Estimate number of clients using Joint Distribution

Skyblack96

New member
Joined
Dec 31, 2022
Messages
5
I have the following problem :

I have a server able to observe from which country a client make a request. Then, my server gives me the distribution of clients (on a 24h basis).
Let's say that we have :
  • Belgium : 10%
  • Canada : 40%
  • China : 50%
Now, I know that I have a defined number of clients in two cities of each country. Let's say :
  • Belgium_city1 : 10000
  • Belgium_city2 : 5000
  • Spain_city1 : 20000
  • Spain_city2 : 15000
  • China_city1 : 40000
  • China_city2 : 60000
Now, I would like to find an estimation of the number of clients originating from each city and making a request to the server. Solution giver by my professor is to just multiply the number of clients of one city by the probability of the country where the city is located.
For Belgium_city1 as example : 10000*0.1 = 1000 clients requesting my server
But I don't understand from where comes this solution. I tried with the join distribution like this.
  • Let A = client comes from Belgium
  • Let B = client comes from Belgium_city1
P(A,B) = P(B|A)*P(A)
where :
  • P(A) = 10% as observed by the server
  • P(B|A) = P(Belgium_city1)/[P(Belgium_city1)+P(Belgium_city2)] = 10000/15000
Then my probability a client make a request to the server from Belgium_city1 is P(A,B) = 10000/15000*0.1. Now to estimate the number of clients making requests from Belgium_city1, I'm not sure how to do it. The only way to find 1000 as my professor said is to multiply my previous probability by the total number of clients in Belgium. But why should I have to do that? It's not clear in my mind because for me, I should multiply by the total of client from each country and not only Belgium... Someone can explain me where I made a mistake ?

Thank you in advance
 
I have the following problem :

I have a server able to observe from which country a client make a request. Then, my server gives me the distribution of clients (on a 24h basis).
Let's say that we have :
  • Belgium : 10%
  • Canada : 40%
  • China : 50%
Now, I know that I have a defined number of clients in two cities of each country. Let's say :
  • Belgium_city1 : 10000
  • Belgium_city2 : 5000
  • Spain_city1 : 20000
  • Spain_city2 : 15000
  • China_city1 : 40000
  • China_city2 : 60000
Now, I would like to find an estimation of the number of clients originating from each city and making a request to the server. Solution giver by my professor is to just multiply the number of clients of one city by the probability of the country where the city is located.
For Belgium_city1 as example : 10000*0.1 = 1000 clients requesting my server
But I don't understand from where comes this solution. I tried with the join distribution like this.
  • Let A = client comes from Belgium
  • Let B = client comes from Belgium_city1
P(A,B) = P(B|A)*P(A)
where :
  • P(A) = 10% as observed by the server
  • P(B|A) = P(Belgium_city1)/[P(Belgium_city1)+P(Belgium_city2)] = 10000/15000
Then my probability a client make a request to the server from Belgium_city1 is P(A,B) = 10000/15000*0.1. Now to estimate the number of clients making requests from Belgium_city1, I'm not sure how to do it. The only way to find 1000 as my professor said is to multiply my previous probability by the total number of clients in Belgium. But why should I have to do that? It's not clear in my mind because for me, I should multiply by the total of client from each country and not only Belgium... Someone can explain me where I made a mistake ?

Thank you in advance
You are answering the question: “Given the client is from Belgium, what’s the probability that the client is from Belgium_city1?”
Whereas your professor is answering: "Given a randomly selected client, what’s the probability that the client is from Belgium_city1?"

The difference is the conditional statement.
 
You are answering the question: “Given the client is from Belgium, what’s the probability that the client is from Belgium_city1?”
Whereas your professor is answering: "Given a randomly selected client, what’s the probability that the client is from Belgium_city1?"

The difference is the conditional statement.
Yes I think I see what you say but then I don't know how to start to find what my professor is asking given those data, is it really possible without any assumptions ?
 
Yes I think I see what you say but then I don't know how to start to find what my professor is asking given those data, is it really possible without any assumptions ?
[math]P(Belgium \cap BelgiumCity1) = P(Belgium)\cdot P(BelgiumCity1|Belgium) = 0.1 \times \frac{10,000}{10,000+5,000} = \frac{1}{15}[/math]Do the same for the remaining cities. If you did them correctly, the sum should add up to 1.
 
Last edited:
[math]P(Belgium \cap BelgiumCity1) = P(Belgium)\cdot P(BelgiumCity1|Belgium) = 0.1 \times \frac{10,000}{10,000+5,000} = \frac{1}{15}[/math]Do the same for the remaining cities. If you did them correctly, the sum should add up to 1.
Thank you I understood!

And now if I want an estimation of number of clients making requests from BelgiumCity1, I just multiply the total number of client in BelgiumCity1 by P(Belgium∩BelgiumCity1), right ? Or maybe should I have to multiply by total number of all clients ?

Because I don't see from where comes the solution given by my professor then, he said number of clients making requests from BelgiumCity1 is just 10000*0.1, the only way to obtain this is to multiply what you just writed by total client in Belgium but I was tempted to say we had to multiply by total number of clients and not total number of clients in Belgium...
 
Last edited:
EDIT : Maybe I find the solution. It is in the case of a master thesis then maybe some assumptions can me made. Maybe my professor just considered the two events as independent ? And then, P(Belgium and BelgiumCity1)= P(Belgium)*P(BelgiumCity1) and then we can obtain the previous results given by my professor . What do you think about that ? It's probably an error to consider events as independents but maybe it's good enough to give a first estimation ?
 
Thank you I understood!

And now if I want an estimation of number of clients making requests from BelgiumCity1, I just multiply the total number of client in BelgiumCity1 by P(Belgium∩BelgiumCity1), right ? Or maybe should I have to multiply by total number of all clients ?

Because I don't see from where comes the solution given by my professor then, he said number of clients making requests from BelgiumCity1 is just 10000*0.1, the only way to obtain this is to multiply what you just writed by total client in Belgium but I was tempted to say we had to multiply by total number of clients and not total number of clients in Belgium...
Suppose you have 100 clients, then P(client is from BelgiumCity1) = 100 * (1/15).
The same idea applies, the sum should add up to 100 or more generally the number of clients you have.
I don't see the logic behind your professor's calculation either.
EDIT : Maybe I find the solution. It is in the case of a master thesis then maybe some assumptions can me made. Maybe my professor just considered the two events as independent ? And then, P(Belgium and BelgiumCity1)= P(Belgium)*P(BelgiumCity1) and then we can obtain the previous results given by my professor . What do you think about that ? It's probably an error to consider events as independents but maybe it's good enough to give a first estimation ?
There's no need to assume independence as shown by my calculation above.
P(BelgiumCity1) is can't be 10,000 since probabilities can't be greater than 1.

I'm not sure if this is a real-world application or a theoretical exercise. If it's the former, what I did is a fair estimate of the first run. After that, you need to continuously collect data and compare the result, then adjust the proportions accordingly because we assumed that every client is equally likely to make a call request. More realistically, it's very possible that a handful of clients dominate the request i.e low frequency and high severity.
 
Suppose you have 100 clients, then P(client is from BelgiumCity1) = 100 * (1/15).
The same idea applies, the sum should add up to 100 or more generally the number of clients you have.
I don't see the logic behind your professor's calculation either.

There's no need to assume independence as shown by my calculation above.
P(BelgiumCity1) is can't be 10,000 since probabilities can't be greater than 1.

I'm not sure if this is a real-world application or a theoretical exercise. If it's the former, what I did is a fair estimate of the first run. After that, you need to continuously collect data and compare the result, then adjust the proportions accordingly because we assumed that every client is equally likely to make a call request. More realistically, it's very possible that a handful of clients dominate the request i.e low frequency and high severity.
Thank you for explanation, really helpful ! :)

No what I said is that my professor probably assumed event as independent (rough estimation in context of real-world app) and then :
P(Belgium∩BelgiumCity1) = P(Belgium)*P(BelgiumCity1) = [math]\frac{1}{10}*\frac{10000}{150000 (\text{total number of all clients})}[/math]And then, he multiply P(Belgium∩BelgiumCity1) by total number of clients and just find :
Estimation of clients from BelgiumCity1 = 0.1*10000
That's from there I think professor's logic comes ?
 
Thank you for explanation, really helpful ! :)

No what I said is that my professor probably assumed event as independent (rough estimation in context of real-world app) and then :
P(Belgium∩BelgiumCity1) = P(Belgium)*P(BelgiumCity1) = [math]\frac{1}{10}*\frac{10000}{150000 (\text{total number of all clients})}[/math]And then, he multiply P(Belgium∩BelgiumCity1) by total number of clients and just find :
Estimation of clients from BelgiumCity1 = 0.1*10000
That's from there I think professor's logic comes ?
Making a request from Belgium and from Belgium City can't be independent because a Belgium city is in Belgium.
I don't follow that logic.
 
Top