Probability of selecting a subset ratio matching or exceeding total ratio

tchntm43

New member
Joined
Jan 8, 2020
Messages
36
This seems like it should be simple but I'm scratching my head where to begin.

From a total, arbitrarily large population of red and blue marbles, knowing that the distribution of the total is 48% red and 52% blue (let's say 48000000 red and 52000000 blue if the process requires the total to be defined)... Select from this 972 random marbles, what is the probability of getting at least 483 (~49.64%) red marbles?

It seems to me that this would be, if x is the number of red marbles pulled, the sum of the probability for x=483 + probability for x=484... ... + probability for x=972, and you would take the function to calculate one probability and use a sum from x=483 to 972. However I am not sure how to calculate the probability for a value of x. I know I learned this in statistics 15 years ago, just can't remember. I thought it would be .48^x for each one, but I think that's actually the probability of getting x red marbles in a row (extremely small value that is obviously wrong), and even when summing that over 483 to 972, the result is on the magnitude 10^-190. So, that's definitely wrong.

Lately I've been doing a lot of calculation with the known and unknown variables being the other way around (i.e. the sample is known ratio and I'm trying to evaluate the probability of the whole, using z-scores and p-values). And I think this is probably a lot easier than that, I just can't remember how to do it.
 
With a sample that large, what's stopping you from using the Normal Approximation to the Binomial?
 
  • Like
Reactions: pka
Thank you, I have been reading up now on how that works. I found this page on google: https://www.statisticshowto.datasci...theorem/normal-approximation-to-the-binomial/

I have tried it with some of the actual numbers I'm using but I'm getting a z-score that is much too high, and I think I'm doing something wrong.

Here are the numbers I am using.
sample size (n) = 984
probability per unit (p) = 41.78%
probability 1-p (q) = 58.22%
target (t) = 500

mean = n * p = 411.15493
standard deviation (d) = SQRT(n*p*q) = 15.47119
z-score (z) = ((t-0.5) - mean)/d = (499.5-411.15493)/15.47119 = 5.710295

Now I know from other work that a z-score that high is going to give 100% probability in most cases, but just to be sure, I continued on with the process. I'm doing this in Excel, so I believe the correct function in this case is NORMSDIST(z), and not surprising, this gives 1 for an answer. I don't even get to the last step of adding 0.5 to get the final probability because I'm already at 1.

So I'm doing something wrong. The final result should be a low probability, because at 41.78% probability per unit, and aiming for at least 500 out of 984 (more than 50% of the total), that should be very low chances. And actually, I'm confused about this process because if I'm supposed to add 0.5 at the end step, then my probability is always going to be at least 50%, which means I can't get a low probability no matter what. So my process is definitely wrong.
 
"=normdist(5.71,0,1,1)" = 0.999999994009629

You must understand what this means. It is the probability to the LEFT of a value that is 5.71 standard deviations to the RIGHT of the mean. If you wish to know the probability of being even farther to the RIGHT, this requires subtraction. 1 - 0.999999994009629 = 5.99037097703814E-09 -- a very small number.
 
Ah, that makes sense! And I was using NORMSDIST instead of the very similarly-named NORMDIST. Thank you very much.
 
Ah, that makes sense! And I was using NORMSDIST instead of the very similarly-named NORMDIST. Thank you very much.
Nothing wrong with that. I just like to make sure I know the mean and standard deviation because I entered them with my own hands. Not everyone agrees with me. :) Good work.
 
Top