Probability when subset is known but not total

tchntm43

New member
Joined
Jan 8, 2020
Messages
36
Many years ago I took a basic college statistics course. Usually it had to do with a situation where you have a collection of different objects and you calculate the odds of getting a specific subset or range of subsets from that. For example, you might say "a jar has 8 white marbles and 13 black marbles. What are the odds that the first marble removed is black, not returned, and then you pull 2 white marbles?"

However, I am trying to do something basically the reverse of this. I have a jar of 100 marbles which can be white or black. But I don't know how many of each there are. From the jar I remove 10 marbles at random: 7 white marbles and 3 black marbles. From that information, I want to know what the odds are that there are more white marbles in the jar than black marbles.

I had an idea on the process, but I have doubts about it. I thought that I could look at the number of possible further outcomes. There are 90 marbles remaining, and there could therefore be 0 to 90 white marbles remaining (91 possible outcomes). From that, I can determine that 44 to 90 white marbles in the jar would produce an outcome where there are more white marbles than black marbles, while outcomes 0-42 produce an outcome where there are fewer white marbles than black marbles, and outcome 43 produces a result where there are the same number of each. 44 of the outcomes are failures in this sense, and therefore this means that of the possibilities, 47/91 of the possibilities are successes, and that would suggest a 51.6% chance that there are more white marbles than black marbles. Is this correct?

What I'm unsure about... given that I have already pulled more white marbles than black marbles, are the odds of each possible remaining outcome the same? Which I think is what I'm assuming in the above calculations. And it seems odd to me that a significant majority in 10% subset at random from the whole still would be only a tiny increase over 50/50 odds.
 
Before I try to solve this, I'd like a little more information on your context.

You imply that you are not currently taking a course, so this problem does not come from a textbook or similar source. Or does it? Do you have a particular purpose for the question? Do you just need an answer, or do you need to know how to solve it in order to solve other problems like it? And is it possible that you might really want something slightly different from what you asked, such as testing the hypothesis that there are more white than black?

Also, you use the word "odds" rather than "probability"; technically these mean different things. I'm guessing you really mean probability (which is what you are talking about when you say things like "a 51.6% chance".

One more question: Have you ever heard of Bayes' theorem?
 
Last edited:
This is not for a course. And it's not the answer to this specific question I'm looking for, but the process in calculating it. More generally, if I have a group of assorted items, and I know the total number but not the number of each, and I select a random subset from the group and measure the numbers of each item, how can I use that information to make statements of probability on the distribution of the whole?

This is the first step in the process of a type of polling analysis that I haven't seen done before, and I'm hoping to accomplish.

I was not familiar with Bayes' theorem, but I'm reading about it on Wikipedia right now.
 
That is very close to what I'm after, or at least it's a tangential subject. There are a few things on the page I don't understand.

1. The "significance level". I understand that it's between 0 and 1, and that it's the value you compare your result against at the end to determine if your hypothesis is valid or not, but I don't understand how you choose the value. In the example problem, they simply state "Use a 0.05 level of significance" without saying why they picked that value.

2. The different results of the one-tailed and two-tailed tests. I understand how the one-tailed and two-tailed tests are different (in fact I was originally going to object to the two-tailed test because it seemed clear to me that it wouldn't address the meaning of the claim in the word problem, as anyone bragging about a satisfaction rate would consider it a "success" if the rate were higher than the claim, but then I saw that they actually address that more realistic variant in the one-tailed test). However, the results don't make sense to me. The two-tailed test targets the satisfaction rate being exactly 80%, and when it says "we cannot reject the null hypothesis" it sounds to me that they are saying that the claim is valid. However, the one-tailed test allows for any value of 80% or higher as a success, and yet it rejects the null hypothesis at the end. How can we say that it's reasonable to claim exactly 80% but not reasonable to claim 80% or higher? I am sure I'm misunderstanding something.

Going back to my black and white marbles problem... I'd have to adjust the number of marbles considerably to meet the requirements listed on the page, but in theory I could frame the problem as "I claim that 50% or more of the marbles are white, among 10,000 marbles. I sample 100 of them and find that 70% are white marbles." Then I'd have to pick a significance level, and from there it would be almost identical in process to the one-tailed problem listed.

Of course, that wouldn't give me a probability, but I can see what you mean that it would be an important background to getting there.

Thank you very much for the information provided so far!
 
Some of your questions will be answered if you back up a little, perhaps to https://stattrek.com/hypothesis-test/hypothesis-testing.aspx?Tutorial=AP (and following sections under Hypothesis Testing in the table of contents).

The significance level is commonly chosen as 5% (0.05); what it means (as explained in these other sections) is how often you are willing to be wrong in rejecting the null hypothesis. At 5%, it means that you'll be right 95% of the time if you say that more than half of the marbles are white, or whatever. This is something like the probability you are looking for; in fact, the p-value itself may be what you want.

Both one- and two-tailed tests make sense, for different kinds of questions. Yours is one-tailed; but if someone claimed that the marbles are exactly half white, and you wanted to test that claim (counting anything else, in either direction, as wrong), you'd be doing a two-tailed test. It's sort of like the difference between claiming I can always hit a target, and claiming I can jump 8 feet: In the former case, too high or too low would both be misses, but in the latter, "too high" would still be a success. In their first example, the claim is accuracy (which wouldn't be natural); in the second, it is high satisfaction. Their first example could be better; they are trying to demonstrate both cases with the same data, which is unnatural.

Again, in the first example, not rejecting the null means that there is more than a 5% chance that the true proportion is 80%, so you are not convinced they are wrong; in the second, rejecting the null means that you are convinced the proportion is probably less than 80, because there is only a 4% chance that the results of the survey would have occurred if the proportion were at least 80%.

By the way, in your problem I'd take the null hypothesis to be that the true proportion is less than or equal to 50% (the reverse of their second example), because you want to convince yourself that it is more than 50%.

I should also mention that I've been a little sloppy in some of my statements, for the sake of not sounding too pedantic, and of not taking forever to write this. Statistics is not my strongest field.
 
So I'm trying to calculate a P-value (one-tailed) from some real data. Forget about marbles. Here is the data I am working with:

Total Population: 4,419,587
Sample Size: 1309
True: 617
False: 692
Testing for True >= 50%

Standard Deviation = sqrt((0.5*(1-0.5))/1309) = 0.01381974982056955190073646999235
z = ((617/1309) - 0.5) / 0.01381974982056955190073646999235 = -2.072962473

When I plug this into a Normal Distribution Calculator (https://stattrek.com/Tables/Normal.aspx), using:
mean = 0
standard deviation = 1

I get a result of P-value = 0.019.

When I run the same test for False instead:

Standard Deviation = 0.01381974982056955190073646999235 (this is the same as above, of course)
z = ((692/1309) -0.5) / 0.01381974982056955190073646999235 = 2.0729624730854327851104704988526

For this, I get a P-value = 0.981

This seems to make sense as both P-values add up to 1.

Question 1:
Do these p-values mean that Probability(True >= 50%) = 1.9% and Probability(False >= 50%) = 98.1%?

Question 2:
None of this takes into account the total population size (and I noticed that the problems listed on that page referred to earlier never input 1,000,000 into any part of the calculations). Does that generally not matter? Is it because one of the requirements for using this method is that the total population is at least 20 times the sample size?

Question 3:
I looked up the formula for calculating Normal Distribution (http://mathworld.wolfram.com/NormalDistribution.html), and tried to set it up in Excel so that I could have it calculate this without having to plug the values into the webpage listed. However, I am getting wildly different results. Here is my attempt to calculate the above P-value for (True >= 50%) using a rather cumbersome Excel formula (simplified from the formula listed on the wolfram page since mean = 0 and standard deviation = 1):
=(1/SQRT(2*PI()))*EXP((0-(-2.07296^2))/2)
This instead gives me 0.046536417 as the result. Also, when I run it for the other test, I get almost exactly the same result. Which makes sense because the z in both cases is almost identical except for sign. And the formula there raises that value to the power of 2, which makes the sign irrelevant. So this is definitely not the right way to do it. Any thoughts on how to go about that?
 
Question 1:
Do these p-values mean that Probability(True >= 50%) = 1.9% and Probability(False >= 50%) = 98.1%?
Not exactly. There's some subtlety that I'd probably be more confident of if I taught stats and had to say it all the time, but basically the 1.9% is the probability that you'd get the results you did if the population were exactly 50% true. It's used as a stand-in for the probability that the null hypothesis is true, which is not really possible to calculate.

Question 2:
None of this takes into account the total population size (and I noticed that the problems listed on that page referred to earlier never input 1,000,000 into any part of the calculations). Does that generally not matter? Is it because one of the requirements for using this method is that the total population is at least 20 times the sample size?
Right. As long as the sample is a small enough fraction of the population, you can basically pretend it's infinite.

Question 3:
I looked up the formula for calculating Normal Distribution (http://mathworld.wolfram.com/NormalDistribution.html), and tried to set it up in Excel so that I could have it calculate this without having to plug the values into the webpage listed. However, I am getting wildly different results. Here is my attempt to calculate the above P-value for (True >= 50%) using a rather cumbersome Excel formula (simplified from the formula listed on the wolfram page since mean = 0 and standard deviation = 1):
=(1/SQRT(2*PI()))*EXP((0-(-2.07296^2))/2)
This instead gives me 0.046536417 as the result. Also, when I run it for the other test, I get almost exactly the same result. Which makes sense because the z in both cases is almost identical except for sign. And the formula there raises that value to the power of 2, which makes the sign irrelevant. So this is definitely not the right way to do it. Any thoughts on how to go about that?
That's the formula for the PDF, not the CDF which you use for these calculations.

Excel does have a formula for the CDF (which has no algebraic formula you could use). Look up NORM.DIST, and use "cumulative".
 
Thanks again for the help. This has worked out pretty well. I know that the p-values aren't exactly probabilities but they are producing values that align with expectations so at least there's that.

Now I want to add in a twist. Suppose that we have more than two possibilities: True, False, and Not Sure. From a sample size of 1000, we have 406 True, 355 False, and the rest are Not Sure. I want to calculate two things: the probability that there are more True than False, and the probability that of the 3 distinct groups, True is the highest.

For the first one, my instinct is to redefine the sample size as being the sum of True + False, since the Not Sure doesn't matter. Then use the same calculations as before and ignore the existence of Not Sure.

But I am not sure how to do the second one. I could use the previous process if it is valid to get two distinct probabilities (that True > False or that True > Not Sure, in each case ignoring the 3rd possibility), but I don't know how to calculate that both are simultaneously true.
 
Top