My probabilities are only adding up to 88%

tchntm43

New member
Joined
Jan 8, 2020
Messages
36
I've done this type of calculation with other sets of data and it works, but in this case I can't get the probabilities to add up to 100% (nor get them close enough to account for rounding errors).

Suppose we have a vast bank of colored marbles that can be red, green, blue, yellow, or orange. We pull the following out at random (partial marbles in the data are due to the numbers originally being calculated retroactively from a percent; it doesn't matter in this case)
red: 168.82
green: 154.14
blue: 146.8
yellow: 80.74
orange: 22.02

For each color, what is the probability that there are more of that color than any other color (plurality, not majority)?

So to do this, I figured that red being the most would be requiring all of the following to be true: Red>green, red>blue, red>yellow, and red>orange. And so on for the other colors. And since it requires all of them to be true, the probability of red being the most would be (probability red > green) * (probability red > blue) * (probability red > yellow) * (probability red > orange).

I should have enough information to set up a grid of z-scores and p-values. I created the following table of z-scores first:
is higher ->redgreenblueyelloworange
red
-0.816867631​
-1.239467252​
-5.575576995​
-10.626527​
green
0.816867631​
-0.423112739​
-4.789311015​
-9.95439601​
blue
1.239467252​
0.423112739​
-4.379350372​
-9.60357723​
yellow
5.575576995​
4.789311015​
4.379350372​
-5.79260612​
orange
10.62652704​
9.954396014​
9.603577232​
5.792606125​
In all cases, the score means "for the color above being higher than the color to the left". The empty diagonal of course is because there is no need to compare a color vs itself.

Next, I calculate the p-values from these (using =NORMDIST(value, 0,1,TRUE) in Excel)
is higher ->redgreenblueyelloworange
red
0.207002039​
0.107586255​
1.23355E-08​
1.12155E-26​
green
0.792997961​
0.336106504​
8.36775E-07​
1.20642E-23​
blue
0.892413745​
0.663893496​
5.95168E-06​
3.86085E-22​
yellow
0.999999988​
0.999999163​
0.999994048​
3.46513E-09​
orange1111
I lost a table cell at the end and couldn't figure out how to get it back, but it's blank anyway. All the 1s are of course rounded off by Excel. Excel remembers the actual numbers, though, and as evidence of that, highlighting any two cells that are diagonally opposite the blanks shows that the sum is 1, as it should be. Highlighting the whole table shows the sum as being 10, which is also what it should be.

Lastly, I multiplied each column to get the probability of each color being the highest, and I get the following results:
red: 70.77%
green: 13.74%
blue: 3.62%
yellow: 0%
orange: 0%

Sum of probabilities: 88.13%

Of course there are several places where rounding errors can cause the final numbers to be off, but there is no way they should be off by that much. There is something I'm missing here and I can't figure out what it is. When I've created these tables with other sets of data, the probabilities add up to 1 like they should, but it has been a long time since I did that and I've clearly forgotten something important in the process.
 
I don't understand what you are asking. With a specific number of marbles of each color given there is no question of "probability". There are clearly more red marbles than any other one color so P(red)= 1.0 while P(blue)= P(green)= P(yellow)= P(orange)= 0!
 
Ah, I realize I wasn't clear. I meant the probability for the larger population of marbles that the sample was taken from, which is of unknown size but easily large enough to meet the requirements for using z-score and p-value method.
 
Ah, I realize I wasn't clear. I meant the probability for the larger population of marbles that the sample was taken from, which is of unknown size but easily large enough to meet the requirements for using z-score and p-value method.
I think you're really saying that the "numbers of marbles" you started with describe the proportions in the population, from which we could get probabilities for each color when a marble is drawn; and you are asking about probabilities of each color being a plurality in a sample (of unstated size). Am I right? Or is it the reverse, and you are looking for the probability of each color having a plurality in the population, given the sample data (which clearly can't be a real sample, not being integers, so we don't know the sample size)?

One thing missing in your work is the possibility that there is no strict plurality (that is, there are the same number of two or more colors). But the bigger issue is that the subevents (e.g. red > green and green > blue) are not independent. I'd also like to see how you calculated your z-scores.
 
you are looking for the probability of each color having a plurality in the population, given the sample data
Yes, this. I'm sorry my wording was confusing.
(which clearly can't be a real sample, not being integers, so we don't know the sample size)?
In the original data, I was supplied with a sample size and percentages, which were rounded off. The distribution of the sample size was calculated, by me, inversely from the percentages. For the purpose of this, it shouldn't matter that they aren't integers, since they don't have to be in order to calculate z-scores and p-values (indeed, I could have instead been working with volumes of liquid).
One thing missing in your work is the possibility that there is no strict plurality (that is, there are the same number of two or more colors). But the bigger issue is that the subevents (e.g. red > green and green > blue) are not independent. I'd also like to see how you calculated your z-scores.
Given the population size (in the millions) the chances of there being a tied plurality are pretty small, and I'm willing to overlook that for now. It would show in the results of the z-scores, anyway, so I'll know about that problem if it comes up.

I'm not sure what you mean by "subevents".

z-scores were calculated this way (an example for red > green):
z-score(red > green) = ((red/(red+green))-0.5)/SQRT(0.25/(red+green)) = ((168.82/(168.82+154.14))-0.5/SQRT(0.25/(168.82+154.14)) = 0.816867631
 
z-scores were calculated this way (an example for red > green):
z-score(red > green) = ((red/(red+green))-0.5)/SQRT(0.25/(red+green)) = ((168.82/(168.82+154.14))-0.5/SQRT(0.25/(168.82+154.14)) = 0.816867631
The probability you obtain from this z is not the probability that there are more red than green in the population! It's true that this is how you would test the hypothesis that there are more red than green, but you are misinterpreting the result. This is a common misunderstanding of what hypothesis tests are about.

What you find from this will be the probability that you would observe what you do in your sample if there were exactly as many red as green. That is a very different thing! In particular, it is nonsense to combine these probabilities as you are trying to do.
 
The probability you obtain from this z is not the probability that there are more red than green in the population! It's true that this is how you would test the hypothesis that there are more red than green, but you are misinterpreting the result. This is a common misunderstanding of what hypothesis tests are about.

What you find from this will be the probability that you would observe what you do in your sample if there were exactly as many red as green. That is a very different thing! In particular, it is nonsense to combine these probabilities as you are trying to do.
I think I'm misunderstanding you, because I'm more confused now. The p-value I got from that z-score is around 0.7930. Are you saying that if the population has an equal number of red and green, that there is about a 79% chance of getting exactly the sample size indicated? That wouldn't make sense, so I'm sure I'm misunderstanding.

I've been using this p-value because it produces results that are very close to those that other people (the "experts") have produced (and I don't know their methodology), and also because it produces predictable results for the extremes. In most of my calculations I'm only working with populations/samples that have 2 types of "marbles", and this method produces values that I'm able to verify in other ways as being roughly correct. It's only been with this current situation of having more than 2 that I've run into a problem where I know the results can't be correct.

If I am doing this the wrong way, I would like to be shown the correct way. I'll restate (and simplify) the original problem:
I have a population of 5 million marbles. I randomly draw 573 from among them. 169 are red, 154 are green, 147 are blue, 81 are yellow, and 22 are orange. What is the probability that there are more red marbles than any other single color in the total population?
 
Oh, I should mention... My formula for calculating the z-score is based on this:

I know that's meant for when each unit in the sample/population can have 2 types, but I reasoned (apparently erroneously) that I could break apart a situation with more than 2 types into a number of pairs and run those each independently and then bring them back together at the end.
 
Oh, I should mention... My formula for calculating the z-score is based on this:

I know that's meant for when each unit in the sample/population can have 2 types, but I reasoned (apparently erroneously) that I could break apart a situation with more than 2 types into a number of pairs and run those each independently and then bring them back together at the end.
If you read that page carefully, you should see that the conclusion says nothing about "the probability that the population proportion is greater than 50%" or the like. Rather, the conclusion of the example is "Since the P-value (0.08) is greater than the significance level (0.05), we cannot reject the null hypothesis." See, for example, here: https://stattrek.com/statistics/dictionary.aspx?definition=p-value

Suppose the test statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.​

I did somewhat misstate this; when I called the p-value "the probability that you would observe what you do in your sample if there were exactly as many red as green", I should have said, ""the probability that you would observe at least as many more red as you do in your sample if there were exactly as many red as green". So ...
I think I'm misunderstanding you, because I'm more confused now. The p-value I got from that z-score is around 0.7930. Are you saying that if the population has an equal number of red and green, that there is about a 79% chance of getting exactly the sample size indicated? That wouldn't make sense, so I'm sure I'm misunderstanding.
... no, "exactly" is not right.

But the important thing is that the p-value is not the probability you are thinking it is, so you can't combine them the way you are trying to do.
I've been using this p-value because it produces results that are very close to those that other people (the "experts") have produced (and I don't know their methodology), and also because it produces predictable results for the extremes. In most of my calculations I'm only working with populations/samples that have 2 types of "marbles", and this method produces values that I'm able to verify in other ways as being roughly correct. It's only been with this current situation of having more than 2 that I've run into a problem where I know the results can't be correct.

If I am doing this the wrong way, I would like to be shown the correct way. I'll restate (and simplify) the original problem:
I have a population of 5 million marbles. I randomly draw 573 from among them. 169 are red, 154 are green, 147 are blue, 81 are yellow, and 22 are orange. What is the probability that there are more red marbles than any other single color in the total population?
It may be that others are doing something similar to what you are doing in the two-type case, and it's good enough there. But it doesn't work when you try to do more with the p-values, because they don't combine the way you want.

I don't know that there is a right way to get the probability you are asking for. In fact, a case can be made that the probability that something is true of the entire population is a meaningless concept, along the lines of post #2. Either it is or it isn't; there's no probability about it, unless you can make some a priori assumptions about the population. (You might want to talk with Mr. Bayes about that.)
 
I don't know that there is a right way to get the probability you are asking for. In fact, a case can be made that the probability that something is true of the entire population is a meaningless concept, along the lines of post #2. Either it is or it isn't; there's no probability about it
Post #2 didn't actually have to do with that; it was a misunderstanding about my question. But I want to talk about this because I'm pretty sure it is not correct. If I were to roll a die, but do so under a box, the number on the die facing up is already set. But until I remove the box, I can speak about probability associated with the result. I can say that after a number of repeated rolls, the number of sixes will trend toward 1/6 out of the whole of the trials, even if I say so for each one after the die has finished rolling. As long as I don't look at it and it remains an unknown. The same should be true about this hypothetical ocean of marbles: as long as I don't know what the number is, it's an unknown and if there is data available that can lead to making inferences about it then that's appropriate to do so.

If I inverted the problem and turned it into a more standard probability problem: We know how many blue and green marbles are in the total ocean population, and we want the probability of getting 200 green marbles in a 535 marble sample, then that's a process that's much easier to do. But if I add this catch: the bucket is opaque and we can't actually see the marbles immediately after removing them. Once again, the selection within the bucket is determined, but because we don't know it we can still speak of the probability of what's in the bucket.

Regarding whether or not it can be solved... Consider instead that my 535 sample, selected at random, is 534 green and 1 blue. If I challenged you to a bet on which type of marble the ocean had more of, and it was equal wagers on each, you'd be fighting me to pick green. Anyone would. We can understand that intuitively. That type of extreme ratio in the sample implies that it's almost certain there are more green than blue in the total population. Usually, if you can see something intuitively so clearly, then it can be shown via mathematics as well. I'm certain this can be done. Just not sure how yet. But I've reached out to someone in the area who tutors college-level statistics and probability, I'll see what I get for answer there (and I'll post it here).
 
I didn't say that #2 was saying the same thing I was saying; I said "along the lines of". But that's a minor point.

The important thing is that your analogy to the roll of a die is not parallel to your question. What you're trying to do is more like rolling a die a number of times and writing down what each roll is without actually looking at the die. (You do have the data.) Now you ask, what is the probability that it is a fair die? (That is, that it has six equal sides with six different numbers on them.) You are trying to determine the probability that your population has a particular content, not that some outcome will result from it.

The issue is not, as in your example, the uncertainty of one roll; it's that there is no population of populations, so to speak, from which your particular population is chosen randomly, as there is no population of dice from which your die is chosen.

As for your bet ... yes, that's what statistics is about, and a hypothesis test does reverse what probability does. You can make a strong conclusion. But I'm not sure what the probability of there being more green would be, as a specific number. That's the issue here. I'm not even sure how to define that number.

I've been waiting for someone else to join in this discussion; there are some interesting features to it that others may know more about. I'm not a statistician. (And remember, I only said "a case could be made" for this view, from a Bayesian perspective, not that I'm certain of this.)
 
I was trying this morning to approach this from a different perspective. Forget all the z-score and p-value stuff and work with more basic mathematics. There should be a way to count all of the possible ocean combinations where the sample is possible, identify for each ocean combination the probability of getting our sample, and uh... not really sure where to go from there but I'm brainstorming so I'll try and figure it out as I go. I have a nagging feeling that this process will involve combining together a number of calculations on the order of a 6-digit number raised to an exponent, but hopefully we can identify a way to compress it into a summation process.

Let's define the total population size: 1 million. Without taking a sample, the ocean can have 0-1,000,000 green marbles. So there are 1,000,001 possible ocean populations. With a sample of 535 marbles, we've defined blue as having 234, which means that at least 234 of the marbles in the ocean are blue, leaving the possibility that 0-999,766 are green. But we've also defined 301 as being green, though, so it's actually 301-999,766 possible green marbles. Which gives 999,466 possible ocean configurations.

Ultimately, we also know that we're looking for ocean configurations where green >= 50% of the population, and that's only true when there are at least 500,000 green marbles. So there are 999,466 possible ocean configurations capable of producing our sample, and of them 500,000-999,766 green marbles (or 499,767 configurations) meet the 50% threshold. 499,767/999,466 = .500003. So, of the possible ocean configurations capable of giving us our sample, 50.0003% of those are also configurations with at least 500,000 green marbles. I'm not sure what to do with that but it feels like we may need that later.

Starting with the first of those 999,466 configurations, let's say that the ocean has exactly 301 green marbles, meaning that our sample has miraculously extracted all 301 of them. If we instead are at the pre-sample phase of things and we just have the ocean with 301 green marbles and 999,699 blue, we should be able to ask "what is the probability that, when taking a sample of 535 marbles, exactly 301 of them are green. And we should be able to do the same for all 999,466 configurations. Let's call this set of probabilities H, just so I can refer to it without confusion elsewhere. So you have H(301) through H(999,766). Each of these probabilities should be extremely small.

I'm a little uncertain where to go from here. But we should be able to take this data and reword the question as "what is the probability that our sample comes from one of the ocean configurations where the green marbles number at least 50%?" Like... what if you were to take H and sum them up (i.e. the probability of getting 301 green marbles in at least one of the configurations if you simulated all of them), and then... multiply by 0.500003 because that's taking into account only those where the configurations have majority green. I think it's probably somewhat more complicated than that, but it's in the right direction, maybe?
 
So I've had an interesting breakthrough on this question. Instead of attacking the question head-on, I think I have found the solution indirectly by using the formula for calculating sample sizes.


When using this formula, you typically arrive at an answer where you would say something like "With a 95% confidence level, the sample has a 47% chance +/- 5% of having the same ratio as the total population".

So here is what I did with the question above. First, I assume a very large total population, which means, according to that page, you can use the simplified version of the formula:
(z^2 * p(1-p))/e^2
z is the z-score, (not calculated the way used in a previous post!). Instead you determine a confidence level, and from that a z-score.
p is the standard deviation, which is typically 0.5 unless you have reason to assume a different value
e is the margin of error, which you determine.

Running this through the formula gives you a requirement on the minimum sample size.

So I was experimenting with this. My ratio in the sample size for green is 56.26%. I said, let's make the confidence level 95% since that's typical, and that means a z-score of 1.96. I said, let's make the margin of error +/- 5% because that's fairly typical. When we plug our numbers in, we get:
sample size = (1.96^2 * 0.25)/(0.05^2) = 384.16. My actual sample size is larger, so I know that any statement of the accuracy of the sample with this calculation will be less than reality. So what I have in this case is a statement: "With greater than 95% confidence, the ratio in the total population is between 51.26% and 61.26%"

But the lower end of that is greater than 50%. Bingo! I can now say "With greater than 95% confidence, the ratio in the total population is majority green" or to put it another way "There is greater than 95% probability that the ratio in the total population is majority green"

So the procedure for doing this would be:
1. Of the sample with 2 types represented, identify the type (X) with the majority.
2. Choose a margin of error such that the ratio of X in the sample minus the margin of error is still greater than 50%
3. Choose an initial confidence level (I choose 95% to start with) and get the associated z-score.
4. Run the numbers through the formula.
5. If the indicated required sample size is smaller than our actual sample size, then we're done. Our confidence level is less than the probability that X is majority in the total population.
6. If the indicated required sample size is larger than our actual sample size, lower the confidence level (I'd say go in 5% increments) and go to step 4.

In retrospect, I could have picked a margin of error of 0.06 and still met the requirements. This would have lowered the required sample size, and I wonder if I could have compensated by making the confidence level 99%.

I think this works, although it doesn't get you an exact probability (instead you get an "at least"), but it's still meaningful. Being able to say something is somewhat greater than 95% is about as good as saying it's 95%, in most cases.
 
I have been hesitant to intrude on this thread because (a) I have not been sure what the actual question is, and (b) I have forgotten so much about the subtleties of hypothesis testing. But a fundamental point of hypothesis testing is that you must have an explicit, well formulated hypothesis to test. It seems to me that you are trying to test hypotheses like "The population from which this random sample was drawn contains a higher proportion of red balls than green balls." Intuitively, just by looking at the data, it appears that you cannot have high confidence that that hypothesis is true. With respect to the hypothesis "The population from which this random sample was drawn contains a higher proportion of red balls than orange balls," intuition says that you can have high confidence that the hypothesis is true. But is intuition reliable?

If that is the kind of question that you are trying to answer, I shall not rely on my memory to go through the details of how to replace intuition with math, but I am sure that Dr. P or others here can do so if they understand that is the kind of question you are trying to answer.
 
What we need to know is, what is your real goal?

You asked initially, "For each color, what is the probability that there are more of that color than any other color (plurality, not majority)?"

My point has been that that is not what a confidence level means. In particular, that was why your table didn't add up. So that's the question I've been answering.

But perhaps the confidence level/p-value is in fact what you are really looking for -- some measure of how confident you are that a given color constitutes a plurality, and not a probability to put in a table. In that case, the hypothesis test will be fine (and you don't need to go at it backward, using sample size formulas). If so, we're scrapping your original question. Is that what you want? Can you tell us what you want to do with the answer?

(By the way, things may still fail to "add up" when you compare different colors.)
 
Yeah, the subject of the thread did change a little as it went on. I had been asking about a problem with more than 2 colors, assuming that I was using the correct process for 2 colors. When it was explained to me that I was not, I gave up on the >2 problem and went back to figuring this out for just 2 colors. Only when I'm sure I have a solid process down for 2 colors do I want to return to the >2 problem.

Also I'm easily distracted by the excitement of discovering something in mathematics on my own, and so at the moment I have to admit that I'm most interested in knowing if my reasoning regarding this unconventional usage of the sample size formula works out as I laid out in the previous post. I know it was a long post, but to break it down to it's core point, is it reasonable to say that if we have a >95% confidence that a population is 51%-61% (i.e. 56% +/- 5%) made up of some thing (whatever it is we're measuring, be it marbles or whatever), then that is the same as saying that there is a >95% probability that a majority of the population is that thing?
 
I made a lot of progress today with this. Regarding my last post about using the sample size formula, I think it works. I'm getting results that make sense and also correlate well with the p-values I was doing earlier (leading me to think that maybe the p-value method is valid as well, even though the p-value may be intended for something else it seems to also work here). I also went a lot further with reorganizing which variables are being solved for.

First, I took the formula for calculating a needed sample size when given a confidence level (and related z-score) and margin of error. The formula is:
s = (z^2 * 0.25)/e^2
What I really want to have as the variable is the z-score. So I rewrote this equation as:
z = SQRT(s*e^2)/0.25)
To then calculate the Confidence Level, in Excel it's (2*NORMSDIST(z))-1.

After this I knew that I wanted the probability for the value within the population to be between 50% and 100% (in other words, probability of a majority). However, margin of error doesn't allow you to specify that. But I realized I could mess with the margin of error in an interesting way. For example, considering the example survey result below
A survey indicated that 62% of respondents answered in the affirmative. The margin of error is +/- 4%, and the confidence level is 90%
You can separate the margin of error into two halves, each sharing the confidence level equally. So what this means is that you have a 45% chance that the ratio in the general population is between 58%-62%, and a 45% chance that the ratio in the general population is between 62%-66%.

So to use this to get a probability of the majority, I had it calculate the margin of error by subtracting 0.5 from the sample ratio (if the sample ratio is above 50%). This brings the lower end of the margin of error as close to 50% as possible. I then ran the calculations and got a confidence level. Then I went back and did it again with a different margin of error, so large as to hit 100%. Got another confidence level. And then I put them together, half from one and half from the other. If your sample ratio is above 50%, it's just adding the two together and dividing by 2. If the sample ratio is under 50%, then it's the larger confidence level minus the smaller one, again dividing by 2.

Sidenote: this will produce z-scores far outside of the -3 to 3 range they're supposed to have. You have to make sure to cap it at 3. What it means when this happens is that your sample ratio is so overwhelmingly in favor of a ~100% chance that even a much lower sample ratio would get you close to that, and the rest is overkill.

I want to post some sample data and the results I've gotten. I also want to post the results from the p-value method just to show they are similar. The table below references different political candidates poll ratios in different elections. The ratio means "this percent of respondents prefer this candidate over his/her opponent":
sample sizeratio in samplemy new methodold p-value method
600
0.395833333
0% Confidence Level (also probability of majority in population)0%
552
0.510869565
67.94%69.52%
799
0.489130435
28.45%26.94%
690
0.521276596
85.9%86.82%
586
0.494252874
49.87%39.04%
744
0.52688172
86.10%92.87%
7550.4813.45%13.59%

The numbers seem pretty solid. For the most part they make sense intuitively. I'm not sure why some of them have more difference between this new method and the old p-value method than others do. I thought it could be due to sample size, but looking at the data there it can't be due to that. My new method does have quite a bit of rounding in the steps involved, so some of it is likely due to that. But all in all, I am filled with joy at the comparisons between the two methods. Both of them were explored as attempts to find a way to calculate the probability of winning an election based on polling data. It seems unlikely that I would pick two wrong methods that use very different procedures and arrive at such similar wrong results. I'm inclined to think both of these methods are viable for answering the matter of calculating probability when 2 choices are present and sample data is taken. I am aware that the p-value I've calculated is intended for use with rejecting (or not) a null hypothesis, but I'm of the opinion right now that it also works for this procedure.
 
Top