Stat Analysis of categorical data: "You hypothesize that in the 17th century nobody lived for 70 years or longer in your country."

_Anna_

New member
Joined
Oct 19, 2023
Messages
13
Hi everyone,
could you please explain me how to solve a statistical problem below.

You hypothesize that in the 17th century nobody lived for 70 years or longer in your country. In order to test this hypothesis you need to check records in old metrical books. If you find at least one person who lived 70 years or longer, you will reject your hypothesis.
How to calculate a samples size? How many records in metrical books you have to study? Statistical power should be at least 80%.

Thank you.
 
Before I proceed, is this a homework question, work-related or research-related?

It doesn't seem like you have enough information to determine the exact answer without making more assumptions.
 
I am a biologist. I have to do some statistics for my grant application. This problem mimics my research question, and I put it here because it’s easier for understanding than my research question.
 
I invented another problem that mimics my research question better.

Your goal is to develop a new method to separate red and white blood cells. There is a problem with old methods: a fraction of red blood cells is always contaminated with white blood cells. How many times you have to repeat your separation experiment and show that a fraction of red blood cells is not contaminated anymore? Statistical power 80%, alpha 5%.
 
Last edited:
In your case, you want to perform a two-sample t-test to compare two groups: one where a fraction of red blood cells is contaminated with white blood cells (before your new method), and another where you hope that this fraction is not contaminated anymore (after applying your new method). You want to determine the sample size needed to detect a significant difference between these groups with a specified level of statistical power.

[math]n = \dfrac{(Z_{\alpha/2}+Z_\beta)^2\cdot \sigma^2}{\delta^2}[/math]
where:
[imath]Z_{\alpha/2}[/imath] is the critical value for your [imath]\alpha/2 = 0.05/2[/imath]
[imath]Z_\beta[/imath] is the critical value for your desired power 80% or [imath]\beta = 0.2[/imath]
[imath]\sigma[/imath] is the estimated standard deviation of the population.
[imath]\delta [/imath] is the effect size (the difference you want to detect).

The last 2 is where you need to make some assumptions using prior data, pilot study, literature review, existing knowledge, etc... In layman's terms, your judgment call.

Things to consider:

1) It's important to choose values for [imath]\sigma[/imath] and [imath]\delta[/imath] that are both realistic and meaningful in your specific research context. If you're uncertain about the values, it's better to be conservative and overestimate the required sample size. This way, you ensure that your study is adequately powered to detect meaningful differences, and you may collect more data than is strictly necessary.

2) Remember that larger sample sizes can increase the power of your study but may also be more resource-intensive. Therefore, it's essential to balance statistical power with practical constraints.

Note that the first question isn't the same as the second one you posted because the first question attempts to detect a proportion of zero (i.e., nobody living for 70 years or longer) with high power. Whereas the second question attempts to detect a meaningful difference due to the new method. Unless I misunderstood the goal of your study, the calculation for the first question would've been different. My point here is that your research question should be clear, specific, and well-defined. It sets the stage for the entire research process and guides your methodology, including the choice of statistical tests and sample size calculations.
 
Last edited:
In your case, you want to perform a two-sample t-test to compare two groups: one where a fraction of red blood cells is contaminated with white blood cells (before your new method), and another where you hope that this fraction is not contaminated anymore (after applying your new method). You want to determine the sample size needed to detect a significant difference between these groups with a specified level of statistical power.

[math]n = \dfrac{(Z_{\alpha/2}+Z_\beta)^2\cdot \sigma^2}{\delta^2}[/math]
where:
[imath]Z_{\alpha/2}[/imath] is the critical value for your [imath]\alpha/2 = 0.05/2[/imath]
[imath]Z_\beta[/imath] is the critical value for your desired power 80% or [imath]\beta = 0.2[/imath]
[imath]\sigma[/imath] is the estimated standard deviation of the population.
[imath]\delta [/imath] is the effect size (the difference you want to detect).

The last 2 is where you need to make some assumptions using prior data, pilot study, literature review, existing knowledge, etc... In layman's terms, your judgment call.

Things to consider:

1) It's important to choose values for [imath]\sigma[/imath] and [imath]\delta[/imath] that are both realistic and meaningful in your specific research context. If you're uncertain about the values, it's better to be conservative and overestimate the required sample size. This way, you ensure that your study is adequately powered to detect meaningful differences, and you may collect more data than is strictly necessary.

2) Remember that larger sample sizes can increase the power of your study but may also be more resource-intensive. Therefore, it's essential to balance statistical power with practical constraints.

Note that the first question isn't the same as the second one you posted because the first question attempts to detect a proportion of zero (i.e., nobody living for 70 years or longer) with high power. Whereas the second question attempts to detect a meaningful difference due to the new method. Unless I misunderstood the goal of your study, the calculation for the first question would've been different. My point here is that your research question should be clear, specific, and well-defined. It sets the stage for the entire research process and guides your methodology, including the choice of statistical tests and sample size calculations.
Thanks for your reply.

You told that my first question is not the same as the second one. I need to find out why because I supposed that they are the same.

This is an experimental design for my second research question.
If I separate cells with an old method, I will get either contaminated cells every time or sometimes contaminated cells, sometimes non-contaminated.
If a separate cells with a new method, I will get non-contaminated cells every time. If I get contaminated cells at least once, I will reject my hypothesis (it’s a hypothesis: new method allows to get non-contaminated cells every time).

Could you please explain why you see two different research questions here. I think that the hypotheses are the same: 1. Nobody is living more than 70 years. 2. Cells are never contaminated if I use a new method.

Thank you
 
I am also wondering whether you may explain how to calculate sample size for my research question in G-power. Thank you.
 
I would like to add more.

I am at the research proposal stage now. So I wrote in my grant application: all methods we use now for separation of blood cells produce a red cell fraction contaminated with white blood cells. Here I propose a new method that allows to get a pure fraction of red blood cells. Then I told how I am going to develop my method, and then I have to tell how many times I have to repeat my separation experiment in order to show that a red blood cells fraction is not contaminated anymore. How to calculate n (number of experiments) in G-power?
 
Thanks for your reply.

You told that my first question is not the same as the second one. I need to find out why because I supposed that they are the same.

This is an experimental design for my second research question.
If I separate cells with an old method, I will get either contaminated cells every time or sometimes contaminated cells, sometimes non-contaminated.
If a separate cells with a new method, I will get non-contaminated cells every time. If I get contaminated cells at least once, I will reject my hypothesis (it’s a hypothesis: new method allows to get non-contaminated cells every time).

Could you please explain why you see two different research questions here. I think that the hypotheses are the same: 1. Nobody is living more than 70 years. 2. Cells are never contaminated if I use a new method.

Thank you
Are you trying to prove that your method is significantly more effective than the older method? Or
Are you trying to prove that the probability of contamination of the new method is 0?

It seems like you're trying to do the latter, where it's not only more effective but 100% effective. Correct?

I'm not familiar with G*Power software, but I can imagine it requires the same inputs I provided above.
 
Are you trying to prove that your method is significantly more effective than the older method? Or
Are you trying to prove that the probability of contamination of the new method is 0?

It seems like you're trying to do the latter, where it's not only more effective but 100% effective. Correct?

I'm not familiar with G*Power software, but I can imagine it requires the same inputs I provided above.
Dear BigBeachBanana, I am very grateful to you for your help. Yes, I am trying to prove that the probability of contamination of the new method is 0. Is it what you called "proportion of zero" above? If so, could you please explain me how to calculate the sample size. Then I will try to find out how to do this in G-Power or will just find an online statistical calculator.

P. S. I would like to send you flowers... 1697920790603.png
 
Dear BigBeachBanana, I am very grateful to you for your help. Yes, I am trying to prove that the probability of contamination of the new method is 0. Is it what you called "proportion of zero" above? If so, could you please explain me how to calculate the sample size. Then I will try to find out how to do this in G-Power or will just find an online statistical calculator.

P. S. I would like to send you flowers... View attachment 36601
I would state your null and alternate hypothesis as follows:

Null Hypothesis: The new cell separation method results in a contamination probability of zero. Pr(contamination = 0)
Alternative Hypothesis (H1): The new cell separation method does not result in a contamination probability of zero. P(contamination ≠ 0).

------
However, here's the issue with that. If you want to make a confident statement that the new method has a zero probability of contamination, you need to be very certain that contamination is impossible. However, asserting absolute certainty of zero probability can be extremely challenging. There might always be some residual uncertainty or the possibility of rare, unpredictable events.

I recommend you revise your thesis. In the context of statistical hypothesis testing, you typically assume some non-zero probability (even a very small one) under the null hypothesis because absolute certainty in zero contamination is often unrealistic. Therefore, you might set a very low threshold for what you consider acceptable contamination (p=0.001 or p=0.01), and then use statistical testing to determine if the observed contamination rate is significantly greater than this threshold.

There's no humanly possible sample size for you to test to conclude that the new method has an absolutely 0 chance of contamination.
 
I would state your null and alternate hypothesis as follows:

Null Hypothesis: The new cell separation method results in a contamination probability of zero. Pr(contamination = 0)
Alternative Hypothesis (H1): The new cell separation method does not result in a contamination probability of zero. P(contamination ≠ 0).

------
However, here's the issue with that. If you want to make a confident statement that the new method has a zero probability of contamination, you need to be very certain that contamination is impossible. However, asserting absolute certainty of zero probability can be extremely challenging. There might always be some residual uncertainty or the possibility of rare, unpredictable events.

I recommend you revise your thesis. In the context of statistical hypothesis testing, you typically assume some non-zero probability (even a very small one) under the null hypothesis because absolute certainty in zero contamination is often unrealistic. Therefore, you might set a very low threshold for what you consider acceptable contamination (p=0.001 or p=0.01), and then use statistical testing to determine if the observed contamination rate is significantly greater than this threshold.

There's no humanly possible sample size for you to test to conclude that the new method has an absolutely 0 chance of contamination.
Thanks.

I need to emphasis another side of this topic. Yes, I understand that it's extremely challeging to prove that something does not exist or is impossible. However, this thread has another goal. My real "separation experiment" is very complicated and expensive. I need to justify in my grant application how many times I need to repeat it in order make a conclusion with alpha 0.05 and (1-beta) at least 0.8. Could you please help me to do this (just explain how to calculate a sample size). Thank you.
 
Thanks.

I need to emphasis another side of this topic. Yes, I understand that it's extremely challeging to prove that something does not exist or is impossible. However, this thread has another goal. My real "separation experiment" is very complicated and expensive. I need to justify in my grant application how many times I need to repeat it in order make a conclusion with alpha 0.05 and (1-beta) at least 0.8. Could you please help me to do this (just explain how to calculate a sample size). Thank you.
With alpha =0.05 and beta = 0.2, p = 0

[math]n = \dfrac{(1.96+0.84)^2*0(1-0)}{\text{Effect size}^2}=0[/math]
That's the issue with your hypothesis using p=0, the formula will say it's 0 because of the reason stated above.
 
With alpha =0.05 and beta = 0.2, p = 0

[math]n = \dfrac{(1.96+0.84)^2*0(1-0)}{\text{Effect size}^2}=0[/math]
That's the issue with your hypothesis using p=0, the formula will say it's 0 because of the reason stated above.
I found a video when I was googling "zero proportion". What do you think?
I cannot post a link here. Youtube,

What if the Sample Proportion is Zero? A 95% Confidence Interval​


Robert Cruikshank
 
Is it correct? Is my sample size 11 per group? 50% sounds reasonable if we consider my real experiment. Do you know which statisitcal test was used by an online calculator?
 
I have just tried anticipated incidence 99.9% for the group 1 and 0% for the group 2 on this online calculator and got that the sample size is only 2 per group. Intuitively, it's wrong.
 
Top