Confidence interval Problem

kenyukobayashi · Nov 18, 2019

Hi everyone,
I'm currently doing a homework in a course called Applied data analysis, and I've stumbled across a curious problem.
We have this huge dataset about Reddit, which contains a million rows data, each row describing a post.
To keep things short, I have access to almost 1 million different messages that were posted on Reddit.
The question I have is to compute the confidence interval of 99% of the length of the message.
To do this, I computed the mean and the standard deviation of the data, in order to then use the following formula : [ mean - z * standard deviation / squareroot(n) ; mean + z * standard deviation / squareroot(n) ] with a z value equal to 2,576.
The problem is, the result doesn't seem coherent at all. I got an error which is really small (approx 1), and my confidence interval doesn't seem right.

Let's say that the mean of the messages length is 200 characters. The standard deviation I got is 300, which can be coherent because some people make posts of 3 words while others make posts of 3 paragraphs. The size of the data is 100 000.
With these parameters, the confidence interval of 99% would be equal to : [197,55 ; 202,45].
So, this is claiming that 99% of the length of the messages posted are contained in that interval, which is completely absurd.

Can somebody please enlighten me here ?
Thanks a lot !

tkhunny · Nov 18, 2019

Why are you dividing by sqrt(n)? 100,000 is a population, not a sample.

Dr.Peterson · Nov 18, 2019

kenyukobayashi said:
I have access to almost 1 million different messages that were posted on Reddit.
The question I have is to compute the confidence interval of 99% of the length of the message.
To do this, I computed the mean and the standard deviation of the data, in order to then use the following formula : [ mean - z * standard deviation / squareroot(n) ; mean + z * standard deviation / squareroot(n) ] with a z value equal to 2,576.

Let's say that the mean of the messages length is 200 characters. The standard deviation I got is 300, which can be coherent because some people make posts of 3 words while others make posts of 3 paragraphs. The size of the data is 100 000.
With these parameters, the confidence interval of 99% would be equal to : [197,55 ; 202,45].
So, this is claiming that 99% of the length of the messages posted are contained in that interval, which is completely absurd.

What was the exact wording of the assignment? I doubt it was that.

The formula you are using is not for a confidence interval for the length of every individual message, but for the mean of the entire population. Confidence intervals don't claim to contain all the data; they claim to contain one parameter, in this case the mean.

If your n = 100,000 is taken as a sample (of the even larger population), then it is a large enough sample that I would expect a very accurate estimate of the mean of the population.

kenyukobayashi · Nov 19, 2019

Hi,
Here is the question : "plot the mean of message length for each subreddit in descending order. Visualise the statistical significance by plotting the 99% confidence intervals for each subreddit as well."

kenyukobayashi · Nov 19, 2019

Dr.Peterson said:
What was the exact wording of the assignment? I doubt it was that.

The formula you are using is not for a confidence interval for the length of every individual message, but for the mean of the entire population. Confidence intervals don't claim to contain all the data; they claim to contain one parameter, in this case the mean.

If your n = 100,000 is taken as a sample (of the even larger population), then it is a large enough sample that I would expect a very accurate estimate of the mean of the population.

I see, I mixed it all up. If that is the case, the values are indeed coherent.

Confidence interval Problem

kenyukobayashi

New member

tkhunny

Moderator

Dr.Peterson

Elite Member

kenyukobayashi

New member

kenyukobayashi

New member