kenyukobayashi
New member
- Joined
- Nov 18, 2019
- Messages
- 3
Hi everyone,
I'm currently doing a homework in a course called Applied data analysis, and I've stumbled across a curious problem.
We have this huge dataset about Reddit, which contains a million rows data, each row describing a post.
To keep things short, I have access to almost 1 million different messages that were posted on Reddit.
The question I have is to compute the confidence interval of 99% of the length of the message.
To do this, I computed the mean and the standard deviation of the data, in order to then use the following formula : [ mean - z * standard deviation / squareroot(n) ; mean + z * standard deviation / squareroot(n) ] with a z value equal to 2,576.
The problem is, the result doesn't seem coherent at all. I got an error which is really small (approx 1), and my confidence interval doesn't seem right.
Let's say that the mean of the messages length is 200 characters. The standard deviation I got is 300, which can be coherent because some people make posts of 3 words while others make posts of 3 paragraphs. The size of the data is 100 000.
With these parameters, the confidence interval of 99% would be equal to : [197,55 ; 202,45].
So, this is claiming that 99% of the length of the messages posted are contained in that interval, which is completely absurd.
Can somebody please enlighten me here ?
Thanks a lot !
I'm currently doing a homework in a course called Applied data analysis, and I've stumbled across a curious problem.
We have this huge dataset about Reddit, which contains a million rows data, each row describing a post.
To keep things short, I have access to almost 1 million different messages that were posted on Reddit.
The question I have is to compute the confidence interval of 99% of the length of the message.
To do this, I computed the mean and the standard deviation of the data, in order to then use the following formula : [ mean - z * standard deviation / squareroot(n) ; mean + z * standard deviation / squareroot(n) ] with a z value equal to 2,576.
The problem is, the result doesn't seem coherent at all. I got an error which is really small (approx 1), and my confidence interval doesn't seem right.
Let's say that the mean of the messages length is 200 characters. The standard deviation I got is 300, which can be coherent because some people make posts of 3 words while others make posts of 3 paragraphs. The size of the data is 100 000.
With these parameters, the confidence interval of 99% would be equal to : [197,55 ; 202,45].
So, this is claiming that 99% of the length of the messages posted are contained in that interval, which is completely absurd.
Can somebody please enlighten me here ?
Thanks a lot !