r squared with weighting for each data point

dchubbock · Feb 27, 2016

I have minimal statistics schooling but i am attempting some analysis on the correlation between data sets using Excel. This i have managed to achieve using the RSQ function. However I would like to weight each data point depending on the 'sample size' (amount) :

I have found that when I compare r-squared with the say the top 10 data points by amount, the correlation is generally stronger with fewer outliers. However i still would like to calculate r squared based on the larger data set but weighted by the amount.

I hope this makes sense. Thanks in advance to anyone who spends time helping me with this.

Dan

Ishuda · Feb 27, 2016

dchubbock said:
I have minimal statistics schooling but i am attempting some analysis on the correlation between data sets using Excel. This i have managed to achieve using the RSQ function. However I would like to weight each data point depending on the 'sample size' (amount) :

I have found that when I compare r-squared with the say the top 10 data points by amount, the correlation is generally stronger with fewer outliers. However i still would like to calculate r squared based on the larger data set but weighted by the amount.

I hope this makes sense. Thanks in advance to anyone who spends time helping me with this.

Dan

First, I'm guessing at some of the writing/numbers on the image as it is too small to see for me. Next, I'm not quite sure how the 'sign up rate', 'soc imp(?) rate' and 'amount' relate to one another. I would usually guess that the third column was a 'relative frequency column', that is the (0.029%, 35.995%) occurred 1,852,512 times. But this would mean you had some umteen million samples which I don't think the spread sheet could handle for the RSQ [too bad they don't have a relative frequency, argument with default 1]. You could account for the relative frequency manually by using the definition of the Pearson product-moment correlation coefficient, see
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
for example. That is you would use the formula's and adjust for the relative frequency column.

If it is anything else, I can't come up with anything that makes sense to me off the top of my head but, for wild guesses the first row might be something like either of the following two situations: 'In a sample size of 1,852,512 population, 0.029% of the population reacted positively to 35.995% of the products.' In that case you numbers would be converted by multiplying the 0.029% by the 1,852,512.

However if the statement were something like 'in a sample size of 1,852,512 products, 0.029% of the population reacted positively to 35.995% of the products' you would have a different result. Your numbers would be converted by multiplying the 35.995% by the 1,852,512.

dchubbock · Feb 27, 2016

Thank you for responding, Ishuda. I don't believe the Pearson formula is what i'm looking for. But explaining a little more about what the numbers are might help. Your 'wild guess' -first situation - was actually pretty close.

This is advertising data from facebook:

Column A: is the conversion rate. Which is: website sign-up/ the number of times the ad was seen (or 'impressions'), as a %
Column B: is the number of times the ad was 'liked' /impressions (again, as a %)

(Column B is the variable i'm testing conversion rate against to see which most correlates with conversion rate, so for example this could be the number of times the ad was 'shared' on facebook, compared to impressions (# time the ad was shown))

Column C is the number of impressions, so kind of like sample size.

The idea is that if the number of impressions for a given segment (row) are higher, then the results for this row are more statistically significant and therefore should have a greater weighting.

I'm not sure how to approach this, so any further guidance would be appreciated!

thank you

Ishuda · Feb 27, 2016

dchubbock said:
Thank you for responding, Ishuda. I don't believe the Pearson formula is what i'm looking for. But explaining a little more about what the numbers are might help. Your 'wild guess' -first situation - was actually pretty close.

This is advertising data from facebook:

Column A: is the conversion rate. Which is: website sign-up/ the number of times the ad was seen (or 'impressions'), as a %
Column B: is the number of times the ad was 'liked' /impressions (again, as a %)

(Column B is the variable i'm testing conversion rate against to see which most correlates with conversion rate, so for example this could be the number of times the ad was 'shared' on facebook, compared to impressions (# time the ad was shown))

Column C is the number of impressions, so kind of like sample size.

The idea is that if the number of impressions for a given segment (row) are higher, then the results for this row are more statistically significant and therefore should have a greater weighting.

I'm not sure how to approach this, so any further guidance would be appreciated!

thank you

So, if I'm understanding you correctly, if you were to multiply both columns A & B by 'Amount', you would have some like
Column A = Number who signed up
Column B = Number who liked 'Liked the ad'
That is, for example, for row one we would have 1,852,512 saw the ad, 666719 Liked the ad, 537 Signed up, and there were about 1,185,256 who could care less. You now make a linear fit between column A and B and would like to know something about how good the fit is, i.e. a correlation coefficient of some kind.

Columns A&B as they originally stand 'normalizes' the numbers so that the number of impressions for a particular ad don't matter. If you want to include number of impressions, you need a three dimensional, i.e. instead of
A = a B + c
you would have
A = a B + b I + c
where I was the number of impressions [in millions in order to get reasonable numbers to work with].

I'm off for right now but, if you have more questions, let us know.

dchubbock · Feb 28, 2016

After many hours and countless Googling, I came across this thread that achieves exactly what i was looking for.

http://stackoverflow.com/questions/11087773/weighted-trendline

It seems the weighted least squares regression is the formula i need to use. The idea of plotting the data as a bubbles with impressions as weights (bubble size) is the best way to visualise what I was aiming for. As you can see, for the plot below, the r2 value was relatively low because it was weighting the few outliers equally with those that had a large sample size. The least squares formula appears to fix this.

The only thing i'm yet to work out is how to plot the trendline based on least squares formula.

Thank you again for your help

r squared with weighting for each data point

dchubbock

New member

Attachments

Ishuda

Elite Member

dchubbock

New member

Ishuda

Elite Member

dchubbock

New member

Attachments