sum of variance from linear regression coefficients, larger then total variance.

Sarmes

New member
Joined
Dec 7, 2013
Messages
4
Dear Readers,

I have a linear regression model for 32 variables, 32 regression coefficients (no constant), and about 1000 observations Y, based on thousand random combinations of X. (X is a matrix with the samples in the rows, and each column represents a dimension).

When i take regression coefficient i (i=1:32) and multiply that with the ith column of my X matrix and do that for each of the 32 variables, i get 32 column vectors of 1000 values, lets call this matrix B If i would add up those columns i would have evaluated the regression model and the result would be the prediction column vector Y_p.

But my question is about the variances,
if i now take the variance over the colums of matrix B, i get a row vector.
I expected the sum of these row vector to add up to a number equall to the variance of prediction vector Y_p, but for high dimension this is some how not the case.

Even it can be off by about 50%
why is this? is this because of round off errors? (im using matlab)
it some how seems to depend on the distribution of the correlation coeffcients.
When i have an exponential distribution of the regression coefficients (few important variables). the difference is much bigger then when i have, equally distributed regression coefficients (all more or less in the same magnitude).

Could some body maybe explain me this?

well maybe i explain it to much from my application point of view.
Shorter version: why is var(sum(B)) not equal to sum(var(B))?

if B is a matrix and the operators var() and sum() always work over the last dimension of the matrix, so in this case the first operator goes over the colums, and the second over the rows.
 
Last edited:
Dear Readers,

I have a linear regression model for 32 variables, 32 regression coefficients (no constant), and about 1000 observations Y, based on thousand random combinations of X. (X is a matrix with the samples in the rows, and each column represents a dimension).

When i take regression coefficient i (i=1:32) and multiply that with the ith column of my X matrix and do that for each of the 32 variables, i get 32 column vectors of 1000 values, lets call this matrix B If i would add up those columns i would have evaluated the regression model and the result would be the prediction column vector Y_p.

But my question is about the variances,
if i now take the variance over the colums of matrix B, i get a row vector.
I expected the sum of these row vector to add up to a number equall to the variance of prediction vector Y_p, but for high dimension this is some how not the case.

Even it can be off by about 50%
why is this? is this because of round off errors? (im using matlab)
it some how seems to depend on the distribution of the correlation coeffcients.
When i have an exponential distribution of the regression coefficients (few important variables). the difference is much bigger then when i have, equally distributed regression coefficients (all more or less in the same magnitude).

Could some body maybe explain me this?

well maybe i explain it to much from my application point of view.
Shorter version: why is var(sum(B)) not equal to sum(var(B))?

if B is a matrix and the operators var() and sum() always work over the last dimension of the matrix, so in this case the first operator goes over the colums, and the second over the rows.
The problem is probably related to Covariances. Variance is a matrix, not a vector. The elements of the covariance matrix are indeed related to the correlation coefficients, as you suspected. The diagonal elements of the covariance are the Variances of the coefficients \(\displaystyle C_i\), and the off-diagonal elements are

\(\displaystyle Cov_{ij} = \rho_{ij}\ \sqrt{V_i\ V_j}\)

Determining the variance of any function of the \(\displaystyle C_i\) requires matrix operations. Note that correlations may be of either sign, and may cause individual calculated variances to be either larger or smaller.
 
Additional related questions

Thank you for your answer.

Indeed i did not think about the covariances, because i used random samples and i thought, that that would totally cancel out any covariances, or correlations, but when i looked at it in detail, they seem to be signifficant.

Does this mean that the accuracy of the regression is influenced by the covariance of the samples?
For an example Y=a_i*X_i i understand that the Var(Y)=the total sum of cov(a_i*X_i,a_j*Xj).
In the perfect case the off diagonal terms would be all zero, and i only have the Variances of the diagonal, that would be equal to the square of the regression coefficients (a_i)^2*Var(X_i), right?

If i understand this is correct it seems to imply that it would be better so sample with samples where the of diagonal terms of covariance matrix of the the samples are as small as possible right?
(I tried to imagine this in a 2d example by thinking about making a linear regression over 2 variables, with 3 points that are almost lying on a straight line.)

1)Does this also mean that it could be better to use a subset of the original samples, if for that subset the off-diagonal covariance terms are smaller, then for the full set?
Because that would be really surprizing, and unintuitive to me, to me to use less samples, but because those samples are less correlated, the result would be more accurate.

2) How is this covariance of the sample normally related to the accuracy of the regression coefficients? In my example of regression in 2 d over 3 points that are olmost on a line, i can get very good R^2 values, altough i am having very correlated samples. Is there some indicator such as the sum of the off diagonal covariance terms devided by the diagonal covariance terms, to give a warning fur such cases?
 
I will try to find time [currently on vacation] to work this through, so I can make a coherent reply. Two points:

1) All of your 1000 data must be statistically independent - no correlation (or covariance) between measurements.

2) In principle you have to know the statistical uncertainties (i.e. Variance) of each measured y. If they are not all equal, then the individual data should be weighted inversely as V[y]. If they ARE all equal, the weighting will be equal but the V[y] will propagate as a normalizing factor.

In what I am writing out, I will be using i,j as indices of the 32 coefficients and independent variables, and k as the index of the 1000 observations. I will develop a 32×32 "least-squares" matrix, the inverse of which will be the Variance/Covariance matrix.

¡Hasta luego!
 
[commend removed, because question was already answered, before the moderator released this message]
 
Last edited:
The linear model for y(x1,x2,...x32) is

\(\displaystyle \displaystyle \hat y = \vec a \cdot \vec x = \sum_i\ a_i\ x_i \)

A measurement consists of a value of \(\displaystyle \vec x_k\) and the corresponding \(\displaystyle y_k\), which deviates from the model value by an amount

\(\displaystyle \epsilon_k = \hat y_k - y_k = \sum_i\ a_i\ x_{ik} - y_k\)

Let \(\displaystyle S\) be the sum of the squares of the deviations:

\(\displaystyle \displaystyle S = \sum_k \left(\sum_i\ a_i\ x_{ik} - y_k\right)^2\)

To minimize \(\displaystyle S\) ("least squares"), set the partial derivative with respect to each coefficient to 0:

\(\displaystyle \displaystyle \dfrac {\partial S}{\partial a_j} = \sum_k\ 2 \left(\sum_i\ a_i\ x_{ik} - y_k\right)\ x_{jk} = 0\)

......\(\displaystyle \displaystyle \implies \sum_i\left(\sum_k x_{ik}\ x_{jk}\right) a_i = \sum_k\ y_k\ x_{ik}\)

This is now a set of 32 equations in 32 unknowns. In matrix form,

\(\displaystyle \alpha \times \vec a = \vec{\beta}\), ...where

\(\displaystyle \alpha_{ij} = \sum x_i x_j \;\;\; \text{ and } \;\;\; \beta_i = \sum y\ x_i \),

and the summations are over the measurements, \(\displaystyle k\). The elements of the least-squares matrix \(\displaystyle \alpha\) are sums of products of pairs of the independent variables, and the elements of the constant vector [tek]\beta[\tex] are the sums of the products of y times each independent variable.

The system of equations is solved by multiplying on the left by the inverse of \(\displaystyle \alpha\):

\(\displaystyle \vec a = \alpha^{-1}\ \beta\)

The diagonal elements of \(\displaystyle \alpha^{-1}\) are the Variances of \(\displaystyle \vec a\), and the off-diagonal elements are the Covariances. As you suggested in one of your posts, it is not "nice" to have large off-diagonal terms. It is possible to redefine your set of independent variables to minimize the covariances. For instance, I was using N and Z as two of the parameters for a fit. But because there was a strong correlation between N and Z, I got much cleaner results by using (N+Z) and (N-Z) instead. To find the optimum set of linear combinations, I went so far as to diagonalize the Covariance matrix!

The sum of squares \(\displaystyle S\) will always decrease if you add more terms. However, \(\displaystyle S\) is a sample from a \(\displaystyle \chi^2\)-distribution with degrees of freedom = (number of data) \(\displaystyle -\) (number of parameters). If you have the Variance of the data properly normalized, you can use the \(\displaystyle \chi^2/\text{d.f.}\) statistic to decide how many parameters to keep.
 
Thank you very much for the comprehensive explanation.
It is really very clear. Based on this i now programmed a matlab script to make a bilinear fit for given input data.

The only thing that is still quite magical to me, is that the inverse of alpha is the coveriance matrix.
 
Thank you very much for the comprehensive explanation.
It is really very clear. Based on this i now programmed a matlab script to make a bilinear fit for given input data.

The only thing that is still quite magical to me, is that the inverse of alpha is the coveriance matrix.
Look at the equation for coefficients \(\displaystyle \vec{a}\):

......\(\displaystyle \vec{a} = \alpha^{-1}\ \vec\beta\),....where....\(\displaystyle \displaystyle\beta_i = \sum y\ x_i\)

Assume the independent variables are known perfectly, and that all of the measurements \(\displaystyle y_k\) have the same Variance, \(\displaystyle V_y\). Suppose that you have weighted all of the data by normalizing to unit Variance, that is, all of the sums are divided by \(\displaystyle V_y\). Since the weight is constant, it does not affect the determination of \(\displaystyle \vec{a}\), but it does leave you in the position that propagation of errors from the raw data to \(\displaystyle \vec{a}\) uses the elements of \(\displaystyle \alpha^{-1}\).

This is even more fun when the measurements do not all have the same Variance, so the weights are not equal.
 
Top