How to Proceed

JeffM

Elite Member
Joined
Sep 14, 2012
Messages
7,874
I have a series of fourteen observations of \(\displaystyle x_{1,t}\ for\ t = 1,\ ...\ 14.\) I also have 14 observations of y_t.

The observations of y are generally considered reliable. The reliability of the observations of x_1 has been questioned. So I am looking for a test of the reliability of the observations of x_1.

\(\displaystyle y_t \equiv x_{1,t} + x_{2, t}.\)

The identity above would obtain if the observations for y, x_1, and x_2 were all perfectly exact. But no observation is exact; the observations for x_1 have been disputed, and there is no data at all for x_{2, t}. My first thought was to look at the approximation:

\(\displaystyle y_t \approx x_{1,t} + z.\)

There is no reason to believe that this approximation will be a good one because the data for x_2 are highly unlikely to be stable from period to period. Furthermore, if the data for x_1 are indeed defective, this will add another source of error. But I must work with what I have.

When I do the obvious linear regression between y and x_1, I get a seemingly terrible fit with r^2 = 36%. That could be because the missing variable is not stable or because the data on x_1 really are defective or both. However, the coefficient of x_1 is of the right sign and statistically meaningful. Moreover, I can improve the fit up to 46% by putting t in as a variable, which confirms other evidence that x_2 has a trend component. Of course, even if x_2 is not stable, that does not prove that the data for x_1 are reliable.

Because my initial test gave ambiguous results, I calculated deviations from the mean for y and for x_1. If the deviation was negative, I scored that as -1. If the deviation was positive, I scored that as 1. Regressing those two sets of scores against each other and forcing the constant term to be zero gives me r^2 of 73%. But I am far from sure that I should not be using some other test (a non-paramteric test perhaps?). Neverthless, out of fourteen observations, the scores match thirteen times. That suggests to me that the x_1 data are at least decent: I can predict the sign of the y variable's deviation from its mean by the deviation of just the x_1's deviation from its mean 92% of the time. Is it correct to infer that the data for x_1 are decent? Did I apply the proper test by regressing the deviations, or should I use some other test?

Lacking data on one of the critical variables, perhaps I should even view a coefficient of determination of 36% as being an excellent fit.

Does anyone have a suggestion, other than having a few martinis and forgetting the whole thing?
 
Last edited:
I have a series of fourteen observations of \(\displaystyle x_{1,t}\ for\ t = 1,\ ...\ 14.\) I also have 14 observations of y_t.

The observations of y are generally considered reliable. The reliability of the observations of x_1 has been questioned. So I am looking for a test of the reliability of the observations of x_1.

\(\displaystyle y_t \equiv x_{1,t} + x_{2, t}.\)

The identity above would obtain if the observations for y, x_1, and x_2 were all perfectly exact. But no observation is exact; the observations for x_1 have been disputed, and there is no data at all for x_{2, t}. My first thought was to look at the approximation:

\(\displaystyle y_t \approx x_{1,t} + z\) . . .
There are two distinct forms of errors in measurements: statistical and systematic. Statistical uncertainties relate to the precision with which a measurement can be made: if you consider repeating an identical measurement a large number of times, the distribution of values could be represented as a normal distribution with a standard deviation. On the other hand, systematic uncertainties limit the accuracy of the measurement, for instance if the instrument is not calibrated correctly.

Lets look at statistics. You state y is "reliable" - can you estimate the standard deviation \(\displaystyle \sigma_{y,t}\) for each observation, or a constant \(\displaystyle \sigma_y\) to be used for all?

Then you say reliability of \(\displaystyle x_{1,t}\) is questionable (or has been questioned). What do you think a priori that the standard deviation of a measurement is? If you assign a value \(\displaystyle \sigma_{x1}\) (or values of \(\displaystyle \sigma_{x1,t}\)) and work through the regression including propagation of errors, you can see whether the apparent fit is consistent with the proposed \(\displaystyle \sigma_{x1}\). How big a value of \(\displaystyle \sigma_{x1}\) would you have to include to make the propagated error bars consistent with the scatter plot of the data?

Since I ALWAYS propagate errors through any model fitting, I am sure I can find a sample program to send you, or I should be able to find a write-up of the procedure I follow. In the meantime, look at the scatter plot and your regression line. How big would the error bars in the x direction have to be so that 2/3 would include the regression line? That should be an estimate of the standard deviation.
 
There are two distinct forms of errors in measurements: statistical and systematic. Statistical uncertainties relate to the precision with which a measurement can be made: if you consider repeating an identical measurement a large number of times, the distribution of values could be represented as a normal distribution with a standard deviation. On the other hand, systematic uncertainties limit the accuracy of the measurement, for instance if the instrument is not calibrated correctly.

Lets look at statistics. You state y is "reliable" - can you estimate the standard deviation \(\displaystyle \sigma_{y,t}\) for each observation, or a constant \(\displaystyle \sigma_y\) to be used for all?

Then you say reliability of \(\displaystyle x_{1,t}\) is questionable (or has been questioned). What do you think a priori that the standard deviation of a measurement is? If you assign a value \(\displaystyle \sigma_{x1}\) (or values of \(\displaystyle \sigma_{x1,t}\)) and work through the regression including propagation of errors, you can see whether the apparent fit is consistent with the proposed \(\displaystyle \sigma_{x1}\). How big a value of \(\displaystyle \sigma_{x1}\) would you have to include to make the propagated error bars consistent with the scatter plot of the data?

Since I ALWAYS propagate errors through any model fitting, I am sure I can find a sample program to send you, or I should be able to find a write-up of the procedure I follow. In the meantime, look at the scatter plot and your regression line. How big would the error bars in the x direction have to be so that 2/3 would include the regression line? That should be an estimate of the standard deviation.
Thank you very much for taking the time to think about my question.

I can see that I provided insufficient information. The y values can be considered reliable because (a) they are summations of many independent events, thereby minimizing random error, and (b) because, with respect to each individual event, the buyer did not want to pay too much, and the seller did not want to be paid too little, and between them consistent bias should be minimized. That is, the process from which the observations come has built-in checks to ensure that errors are relatively small. But there is no way for me to go back to the 1930's and calibrate the process. I can merely make the plausible judgment that the differing interests of a buyer, a seller, and a shipper will ensure that the quantity purchased and shipped will be measured with some care.

With respect to the x_1's, which are government statistics, the process involved no adversarial aspect that would act to minimize errors, and there have been accusations that the government either grossly erred in its estimates or deliberately falsified them.

If I had reliable data for x_2, I could simply use the identity y - x_2 = x_1 and be done with it. But I have no data for x_2, let alone reliable data. If I had got a really good fit when I regressed y and x_1, I would have used that as evidence that the figures for x_1 are reasonably decent and that x_2 was quite stable. But there is no reason to believe that x_2 was stable; in fact, it is highly probable that it had a trend and reasonably likely that it had material variations around its trend. So I was not surprised when the regression gave a low r^2. It did strike me as reassuring about the reliability of the estimates of x_1 that the coefficient at least had the proper sign. But I am reluctant to give these estimates a clean bill of health just on the sign of coefficient.

So, as I said, I computed the deviations around the mean for y and for x_1. There is an almost perfect match between the sign of the deviations of the two variables around their respective means. I did a regression, but I am not sure that regression is suitable for these binary scorings. I am tempted to say that this matching indicates that the accusations about the unreliability of estimates for x_1 are not supported by the admittedly thin evidence, but I am not sure that I have used the proper test or that my reasoning is correct.
 
With respect to the x_1's, which are government statistics, the process involved no adversarial aspect that would act to minimize errors, and there have been accusations that the government either grossly erred in its estimates or deliberately falsified them.
That makes the problem nearly impossible - whatever analysis you do, biased data will give inaccurate results. What was the hidden agenda of the person(s) who ,ay have fudged the data?

If you believe (or assume) that there is time dependence in the x_2 term, then including it in the regression is a good idea. To see if it really helps the statistics you have to account for the number of degrees of freedom being reduced from 12 to 11 when you add a parameter; I generally use \(\displaystyle \chi^2/d.f.\) as the criterion for goodness of fit.

It might still be interesting - maybe even useful - to consider what error bars you would have to put on the x_1 data to get 2/3 of them to include the regression line. It might be easier to visualize if you make y the independent variable:

\(\displaystyle x_{1,t} = y_t - x_2(t) \longrightarrow \alpha \ y_t + \beta + \gamma \ t\)

The data should be weighted as 1/Variance. If the assumed error bars on the \(\displaystyle x_{1,t}\) are all taken to be equal, \(\displaystyle \sigma_x\), you can do the regression unweighted and scale the propagated error matrix later. [If you do the regression by actually inverting the matrix, then the resuting inverse matrix has the Variances of the determined coefficients on the diagonal and the Covariances off-diagonal.]

This is "interesting." Good luck!
 
What was the hidden agenda of the person(s) who ,ay have fudged the data?

Personally, the accusations that the Raj deliberately faked the figures for production of rice in Bengal has always seemed tenuous at best. Revolutionaries will choose any stick available to beat their opponents. But, in the 1790's, the British had made a "permanent settlement" with the zamindars of Bengal never to raise their property taxes. (Amazingly, the British kept their promise.) Consequently, the government agents who made the estimates were not tax agents, but primarily magistrates whose only interest was to distinguish ordinary years from potential famine years. Unlike the tax agents who estimated the productivity of land in the rest of India, the estimators in Bengal had no interest in, and probably little training for, making numeric estimates as exact as possible. Consequently, it seems quite possible that the statistics are seriously flawed.

If you believe (or assume) that there is time dependence in the x_2 term, then including it in the regression is a good idea. To see if it really helps the statistics you have to account for the number of degrees of freedom being reduced from 12 to 11 when you add a parameter; I generally use \(\displaystyle \chi^2/d.f.\) as the criterion for goodness of fit. I had not thought to use the chi squared test. I shall have to review it.

It might still be interesting - maybe even useful - to consider what error bars you would have to put on the x_1 data to get 2/3 of them to include the regression line. It might be easier to visualize if you make y the independent variable:

\(\displaystyle x_{1,t} = y_t - x_2(t) \longrightarrow \alpha \ y_t + \beta + \gamma \ t\)

I see what you mean about making y be the independent variable. That is after all the estimate that warrants the greatest confidence. Great suggestion.

The data should be weighted as 1/Variance. If the assumed error bars on the \(\displaystyle x_{1,t}\) are all taken to be equal, \(\displaystyle \sigma_x\), you can do the regression unweighted and scale the propagated error matrix later. [If you do the regression by actually inverting the matrix, then the resuting inverse matrix has the Variances of the determined coefficients on the diagonal and the Covariances off-diagonal.]

This is "interesting." Good luck!
Dr. Phil. Thank you for your help . I truly appreciate it.

The term "error bars" is unfamiliar to me, but I am guessing that it represents the difference between the observed value and the estimated value. Of course, in this case, the question is whether the "observed" values really are valid observations. So if I understand what you are suggesting, I should compute the minimum changes needed in the observed x_{1,t} to get two-thirds of them to fall on the regression plane. If those changes are relatively small, then the presumption is that the observed x_{1,t} are close to exact. That would be a really clever suggestion if I were not missing a relevant variable. But because I am forced to leave out a variable known to be relevant, namely x_2, there is no reason to expect that kind of fit between y and t on the one hand and x_1 on the other even if every observed y_t and every observed x_{1, t} was exact.

This problem is weird enough that I am posting a clearer explanation of it on another site so it may make sense for you to abandon (or at least defer) wasting more time on this problem. Your responses have at the very least helped me formulate the question more cogently. But your responses have done more. They have given me new things to think about. So thank you again.

PS My mathematical education never included a course in linear algebra. I did study abstract algebra, and many of the examples used linear algebra so I picked up some knowledge about it, but explanations couched in the language of linear algebra are very hard for me to follow.
 
The term "error bars" is unfamiliar to me, but I am guessing that it represents the difference between the observed value and the estimated value.
Not quite. Every measured or estimated quantity has some uncertainty associated with it. For instance, maybe as an a priori guess you estimate a standard deviation of \(\displaystyle \sigma = 3.0\) for the \(\displaystyle x_{1,t}\). Using one \(\displaystyle \sigma\) to represent the statistical spread, the datum can be written \(\displaystyle x_{1,t} \pm 3.0\)

The corresponding 1-sigma error bar drawn on the plotted point would extend from 3.0 below to 3.0 above the point. We expect the "true" value to lie within the error bars about 2/3 of the time.

The deviation of the datum from the model is different. Linear regression is a method that minimizes the sum of squares of the deviations. If the sum is weighted by \(\displaystyle 1/\sigma^2\), it may be taken as representing a sample from a \(\displaystyle \chi^2\) distribution:

\(\displaystyle \displaystyle \chi^2 = \sum_t\left[\left( x_{1,t} - (\alpha\ y_t + \beta + \gamma\ t)\right)/\sigma_t \right]^2\)

Not knowing \(\displaystyle \sigma\), you can adjust it so that \(\displaystyle \chi^2\) turns out to equal its expectation value, which is the number of degrees of freedom (11 in this case, from 14 data minus 3 fitted parameters).

[I'm having fun trying to express all of this in understandable terms - and also learning a bit of LaTeX. How does one make a partial derivative symbol? We minimize \(\displaystyle \chi^2\) by making the partials w.r.t. \(\displaystyle \alpha, \beta, \gamma\) all zero.]
 
Not quite. Every measured or estimated quantity has some uncertainty associated with it. For instance, maybe as an a priori guess you estimate a standard deviation of \(\displaystyle \sigma = 3.0\) for the \(\displaystyle x_{1,t}\). Using one \(\displaystyle \sigma\) to represent the statistical spread, the datum can be written \(\displaystyle x_{1,t} \pm 3.0\)

The corresponding 1-sigma error bar drawn on the plotted point would extend from 3.0 below to 3.0 above the point. We expect the "true" value to lie within the error bars about 2/3 of the time.

The deviation of the datum from the model is different. Linear regression is a method that minimizes the sum of squares of the deviations. If the sum is weighted by \(\displaystyle 1/\sigma^2\), it may be taken as representing a sample from a \(\displaystyle \chi^2\) distribution:

\(\displaystyle \displaystyle \chi^2 = \sum_t\left[\left( x_{1,t} - (\alpha\ y_t + \beta + \gamma\ t)\right)/\sigma_t \right]^2\)

Not knowing \(\displaystyle \sigma\), you can adjust it so that \(\displaystyle \chi^2\) turns out to equal its expectation value, which is the number of degrees of freedom (11 in this case, from 14 data minus 3 fitted parameters).

[I'm having fun trying to express all of this in understandable terms - and also learning a bit of LaTeX. How does one make a partial derivative symbol? We minimize \(\displaystyle \chi^2\) by making the partials w.r.t. \(\displaystyle \alpha, \beta, \gamma\) all zero.]
Dr. Phil

Got it. Error bars are not bars over a symbol but lines parallel to and equidistant from the regression line within which 67% of the observations are expected to fall.

I have been thinking about your last post. I think it has given me an idea to work on.

Here is how I do partials: \(\displaystyle \dfrac{\delta z}{\delta x}\). There may be a more elegant way. I have been teaching myself LaTeX since starting post to here so my LaTeX skills are minimal.

Also, I think the non-parametric test I have been groping for is a signs test. Researching that also.

Thanks again for the discussion. I am learning some things, remembering some other things, and getting new ideas.
 
Top