Proof of an obvious change in variance

HCFandLCM

New member
Joined
Aug 9, 2018
Messages
2
Consider a set of n numbers, which are all distinct. Suppose both the max and the min numbers are moved from the set, and then the median of the original set is added to the resulting set, so that there are now n-1 distinct numbers in total. It is obvious that the variance decreases. But how it the result proved?
 
I'd try a proof by weak mathematical induction. You may have to do separate proofs for odd and even n.
 
I'd try a proof by weak mathematical induction. You may have to do separate proofs for odd and even n.

I have attempted to prove the result using direct expansion and imposing extra the extra assumption that the mean of the original set is 0 (by virtue of the translational invariance of the variance). Ultimately, I arrived at an expression that looks terrible...

If you were to use induction, how would you relate a set of n data and a set of n-1 data?
 
First, I have not really attempted a proof. I was responding to what appeared to be a request for ideas.

Second, your proposition is almost trivial for the case of 2 * 1 because in that case the median and the mean are equal. In the case of 2 * 1 + 1, the median equals the middle observation. In both those cases, the revised variance is zero and so less than the original variance, which under your assumptions is necessarily positive. (I suspect your assumption that each observation is distinct can be relaxed if you try to prove that the revised variance is less than or equal to the original variance.) Thus achieving the base case for an induction proof is easy.

I admittedly did not explore in any depth at all the 2(k + 1) and 2(k + 1) + 1 cases except to note that the absolute value of the difference between the median and the original mean must be less than the absolute value of the differences between the original mean and the two extremes under your assumptions. You do not have to find the exact change in the variance, merely show that the revised variance is less than the original variance. (This point may let you avoid a proof by induction by being the lynchpin of a direct proof.)

I have a lot to do this week and probably do not have time to do a proof, but I shall give this some thought over the next three days.

EDIT: Even if the original mean is zero, the revised mean may not be zero so that does not help much. The key it seems to me will be to show that the difference between the revised mean and the original median has an absolute value less than or equal to the absolute values of the differences between the original means and the original extremes.
 
Last edited:
I had little time today to deal with this problem, but I hope it is clear that the problem does not exist if n = 1 and that the proof is completely trivial if n = 2. So we go on to the cases where n > 2.

Let's get some basics out of the way first.

We order the data from lowest to highest so that

\(\displaystyle x_j \le x_k \text { for any pair of integers } j \text { and } k \text { such that } 1 \le j < k \le n.\)

Now if we assume that \(\displaystyle x_1 = x_n\),

then every number in the original data set is equal to every other number in the data set. Consequently, the mean and median of the original data set are also equal to every number in the original data set, and so the variance of the original data set is zero because the difference between the mean and every number is zero. Furthermore, the revised data set also has every number the same and so its variance is also zero.

\(\displaystyle \therefore x_1 = x_n \implies \text { the variances of the original and revised data sets both } = 0.\)

So to get to the interesting problem we assume

\(\displaystyle n > 2 \text { and } x_1 < x_n.\)

You appear to have beeen correct (and I was wrong) ithat assuming the mean of the original data to equal zero simplifies things without any loss of generality. So let's make that assumption as well.

\(\displaystyle \displaystyle \sum_{j=1}^n x_j = 0 = x_1 + x_n + \sum_{j=2}^{n-1} x_j.\)

There are some obvious consequences to this assumption:

\(\displaystyle x_1 < 0 \text { and } 0 < x_n.\)

\(\displaystyle \text {Let } a = \text { the median of the original data set.}\)

Thus, there are three cases to consider, namely:

\(\displaystyle x_1 < a < 0 < x_n,\ x_1 < a = 0 < x_n, \text { and } x_1 < 0 < a < x_n.\)

If I have time, I shall explore the first case tomorrow.
 
Let's recapitulate where we left off yesterday. We start by defining the original data set, its median, mean, and variance.

\(\displaystyle \text {The original data set } = \{x_1,\ ... \ x_n \} \text { such that }\)

\(\displaystyle n > 2,\ x_j \le x_k \text { if } 1 \le j < k \le n,\ x_1 < x_n, \text { and } \displaystyle \sum_{j=1}^n x_j = 0.\)

\(\displaystyle \text {Let } a = \text { the median of the original data set.}\)

\(\displaystyle \text {Let } b = \text { the arithmetic mean of the original data set.}\)

\(\displaystyle \text {Let } c = \text { the variance of the original data set.}\)

\(\displaystyle \text {Let } d = \displaystyle \sum_{j = 2}^{n-1} x_j.\)

\(\displaystyle \text {Let } e = \displaystyle \sqrt{\sum_{j = 2}^{n-1} x_j^2} \implies e^2 = \sum_{j=2}^{n-1} x_j^2.\)

\(\displaystyle b = 0\ \because \ b = \dfrac{1}{n} * \displaystyle \left ( \sum_{j=1}^n x_j \right ) \text { and }
\sum_{j=1}^n x_j = 0.\)

\(\displaystyle x_1 < a < x_n \ \text { and } x_1 < b < x_2 \ \because \ x_1 < x_n.\)

\(\displaystyle \therefore x_1 < 0 < x_n \ \because \ b = 0.\)

\(\displaystyle \displaystyle b = \dfrac{1}{n} * \left ( \sum_{j=1}^n x_j \right ) = \dfrac{1}{n} * \left \{ x_1 + \left ( \sum_{j=2}^{n-1} \right ) + x_n \right \} = \dfrac{1}{n} * (x_1 + d + x_n).\)

\(\displaystyle \therefore x_1 + d + + x_n = bn = 0 \ \because \ b = 0.\)

\(\displaystyle \therefore d = -\ (x_1 + x_n) \implies d^2 = x_1^2 + 2x_1x_n + x_n^2.\)

We use the fact that the variance = the mean of the squares - the square of the mean.

\(\displaystyle \text {The mean of the squares of the original data set } =\)

\(\displaystyle \displaystyle \dfrac{1}{n} \left ( \sum_{j=1}^n x_j^2 \right ) = \dfrac{1}{n} * \left \{ x_1^2 + \left ( \sum_{j=2}^{n-1} x_j^2 \right ) + x_n^2 \right \} = \dfrac{x_1^2 + e^2 + x_n^2}{n}.\)

\(\displaystyle \therefore c = \dfrac{x_1^2 + e^2 + x_n^2}{n} - b^2 = \dfrac{x_1^2 + e^2 + x_n^2}{n} \ \because \ b = 0.\)

Now let's define the revised data set.

\(\displaystyle \text {The revised data set } = \{a,\ x_2\ ... \ x_{n-1}\}.\)

\(\displaystyle \text {Its mean} = p = \displaystyle \dfrac{1}{n - 1} * \left ( a + \left \{ \sum_{j=2}^{n-1} x_j \right \} \right ) = \dfrac{a + d}{n - 1} \implies p^2 = \dfrac{a^2 + 2ad + d^2}{(n - 1)^2}.\)

\(\displaystyle \text {Its variance } = q = \displaystyle \dfrac{1}{n - 1} * \left ( a^2 + \left \{ \sum_{j=2}^{n-1} + x_j^2 \right \} \right ) - p^2 = \dfrac{a^2 + e^2}{n - 1} - \dfrac{a^2 + 2ad + d^2}{(n - 1)^2}.\)

We want to show that c > q. There are potentially 3 cases to consider a < b, a = b, a > b.

This is all I have time for right now, but at least we have things set up to proceed.
 
Last edited:
For several weeks, I have been working on and off on this proof about the effect on the variance of changing a data set by eliminating the extremes and adding the median as an independent element (given that the mean is zero). I think I now have one. But I have been looking at it for so long that I doubt I could see a flaw in it. An independent eye would be helpful. GIVEN:

\(\displaystyle n \in \mathbb Z,\ n \ge 3,\ 1 \le i \ \le n \implies x_i \in \mathbb R,\ x_1 < x_n, \)

\(\displaystyle 1 \le j < k \le n \implies x_j \le x_k,\ \text { and } \displaystyle \sum_{i=1}^n x_i = 0.\)

\(\displaystyle \text {Set I: } = \{x_1,\ ...\ x_n\}, \text { and } a,\ b,\ \text {and } c\)

\(\displaystyle \text { are the median, mean, and variance of Set I respectively.}\)

\(\displaystyle \text {Set II: } = \{a,\ x_2,\ ...\ x_{n-1}\}, \text { and } p \text { and } q\)

\(\displaystyle \text { are the mean and variance of Set II respectively.}\)

Prove that \(\displaystyle c \ge q.\)

Note that if n = 2, the proof of the equivalent proposition is trivial.

\(\displaystyle n \ge 3 \implies n * n = n^2 \ge 3n \implies n^2 + 1 > 3n = 2n + n \implies\)

\(\displaystyle n^2 - 2n + 1 > n > n - 1 \implies (n - 1)^2 > n > n - 1 \implies \dfrac{1}{(n - 1)^2} < \dfrac{1}{n} < \dfrac{1}{n - 1}.\)

\(\displaystyle b = \dfrac{1}{n} * \displaystyle \sum_{i=1}^n x_i = \dfrac{1}{n} * 0 = 0.\)

\(\displaystyle \displaystyle \sum_{i=1}^n = 0, x_1 < x_n, \text { and } x_j \le x_k \text { for } 1 \le j < k \le n \implies\)

\(\displaystyle x_1 < 0 < x_n.\)

\(\displaystyle \displaystyle d = \sum_{i = 2}^{n-1} x_i.\)

\(\displaystyle \displaystyle \therefore 0 = \sum_{i=1}^n x_i = x_1 + \left ( \sum_{i=2}^{n-1} x_i \right ) + x_n \implies\)

\(\displaystyle x_1 + d + x_2 = 0 \implies d = -\ (x_1 + x_n).\)

\(\displaystyle \displaystyle e = \sum_{i=2}^{n-1} x_i^2 \ge 0.\)

\(\displaystyle \therefore c = \dfrac{x_1^2 + e + x_n^2}{n} - b^2 = \dfrac{x_1^2 + e + x_n^2}{n} \implies\)

\(\displaystyle c \ge \dfrac{nx_1^2 + ne + nx_n^2}{(n - 1)^2} \ \because \ \dfrac{1}{(n - 1)^2} < \dfrac{1}{n} \implies \dfrac{n}{(n - 1)^2} < 1.\)

\(\displaystyle p = \dfrac{a + d}{n - 1} \implies p^2 = \dfrac{a^2 + 2ad + d^2}{(n - 1)^2}.\)

\(\displaystyle \dfrac{1}{(n - 1)^2} < \dfrac{1}{n - 1} \text { and } e \ge 0 \implies \dfrac{a^2 + e}{(n - 1)^2} \le \dfrac{a^2 + e}{n - 1} \implies \dfrac{a^2 + e}{(n - 1)^2} - p^2 \le \dfrac{a^2 + e}{n - 1} - p^2 = q.\)

\(\displaystyle \therefore -\ q \ge p^2 - \dfrac{a^2 + e}{(n - 1)^2} = \dfrac{a^2 + 2ad + d^2 - a^2 - e}{(n - 1)^2} = \dfrac{d^2 + 2ad - e}{(n - 1)^2}.\)

\(\displaystyle \therefore c - q \ge \dfrac{nx_1^2 + ne + nx_n^2 + d^2 + 2ad - e}{(n - 1)^2} = \dfrac{(n - 2)(x_1^2 + x_n^2) + (n - 1)e + d^2 + 2(x_1^2 - ax_1 - ax_n + x_n^2)}{(n - 1)^2}\).

Now every term in the fraction is non-negative except possibly \(\displaystyle x_1^2 - ax_1 - ax_n + x_n^2.\)

And that term is obviously non-negative if \(\displaystyle a = 0.\)

Futhermore \(\displaystyle x_1 < 0,\ x_n > 0, \ \text { and } x_1 \le a \le x_n.\)

\(\displaystyle \text {Case A: } x_1 \le a < 0 < x_n.\)

\(\displaystyle -\ ax_n + x_n^2 > 0 \ \because -\ a > 0 < x_n.\)

\(\displaystyle x_1 \le a < 0 \implies x_1^2 \ge ax_1 \implies x_1^2 - ax_1 \ge 0.\)

\(\displaystyle \therefore (x_1^2 - ax_1 - ax_n + x_n) > 0.\)

\(\displaystyle \text {Case B: } x_1 < 0 < a \le x_n.\)

\(\displaystyle -\ ax_1+ x_1^2 > 0 \ \because -\ a < 0 > x_1.\)

\(\displaystyle 0 < a \le x_n \implies ax_n \le x_n^2 \implies 0 \le x_n^2 - ax_n.\)

\(\displaystyle \therefore (x_1^2 - ax_1 - ax_n + x_n) > 0.\)

Therefore the fraction is non-negative.

\(\displaystyle \therefore c - q \ge 0 \implies\)

\(\displaystyle c \ge q \text { Q.E.D.}\)

So where did I mess up? And, if I did not, I did not find the proof to be at all obvious as alleged by the OP.
 
Last edited:
Top