User:Michael Hardy/Matrix spectral decompositions in statistics

In statistics, there are a number of theoretical results that are usually presented at a fairly elementary level that cannot be proved at that level without resort to somewhat cumbersome arguments. However, they can be quickly and conveniently proved by using spectral decompositions of real symmetric matrices. That such a decomposition always exists is the content of the spectral theorem of linear algebra. Since the most elementary accounts of statistics do not presuppose any familiarity with linear algebra, the results are often stated without proof in elementary accounts.

Certain chi-square distributions edit

The chi-square distribution is the probability distribution of the sum of squares of several independent random variables each of which is normally distributed with expected value 0 and variance 1. Thus, suppose

 

are such independent normally distributed random variables with expected value 0 and variance 1. Then

 

has a chi-square distribution with n degrees of freedom. A corollary is that if

 

are independent normally distributed random variables with expected value μ and variance σ2, then

 

also has a chi-square distribution with n degrees of freedom. Now consider the "sample mean"

 

If one puts the sample mean in place of the "population mean" μ in (1) above, one gets

 

One finds it asserted in many elementary texts[citation needed] that the random variable (2) has a chi-square distribution with n − 1 degrees of freedom. Why that should be so may be something of a mystery when one considers that

  • The random variables
 
although normally distributed, cannot be independent (since their sum must be zero);
  • Those random variables do not have variance 1, but rather
 
(as will be explained below);
  • There are not n − 1 of them, but rather n of them.

The fact that the variance of (3) is (n − 1)/n can be seen be writing it as

 

and then using elementary properties of the variance.

To resolve the mystery, one begins by thinking about the operation of subtracting the sample mean from each observation:

 

This is a linear transformation. In fact it is a projection, i.e. an idempotent linear transformation. To say that it is idempotent is to say that if one subtracts the mean of each of the scalar components of this vector, getting a new vector, and then applies the same operation to the new vector, what one gets is that same new vector. It projects the n-dimensional space onto the (n − 1)-dimensional subspace whose equation is x1 + .... + xn = 0.

The matrix can be seen to be symmetric by observing that the matrix is

 

Alternatively, one can see that the matrix is symmetric by observing that the vector (1, 1, 1, ..., 1)T that gets mapped to 0 is orthogonal to every vector in the image space x1 + .... + xn = 0; thus the mapping is an orthogonal projection. The matrices of orthogonal projections are precisely the symmetric idempotent matrices; hence this matrix is symmetric.

Therefore

  • P is an n × n orthogonal projection matrix of rank n − 1.

Now we apply the spectral theorem to conclude that there is an orthogonal matrix G that rotates the space so that

  •  
with 0 is every off-diagonal position.

Now let

 

and

 

The probability distribution of X is a multivariate normal distribution with expected value

 

and variance

 

Consequently the probability distribution of U = PX is multivariate normal with expected value

 

and variance

 

(we have used the fact that P is symmetric and idempotent).

Confidence intervals based on Student's t-distribution edit

One such elementary result is as follows. Suppose

 

are the observations in a random sample from a normally distributed population with population mean μ and population standard deviation σ. It is desired to find a confidence interval for μ.

Let

 

be the sample mean and let

 

be the sample variance. It is often asserted in elementary accounts that the random variable

 

has a Student's t-distribution with n − 1 degrees of freedom. Consequently the interval whose endpoints are

 

where A is a suitable percentage point of Student's t-distribution with n − 1 degrees of freedom, is a confidence interval for μ.

That is the practical result desired. But the proof using the spectral theorem is not given in accounts in which the reader is not assumed to be familiar with linear algebra at that level.

Student's distribution and the chi-square distribution edit

Student's t-distribution with (so called because its discoverer, William Sealy Gosset, wrote under the pseudonym "Student") with k degrees of freedom, can be characterized as the probability distribution of the random variable

 

where

The chi-square distribution with k degrees of freedom is the distribution of the sum

 

where Z1, ..., Zk are indepedent random variables, each normally distributed with expected value 0 and standard deviation 1.

The problem edit

Why should the random variable

 

have the same distribution as

 

where k = n − 1?

We must overcome several apparent objections to the conclusion we hope to prove:

  • Although the numerator in (1) is normally distributed with expected value 0, it does not have standard deviation 1.
  • The random variable,
 
appearing in the numerator is the sum of square of random variables, each of which is normally distributed with expected value 0, but
    • there are not n − 1 of them, but n; and
    • they are not independent (notice in particular that
 
regardless of the values of X1, ..., Xn, and that clearly precludes independence); and
    • the standard deviation of each of them is not 1. If one divides them each by σ, the standard deviation of the quotient is also not 1, but in fact less than 1. To see that, consider that the standard score
 
has standard deviation 1, and substituting   for μ makes the standard deviation smaller.
  • It may be unclear what the numerator and denominator in (1) must be independent. After all, both are functions of the same list of n observations X1, ..., Xn.

The very last of these objections may be answered without resorting to the spectral theorem. But all of them, including the last, can be answered by means of the spectral theorem. The solution will amount to rewriting the vector

 

in a different coordinate system.

Spectral decompositions edit

The spectral theorem tells us that any real symmetric matrix can be diagonalized by an orthogonal matrix.

We will apply that to the n × n projection matrices P = Pn and Q = Qn defined by saying that every entry in P is 1/n and Q = I − P, i.e. the n × n identity matrix minus P. Notice that

 

Also notice that P and Q are complementary orthogonal projection matrices, i.e.

 

For any vector X, the vector PX is the orthogonal projection of X onto the space spanned by the column vector J in which every entry is 1, and QX is the projection onto the (n − 1)-dimensional orthogonal complement of that space.

Let G be an orthogonal matrix such that