Howard Mark and Jerome Workman, Jr. continue their discussion of the derivation of the principal component algorithm using elementary algebra.
This column is a continuation of the set we have been working on to explain and derive the equations behind principal components (1–3). As we usually do, when we continue the discussion of a topic through more than one column, we continue the numbering of equations from where we left off.
We also repeat our note that while our basic approach will avoid the use of matrices in the explanations, we will include the matrix expression corresponding to several of the results we derive. There are several reasons for this, which were given in the previous columns. It might not always be obvious that a given matrix expression is the decomposition of the algebraic equations stated. While it might not always be easy to go from the algebra to the matrix notation, it is always possible to confirm the correspondence by performing the specified matrix operations and seeing that the algebraic expression is recovered. For some of the more complex expressions, we present a demonstration of the equivalency in an appendix of the column.
To recap: So far, we have presented some interesting and useful behaviors that data exhibit when we attempt to create a least squares representation of a data set. We showed that the total sum of squares of a set of data can be separated into several pieces. The key behavior, from our current perspective, is that the differences between the actual data and the least-squares approximation to that data as represented by a principal component loading contain variances (and their multiples by n – 1, their sums of squares) that add up to the total variance (or sum of squares) in the data set.
We also showed that the mathematical attempt to create a function that is the least-squares fit to the data implicitly, but inevitably and naturally, gave rise to the expression of the result in terms of a variance–covariance matrix, thus confirming our introductory statement that principal components subsume the correlation structure of the data into themselves.
However, the previous approaches failed to produce the equations that would give rise to those functions, because they all were subject to solutions that were trivial; all the coefficients were zeros in one case, and all infinite, in the other.
Actually, it was arguably even worse than that. While the maximum variance would be achieved by allowing the coefficients to go to infinity, there were also an infinite number of finite solutions that would be solutions to equation 34; any combination of the coefficients of the loadings that met the conditions of equation 36 would be a solution of equation 34.
So we remain searching for a solution to the problem of finding the function that maximizes the variance of the approximation to the data set [X], which as we showed is the function that is the least-squares estimator of the data array. We are almost there. What we need is a way to keep the answer from collapsing into a trivial solution.
Looking back at the previous developments, in Part II of this series (2), we made the statement:
"The values of the various Li comprising the vector [L] are normalized (or scaled) so that their standard deviation (and therefore their variance) equals unity."
Later in the same column we stated
"However, there is a constraint on the values of the Li. As we noted shortly after equation 15, the constraint is that the standard deviation of the Lj (and therefore their variance) is unity. Therefore, at least some of the Lj must be nonzero, and the trivial solution does not exist. We will pursue introducing the constraint in due course."
The time has come, the writer said, to introduce constraints. If in fact equation 34 represents the solution to the original problem, then all we need to do is find a way to keep the solution from collapsing or exploding, and to have it settle into a set of finite values for the coefficients of the loading. More specifically, we want to find the set of Li that has values such that the sum of their squares is unity; this will prevent them from collapsing to either zero or to infinity.
In fact, mathematicians have developed a way of doing just that. There is a branch of mathematics called the calculus of variations, and it is this branch of mathematics that involves itself with the question of finding functions that meet one or more specified conditions, such as we have here. We are interested in finding the function that is the least-square estimator of the data set [X] (condition 1, the function to find), while having coefficients whose total variance is unity (condition 2, a constraint). The chief tool for solving problems of this nature is called a Lagrange multiplier. There is much written about Lagrange multipliers, both in books on mathematics (4,5) and mathematical statistics (6,7) (reference 6 was found to be particularly useful), and on the web; a Google search will turn up much useful information — after reading a few of the tutorial treatises you can find there (including Wikipedia), you actually will understand Lagrange multipliers!
What is a Lagrange multiplier? A Lagrange multiplier is a variable added to the function of interest in a way that introduces the constraint into the equation. Basically, the function is modified by adding a zero to it; this allows us to introduce the constraint without changing the value of the original function. Consider our problem: we want to find the function we want to maximize, which is expressed in equation 32. Note: subscripts on X indicate the row and column number of the original X matrix; subscripts on L indicate the column of the X matrix they are coefficients of, and the fact that they correspond to the first principal component:
to which we wish to introduce the constraint:
We begin by adding a zero to equation 32, but we do it in a particular way. We multiply the zero by an as-yet-undetermined constant; by convention, the Greek letter λ is used to represent the undetermined constant:
The constant, λ, represents the Lagrange multiplier. We use the symbol λ because it is the conventional symbol for the Lagrange multiplier. Because the λ is multiplied by zero, it does not change the value of the equation, no matter what its value turns out to be. Now we need to substitute something for the zero. The "something" we substitute must itself equal zero, and it should represent the constraint. The constraint, as we saw, is represented by equation 37. But equation 37 is not zero. Therefore, we make it equal zero by the simple expedient of subtracting the left-hand side of equation 37 from both sides of that equation.
We can now substitute the left-hand side of equation 39 for the zero in equation 38:
It is now convenient to distribute the summation and to expand the term containing the Lagrange multiplier:
We are now ready to proceed in the usual way, to find the maximum value of SSA by taking derivatives and setting them equal to zero. Therefore, we take derivatives of SSA with respect to L 1,1 , L 2,1 , and L 3,1 as we did before, and set those derivatives equal to zero. For simplicity of notation, from this point on, we leave off the row subscript of the X variables, and we understand that all summations are taken over the samples in the sample set. The derivatives are
which reduces to
Similarly, the other derivatives are
These can be rewritten as
(Note: the matrix expression for equation 44 is 2([X]T [X] – λ[1])[L]T . See Appendix C for a demonstration of this equivalency.)
Setting these derivatives equal to zero, we get
We can divide both sides of each of these three equations by 2:
At this point, we have set up the equations that must be solved. In the next installment, we will go about showing how to solve them.
Demonstration that 2([X]T [X] – λ[1])[L]T is equivalent to equations 44:
Jerome Workman, Jr. serves on the Editorial Advisory Board of Spectroscopy and is director of research and technology for the Molecular Spectroscopy & Microanalysis division of Thermo Fisher Scientific. He can be reached by e-mail at: jerry.workman@thermo.com
Jerome Workman, Jr.
Howard Mark serves on the Editorial Advisory Board of Spectroscopy and runs a consulting service, Mark Electronics (Suffern, NY). He can be reached via e-mail: hlmark@prodigy.net
Howard Mark
(1) H. Mark and J. Workman, Spectroscopy 22(9), 20–29 (2007).
(2) H. Mark and J. Workman, Spectroscopy 23(2), 30–37 (2008).
(3) H. Mark and J. Workman, Spectroscopy 23(5), 14–17 (2008).
(4) D. Zwillinger, CRC Standard Mathematical Tables and Formulae (Chapman & Hall/CRC Press, New York, 2003).
(5) A.E. Taylor, Advanced Calculus (Ginn and Company, New York, 1955).
(6) D.F. Morrison, Multivariate Statistical Methods (McGraw-Hill, New York, 1976).
(7) T.W. Anderson, An Introduction to Multivariate Statistical Analysis, 1st ed. (John Wiley & Sons, New York, 1958).