27-06-2013, 04:04 PM
A tutorial on Principal Components Analysis
A tutorial on Principal.pdf (Size: 117.15 KB / Downloads: 14)
Introduction
This tutorial is designed to give the reader an understanding of Principal Components
Analysis (PCA). PCA is a useful statistical technique that has found application in
fields such as face recognition and image compression, and is a common technique for
finding patterns in data of high dimension.
Before getting to a description of PCA, this tutorial first introduces mathematical
concepts that will be used in PCA. It covers standard deviation, covariance, eigenvectors
and eigenvalues. This background knowledge is meant to make the PCA section
very straightforward, but can be skipped if the concepts are already familiar.
There are examples all the way through this tutorial that are meant to illustrate the
concepts being discussed. If further information is required, the mathematics textbook
“Elementary Linear Algebra 5e” by Howard Anton, Publisher John Wiley & Sons Inc,
ISBN 0-471-85223-6 is a good source of information regarding the mathematical background.
Background Mathematics
This section will attempt to give some elementary background mathematical skills that
will be required to understand the process of Principal Components Analysis. The
topics are covered independently of each other, and examples given. It is less important
to remember the exact mechanics of a mathematical technique than it is to understand
the reason why such a technique may be used, and what the result of the operation tells
us about our data. Not all of these techniques are used in PCA, but the ones that are not
explicitly required do provide the grounding on which the most important techniques
are based.
I have included a section on Statistics which looks at distribution measurements,
or, how the data is spread out. The other section is on Matrix Algebra and looks at
eigenvectors and eigenvalues, important properties of matrices that are fundamental to
PCA.
Statistics
The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set
of data, and what they tell you about the data itself.
Standard Deviation
To understand standard deviation, we need a data set. Statisticians are usually concerned
with taking a sample of a population. To use election polls as an example, the
population is all the people in the country, whereas a sample is a subset of the population
that the statisticians measure. The great thing about statistics is that by only
measuring (in this case by doing a phone survey or similar) a sample of the population,
you can work out what is most likely to be the measurement if you used the entire population.
Covariance
The last two measures we have looked at are purely 1-dimensional. Data sets like this
could be: heights of all the people in the room, marks for the last COMP101 exam etc.
However many data sets have more than one dimension, and the aim of the statistical
analysis of these data sets is usually to see if there is any relationship between the
dimensions. For example, we might have as our data set both the height of all the
students in a class, and the mark they received for that paper. We could then perform
statistical analysis to see if the height of a student has any effect on their mark
Eigenvalues
Eigenvalues are closely related to eigenvectors, in fact, we saw an eigenvalue in Figure
Notice how, in both those examples, the amount by which the original vector
was scaled after multiplication by the square matrix was the same? In that example,
the value was 4. 4 is the eigenvalue associated with that eigenvector. No matter what
multiple of the eigenvector we took before we multiplied it by the square matrix, we
would always get 4 times the scaled vector as our result (as in Figure 2.3).
So you can see that eigenvectors and eigenvalues always come in pairs. When you
get a fancy programming library to calculate your eigenvectors for you, you usually get
the eigenvalues as well.
Principal Components Analysis
Finally we come to Principal Components Analysis (PCA). What is it? It is a way
of identifying patterns in data, and expressing the data in such a way as to highlight
their similarities and differences. Since patterns in data can be hard to find in data of
high dimension, where the luxury of graphical representation is not available, PCA is
a powerful tool for analysing data.
The other main advantage of PCA is that once you have found these patterns in the
data, and you compress the data, ie. by reducing the number of dimensions, without
much loss of information. This technique used in image compression, as we will see
in a later section.
This chapter will take you through the steps you needed to perform a Principal
Components Analysis on a set of data. I am not going to describe exactly why the
technique works, but I will try to provide an explanation of what is happening at each
point so that you can make informed decisions when you try to use this technique
yourself.