19-10-2012, 05:08 PM
Robust Face Recognition via Sparse Representation
Robust Face Recognition.pdf (Size: 3.03 MB / Downloads: 27)
Abstract
We consider the problem of automatically recognizing human faces from frontal views with varying expression and
illumination, as well as occlusion and disguise. We cast the recognition problem as one of classifying among multiple linear regression
models and argue that new theory from sparse signal representation offers the key to addressing this problem. Based on a sparse
representation computed by ‘1-minimization, we propose a general classification algorithm for (image-based) object recognition. This
new framework provides new insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. For
feature extraction, we show that if sparsity in the recognition problem is properly harnessed, the choice of features is no longer critical.
What is critical, however, is whether the number of features is sufficiently large and whether the sparse representation is correctly
computed. Unconventional features such as downsampled images and random projections perform just as well as conventional
features such as Eigenfaces and Laplacianfaces, as long as the dimension of the feature space surpasses certain threshold, predicted
by the theory of sparse representation. This framework can handle errors due to occlusion and corruption uniformly by exploiting the
fact that these errors are often sparse with respect to the standard (pixel) basis. The theory of sparse representation helps predict how
much occlusion the recognition algorithm can handle and how to choose the training images to maximize robustness to occlusion. We
conduct extensive experiments on publicly available databases to verify the efficacy of the proposed algorithm and corroborate the
above claims.
INTRODUCTION
PARSIMONY has a rich history as a guiding principle for
inference. One of its most celebrated instantiations, the
principle of minimum description length in model selection
[1], [2], stipulates that within a hierarchy of model classes,
the model that yields the most compact representation
should be preferred for decision-making tasks such as
classification. A related, but simpler, measure of parsimony
in high-dimensional data processing seeks models that
depend on only a few of the observations, selecting a small
subset of features for classification or visualization (e.g.,
Sparse PCA [3], [4] among others). Such sparse feature
selection methods are, in a sense, dual to the support vector
machine (SVM) approach in [5] and [6], which instead
selects a small subset of relevant training examples to
characterize the decision boundary between classes. While
these works comprise only a small fraction of the literature
on parsimony for inference, they do serve to illustrate a
common theme:
CLASSIFICATION BASED ON SPARSE
REPRESENTATION
A basic problem in object recognition is to use labeled
training samples from k distinct object classes to correctly
determine the class to which a new test sample belongs. We
arrange the given ni training samples from the ith class as
columns of a matrix Ai¼
context of face recognition, we will identify a w h grayscale
image with the vector v 2 IRm ðm ¼ whÞ given by
stacking its columns; the columns of Ai are then the training
face images of the ith subject.
Test Sample as a Sparse Linear Combination of
Training Samples
An immense variety of statistical, generative, or discriminative
models have been proposed for exploiting the
structure of the Ai for recognition. One particularly simple
and effective approach models the samples from a single
class as lying on a linear subspace. Subspace models are
flexible enough to capture much of the variation in real data
sets and are especially well motivated in the context of face
recognition, where it has been observed that the images of
faces under varying lighting and expression lie on a special
low-dimensional subspace [24], [30], often called a face
subspace. Although the proposed framework and algorithm
can also apply to multimodal or nonlinear distributions (see
the supplementary appendix for more detail, which can be
found on the Computer Society Digital Library at http://
doi.ieeecomputersociety10.1109/TPAMI.2008.79), for
ease of presentation, we shall first assume that the training
samples from a single class do lie on a subspace. This is the
only prior knowledge about the training samples we will be
using in our solution.
Classification Based on Sparse Representation
Given a new test sample y from one of the classes in the
training set, we first compute its sparse representation ^x1
via (6) or (10). Ideally, the nonzero entries in the estimate ^x1
will all be associated with the columns of A from a single
object class i, and we can easily assign the test sample y to
that class. However, noise and modeling error may lead to
small nonzero entries associated with multiple object
classes (see Fig. 3). Based on the global sparse representation,
one can design many possible classifiers to resolve
this. For instance, we can simply assign y to the object class
with the single largest entry in ^x1. However, such heuristics
do not harness the subspace structure associated with
images in face recognition. To better harness such linear
structure, we instead classify y based on how well the
coefficients associated with all training samples of each
object reproduce y.
The Role of Feature Extraction
In the computer vision literature, numerous feature extraction
schemes have been investigated for finding projections
that better separate the classes in lower dimensional spaces,
which are often referred to as feature spaces. One class of
methods extracts holistic face features such as Eigenfaces[
23], Fisherfaces [24], and Laplacianfaces [25]. Another
class of methods tries to extract meaningful partial facial
features (e.g., patches around eyes or nose) [21], [41] (see
Fig. 6 for some examples). Traditionally, when feature
extraction is used in conjunction with simple classifiers
such as NN and NS, the choice of feature transformation is
considered critical to the success of the algorithm. This has
led to the development of a wide variety of increasingly
complex feature extraction methods, including nonlinear
and kernel features [42], [43]. In this section, we reexamine
the role of feature extraction within the new sparse
representation framework for face recognition.
Recognition Despite Random Pixel Corruption
For this experiment, we test the robust version of SRC, which
solves the extended ‘1-minimization problem (22) using the
Extended Yale B Face Database. We choose Subsets 1 and 2
(717 images, normal-to-moderate lighting conditions) for
training and Subset 3 (453 images, more extreme lighting
conditions) for testing. Without occlusion, this is a relatively
easy recognition problem. This choice is deliberate, in order
to isolate the effect of occlusion. The images are resized to 96
84 pixels,19 so in this case, B ¼ ½A; I is an 8,064 8,761
matrix. For this data set, we have estimated that the polytope
P ¼ convðBÞ is approximately 1,185 neighborly (using the
method given in [37]), suggesting that perfect reconstruction
can be achieved up to 13.3 percent (worst possible)
occlusion.
Recognition Despite Random Block Occlusion
We next simulate various levels of contiguous occlusion,
from 0 percent to 50 percent, by replacing a randomly located
square block of each test image with an unrelated image, as
in Fig. 12a. Again, the location of occlusion is randomly
chosen for each image and is unknown to the computer.
Methods that select fixed facial features or blocks of the
image (e.g., [16] and [57]) are less likely to succeed here due
to the unpredictable location of the occlusion. The top two
rows in Figs. 12a, 12b, 12c, and 12d shows the two
representative results of Algorithm 1 with 30 percent
occlusion. Fig. 12a is the occluded image. In the second
row, the entire center of the face is occluded; this is a
difficult recognition task even for humans. Fig. 12b shows
the magnitude of the estimated error ^e1. Notice that ^e1
compensates not only for occlusion due to the baboon but
also for the violation of the linear subspace model caused by
the shadow under the nose. Fig. 12c plots the estimated
coefficient vector ^x1. The red entries are coefficients
corresponding to test image’s true class. In both examples,
the estimated coefficients are indeed sparse and have large
magnitude only for training images of the same person. In
both cases, the SRC algorithm correctly classifies the
occluded image. For this data set, our Matlab implementation
requires 90 seconds per test image on a PowerMac G5.
CONCLUSIONS AND DISCUSSIONS
In this paper, we have contended both theoretically and
experimentally that exploiting sparsity is critical for the
high-performance classification of high-dimensional data
such as face images. With sparsity properly harnessed, the
choice of features becomes less important than the number
of features used (in our face recognition example, approximately
100 are sufficient to make the difference negligible).
Moreover, occlusion and corruption can be handled
uniformly and robustly within the same classification
framework. One can achieve a striking recognition performance
for severely occluded or corrupted images by a
simple algorithm with no special engineering.