04-11-2016, 10:17 AM
Mixture of Experts for Classification of Gender, Ethnic Origin, and Pose of Human Faces
1464293857-GuttaMOETNN00.pdf (Size: 1.51 MB / Downloads: 8)
Abstract—In this paper we describe the application of mixtures
of experts on gender and ethnic classification of human faces, and
pose classification, and show their feasibility on the FERET database
of facial images. The FERET database allows us to demonstrate
performance on hundreds or thousands of images. The mixture
of experts is implemented using the “divide and conquer”
modularity principle with respect to the granularity and/or the locality
of information. The mixture of experts consists of ensembles
of radial basis functions (RBFs). Inductive decision trees (DTs) and
support vector machines (SVMs) implement the “gating network”
components for deciding which of the experts should be used to
determine the classification output and to restrict the support of
the input space. Both the ensemble of RBF’s (ERBF) and SVM use
the RBF kernel (“expert”) for gating the inputs. Our experimental
results yield an average accuracy rate of 96% on gender classification
and 92% on ethnic classification using the ERBF/DT approach
from frontal face images, while the SVM yield 100% on pose classification.
I. INTRODUCTION
T O INTERACT socially, we must be able to process faces
in a variety of ways, which is supported by the social and
cognitive psychology literature [1]. This literature also attests
to an impressive set of capabilities. This includes identifying
familiar faces, and extracting information such as gender, race,
and emotional state from a face.
Humans are able to make accurate and fast predictions from
visual imagery. Among face processing tasks, gender classification
is one of the most biologically important and probably the
easiest and the fastest to achieve [2]. It was observed by Bruce et al. [3] that, on the average, only 600 ms was needed for classification
of faces based on their sex. In a more recent study by the
same authors [4], human subjects were able to classify nonfamiliar
face pictures using sex as a visual cue with 96% accuracy,
even for the cases when a swimming cap concealed the hair.
Face processing is a difficult task, mostly because of the inherent
variability of the image formation process in terms of
image quality and photometry, geometry, occlusion, change, and
disguise. Two recent surveys discuss these challenges in some
detail [5], [6]. Most face processing systems available today perform
only on restricted databases of images in terms of size,
age, gender, and race, and they assume well-controlled environments.
Most face processing systems assume that all images are
frontal. If additional poses, beyond the frontal one, are possible,
then it becomes necessary to either discriminate between possible
poses or estimate the actual face pose. Pose information
can then be used in a variety of ways, ranging from normalization
and detection of facial landmarks, to face recognizers
trained only on some specific poses.
This paper addresses the problem of automatic categorization
of human faces based on gender and ethnic origin, and
pose discrimination using mixture of experts. The mixture of
experts implements the “divide and conquer” modularity principle
with respect to the granularity and/or the locality of information.
The mixtures of experts are ensembles of radial basis
functions (ERBFs) networks. Inductive decision trees (DTs) and
support vector machines (SVMs) implement the “gating networks”
components for deciding which of the experts should
be used to determine the classification output and/or to restrict
the support of the input space. Both the ERBF and SVM use the
RBF kernels for gating the inputs.
II. BACKGROUND
Few attempts have been made to perform gender and ethnic
classification and the ones made used very small data sets.
SEXNET, an early example of a gender classification system,
characteristic of the holistic approach, is described by Golomb
et al. [7]. In 90 images of faces comprising 45 beardless male
and 45 female, the eyes were manually located and the images
then rotated and scaled automatically to a standard format of
30 30 pixels. An encoder back propagation network with 40
hidden units then compressed the images. The output of those
40 units served as input for a sex classification network, trained
using back propagation as well. SEXNET yields an accuracy
of 91.9% on a data set of 90 exemplars corresponding to 45
male and 45 female subjects. The training set was composed
of 80 exemplars and the remaining ten exemplars were used for testing. The system used limited hair information. Brunelli
and Poggio [8] describe a gender classification system using
a discrete approach requiring geometrical features such as
pupil-to-nose vertical distance, nose width, chin radii, and
eyebrow thickness. The geometrical features define then a
feature vector consisting of 18 such features for each person.
No hair information was used and their data set consisted of
168 images of 21 males and 21 females. Brunelli and Poggio
report an accuracy of 92% on the training set and 87.5% on the
testing set using the hyper basis function network. Recently
Wiskott [9] reported an accuracy of 92% on a data set of 111
faces corresponding to 72 male and 39 female faces using the
dynamic link matching architecture (DLA). No restrictions
were placed on hair information.
On the ethnic classification task, the only reference the
authors are aware of is the technique due to O’Toole et al.
[10], who applied principal component analysis (PCA) to
align 151 225 pixel images of 167 Caucasian and Japanese
facial images. A simple criterion based on the reconstruction
coefficients of the first four eigen vectors yielded an accuracy
rate of 76%.
Pose estimation is important for face recognition when viewbased
classifiers are trained to recognize a subset of views and,
among other things, to disambiguate gestures during recognition
because head pose is closely related with human intention
and behavior [11]–[12]. Pose estimation, usually on small
data sets, has been approached using (annotated) geometrical
features and affine geometry [13], interpolation and extrapolation
in the three-dimensional (3-D) eigenspace [14], and labeled
graphs and the dynamic link architecture (DLA) [15]. More recently,
McKenna and Gong [16], implemented a template-based
correlation (of oriented Gabor filters) to recognize and track
faces. Using a magnetic sensor and a calibrated camera, they
continuously track the 3-D head pose across the view-sphere
( 90 . yaw and 30 . tilt at intervals of 10 from video sequences.
As the size of the data sets used in the experiments reported
above is quite restricted, no conclusions can be drawn about
the ability of such methods to generalize and to scale up for
large image databases, possibly consisting of several hundreds
or thousands of face images. This paper describes novel committee
network architectures for gender and ethnic classification
of human faces and shows their feasibility using as test
beds hundreds and/or thousands of face images drawn from
the standard FERET face image database. No restrictions were
placed on the hairstyles of different subjects. We are not aware
of any pose discrimination implementations similar to the one
addressed in this paper using SVM or tested on hundred of images.
III. MIXTURE OF EXPERTS
One (cross-validation) practice in neural networks research
is to try several estimators on a given data set and then choose
the result using a winner-take-all (WTA) approach. It can be argued
that WTA “wastes” the resulting models, which lose the
competition. Instead of choosing a single “best” method for
a given problem, a combination of several predictive models may produce an improved prediction. Model combination approaches
are an attempt to capture the information contained in
all the candidates. Typical model combination procedures consist
of a two-stage process. In the first stage, the training data are
used to separately estimate a number of different models (“experts”).
The parameters of those models are then held fixed. In
the second stage, these individual models are (linearly or nonlinearly)
combined, mixed or gated, to produce the final predictive
model [17].
Specific mixture of experts’ architectures used for model
combination usually produce a model combination by minimizing
the empirical risk at each stage [18] or, as it is the case
with stacking predictors [19], employ a resampling technique
similar to cross-validation. In the first approach, the training
data are first used to estimate the candidate models, and then
taking the weighted average creates the combined model. The
procedure for stacking predictors uses a resampling approach
to combine the models. This resampling is done so that data
samples used to estimate the individual approximating functions
are not used to estimate the mixture coefficients.
An early example of using ensembles of expert (neural networks)
is due to Hamshire and Waibel [20]. The Meta-Pi classifier
is a connectionist pattern classifier that consists of a number
of source-dependent sub networks that are integrated by a combinational
time-delay neural network (TDNN) superstructure.
The TDNN combines the outputs of the modules, trained independently,
in order to provide a global classification. Lincoln
and Skrzypek [21] have proposed clustering multiple backpropagation
networks for improved performance and fault tolerance.
Following training, a “cluster” is created by computing
the average of the outputs generated by the individual networks.
The output of the “cluster” is used as the desired output during
training by feeding it back to the individual networks. The basic
notion behind using such a strategy is based according to the authors
on the assumption that while it is possible to “fool” single
BP networks all of the time one cannot mislead all of them all of
the time. Battiti and Colla [22] have proposed means to combine
the outputs of different neural network classifiers for improving
the rejection-accuracy (ROC) rates and to make the combined
performance better than that obtained from the individual components.
The suggested concept of democracy is analogous with
the human way of reaching a pondered decision—query by consensus.
Soulie et al. [23], have proposed multimodular architectures
(MMAs) that integrate various neural networks to realize
feature extraction and recognition in successive stages that are
cooperatively trained.
Consider now the problem of learning a mapping in which
the form of the mapping is different for different regions of the
input space. Although a single homogeneous network could be
applied to this problem, we expect the task will be easier if we
assigned different expert networks to tackle each of the different
regions. Then use a “gating” network, which also sees the input
data, to decide which of the experts should be used to determine
the output [24]. Gating networks, based on the “divide and conquer”
modularity principle [25], train the expert networks and
the gating network together. The goal of the training procedure
is to have the gating network learn an appropriate decomposition
of the input space into different regions, with one of the expert networks responsible for generating the outputs for input
vectors falling within each region. Jordan and Jacobs [26] extend
this model by considering a hierarchical system in which
each expert network can itself consist of a mixture-of-experts
model complete with its own gating network [24].
Gating networks as described above can be shown to carry
conceptual similarity to mixture estimation and the EM algorithm
[26]. As an example, in the context of estimating motion in
scenes containing multiple motions, Weiss and Adelson [27] describe
a novel recurrent network architecture, which solves this
problem by simultaneously estimating motion and segmenting
the scene. The network is comprised of locally connected units
that carry out simple calculations in parallel. Rather than have
one network estimate motion everywhere, one has now instead
multiple motion expert subnetworks competing to explain the
data by minimizing the motion error. A gating subnetwork that
assigns different regions of space to different experts controls
the error signal to these expert subnetworks. The advantage of
this approach is that it restores the validity of the smoothness
assumptions: regions undergoing drastically different motions
are assigned to different experts, and the motion of regions assigned
to a specific expert is indeed smoothly varying rewarding
coherence of assignments.
Learning classifiers from small sample training sets is difficult
in that the parameters of the data distribution cannot be
estimated properly [17]. Due to a small amount of training objects,
some of them (“outliers”) could largely distort the distribution.
Classifiers built on small training sets are thus usually
biased or unstable [28]. Bootstrap [29], based on random
sampling with replacement, allows one to get more accurate
statistical estimators. By taking a bootstrap replicate, one is
likely to avoid the “outliers” from the original training set. The
bootstrap estimators are not always superior to leaving-one-out
(cross-validation) on small samples, despite the fact that while
leaving-one-out are nearly unbiased their variance is high for
small samples [30]. Bagging, based on bootstrapping and aggregation,
works by averaging the parameters of the classifier built
from several bootstrap replicates. Bagging is useful for unstable
(biased and large variance) classifiers, but for stable classifiers
it could deteriorate their performance.
The basic paradigm for improving the accuracy of unstable
methods is that of perturbing and combining. As an example,
Freund and Shapire [31] have proposed an arcing algorithm
whose basis is to adaptively resample and combine so that the
weights during resampling are increased for those cases most
often misclassified. A similar concept using corrective training
driven by an active learning scheme has been suggested by
Krogh and Vedelsby [32]. The active learning scheme takes
advantage of the obvious observation that a combination of the
output of several networks (or other predictors) is only useful
if they disagree on some inputs. The disagreement, called the
ensemble ambiguity, can then reduce the generalization error of
the network ensemble. Arcing has proved more successful than
bagging in test set error reduction. Both bootstrap aggregating
(“bagging”) and arcing (“boosting”) manipulate the training
data in order to generate different classifiers. Combining
multiple versions through either bagging or arcing then reduces
the variance significantly [33]. An empirical comparison of voting classification algorithms has been provided recently by
Bauer and Kohavi [34].
IV. ENSEMBLES OF RADIAL BASIS FUNCTIONS (ERBFs) AND
DECISION TREES (DT)
The motivation for the (hybrid) ERBF/DT architecture comes
from the apparent need to process imagery at different levels
of granularity. The ability of RBFs to provide an approximate
and compressed input representation, and the recognition that
DT classifiers are fast and comprehensible induction learning
methods based on recursive partitioning (“gating”). RBFs further
allow for clustering similar images before classifying them
and provide thus the potential for developing in the future hierarchical
classifiers where faces can be sequentially discriminated
in terms of gender, race, and age, before final ID recognition
takes place. Decision trees are valuable tools for the description,
classification and generalization of data [35]. Several advantages
of DT-based classification are pointed out by Murthy
and include 1) tree methods are exploratory as opposed to inferential;
they are also nonparametric. As only a few assumptions
are made about the model and the data distribution, trees can
model a wide range of data distributions; 2) the hierarchical decomposition
implies better use of available features and computational
efficiency in classification; and 3) trees perform classification
by a sequence of simple, easy-to-understand tests whose
semantics are intuitively clear to domain experts. Decision trees
provide for flexible and adaptive classification thresholds on the
RBF outputs based on entropy and using both positive and negative
examples of the classes to be learned, and can interpret
(“explain”) the way classification and retrieval are eventually
achieved in terms of the experts being used. The ERBF implements
the equivalent of query by consensus and they are trained
on data reflecting the inherent variability of the input. Ensembles
are defined in terms of their specific topology (connections
and RBF nodes) and the data they are trained on. Both original
data and possible distortions caused by geometrical changes and
blur provide robustness to those very distortions via generalization.
As it is difficult to decide empirically which grouping of
classifiers (“experts”) are sufficient for classification and furthermore
as suitable decision boundaries (“thresholds”) are hard
to establish, this issue is addressed by interfacing the DT component
to the ensemble of RBFs networks.