18-08-2012, 03:20 PM
Neural Network-Based Face Detection
Neural Network-Based Face Detection.pdf (Size: 522.01 KB / Downloads: 22)
Abstract
We present a neural network-based upright frontal face detection system. A retinally connected
neural network examines small windows of an image, and decides whether each window
contains a face. The systemarbitrates between multiple networks to improve performance
over a single network. We present a straightforward procedure for aligning positive face examples
for training. To collect negative examples, we use a bootstrap algorithm, which adds
false detections into the training set as training progresses. This eliminates the difficult task of
manually selecting nonface training examples, which must be chosen to span the entire space
of nonface images. Simple heuristics, such as using the fact that faces rarely overlap in images,
can further improve the accuracy. Comparisons with several other state-of-the-art face detection
systems are presented; showing that our system has comparable performance in terms of
detection and false-positive rates.
Introduction
In this paper, we present a neural network-based algorithm to detect upright, frontal views of faces
in gray-scale images1. The algorithm works by applying one or more neural networks directly to
portions of the input image, and arbitrating their results. Each network is trained to output the
presence or absence of a face. The algorithms and training methods are designed to be general,
with little customization for faces.
Many face detection researchers have used the idea that facial images can be characterized
directly in terms of pixel intensities. These images can be characterized by probabilistic models of
the set of face images [4, 13, 15], or implicitly by neural networks or other mechanisms [3, 12, 14,
19,21,23,25,26]. The parameters for these models are adjusted either automatically from example
images (as in our work) or by hand. A few authors have taken the approach of extracting features
and applying either manually or automatically generated rules for evaluating these features [7,11].
Description of the System
Our system operates in two stages: it first applies a set of neural network-based filters to an image,
and then uses an arbitrator to combine the outputs. The filters examine each location in the image at
several scales, looking for locations that might contain a face. The arbitrator then merges detections
from individual filters and eliminates overlapping detections.
Stage One: A Neural Network-Based Filter
The first component of our system is a filter that receives as input a 20x20 pixel region of the
image, and generates an output ranging from 1 to -1, signifying the presence or absence of a face,
respectively. To detect faces anywhere in the input, the filter is applied at every location in the
image. To detect faces larger than the window size, the input image is repeatedly reduced in size
(by subsampling), and the filter is applied at each size. This filter must have some invariance to
position and scale. The amount of invariance determines the number of scales and positions at
which it must be applied. For the work presented here, we apply the filter at every pixel position
in the image, and scale the image down by a factor of 1.2 for each step in the pyramid.
The filtering algorithm is shown in Fig. 1. First, a preprocessing step, adapted from [21], is
applied to a window of the image. The window is then passed through a neural network, which
decides whether the window contains a face. The preprocessing first attempts to equalize the
intensity values in across the window. We fit a function which varies linearly across the window
to the intensity values in an oval region inside the window. Pixels outside the oval (shown in
Fig. 2a) may represent the background, so those intensity values are ignored in computing the
lighting variation across the face. The linear function will approximate the overall brightness of
each part of the window, and can be subtracted from the window to compensate for a variety
of lighting conditions. Then histogram equalization is performed, which non-linearly maps the
intensity values to expand the range of intensities in the window. The histogram is computed for
pixels inside an oval region in the window. This compensates for differences in camera input gains,
as well as improving contrast in some cases.
Merging Overlapping Detections
Note that in Fig. 3, most faces are detected at multiple nearby positions or scales, while false detections
often occur with less consistency. This observation leads to a heuristic which can eliminate
many false detections. For each location and scale, the number of detections within a specified
neighborhood of that location can be counted. If the number is above a threshold, then that location
is classified as a face. The centroid of the nearby detections defines the location of the
detection result, thereby collapsing multiple detections. In the experiments section, this heuristic
will be referred to as “thresholding”.