16-08-2012, 03:16 PM
Robust Real-time Object Detection
Robust Real-time Object Detection.pdf (Size: 461.45 KB / Downloads: 25)
Abstract
This paper describes a visual object detection framework that is capable of processing images extremely
rapidly while achieving high detection rates. There are three key contributions. The first is the introduction
of a new image representation called the “Integral Image” which allows the features used by our detector
to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small
number of critical visual features and yields extremely efficient classifiers [6]. The third contribution is a
method for combining classifiers in a “cascade” which allows background regions of the image to be quickly
discarded while spending more computation on promising object-like regions. A set of experiments in the
domain of face detection are presented. The system yields face detection performace comparable to the best
previous systems [18, 13, 16, 12, 1]. Implemented on a conventional desktop, face detection proceeds at 15
frames per second.
Introduction
This paper brings together new algorithms and insights to construct a framework for robust and extremely
rapid object detection. This framework is demonstrated on, and in part motivated by, the task of face
detection. Toward this end we have constructed a frontal face detection system which achieves detection and
false positive rates which are equivalent to the best published results [18, 13, 16, 12, 1]. This face detection
system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly.
Operating on 384 by 288 pixel images, faces are detected at 15 frames per second on a conventional 700
MHz Intel Pentium III. In other face detection systems, auxiliary information, such as image differences in
video sequences, or pixel color in color images, have been used to achieve high frame rates. Our system
achieves high frame rates working only with the information present in a single grey scale image. These
alternative sources of information can also be integrated with our system to achieve even higher frame rates.
There are three main contributions of our object detection framework. We will introduce each of these
ideas briefly below and then describe them in detail in subsequent sections.
The first contribution of this paper is a new image representation called an integral image that allows
for very fast feature evaluation. Motivated in part by the work of Papageorgiou et al. our detection system
does not work directly with image intensities [10]. Like these authors we use a set of features which are
reminiscent of Haar Basis functions (though we will also use related filters which are more complex than
Haar filters). In order to compute these features very rapidly at many scales we introduce the integral image
representation for images (the integral image is very similar to the summed area table used in computer
graphics [3] for texture mapping). The integral image can be computed from an image using a few operations
per pixel. Once computed, any one of these Harr-like features can be computed at any scale or location
in constant time.
The second contribution of this paper is a method for constructing a classifier by selecting a small number
of important features using AdaBoost [6]. Within any image sub-window the total number of Harr-like
features is very large, far larger than the number of pixels. In order to ensure fast classification, the learning
process must exclude a large majority of the available features, and focus on a small set of critical features.
Motivated by the work of Tieu and Viola, feature selection is achieved through a simple modification of the
AdaBoost procedure: the weak learner is constrained so that each weak classifier returned can depend on
only a single feature [2]. As a result each stage of the boosting process, which selects a new weak classifier,
can be viewed as a feature selection process. AdaBoost provides an effective learning algorithm and strong
bounds on generalization performance [14, 9, 10].
The third major contribution of this paper is a method for combining successively more complex classifiers
in a cascade structure which dramatically increases the speed of the detector by focussing attention on
promising regions of the image. The notion behind focus of attention approaches is that it is often possible to
rapidly determine where in an image an object might occur [19, 8, 1]. More complex processing is reserved
only for these promising regions. The key measure of such an approach is the “false negative” rate of the
attentional process. It must be the case that all, or almost all, object instances are selected by the attentional
filter.
We will describe a process for training an extremely simple and efficient classifier which can be used as a
“supervised” focus of attention operator. The term supervised refers to the fact that the attentional operator
is trained to detect examples of a particular class. In the domain of face detection it is possible to achieve
fewer than 1% false negatives and 40% false positives using a classifier which can be evaluated in 20 simple
operations (approximately 60 microprocessor instructions). The effect of this filter is to reduce by over one
half the number of locations where the final detector must be evaluated.
Those sub-windows which are not rejected by the initial classifier are processed by a sequence of classifiers,
each slightly more complex than the last. If any classifier rejects the sub-window, no further processing
is performed. The structure of the cascaded detection process is essentially that of a degenerate decision tree,
and as such is related to the work of Amit and Geman [1].
The complete face detection cascade has 32 classifiers, which total over 80,000 operations. Nevertheless
the cascade structure results in extremely rapid average detection times. On a difficult dataset, containing
507 faces and 75 million sub-windows, faces are detected using an average of 270 microprocessor instructions
per sub-window. In comparison, this system is about 15 times faster than an implementation of the
detection system constructed by Rowley et al.1 [13]
An extremely fast face detector will have broad practical applications. These include user interfaces, image
databases, and teleconferencing. This increase in speed will enable real-time face detection applications
on systems where they were previously infeasible. In applications where rapid frame-rates are not necessary,
our system will allow for significant additional post-processing and analysis. In addition our system can be
implemented on a wide range of small low power devices, including hand-helds and embedded processors.
In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection
at two frames per second (this device has a low power 200 mips Strong Arm processor which lacks
floating point hardware).
Overview
The remaining sections of the paper will discuss the implementation of the detector, related theory, and
experiments. Section 2 will detail the form of the features as well as a new scheme for computing them
rapidly. Section 3 will discuss the method in which these features are combined to form a classifier. The
machine learning method used, a variant of AdaBoost, also acts as a feature selection mechanism. While
the classifiers that are constructed in this way have good computational and classification performance, they
are far too slow for a real-time classifier. Section 4 will describe a method for constructing a cascade of
classifiers which together yield an extremely reliable and efficient object detector. Section 5 will describe a
number of experimental results, including a detailed description of our experimental methodology. Finally
Section 6 contains a discussion of this system and its relationship to related systems.
Features
Our object detection procedure classifies images based on the value of simple features. There are many
motivations for using features rather than the pixels directly. The most common reason is that features can
act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data. For
this system there is also a second critical motivation for features: the feature-based system operates much
faster than a pixel-based system.
The simple features used are reminiscent of Haar basis functions which have been used by Papageorgiou
et al. [10]. More specifically, we use three kinds of features. The value of a two-rectangle feature is the
difference between the sum of the pixels within two rectangular regions. The regions have the same size and
shape and are horizontally or vertically adjacent (see Figure 1). A three-rectangle feature computes the sum
within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature
computes the difference between diagonal pairs of rectangles.
Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite
large, 45,396 . Note that unlike the Haar basis, the set of rectangle features is overcomplete2.