17-09-2016, 04:33 PM
1455203145-3DHAARLIKEFEATURESFORPEDESTRIANDETECTION.pdf (Size: 206.6 KB / Downloads: 23)
ABSTRACT
One basic observation for pedestrian detection in video
sequences is that both appearance and motion information
are important to model the moving people. Based on this
observation, we propose a new kind of features, 3D Haarlike
(3DHaar) features. Motivated by the success of Haarlike
features in image based face detection and differentialframe
based pedestrian detection, we naturally extend this
feature by defining seven types of volume filters in 3D
space, instead of using rectangle filter in 2D space. The
advantage is that it can not only represent pedestrian’s
appearance, but also capture the motion information. To
validate the effectiveness of the proposed method, we
combine the 3DHaar with support vector machine (SVM)
for pedestrian detection. Our experiments demonstrate the
3DHaar are more effective for video based pedestrian
detection.
INTRODUCTION
Human detection is important for a variety of applications,
such as visual surveillance, smart room, and automatic
driver-assistance system. But it is a challenging task because
of the wide variability in appearance due to clothing,
articulation and illumination conditions.
In the last few years, many approaches to pedestrian
detection have been proposed in both still images and video
sequences. For pedestrian detection in still images,
researchers mainly focus on the features for modeling
pedestrians. In the early stage, Papageorgiou et al [1] used
Haar-based representation combining with a polynomial
SVM to classify pedestrians. In order to detect pedestrians
with partial occlusion, Mohan et al. [2] improved the
method in [1] by dividing human body into fours parts:
head-shoulder, legs and left/right arms. Later, researches
start describing pedestrians by local descriptors in Implicit
Shape Model (ISM) [3], which is a popular method for
object detection and recognition. Also some modified SIFT
descriptors (i.e. Histogram of Oriented Gradient) are used in
pedestrian detection with other classifiers such as SVM [4]
and cascade AdaBoost [5]. Recently, Wu and Yu [6] model
pedestrians in a Markov Random Field, to solve the problem
of non-rigid shape and partial occlusion. Munder and
Gavrila [7] carry out an extensive experimental study on
various features and classifiers for pedestrian detection.
Much progress has been made in detection and tracking
of pedestrians in video sequences [8-11]. However, most
methods rely on segmentation of a foreground motion blob.
Motion segmentation by background modeling is simple and
effective when camera is stationary and changes in
illumination are gradual. But for many applications the
camera may move and illumination may change suddenly.
In such case, direct detection of human pattern can solve the
problem. Considering this requirement, we propose a
method detecting pedestrian directly from the video
sequences, while being independent of motion detection.
Recently, Viola et al.’s [11] and Dalal et al.’s [12] work
indicate the combination of static and dynamic information
can improve the detection accuracy. Viola et al. [11]
presented a pedestrian detection algorithm with Haar-like
features between two frame difference, considering both the
appearance and motion information. Our method is a natural
generalization of this algorithm. Instead of presenting
features between two frames, we extract Haar-like features
among multiple frames, which can capture more motion
information representative of people. Since they are
extracted in a space-time volume, we call them 3D Haar-like
features. This kind of features is distinctive and robust to
represent the motion and appearance pattern of moving
people. The experimental results further verify the
effectiveness of our method.
The remaining of this paper is organized as follows.
Section 2 reviews the relevant algorithm of the Haar-like
features proposed in [11]. Section 3 gives the detailed
description of our 3D Haar-like features. Section 4 presents
our experimental results, and section 5 gives the summation
and the focus of our future work.
2. RELATED WORKS
The proposed method can be regarded as a natural
generalization of Viola et al.’s algorithm [11]. So we start
with a short description of Viola’s algorithm. Given a pair
of images It and 1 It+ in time, five differential images are
computed. Δ is the difference image between image It and 1 It+ , U is the difference image between It and 1 It+
with one pixel shifting up, and LRD , , between It and 1 It+
with one pixel shifting down, left and right respectively.
Then four types of rectangle filters are defined on these
five images. One type of filters compares sums of absolute
differences between Δ and one of {,,, } U LRD . It extracts
information related to the likelihood that a particular region
is moving in a given direction. The second type of filters
compares sums within the same motion image, with
rectangle filters similar to [13]. It measures something
closer to motion shear. The third type of filters measures the
magnitude of motion images, which simply compute the
sum within the detection window. They also use appearance
filters which operate on the first input image It . All the
filters can be evaluated rapidly using the Integral Image.
Then the training process uses AdaBoost to select a subset
of features and construct the cascade classifier. This method
achieves good result.
3. 3D HAAR-LIKE FEATURES
The success of Viola et al.’s algorithm [11] just lies in
that it uses the motion information between consecutive two
images. But when person is moving slowly, the motion
pattern between the two images is not obvious, thus the
features from two frame differences are not so distinctive. In
order to capture the long-term motion patterns among
multiple frames and record the person’s appearance features
at the same time, we extract Haar-like features from a series
of consecutive frames instead of two frames.
Although we can obtain more motion information from
multiple frames, we can’t use the whole frames of the video
to detect a person. It consumes a lot of time and inapplicable
for a detection task. In addition, a target may not always
stay in one position along the whole video. Thus, we divide
videos into small space-time volumes, which only contain
several frames and looks like a cubic window in a video
sequences. The space-time volumes in our method is similar
to the 2D search window in Viola et al.’s algorithm [11]. A
space-time volume can be seen as an independent and whole
unit, which various 3DHaar features are extracted from. Our
goal is to give a judgment on whether a space-time volume
has a person or not. In the next section, we will give the
detailed description of 3D Haar-like features.
3.1. Detailed Description of 3D Haar-like Features
We give the detailed description of 3DHaar features in this
section. 3DHaar features are extracted in a space-time
volume. They can be seen as cubic filters. Specifically, we
adopt seven types of 1-order 3DHaar features. See Figure 1
for a detail. For every type of the cubic filters, the feature
value is the absolute difference of the pixel intensity sum
between the black and white regions. Unlike the 2D Haarlike
features using the difference value [13], we only use the absolute value. It is because the intensity variations of
pedestrians due to clothes, articulation and attachment are
more complicated, and the structures are less definite than
face. Similar cubic filters have been used by [14] for visual
event detection, but only filters(a) (b) (g) are used in optical
flow field in their work.
The cubic filters showed in Figure 1 (a) (b) © are the
static features, which are similar to the 2D Haar-like
features used in [13]. They only compare sums of the same
regions in temporal coordinate. Such features are used to
describe the pedestrian’s appearance information.
The features showed in Figure 1 (d) (e) (f) (g) are the
dynamic features. They compare sums of the different
regions in temporal space. Take (d) for example; it
computes the difference between diagonal pairs of cubic in
temporal dimension. Since the feature value is computed
among multiple frames, it can better describe the motion
information in the scene and capture more motion patterns
of pedestrians.