16-10-2012, 01:34 PM
Recognizing Human Action from a Far Field of View
Recognizing Human Action.pdf (Size: 292.75 KB / Downloads: 45)
Abstract
In this paper, we present a novel descriptor to characterize
human action when it is being observed from a far field
of view. Visual cues are usually sparse and vague under this
scenario. An action sequence is divided into overlapped
spatial-temporal volumes to make reliable and comprehensive
use of the observed features. Within each volume, we
represent successive poses by time series of Histogram of
Oriented Gradients (HOG) and movements by time series of
Histogram of Oriented Optical Flow (HOOF). Supervised
Principle Component Analysis (SPCA) is applied to seek a
subset of discriminantly informative principle components
(PCs) to reduce the dimension of histogram vectors without
loss of accuracy. The final action descriptor is formed
by concatenating sequences of SPCA projected HOG and
HOOF features.
A Support Vector Machine (SVM) classifier is trained
to perform action classification. We evaluated our algorithm
by testing it on one normal resolution and two lowresolution
datasets, and compared our results with those of
other reported methods. By using less than 1/5 the dimension
a full-length descriptor, our method is able to achieve
perfect accuracy on two of the datasets, and perform comparably
to other methods on the third dataset.
Introduction
Recognition of human actions from a distant view is
a challenging problem in computer vision. It is of significant
interest in many applications, such as automated
surveillance, aerial video analysis, sport video annotation
and search. Various visual cues have been shown to be effective
for representing human actions, including motion
[8, 9], contours [3, 12], extremities [22], and body parts
[5, 18], etc. Most of these features can be reliably extracted
from image sequences of medium to high-resolution.
Similar to [8], our goal is to recognize actions from video
sequences where human figures are less than 40 pixels in
height. This is usually the case when actions are being imaged
from a far field of view. Therefore, not only is the
image resolution greatly reduced, but also the quality of visual
cues is adversely effected due to turbulence. As shown
in Figure 1(a), a person is waving both hands with optical
flow vectors superimposed. The average width of his limbs
is about 3 pixels, and the boundary between the body parts
and background is vague. As a result, the computed optical
flow is rather sparse and noisy. In our problem, we find that
action classification with a single type of feature is easily
subject to background noise and missing features. Moreover,
there are certain human actions where one type of feature
cannot fully capture their properties. For example, it
is difficult to distinguish ‘standing’ from ‘pointing’ using
optical flow alone. Therefore, instead of describing action
by a single type of measure, we propose a novel descriptor
which combines both human poses and motion information
within a spatial-temporal volume.
Related Work
The survey papers by Aggarwal and Cai [1], Gavrila
[10], and Hu et al. [13] provide an extensive review of algorithms
and systems for human tracking, motion analysis,
action representation, and behavior recognition. In this section,
we look at specifically the work which addresses the
similar problem or adopts similar representation.
Efros et al. [8] propose an optical flow based motion descriptor
for recognizing human action at a distance. Their
descriptor is formed by rectified optical flow components
in a spatio-temporal volume. They use k-nearest-neighbor
classifier to perform action recognition and synthesis. As
mentioned before, the use of motion feature alone is insufficient
to characterize certain ‘static’ actions. Moreover,
they compute the optical flow feature between figure-centric
frames, which implicitly removes the velocity information
of human movement.
In [17], Lu and Little employs the subspace projected
HOG descriptor in a hybrid HMM classifier for the joint
task of athlete tracking and action recognition. The space
searched by PCA provides an efficient representation of the
data, but it does not necessarily allow better separation of
descriptor vectors from different actions.
Similar to our work, Ikizler et al. [14] use both human
contour and motion features for action recognition. They
characterize human contours by histograms of Hough transformed
edges, and use coarse orientation bins to compute
optical flow distribution. They train separate shape and motion
classifiers and combine both classification results by
averaging them. However, there is no evidence that shape
and motion features are equally useful for distinguishing actions.
Therefore, the linear combination of single feature
trained classifiers may not be the optimal way of improving
joint decision.
Preprocessing and action features
Preprocessing. Given a stabilized video with tracks
of human actors, the purpose of our preprocessing stage
is to acquire figure-centric action sequences from the
tracks. This step is critical, because in low-resolution
video frames, even a minor misalignment of a bounding
box can cause the loss of body parts or a large inclusion
of background. To overcome this difficulty, we take the
approach similar to [7] for human figure centralization.
The major difference is that, instead of searching for all
people in the entire frame, it is assumed that the person
of interest is somewhere around the track coordinate.
We train our figure centralization detector with HOG
descriptors extracted from manually cropped figure-centric
bounding boxes and negative samples from descriptors of
patches around the figures. During runtime, within the
neighborhood of interest, the detection window searches
in the space of scale and translation (Figure 1.(b)). For
a specific scale and translation which the SVM window
classifier provides the highest probability estimate, the
corresponding HOG vector and the window coordinates
are stored. The recorded coordinates are then passed to the
calculation of HOOF.
Conclusions
When actions are being observed from a far field of view,
available visual cues from human figures are usually sparse
and vague. Therefore, action recognition algorithms that require
an exact description of human shapes or motion .