06-12-2012, 06:43 PM
REAL-TIME FACE AND HAND DETECTION FOR VIDEOCONFERENCING ON A MOBILE DEVICE
1REAL-TIME FACE.pdf (Size: 338.19 KB / Downloads: 53)
ABSTRACT
The increase in processing power on modern mobile devices
allows for the implementation of more advanced image and
video processing algorithms, such as real-time videoconferencing.
In a videoconferencing setting, region of interest
encoding techniques can be applied to improve the quality
of the user’s face. In this work, three face detection techniques
are implemented on a mobile device and evaluated
in terms of accuracy and speed. A shape-based detection
algorithm achieves the fastest detection times of 165 msec,
but fails to accurately detect the face in all cases. Local
binary patterns and the Viola-Jones algorithm are both capable
of accurately detection the face, but are significantly
slower. Several methods for increasing the speed of these
feature-based approaches are discussed. Finally, the results
of the face detection are applied to an H.264 video encoder
operating on the mobile device.
INTRODUCTION
Videoconferencing on mobile devices is becoming a possibility
as cellular network bandwidths are rapidly increasing.
Two-way video communication in this setting requires
real-time processing on a cellular device. While these devices
are more powerful than in the past, they still offer little
computational power when compared to modern desktop
computers. Slow processors constrain the complexity of the
algorithms that can be implemented in real-time on a mobile
device. Furthermore, the bandwidths available on a cellular
network are significantly smaller than those available
on a wired network. Consequently, advanced compression
techniques are required to generate video sequences that are
useful to the end users.
DETECTION ALGORITHMS
In both videoconferencing and ASL video telephony, encoding
only the relevant portions of the sequence at a high
quality can yield significant gains in compression. This
improved compression is essential for meeting the bandwidth
constraints of cellular networks, but requires additional
computational complexity for identifying those relevant
regions. In this section, the face and hands of an individual
are identified through the use of skin segmentation
and face detection algorithms. Based on the detected locations
of the face and hands, the 16x16 macroblocks in the
video are labeled as either face, hand, or background.
Color and shape based face detection
Face detection can be performed using shape and color information
extracted from the image [1]. Skin pixels have a
color distribution that is distinct from non-skin pixels [8].
Skin detection is performed in the YUV color space. Because
the H.264 encoder also operates within this color space,
no color conversion is required to perform the skin detection.
The chrominance values (U and V) of skin pixels are
modeled as a bivariate Gaussian distribution. The mean μ
and covariance matrix of the distribution are generated
from a sample set of skin pixels. Skin-color segmentation
is implemented by thresholding the Mahalanobis distance,
D2
M(x), between a given pixel’s chrominance values x and
the skin pixel distribution.
Hand Detection
While simply identifying the face region may be sufficient
for generic videoconferencing, further processing must be
done for American Sign Language (ASL) video. In ASL,
information is conveyed through both facial expressions and
hand gestures. In order to optimally encode ASL videos, the
hands must also be identified. Following both skin segmentation
and face detection, the signer’s hands are identified
as the large skin clusters not corresponding to the signer’s
face.
ACCURACY AND COMPUTATIONAL RESULTS
The algorithms described in Section 2 are implemented on
an HTC Apache PocketPC with an Intel PXA270 processor
running at 416 MHz, with 64 MB RAM, and a 240x320
LCD display. The device runs the Windows Mobile operating
system. Three test videos of American Sign Language
are used for the evaluation. Two of the videos were
recorded using professional video equipment and downsampled
to QCIF resolution (176x144) at 10 frames per second.
One of these videos was recorded indoor in a studio, the
other was recorded outdoors. The third video was captured
using the camera on the PocketPC while being held by the
signer. It was downsampled from QVGA to a resolution of
160x120 at 15 frames per second.
One of the primary factors controlling the speed of the
feature based face detection algorithms are the number of
image scales included in the search space. A large number
of scales ensures that faces of any size will be found, but
each scale adds a significant amount of computation time.
The number of scales is limited by controlling the scaling
factor and the minimum/maximum expected face size in the
image. In this implementation, the scaling factor was set
to 1.25, the maximum face size was set to 60% of the image
width, and the minimum face size was set to 15% of
the image width. Also, at each image scale, the search is
performed for every other pixel.
CONCLUSION
This work analyzes low complexity methods for identifying
face and hand regions in a mobile video telephony setting.
Shape-based processing is the most computationally
efficient method for identifying the face and hands, but cannot
adequately identify these regions in the presence of skincolored
backgrounds. In these noisy environments, featurebased
face detection techniques are applied to the segmentation
task. The Viola-Jones algorithm achieves 90% detection
rates with almost no false positives. The feature-based
techniques are further optimized by restricting the search
space based on the location of skin pixels in the current
frame or the face in previous frames. The detection algorithms
provide an H.264 encoder with a macroblock-level
map of the face and hands, allowing for the use of regionof-
interest encoding techniques.