16-03-2012, 04:27 PM
DigitEyes: Vision-Based Hand Tracking for Human-Computer Interaction
Abstract
Computer sensing of hand and limb motion is
an important problem for applications in Human-
Computer Interaction (HCI), virtual reality, and ath-
letic performance measurement. Commercially avail-
able sensors are invasive, and require the user to
wear gloves or targets. We have developed a nonin-
vasive vision-based hand tracking system, called Dig-
itEyes. Employing a kinematic hand model, the Dig-
itEyes system has demonstrated tracking performance
at speeds of up to 10 Hz, using line and point fea-
tures extracted from gray scale images of unadorned,
unmarked hands. We describe an application of our
sensor to a 3D mouse user-interface problem.
1 Introduction
A \human sensor" capable of tracking a person's
spatial motion using techniques from Computer Vi-
sion would be a powerful tool for human-computer
interfaces. Such a sensor could be located in the
user's environment (rather than on their person) and
could operate under natural conditions of lighting and
dress, providing a degree of convenience and
exibil-
ity that is currently unavailable. For the purpose of
visual sensing, human hands and limbs can be mod-
eled as articulated mechanisms, systems of rigid bod-
ies connected together by joints with one or more
degrees of freedom (DOF's). This model can be ap-
plied at a ne (visual) scale to describe hand motion,
and at a coarser scale to describe the motion of the
entire body. Based on this observation, we formu-
late human sensing as the real-time visual tracking of
Supported by the NASA George Marshall Space Flight
Center, GSRP Grant NGT-50559.
articulated kinematic chains.
Although many frameworks for human motion
analysis are possible, our approach has four main ad-
vantages. First, by tracking all of the hand's DOF's,
we provide the user with maximum
exibility for in-
terface applications. (See [15, 6] for examples of in-
terfaces requiring a whole-hand sensor.) In addition,
our general modeling approach based on 3D kinemat-
ics makes it possible to track any subset of hand or
body states with the same basic algorithm. Another
benet of full state tracking is invariance to unused
hand motions. The motion of a particular nger, for
example, can be recognized from its joint angles re-
gardless of the pose of the palm relative to the cam-
era. Finally, by modeling the hand kinematics in 3D
we eliminate the need for application- or viewpoint-
dependent user modeling.
The DigitEyes system treats hand tracking as a
model-based sequential estimation problem: given a
sequence of images and a hand model, we estimate
the 3D hand conguration in each frame. All pos-
sible hand congurations are represented by vectors
in a state space, which encodes the pose of the palm
(six rotation and translation DOF's1) and the joint
angles of the ngers (four states per nger, ve for
the thumb). Each hand conguration generates a set
of image features, 2D lines and points, by projection
through the camera model. A feature measurement
process extracts these hand features from grey-scale
images by detecting the occluding boundaries of n-
ger links and tips. The state estimate for each image
is computed by nding the state vector that best ts
the measured features. Our basic tracking framework
1We use quaternions to represent palm rotation, resulting in
a model with four rotational states|- one more than the num-
ber of DOF's. Although quaternions are a nonminimal repre-
sentation, they have the advantage of being free of singularities.
is similar to that of [4, 7, 16].
Articulated mechanisms are more dicult to track
than the single rigid objects traditionally addressed
in Computer Vision. Three major diculties are the
large size of the state space, nonlinearities in the
state-to-feature mapping (called the measurement
model), and self-occlusions. Finger articulations add
an additional 21 states over the rigid body motion of
the palm, signicantly increasing the computational
cost of estimation. These additional states are pa-
rameterized by joint angles, which introduce nonlin-
earities and kinematic singularities into the measure-
ment model. Singularities arise when a small change
in a given state has no eect on the image features. In
addition to these problems, the ngers occlude each
other and the palm during motion, making feature
measurement dicult.
The DigitEyes system uses local search and lin-
eariztion to deal with the large state space and non-
linear measurement model. The key to our local,
gradient-based approach to tracking is a high image
acquisition rate (10 Hz), which limits the change in
the hand state, and therefore image feature location,
between frames. In the state space, we exploit this lo-
cality by linearizing the nonlinear state model around
the previous estimate. Techniques from robotics pro-
vide for fast computation of the necessary kinematic
Jacobian. Kinematic singularities are dealt with by
stabilizing the state estimator. The resulting linear
estimation problem is solved for each frame, produc-
ing a sequence of state corrections which are inte-
grated over time to yield an estimated state trajec-
tory.
As a result of the high image sampling rate, the
change in hand features between frames is also small.
For a given image, the state estimate from the pre-
vious frame is used to predict feature positions. Fea-
ture detectors, initialized to these predictions, exploit
the symmetry of the nger links to extract lines and
points and match them to the hand model. In the
rst image of the sequence, the user places his hand
in a known starting conguration to initialize track-
ing. In the current system, each nger link detects its
features independently of the others, which limits the
sensor to hand motions without occlusions. We are
extending our feature processing approach to remove
this limitation.
In [12], we described the DigitEyes system in de-
tail, and gave results of tracking a 27 DOF hand
model from a two camera image sequence under or-
thographic projection. This paper describes an ex-
tension to perspective projection, and gives a detailed
example of a user-interface based on our sensor: a 3D
graphical mouse. While dicult problems still re-
main in tracking through occlusions and across com-
plicated backgrounds, these results demonstrate the
potential of our approach to vision-based human mo-
tion sensing.
2 Previous Work
Previous work on tracking general articulated ob-
jects includes [8, 16, 10, 9]. In [16], Yamamoto and
Koshikawa describe a system for human body track-
ing using kinematic and geometric models. They
give an example of tracking a single human arm
and torso using optical
ow features. Pentland and
Horowitz [10] give an example of tracking the mo-
tion of a human gure using optical
ow and an ar-
ticulated deformable model. In a related approach,
Metaxis and Terzopoulos [8] track articulated mo-
tion using deformable superquadric models. In [4],
Dorner describes a system for interpreting Ameri-
can Sign Language from image sequences of a single
hand. Dorner's system uses the full set of the hand's
DOFs, and employs a glove with colored markers to
simplify feature extraction. One of the earliest sys-
tems, by O'Rourke and Badler [9], analyzed human
body motion using constraint propagation. None of
these earlier approachs based on articulated models
have demonstrated real-time tracking results for the
full state of a complicated mechanism like the human
hand, using natural image features.
In addition to previous work on articulated object
tracking, many authors have applied general vision
techniques to human motion analysis. In contrast to
DigitEyes, these approachs analyze a subset of the to-
tal hand motion, such as a set of gestures [2, 13] or the
rigid motion of the palm [1]. Darrell and Pentland de-
scribe a system for learning and recognizing dynamic
hand gestures in [2]. Related work by Segen [13] takes
a neural network approach to 2D hand gesture recog-
nition. Both of these approachs work in real-time on
unmarked hand images, but they don't produce 3D
motion estimates and it would be dicult to apply
them to problems like the 3D mouse interface in Sub-
sect. 5. In [1], Blake et. al. describe a real-time
contour tracking system that can follow the silhou-
ette of a rigidly moving hand under an ane motion
model.
Anchor
Palm
Point
4th Finger
Thumb
q 0
q 1
q 2
q 3
a2
a3
a4
a1
q a0
q
q
1
2
3
4th Finger Side View
Link 1
Link 2
Link 3
Figure 1: Kinematic models, illustrated for fourth
nger and thumb. The arrows illustrate the joint axes
for each link in the chain.
3 Hand Modeling for Visual Tracking
This section is a brief summary of our kinematic
and feature models of the hand. See [12] for more
detail. We use the Denavit-Hartenburg (DH) repre-
sentation, widely used in robotics [14], to describe
the hand kinematics. We make several simplifying
assumptions in modeling the hand, which are illus-
trated in Fig. 1. First, we assume that each of the
four ngers of the hand are planar mechanisms with
four degrees of freedom (DOF). The abduction DOF
moves the plane of the nger relative to the palm,
while the remaining 3 DOF determine the nger's
conguration within the plane. Each nger has an
anchor point, which is the position of its base joint
center in the frame of the palm, which is assumed
to be rigid. Real ngers deviate from our modeling
assumptions slightly, but we have found them to be
adequate in practice.
Hand features consist of lines and points generated
by the projection of the hand model into the image
plane. Each nger link, modeled by a cylinder, gen-
erates a pair of lines in the image corresponding to
its occlusion boundaries. The bisector of these lines,
which contains the projection of the cylinder central
axis, is used as the link feature. The link feature vec-
tor [a b ] gives the parameters of the line equation
ax + by