Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Gesture Based Interaction
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Gesture Based Interaction
[attachment=25397]
Introduction
Gestures and gesture recognition are terms increasingly encountered in discussions of
human-computer interaction. For many (if not most) people the term includes character
recognition, the recognition of proof readers symbols, shorthand, and all of the types of
interaction described in the previous chapter, Marking Interfaces. In fact every physical
action involves a gesture of some sort in order to be articulated. Furthermore, the nature of
that gesture is generally an important component in establishing the quality of feel to the
action. Nevertheless, what we want to isolate for discussion in this chapter are interactions
where the gesture is what is articulated and recognized, rather than a consequence of
expressing something through a transducer. Thus we use the definition of gesture
articulated by Kurtenbach and Hulteen (1990):
“A gesture is a motion of the body that contains information. Waving goodbye is a gesture.
Pressing a key on a keyboard is not a gesture because the motion of a finger on its way to
hitting a key is neither observed nor significant. All that matters is which key was pressed”.
And, of course, this is true regardless of the gesture that was used to push the key. It
could have been pushed lovingly or in anger. Either could be easily sensed by an
observer. But both are irrelevant to the computer, which only cares about what key was
pushed when.
The type of communication that we are discussing here is far richer in many ways than
what we have been dealing with. Consequently, it is not hard to understand why this use
of gesture requires a different class of input devices then we have seen thus far. For the
most part, gestures, as we discuss them, involve a far higher number of degrees of
freedom than we have been looking at. Trying to do gesture recognition by using a mouse
or some other “single point” device for gestural interaction restricts the user to the gestural
∗ The primary author of this chapter is Mark Billinghurst.
14.2 Gesture Based Interaction
Haptic Input 2:41 PM Buxton
vocabulary of a fruit fly ! You may still be able to communicate, but your gestural repertoire
will be seriously constrained.
The first step in considering gesture based interaction with computers is to understand the
role of gesture in human to human communication. In the next section we review the
psychology and anthropology literature to categorize the types of gestures that are
commonly made and their attributes. In the remainder of the chapter we use these
categories to discuss gesture based interfaces, from symbolic gesture systems to
multimodal conversational interfaces. We end with a discussion of future research
directions, in particular reactive environments where the user’s entire surroundings is able
to understand their voice and gestural commands and respond accordingly.
Gestures in the Everyday World
If we remove ourselves from the world of computers and consider human-human
interaction for a moment we quickly realize that we utilize a broad range of gesture in
communication. The gestures that are used vary greatly among contexts and cultures
(Morris, Collet, Marsh & O’Shaughnessy 1980) yet are intimately related to communication.
This is shown by the fact that people gesticulate just as much when talking on the phone
and can’t see each other as in face to face conversation (Rime 1982).
Gestures can exist in isolation or involve external objects. Free of any object, we wave,
beckon, fend off, and to a greater or lesser degree (depending on training) make use of
more formal sign languages. With respect to objects, we have a broad range of gestures
that are almost universal, including pointing at objects, touching or moving objects,
changing object shape, activating objects such as controls, or handing objects to others.
This suggests that gestures can be classified according to their function. Cadoz (1994)
uses function to group gestures into three types:
• semiotic: those used to communicate meaningful information.
• ergotic: those used to manipulate the physical world and create artifacts
• epistemic: those used to learn from the environment through tactile or haptic exploration
Within these categories there may be further classifications applied to gestures. Mulder
(1996) provides a summary of several different classifications, especially with respect to
semiotic gestures.
In this chapter we are primarily interested in how gestures can be used to communicate
with a computer so we will be mostly concerned with empty handed semiotic gestures.
These can further be categorized according to their functionality. Rime and Schiaratura
(1991) propose the following gesture taxonomy:
• Symbolic gestures: These are gestures that, within each culture, have come to have a
single meaning. An Emblem such as the “OK” gesture is one such example, however
American Sign Language gestures also fall into this category.
• Deictic gestures: These are the types of gestures most generally seen in HCI and are
the gestures of pointing, or otherwise directing the listeners attention to specific events or
objects in the environment. They are the gestures made when someone says “Put that
there”.
• Iconic gestures: As the name suggests, these gestures are used to convey information
about the size, shape or orientation of the object of discourse. They are the gestures
made when someone says “The plane flew like this”, while moving their hand through the
air like the flight path of the aircraft.
Gesture Based Interaction 14.3
Haptic Input 24 August, 2011 Buxton
• Pantomimic gestures: These are the gestures typically used in showing the use of
movement of some invisible tool or object in the speaker’s hand. When a speaker says “I
turned the steering wheel hard to the left”, while mimicking the action of turning a wheel
with both hands, they are making a pantomimic gesture.
To this taxonomy McNeill (1992) adds types of gestures which relate to the process of
communication; beat gestures and cohesives. Beat or baton gestures are so named
because the hand moves up and down with the rhythm of speech and looks like it is
beating time. Cohesives, on the other hand, are variations of iconic, pantomimic or deictic
gestures that are used to tie together temporally separated but thematically related portions
of discourse.
Gesture is also intimately related to speech, both in it’s reliance on the speech channel for
interpretation, and for its own speech like-qualities. Only the first class of gestures,
symbolic, can be interpreted alone without further contextual information. Either this
context has to be provided sequentially by another gesture or action, or by speech input in
concert with the gesture. So these gesture types can also be categorized according to their
relationship with speech:
• Gestures that evoke the speech referent: Symbolic, Deictic
• Gestures that depict the speech referent: Iconic, Pantomimic
• Gestures that relate to conversational process: Beat, Cohesive
The need for a speech channel for understanding varies according to the type of gesture.
Thus gesture types can be ordered according to their speech/gesture dependency. This is
described in Kendon’s Continuum (Kendon 1988):
Gesticulation -> Language-Like -> Pantomimes -> Emblems -> Sign Language
(Beat, Cohesive) (Iconic) (Pantomimic) (Deictic) (Symbolic)
Progressing from left to right the necessity of accompanying speech to understand the
gesture declines, the gestures become more language-like, and idiosyncratic gestures are
replaced by socially regulated signs. For example sign languages share enough of the
syntactic and semantic features of speech that they don’t require an additional speech
channel for interpretation. However iconic gestures cannot be understood without
accompanying speech.
In contrast to this rich gestural taxonomy, current interaction with computers is almost
entirely free of gestures. The dominant paradigm is direct manipulation, however we may
wonder how direct are direct manipulation systems when they are so restricted in the ways
that they engage our everyday skills. This deficiency is made obvious when we consider
how proficient humans are at using gestures in the everyday world and then consider how
few of these gestures can be used in human-computer interaction and how long it takes to
learn the input gestures that computers can understand. Even the most advanced gestural
interfaces typically only implement symbolic or deictic gesture recognition. However this
need not be the case. In the remainder of the chapter we move along Kendon’s Continuum
from right to left reviewing computer interfaces from each of three categories; gesture only
interfaces, gesture and speech interfaces, conversational interfaces.
As we shall see from this review, one of the compelling reasons for using gesture at the
interface is because of it’s relationship to the concepts of chunking and phrasing. In
chapter seven we described how the most intuitive interfaces match the phrase structure of
the human-computer dialogue with the cognitive chunks the human should be learning.
Unintuitive interfaces require simple conceptual actions to be broken up into compound
tasks; for example a Move action that requires separate Cut and Paste commands. In
contrast, gesture based interfaces allow the use of natural gestural phrases that chunk the
input dialog into units meaningful to the application. This is especially the case when voice
input is combined with gesture, allowing the user to exactly match their input modalities to
the cognitive chunks of the task. For example, saying the command “move the ball like this”
14.4 Gesture Based Interaction
Haptic Input 2:41 PM Buxton
while showing the path of the ball with an iconic gesture specifies both a command and its
relevant parameters in a single cognitive chunk.
Gesture Only Interfaces
The gestural equivalent of direct manipulation interfaces are those which use gesture
alone. These can range from interfaces that recognize a few symbolic gestures to those
that implement fully fledged sign language interpretation. Similarly interfaces may
recognize static hand poses, or dynamic hand motion, or a combination of both. In all
cases each gesture has an unambiguous semantic meaning associated with it that can be
used in the interface. In this section we will first briefly review the technology used to
capture gesture input, then describe examples from symbolic and sign language
recognition. Finally we summarize the lessons learned from these interfaces and provide
some recommendations for designing gesture only applications.
Tracking Technologies
Gesture-only interfaces with a syntax of many gestures typically require precise hand pose
tracking. A common technique is to instrument the hand with a glove which is equipped
with a number of sensors which provide information about hand position, orientation, and
flex of the fingers. The first commercially available hand tracker, the Dataglove, is
described in Zimmerman, Lanier, Blanchard, Bryson and Harvill (1987), and illustrated in
the video by Zacharey, G. (1987). This uses thin fiber optic cables running down the back
of each hand, each with a small crack in it. Light is shone down the cable so when the
fingers are bent light leaks out through the cracks. Measuring light loss gives an accurate
reading of hand pose. The Dataglove could measure each joint bend to an accuracy of 5 to
10 degrees (Wise et. al. 1990), but not the sideways movement of the fingers (finger
abduction). However, the CyberGlove developed by Kramer (Kramer 89) uses strain
gauges placed between the fingers to measure abduction as well as more accurate bend
sensing (Figure 1). Since the development of the Dataglove and Cyberglove many other
glove based input devices have appeared as described by Sturman and Zeltzer (1994).
Gesture Based Interaction 14.5
Haptic Input 24 August, 2011 Buxton
Figure 1: The CyberGlove
The CyberGlove captures the position and movement of the fingers and wrist. It has up to
22 sensors, including three bend sensors (including the distal joints) on each finger, four
abduction sensors, plus sensors measuring thumb crossover, palm arch, wrist flexion and
wrist abduction. (Photo: Virtual Technologies, Inc.)
Once hand pose data has been captured by the gloves, gestures can be recognized using
a number of different techniques. Neural network approaches or statistical template
matching is commonly used to identify static hand poses, often achieving accuracy rates of
better than 95% (Väänänen and Böhm 1993). Time dependent neural networks may also
be used for dynamic gesture recognition [REF], although a more common approach is to
use Hidden Markov Models. With this technique Kobayashi is able to achieve an accuracy
of XX% (Kobayashi et. al. 1997), similar results have been reported by XX and XX.
Hidden Markov Models may also be used to interactively segment out glove input into
individual gestures for recognition and perform online learning of new gestures (Lee 1996).
In these cases gestures are typically recognized using pre-trained templates, however
gloves can also be used to identify natural or untrained gestures. Wexelblat uses a top
down and bottom up approach to recognize natural gestural features such as finger
curvature and hand orientation, and temporal integration to produce frames describing
complete gestures (Wexelblat 1995). These frames can then be passed to higher level
functions for further interpretation.
Although instrumented gloves provide very accurate results they are expensive and
encumbering. Computer vision techniques can also be used for gesture recognition
overcoming some of these limitations. A good review of vision based gesture recognition is
provided by Palovic et. al. (1995). In general, vision based systems are more natural to use
that glove interfaces, and are capable of excellent hand and body tracking, but do not
provide the same accuracy in pose determination. However for many applications this may
not be important. Sturman and Zeltzer point out the following limitations for image based
visual tracking of the hands (Sturman and Zeltzer 1994):
• The resolution of video cameras is too low to both resolve the fingers easily and cover
the field of view encompassed by broad hand motions.
• The 30- or 60- frame-per-second conventional video technology is insufficient to capture
rapid hand motion.
14.6 Gesture Based Interaction
Haptic Input 2:41 PM Buxton
• Fingers are difficult to track as they occlude each other and are occluded by the hand.
There are two different approaches to vision based gesture recognition; model based
techniques which try to create a three-dimensional model of the users hand and use this
for recognition, and image based techniques which calculate recognition features directly
from the hand image. Rehg and Kanade (1994) describe a vision-based approach that
uses stereo camera to create a cylindrical model of the hand. They use finger tips and joint
links as features to align the cylindrical components of the model. Etoh, Tomono and
Kishino (1991) report similar work, while Lee and Kunii use kinematic constraints to
improve the model matching and recognize 16 gestures with XX% accuracy (1993). Image
based methods typically segment flesh tones from the background images to find hands
and then try and extract features such as fingertips, hand edges, or gross hand geometry
for use in gesture recognition. Using only a coarse description of hand shape and a hidden
markov model, Starner and Pentland are able to recognize 42 American Sign Language
gestures with 99% accuracy (1995). In contrast, Martin and Crowley calculate the principle
components of gestural images and use these to search the gesture space to match the
target gestures (1997).
Natural Gesture Only Interfaces
At the simplest level, effective gesture interfaces can be developed which respond to
natural gestures, especially dynamic hand motion. An early example is the Theremin, an
electronic musical instrument from the 1920’s. This responds to hand position using two
proximity sensors, one vertical, the other horizontal. Proximity to the vertical sensor
controls the music pitch, to the horizontal one, loudness. What is amazing is that music can
be made with orthogonal control of the two prime dimensions, using a control system that
provides no fixed reference points, such as frets or mechanical feedback. The hands work
in extremely subtle ways to articulate steps in what is actually a continuous control space
[REF]. The Theremin is successful because there is a direct mapping of hand motion to
continuous feedback, enabling the user to quickly build a mental model of how to use the
device.
Gesture Based Interaction 14.7
Haptic Input 24 August, 2011 Buxton
Figure 2: The Theremin.
The figure shows Dr. Robert Moog playing the Theremin. This electronic musical
instrument generates a violin-like tone whose pitch is determined by the proximity
of the performer’s right hand to the vertical antenna, and the loudness is controlled
by the proximity of the left hand to the horizontal antenna. Hence, a musical
performance requires control over great subtlety of nuance over gesture on the
part of the artist, with no mechanical aids (such as frets) as a guide. It is an
extreme example of the human’s potential to articulate controlled gestures. (Photo:
Big Briar, Inc.)
Myron Krueger’s Videoplace is another system with responds to natural user gesture
(Krueger 1991). Developed in the late 1970’s and early 80’s, Videoplace uses real time
image processing of live video of the user. Background subtraction and edge detection are
used to create a silhouette of the user and relevant features identified. The feature
recognition is sufficiently fine to distinguish between hands and fingers, whether fingers are
extended or closed, and even which fingers. With this capability, the system has been
programmed to perform a number of interactions, many of which closely echo our use of
gesture in the everyday world.
Videoplace is a stunning piece of work, displaying an extremely high degree of virtuosity
and creativity. The key to its success is the recognition of dynamic natural gestures,
meaning users require no training. Figure 3 shows a kind of “finger painting” while Figure 4
shows how one can select from a menu (in this case the alphabet, thereby enabling text
entry) by pointing at items with the index finger. Finally, Figure 5 shows an object being
manipulated by simultaneously using the index finger and thumb from both hands.