29-09-2012, 11:40 AM
A Framework for Recognizing the Simultaneous Aspects of American Sign Language
A Framework for Recognizing.pdf (Size: 334.53 KB / Downloads: 34)
ABSTRACT
The major challenge that faces American Sign Language (ASL) recognition now
is developing methods that will scale well with increasing vocabulary size. Unlike
in spoken languages, phonemes can occur simultaneously in ASL. The number of
possible combinations of phonemes is approximately 1:5 £ 109, which cannot be
tackled by conventional hidden Markov model-based methods. Gesture recognition,
which is less constrained than ASL recognition, suffers from the same problem. In
this paper we present a novel framework to ASL recognition that aspires to being a
solution to the scalability problems. It is based on breaking down the signs into their
phonemes and modeling them with parallel hidden Markov models. These model the
simultaneous aspects of ASL independently. Thus, they can be trained independently,
and do not require consideration of the different combinations at training time. We
show in experiments with a 22-sign-vocabulary how to apply this framework in
practice.We also show that parallel hidden Markov models outperform conventional
hidden Markov models.
INTRODUCTION
Computers still have a long way to go before they can interact with users in a truly
natural fashion. From a user’s perspective, the most natural way to interact with a computer
would be through a speech and gesture interface. Although speech recognition has made
significant advances in the past 10 years, gesture recognition has been lagging behind. Yet,
gestures are an integral part of human-to-human communication and convey information
that speech alone cannot [20]. A working speech-and-gesture interface is likely to entail a
major paradigm shift away from point-and-click user interfaces toward a natural language
dialogue-and-spoken command-based interface.
RELATEDWORK
In the discussion of relatedwork, we focus on previouswork in sign language recognition.
For coverage of gesture recognition, the survey in [24] is an excellent starting point. Other,
more recent work is reviewed in [35].
Much previous work has focused on isolated sign language recognition with clear pauses
after each sign, although the research focus is slowly shifting to continuous recognition.
These pauses make it a much easier problem than continuous recognition without pauses
between the individual signs, because explicit segmentation of a continuous input stream into
the individual signs is very difficult. For this reason, and because of coarticulation effects,
work on isolated recognition often does not generalize easily to continuous recognition.
Erensthteyn and colleagues used neural networks to recognize fingerspelling [6].Waldron
and Kim also used neural networks, but they attempted to recognize a small set of isolated
signs [34] instead of fingerspelling. They used Stokoe’s transcription system [29] to separate
the handshape, orientation, and movement aspects of the signs.
Kadous used Power Gloves to recognize a set of 95 isolated Auslan signs with 80%
accuracy, with an emphasis on computationally inexpensive methods [13]. Grobel and
Assam used HMMs to recognize isolated signs with 91.3% accuracy out of a 262-sign
vocabulary. They extracted 2D features from video recordings of signers wearing colored
gloves [9].
Braffort described ARGo, an architecture for recognizing French Sign Language. It
attempted to integrate the normally disparate fields of sign language recognition and understanding
[2]. Toward this goal, Gibet and colleagues also described a corpus of 3D
gestural and sign language movement primitives [8]. This work focused on the syntactic
and semantic aspects of sign languages, rather than phonology.
Most work on continuous sign language recognition is based on HMMs, which offer the
advantage of being able to segment a data stream into its constituent signs implicitly. It thus
bypasses the difficult problem of segmentation entirely.
MODELING ASL
In this section we first give an overview on the relevant aspects of ASL linguistics,
particularly ASL phonology.We describe the movement–hold phonological model in detail,
as it forms the basis of our work. We then discuss its shortcomings and extend this model
to make it suitable for ASL recognition.
ASL is the primary mode of communication for many deaf people in the USA. It is a
highly inflected language; that is, many signs can be modified to indicate subject, object,
and numeric agreement. They can also be modified to indicate manner (fast, slow, etc.),
repetition, and duration [30, 29, 19]. Like all other languages, ASL has structure, which
sets it clearly apart from gesturing. It allows us to test ideas in a constrained framework
first, before attempting to generalize the results to gesture recognition problems.