22-09-2012, 12:48 PM
Speech Recognition using Neural Networks
1Speech Recognition.pdf (Size: 970.74 KB / Downloads: 41)
Abstract
This thesis examines how artificial neural networks can benefit a large vocabulary, speaker
independent, continuous speech recognition system. Currently, most speech recognition
systems are based on hidden Markov models (HMMs), a statistical framework that supports
both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs
make a number of suboptimal modeling assumptions that limit their potential effectiveness.
Neural networks avoid many of these assumptions, while they can also learn complex functions,
generalize effectively, tolerate noise, and support parallelism. While neural networks
can readily be applied to acoustic modeling, it is not yet clear how they can be used for temporal
modeling. Therefore, we explore a class of systems called NN-HMM hybrids, in which
neural networks perform acoustic modeling, and HMMs perform temporal modeling. We
argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system,
including better acoustic modeling accuracy, better context sensitivity, more natural discrimination,
and a more economical use of parameters. These advantages are confirmed
experimentally by a NN-HMM hybrid that we developed, based on context-independent
phoneme models, that achieved 90.5% word accuracy on the Resource Management database,
in contrast to only 86.0% accuracy achieved by a pure HMM under similar conditions.
Introduction
Speech is a natural mode of communication for people. We learn all the relevant skills
during early childhood, without instruction, and we continue to rely on speech communication
throughout our lives. It comes so naturally to us that we don’t realize how complex a
phenomenon speech is. The human vocal tract and articulators are biological organs with
nonlinear properties, whose operation is not just under conscious control but also affected
by factors ranging from gender to upbringing to emotional state. As a result, vocalizations
can vary widely in terms of their accent, pronunciation, articulation, roughness, nasality,
pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can
be further distorted by background noise and echoes, as well as electrical characteristics (if
telephones or other electronic equipment are used). All these sources of variability make
speech recognition, even more than speech generation, a very complex problem.
Yet people are so comfortable with speech that we would also like to interact with our
computers via speech, rather than having to resort to primitive interfaces such as keyboards
and pointing devices. A speech interface would support many valuable applications — for
example, telephone directory assistance, spoken database querying for novice users, “handsbusy”
applications in medicine or fieldwork, office dictation devices, or even automatic
voice translation into foreign languages. Such tantalizing applications have motivated
research in automatic speech recognition since the 1950’s. Great progress has been made so
far, especially since the 1970’s, using a series of engineered approaches that include template
matching, knowledge engineering, and statistical modeling. Yet computers are still
nowhere near the level of human performance at speech recognition, and it appears that further
significant advances will require some new insights.
Speech Recognition
What is the current state of the art in speech recognition? This is a complex question,
because a system’s accuracy depends on the conditions under which it is evaluated: under
sufficiently narrow conditions almost any system can attain human-like accuracy, but it’s
much harder to achieve good accuracy under general conditions. The conditions of evaluation
— and hence the accuracy of any system — can vary along the following dimensions:
• Vocabulary size and confusability. As a general rule, it is easy to discriminate
among a small set of words, but error rates naturally increase as the vocabulary
size grows. For example, the 10 digits “zero” to “nine” can be recognized essentially
perfectly (Doddington 1989), but vocabulary sizes of 200, 5000, or 100000
may have error rates of 3%, 7%, or 45% (Itakura 1975, Miyatake 1990, Kimura
1990). On the other hand, even a small vocabulary can be hard to recognize if it
contains confusable words. For example, the 26 letters of the English alphabet
(treated as 26 “words”) are very difficult to discriminate because they contain so
many confusable words (most notoriously, the E-set: “B, C, D, E, G, P, T, V, Z”);
an 8% error rate is considered good for this vocabulary (Hild & Waibel 1993).
Neural Networks
Connectionism, or the study of artificial neural networks, was initially inspired by neurobiology,
but it has since become a very interdisciplinary field, spanning computer science,
electrical engineering, mathematics, physics, psychology, and linguistics as well. Some
researchers are still studying the neurophysiology of the human brain, but much attention is now being focused on the general properties of neural computation, using simplified neural
models. These properties include:
• Trainability. Networks can be taught to form associations between any input and
output patterns. This can be used, for example, to teach the network to classify
speech patterns into phoneme categories.
• Generalization. Networks don’t just memorize the training data; rather, they
learn the underlying patterns, so they can generalize from the training data to new
examples. This is essential in speech recognition, because acoustical patterns are
never exactly the same.
• Nonlinearity. Networks can compute nonlinear, nonparametric functions of their
input, enabling them to perform arbitrarily complex transformations of data. This
is useful since speech is a highly nonlinear process.
• Robustness. Networks are tolerant of both physical damage and noisy data; in
fact noisy data can help the networks to form better generalizations. This is a valuable
feature, because speech patterns are notoriously noisy.
Fundamentals of Speech Recognition
Speech recognition is a multileveled pattern recognition task, in which acoustical signals
are examined and structured into a hierarchy of subword units (e.g., phonemes), words,
phrases, and sentences. Each level may provide additional temporal constraints, e.g., known
word pronunciations or legal word sequences, which can compensate for errors or uncertainties
at lower levels. This hierarchy of constraints can best be exploited by combining
decisions probabilistically at all lower levels, and making discrete decisions only at the
highest level.