14-01-2013, 04:46 PM
Automatic Speech Recognition – A Brief History of the Technology
Development
Automatic Speech Recognition.pdf (Size: 484.22 KB / Downloads: 30)
Abstract
Designing a machine that mimics human behavior, particularly the capability of speaking
naturally and responding properly to spoken language, has intrigued engineers and scientists for
centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model
for speech analysis and synthesis [1, 2], the problem of automatic speech recognition has been
approached progressively, from a simple machine that responds to a small set of sounds to a
sophisticated system that responds to fluently spoken natural language and takes into account the
varying statistics of the language in which the speech is produced. Based on major advances in
statistical modeling of speech in the 1980s, automatic speech recognition systems today find
widespread application in tasks that require a human-machine interface, such as automatic call
processing in the telephone network and query-based information systems that do things like
provide updated travel information, stock price quotations, weather reports, etc. In this article, we
review some major highlights in the research and development of automatic speech recognition
during the last few decades so as to provide a technological perspective and an appreciation of the
fundamental progress that has been made in this important area of information and
communication technology.
Keywords
Speech recognition, speech understanding, statistical modeling, spectral analysis, hidden Markov
models, acoustic modeling, language modeling, finite state network, office automation, automatic
transcription, keyword spotting, dialog systems, neural networks, pattern recognition, time
normalization
1. Introduction
Speech is the primary means of communication between people. For reasons ranging from
technological curiosity about the mechanisms for mechanical realization of human speech
capabilities, to the desire to automate simple tasks inherently requiring human-machine
interactions, research in automatic speech recognition (and speech synthesis) by machine has
attracted a great deal of attention over the past five decades.
The desire for automation of simple tasks is not a modern phenomenon, but one that goes
back more than one hundred years in history. By way of example, in 1881 Alexander Graham
Bell, his cousin Chichester Bell and Charles Sumner Tainter invented a recording device that
used a rotating cylinder with a wax coating on which up-and-down grooves could be cut by a
stylus, which responded to incoming sound pressure (in much the same way as a microphone that
Bell invented earlier for use with the telephone). Based on this invention, Bell and Tainter formed
the Volta Graphophone Co. in 1888 in order to manufacture machines for the recording and
reproduction of sound in office environments. The American Graphophone Co., which later
became the Columbia Graphophone Co., acquired the patent in 1907 and trademarked the term
“Dictaphone.” Just about the same time, Thomas Edison invented the phonograph using a tinfoil
based cylinder, which was subsequently adapted to wax, and developed the “Ediphone” to
compete directly with Columbia. The purpose of these products was to record dictation of notes
and letters for a secretary (likely in a large pool that offered the service as shown in Figure 1)
who would later type them out (offline), thereby circumventing the need for costly stenographers.
This turn-of-the-century concept of “office mechanization” spawned a range of electric and
electronic implements and improvements, including the electric typewriter, which changed the
face of office automation in the mid-part of the twentieth century. It does not take much
imagination to envision the obvious interest in creating an “automatic typewriter” that could
directly respond to and transcribe a human’s voice without having to deal with the annoyance of
recording and handling the speech on wax cylinders or other recording media.
A similar kind of automation took place a century later in the 1990’s in the area of “call
centers.” A call center is a concentration of agents or associates that handle telephone calls from
customers requesting assistance. Among the tasks of such call centers are routing the in-coming
calls to the proper department, where specific help is provided or where transactions are carried
out. One example of such a service was the AT&T Operator line which helped a caller place calls,
arrange payment methods, and conduct credit card transactions. The number of agent positions
(or stations) in a large call center could reach several thousand. Automatic speech recognition
10/08/2004 09:56:44 AM 3
technologies provided the capability of automating these call handling functions, thereby
reducing the large operating cost of a call center. By way of example, the AT&T Voice
Recognition Call Processing (VRCP) service, which was introduced into the AT&T Network in
1992, routinely handles about 1.2 billion voice transactions with machines each year using
automatic speech recognition technology to appropriately route and handle the calls [3].