23-11-2012, 03:02 PM
Report on Speech Recognition
speech_recgnition[1].doc (Size: 132 KB / Downloads: 26)
INTRODUCTION
One of the most important inventions of the nineteenth century was the telephone. Then at the midpoint of twentieth century, the invention of the digital computer amplified the power of our minds, enabled us to think and work more efficiently and made us more imaginative then we could ever have imagined. Now several new technologies have empowered us to teach computers to talk to us in our native languages and to listen to us when we speak (recognition); haltingly computers have begun to understand what we say. Having given our computers both oral and aural abilities, we have been able to produce innumerable computer applications that further enhance our productivity. Such capabilities enable us to route phone calls automatically and to obtain and update computer based information by telephone, using a group of activities collectively referred to as Voice Processing.
SPEECH TECHNOLOGY:
Three primary speech technologies are used in voice processing applications: stored speech, text-to – speech and speech recognition. Stored speech involves the production of computer speech from an actual human voice that is stored in a computer’s memory and used in any of several ways.
Speech can also be synthesized from plain text in a process known as text-to – speech which also enables voice processing applications to read from textual database.
Speech recognition is the process of deriving either a textual transcription or some form of meaning from a spoken input.
Speech analysis can be thought of as that part of voice processing that converts human speech to digital forms suitable for transmission or storage by computers.
Speech synthesis functions are essentially the inverse of speech analysis – they reconvert speech data from a digital form to one that’s similar to the original recording and suitable for playback.
Speech analysis processes can also be referred to as a digital speech encoding ( or simply coding) and speech synthesis can be referred to as Speech decoding.
Types of speech recognition:
Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using.
Isolated Words:
Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.
Connected Words:
Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them.
Continuous Speech:
Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.
Spontaneous Speech:
There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.
Voice Verification/Identification:
Some ASR systems have the ability to identify specific users. This document doesn't cover verification or security systems.
SPEECH ACT FRAMEWORK:
Speech Acts was a research prototype developed in the Speech Applications Group in the period 1993-1997 as a testbed for developing spoken natural language applications (see the page on applications). In developing the system, a primary goal was to enable software developers without special expertise in speech or natural language to create effective conversational speech applications, that is, to create applications in which users can speak naturally as if they were conversing with a personal assistant (see the page on user interface issues).
Another goal was for SpeechActs applications to work in conjunction with one another on a discourse level without having specific knowledge of the other applications running in the same suite. For example, if someone talks about "Tom Jones" in one application, and then mentions "Tom" later in the conversation while in another application, that second application should know that the user means "Tom Jones" and not some other "Tom."
Given the rapidly changing technology, a third goal was to avoid tying developers to specific speech recognizers or synthesizers. We wanted them to be able to use these speech technologies as plug-in components. SpeechActs supported a handful of speaker-independent, continuous speech recognizers: Hark from BBN Dagger from Texas Instruments, and Nuance Communications' recognizers. In addition, the framework used TruVoice text-to-speech (previously from Centigram, now from Lernout & Hauspie) or TrueTalk from Entropic. The architecture of the system made it straightforward to add new recognizers and synthesizers to the existing set.
Framework Architecture
The Speech Acts framework is comprised of an audio server, the Swiftus natural language processor, a discourse manager, a text-to-speech manager, and a set of grammar building tools. These pieces work in conjunction with third-party speech components and application-developer-supplied components. The major components illustrated in this diagram are explained below.
CONCLUSION:
Speech recognition is a truly amazing human capacity, especially when you consider that normal conversation requires the recognition of 10 to 15 phonemes per second. It should be of little surprise then that attempts to make machine (computer) recognition systems have proven difficult. Despite these problems, a variety of systems are becoming available that achieve some success, usually by addressing one or two particular aspects of speech recognition. A variety of speech synthesis systems, on the other hand, have been available for some time now. Though limited in capabilities and generally lacking the ``natural'' quality of human speech, these systems are now a common component in our lives.