Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Speech Recognition ppt
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Speech Recognition


[attachment=25954]

Recognition – Conceptually


Data Acquisition

Training Hidden Markov Models for word set

Recognition & Analysis


Viterbi-based Recognition


Calculates the log-maximum likelihood of a series of observations given a particular HMM.
“Which model did this set of data most likely come from?”

Saves time by calculating only a subset of possible paths through the HMM network.
At each new frame, only the most likely transition/observation state pairs are used.
Concepts similar to Dynamic Time Warping


DSP – Recording/Thresholding


Speech Input
Process
Poll A/D for input data (TI-provided code used)
Take only one channel as input
Downsample
Save samples only when signal threshold has been crossed
Lead buffer
Tail buffer
PROBLEMS
Sample transfer modes, single channel selection, threshold values, external microphones
TESTING
Visual and audio inspection in Matlab


SPEECH RECOGNITION

[attachment=29831]

ABSTRACT

Speech recognition is the analysis side of the subject of machine speech processing. The synthesis side might be called speech production. These two taken together allow computers to work with spoken language. My study concentrates on isolated word speech recognition. Speech recognition, in humans, is thousands of years old. On our planet it could be traced backed millions of years to the dinosaurs. Our topic might better be called automatic speech recognition (ASR). I give a brief survey of ASR, starting with modern phonetics, and continuing through the current state of Large-Vocabulary Continuous Speech Recognition (LVCSR). A simple computer experiment, using MATLAB, into isolated word speech recognition is described in some detail. I experimented with several different recognition algorithms and I used training and testing data from two distinct vocabularies. My training and testing data was collected and recorded with both male and female voices.

INTRODUCTION

Historically the sounds of spoken language have been studied at two different levels: (1) phonetic components of spoken words, e.g., vowel and consonant sounds, and (2) acoustic wave patterns. A language can be broken down into a very small number of basic sounds, called phonemes (English has approximately forty). An acoustic wave is a sequence of changing vibration patterns (generally in air), however we are more accustom to “seeing” acoustic waves as their electrical analog on an oscilloscope (time presentation) or spectrum analyzer (frequency presentation). Also seen in sound analysis are two-dimensional patterns called spectrograms , which display frequency (vertical axis) vs. time (horizontal axis) and represent the signal energy as the figure intensity or color.

SERVEY OF SPEECH RECOGNITION

The general public’s “understanding” of speech recognition comes from such things as the HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey. Notice that HAL is a perversion of IBM. At the time of the movie’s release (1968) IBM was just getting started with a large speech recognition project that led to a very successful large vocabulary isolated word dictation system and several small vocabulary control systems. In the middle nineties IBM’s VoiceType, Dragon Systems’ DragonDictate, and Kurzweil Applied Intelligence's VoicePlus were the popular personal computer speech recognition products on the market. These “early” packages typically required additional (nonstandard) digital signal processing (DSP) computer hardware. They were about 90% accurate for general dictation and required a short pause between words.


Speech waveform capture (analog to digital conversion)

The a-to-d conversion is generally accomplished by digital signal processing hardware on the computer’s sound card (a standard feature on most computers today). The typical sampling rate, 8000 samples per second, is adequate. The spoken voice is considered to be 300 to 3000 Hertz. A sampling rate 8000 gives a Nyquist frequency of 4000 Hertz, which should be adequate for a 3000 Hz voice signal. Some systems have used over sampling plus a sharp cutoff filter to reduce the effect of noise. The sample resolution is the 8 or 16 bits per second that sound cards can accomplish.

Pre-emphasis filtering

Because speech has an overall spectral tilt of 5 to 12 dB per octave, a pre emphasis filter of the form 1 – 0.99 z-1 is normally used. This first order filter will compensate for the fact that the lower formants contain more energy than the higher. If it weren’t for this filter the lower formants would be preferentially modeled with respect to the higher formants.

Feature extraction

Usually the features are derived from Linear Predictive Coding (LPC), a technique that attempts to derive the coefficients of a filter that (along with a power source) would produce the utterance that is being studied. LPC is useful in speech processing because of its ability to extract and store time-varying formant information. Formants are points in a sound's spectrum where the loudness is boosted. Does the expression “all pole filter” come to mind? What we get from LPC analysis is a set of coefficients that describe a digital filter. The idea being that this filter in conjunction with a noise source or a periodic signal (rich in overtones) would produce a sound similar to the original speech. LPC data is often further processed (by recursion) to produce what are called LPC cepstrum features.

The experiments

Each experiment uses several individually recorded samples. Some are used for training and the others are used for testing. We make a point not to test with our training data and not to train with our testing data. The programs allow for parameter variation and algorithm substitution for key items such as feature extraction and class selection. The package also is designed to gather and record relevant statistics on the accuracy of recognition. The general idea is to change an algorithm and/or various controlling parameters and rerun the standard experiment, noticing any improvement or degradation in the statistics that represent successful recognition.

SUMMARY

Generally, linear prediction worked better than LPC cepstrum. I would expect LPC cepstrum to be better than LPC if used on individual phonemes (as opposed to complete words).
Laboratory (controlled environment) was better than studio (general use).
The poor performance of the Mahalanobis method is thought to be a result of my small number of training sets. With just a few trainings (I used eight) for each class I do get a very good handle on the class mean, but since I need so many features (in the high teens), I for sure can expect trouble with the covariance matrix! I needed to reduce my feature count to nine to be consistent with my eight training samples. Remember, my features are the filter coefficients (not the poles); therefore my nine features contain at most eight degrees of freedom.