13-11-2012, 06:10 PM
Speech Recognition
Speech Recognition1.ppt (Size: 1.19 MB / Downloads: 37)
Speech Recognition are technologies of particular interest, for their support of direct communication between humans and computers, through a communications mode, humans commonly use among themselves and at which they are highly skilled.
Types of speech recognition
Isolated words
Connected words
Continuous speech
Spontaneous speech (automatic speech recognition)
Voice verification and identification
Challenges of speech recognition
Ease of use
Robust performance
Automatic learning of new words and sounds
Grammar for spoken language
Control of synthesized voice quality
Integrated learning for speech recognition and synthesis
What is the task?
Getting a computer to understand spoken language
By “understand” we might mean
React appropriately
Convert the input speech into another medium, e.g. text
Several variables impinge on this (see later)
What’s hard about that?
Digitization
Converting analogue signal into digital representation
Signal processing
Separating speech from background noise
Phonetics
Variability in human speech
Phonology
Recognizing individual sound distinctions (similar phonemes)
Lexicology and syntax
Disambiguating homophones
Features of continuous speech
Syntax and pragmatics
Interpreting prosodic features
Pragmatics
Filtering of performance errors (disfluencies)
Digitization
Analogue to digital conversion
Sampling and quantizing
Use filters to measure energy levels for various points on the frequency spectrum
Knowing the relative importance of different frequency bands (for speech) makes this process more efficient
E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
Identifying phonemes
Differences between some phonemes are sometimes very small
May be reflected in speech signal (eg vowels have more or less distinctive f1 and f2)
Often show up in coarticulation effects (transition to next sound)
e.g. aspiration of voiceless stops in English
Allophonic variation
Performance errors
Performance “errors” include
Non-speech sounds
Hesitations
False starts, repetitions
Filtering implies handling at syntactic level or above
Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future
Template-based approach
Hard to distinguish very similar templates
And quickly degrades when input differs from templates
Therefore needs techniques to mitigate this degradation:
More subtle matching techniques
Multiple templates which are aggregated
Taken together, these suggested …
Statistics-based approach
Collect a large corpus of transcribed speech recordings
Train the computer to learn the correspondences (“machine learning”)
At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
Machine learning
Acoustic and Lexical Models
Analyse training data in terms of relevant features
Learn from large amount of data different possibilities
different phone sequences for a given word
different combinations of elements of the speech signal for a given phone/phoneme
Combine these into a Hidden Markov Model expressing the probabilities
The Noisy Channel Model
Use the acoustic model to give a set of likely phone sequences
Use the lexical and language models to judge which of these are likely to result in probable word sequences
The trick is having sophisticated algorithms to juggle the statistics
A bit like the rule-based approach except that it is all learned automatically from data