20-03-2014, 04:20 PM
Isolated-word speech recognition using hidden Markov models
[attachment=60332]
Introduction
Speech recognition is a challenging problem on which much work has been done the last
decades. Some of the most successful results have been obtained by using hidden Markov
models as explained by Rabiner in 1989 [1].
A well working generic speech recognizer would enable more efficient communication
for everybody, but especially for children, analphabets and people with disabilities. A
speech recognizer could also be a subsystem in a speech-to-speech translator.
The speech recognition system implemented during this project trains one hidden
Markov model for each word that it should be able to recognize. The models are trained
with labeled training data, and the classification is performed by passing the features to
each model and then selecting the best match.
Background theory
Hidden Markov models
Basic knowledge of hidden Markov models is assumed, but the two most important
algorithms used in this project will be described.
The observable output from a hidden state is assumed to be generated by a mul-
tivariate Gaussian distribution, so there is one mean vector and covariance matrix for
each state. We will also assume that the state transition probabilities are independent
of time, such that the hidden Markov chain is homogenous.
We will now define the notation for describing a hidden Markov model as used in
this project. There is a total number of N states. An element ass in the transition
probability matrix A denotes the transition probability from state s to state s , and
the probability for the chain to start in state s is πs . The mean vector and covariance
matrix for the multivariate Gaussian distribution modeling the observable output from
state s are μs and Σs , respectively. For an observation o, bs (o) denotes the probability
density of the multivariate Gaussian distribution of state s at the values of o. We will
sometimes denote the collection of parameters describing the hidden Markov model as
λ = {A, π, μ, Σ}.
Conclusion and future work
During this project a system for isolated-word speech recognition was implemented and
tested. The cross-validation results are good for a single speaker. Two obvious extensions
are better support for several speakers, and support for continuos speech. The first step
towards the former would be more, and more robust, features.
[attachment=60332]
Introduction
Speech recognition is a challenging problem on which much work has been done the last
decades. Some of the most successful results have been obtained by using hidden Markov
models as explained by Rabiner in 1989 [1].
A well working generic speech recognizer would enable more efficient communication
for everybody, but especially for children, analphabets and people with disabilities. A
speech recognizer could also be a subsystem in a speech-to-speech translator.
The speech recognition system implemented during this project trains one hidden
Markov model for each word that it should be able to recognize. The models are trained
with labeled training data, and the classification is performed by passing the features to
each model and then selecting the best match.
Background theory
Hidden Markov models
Basic knowledge of hidden Markov models is assumed, but the two most important
algorithms used in this project will be described.
The observable output from a hidden state is assumed to be generated by a mul-
tivariate Gaussian distribution, so there is one mean vector and covariance matrix for
each state. We will also assume that the state transition probabilities are independent
of time, such that the hidden Markov chain is homogenous.
We will now define the notation for describing a hidden Markov model as used in
this project. There is a total number of N states. An element ass in the transition
probability matrix A denotes the transition probability from state s to state s , and
the probability for the chain to start in state s is πs . The mean vector and covariance
matrix for the multivariate Gaussian distribution modeling the observable output from
state s are μs and Σs , respectively. For an observation o, bs (o) denotes the probability
density of the multivariate Gaussian distribution of state s at the values of o. We will
sometimes denote the collection of parameters describing the hidden Markov model as
λ = {A, π, μ, Σ}.
Conclusion and future work
During this project a system for isolated-word speech recognition was implemented and
tested. The cross-validation results are good for a single speaker. Two obvious extensions
are better support for several speakers, and support for continuos speech. The first step
towards the former would be more, and more robust, features.