23-04-2014, 04:34 PM
IMPLEMENTATION OF SPEECH RECOGNITION IN RESOURCE CONSTRAINED ENVIRONMENTS
IMPLEMENTATION OF SPEECH .pdf (Size: 1.11 MB / Downloads: 59)
ABSTRACT
With the emergence of ubiquitous computing powered by state of the art technology, a
constant need has been felt for more convenient methods to input data and commands to
a computing device. As the size of the computing devices is decreasing exponentially
with time, more sophisticated techniques of human computer interfaces such as speech
are fast evolving. The project is an attempt to fulfill the gap between the current speech
recognition technology and embedded systems.
The initial objective of the project will be the implementation of a speech recognition
engine using Hidden Markov Models. This would involve the design of an efficient
MATLAB code on a PC. This phase of the project will involve the development of a
limited domain recognition engine spanning numerals only. The subsequent step will
involve porting this engine to a Resource Constrained environment such as an FPGA kit.
The long term aim would be to eliminate the PC altogether and build a stand-alone
system. The recognition engine should contain the capability to be extended to span the
entire vocabulary of English language.
PROJECT PHILOSOPHY
Every Speech Recognition system must be judged on two basic factors which govern its
usability – Accuracy and Speed. Unfortunately, one of them almost invariably comes at
the cost of the other. A higher accuracy rate implies a wider training sequence and a
higher number of iterations in the learning algorithm, all of which would necessarily take
a far greater number of clock cycles in a standard processor setting.
The solution, which we have envisaged in the course of this project, is to introduce a
degree of parallelism in the methodology - thereby reducing the number of clock cycles
required for its implementation. This is possible when we port the recognition phase of
the system onto an FPGA. The Viterbi algorithm used for the recognition is inherently
dependent on a log-likelihood condition involving only „add‟ operations making it ideal
for hardware implementation.
A BRIEF INTRODUCTION TO SPEECH RECOGNITION
Real time continuous speech recognition is a computationally demanding task, and one
which tends to benefit from increasing the available computing resources.
A typical speech recognition system starts with a preprocessing stage, which takes a
speech waveform as its input, and extracts from it feature vectors or observations which
represent the information required to perform recognition. This stage is efficiently
performed by software. The second stage is recognition, or decoding, which is performed
using a set of phoneme-level statistical models called hidden Markov models (HMMs).
Word-level acoustic models are formed by concatenating phone-level models according
to a pronunciation dictionary. These word models are then combined with a language
model, which constrains the recognizer to recognize only valid word sequences. The
decoder stage is computationally expensive.
ELEMENTS OF AN HMM
We now formally define the elements of an HMM, and explain how the model generates
observation sequences. An HMM is characterized by the following:
1) N, the number of states in the model. Although the states are hidden, for many
practical applications there is often some physical significance attached to the states or to
sets of states of the model. Generally the states are interconnected in such a way that any
state can be re ached from any other state (e.g., an ergodic model). We denote the
individual states as S = {Sl, S2, . . . , SN}, and the state at time t as q