19-01-2013, 12:33 PM
Neural Networks used for Speech Recognition
[attachment=47678]
Abstract
In this paper is presented an investigation of the
speech recognition classification performance. This
investigation on the speech recognition classification
performance is performed using two standard neural networks
structures as the classifier. The utilized standard neural
network types include Feed-forward Neural Network (NN)
with back propagation algorithm and a Radial Basis Functions
Neural Networks.
INTRODUCTION
PEECH is probably the most efficient way to
communicate with each other. This also means that
speech could be a useful interface to interact with machines.
For a long time research on how to improve this type of
communication has been done. Some successful examples
based on it during the past years, since we have knowledge
about electromagnetism; includes the invention of the
megaphone, telephone and etc.
Even in 18th century people were experimenting on
speech synthesis. For example, in the late 18th century, Von
Kempelen developed a machine capable of 'speaking' words
and phrases. Nowadays, thanks to the evolution of
computational power it has become possible not only to
develop, test and implement speech recognition systems, but
also to have systems capable to real-time conversion of text
into speech. Unfortunately, despite the good progress made
on that field, the speech recognition process is still facing a
lot of problems, with most of them contributed to the fact
that speech is a very subjective phenomenon.
Dynamic Time Warping (DTW)
This technique compares words with reference words.
Every reference word has a set of spectra; but there is no
distinction between separate sounds in the word. Because a
word can be pronounced at different speeds, a time
normalization will be necessary. Dynamic Time Warping is
a programming technique where the time dimension of the
unknown word is changed (stretched and shrinked) until
there is a similarity with a reference word.
IMPLEMENTATION OF SIGNAL-PREPROCESSING
In the previous section we have discussed the general
structure of a speech recognition system. In this paper we
put the main focus on the neural networks and not on the
signal pre-processing, although signal pre-processing has a
big impact on the performance of the speech classifier. It is
important to feed the neural network with normalized input.
Recorded samples never produce identical waveforms; the
length, amplitude, background noise may vary. Therefore
we need to perform signal pre-processing to extract only the
speech related information. This means that using the right
features is crucial for successful classification. Good
features simplify the design of a classifier whereas weak
features (with little discrimination power) can hardly be
compensated with any classifier. We can divide this process
on some distinctive steps like:
Signal Pre-processing
As the neural network will have to do the speech
classification, it is very important to feed the network inputs
with relevant data. It’s obvious that an appropriate preprocessing
is necessary in order to be sure that the input of
the neural network is characteristic for every word while
having a small spread amongst samples of the same word.
Noise and difference in amplitude of the signal can distort
the integrity of a word while timing variations can cause a
large spread amongst samples of the same word [5],[6].
These problems are dealt with in the signal preprocessing
part which is composed of different sub stages:
Filtering, Entropy based endpoint detection and Mel
Frequency Cepstrum Coefficients.
NEURAL NETWORK IMPLEMENTATIONS
Many authors used neural networks for speech
recognition in the past [9], [10], [11], [12]. For our
implementation the MATLAB Neural Network toolbox has
been used to create, train and simulate the networks [13].
For every word we used 200 recorded samples. From these
200 samples, 100 samples were used for training, while the
other 100 were be used to test the network (as these not
included in the training set). The trained network can also
be tested with real time input from a microphone.
CONCLUSION
This paper is showing that neural networks can be very
powerful speech signal classifiers. A small set of words
could be recognized with some very simplified models. The
pre-processing quality is giving the biggest impact on the
neural networks performance. In some cases where the
spectrogram combined with entropy based endpoint
detection is used we observed poor classification
performance results, making this combination as a poor
strategy for the pre-processing stage. On the other hand we
observed that Mel Frequency Ceptstrum Coefficients are a
very reliable tool for the pre-processing stage, with the good
results they provide.