25-09-2012, 01:00 PM
THE AURORA EXPERIMENTAL FRAMEWORK FOR THE PERFORMANCE
EVALUATION OF SPEECH RECOGNITION SYSTEMS UNDER NOISY
CONDITIONS
print2.pdf (Size: 265.89 KB / Downloads: 33)
ABSTRACT
This paper describes a database designed to evaluate the
performance of speech recognition algorithms in noisy
conditions. The database may either be used for the evaluation of
front-end feature extraction algorithms using a defined HMM
recognition back-end or complete recognition systems. The
source speech for this database is the TIdigits, consisting of
connected digits task spoken by American English talkers
(downsampled to 8kHz). A selection of 8 different real-world
noises have been added to the speech over a range of signal to
noise ratios and special care has been taken to control the filtering
of both the speech and noise.
The framework was prepared as a contribution to the ETSI
STQ-AURORA DSR Working Group [1]. Aurora is developing
standards for Distributed Speech Recognition (DSR) where the
speech analysis is done in the telecommunication terminal and
the recognition at a central location in the telecom network. The
framework is currently being used to evaluate alternative
proposals for front-end feature extraction. The database has been
made publicly available through ELRA so that other speech
researchers can evaluate and compare the performance of noise
robust algorithms.
INTRODUCTION
The robustness of a recognition system is heavily
influenced by the ability
· to handle the presence of background noise and
· to cope with the distortion by the frequency
characteristic of the transmission channel (often
described also as convolutional “noise” – although the
term convolutional distortion is preferred).
The importance of these issues is reflected by an
increasing number of investigations and publications on
these topics during the last years. This is again driven by
the dependency on robustness in real-life scenarios for the
successful introduction of recognition systems. Robustness
can be achieved by an appropriate extraction of robust
features in the front-end and/or by the adaptation of the
references to the noise situation.
NOISY SPEECH DATABASE
The TIDigits database is taken as basis. This part is
considered that contains the recordings of male and female
US-American adults speaking isolated digits and
sequences of up to 7 digits. The original 20kHz data have
been downsampled to 8 kHz with an “ideal” low-pass filter
extracting the spectrum between 0 and 4kHz. These data
are considered as “clean” data. Distortions are artificially
added.
Noise Adding
Noise is artificially added to the filtered TIDigits. To add
noises at a desired SNR (signal-to-noise ratio) the term
SNR has to be defined first because it is dependent on the
selected frequency range. We define it as the ratio of signal
to noise energy after filtering both signals with the G.712
characteristic. This assumes the recording of speech and
noise signals with good and similar equipment that does
not influence the spectrum of the original signals.
To determine the speech energy we apply the ITU
recommendation P.56 [8] by using the corresponding ITU
software. The noise energy is calculated as RMS value
with the same software where a noise segment of same
length than the speech signal is randomly cut out of the
whole noise recording. We assume duration of the noise
signal much longer than that of the speech signal.
The level of the speech signal is not changed as long as
no overflow occurs in the Short-integer range. Based on
the desired SNR the attenuation factor is calculated to
multiply the noise samples before adding them to the
speech samples. The speech level is only changed in case
of an overflow. This happens only for the worst SNR of –
5dB and in less than 10 cases in total for all noises.
HTK REFERENCE RECOGNIZER
The reference recognizer is based on the HTK software
package version 2.2 from Entropic. The training and
recognition parameters are defined to compare the
recognition results when applying different feature
extraction schemes. Some parameters, e.g. the number of
states per HMM model, have been chosen with respect to
the commonly used frame rate of 100 Hz (frame shift =
10ms). The recognition of digit strings is considered as
task without restricting the string length.
AURORA WI007 FRONT-END
As already mentioned in the introduction the definition
of the whole experiment was initially caused by a demand
of the Aurora DSR standardization activity. It will be used
to select a robust front-end as component in
telecommunication terminals for the realization of a
distributed speech recognition. This selection process is
work item WI008 of the Aurora group. The proposers of
alternative candidates for the advanced DSR front-end are
evaluating its performance on this database as part of the
final submissions on 27th October 2000.
RECOGNITION PERFORMANCE
The recognition results are presented in this section when
applying the WI007 front-end and the HTK recognition
scheme as described above. The MFCC of order 0 is not
part of the feature vector that consists of the remaining 13
components as well as of the corresponding delta and
acceleration coefficients. Thus a vector contains 39
components in total. Based on those results a relative
improvement can be stated for the proposals of the Aurora
WI008 activity.
The word accuracy is listed in Table 1 for test set A when
applying the multi-condition training. As well known the
performance deteriorates for decreasing SNR. The
degradation does not significantly differ for the different
noises. A performance measure for the whole test set has
been introduced as average over all noises and over SNRs
between 0 and 20dB. This average performance between 0
and 20dB takes a value of 87.81% for test set A.