Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

MFCC Quantisation in Distributed Speech Recognition

[attachment=32571]

Abstract

This chapter investigates the application of the multi-frame GMM-based block quantisa-
tion scheme to MFCC quantisation in distributed speech recognition and examines how
it compares with other schemes. The advantage of the multi-frame GMM-based block
quantiser is: superior recognition performance at low bitrates, which is comparable with
vector quantisation; fixed and relatively low computational and memory complexity that
is independent of bitrate; and bitrate scalability, where the bitrate can be dynamically
altered without requiring codebook re-training.
We begin the chapter with some background theory on speech recognition, which covers
the basic ideas of feature extraction and pattern recognition using hidden Markov models
(HMMs). Following this, we provide a general review of client/server-based speech recog-
nition systems and the various types of modes (NSR and DSR) that have been proposed
and reported in the literature. We also brie y describe the Aurora-2 DSR experimental
framework, which will be used extensively to evaluate the performance and robustness to
noise of the various DSR schemes. The second half of the chapter is dedicated to pre-
senting and discussing results of di erent quantisation schemes applied to a common DSR
framework.

Preliminaries of Speech Recognition

Figure 7.1 shows a block diagram of a speech recognition system, highlighting the main
components in general. In this section, we give only a brief review of each of these compo-
nents rather than a comprehensive coverage of the algorithms used in modern recognition
systems, as the scope of this chapter is focused on the e cient quantisation of MFCC
features for distributed speech recognition.

Speech Production

Speech sounds can be broadly classified as either voiced or unvoiced. Voiced sounds,
such as /iy/ (as in see), are periodic and have a harmonic structure that is not present in
unvoiced sounds, such as /s/, which are aperiodic and noise-like. These are best visualised
in Figure 7.2, which shows the waveform and spectrogram of the sentence, she had your
dark suit in greasy wash-water all year, and highlights the voiced and unvoiced sections
in the first word, she. Notice that the spectrum for /sh/ is at, similar to that of noise,
while the spectrum of /iy/ shows a harmonic structure, as characterised by the alternating
bands.

Client/Server-Based Speech Recognition

With the increase in popularity of remote and wireless devices such as personal digital
assistants (PDAs) and cellular phones, there has been a growing interest in applying au-
tomatic speech recognition (ASR) technology in the context of mobile communication
systems. Speech recognition can facilitate consumers in performing common tasks, which have traditionally been accomplished via buttons or pointing devices, such as making a
call through voice dialing or entering data into their PDAs via spoken commands and
sentences. Some of the issues that arise when implementing ASR on mobile devices in-
clude: computational and memory constraints of the mobile device; network bandwidth
utilisation; and robustness to noisy operating conditions.

Network Speech Recognition

In the Network Speech Recognition (NSR) mode [85], the user’s speech is compressed
using conventional speech coders (such as the GSM speech coder) and transmitted to the
server which performs the recognition task. In speech-based NSR (Figure 7.7(a)), the
server calculates ASR features from the decoded speech to perform the recognition. In
bitstream-based NSR (Figure 7.7(b)), the server uses ASR features that are derived from
linear predictive coding (LPC) parameters taken directly from the bitstream. Numerous
studies have been reported in the literature evaluating and comparing the performance of
these two forms of NSR [48, 69, 74, 83, 99, 140, 189, 51].
Literature Review of Speech-Based NSR
Euler and Zinke [48] investigated the e ect of three CELP-based speech coders, LD-CELP,
RPE-LTP, and TETRA-CELP at 16, 13, and 4.8 kbps, respectively, on isolated word
recognition and speaker verification. Narrowband speech was coded and decoded using
the CELP coders, and 12 LPCCs and their delta coe cients extracted from the decoded
speech. They found that the speech coders operating at 13 kbps and lower decreased the
recognition performance in matched and mismatched conditions.

seminar flower