Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: SPEECH CODING: FUNDAMENTALS AND APPLICATIONS pdf
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

[attachment=43186]

INTRODUCTION

Speech coding is the process of obtaining a compact
representation of voice signals for efficient transmission
over band-limited wired and wireless channels and/or
storage. Today, speech coders have become essential
components in telecommunications and in the multimedia
infrastructure. Commercial systems that rely on efficient
speech coding include cellular communication, voice over
internet protocol (VOIP), videoconferencing, electronic
toys, archiving, and digital simultaneous voice and data
(DSVD), as well as numerous PC-based games and
multimedia applications.
Speech coding is the art of creating a minimally
redundant representation of the speech signal that can
be efficiently transmitted or stored in digital media, and
decoding the signal with the best possible perceptual
quality. Like any other continuous-time signal, speechmay
be represented digitally through the processes of sampling
and quantization; speech is typically quantized using
either 16-bit uniform or 8-bit companded quantization.
Like many other signals, however, a sampled speech
signal contains a great deal of information that is
either redundant (nonzero mutual information between
successive samples in the signal) or perceptually irrelevant
(information that is not perceived by human listeners).
Most telecommunications coders are lossy, meaning that
the synthesized speech is perceptually similar to the
original but may be physically dissimilar.

WAVEFORM CODING

Waveform coders attempt to code the exact shape of the
speech signal waveform, without considering in detail the
nature of human speech production and speech perception.
Waveform coders are most useful in applications that
require the successful coding of both speech and nonspeech
signals. In the public switched telephone network (PSTN),
for example, successful transmission of modem and
fax signaling tones, and switching signals is nearly as
important as the successful transmission of speech. The
most commonly used waveform coding algorithms are
uniform 16-bit PCM, companded 8-bit PCM [48], and
ADPCM [46].

SUBBAND CODING

In subband coding, an analysis filterbank is first used to
filter the signal into a number of frequency bands and
then bits are allocated to each band by a certain criterion.
Because of the difficulty in obtaining high-quality speech
at low bit rates using subband coding schemes, these
techniques have been used mostly for wideband medium
to high bit rate speech coders and for audio coding.
For example, G.722 is a standard in which ADPCM
speech coding occurs within two subbands, and bit
allocation is set to achieve 7-kHz audio coding at rates
of 64 kbps or less.
In Refs. 12,13, and 30 subband coding is proposed as
a flexible scheme for robust speech coding. A speech production
model is not used, ensuring robustness to speech
in the presence of background noise, and to nonspeech
sources. High-quality compression can be achieved by
incorporating masking properties of the human auditory
system [54,93]. In particular, Tang et al. [93] present a
scheme for robust, high-quality, scalable, and embedded
speech coding. Figure 3 illustrates the basic structure
of the coder. Dynamic bit allocation and prioritization
and embedded quantization are used to optimize the perceptual
quality of the embedded bitstream, resulting in
little performance degradation relative to a nonembedded
implementation. A subband spectral analysis technique
was developed that substantially reduces the complexity
of computing the perceptual model.

Perceptual Error Weighting

Not all types of distortion are equally audible. Many
types of speech coders, including LPC-AS coders, use
simple models of human perception in order to minimize
the audibility of different types of distortion. In LPC-AS
coding, two types of perceptual weighting are commonly
used. The first type, perceptual weighting of the residual
quantization error, is used during the LPC excitation
search in order to choose the excitation vector with the
least audible quantization error. The second type, adaptive
postfiltering, is used to reduce the perceptual importance
of any remaining quantization error.

Frame-Based Analysis

The characteristics of the LPC excitation signal u(n)
change quite rapidly. The energy of the signal may change
from zero to nearly full amplitude within one millisecond
at the release of a plosive sound, and a mistake of more
than about 5 ms in the placement of such a sound is
clearly audible. The LPC coefficients, on the other hand,
change relatively slowly. In order to take advantage of the
slow rate of change of LPC coefficients without sacrificing
the quality of the coded residual, most LPC-AS coders
encode speech using a frame–subframe structure, as
depicted in Fig. 8. A frame of speech is approximately
20 ms in length, and is composed of typically three to four
subframes. The LPC excitation is transmitted only once
per subframe, while the LPC coefficients are transmitted
only once per frame. The LPC coefficients are computed
by analyzing a window of speech that is usually longer
than the speech frame (typically 30–60 ms).

Multiband Excitation (MBE)

In multiband excitation (MBE) coding the voiced/unvoiced
decision is not a binary one; instead, a series of
voicing decisions are made for independent harmonic
intervals [31]. Since voicing decisions can be made in
different frequency bands individually, synthesized speech
may be partially voiced and partially unvoiced. An
improved version of the MBE was introduced in the late
1980s [7,35] and referred to as the IMBE coder. The IMBE
at 2.4 kbps produces better sound quality than does the
LPC-10e. The IMBE was adopted as the Inmarsat-M
coding standard for satellite voice communication at a
total rate of 6.4 kbps, including 4.15 kbps of source coding
and 2.25 kbps of channel coding [104]. The Advanced
MBE (AMBE) coder was adopted as the Inmarsat Mini-M
standard at a 4.8 kbps total data rate, including 3.6 kbps
of speech and 1.2 kbps of channel coding [18,27].

Prototype Waveform Interpolative (PWI) Coding

A different kind of coding technique that has properties
of both waveform and LPC-based coders has been
proposed [59,60] and is called prototype waveform interpolation
(PWI). PWI uses both interpolation in the frequency
domain and forward–backward prediction in the time
domain. The technique is based on the assumption that, for
voiced speech, a perceptually accurate speech signal can
be reconstructed from a description of the waveform of a
single, representative pitch cycle per interval of 20–30 ms.
The assumption exploits the fact that voiced speech can
be interpreted as a concentration of slowly evolving pitch
cycle waveforms. The prototype waveform is described by
a set of linear prediction (LP) filter coefficients describing
the formant structure and a prototype excitation waveform,
quantized with analysis-by-synthesis procedures.
The speech signal is reconstructed by filtering an excitation
signal consisting of the concatenation of (infinitesimal)
sections of the instantaneous excitation waveforms.

MEASURES OF SPEECH QUALITY

Deciding on an appropriate measurement of quality is
one of the most difficult aspects of speech coder design,
and is an area of current research and standardization.
Earlymilitary speech coders were judged according to only
one criterion: intelligibility. With the advent of consumergrade
speech coders, intelligibility is no longer a sufficient
condition for speech coder acceptability. Consumers want
speech that sounds ‘‘natural.’’ A large number of subjective
and objective measures have been developed to quantify
‘‘naturalness,’’ but it must be stressed that any scalar
measurement of ‘‘naturalness’’ is an oversimplification.
‘‘Naturalness’’ is a multivariate quantity, including such
factors as the metallic versus breathy quality of speech,
the presence of noise, the color of the noise (narrowband
noise tends to be more annoying than wideband noise,
but the parameters that predict ‘‘annoyance’’ are not well
understood), the presence of unnatural spectral envelope
modulations (e.g., flutter noise), and the absence of natural
spectral envelope modulations.