Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: TEXT-INDEPENDENT SPEAKER RECOGNITION
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
TEXT-INDEPENDENT SPEAKER RECOGNITION
[attachment=31246]
1.1 Introduction
The fundamental issues of automatic speaker recognition include speaker
characterization, classifier design, channel- and session-compensation techniques,
and score normalization algorithms for a robust speaker recognition decision. In this
chapter, an overview on text-independent speaker recognition systems is first given.
Then, the acoustic feature extraction and the state-of-the-art speaker classification
approaches are described. A further discussion on compensation of session variability
and score normalization for robust speaker recognition then follows. Finally, a
performance assessment is provided, as a case study, on the National Institute of
Standards and Technology (NIST) 2008 Speaker Recognition Evaluation corpus.
1.2 Overview of Speaker Recognition
Automatic speaker recognition is a technology that identifies a speaker, based
on his/her speech, using a computer system. Just like any other pattern
recognition problem, it involves feature extraction, speaker modeling, or speaker
characterization, and classification decision strategy, as shown in Fig. 1.1. Speaker
recognition is often referred to as voiceprint or voice biometrics in applications
where a speaker’s identity is required, for example, in access control and forensic
investigation. As opposed to other biometric technologies, such as biometric
recognition with fingerprint, face and iris, an advantage of speaker recognition is
that it does not require any specialized hardware for the user interface. The only
requirement is a microphone, which has been made widely available in pervasive
telephone infrastructure. Over the telephone, speaker enrollment and recognition
can be carried out remotely.
3
4 B. Ma and H. Li
Feature
extraction
Speaker
modeling
Pattern
matching
Decision
strategy
Speaker
model
database
Input speech Enrollment
Recognition
Decision
1.2.1 Text-dependent and text-independent speaker recognition
If a speaker recognition system operates on speech of pre-defined text, it is
called a text-dependent system, otherwise, it is a text-independent one. A textdependent
speaker recognition system can work with fixed text, such as passwords,
credit card numbers, or telephone numbers, as well as with prompted text, which
is given by the system at the point of use. It works well in situations where
the speakers are cooperative. A text-independent speaker recognition system, on
the other hand, does not specify any speech content. It accepts speech with
any content and even in different languages. The latter provides great flexibility
especially when the speakers of interest are not available to provide speech
samples.
In general, a text-dependent speaker recognition system, with comparable
speech content, is more accurate than a text-independent one. Although a
speaker can say any words during enrolment, a text-dependent speaker recognition
system usually assumes that the words spoken during run-time tests have
already been enrolled and are known to the system. In this way, a verbatim
comparison using hidden Markov model (HMM) (Rabiner and Juang, 1993),
a commonly used acoustic model structure in automatic speech recognition
(ASR), becomes possible and has shown promising results (Naik et al., 1989;
Matsui and Furui, 1993; Che et al., 1996; Parthasarathy and Rosenberg,
1996).
Text-independent speaker recognition provides a more flexible application
scenario because it does not require information about the words spoken. Among the
research activities in this area, the National Institute of Standards and Technology
(NIST, US) has conducted a series of speaker recognition evaluations (SREs) since
1996 (NIST, 1996), which provide a common platform for text-independent speaker
recognition technology benchmarking. The NIST SRE has seen an increasing
participation in the recent years.
Text-Independent Speaker Recognition 5
1.2.2 Speaker identification and verification
In practice, speaker recognition is implemented as either an identification or a
verification task. Speaker identification is the process of determining who speaks
among a list of speakers who are known to the system. Speaker verification, on the
other hand, is the process of a binary decision, accepting or rejecting the identity
claim of a speaker, given an input speech.
Speaker identification can be regarded as a one-to-many problem, identifying
the person as one of the many in the speaker database, while speaker verification can
be regarded as one-to-one problem, answering the question: “Is the person who he
claims to be?.” While the two tasks are formulated differently to address application
need, they share some common technical challenges. Fundamentally, they are just
different applications of the same speaker modeling and classification techniques,
which are studied in this chapter. For simplicity, they are presented only in relation
to speaker verification applications.
1.2.3 Speaker verification framework
Just like any other problem in pattern recognition, both speaker identification and
speaker verification require a training process for each speaker to register with the
system before a test can be conducted. A typical example of a speaker verification
system is shown in Fig. 1.2.
Feature
extraction
Speaker
modeling
Speaker
model
database
Enrollment
Accept/
Reject
Feature
extraction
Speaker
model
Background
models
Score
normalization
Decision
making
Threshold
Verification
Fig. 1.2. A speaker verification system.
6 B. Ma and H. Li
During enrollment, a feature extraction process converts the speech samples
into speech feature vectors. A speaker model, such as Gaussian mixture models
(GMMs) or support vector machines (SVMs), is then built to characterize a speaker
from his/her speech features. The resulting speaker models are kept in a speaker
database for future verification tests.
During verification, the same feature extraction process is employed. The speech
feature vectors are then evaluated against the claimed speaker model. To make a
robust decision, a score normalization algorithm is a crucial component in speaker
verification system. It scales the matching score against the scores given by a
group of background speakers, to make a calibrated decision. Finally, a decision
can be made by comparing the normalized score with a decision threshold, which
is estimated from a development database. Figure 1.2 shows the fundamental
components of a speaker verification system. Each of these components is reviewed
and then a case study of performance evaluation is carried out on the NIST 2008
Speaker Recognition Evaluations (SREs) corpus.
1.3 Feature Extraction
It is believed that digital speech signal carries substantially redundant information
as far as speech or speaker recognition is concerned. Transforming the input
speech signal into a sequence of feature vectors is called feature extraction. The
resulting feature vectors are expected to be discriminative among speakers and
robust against any distortion and noise. Low-level acoustic feature extraction is also
a data-reduction process that attempts to capture the essential characteristics of
the speaker and provides a more stable and compact representation of the input
speech than the raw speech signals. The popular acoustic features adopted in
ASR, such as Mel-frequency cepstral coefficients (MFCCs) (Atal, 1974), perceptual
linear predictive (PLP) (Hermansky, 1990) coefficients and linear predictive cepstral
coefficients (LPCCs) (Davis and Mermelstein, 1980) are also effective in speaker
recognition.
1.3.1 Spectral analysis
The spectral analysis consists of several processing steps. The input speech signals
pass a low-pass filter to remove high-frequency components and are segmented into
frames. Each frame is a time slice, which is short enough so that the speech wave
can be considered stationary within the frame. For example, a typical practice
is to segment the speech into frames by a 20ms window at a 10ms frame rate.
A Hamming window is typically used to window each frame by minimizing the
signal discontinuities at the frame boundaries.
Voice activity detection (VAD) is used to detect the presence of human voice in
the signal. It plays an important role as a pre-processing stage in almost all speechprocessing
applications including speaker recognition. It improves the performance
of speaker recognition by excluding silence and noise.
Text-Independent Speaker Recognition 7
Among the different ways of determining the short-time spectral representation
of speech, MFCC was originally developed for speech recognition but also found to
be effective in speaker recognition. The choice of center frequencies and bandwidths
of the filter bank used in MFCC is determined by the properties of the human
auditory system. In particular, a Fourier transform is applied to the windowed
speech signals and a set of bandpass filters is used. These filters are normally equally
spaced in the Mel scale. The output of each filter can be considered as representing
the energy of the signal within the passband of the filter. A discrete cosine transform
can be applied to the log-energy outputs of these filters, in order to calculate the
MFCC as follows