13-08-2012, 02:27 PM
A Tutorial on Text-Independent Speaker Verification
A Tutorial on Text-Independent Speaker Verification.pdf (Size: 1,012.92 KB / Downloads: 66)
1. INTRODUCTION
Numerous measurements and signals have been proposed
and investigated for use in biometric recognition systems.
Among the most popularmeasurements are fingerprint, face,
and voice. While each has pros and cons relative to accuracy
and deployment, there are two main factors that have made
voice a compelling biometric. First, speech is a natural signal
to produce that is not considered threatening by users
to provide. In many applications, speech may be the main
(or only, e.g., telephone transactions) modality, so users do
not consider providing a speech sample for authentication
as a separate or intrusive step. Second, the telephone system
provides a ubiquitous, familiar network of sensors for
obtaining and delivering the speech signal. For telephonebased
applications, there is no need for special signal transducers
or networks to be installed at application access points
since a cell phone gives one access almost anywhere. Even for
non-telephone applications, sound cards and microphones
are low-cost and readily available. Additionally, the speaker
recognition area has a long and rich scientific basis with over
30 years of research, development, and evaluations.
Over the last decade, speaker recognition technology has
made its debut in several commercial products. The specific
A Tutorial on Text-Independent Speaker Verification 431
Speaker
model
Statistical modeling
module
Speech parameterization Speech parameters
module
Speech data
from a given
speaker
Figure 1: Modular representation of the training phase of a speaker verification system.
Background
model
Speaker
model
Statistical models
Claimed
identity
Accept
or
reject
Scoring
normalization
decision
Speech parameterization Speech parameters
module
Speech data
from an unknown
speaker
Figure 2: Modular representation of the test phase of a speaker verification system.
recognition task addressed in commercial systems is that
of verification or detection (determining whether an unknown
voice is from a particular enrolled speaker) rather
than identification (associating an unknown voice with one
from a set of enrolled speakers). Most deployed applications
are based on scenarios with cooperative users speaking fixed
digit string passwords or repeating prompted phrases from a
small vocabulary. These generally employ what is known as
text-dependent or text-constrained systems. Such constraints
are quite reasonable and can greatly improve the accuracy of
a system; however, there are cases when such constraints can
be cumbersome or impossible to enforce. An example of this
is background verification where a speaker is verified behind
the scene as he/she conducts some other speech interactions.
For cases like this, a more flexible recognition system able to
operate without explicit user cooperation and independent
of the spoken utterance (called text-independent mode) is
needed. This paper focuses on the technologies behind these
text-independent speaker verification systems.
A speaker verification system is composed of two distinct
phases, a training phase and a test phase. Each of them can be
seen as a succession of independent modules. Figure 1 shows
a modular representation of the training phase of a speaker
verification system. The first step consists in extracting parameters
from the speech signal to obtain a representation
suitable for statistical modeling as such models are extensively
used in most state-of-the-art speaker verification systems.
This step is described in Section 2. The second step
consists in obtaining a statistical model from the parameters.
This step is described in Section 3. This training scheme
is also applied to the training of a background model (see
Section 3).
Figure 2 shows amodular representation of the test phase
of a speaker verification system. The entries of the system are
a claimed identity and the speech samples pronounced by
an unknown speaker. The purpose of a speaker verification
system is to verify if the speech samples correspond to the
claimed identity. First, speech parameters are extracted from
the speech signal using exactly the same module as for the
training phase (see Section 2). Then, the speaker model corresponding
to the claimed identity and a background model
are extracted from the set of statistical models calculated
during the training phase. Finally, using the speech parameters
extracted and the two statistical models, the last module
computes some scores, normalizes them, and makes an
acceptance or a rejection decision (see Section 4). The normalization
step requires some score distributions to be estimated
during the training phase or/and the test phase (see
the details in Section 4).
Finally, a speaker verification system can be textdependent
or text-independent. In the former case, there is
some constraint on the type of utterance that users of the
system can pronounce (for instance, a fixed password or certain
words in any order, etc.). In the latter case, users can
say whatever they want. This paper describes state-of-the-art
text-independent speaker verification systems.
The outline of the paper is the following. Section 2
presents the most commonly used speech parameterization
techniques in speaker verification systems, namely, cepstral
analysis. Statistical modeling is detailed in Section 3, including
an extensive presentation of Gaussian mixture modeling
(GMM) and the mention of several speaker modeling
alternatives like neural networks and support vector
machines (SVMs). Section 4 explains how normalization is
used. Section 5 shows how to evaluate a speaker verification
system. In Section 6, several extensions of speaker verification
are presented, namely, speaker tracking and speaker segmentation.
Section 7 gives a few applications of speaker verification.
Section 8 details specific problems relative to the use
of speaker verification in the forensic area. Finally, Section 9
concludes this work and gives some future research directions.
432 EURASIP Journal on Applied Signal Processing
Cepstral
Cepstral vectors
transform
Spectral
vectors
20∗ Pre- Windowing FFT | | Filterbank Log
emphasis
Speech
signal
Figure 3: Modular representation of a filterbank-based cepstral parameterization.
2. SPEECH PARAMETERIZATION
Speech parameterization consists in transforming the speech
signal to a set of feature vectors. The aim of this transformation
is to obtain a new representation which is more compact,
less redundant, and more suitable for statistical modeling
and the calculation of a distance or any other kind of
score. Most of the speech parameterizations used in speaker
verification systems relies on a cepstral representation of
speech.
2.1. Filterbank-based cepstral parameters
Figure 3 shows a modular representation of a filterbankbased
cepstral representation.
The speech signal is first preemphasized, that is, a filter
is applied to it. The goal of this filter is to enhance the high
frequencies of the spectrum, which are generally reduced by
the speech production process. The preemphasized signal is
obtained by applying the following filter:
xp(t) = x(t) − a · x(t − 1). (1)
Values of a are generally taken in the interval [0.95, 0.98].
This filter is not always applied, and some people prefer not
to preemphasize the signal before processing it. There is no
definitive answer to this topic but empirical experimentation.
The analysis of the speech signal is done locally by the application
of a window whose duration in time is shorter than
the whole signal. This window is first applied to the beginning
of the signal, thenmoved further and so on until the end
of the signal is reached. Each application of the window to a
portion of the speech signal provides a spectral vector (after
the application of an FFT—see below). Two quantities have
to be set: the length of the window and the shift between two
consecutive windows. For the length of the window, two values
are most often used: 20 milliseconds and 30milliseconds.
These values correspond to the average duration which allows
the stationary assumption to be true. For the delay, the
value is chosen in order to have an overlap between two consecutive
windows; 10 milliseconds is very often used. Once
these two quantities have been chosen, one can decide which
window to use. The Hamming and the Hanning windows
are the most used in speaker recognition. One usually uses
a Hamming window or a Hanning window rather than a
rectangular window to taper the original signal on the sides
and thus reduce the side effects. In the Fourier domain, there
is a convolution between the Fourier transform of the portion
of the signal under consideration and the Fourier transform
of the window. The Hamming window and the Hanning
window are much more selective than the rectangular
window.
Once the speech signal has been windowed, and possibly
preemphasized, its fast Fourier transform(FFT) is calculated.
There are numerous algorithms of FFT (see, for instance, [1,
2]).
Once an FFT algorithm has been chosen, the only parameter
to fix for the FFT calculation is the number of points for
the calculation itself. This number N is usually a power of 2
which is greater than the number of points in the window,
classically 512.
Finally, the modulus of the FFT is extracted and a power
spectrum is obtained, sampled over 512 points. The spectrum
is symmetric and only half of these points are really
useful. Therefore, only the first half of it is kept, resulting in
a spectrum composed of 256 points.
The spectrum presents a lot of fluctuations, and we are
usually not interested in all the details of them. Only the envelope
of the spectrum is of interest. Another reason for the
smoothing of the spectrum is the reduction of the size of the
spectral vectors. To realize this smoothing and get the envelope
of the spectrum, we multiply the spectrum previously
obtained by a filterbank. A filterbank is a series of bandpass
frequency filters which are multiplied one by one with
the spectrum in order to get an average value in a particular
frequency band. The filterbank is defined by the shape of
the filters and by their frequency localization (left frequency,
central frequency, and right frequency). Filters can be triangular,
or have other shapes, and they can be differently located
on the frequency scale. In particular, some authors use
the Bark/Mel scale for the frequency localization of the filters.
This scale is an auditory scale which is similar to the frequency
scale of the human ear. The localization of the central
frequencies of the filters is given by
fMEL = 1000 · log
1 + fLIN/1000
log 2
. (2)
Finally, we take the log of this spectral envelope and multiply
each coefficient by 20 in order to obtain the spectral envelope
in dB. At the stage of the processing, we obtain spectral
vectors.
An additional transform, called the cosine discrete transform,
is usually applied to the spectral vectors in speech processing
and yields cepstral coefficients [2, 3, 4]:
cn =
K
k=1
Sk · cos
n
k − 1
2
π
K
, n = 1, 2, . . . , L, (3)
A Tutorial on Text-Independent Speaker Verification 433
Cepstral
Cepstral vectors
transform
LPC
vectors
Windowing Preemphasis LPC algorithm
Speech
signal
Figure 4: Modular representation of an LPC-based cepstral parameterization.
where K is the number of log-spectral coefficients calculated
previously, Sk are the log-spectral coefficients, and L is
the number of cepstral coefficients that we want to calculate
(L ≤ K). We finally obtain cepstral vectors for each analysis
window.
2.2. LPC-based cepstral parameters
Figure 4 shows a modular representation of an LPC-based
cepstral representation.
The LPC analysis is based on a linear model of speech
production. The model usually used is an auto regressive
moving average (ARMA) model, simplified in an auto regressive
(AR) model. This modeling is detailed in particular
in [5].
The speech production apparatus is usually described as
a combination of four modules: (1) the glottal source, which
can be seen as a train of impulses (for voiced sounds) or a
white noise (for unvoiced sounds); (2) the vocal tract; (3)
the nasal tract; and (4) the lips. Each of them can be represented
by a filter: a lowpass filter for the glottal source, an
AR filter for the vocal tract, an ARMA filter for the nasal
tract, and an MA filter for the lips. Globally, the speech
production apparatus can therefore be represented by an
ARMA filter. Characterizing the speech signal (usually a windowed
portion of it) is equivalent to determining the coefficients
of the global filter. To simplify the resolution of this
problem, the ARMA filter is often simplified in an AR filter.
The principle of LPC analysis is to estimate the parameters
of an AR filter on a windowed (preemphasized or not)
portion of a speech signal. Then, the window is moved and
a new estimation is calculated. For each window, a set of coefficients
(called predictive coefficients or LPC coefficients)
is estimated (see [2, 6] for the details of the various algorithms
that can be used to estimate the LPC coefficients) and
can be used as a parameter vector. Finally, a spectrum envelope
can be estimated for the current window from the
predictive coefficients. But it is also possible to calculate
cepstral coefficients directly from the LPC coefficients (see
[6]):
c0 = ln σ2,
cm = am +
m−1
k=1
k
m
ckam−k, 1≤ m ≤ p,
cm =
m−1
k=1
k
m
ckam−k, p <m,
(4)
where σ2 is the gain term in the LPC model, am are the LPC
coefficients, and p is the number of LPC coefficients calculated.
2.3. Centered and reduced vectors
Once the cepstral coefficients have been calculated, they can
be centered, that is, the cepstral mean vector is subtracted
from each cepstral vector. This operation is called cepstral
mean subtraction (CMS) and is often used in speaker verification.
The motivation for CMS is to remove from the cepstrum
the contribution of slowly varying convolutive noises.
The cepstral vectors can also be reduced, that is, the variance
is normalized to one component by component.
2.4. Dynamic information
After the cepstral coefficients have been calculated, and possibly
centered and reduced, we also incorporate in the vectors
some dynamic information, that is, some information about
the way these vectors vary in time. This is classically done by
using the Δ and ΔΔ parameters, which are polynomial approximations
of the first and second derivatives [7]:
Δcm =
l
k=−l k · cm+k l
k=−l
|k|
,
ΔΔcm =
l
k=−l k2 · cm+k l
k=−l k2
.
(5)
2.5. Log energy and Δ log energy
At this step, one can choose whether to incorporate the log
energy and the Δ log energy in the feature vectors or not. In
practice, the former one is often discarded and the latter one
is kept.
2.6. Discarding useless information
Once all the feature vectors have been calculated, a very important
last step is to decide which vectors are useful and
which are not. One way of looking at the problem is to determine
vectors corresponding to speech portions of the signal
versus those corresponding to silence or background noise.
A way of doing it is to compute a bi-Gaussian model of the
feature vector distribution. In that case, the Gaussian with
the “lowest” mean corresponds to silence and background
noise, and the Gaussian with the “highest” mean corresponds
to speech portions. Then vectors having a higher likelihood
with the silence and background noise Gaussian are
discarded. A similar approach is to compute a bi-Gaussian
model of the log energy distribution of each speech segment
and to apply the same principle.