Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Audio/Visual Mapping With Cross-Modal Hidden Markov Models
[attachment=26790]
Abstract
The audio/visual mapping problem of speech-driven
facial animation has intrigued researchers for years. Recent
research efforts have demonstrated that hidden Markov model
(HMM) techniques, which have been applied successfully to the
problem of speech recognition, could achieve a similar level of success
in audio/visual mapping problems. A number of HMM-based
methods have been proposed and shown to be effective by the
respective designers, but it is yet unclear how these techniques
compare to each other on a common test bed. In this paper,
we quantitatively compare three recently proposed cross-modal
HMM methods, namely the remapping HMM (R-HMM), the
least-mean-squared HMM (LMS-HMM), and HMM inversion
(HMMI). The objective of our comparison is not only to highlight
the merits and demerits of different mapping designs, but also
to study the optimality of the acoustic representation and HMM
structure for the purpose of speech-driven facial animation. This
paper presents a brief overview of these models, followed by an
analysis of their mapping capabilities on a synthetic dataset. An
empirical comparison on an experimental audio-visual dataset
consisting of 75 TIMIT sentences is finally presented. Our results
show that HMMI provides the best performance, both on synthetic
and experimental audio-visual data.
Index Terms—3-D audio/video processing, joint media and multimodal
processing, speech reading and lip synchroization.
I. INTRODUCTION
THE GOAL OF audio/visual (A/V) mapping is to produce
accurate, synchronized and perceptually natural animations
of facial movements driven by an incoming audio stream.
Speech-driven facial animation can provide practical benefits in
human-machine interfaces [1], since the combination of audio
and visual information has been shown to enhance speech perception,
especially when the auditory signals degrade due to
noise, bandwidth filtering, or hearing impairments.
Despite its apparent simplicity, the mapping between continuous
audio and visual streams is rather complex as a result
Manuscript received March 12, 2003; revised October 5, 2003. This work
was supported by the National Science Foundation (NSF) under Grant EIA-
9906340, Grant BCS-9980054, and CAREER Award 9984426/0229598. The
associate editor coordinating the review of this paper and approving it for publication
was Prof. Suh-Yin Lee.
S. Fu is with the Department of Electrical and Computer Engineering, University
of Delaware, Newark, DE 19716 USA (e-mail: fu[at]ee.udel.edu).
R. Gutierrez-Osuna is with Department of Computer Science, Texas A&M
University, College Station, TX 77843 USA (e-mail: rgutier[at]cs.tamu.edu).
A. Esposito is with the Department of Psychology, Second University of
Naples, Naples, Italy (e-mail: iiass.anna[at]tin.it).
P. K. Kakumanu is with the Department of Computer Science and
Engineering, Wright State University, Dayton, OH 45435 USA (e-mail:
kpraveen[at]cs.wright.edu).
O. N. Garcia is with the College of Engineering, University of North Texas,
Denton, TX 76203 USA (e-mail: garcia[at]unt.edu).
Digital Object Identifier 10.1109/TMM.2005.843341
of co-articulation [2], which causes a given phone to be pronounced
differently depending on the surrounding phonemes.
According to the level at which speech signals are represented,
facial animation approaches can be classified into
two groups: phoneme/viseme and subphonemic mappings.
The phoneme/viseme approach views speech as a bimodal
linguistic entity. The basic idealized linguistic unit of spoken
language is the phoneme. The spoken English language has approximately
58 phonemes [3]. Similarly, the basic unit of facial
speech movement corresponding to a phoneme is the viseme.
Following Goldschen [4], phonemes may be mapped into 35
different visemes. Although phoneme/viseme mappings have
generated fairly realistic “talking-heads” [5], [6] the approach is
inherently limited. When speech is segmented into phonemes,
considerable information is lost, including speech rate, emphasis
and prosody, all of which are essential for animating a
realistically perceived talking face. Therefore, phoneme/viseme
mappings result in less natural facial animations.
An alternative to the phoneme/viseme approach is to construct
a direct mapping from sub-phonemic speech acoustics
(e.g., linear predictive coefficients) onto orofacial trajectories.
This approach assumes a dynamic relationship between a short
window of speech and the corresponding visual frame. In this
case, the problem consists of finding an optimal functional
approximation using a training set of A/V frames. As universal
approximators, neural networks have been widely used for
such nonlinear mapping [7]. To incorporate the co-articulation
cues of speech, Lavagetto [8] and Massaro et al. [9] proposed
a model based on time-delay neural networks (TDNNs), which
uses tapped-delay connections to capture context information
during phone and motion transitions. Hong et al. [10] used a
family of 44 multilayer perceptrons (MLP), each trained on a
specific phoneme. Co-articulation is captured with a seven-unit
delay line. An incoming audio sample is first classified into
a phoneme class using a set of 44 Gaussian mixture models,
and the corresponding MLP is then used to predict the video
components. Obviously, the predefined length of the delay line
limits the time window of co-articulation dynamics because the
context and durations are quite different for different subjects,
emotional states, and speech rates.
To overcome the above limitations, hidden Markov models
(HMMs) have recently received much attention for the purpose
of A/V mapping [11]–[14]. HMM-based methods have the advantage
that context information can be easily represented by
state-transition probabilities. To the best of our knowledge, the
first application of HMM to A/V mapping is the work by Yamamoto
et al. [14]. In this approach, an HMM is learned from
1520-9210/$20.00 © 2005 IEEE
244 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 2, APRIL 2005
audio training data, and each video training sequence is aligned
with the audio sequence using Viterbi optimization. During synthesis,
an HMM state sequence is selected for a given novel
audio input using the Viterbi algorithm, and the visual output
associated with each state in the audio sequence is retrieved.
This technique, however, provides video predictions of limited
quality for two reasons. First, the output of each state is the average
of the Gaussian mixture components associated to that
state, so the predicted visual output of this state is only indirectly
related to the current audio vector by means of the Viterbi state.
Second, synthesis performance depends heavily on the Viterbi
alignment, which is rather sensitive to noise in the audio input.
Chen and Rao [11], [15] employ a least mean square estimation
method for the synthesis phase, whereby the visual output
is made dependent not only on the current state, but also on
the current audio input. Still, their model predicts the state sequence
by means of the Viterbi algorithm. To address this limitation,
Choi et al. [12] have proposed a hidden Markov model
inversion (HMMI) method. In HMMI, the visual output is generated
directly from the given audio input and the trained HMM
by means of an expectation-maximization (EM) iteration, thus
avoiding the use of the Viterbi sequence. A different mechanism
has been proposed by Brand [13], in which a minimum-entropy
training method is used to learn a concise HMM. As a result, the
Viterbi sequence captures a larger proportion of the total probability
mass, thus reducing the detrimental effects of noise in the
audio input. For reasons that will become clear in Section II-A,
Brand’s method is termed the remapping HMM (R-HMM).
Even though these three HMMs have shown high potential
for A/V mapping in speech-driven facial animation systems, a
clear understanding of their relative merits has yet to be discerned.
Moreover, the proposed models have not yet been evaluated
under the same experimental conditions or A/V datasets,
but the originators have used their own datasets. A theoretical
evaluation alone may miss important factors affecting their behavior
since the performance of any computational model will
ultimately depend on the problem domain and the nature of the
data. The objective of this paper is to provide an experimental
comparison of these HMMs, as well as investigate how HMM
structure and choice of acoustic features affect the prediction
performance for speech-driven facial animation.
II. HMM-BASED A/V MAPPING
An HMM is commonly represented by a vector of model parameters
, where is
the set of Markov chain states, denotes the set of observations,
is a matrix of state transition probabilities, and is the
initial state distribution. If the outputs of the HMM are discrete
symbols, is the observation symbol probability distribution.
Although continuous outputs can be discretized through vector
quantization, improved performance can be obtained by modeling
the output probability distribution at state
with a semi-parametric Gaussian mixture model
(1)
where is the mixture coefficient for the th mixture at state
, and is a Gaussian density with mean vector and covariance
matrix [16]. As reviewed in the previous section,
a number of extensions of this basic HMM have been proposed
for A/V mappings. A concise description of these models follows.
A. Remapping HMM
Under the assumption that both acoustic and visual data can
be modeled with the same structure, Brand [13] has proposed a
remapping procedure to train cross-modal HMMs. The training
is conducted using video data. Once a video HMM is learned,
the video output probabilities at each state are remapped onto
the audio space using the M-step in Baum-Welch. Borrowing
notation from [16], the estimation formulas for and ,
the mean and covariance for the th Gaussian component at the
th state in the visual HMM are
(2)
where is the visual vector at time , and
is the probability of being in state at time
with the th mixture component accounting for visual sequence
and learned model . To re-map the video HMM into audio
space, the audio and are obtained by replacing the
video vector in (2) with , the audio vector at time . All
other parameters in the audio HMM remain the same as in the
video HMM.
The process of synthesizing a novel video trajectory sequence
involves two steps. First, given a new audio sequence and the
learned audio HMM, the optimal state sequence is obtained with
the Viterbi algorithm. From the Viterbi sequence, the A/V mapping
may be simply implemented by choosing the average visual
vector for each state, as in [14]. This naive solution, however,
yields an animation which displays jerky motion from frame
to frame. Instead, Brand [17] proposes a solution that yields a
short, smooth trajectory that is most consistent with the visual
HMM and the given Viterbi state sequence. For simplicity, each
state is assumed to have one Gaussian component, but the procedure
can be generalized to Gaussian mixtures. Let
be the probability of observation given Gaussian model with
mean and covariance . The predicted visual trajectory is
then
(3)
where , . Thus, the observation
at time for the visualHMMincludes both the position
FU et al.: A/V MAPPING WITH CROSS-MODAL HMMs 245
and the velocity . Equation (3) has a closed-form solution with
a single global optimum. A standard block tri-diagonal system
can be obtained by setting its derivative to zero. Details of the
solution for such system can be found in [18] and [19].
B. Least-Mean Squared HMM
The LMS-HMM method of Chen [11] differs from
the R-HMM method in two fundamental ways. First, the
LMS-HMM is trained on the joint A/V space, as opposed to
video space. Second, the synthesis of video for each particular
state is formulated as a least-mean-squares regression from the
corresponding audio observation. Training of the LMS-HMM
[11] is performed by combining the audio and visual features
into one joint observation vector . Once the joint
HMM is trained using Baum-Welch, the extraction of an audio
HMM is trivial since the audio parameters are part of the joint
A/V distribution
(4)
where and represent the mean vector and covariance
matrix for the th Gussian component at the th state in the
audio HMM. To synthesize a video vector from a new audio
input, the LMS-HMM method operates in two stages. First, the
most likely state sequence is found based on the learned audio
HMM using the Viterbi algorithm. Then, the audio input and
the Gaussian mixture model corresponding to each Viterbi state
are used to analytically derive the visual estimate that minimizes
the mean squared error (MSE) . It can
be shown [15] that this MSE estimate is given by
(5)
with
(6)
where is the mixture coefficient, and is the
probability of for the th Gaussian component in state .
Equation (5) shows that the visual output for a given state depends
directly on the corresponding audio input.
C. HMM Inversion
The HMMI approach of Choi et al. [12], [20] addresses a
major weakness of HMMs: reliance on the Viterbi sequence,
which represents only a small fraction of the total probability
mass, with many other state sequences potentially having
nearly equal likelihoods [13]. In addition, the Viterbi search
may be easily misguided by noise in the audio input. To avoid
these problems, in HMMI the visual outputs are estimated
directly from the speech signal, bypassing the Viterbi search
[21].
The training process for HMMI is the same as in the
LMS-HMM method, where both audio and video features are
used to train a joint A/V HMM. During synthesis, the visual
output in HMMI is predicted from the given audio stimuli and
the joint HMM using a procedure that can be regarded as the
inverse version of Baum-Welch. The objective of the HMMI
re-estimation method is to find a video observation sequence
that maximizes Baum’s auxiliary function [22]
(7)
where is the audio sequence, is an initial visual sequence,
and are the parameters of the joint A/V HMM. Note that
(7) has two identical since the EM step in HMMI does
not re-estimate model parameters but the video observation sequence
. Details of this derivation may be found in [12]. Since
both HMMI and LMS-HMM are trained on the joint A/V space,
our HMMI implementation uses the visual sequence predicted
by LMS-HMM as an initial value in (7).

project uploader