28-05-2014, 04:33 PM
Emotional speech recognition: Resources, features, and methods
Emotional speech recognition:.pdf (Size: 561.03 KB / Downloads: 46)
Abstract
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-
date record of the available emotional speech data collections. The number of emotional states, the language, the number
of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features
used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the
formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based fea-
tures, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to
classify speech into emotional states. We examine separately classification techniques that exploit timing information from
which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discrim-
inant analysis, k-nearest neighbors, support vector machines are reviewed.
Ó 2006 Elsevier B.V. All rights reserved.
Introduction
Emotional speech recognition aims at automati-
cally identifying the emotional or physical state of
a human being from his or her voice. The emotional
and physical states of a speaker are known as
emotional aspects of speech and are included in
the so-called paralinguistic aspects. Although the
emotional state does not alter the linguistic content,
it is an important factor in human communication,
because it provides feedback information in many
applications as it is outlined next.
Data collections
A record of emotional speech data collections is
undoubtedly useful for researchers interested in
emotional speech analysis. An overview of 64 emo-
tional speech data collections is presented in Table
1. For each data collection additional information
is also described such as the speech language, the
number and the profession of the subjects, other
physiological signals possibly recorded simulta-
neously with speech, the data collection purpose
(emotional speech recognition, expressive synthe-
sis), the emotional states recorded, and the kind of
the emotions (natural, simulated, elicited).
Teager energy operator
Another useful feature for emotion recognition is
the number of harmonics due to the non-linear air
flow in the vocal tract that produces the speech
signal. In the emotional state of anger or for
stressed speech, the fast air flow causes vortices
located near the false vocal folds providing addi-
tional excitation signals other than the pitch (Teager
and Teager, 1990; Zhou et al., 2001). The additional
excitation signals are apparent in the spectrum as
harmonics and cross-harmonics. In the following,
a procedure to calculate the number of harmonics
in the speech signal is described.
Emotion classification techniques
The output of emotion classification techniques is
a prediction value (label) about the emotional state
of an utterance. An utterance un is a speech segment
corresponding to a word or a phrase. Let un,
n 2 {1, 2, . . . , N} be an utterance of the data collec-
tion. In order to evaluate the performance of a clas-
sification technique, the cross-validation method is
used. According to this method, the utterances of
the whole data collection are divided into the design
set Ds containing N Ds utterances and the test set Ts
comprised of N Ts utterances. The classifiers are
trained using the design set and the classification
error is estimated on the test set. The design and
the test set are chosen randomly. This procedure is
repeated for several times defined by the user and
the estimated classification error is the average
classification error over all repetitions (Efron and
Tibshirani, 1993).
Concluding remarks
In this paper, several topics have been addressed.
First, a list of data collections was provided includ-
ing all available information about the databases
such as the kinds of emotions, the language, etc.
Nevertheless, there are still some copyright prob-
lems since the material from radio or TV is held
under a limited agreement with broadcasters.
Furthermore, there is a need for adopting protocols
such as those in (Douglas-Cowie et al., 2003;
Scherer, 2003; Schro
̈ der, 2005) that address issues
related to data collection. Links with standardiza-
tion activities like MPEG-4 and MPEG-7 concern-
ing the emotion states and features should be
established. It is recommended the data to be
distributed by organizations (like LDC or ELRA),
and not by individual research organizations or pro-
ject initiatives, under a reasonable fee so that the
experiments reported using the specific data collec-
tions could be repeated. This is not the case with
the majority of the databases reviewed in this paper,
whose terms of distribution are rather unclear.