10-06-2014, 04:30 PM
Emotional speech recognition: Resources, features, and methods
Emotional speech recognition.pdf (Size: 561.03 KB / Downloads: 42)
Abstract
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-todate
record of the available emotional speech data collections. The number of emotional states, the language, the number
of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features
used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the
formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features,
the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to
classify speech into emotional states. We examine separately classification techniques that exploit timing information from
which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant
analysis, k-nearest neighbors, support vector machines are reviewed.
Introduction
Emotional speech recognition aims at automatically
identifying the emotional or physical state of
a human being from his or her voice. The emotional
and physical states of a speaker are known as
emotional aspects of speech and are included in
the so-called paralinguistic aspects. Although the
emotional state does not alter the linguistic content,
it is an important factor in human communication,
because it provides feedback information in many
applications as it is outlined next.
Outline
In Section 2, a corpus of 64 data collections is
reviewed putting emphasis on the data collection
procedures, the kind of speech (natural, simulated,
or elicited), the content, and other physiological
signals that may accompany the emotional speech.
In Section 3, short-term features (i.e. features that
are extracted on speech frame basis) that are related
to the emotional content of speech are discussed. In
addition to short-term features, their contours are
of fundamental importance for emotional speech
recognition. The emotions affect the contour characteristics,
such as statistics and trends as is summarized
in Section 4. Emotion classification techniques
that exploit timing information and other techniques
that ignore it are surveyed in Section 5.
Therefore, Sections 3 and 4 aim at describing the
appropriate features to be used with the emotional
classification techniques reviewed in Section 5.
Finally, Section 6 concludes the tutorial by indicating
future research directions.
Data collections
A record of emotional speech data collections is
undoubtedly useful for researchers interested in
emotional speech analysis. An overview of 64 emotional
speech data collections is presented in Table
1. For each data collection additional information
is also described such as the speech language, the
number and the profession of the subjects, other
physiological signals possibly recorded simultaneously
with speech, the data collection purpose
(emotional speech recognition, expressive synthesis),
the emotional states recorded, and the kind of
the emotions (natural, simulated, elicited).