20-09-2016, 12:28 PM
1455356342-emotionalspeechrecognitionresourcesfeaturesmethods.docx (Size: 31.38 KB / Downloads: 5)
ABSTRACT:
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal-tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed.
I.INTRODUCTION
Emotional speech recognition aims at automat-ically identifying the emotional or physical state of a human being from his or her voice. The emo-tional and physical states of a speaker are known as emotional aspects of speech and are included in the so called paralinguistic aspects. Although the emotional state does not alter the linguistic con-tent, it is an important factor in human commu-nication,Making a machine to recognize emotions from speech is not a new idea. The first investigations were conducted around the mid-eighties using statistical properties of certain acoustic features (Van Bezooijen, 1984; Tolkmitt and Scherer, 1986). Ten years later, the evolution of computer architectures made the implementation of more complicated emotion recognition algorithms fea-sible. Market requirements for automatic services motivate further research. In environments like aircraft cockpits, speech recognition systems were trained by employing stressed speech instead of neutral (Hansen and Cairns, 1995). The acoustic features were estimated more precisely by itera-tive algorithms. Advanced classifiers exploiting timing information were proposed (Cairns and Hansen, 1994; Womack and Hansen, 1996; Polzin and Waibel, 1998). Nowadays, research is focused on finding powerful combinations of classifiers that advance the classification efficiency in real life applications. The wide use of telecommunication services and multimedia devices paves also the way for new applications. For example, in the projects “Prosody for dialogue systems” and “SmartKom”, ticket reservation systems are developed that em-ploy automatic speech recognition being able to recognize the annoyance or frustration of a user and change their response accordingly (Ang et al., 2002; Schiel et al., 2002). Similar scenarios are also presented for call center applications (Petrushin, 1999; Lee and Narayanan, 2005). Emotional speech recognition can be employed by therapists as a di-agnostic tool in medicine (France et al., 2000). In psychology, emotional speech recognition methods can cope with the bulk of enormous speech data in real-time extracting the speech characteristics that convey emotion and attitude in a systematic manner (Mozziconacci and Hermes, 2000). In the future, the emotional speech research will primarily be benefited by the on-going availability of large-scale emotional speech data collections, and will focus on the improvement of theoretical models for speech production (Flanagan, 1972) or models related to the vocal communication of emotion (Scherer, 2003). Indeed, on the one hand, large data collections which include a vari-ety of speaker utterances under several emotional states are necessary in order to faithfully assess the performance of emotional speech recognition algorithms. The already available data collections consist only of few utterances, and therefore it is difficult to demonstrate reliable emotion recogni-tion results. On the other hand, theoretical models of speech production and vocal communication of emo-tion will provide the necessary background for a systematic study and will deploy more accurate emotional cues through time.
II.DATA COLLECTIONS
A record of emotional speech data collections is undoubtedly useful for researchers interested in emotional speech analysis. For each data collection additional informa-tion is also described such as the speech language, the number and the profession of the subjects, other physiological signals possibly recorded simul-taneously with speech, the data collection purpose (emotional speech recognition, expressive synthe-sis), the emotional states recorded, and the kind of the emotions (natural, simulated, elicited). Three kinds of speech are observed. Natural speech is simply spontaneous speech where all emotions are real. Simulated or acted speech is speech expressed in a professionally deliberated manner. Finally, elicited speech is speech in which the emotions are induced. The elicited speech is neither neutral nor simulated. For example, portrayals of non-professionals while imitating a professional produce elicited speech, which can also be an acceptable solution when an adequate number of professionals is not available (Nakatsu et al., 1999). Acted speech from professionals is the most reliable for emotional speech recognition because professionals can deliver speech colored by emotions that possess a high arousal, i.e. emotions with a great amplitude or strength.
Additional synchronous physiological signals such as sweat indication, heart beat rate, blood pressure, and respiration could be recorded during the experiments. They provide a ground truth for the degree of subjects’ arousal or stress (Rahurkar and Hansen, 2002; Picard et al., 2001). There is a direct evidence that the aforementioned signals are related more to the arousal information of speech than to the valence of the emotion, i.e. the positive or negative character of the emotions (Wagner et al., 2005). As regards other physiolog-ical signals, such as EEG or signals derived from blood analysis, no sufficient and reliable results have been reported yet. The recording scenarios employed in data col-lections are presumably useful for repeating or augmenting the experiments. Material from radio or television is always available (Douglas-Cowie et al., 2003). However, such material raises copy-right issues and impedes the data collection distri-bution. An alternative is speech from interviews with specialists, such as psychologists and sci-entists specialized in phonetics (Douglas-Cowie et al., 2003). Furthermore, speech from real life situations such as oral interviews of employees when they are examined for promotion can be also used (Rahurkar and Hansen, 2002). Parents talking to infants, when they try to keep them away from dangerous objects can be another real life example (Slaney and McRoberts, 2003). Inter-views between a doctor and a patient before and after medication was used in (France et al., 2000). Speech can be recorded while the subject faces a machine, e.g. during telephone calls to automatic speech recognition (ASR) call centers (Lee and Narayanan, 2005), or when the subjects are talk-ing to fake-ASR machines, which are operated by a human (wizard-of-OZ method, WOZ) (Fischer, 1999). Giving commands to a robot is another idea explored (Batliner et al., 2004). Speech can be also recorded during imposed stressed situations. For example when the subject adds numbers while driving a car at various speeds (Fernandez and Picard, 2003), or when the subject reads distant car plates on a big computer screen (Steeneken and Hansen, 1999). Finally, subjects’ readings of emotionally neutral sentences located between emotionally biased ones can be another manner of recording emotional speech.
III.ESTIMATION OF ACOUSTIC FEATURES
Methods for estimating short-term acoustic features that are frequently used in emotion recognition are described hereafter. Short-term features are estimated on a frame basis fs(n;m) = s(n)w(m − n), (1) where s(n) is the speech signal and w(m − n) is a window of length Nw ending at sample m (Deller et al., 2000). Most of the methods stem from the front-end signal processing employed in speech recognition and coding. However, the discussion is focused on acoustic features that are useful for emotion recognition. where s(n) is the speech signal and w(m − n) is a window of length Nw ending at sample m (Deller et al., 2000). Most of the methods stem from the front-end signal processing employed in speech recognition and coding. However, the discussion is focused on acoustic features that are useful for emotion recognition.
a.pitch
The pitch signal, also known as the glottal waveform, has information about emotion, because it depends on the tension of the vocal folds and the subglottal air pressure. The pitch signal is produced from the vibration of the vocal folds. Two features related to the pitch signal are widely used, namely the pitch frequency and the glottal air velocity at the vocal fold opening time instant. The time elapsed between two successive vocal fold openings is called pitch period T, while the vibration rate of the vocal folds is the fundamental frequency of the phonation F0 or pitch frequency. The glottal volume velocity denotes the air velocity through glottis during the vocal fold vibration. High velocity indicates a music like speech like joy or surprise. Low velocity is in harsher styles such as anger or disgust (Nogueiras et al., 2001). Many algorithms for estimating the pitch signal exist (Hess, 1992). Two algorithms will be discussed here. The first pitch estimation algorithm is based on the autocorrelation and it is the most frequent one. The second algorithm is based on a wavelettransform. It has been designed for stressed speech.
b.teager energy operator
Another useful feature for emotion recognition is the number of harmonics due to the nonlinear air flow in the vocal tract that produces the speech signal. In the emotional state of anger or for stressed speech, the fast air flow causes vortices located near the false vocal folds providing additional excitation signals other than the pitch (Teager and Teager, 1990; Zhou et al., 2001). The additional excitation signals are apparent in the spectrum as harmonics and cross-harmonics.