16-08-2012, 10:05 AM
Multimodal emotion recognition from expressive faces, body gestures and speech
1Multimodal emotion.pdf (Size: 187.55 KB / Downloads: 28)
Abstract.
In this paper we present a multimodal approach for the recognition of eight emotions that integrates information from facial expressions, body movement and gestures and speech. We trained and tested a model with a Bayesian classifier, using a multimodal corpus with eight emotions and ten subjects. First individual classifiers were trained for each modality. Then data were fused at the feature level and the decision level. Fusing multimodal data increased very much the recognition rates in comparison with the unimodal systems: the multimodal approach gave an improvement of more than 10% with respect to the most successful unimodal system. Further, the fusion performed at the feature level showed better results than the one performed at the decision level.
Introduction
In the last years, research in the human-computer interaction area increasingly addressed the communication aspect related to the “implicit channel”, that is the channel through which the emotional domain interacts with the verbal aspect of the communication [1]. One of the challenging issues is to endow a machine with an emotional intelligence. Emotionally intelligent systems must be able to create an affective interaction with users: they must be endowed with the ability to perceive, interpret, express and regulate emotions [2]. Recognising users’ emotional state is then one of the main requirements for computers to successfully interact with humans. Most of the works in affective computing do not combine different modalities into a single system for the analysis of human emotional behaviour.
Related work
Emotion recognition has been investigated with three main types of databases: acted emotions, natural spontaneous emotions and elicited emotions. The best results are generally obtained with acted emotion databases because they contain strong emotional expressions. Literature on speech (see for example Banse and Scherer [7]) shows that most part of the studies were conducted with emotional acted speech. Feature sets for acted and spontaneous speech have recently been compared by [8]. Generally, few acted-emotion speech databases included speakers with several different native languages. In the last years, some attempts to collect multimodal data were done: some examples of multimodal databases can be found in [9] [10] [11].
In the area of unimodal emotion recognition, there have been many studies using different, but single, modalities. Facial expressions [12] [13], vocal features [14] [15], body movements and postures [16] [17] [18], physiological signals [19] have been used as inputs during these attempts, while multimodal emotion recognition is currently gaining ground [20] [21] [22]. Nevertheless, most of the works consider the integration of information from facial expressions and speech and there are only a few attempts to combine information from body movement and gestures in a multimodal framework. Gunes and Piccardi [23] for example fused at different levels facial expressions and body gestures information for bimodal emotion recognition. Further, el Kaliouby and Robinson [24] proposed a vision-based computational model to infer acted mental states from head movements and facial expressions.
Procedure
Participants were asked to act eight emotional states: anger, despair, interest, pleasure, sadness, irritation, joy and pride, equally distributed in the space valence-arousal (see Table 1). During the recording process one of the authors had the role of director guiding the actors through the process. Participants were asked to perform specific gestures that exemplify each emotion. The director’s role was to instruct the subject on the procedure (number of gestures’ repetitions, emotion sequence, etc.) and details of each emotion and emotion-specific gesture. For example, for the despair emotion the subject was given a brief description of the emotion (e.g. “facing an existential problem without solution, coupled with a refusal to accept the situation”) and if the subject had required more details he would be given an example of a situation in which the specific emotion was present. All instructions were provided based on the procedure used during the collection of the GEMEP corpus [10]. For selecting the emotion-specific gestures we have borrowed ideas from a figure animation research area dealing with posturing of a figure [26] and came up with the gestures shown in Table 1.
Body feature extraction
Tracking of body and hands of the subjects was done using the EyesWeb platform [29]. Starting from the silhouette and the hands blobs of the actors, we extracted five main expressive motion cues, using the EyesWeb Expressive Gesture Processing Library [30]: quantity of motion and contraction index of the body, velocity, acceleration and fluidity of the hand’s barycenter. Data were normalised according to the behaviour shown by each actor, considering the maximum and the minimum values of each motion cue in each actor, in order to compare data from all the subjects.
Speech feature extraction
The set of features that we used contains features based on intensity, pitch, MFCC (Mel Frequency Cepstral Coefficient), Bark spectral bands, voiced segment characteristics and pause length. The full set contains 377 features. The features from the intensity contour and the pitch contour were extracted using a set of 32 statistical features. This set of features was applied both to the pitch and intensity contour and to their derivatives. Not any normalisation was applied before feature extraction. In particular, we didn't perform user or gender normalisation for pitch contour as it is often done in order to remove difference between registers. We considered the following 32 features: maximum, mean and minimum values, sample mode (most frequently occurring value), interquartile range (difference between the 75th and 25th percentiles), kurtosis, the third central sample moment.
Discussion and conclusions
We presented a multimodal framework for analysis and recognition of emotion starting from expressive faces, gestures and speech. We trained and tested a model with a Bayesian classifier, using a multimodal corpus with eight acted emotions and ten subjects of five different nationalities.
We experimented our approach on a dataset of 240 samples for each modality (face, body, speech). Considering the performances of the unimodal emotion recognition systems, the one based on gestures appears to be the most successful, followed by the one based on speech and the one based on facial expressions. We note that in this study we used emotion-specific gestures: these are gestures that are selected so as to express each specific emotion. An alterative approach which may also be of interest would be to recognise emotions from different expressivity of the same gesture (one not necessarily associated with any specific emotion) performed under different emotional conditions. This would allow good comparison with contemporary systems based on facial expressions and speech and will be considered in our future work.