10-11-2012, 04:33 PM
Mind Reading Machines: Automated Inference of Cognitive Mental States from Video
mind reading document.pdf (Size: 1.79 MB / Downloads: 134)
Abstract
Mind reading encompasses our ability to attribute mental
states to others, and is essential for operating in a complex
social environment. The goal in building mind reading
machines is to enable computer technologies to understand
and react to people’s emotions and mental states. This
paper describes a system for the automated inference of
cognitive mental states from observed facial expressions
and head gestures in video. The system is based on a multilevel
dynamic Bayesian network classifier which models
cognitive mental states as a number of interacting facial
and head displays. Experimental results yield an average
recognition rate of 87.4% for 6 mental states groups: agreement,
concentrating, disagreement, interested, thinking and
unsure. Real time performance, unobtrusiveness and lack
of preprocessing make our system particularly suitable for
user-independent human computer interaction.
Introduction
People mind read or attribute mental states to others all the
time, effortlessly, and mostly subconsciously. Mind reading
allows us to make sense of other people’s behavior, predict
what they might do next, and how they might feel. While
subtle and somewhat elusive, the ability to mind read is
essential to the social functions we take for granted. A lack
of or impairment in mind reading abilities are thought to be
the primary inhibitor of emotion and social understanding
in people diagnosed with autism (e.g. Baron-Cohen et. al
[2]).
People employ a variety of nonverbal communication
cues to infer underlying mental states, including voice,
posture and the face. The human face in particular provides
one of the most powerful, versatile and natural means
of communicating a wide array of mental states. One
subset comprises cognitive mental states such as thinking,
deciding and confused, which involve both an affective and
intellectual component [4].
Extracting head action units
Natural human head motion typically ranges between 70-
90o of downward pitch, 55o of upward pitch, 70o of
yaw (turn), and 55o of roll (tilt), and usually occurs as
a combination of all three rotations [16]. The output
positions of the localized feature points are sufficiently
accurate to permit the use of efficient, image-based head
pose estimation. Expression invariant points such as the
nose tip, root, nostrils, inner and outer eye corners are used
to estimate the pose. Head yaw is given by the ratio of left
to right eye widths. A head roll is given by the orientation
angle of the two inner eye corners. The computation of both
head yaw and roll is invariant to scale variations that arise
from moving toward or away from the camera. Head pitch
is determined from the vertical displacement of the nose
tip normalized against the distance between the two eye
corners to account for scale variations. The system supports
up to 50o, 30o and 50o of yaw, roll and pitch respectively.
Pose estimates across consecutive frames are then used to
identify head action units. For example, a pitch of 20o
degrees at time t followed by 15o at time t + 1 indicates a
downward head action, which is AU54 in the FACS coding
[10].
Extracting facial action units
Facial actions are identified from component-based facial
features (e.g. mouth) comprised of motion, shape and
colour descriptors. Motion and shape-based analysis are
particularly suitable for a real time video system, in which
motion is inherent and places a strict upper bound on the
computational complexity of methods used in order to meet
time constraints. Color-based analysis is computationally
efficient, and is invariant to the scale or viewpoint of the
face, especially when combined with feature localization
(i.e. limited to regions already defined by feature point
tracking).
Head and facial display recognition
Facial and head actions are quantized and input into leftto-
right HMM classifiers to identify facial expressions and
head gestures. Each is modelled as a temporal sequence of
action units (e.g. a head nod is a series of alternating up and
down movement of the head). In contrast to static classifiers
which classify single frames into an emotion class, HMMs
model dynamic systems spatio-temporally, and deal with
the time warping problem. In addition, the convergence of
recognition computation may run in real time, a desirable
aspect in automated facial expression recognition systems
for human computer interaction.
Applications and conclusion
The principle contribution of this paper is a multi-level
DBN classifier for inferring cognitive mental states from
videos of facial expressions and head gestures in real time.
The strengths of the system include being fully automated,
user-independent, and supporting purposeful head displays
while de-coupling that from facial display recognition. We
reported promising results for 6 cognitive mental states on a
medium-sized posed dataset of labelled videos. Our current
research directions include:
1. testing the generalization power of the system by
evaluating a larger and more natural dataset
2. exploring the within-class and between-class variation
between the various mental state classes, perhaps by
utilizing cluster analysis and/or unsupervised classification