20-08-2013, 01:16 PM
Multi-pose lipreading and audio-visual speech recognition
Multi-pose lipreading.docx (Size: 136.67 KB / Downloads: 14)
Abstract
In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier.
Introduction
The performance of automatic speech recognition (ASR) systems degrades heavily in the presence of noise, compromising their use in real world scenarios. In these circumstances, ASR systems can benefit from the use of other sources of information complementary to the audio signal and yet related to speech. Visual speech constitutes such a source of information. Mimicking human lipreading, visual ASR systems are designed to recognize speech from images and videos of the speaker's mouth. This fact gives rise to audio-visual automatic speech recognition (AV-ASR), combining the audio and visual modalities of speech to improve the performance of audio-only ASR, especially in presence of noise [1,2]. In these situations, we cannot trust the corrupted audio signal and must rely on the visual modality of speech to guide recognition. The major challenges that AV-ASR has to face are, therefore, the definition of reliable visual features for speech recognition and the integration of the audio and visual cues when taking decisions about the speech classes.
Audio-visual speech recognition
In terms of the visual modality, AV-ASR systems differ in three major aspects: the visual front-end, the audio-visual integration strategy and the pattern classifier associated to the speech recognition task. In Figure 1, we present a typical AVSR system. First, the audio front-end extracts the audio features that will be used in the classifier. This block is identical to that of an audio-only ASR system and the features most commonly used are perceptual linear predictive [16] or Mel frequency cepstral coefficients [17,18]. In parallel, the face of the speaker has to be localized from the video sequence and the region of the mouth detected and normalized before relevant features can be extracted [1,19]. Typically, both audio and visual features are extended to include some temporal information of the speech process. Then, the features are used in statistical classifiers, usually hidden Markov models (HMM) [20], to estimate the most likely sequence of phonemes or words. The fusion of information between modalities can happen at two stages [1]: merging the extracted features before going through pattern classifiers or on the statistical models used for classification. In the following, we focus on the visual modality, in particular in the blocks affected by the pose changes on the speaker: the extraction of visual features from images of the mouth and the integration of the visual and audio streams. Finally, we describe the standard AV-ASR system that we adopt and describe how pose normalization can be included in it.
Audio-visual integration and classification
Audio-visual integration can be grouped into two categories: feature and decision fusion [1,3]. In the first case, the audio and visual features are combined projecting them onto an audio-visual feature space, where traditional single-stream classifiers are used [33-36]. Decision fusion, on its turn, processes the streams separately and, at a certain level, combines the outputs of each single-modality classifier. Decision fusion allows more flexibility for modality integration and is the technique usually adopted [1,3], in AV-ASR systems because it allows to weight the contribution of each modality in the classification task.
Linear regression and lipreading
In our study, the LR techniques are applied considering X and Y to be either directly the images from frontal and lateral views XI, YI or the visual features extracted from them at different stages of the feature extraction process. A first set of features XF, YF are designed to smooth the images and obtain a more compact and low-dimensional representation in the frequency domain. Afterwards, those features are transformed and their dimensionality again reduced in order to contain only information relevant for speech classification, leading to the vectors XL, YL used in the posterior speech classifier.
The visual features XF, YF are the first coefficients of the two-dimensional DCT of the image following the zigzag order, which provide a smooth, compact and low dimensional representation of the mouth. Note that the selected DCT can be obtained as a linear transform, XF = SDXI, with D the matrix of two dimensional DCT basis transform and S a matrix selecting the DCT coefficients of interest. Therefore there is also an approximate linear mapping DWIDT between the DCT coefficients of the frontal and lateral images xD, yD.
Projective transforms on the images
A simple option when working with the images themselves is to estimate a projective transform from the lateral to the frontal views as a change of the coordinate systems between the images. In fact, as the difference in pose involves an extra dimension not taken into account in the projective model (3-D nature of the head rotation), that approach can only be justified for small pose changes, e.g., being impossible to implement for 90° of head rotation. We include in our experiments two projective transforms to measure the gain obtained by the learning approach of the LR techniques in comparison to projective transforms. In that case, we estimate a 3 × 3 projective transform T between the image coordinates in a semi-manual and automatic procedure. The coordinate points P used for that purpose are the corners of the lips, the center of the cupid's bow and the center of the lower lip contour for the different poses. In the manual procedure, we selected several frames of each sequence, manually marked the position of those four points for the frontal and lateral views and estimated the transform T minimizing the error of Pfrontal = TPlateral over the selected frames of the sequence. For the automatic method, we segment the image based on color and region information and detect the lip's contour and the position of the points P from the segmentation.
Conclusions
In this article, we presented a lipreading system able to recognize speech from different views of the speaker. Inspired by pose-invariant face recognition studies, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. In particular, we use linear regression to project the features associated to different poses at different stages of the lipreading system: the images themselves, a low-dimensional and compact representation of the images in the frequency domain or the final LDA features used for classification. Our experiments show that the pose normalization is more successful when applied directly to the LDA features used in the classifier, while the projection of more general features like the images or their low-frequency representation fails because of misalignments on the training data and errors on the estimation of the transforms.