01-10-2012, 10:33 AM
Modeling and Compressing 3-D Facial Expressions Using Geometry Videos
1Modeling and Compressing.pdf (Size: 1.24 MB / Downloads: 15)
Abstract
In this paper, we present a novel geometry video
(GV) framework to model and compress 3-D facial expressions.
GV bridges the gap of 3-D motion data and 2-D video, and
provides a natural way to apply the well-studied video processing
techniques to motion data processing. Our framework includes a
set of algorithms to construct GVs, such as hole filling, geodesicbased
face segmentation, expression-invariant parameterization
(EIP), and GV compression. Our EIP algorithm can guarantee
the exact correspondence of the salient features (eyes, mouth,
and nose) in different frames, which leads to GVs with better
spatial and temporal coherence than that of the conventional
parameterization methods. By taking advantage of this feature,
we also propose a new H.264/AVC-based progressive directional
prediction scheme, which can provide further 10%–16% bitrate
reductions compared to the original H.264/AVC applied for
GV compression while maintaining good video quality. Our
experimental results on real-world datasets demonstrate that GV
is very effective for modeling the high-resolution 3-D expression
data, thus providing an attractive way in expression information
processing for gaming and movie industry.
Introduction
OVER THE past decade, we have witnessed a revolution
in movie and game industries resulting from the use
of motion data. Nowadays, it is very common that actors
work in front of a blue screen and interact with invisible
computer-animated characters which are added later, trying to
fit into a computer-animated world. The movements of actors
are recorded using a motion capture (or mocap) system, by
which complex movement, realistic physical interactions, and
exchange of forces can be recreated in a physically accurate
manner. Despite the great success in movies and gaming,
the current mocap requires the subject to wear calibrated
markers. The output of motion capture is just the approximate
motion of a skeleton representing the rigid parts of the subject.
Related Work
GV bridges two different research fields, geometry processing,
and video processing. This section briefly reviews
related work in motion data acquisition and processing, 3-D
motion data compression, geometry images/videos, and video
compression.
3-D Motion Data Acquisition and Processing
In recent years, we have witnessed the significant advances
in developing high-speed shape acquisition devices. Using
range scanning techniques, such as phase-shifting structure
light [1], [5], [6] and spacetime stereo [7], [8], it is possible to
scan high-resolution 3-D geometry and/or texture of moving
and deforming objects at video speeds.
Wang et al. [9] presented a data-driven approach for accurate
facial tracking and expression retargeting. Wang et al. [10]
simplified the 3-D human face registration problem to a 2-
D image matching problem by conformal parameterization.
Mitra et al. [11] proposed an algorithm to register large sets
of unstructured point clouds of moving and deforming objects
without computing correspondences. Chang and Zwicker [12]
presented an unsupervised algorithm that aligns a pair of
articulated shapes with significant motion and missing data.
Sharf et al. [13] developed a volumetric space-time technique
to reconstruct the moving and deforming objects from point
clouds. Wang et al. [14] developed an efficient non-rigid 3-D
motion tracking algorithm to establish inter frame correspondences
that facilitate the temporal study of subtle motions in
facial expressions.
3-D Motion Data Compression
Time-varying meshes (TVM) has been introduced in 3-D
motion data compression by Han et al. [16]. TVM is a 3-D
motion representation which is generated from multiple cameras
[17]. Since TVM are generated from multiple viewpoint
images, frame by frame independently, TVM do not have the
correspondence among frames. The generated data is bulky
and noisy as the data is captured by the structured light-based
3-D camera. TVM cannot afford correspondences between
frames either. Although it is natural to model 3-D motion data
with TVM, there are few papers on TVM compression due to
the challenges. Han et al. [16] proposed an extended block
matching algorithm for TVM compression. By extending the
block matching algorithm from 2-D video to 3-D mesh data,
they could achieve 10–18% compression. But only inter frame
coding is used in their work. By considering both spatial
and temporal redundancies of TVM, Han et al. [18] achieved
compression of 1.9–16%. Yamasaki et al. [19] also compress
connections and color textures. Instead of marching cubes,
they used a patch-based method to describe the model. However,
none of the above works has taken the correspondence
between frames into consideration.
3-D Motion Data Acquisition and
Pre-Processing
We employ the structured light-based 3-D camera system [1]
to capture the moving objects in real time. The system contains
a video camera and a structured light projector. The projector
projects digital fringe patterns that are composed of vertical
straight stripes to the object. The stripes are deformed due to
the surface profile. Then a high-speed charge-coupled device
camera synchronized with the projector captures the distorted
fringe image. Finally, by analyzing the fringe images, the 3-
D information is obtained based on the deformation using
triangulation. The system is able to capture the geometry
and texture of the moving objects in real time. Despite the
high speed, the 3-D camera system is not robust due to
various reasons, such as ambient light interference, occlusions.
Expression-Invariant Parameterization
In each frame of the captured motion data, the geometry
is given in the reference system of the scanner, and it is
not registered in object space, and correspondences between
points in different frames are not available. From the analysis
and editing point of view, it is highly desirable to find
the correspondence among the captured data. Motion data
parameterization serves this purpose by mapping all frames
to a parametric domain and then re-sample the data on the
domain.
Although there are large amount of literatures in surface
parameterization [37], [38], there is little work on the motion
data parameterization. The key challenging in motion data
parameterization is that it must take the temporal coherence
into consideration, i.e., the features in all frames should be
mapped consistently to the parametric domain.
Conclusion
This paper presented a novel framework to model and encode
3-D facial expressions using GVs. Within our framework,
we parameterized the 3-D expressions with guaranteed feature
correspondence and stored them into a video format, allowing
the 3-D data being significantly compressed by well-studied
video compression techniques. Compared to other parameterization
methods, our method can lead to results that are highly
consistent and insensitive to the expressions, and a higher
degree of coherence of constructed GVs, which is highly
desirable for video compression. Our experimental results on
real-world datasets showed that our framework was very effective
for modeling 3-D facial motion data, and our predictive
compression scheme can lead to considerably improved ratedistortion
performance over the original H.264/AVC without
any extra costs thus allowing better GV compression.