Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: 3D TV: A Scalable System for Real-Time Acquisition, Transmission, and .....
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
3D TV: A Scalable System for Real-Time Acquisition, Transmission, and
Autostereoscopic Display of Dynamic Scenes

[attachment=28513]
Abstract
Three-dimensional TV is expected to be the next revolution in the
history of television. We implemented a 3D TV prototype system
with real-time acquisition, transmission, and 3D display of dynamic
scenes. We developed a distributed, scalable architecture to manage
the high computation and bandwidth demands. Our system consists
of an array of cameras, clusters of network-connected PCs, and a
multi-projector 3D display. Multiple video streams are individually
encoded and sent over a broadband network to the display. The
3D display shows high-resolution (1024×768) stereoscopic color
images for multiple viewpoints without special glasses. We implemented
systems with rear-projection and front-projection lenticular
screens. In this paper, we provide a detailed overview of our 3D
TV system, including an examination of design choices and tradeoffs.
We present the calibration and image alignment procedures
that are necessary to achieve good image quality. We present qualitative
results and some early user feedback. We believe this is the
first real-time end-to-end 3D TV system with enough views and
resolution to provide a truly immersive 3D experience.
CR Categories: B.4.2 [Input/Output and Data Communications]:
Input/Output Devices—Image Display
Keywords: Autostereoscopic displays, multiview displays, camera
arrays, projector arrays, lightfields, image-based rendering
∗[matusik,pfister]@merl.com
1 Introduction
Humans gain three-dimensional information from a variety of cues.
Two of the most important ones are binocular parallax, scientifically
studied by Wheatstone in 1838, and motion parallax, described
by Helmholtz in 1866. Binocular parallax refers to seeing
a different image of the same object with each eye, whereas motion
parallax refers to seeing different images of an object when
moving the head. Wheatstone was able to scientifically prove the
link between parallax and depth perception using a steroscope – the
world’s first three-dimensional display device [Okoshi 1976]. Ever
since, researchers have proposed and developed devices to stereoscopically
display images. These three-dimensional displays hold
tremendous potential for many applications in entertainment, information
presentation, reconnaissance, tele-presence, medicine, visualization,
remote manipulation, and art.
In 1908, Gabriel Lippmann, who made major contributions to color
photography and three-dimensional displays, contemplated producing
a display that provides a “window view upon reality” [Lippmann
1908]. Stephen Benton, one of the pioneers of holographic
imaging, refined Lippmann’s vision in the 1970s. He set out to design
a scalable spatial display system with television-like characteristics,
capable of delivering full color, 3D images with proper occlusion
relationships. The display should provide images with binocular
parallax (i.e., stereoscopic images) that can be viewed from any
viewpoint without special glasses. Such displays are called multiview
autostereoscopic since they naturally provide binocular and
motion parallax for multiple observers. 3D video usually refers to
stored animated sequences, whereas 3D TV includes real-time acquisition,
coding, and transmission of dynamic scenes. In this paper
we present the first end-to-end 3D TV system with 16 independent
high-resolution views and autostereoscopic display.
Research towards the goal of end-to-end 3D TV started in Japan after
the Tokyo Olympic Games in 1964 [Javidi and Okano 2002].
Most of that research focused on the development of binocular
stereo cameras and stereo HDTV displays because the display of
multiple perspective views inherently requires a very high display
resolution. For example, to achieve maximum HDTV output resolution
with 16 distinct horizontal views requires 1920×1080×16
or more than 33 million pixels, which is well beyond most current
display technologies. It has only recently become feasible to deal
with the high processing and bandwidth requirements of such highresolution
TV content.
In this paper we present a system for real-time acquisition, transmission,
and high-resolution 3D display of dynamic multiview TV
content. We use an array of hardware-synchronized cameras to capture
multiple perspective views of the scene. We developed a fully
distributed architecture with clusters of PCs on the sender and receiver
side. We implemented several large, high-resolution 3D displays
by using a multi-projector system and lenticular screens with
horizontal parallax only. The system is scalable in the number of
acquired, transmitted, and displayed video streams. The hardware
is relatively inexpensive and consists mostly of commodity components
that will further decrease in price. The system architecture is
flexible enough to enable a broad range of research in 3D TV. Our
system provides enough viewpoints and enough pixels per viewpoint
to produce a believable and immersive 3D experience.
We make the following contributions:
Distributed architecture: In contrast to previous work in multiview
video we use a fully distributed architecture for acquisition,
compression, transmission, and image display.
Scalability: The system is completely scalable in the number of
acquired, transmitted, and displayed views.
Multiview video rendering: A new algorithm efficiently renders
novel views from multiple dynamic video streams on a cluster
of PCs.
High-resolution 3D display: Our 3D display provides horizontal
parallax with 16 independent perspective views at 1024×768
resolution.
Computational alignment for 3D displays: Image alignment
and intensity adjustment of the 3D multiview display are
completely automatic using a camera in the loop.
After an extensive discussion of previous work we give a detailed
system overview, including a discussion of design choices and
tradeoffs. Then we discuss the automatic system calibration using
a camera in the loop. Finally, we present results, user experiences,
and avenues for future work.
2 Previous Work and Background
The topic of 3D TV – with thousands of publications and patents –
incorporates knowledge from multiple disciplines, such as imagebased
rendering, video coding, optics, stereoscopic displays, multiprojector
displays, computer vision, virtual reality, and psychology.
Some of the work may not be widely known across disciplines.
There are some good overview books on 3D TV [Okoshi 1976; Javidi
and Okano 2002]. In addition, we provide an extensive review
of the previous work.
2.1 Model-Based Systems
One approach to 3D TV is to acquire multiview video from sparsely
arranged cameras and to use some model of the scene for view interpolation.
Typical scene models are per-pixel depth maps [Fehn
et al. 2002; Zitnick et al. 2004], the visual hull [Matusik et al.
2000], or a prior model of the acquired objects, such as human
body shapes [Carranza et al. 2003]. It has been shown that even
coarse scene models improve the image quality during view synthesis
[Gortler et al. 1996]. It is possible to achieve very high image
quality with a two-layer image representation that includes automatically
extracted boundary mattes near depth discontinuities [Zitnick
et al. 2004].
One of the earliest and largest 3D video studios is the virtualized reality
system of Kanade et al. [Kanade et al. 1997] with 51 cameras
arranged in a geodesic dome. The Blue-C system at ETH Z¨urich
consists of a room-sized environment with real-time capture and
spatially-immersive display [Gross et al. 2003]. The Argus research
project of the Air Force uses 64 cameras that are arranged in a large
semi-circle [Javidi and Okano 2002, Chapter 9]. Many other, similar
systems have been constructed.
All 3D video systems provide the ability to interactively control the
viewpoint, a feature that has been termed free-viewpoint video by
the MPEG Ad-Hoc Group on 3D Audio and Video (3DAV) [Smolic
and Kimata 2003]. During rendering, the multiview video can be
projected onto the model to generate more realistic view-dependent
surface appearance [Matusik et al. 2000; Carranza et al. 2003].
Some systems also display low-resolution stereo-pair views of the
scene in real-time.
Real-time acquisition of scene models for general, real-world
scenes is very difficult and subject of ongoing research. Many systems
do not provide real-time end-to-end performance, and if they
do they are limited to simple scenes with only a handful of objects.
We are using a dense lightfield representation that does not require
a scene model, although we are able to benefit from it should it
be available [Gortler et al. 1996; Buehler et al. 2001]. On the other
hand, dense lightfields require more storage and transmission bandwidth.
We demonstrate that these issues can be solved today.
2.2 Lightfield Systems
A lightfield represents radiance as a function of position and direction
in regions of space free of occluders [Levoy and Hanrahan
1996]. The ultimate goal, which Gavin Miller called the “hyper
display” [Miller 1995], is to capture a time-varying lightfield passing
through a surface and emitting the same (directional) lightfield
through another surface with minimal delay.
Early work in image-based graphics and 3D displays has dealt with
static lightfields [Ives 1928; Levoy and Hanrahan 1996; Gortler
et al. 1996]. In 1929, H. E. Ives proposed a photographic multicamera
recording method for large objects in conjunction with
the first projection-based 3D display [Ives 1929]. His proposal
bears some architectural similarities to our system, although modern
technology allows us to achieve real-time performance.
Acquisition of dense, dynamic lightfields has only recently become
feasible. Some systems use a bundle of optical fibers in front
of a high-definition camera to capture multiple views simultaneously
[Javidi and Okano 2002, Chapters 4 and 8]. The problem with
single-camera systems is that the limited resolution of the camera
greatly reduces the number and resolution of the acquired views.
Most systems – including ours – use a dense array of synchronized
cameras to acquire high-resolution lightfields. The configuration
and number of cameras is usually flexible. Typically, the cameras
are connected to a cluster of PCs [Schirmacher et al. 2001; Naemura
et al. 2002; Yang et al. 2002]. The Stanford multi-camera array
[Wilburn et al. 2002] consists of up to 128 cameras and specialpurpose
hardware to compress and store all the video data in realtime.
Most lightfield cameras allow interactive navigation and manipulation
(such as “freeze frame” effects) of the dynamic scene. Some
systems also acquire [Naemura et al. 2002] or compute [Schirmacher
et al. 2001] per-pixel depth maps to improve the results of
lightfield rendering. Our system uses 16 high-resolution cameras,
real-time compression and transmission, and 3D display of the dynamic
lightfield on a large multiview screen.
2.3 Multiview Video Compression and Transmission
Multiview video compression has mostly focused on static lightfields
(e.g., [Magnor et al. 2003; Ramanathan et al. 2003]). There
has been relatively little research on how to compress and transmit
multiview video of dynamic scenes in real-time. A notable exception
is the work by Yang et al. [2002]. They achieve real-time display
from an 8×8 lightfield camera by transmitting only the rays
that are necessary for view interpolation. However, it is impossible
to anticipate all the viewpoints in a TV broadcast setting. We
transmit all acquired video streams and use a similar strategy on
the receiver side to route the videos to the appropriate projectors
for display (see Section 3.3).
Most systems compress the multiview video off-line and focus on
providing interactive decoding and display. An overview of some
early off-line compression approaches can be found in [Javidi and
Okano 2002, Chapter 8]. Motion compensation in the time domain
is called temporal encoding, and disparity prediction between cameras
is called spatial encoding [Tanimoto and Fuji 2003]. Zitnick
et al. [Zitnick et al. 2004] show that a combination of temporal and
spatial encoding leads to good results. The Blue-C system converts
the multiview video into 3D “video fragments” that are then
compressed and transmitted [Lamboray et al. 2004]. However, all
current systems use a centralized processor for compression, which
limits their scalability in the number of compressed views.
Another approach to multiview video compression, promoted by
the European ATTEST project [Fehn et al. 2002], is to reduce the
data to a single view with per-pixel depth map. This data can be
compressed in real-time and broadcast as an MPEG-2 enhancement
layer. On the receiver side, stereo or multiview images are generated
using image-based rendering. However, it may be difficult to
generate high-quality output because of occlusions or high disparity
in the scene [Chen andWilliams 1993]. Moreover, a single view
cannot capture view-dependent appearance effects, such as reflections
and specular highlights.
High-quality 3D TV broadcasting requires that all the views are
transmitted to multiple users simultaneously. The MPEG 3DAV
group [Smolic and Kimata 2003] is currently investigating compression
approaches based on simultaneous temporal and spatial
encoding. Our system uses temporal compression only and transmits
all of the views as independent MPEG-2 video streams. We
will discuss the tradeoffs in Section 3.2.
2.4 Multiview Autostereoscopic Displays
Holographic Displays It is widely acknowledged that the hologram
was invented by Dennis Gabor in 1948 [Gabor 1948], although
the French physicist Aim´e Cotton first described holographic
elements in 1901. Holographic techniques were first applied
to image display by Leith and Upatnieks in 1962 [Leith and
Upatnieks 1962]. In holographic reproduction, light from an illumination
source is diffracted by interference fringes on the holographic
surface to reconstruct the light wavefront of the original
object. A hologram displays a continuous analog lightfield, and
real-time acquisition and display of holograms has long been considered
the “holy grail” of 3D TV.
Stephen Benton’s Spatial Imaging Group at MIT has been pioneering
the development of electronic holography. Their most recent
device, the Mark-II Holographic Video Display, uses acoustooptic
modulators, beamsplitters, moving mirrors, and lenses to create
interactive holograms [St.-Hillaire et al. 1995]. In more recent
systems, moving parts have been eliminated by replacing the
acousto-optic modulators with LCD [Maeno et al. 1996], focused
light arrays [Kajiki et al. 1996], optically-addressed spatial modulators
[Stanley et al. 2000], or digital micromirror devices [Huebschman
et al. 2003].
All current holo-video devices use single-color laser light. To reduce
the amount of display data they provide only horizontal parallax.
The display hardware is very large in relation to the size
of the image (which is typically a few millimeters in each dimension).
The acquisition of holograms still demands carefully controlled
physical processes and cannot be done in real-time. At least
for the foreseeable future it is unlikely that holographic systems will
be able to acquire, transmit, and display dynamic, natural scenes on
large displays.
Volumetric Displays Volumetric displays use a medium to fill
or scan a three-dimensional space and individually address and illuminate
small voxels [McKay et al. 2000; Favalora et al. 2001]. Actuality
Systems (www.actuality-systems.com) and Neos Technologies
(www.neostech.com) sell commercial systems for applications
such as air-traffic control or scientific visualization. However, volumetric
systems produce transparent images that do not provide a
fully convincing three-dimensional experience. Furthermore, they
cannot correctly reproduce the lightfield of a natural scene because
of their limited color reproduction and lack of occlusions. The design
of large-size volumetric displays also poses some difficult obstacles.
Akeley et al. [Akeley et al. 2004] developed an interesting fixedviewpoint
volumetric display that maintains view-dependent effects
such as occlusion, specularity, and reflection. Their prototype uses
beam-splitters to emit light at focal planes at different physical distances.
Two such devices are needed for stereo viewing. Since
the head and viewing positions remain fixed, this prototype is not a
practical 3D display solution. However, it serves well as a platform
for vision research.
Parallax Displays Parallax displays emit spatially varying directional
light. Much of the early 3D display research focused on improvements
to Wheatstone’s stereoscope. In 1903, F. Ives used a
plate with vertical slits as a barrier over an image with alternating
strips of left-eye/right-eye images [Ives 1903]. The resulting device
is called a parallax stereogram. To extend the limited viewing angle
and restricted viewing position of stereograms, Kanolt [Kanolt
1918] and H. Ives [Ives 1928] used narrower slits and smaller pitch
between the alternating image stripes. These multiview images are
called parallax panoramagrams.
Stereograms and panoramagrams provide only horizontal parallax.
In 1908, Lippmann proposed using an array of spherical lenses instead
of slits [Lippmann 1908]. This is frequently called a “fly’seye”
lens sheet, and the resulting image is called an integral photograph.
An integral photograph is a true planar lightfield with directionally
varying radiance per pixel (lenslet).
Integral lens sheets can be put on top of high-resolution
LCDs [Nakajima et al. 2001]. Okano et al. [Javidi and Okano 2002,
Chapter 4] connect an HDTV camera with fly’s-eye lens to a highresolution
(1280×1024) LCD display. However, the resolution of
their integral image is limited to 62×55 pixels. To achieve higher
output resolution, Liao et al. [Liao et al. 2002] use a 3×3 projector
array to produce a small display with 2872×2150 pixels. Their
integral display with three views of horizontal and vertical parallax
has a resolution of 240×180 pixels.
Integral photographs sacrifice significant spatial resolution in both
dimensions to gain full parallax. Researchers in the 1930s introduced
the lenticular sheet, a linear array of narrow cylindrical
lenses called lenticules. This reduces the amount of image data by
giving up vertical parallax. Lenticular images found widespread
use for advertising, CD covers, and postcards [Okoshi 1976]. This
has lead to improved manufacturing processes and the availability
of large, high-quality, and very inexpensive lenticular sheets.
To improve the native resolution of the display, H. Ives invented the
multi-projector lenticular display in 1931. He painted the back of
a lenticular sheet with diffuse paint and used it as a projection surface
for 39 slide projectors [Ives 1931]. Different arrangements of
lenticular sheets and multi-projector arrays can be found in [Okoshi
1976, Chapter 5]. Based on this description we implemented both
rear-projection and front-projection 3D display prototypes with a
linear array of 16 projectors and lenticular screens (see Section 3.4).
The high output resolution (1024×768), the large number of views
(16), and the large physical dimension (6 ×4) of our display lead
to a very immersive 3D experience.
Other research in parallax displays includes time-multiplexed
(e.g., [Moore et al. 1996]) and tracking-based (e.g., [Perlin et al.
2000]) systems. In time-multiplexing, multiple views are projected
at different time instances using a sliding window or LCD shutter.
This inherently reduces the frame rate of the display and may lead
to noticeable flickering. Head-tracking designs are mostly used to
display stereo images, although it could also be used to introduce
some vertical parallax in multiview lenticular displays.
Today’s commercial autostereoscopic displays use variations of
parallax barriers or lenticular sheets placed on top of LCD or
plasma screens (www.stereo3d.com). Parallax barriers generally
reduce some of the brightness and sharpness of the image. The
highest resolution flat-panel screen available today is the IBM T221
LCD with about 9 million pixels. Our projector-based 3D display
currently has a native resolution of 12 million pixels. We believe
that new display media – such as organic LEDs or nanotube fieldemission
displays (FEDs) – will bring flat-panel multiview 3D displays
within consumer reach in the foreseeable future.
2.5 Multi-Projector Displays
Scalable multi-projector display walls have recently become popular
[Li et al. 2002; Raskar et al. 1998]. These systems offer very
high resolution, flexibility, excellent cost-performance, scalability,
and large-format images. Graphics rendering for multi-projector
systems can be efficiently parallelized on clusters of PCs using, for
example, the Chromium API [Humphreys et al. 2002]. Projectors
also provide the necessary flexibility to adapt to non-planar display
geometries [Raskar et al. 1999].
Precise manual alignment of the projector array is tedious and becomes
downright impossible for more than a handful of projectors
or non-planar screens. Some systems use cameras in the loop to
automatically compute relative projector poses for automatic alignment
[Raskar et al. 1999; Li et al. 2002]. Liao et al. [Liao et al.
2002] use a digital camera mounted on a linear 2-axis stage in their
multi-projector integral display system. We use a static camera for
automatic image alignment and brightness adjustments of the projectors
(see Section 3.5).