08-11-2016, 12:21 PM
1466681592-Golbert2013.pdf (Size: 921.87 KB / Downloads: 2)
ABSTRACT:
We present a method for object detection in a multi view 3D model. We use highly overlapping views, geometric data, and semantic
surface classification in order to boost existing 2D algorithms. Specifically, a 3D model is computed from the overlapping views, and
the model is segmented into semantic labels using height information, color and planar qualities. 2D detector is run on all images and
then detections are mapped into 3D via the model. The detections are clustered in 3D and represented by 3D boxes. Finally, the
detections, visibility maps and semantic labels are combined using a Support Vector Machine to achieve a more robust object detector.
INTRODUCTION
3D reconstruction is becoming more and more common as
computing power increases and more methods are being
developed. Standard graphics cards are now strong enough
to generate photorealistic images of complex scenes in realtime.
Typical data consists of multiple images with large
overlap, where the camera’s internal parameters and
location are either known or estimated with SFM methods.
The model is reconstructed using multiview geometry, and
may be represented by such means as polygons, voxel space
or planar disks; it typically contains no high level
understanding or semantic interpretation.
A seemingly unrelated problem is the detection of objects in
2D images. Common object detection methods often use
sliding windows of different sizes, where each window is
tested for the existence of an object. Thus (Viola, 2001)
used a cascade of Haar features classifiers with growing
complexity, while (Lienhart, 2002) adapted the feature set
to include diagonal features. (Grabner, 2010) used online
learning to reduce the labeling needed for the task of
learning, and only a few false positives are labeled in each
stage. (Kluckner, 2007) used height information from stereo
matching to automatically detect false positives, managing
to learn a very good car detector from a very small initial
training set.
While methods for patch based detection and recognition
are making progress using better descriptors and learning
methods (Tuermer, 2011), others are seeking to use more
information than just the pixels in a given patch. This
includes a variety of context and scene understanding cues,
in order to boost performance for object detection, see
review in (Oliva, 2007, Divvala, 2009). Thus (Heitz, 2008)
learns a few classes to describe the local scene, and uses
statistical dependence between scene and objects to improve
performance of object detection. (Hoiem, 2008) combines a
method for planar approximation and object detection in a
naive Bayes model, using maximal likelihood and belief
propagation.
The use of 3D scene information to enhance object
detection has been made even more explicit (Rottensteiner,
2012). Thus (Posner, 2009) used 3D data from laser sensors
and color data to segment the image into semantic labels.
Each pixel is assigned a feature vector using HSV
histogram, 3D normal and image coordinates, and a one-vs.-
all classifier is trained for each pixel. The classification is
expanded to patches that are spatially and temporally linked
across a few images. (Douillard, 2007) combined visual
features from color image and 3D geometric features from
laser scan in a CRF for the task of multi class labeling.
Although they link nodes temporally only a few images
with similar viewpoint can be considered. (Kluckner, 2009)
use features extracted from an input aerial image and a
corresponding height map in a randomized forest to learn a
probability distribution for multiple classes for each pixel, using
a conditional random field to find a smooth labeling of the
image classification. (Leberl, 2008) uses graph-based
grouping, on the 4-neighborhood grid of the image, in order
to link the best car detections and extract street layer, which
in turn is used to filter out cars when creating the
Orthophoto. Although they used overlapping images, the
process was done independently for each image, followed
by interpolation obtained by projecting the street layer onto
the Digital Terrain Map (DTM).
In this paper we take this research direction a bit further,
starting from the reconstruction of a full 3D scene model
from many overlapping views with large baselines. We then
detect static objects in the model by using detections from
all images and 3D semantic labeling simultaneously. More
specifically, the camera location and orientation are
calculated for each image using Slam (Triggs, 1999), and
then a dense 3D model is calculated (Seitz, 2006), (Goesele,
2006), (Curless, 1996). A sliding windows detector in run
on each image at 6 different rotations, and each image
detection is translated to a 3D Bounding Box using the
camera calibration and 3D model. All 3D Bounding Boxes
are clustered into a smaller set of representative 3D
Bounding Boxes. This allows us to infer from many images
while overcoming obstructions and greatly varying
viewpoints. A multi class semantic labeling of the model is
performed using geometric information, local planes, and
color information from all images. We show that using
multiple overlapping viewpoints and context greatly
improves the initial performance of the 2D detector.
2. OUR METHOD: OBJECT DETECTION FROM
MULTIPLE VIEWS
Our method is described in Algorithm 1 below. The input is
a set of overlapping images {?
}. The 3D scene is
reconstructed from those images in order to obtain
estimated location and orientation for each image, and a 3D
model – a mesh (?, ?) consisting of vertices and triangles,
see Section 2.1. Next, we look for objects in the images -
cars defined by location, orientation and size ? = (?, ?, ?). In
Section 2.2 we describe how for each image ? a set of cars
? is detected and assigned weights ? using a cascade of
weak classifiers. In Section 2.3 we describe how each
detection ? in image ?
is mapped to a 3D Bounding Box
?, and how a 3D Bounding Box is projected onto an
image.
2.3 Clustering in 3D
Each 3D object is visible in many images. The 2D detector
may fail to detect it in some images but succeed in others.
In order to collect data from different images and use them
together, the detections are mapped into 3D. Using the
calculated camera locations, a vertex ? ∈ ?
3
is projected
onto the image plane with the projection matrix ?
, and
conversely a pixel in image ? defines a ray that intersects
the 3D Mesh at point ?. Thus detection ? in image ?
is
projected to a 3D box ? by projecting the center ? of
detection ? onto the mesh at point ?, which is used as the
center point of ?. The 3D orientation is calculated by
projecting the 2D orientation onto the x,y plane (z=0)
around ?. ? is always assumed to be parallel to the x-y
plane, and is assigned weight ? = ?.
Next, we seek a subset of all 3D boxes from all images that
best explains the 2D detections. We use the Jaccard index as
a measure of similarity between 3D boxes
?(?1?1
,?2?2
) = ?(?1?1
∩ ?2?2
)/?(?1?1
∪ ?2?2
).
This can easily be calculated since the detections are
parallel to the x-y plane and can be calculated as the area of
the intersection and union of 2D rectangles. The z
dimension is ignored when there is an overlap, otherwise
?(?1?1
,?2?2
) = 0, to help prevent the distance function
from diminishing too quickly.
We represent this problem as a multi labeling problem. The
goal is to choose for each ? a representative that best
describes it, while requiring that similar Detections have the
same representatives. We define a graph ? = (?, ?) whose
nodes are the 3D detections, and edges are drawn between
any intersecting detections:
? = {?}, ? = {(?1?1
,?2?2
)|?(?1?1
,?2?2
) > 0}. A Labeling
of the graph is a function ?:{?} → {?}; it is a mapping
from each 3D detection to a representing 3D Box. We use
Graph Cuts to minimize the total energy (Boykov and
Kolmogorov, 2001).
?(?, ?) = ∑? ⋅ (1 − ?(?(? ),? ))
?
+ ∑ min(?1?1
,?2?2
) ⋅ ?(?1?1
,?2?2
)
(?1?1
,?2?2
)∈?,
?(?1?1
)≠?(?2?2
)
The first sum is the data fidelity term - the cost of assigning
? to ?(?). The second sum is the smoothness term - the
cost of assigning neighboring detection to different labels.
We use the image of the labeling function as the
representative set: {?
} ← {?|∃?
′
?
′
, ?(?
′
?
′) = ?}, and
each label’s weight is assigned the sum of weights over the
cluster
? = ∑ ?�