18-09-2012, 03:42 PM
Kinsight: Localizing and Tracking Household Objects using Depth-Camera Sensors
1Localizing and Tracking.pdf (Size: 499.33 KB / Downloads: 44)
Abstract
We solve the problem of localizing and tracking
household objects using a depth-camera sensor network. We
design and implement Kinsight that tracks household objects
indirectly – by tracking human figures, and detecting and recognizing
objects from human-object interactions. We devise two
novel algorithms: (1) Depth Sweep – that uses depth information
to efficiently extract objects from an image, and (2) Context
Oriented Object Recognition – that uses location history and
activity context along with an RGB image to recognize objects
at home. We thoroughly evaluate Kinsight’s performance with
a rich set of controlled experiments. We also deploy Kinsight
in real-world scenarios and show that it achieves an average
localization error of about 13 cm.
INTRODUCTION
We interact with varieties of objects at home everyday. We
grab objects, interact with them, and place them somewhere
once we are done with them. Imagine the possibilities that
open if we had a system that could keep account of all the
objects that we interact with in our daily lives. By knowing
what objects one is dealing with, we could infer what activity
that person is doing [13]. By keeping track of the locations of
the objects, we could build a smart search engine for our home
that could answers queries like – where are my eye glasses,
or my tv-remote controller, or my wallet? To materialize such
possibilities, as a first step, we build Kinsight, which detects
human-object interactions, recognizes objects, and keeps track
of the locations of the objects – using its keen sight.
HOUSEHOLD OBJECT LOCALIZATION
Localization of household objects by tracking is a special
case of the tracking problem, in which, the objective is to find
a mapping between a set of objects and their corresponding
locations. It involves discovery of objects, obtaining their
3D locations, and updating their locations whenever they
are changed. Several distinguishing characteristics make this
problem different than the general tracking problem. These
observations give us the opportunity to apply various optimization
techniques to design and implement an efficient household
object tracking system. In this section, we describe these
observations, which form the basis of the system assumptions
in Kinsight. While most of these assumptions are readily
understandable from our everyday experiences, in an attempt
to quantify these, we conduct some experiments in two multiperson
households using video cameras and RFID tags. We
use these data in our discussion whenever they are relevant.
SYSTEM DESIGN
Kinsight consists of a network of depth-camera sensors.
Each node within the network has its own sensor, processing
unit and a local database. We assume that all nodes are
stationary. A master node acts as the coordinator, which
manages the central database and communicates to zero or
more slave nodes. Figure 3 shows the task architecture within
a node. Tasks performed at each node are divided into four
operational stages: sensing, motion event handing, real-time
processing and post processing. This section describes these
stages in brief.
Sensing
Kinsight uses image, depth, and light intensity sensors. The
image and depth data are obtained using Kinect [30]. The
range limit of a Kinect sensor is approximately 11 feet. This
range can be increased or decreased using special lenses such
as [5]. Kinect connects to a PC via USB and multiple Kinects
are connected to a single PC. The image and depth data
are read as two separate streams. The image stream has a
resolution of 640 × 480 and provides 32-bit colored images
at 30 fps. The depth stream has a resolution of 320 × 240
and provides 16-bit depth value for each pixel. Kinect also
annotates the pixels that are part of a human body and tracks
up to 20 body-joints. The light intensity sensor on a MICA2
sensor board (MTS310) gives us the light intensity.
EXPERIMENTS
We conduct a number of experiments to evaluate the
performance of Kinsight. These experiments are performed
using Kinect sensors connected to a laptop having a 2.3
GHz Intel Core i5 processor and 4 GB RAM. We label 48
household objects and 80 locations with numeric tags, and ask
the human subjects to move objects according to randomly
generated scripts. The list of items includes personal items
(e.g., phone, wallet, keys), stationary (e.g., pen, boxes, stapler),
utensils (e.g., cups, bottles, pots), toys (e.g., dolls, cars), and
entertainment (e.g. xbox, remote controller). A subset of these
items are shown in Figure 4. Location contexts of these items
are generated following the distribution as in Figure 1(b), and
activity contexts are generated by restricting the number of
objects for each session. This is done to control the variations
of different variables to see how they influence the system.
For each movement of an object, Kinsight classifies the object
and stores its identity and the location.
RELATED WORKS
There are numerous works in computer vision research
pertaining to object detection, recognition and tracking. We
mention a few that are recent, and refer readers to [37] for
a survey. [12, 15, 16] detect objects by performing image
segmentation and contour detection. These are detectors for
general objects and do not use any contextual information. [21,
23] use contexts to improve object recognition accuracy, but it
is either supplied during training or extracted from the image
itself. [24, 33] are model based approaches for tracking single
objects in real-time; [8] tracks multiple objects simultaneously,
but is not real-time. The differences between these works and
ours are that, ours is a more specialized system dealing with
only household objects, we learn object instances (not class),
and we go beyond images to use location and activity contexts.
Some recent works use depth data along with RGB image
for human pose estimation [30], illumination invariant tracking
[27], 3D mapping for mobile robots [18], and human
activity detection [31]. [19] describes a template matching
algorithm that uses depth to detect objects in images. But they
require to search the whole image for a match, while Kinsight
selects only a subset of pixels that are close to the depth
of human-object interaction point. [7] uses depth to enhance
object tracking. But unlike Kinsight, they track each object
separately.
Localization with RFID [20, 25, 28, 35] and WSN [10, 11,
17] technologies have been studied by many. There are also
a number of commercial products that uses RFID or similar
technologies [1, 2]. But these systems are intrusive, i.e., require
us to attaching tags to every object we want to keep track of,
and the readers are expensive.
CONCLUSION
In this paper, we describe Kinsight which uses a depthcamera
sensor network to detect and track household objects’
locations. Kinsight discovers objects from human-object interactions,
uses unsupervised learning techniques to recognize
objects from their appearance, location history and activity
contexts, and updates the locations of the objects. We evaluate
Kinsight in both controlled and uncontrolled environments to
quantify its sensitivity to a wide range of parameters and to
demonstrate its practicality. In real-world scenarios, Kinsight’s
average localization error is about 13 cm.