29-05-2012, 11:01 AM
Adaptive Cluster Distance Bounding for High Dimensional Indexing
Adaptive Cluster Distance Bounding for High Dimensional Indexing.pdf (Size: 446.76 KB / Downloads: 28)
Abstract
We consider approaches for similarity search in
correlated, high-dimensional data-sets, which are derived within a
clustering framework. We note that indexing by “vector approximation”
(VA-File), which was proposed as a technique to combat
the “Curse of Dimensionality”, employs scalar quantization,
and hence necessarily ignores dependencies across dimensions,
which represents a source of suboptimality. Clustering, on the
other hand, exploits inter-dimensional correlations and is thus a
more compact representation of the data-set. However, existing
methods to prune irrelevant clusters are based on bounding
hyperspheres and/or bounding rectangles, whose lack of tightness
compromises their efficiency in exact nearest neighbor search.We
propose a new cluster-adaptive distance bound based on separating
hyperplane boundaries of Voronoi clusters to complement our
cluster based index. This bound enables efficient spatial filtering,
with a relatively small pre-processing storage overhead and is
applicable to Euclidean and Mahalanobis similarity measures.
Experiments in exact nearest-neighbor set retrieval, conducted
on real data-sets, show that our indexing method is scalable with
data-set size and data dimensionality and outperforms several
recently proposed indexes. Relative to the VA-File, over a wide
range of quantization resolutions, it is able to reduce random
IO accesses, given (roughly) the same amount of sequential IO
operations, by factors reaching 100X and more.
Index Terms—Multimedia databases, indexing methods, similarity
measures, clustering, image databases.
I. INTRODUCTION
WITH developments in semiconductor technology and
powerful signal processing tools, there has been a
proliferation of personal digital media devices such as digital
cameras, music and digital video players. In parallel, storage
media (both magnetic and optical) have become cheaper to
manufacture and correspondingly their capacities have increased.
This has spawned new applications such as Multimedia
Information Systems, CAD/CAM, Geographical Information
systems (GIS), medical imaging, time-series analysis
(in stock markets and sensors), that store large amounts of data
periodically in and later, retrieve it from databases (see [1] for
more details). The size of these databases can range from the
relatively small (a few 100 GB) to the very large (several 100
TB, or more). In the future, large organizations will have to
retrieve and process petabytes of data, for various purposes
such as data mining and decision support. Thus, there exist
numerous applications that access large multimedia databases,
which need to be effectively supported.
+ This work was supported in part by NSF, under grants IIS-0329267 and
CCF-0728986, the University of California MICRO program, Applied Signal
Technology Inc., Cisco Systems Inc., Dolby Laboratories Inc., Sony Ericsson
Inc., and Qualcomm Inc.
∗ Mayachitra Inc.; email: rsharadh[at]mayachitra.com
† University of California, Santa Barbara; email: rose[at]ece.ucsb.edu
Spatial queries, specifically nearest neighbor queries, in
high-dimensional spaces have been studied extensively. While
several analyses [2], [3], have concluded that the nearestneighbor
search, with Euclidean distance metric, is impractical
at high dimensions due to the notorious “curse of dimensionality”,
others have suggested that this may be overpessimistic.
Specifically, the authors of [4] have shown that what
determines the search performance (at least for R-tree-like
structures) is the intrinsic dimensionality of the data set and
not the dimensionality of the address space (or the embedding
dimensionality).
The typical (and often implicit) assumption in many previous
studies is that the data is uniformly distributed, with independent
attributes. Such data-sets have been shown to exhibit
the “curse of dimensionality” in that distance between all pairs
of points (in high dimensional spaces) converges to the same
value [2], [3]. Clearly, such data-sets are impossible to index as
the nearest and furthest neighbors are indistinguishable. However,
real data sets overwhelmingly invalidate assumptions of
independence and/or uniform distributions; rather, they typically
are skewed and exhibit intrinsic (fractal) dimensionalities
that are much lower than their embedding dimension, e.g.,
due to subtle dependencies between attributes. Hence, real
data-sets are demonstrably indexable with Euclidean distances.
Whether the Euclidean distance is perceptually acceptable is
an entirely different matter, the consideration of which directed
significant research activity in content-based image retrieval
toward the Mahalanobis (or weighted Euclidean) distance (see
[5],[6]). Therefore, in this paper, we focus on real data-sets
and compare performance against the state-of-the-art indexing
targetting real data-sets.
II. OUR CONTRIBUTIONS
In this section, we outline our approach to indexing
real high-dimensional data-sets. We focus on the clustering
paradigm for search and retrieval. The data-set is clustered,
so that clusters can be retrieved in decreasing order of their
probability of containing entries relevant to the query.
We note that the Vector Approximation (VA)-file technique
[7] implicitly assumes independence across dimensions, and
that each component is uniformly distributed. This is an
unrealistic assumption for real data-sets that typically exhibit
significant correlations across dimensions and non-uniform
distributions. To approach optimality, an indexing technique
must take these properties into account. We resort to a Voronoi
clustering framework as it can naturally exploit correlations
across dimensions (in fact, such clustering algorithms are the
Digital Object Indentifier 10.1109/TKDE.2010.59 1041-4347/10/$26.00 © 2010 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on April 19,2010 at 00:44:04 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 6, June 2011
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 2
method of choice in the design of vector quantizers [8]).
Moreover, we show how our clustering procedure can be
combined with any other generic clustering method of choice
(such as BIRCH [9]) requiring only one additional scan of
the data-set. Lastly, we note that the sequential scan is in fact
a special case of clustering based index i.e. with only one
cluster.
A. A New Cluster Distance Bound
Crucial to the effectiveness of the clustering-based search
strategy is efficient bounding of query-cluster distances. This
is the mechanism that allows the elimination of irrelevant
clusters. Traditionally, this has been performed with bounding
spheres and rectangles. However, hyperspheres and hyperrectangles
are generally not optimal bounding surfaces for clusters
in high dimensional spaces. In fact, this is a phenomenon
observed in the SR-tree [10], where the authors have used
a combination spheres and rectangles, to outperform indexes
using only bounding spheres (like the SS-tree [11]) or bounding
rectangles (R∗-tree [12]).
The premise herein is that, at high dimensions, considerable
improvement in efficiency can be achieved by relaxing
restrictions on the regularity of bounding surfaces (i.e., spheres
or rectangles). Specifically, by creating Voronoi clusters, with
piecewise-linear boundaries, we allow for more general convex
polygon structures that are able to efficiently bound the cluster
surface. With the construction of Voronoi clusters under the
Euclidean distance measure, this is possible. By projection
onto these hyperplane boundaries and complementing with the
cluster-hyperplane distance, we develop an appropriate lower
bound on the distance of a query to a cluster.
B. Adaptability to Weighted Euclidean or Mahalanobis Distances
While the Euclidean distance metric is popular within the
multimedia indexing community (see [7], [13], [14], [15]), it
is by no means the “correct” distance measure, in that it may
be a poor approximation of user perceived similarities. The
Mahalanobis distance measure has more degrees of freedom
than the Euclidean distance [6] and by proper updation (or
relevance feedback), has been found to be a much better
estimator of user perceptions (see [16], [17], [5] and more
recently [6]) . We extend our distance bounding technique to
the Mahalanobis distance metric, and note large gains over
existing indexes.
C. An Efficient Search Index
The data set is partitioned into multiple Voronoi clusters
and for any kNN query, the clusters are ranked in order of
the hyperplane bounds and in this way, the irrelevant clusters
are filtered out. We note that the sequential scan is a special
case of our indexing, if there were only one cluster. An
important feature of our search index is that we do not store the
hyperplane boundaries (which form the faces of the bounding
polygons), but rather generate them dynamically, from the
cluster centroids. The only storage apart from the centroids
are the cluster-hyperplane boundary distances (or the smallest
cluster-hyperplane distance). Since our bound is relatively
tight, our search algorithm is effective in spatial filtering of
irrelevant clusters, resulting in significant performance gains.
We expand on the results and techniques initially presented
in [18], with comparison against several recently proposed
indexing techniques.
III. PERFORMANCE MEASURE
The common performance metrics for exact nearest neighbor
search have been to count page accesses or the response
time. However, page accesses may involve both serial disk
accesses and random IOs, which have different costs. On the
other hand, response time (seek times and latencies) is tied
to the hardware being used and therefore, the performance
gain/loss would be platform dependent.
Hence, we propose a new 2-dimensional performance metric,
where we count separately the number of sequential
accesses S and number of random disk accesses R. We note
that given the performance in terms of the proposed metric
i.e. the (S,R) pair, it is possible to estimate total number of
disk accesses or response times on different hardware models.
The number of page accesses is evaluated as S + R. If Tseq
represented the average time required for one sequential IO
and Trand for one random IOs, the average response time
would be Tres = TseqS + TrandR.
B
No. of Random IOs ®
No. of Serial Pages (S)
A
Fig. 1. Comparing Performance Graphs of (hypothetical) Index A and Index
B
1) Performance Curves for Index Comparison: Many indexes
have index specific free parameters which when tuned
lead to different performances. For example, in the VA-File the
number of quantization levels per dimension can be varied,
leading to different performance characteristics. For other
indexes like iDistance [15], LDC [19] as well as our own
technique , it would be the number of clusters. We vary the
tunable parameters of each index and obtain a performance
graph.
If the performance graph of indexing method A lies strictly
below that of another index B (see Figure 1), then method
A would be preferable. This is because for every choice of
an operating point (SB,RB) for method B (or an equivalent
response time), we can find an operating point for A (SA,RA)
that has a smaller R and smaller S and hence, lower response
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on April 19,2010 at 00:44:04 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 6, June 2011
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 3
time1.
IV. RELATED WORK
Several index structures exist that facilitate search and
retrieval of multi-dimensional data. In low dimensional spaces,
recursive partitioning of the space with hyper-rectangles (Rtrees
[20], R∗-trees [12]), hyper-spheres (SS-Tree [11]) or a
combination of hyper-spheres and hyper-rectangles (SR-Tree
[10]), have been found to be effective for nearest neighbor
search and retrieval. While the preceding methods specialize to
Euclidean distance (l2 norm), M-trees [21] have been found to
be effective for metric spaces with arbitrary distance functions
(which are metrics).
Such multi-dimensional indexes work well in low dimensional
spaces, where they outperform sequential scan. But it
has been observed that the performance degrades with increase
in feature dimensions and, after a certain dimension threshold,
becomes inferior to sequential scan. In a celebrated result,
Weber et. al. [7] have shown that whenever the dimensionality
is above 10, these methods are outperformed by simple
sequential scan. Such performance degradation is attributed to
Bellman’s ‘curse of dimensionality’ [22], which refers to the
exponential growth of hyper-volume with dimensionality of
the space.
A. Vector Approximation Files
A popular and effective technique to overcome the curse
of dimensionality is the vector approximation file (VA-File)
[7]. VA-File partitions the space into hyper-rectangular cells,
to obtained a quantized approximation for the data that reside
inside the cells. Non-empty cell locations are encoded into
bit strings and stored in a separate approximation file, on
the hard-disk. During a nearest neighbor search, the vector
approximation file is sequentially scanned and upper and lower
bounds on the distance from the query vector to each cell are
estimated. The bounds are used to prune irrelevant cells. The
final set of candidate vectors are then read from the harddisk
and the exact nearest neighbors are determined. At this
point, we note that the terminology “Vector Approximation” is
somewhat confusing, since what is actually being performed is
scalar quantization, where each component of the feature vector
is separately and uniformly quantized (in contradistinction
with vector quantization in the signal compression literature).
VA-File was followed by several more recent techniques to
overcome the curse of dimensionality. In the VA+-File [23],
the data-set is rotated into a set of uncorrelated dimensions,
with more approximation bits being provided for dimensions
with higher variance. The approximation cells are adaptively
spaced according to the data distribution. Methods such as
LDR [24] and the recently proposed non-linear approximations
[25] aim to outperform sequential scan by a combination of
1Alternatively, consider the tangent to the performance curve of index A
with parametric form TseqS + TrandR = T, for some specific hardware
serial and random IO response times Tseq, Trand. The distance of this tangent
from the origin is proportional to the optimal performance time with A. Now,
if the performance graph of A lies strictly below that of another index B, the
tangent to A is closer to the origin than the tangent to B. Hence, A offers a
faster response time than B.
clustering and dimensionality reduction.There also exist a few
hybrid methods, such as the A-Tree [13], and IQ-Tree [14],
which combine VA-style approximations within a tree based
index.
B. Transformation to and Indexing in One Dimension
Other techniques, such as LDC [19], iDistance [15], and
Pyramid Tree [26], are based on local dimensionality reducing
transformations. The data-set is partitioned and, in each partition,
the distances of the resident vectors to some reference
point, typically the centroid, are evaluated. The feature vectors
in a partition are now indexed by their centroid-distance, using
ubiquitous one-dimensional indexes such as the B+-tree [1].
During query processing, spheres of gradually increasing radii
are drawn around the query, until they intersect a cluster
sphere. Now, the relevant elements in the partition, identified
by centroid-distances which lie in the intersecting region, are
retrieved for finer scrutiny. The search radius is set to such a
value that the exact NNs are returned.
It is to be noted that co-ordinate hyperplanes translated to
the centroid divide the feature space into 2d boxes, where d is
the space dimensionality. In LDC [19], another approximation
layer is created, by generating a box identification code for
each resident point. Once an initial set of candidates have been
identified with the B+-tree, the corresponding approximations
are scanned to further filter out irrelevant points within the
partition. The surviving elements are finally retrieved from
the hard drive to determine the nearest neighbors. In order to
reduce disk IO, care is taken to control the maximal fraction
of space searched, yet return the exact nearest neighbors.
C. Approximate Similarity Search
Lastly, it has been argued that the feature vectors and
distance functions are often only approximations of user
perception of similarity. Hence, even the results of an exact
similarity search is inevitably perceptually approximate, with
additional rounds of query refinement necessary. Conversely,
by performing an approximate search, for a small penalty in
accuracy, considerable savings in query processing time would
be possible. Examples of such search strategies are MMDR
[27], probabilistic searches (PAC-NN) [28], VA-LOW [29],
[30], [31], [32] and locality sensitive hashing [33]. The reader
is directed to [34] for a more detailed survey of approximate
similarity search. The limits of approximate indexing i.e. the
optimal tradeoffs between search quality and search time has
also been studied within an information theoretic framework
[35].