01-05-2013, 03:15 PM
Authenticated Multi-Step Nearest Neighbor Search
Authenticated Multi.pdf (Size: 616.68 KB / Downloads: 12)
Abstract
Multi-step processing is commonly used for nearest neighbor (NN) and similarity search in applications involving highdimensional
data and/or costly distance computations. Today, many such applications require a proof of result correctness. In this
setting, clients issue NN queries to a server that maintains a database signed by a trusted authority. The server returns the NN set
along with supplementary information that permits result verification using the dataset signature. An adaptation of the multi-step NN
algorithm incurs prohibitive network overhead due to the transmission of false hits, i.e., records that are not in the NN set, but are
nevertheless necessary for its verification. In order to alleviate this problem, we present a novel technique that reduces the size of
each false hit. Moreover, we generalize our solution for a distributed setting, where the database is horizontally partitioned over
several servers. Finally, we demonstrate the effectiveness of the proposed solutions with real datasets of various dimensionalities.
INTRODUCTION
et DB be a D‐dimensional dataset. Each record P ∈ DB
can be thought of as a point in the space defined by the
D attribute domains, and in the sequel we use the term
record and point interchangeably. Given a point Q, a
nearest neighbor (NN) query retrieves the record {P∈DB :
DST(Q, P) ≤ DST(Q, Pʹ) ∀ Pʹ∈DB}, where DST(Q, P)
denotes the distance between Q and P. Likewise, a kNN
query returns the k closest points to Q. NN and kNN
queries are common in similarity retrieval. Specifically, since
similarity between records is inversely proportional to their
distance, a kNN query returns the k most similar records to
Q. The multi‐step framework [11], [19] has been proposed
for NN and similarity retrieval in domains that entail highdimensional
data (e.g., in Time Series, Medical, Image,
Biological and Document Databases), expensive distance
functions (e.g., Road Network Distance, Dynamic Time
Warping), or a combination of both factors.
MULTI-STEP NN FRAMEWORK
The multi‐step NN framework is motivated by applications
that entail expensive distance computations. Specifically,
let DST(Q, P) be the actual distance between a query Q and
a data point P∈DB. The applicability of the multi‐step
framework rests on the existence of a filter distance metric
dst, which is cheap to evaluate and satisfies the lower
bounding property, i.e., for every possible Q and P: dst(Q, P)
≤ DST(Q, P). Multi‐step NN search was introduced in [11].
Subsequently, Seidl and Kriegel [19] proposed the
algorithm of Figure 1, which is optimal in terms of DST
computations. In order to provide a concrete context, our
explanation focuses on road networks [18], where DST and
dst refer to the network and Euclidean distance,
respectively. Compared to Euclidean distance (dst),
network distance (DST) computations are significantly
more expensive because they entail shortest path
algorithms in large graphs. Moreover, the Euclidean kNNs
can be efficiently retrieved using conventional NN search
on a spatial index.
HIGH-DIMENSIONAL SIMILARITY SEARCH USING MULTISTEP
NN
Several applications including Image, Medical, Time Series
and Document Databases involve high‐dimensional data.
Similarity retrieval in these applications based on lowdimensional
indexes, such as the R*‐Tree [1], is very
expensive due to the dimensionality curse [2]. Specifically,
even for moderate dimensionality (i.e., D = 20) a sequential
scan that computes DST(Q, P) for every P ∈ DB is usually
cheaper than conventional NN algorithms using the index.
Consequently, numerous specialized structures have been
proposed for exact [8] and approximate [22] kNN search in
high dimensions.
AUTHENTICATED QUERY PROCESSING
In authenticated query processing, a server maintains a
dataset DB signed by a trusted authority (e.g., the data
owner, a notarization service). The signature sig is usually
based on a public‐key cryptosystem (e.g., RSA [16]). The
server receives and processes queries from clients. Each
query returns a result set RS ⊆ DB that satisfies certain
predicates. Moreover, the client must be able to establish
that RS is correct, i.e., that it contains all records of DB that
satisfy the query conditions, and that these records have
not been modified by the server or another entity.
THE MR-TREE
The MR‐Tree [23] combines the concepts of the MH‐Tree
and the R*‐Tree [1]. A leaf node contains entries elf of the
form (pgP, P), where P is an indexed point, and pgP is a
pointer to the page accommodating the record of P. An
internal node N stores entries ein of the form (pgNc, MBRNc,
hNc), where pgNc points to ein’s child node Nc. If N is at level 1
(the leafs being at level 0), MBRNc is the minimum
bounding rectangle (MBR) of the points in Nc, and hNc is a
digest computed on the concatenation of the binary
representation of the points in Nc.
AUTHENTICATED MULTI-STEP NN
Our work adopts the GEMINI framework because (i) it has
been proven effective in non‐authenticated similarity
retrieval, especially for numerous (i.e., D > 100) dimensions,
where even high‐dimensional indexes fail3; (ii) it can be
extended to authenticated query processing based on a low
dimensional ADS, i.e., the MR‐Tree, whereas, currently
there are no authenticated high‐dimensional structures; (iii)
it is general, i.e., it can also be applied when the expensive
distance computations are due to the nature of the distance
definition (e.g., network distance), rather than the data
dimensionality (in which case D = d).
FALSE HIT REDUCTION ALGORITHM
Ideally, for each false hit P, ReduceFH should derive the
subset SP with the minimum length. Intuitively, this task is
at least as difficult as the Knapsack Problem; we need to
select a subset of items (SP of P values), each assigned a
cost (communication overhead) and a weight (distance
DST(SQ, SP)), such that the sum of costs is minimized and
the sum of weights exceeds DSTmax. An additional
complication is that, when we select one item, the cost of
the rest changes (i.e., unlike knapsack, where the cost is
fixed).
DISTRIBUTED SERVERS
Next, we compare SD‐AMN and ID‐AMN considering that
the database is horizontally partitioned over m servers.
Recall that the methods first collect distance information,
based on which they determine the range that contains the
result. The NNs and the false hits are obtained during the
verification of this range, which is identical in SD‐AMN
and ID‐AMN. Thus, when measuring the communication
cost, we focus on their differences, which regard the
transmission of query points and the distance information.
The CPU overhead is based again on elementary distance
computations. Finally, due to the identical verification
process, the client cost is similar, and the corresponding
experiments are omitted.
CONCLUSIONS
The importance of authenticated query processing
increases with the amount of information available at
sources that are untrustworthy, unreliable, or simply
unfamiliar. This is the first work addressing authenticated
similarity retrieval from such sources using the multi‐step
kNN framework. We show that a direct integration of
optimal NN search with an authenticated data structure
incurs excessive communication overhead. Thus, we
develop C‐AMN, a technique that addresses the
communication‐specific aspects of NN, and minimizes the
transmission overhead and verification effort of the clients.
Furthermore, we propose ID‐AMN, which retrieves
distance information from distributed servers, eliminating
those that cannot contribute results.