13-07-2012, 10:18 AM
Authenticated Multistep Nearest Neighbor Search
Authenticated multi step nearest neighbor search.pdf (Size: 2.04 MB / Downloads: 37)
1 INTRODUCTION
LETDB be a D-dimensional data set. Each record P 2 DB
can be thought of as a point in the space defined by the
D attribute domains, and in the sequel we use the term
record and point interchangeably. Given a point Q, a
nearest neighbor (NN) query retrieves the record fP 2 DB:
DSTðQ; PÞ DSTðQ; P0Þ 8 P0 2 DBg, where DST(Q, P)
denotes the distance between Q and P. Likewise, a kNN
query returns the k closest points to Q. NN and kNN
queries are common in similarity retrieval. Specifically, since
similarity between records is inversely proportional to their
distance, a kNN query returns the k most similar records to
Q. The multistep framework [11], [19] has been proposed for
NN and similarity retrieval in domains that entail highdimensional
data (e.g., in time series, medical, image,
biological and document databases), expensive distance
functions (e.g., road network distance, dynamic time
warping), or a combination of both factors.
In this paper, we focus on authenticated multistep NN
search for applications that require a proof of result correctness.
For instance, [3] argues that the most cost-effective way for
medical facilities to maintain radiology images is to outsource
all image management tasks to specialized commercial
providers. Clients issue similarity queries to a provider.
The latter returns the result set and additional verification
information, based on which the client establishes that the
result is indeed correct, i.e., it contains exactly the records of
DB that satisfy the query conditions, and that these records
indeed originate from their legitimate data source (i.e., the
corresponding medical facility). A similar situation occurs
for data replication, i.e., when a data owner stores DB at
several servers. Clients issue their queries to the closest (in
terms of network latency) server, but they wish to be assured
that the result is the same as if the queries were sent to the
original source of DB. In other cases, correctness is
guaranteed by a trusted third party. For instance, notarization
services [20] have been proposed to safeguard against
tampering in document databases (the motivating example
being Enron). Authenticated query processing ensures the
client that the received result complies with the validatedDB.
Initially, we study the problem assuming that the entire
DB resides at a single server. Our first contribution is AMN,
an adaptation of a multistep algorithm that is optimal in
terms of DST computations. AMN requires transmissions of
false hits, i.e., records that are not in the result, but are
nevertheless necessary for its verification. In addition to the
network overhead, false hits impose a significant burden to
the client, which has to verify them. The second contribution,
C-AMN, alleviates this problem through an elaborate
scheme that reduces the size of false hits. Finally, we
consider a distributed setting, where the database is
horizontally partitioned over several servers. Our third
contribution, ID-AMN, incrementally retrieves data, gradually
eliminating servers that cannot contribute results.
The rest of the paper is organized as follows: Section 2
surveys related work. Section 3 presents the indexing
scheme and the query algorithms of AMN. Section 4
focuses on C-AMN and minimization of the false hits.
Section 5 deals with distributed servers and ID-AMN.
Section 6 contains an extensive experimental evaluation
with real data sets and Section 7 concludes the paper.
2 BACKGROUND
Section 2.1 describes multistep query processing. Section 2.2
overviews similarity search for high-dimensional data.
Section 2.3 surveys background on authenticated queries.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 641
. S. Papadopoulos, L. Wang, Y. Yang, and D. Papadias are with the
Department of Computer Science and Engineering, Hong Kong University
of Science and Technology, Clear Water Bay, Hong Kong.
E-mail: {stavros, lxwang, yini, dimitris}[at]cse.ust.hk.
. P. Karras is with the School of Computing, National University of
Singapore, Computing 1, #02-18, Computing Drive, Singapore 117417.
E-mail: karras[at]comp.nus.edu.sg.
Manuscript received 18 Sept. 2009; revised 12 Dec. 2009; accepted 9 Feb.
2010; published online 30 Aug. 2010.
Recommended for acceptance by D. Srivastava.
For information on obtaining reprints of this article, please send e-mail to:
tkde[at]computer.org, and reference IEEECS Log Number TKDE-2009-09-0665.
Digital Object Identifier no. 10.1109/TKDE.2010.157.
1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society
Section 2.4 focuses on the MR-Tree, which is used by the
proposed techniques.
2.1 Multistep NN Framework
The multistep NN framework is motivated by applications
that entail expensive distance computations. Specifically, let
DST(Q, P) be the actual distance between a query Q and a
data point P 2 DB. The applicability of the multistep framework
rests on the existence of a filter distance metric dst, which
is cheap to evaluate and satisfies the lower bounding property,
i.e., for every possible Q and P : dstðQ; PÞ DSTðQ; PÞ.
Multistep NN search was introduced in [11]. Subsequently,
Seidl and Kriegel [19] proposed the algorithm of Fig. 1, which
is optimal in terms ofDSTcomputations. In order to provide a
concrete context, our explanation focuses on road networks
[18], where DST and dst refer to the network and euclidean
distance, respectively. Compared to euclidean distance (dst),
network distance (DST) computations are significantly more
expensive because they entail shortest path algorithms in
large graphs. Moreover, the euclidean kNNscan be efficiently
retrieved using conventional NN search on a spatial index.
Assuming that DB is indexed by an R*-Tree [1], the
multistep kNN algorithm first retrieves the k euclidean NNs
of Q using an incremental algorithm (e.g., [7]). These points
are inserted into a result set RS, and their network (DST)
distances are computed. LetDSTmax be the network distance1
between Q and its current kth NN Pk. The next euclidean NN
P is then retrieved. As long as dstðQ; PÞ < DSTmax, the
algorithm computes DST(Q, P) and compares it against
DSTmax. If DSTðQ; PÞ < DSTmax, P is inserted into RS, the
previous Pk is expunged, andDSTmax is updated. The loop of
Lines 5-9 terminates when dstðQ; PÞ DSTmax; because of
the lower bounding property of the euclidean distance, any
point lying further in the euclidean space cannot be closer
than DSTmax in the network.
Independently of the application domain, the algorithm
of Fig. 1 performs the minimum number of DST computations.
Specifically, in addition to RS, the DST distances are
computed only for false hits, i.e., the set of points FH ¼ fP 2
DB-RS : dstðQ; PÞ DSTðQ; PkÞg, where Pk is the final kth
NN. The rest of the records are not accessed at all (if they
reside in pruned nodes of the R*-Tree), or they are
eliminated using their dst to Q.
2.2 High-Dimensional Similarity Search Using Multistep NN
Several applications including image, medical, time series,
and document databases involve high-dimensional data.
Similarity retrieval in these applications based on lowdimensional
indexes, such as the R*-Tree [1], is very
expensive due to the dimensionality curse [2]. Specifically,
even for moderate dimensionality (i.e., D ¼ 20) a sequential
scan that computes DST(Q, P) for every P 2 DB is usually
cheaper than conventional NN algorithms using the index.
Consequently, numerous specialized structures have been
proposed for exact [8] and approximate [22] kNN search in
high dimensions.
The GEMINI framework [6], [11] follows a different
approach, combining multistep search with a dimensionality
reduction technique that exhibits the lower bounding
property. Specifically, each record P 2 DB is mapped to
a low-dimensional representation p in d dimensions
(d D). The resulting d-dimensional data set db is indexed
by an R*-Tree, or any low-dimensional index. The query Q
is also transformed to a d-dimensional point q and
processed using a multistep method. For instance, in the
algorithm of Fig. 1, DST (resp. dst) computations involve
high (low) dimensional points. The index prunes most
nodes and records using the cheap, filter (dst) distances,2
whereas the expensive DST computations are necessary
only for the points in result RS and false hit set FH.
GEMINI is the most common approach for performing
similarity search over high-dimensional data, and especially
time series. Numerous dimensionality reduction
methods have been used extensively including Discrete
Fourier Transform (DFT), Singular Value Decomposition (SVD),
Discrete Wavelet Transform (DWT), Piecewise Linear Approximation
(PLA), Piecewise Aggregate Approximation (PAA),
Adaptive Piecewise Constant Approximation (APCA), and
Chebyshev Polynomials (CP). Their effectiveness is measured
by the number of records that they can prune using only the
low-dimensional representations (i.e., it is inversely proportional
to the cardinality of FH). Ding et al. [5] experimentally
compare various techniques, concluding that their
effectiveness depends on the data characteristics.
2.3 Authenticated Query Processing
In authenticated query processing, a server maintains a data
set DB signed by a trusted authority (e.g., the data owner, a
notarization service). The signature sig is usually based on a
public-key cryptosystem (e.g., RSA [16]). The server receives
and processes queries from clients. Each query returns a
result set RS DB that satisfies certain predicates. Moreover,
the client must be able to establish that RS is correct, i.e.,
that it contains all records of DB that satisfy the query
conditions, and that these records have not been modified by
the server or another entity. Since sig captures the entire DB
(including records not in the query result), in addition to RS,
the server returns a verification object (VO). Given the VO, the
client can verify RS based on sig and the signer’s public key.
VO generation at the server is usually performed using an
authenticated data structure (ADS). The most influential ADS is
642 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Algorithm MultistepNN(Q, k)
1. Retrieve the k NNs {P1, ..., Pk} of Q according to dst
2. RS = {P1, ..., Pk}, sorted according to DST
3. DSTmax = DST(Q, Pk) // the current kth NN DST
4. P = next NN of Q according to dst
5. While dst(Q, P) < DSTmax
6. If DST(Q, P) < DSTmax
7. Insert P into RS and remove previous kth NN
8. Update DSTmax over RS
9. P = next NN of Q according to dst
Fig. 1. Optimal multistep kNN processing.
1. To avoid tedious details in our discussion, we assume that all data
distances to the query point are different.
2. Note that in GEMINI DST and dst may be based on the same definition
(e.g., they may both be euclidean distances). In this case, dst is cheaper
because of the lower dimensionality.
the Merkle Hash Tree (MH-Tree) [15], a main-memory binary
tree, originally proposed for single record authentication.
Each leaf in the MH-Tree stores the digest of a record,
calculated using a one-way, collision-resistant hash function
hðÞ, such as SHA-1 [16]. An inner node stores a digest
computed on the concatenation of the digests in its children.
The trusted authority signs the root digest. Fig. 2 illustrates
an MH-Tree over 16 records. Assume that a client requests
record P6. When traversing the tree to locate P6, the server
produces a VO that contains the digests (shown in gray) of
the siblings of the visited nodes: VO ¼ ½½h25½½h5P6h20h30.
Tokens “[” and “]” signify the scope of a node. VO and sig are
transmitted to the client, which subsequently simulates a
reverse tree traversal. Specifically, from h5 and P6 it
computes h19, then h26 (using h19 and h20), h29 (using h25
and h26), and finally h31 (using h29 and h30). Due to the
collision resistance of the hash function, if P6 is modified,
then the digest h31 reconstructed by the client will not match
sig; hence, the verification will fail.
The MH-Tree constitutes the basis of a large number of
subsequent ADSs that support:
1. range queries on a single attribute [13],
2. multidimensional ranges [23],
3. continuous queries on data streams [24],
4. search in generalized directed acyclic graphs [14],
5. search over peer-to-peer networks [21],
6. XML queries [12], and
7. similarity-based document retrieval [17].
In the following, we focus on the MR-Tree [23], the state-ofthe-
art multidimensional ADS, which is utilized by our