28-09-2013, 01:05 PM
NEAREST NEIGHBOUR MATCHING USING K D-TREES
NEAREST NEIGHBOUR.pdf (Size: 339.78 KB / Downloads: 28)
Abstract
Nearest neighbour matching is a problem that occurs in many fields of study. It
involves finding the point within a multi-dimensional data set that is the “closest”
to a target point given an arbitrary distance metric. This problem arises in many
situations such as pattern matching and machine learning.
Several basic techniques for nearest neighbour matching are briefly examined. A
basic discussion of the trade-offs that are made with space and time complexities is
done for each technique.
The k d-tree is discussed in greater detail. Techniques for optimizing query time
are discussed. These techniques include strategies for building the tree and informed
search algorithms.
The discussion of optimizing the tree centres around “splitting strategies,” with
region compression also briefly discussed. The splitting strategies discussed are the
simple strategy, choosing the widest dimension for splitting and choosing the most
deviant dimension for splitting.
The informed search algorithms discussed are the depth-first search, the hill-
climbing search, the best-first search and the A* search. These algorithms make
use of various heuristics which are affected by the construction of the tree.
After a rigourous empirical analysis, the depth-first search is shown to be the
most efficient searching algorithm. Depending on the distribution of the data set, the
simple splitting strategy and the most deviant dimension splitting strategy produce
desirable results.
Introduction
One of the fundamental problems in computational geometry, nearest neighbour
matching can be described as follows. Given a metric space, S, and a distance func-
tion, d(p1 , p2 ), describing the distance between two points, p1 and p2 , find the closest
point within a set of data points to any given query point. Thus, given a query point,
q ∈ S, and a set of data points, pi ∈ S, find the point where d(q, pi ) is minimal.
This problem arises in many situations including pattern matching, machine learn-
ing, information retrieval, data mining and statistics. These problems involve taking
samples and representing them as vectors or cartesian coordinates. For example, to
identify the speaker in an audio recording, one might perform spectral analysis on
the waveform and compare it to the frequency analyses of the voices of some possible
speakers.
Geometry
In a k-dimensional Euclidean space, cartesian coordinates can be represented by k-
dimensional vectors, with the ith component of the vector representing the point’s
position along the ith axis.
This can be proven to be a vector space and thus, all operations on a vector space
may be used on these cartesian coordinates. This includes the addition and scalar
multiplication, inner product and norms.
Normal Distribution
While directing an effort to survey Germany in the 1820’s, Carl Friedrich Gauss
discovered a pattern in random errors. They tend to be distributed around a mean in a
bell-shaped curve. This distribution is appropriately named the Gaussian distribution
and is more commonly referred to as the normal distribution.
Nearest Neighbour Matching
Given a set of points, A, and a target, target, one is faced with the problem of finding
the closest point to target in A using a given distance metric. The most intuitive
solution to this problem, demonstrated in algorithm 3.1, is to check each element of
A in turn comparing it to a best known match.
Definition
Bentley [Ben75] proposed a solution to the nearest neighbour matching problem with
linear space complexity and sub-linear time complexity called the multidimensional
binary search tree, or “k d-tree” for short.
A k d-tree is incredibly similar to the unidimensional binary search tree and shares
some of the same algorithms. At every level, a binary search tree partitions a set of
points around the point stored within the node. Similarly, the k d-tree partitions a
set of vectors along a k-dimensional hyper-plane, orthogonal to an axis and passing
through the vector stored within the node. The dimension represented by the axis is
known as the splitting dimension.
Constructing Sub-Optimal
k d-Trees
To generate a k d-tree that will produce optimal results for all queries is an NP
complete optimization problem [FBF77]. Instead, a sub-optimal tree can be created
in polynomial time that attempts to produce a “good” tree. It is assumed that a
“good” tree is balanced (has a constant height) and has smaller hyper-rectangles
with sides of approximately even lengths [Ben75].
To accomplish this using the method in this paper, all points in the tree must be
known ahead of time. It is possible that a k d-tree could be dynamically balanced in
a manner similar to the AVL or red-black trees, but this problem is outside the scope
of this paper.