22-12-2012, 04:35 PM
Efficient Multidimensional Fuzzy Search for Personal Information Management Systems
1Efficient Multidimensional.pdf (Size: 1.96 MB / Downloads: 26)
Abstract
With the explosion in the amount of semistructured data users access and store in personal information management
systems, there is a critical need for powerful search tools to retrieve often very heterogeneous data in a simple and efficient way.
Existing tools typically support some IR-style ranking on the textual part of the query, but only consider structure (e.g., file directory)
and metadata (e.g., date, file type) as filtering conditions. We propose a novel multidimensional search approach that allows users to
perform fuzzy searches for structure and metadata conditions in addition to keyword conditions. Our techniques individually score each
dimension and integrate the three dimension scores into a meaningful unified score. We also design indexes and algorithms to
efficiently identify the most relevant files that match multidimensional queries. We perform a thorough experimental evaluation of our
approach and show that our relaxation and scoring framework for fuzzy query conditions in noncontent dimensions can significantly
improve ranking accuracy. We also show that our query processing strategies perform and scale well, making our fuzzy search
approach practical for every day usage.
INTRODUCTION
THE amount of data stored in personal information
management systems is rapidly increasing, following
the relentless growth in capacity and dropping price of
storage. This explosion of information is driving a critical
need for powerful search tools to access often very heterogeneous
data in a simple and efficient manner. Such tools
should provide both high-quality scoring mechanisms and
efficient query processing capabilities.
Numerous search tools have been developed to perform
keyword searches and locate personal information stored in
file systems, such as the commercial tools Google Desktop
Search (GDS) [18] and Spotlight [26]. However, these tools
usually support some form of ranking for the textual part of
the query—similar to what has been done in the Information
Retrieval (IR) community—but only consider structure (e.g.,
file directory) and metadata (e.g., date, file type) as filtering
conditions. Recently, the research community has turned its
focus on search over to Personal Information and Dataspaces
[10], [14], [16], which consist of heterogeneous data collections.
However, as is the case with commercial file system
search tools, these works focus on IR-style keyword queries
and use other system information only to guide the keywordbased
search.
UNIFIED MULTIDIMENSIONAL SCORING
In this section, we present our unified framework for
assigning scores to files based on how closely they match
query conditions within different query dimensions. We
distinguish three scoring dimensions: content for conditions
on the textual content of the files, metadata for conditions on
the system information related to the files, and structure for
conditions on the directory path to access the file.
We represent files and their associated metadata and
structure information as XML documents. We use a simplified
version of XQuery to express metadata and structure
conditions in addition to keyword-based content conditions.
Any file that is relevant to one or more of the query
conditions is then a potential answer for the query; it gets
assigned a score for each dimension based on how close it
matches the corresponding query condition. Scores across
multiple dimensions are unified into a single overall score for
ranking of answers.
Our scoring strategy is based on an IDF-based interpretation
of scores, as described below. For each query condition,
we score files based on the least relaxed form of the condition
that each file matches. Scoring along all dimensions is
uniformly IDF-based which permits us to meaningfully
aggregate multiple single-dimensional scores into a unified
multidimensional score.
Scoring Structure
Most users use a hierarchical directory structure to organize
their files. When searching for a particular file, a user may
often remember some of the components of the containing
directory path and their approximate ordering rather than the
exact path itself. Thus, allowing for some approximation on
structure query conditions is desirable because it allows users
to leverage their partial memory to help the search engine
locate the desired file.
Score Aggregation
We aggregate the above single-dimensional scores into a
unified multidimensional score to provide a unified ranking
of files relevant to a multidimensional query. To do this, we
construct a query vector, ~ VQ, having a value of 1 (exact
match) for each dimension and a file vector, ~ VF , consisting
of the single-dimensional scores of file F with respect to
query Q. (Scores for the content dimension is normalized
against the highest score for that query condition to get
values in the range ½0; 1.) We then compute the projection
of ~ VF onto ~ VQ and the length of the resulting vector is used
as the aggregated score of file F. In its current form, this is
simply a linear combination of the component scores with
equal weighting. The vector projection method, however,
provides a framework for future exploration of more
complex aggregations.
QUERY PROCESSING
We adapt an existing algorithm called the Threshold
Algorithm (TA) [15] to drive query processing. TA uses a
threshold condition to avoid evaluating all possible matches
to a query, focusing instead on identifying the k best answers.
It takes as input several sorted lists, each containing the
system’s objects (files in our scenario) sorted in descending
order according to their relevance scores for a particular
attribute (dimension in our scenario), and dynamically
accesses the sorted lists until the threshold condition is met
to find the k best answers.
Critically,TArelies on sorted and random accesses to retrieve
individual attribute scores. Sorted accesses, that is, accesses to
the sorted lists mentioned above, require the files to be
returned in descending order of their scores for a particular
dimension. Random accesses require the computation of a
score for a particular dimension for any given file. Random
accesses occur when TA chooses a file from a particular list
corresponding to some dimension, then needs the scores for
the file in all the other dimensions to compute its unified
score. To use TA in our scenario, our indexing structures and
algorithms need to support both sorted and random access
for each of the three dimensions.
Evaluating Metadata Scores
Sorted access for a metadata condition is implemented using
the appropriate relaxation DAG index. First, exact matches
are identified by identifying the deepest DAG node N that
matches the given metadata condition (see Section 2.2). Once
all exact matches have been retrieved from N’s leaf
descendants, approximate matches are produced by traversing
up the DAG to consider more approximate matches.
Each parent contains a larger range of values than its children,
which ensures that the matches are returned in decreasing
order of metadata scores. Similar to the content dimension,
we use the TA algorithm recursively to return files in sorted
order for queries that contain multiple metadata conditions.
Random accesses for a metadata condition require
locating in the appropriate DAG index the closest common
ancestor of the deepest node that matches the condition and
the deepest node that matches the file’s metadata attribute
(see Section 2.2). This is implemented as an efficient DAG
traversal algorithm.
Evaluating Structure Scores
The structure score of a file for a query condition depends on
how close the directory in which the file is stored matches the
condition. All files within the same directory will therefore
have the same structure score.
To compute the structure score of a file f in a directory d
that matches the (exact or relaxed) structure condition P of
a given query, we first have to identify all the directory
paths, including d, that match P. For the rest of the section,
we will use structure condition to refer to the original
condition of a particular query and query path to refer to a
possibly relaxed form of the original condition. We then
sum the number of files contained in all the directories
matching P to compute the structure score of these files for
the query using (6). The score computation step itself is
straightforward; the complexity resides in the directory
matching step. Node inversions complicate matching query
paths with directories, as all possible permutations have to
be considered. Specific techniques and their supporting
index structures need to be developed.
CONCLUSIONS
We presented a scoring framework for multidimensional
queries over personal information management systems.
Specifically, we defined structure and metadata relaxations
and proposed IDF-based scoring approaches for content,
metadata, and structure query conditions. This uniformity of
scoring allows individual dimension scores to be easily
aggregated. We have also designed indexing structures and
dynamic index construction and query processing optimizations
to support efficient evaluation of multidimensional
queries.
We implemented and evaluated our scoring framework
and query processing techniques. Our evaluation show that
our multidimensional score aggregation technique preserves
the properties of individual dimension scores and
has the potential to significantly improve ranking accuracy.
We also show that our indexes and optimizations are
necessary to make multidimensional searches efficient
enough for practical everyday usage. Our optimized query
processing strategies exhibit good behavior across all
dimensions, resulting in good overall query performance
and scalability.