30-07-2013, 02:14 PM
EFFICIENT MULTI-DIMENSIONAL FUZZY SEARCH FOR PERSONAL INFORMATION MANAGEMENT SYSTEMS
EFFICIENT MULTI-DIMENSIONAL.doc (Size: 52 KB / Downloads: 42)
Abstract
With the explosion in the amount of semi-structured data users access and store in personal information management systems, there is a critical need for powerful search tools to retrieve often very heterogeneous data in a simple and efficient way..We propose a novel multi-dimensional search approach that allows users to perform fuzzy searches for structure and metadata conditions in addition to keyword conditions. Our techniques individually score each dimension and integrate the three dimension scores into a meaningful unified score. We also design indexes and algorithms to efficiently identify the most relevant files that match multi-dimensional queries. We perform a thorough experimental evaluation of our approach and show that our relaxation and scoring framework for fuzzy query conditions in noncontent dimensions can significantly improve ranking accuracy. We also show that our query processing strategies perform and scale well, making our fuzzy search approach practical for every day usage.
EXISTING SYSTEM
filtering conditions. We adapt an existing algorithm called the Threshold Algorithm (TA) to drive query processing. TA uses a threshold condition to avoid evaluating all possible matches to a query, focusing instead on identifying the k best answers. It takes as input several sorted lists, each containing the system’s objects (files in our scenario) sorted in descending order according to their relevance scores for a particular attribute (dimension in our scenario), and This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Dynamically accesses the sorted lists until the threshold condition is met to find the k best answers.
DISADVANTAGE
Existing tools typically support some IR-style ranking on the textual part of the query, but only consider structure filtering conditions.
PROPOSED SYSTEM
Propose a novel approach that allows users to efficiently perform fuzzy searches across three different dimensions: content, metadata, and structure. We describe individual IDF-based scoring approaches for each dimension and present a unified scoring framework for multi-dimensional queries over personal information file systems. We also present new data structures and index construction optimizations to make finding and scoring fuzzy matches efficient. While our work could be extended to a variety of data space applications and queries, we focus on a file search scenario in this paper. That is, we consider the granularity of the search results to be a single file in the personal information system. Of course, our techniques could be extended to a more flexible query model where pieces of data within files could be returned as results.
ADVANTAGE
We incrementally build the query dependent DAG structures at query time, only materializing those DAG nodes necessary to answer a query.
To improve sorted access efficiency, we propose techniques to skip the scoring of unneeded DAG nodes by taking advantage of the containment property of the DAG.
MODULES DECRIPTION
Data set
We use a data set that contains files and directories from the working environment of one of the authors.
Scoring Content
Where Q is the content query condition, f is the file being scored, N is the total number of files, Nt is the number of files containing the term t, and Norm Length(f) is a normalizing factor that is a function of f’s length. 2 Note that relaxation is an integral part of the above formulas since they score all files that contain a subset of the terms in the query condition.
Scoring Metadata
We introduce a hierarchical relaxation approach for each type of searchable metadata to support scoring. A key characteristic of this hierarchical representation is containment; that is, the set of files matching a node must be equal to or subsume the set of files matching each of its children nodes. This ensures that the score of a file matching a more relaxed form of a query condition is always less than or equal to the score of a file matching a less relaxed
Scoring Structure
Most users use a hierarchical directory structure to organize their files. When searching for a particular file, a user may often remember some of the components of the containing directory path and their approximate ordering rather than the exact path itself. Thus, allowing for some approximation on structure query conditions is desirable because it allows users to leverage their partial memory to help the search engine locate the desired file.
CONCLUSIONS
We presented a scoring framework for multi-dimensional queries over personal information management systems. Specifically, we defined structure and metadata relaxations and proposed IDF-based scoring approaches for content, metadata, and structure query conditions. This uniformity of scoring allows individual dimension scores to be easily aggregated. We have also designed indexing structures and dynamic index construction and query processing optimizations
to support efficient evaluation of multi-dimensional queries.
We implemented and evaluated our scoring framework and query processing techniques. Our evaluation show that our multi-dimensional score aggregation technique preserves the properties of individual dimension scores and has the potential to significantly improve ranking accuracy. We also show that our indexes and optimizations are necessary to make multi-dimensional searches efficient enough for practical everyday usage. Our optimized query processing strategies exhibit good behavior across all dimensions, resulting in good overall query performance and scalability.