01-08-2012, 03:46 PM
Answering Structured Queries on Unstructured Data
10Answering Structured.pdf (Size: 332.86 KB / Downloads: 38)
Abstract
There is growing number of applications that require access to both structured and unstruc-
tured data. Such collections of data have been referred to as dataspaces, and Dataspace Support
Platforms (DSSPs) were proposed to offer several services over dataspaces, including search and
query, source discovery and categorization, indexing and some forms of recovery. One of the key
services of a DSSP is to provide seamless querying on the structured and unstructured data.
Querying each kind of data in isolation has been the main subject of study for the fields of
databases and information retrieval. Recently the database community has studied the problem
of answering keyword queries on structured data such as relational data or XML data. The only
combination that has not been fully explored is answering structured queries on unstructured
data.
Introduction
Significant interest has arisen recently in combining techniques from data management and infor-
mation retrieval [1, 5]. This is due to the growing number of applications that require access to
both structured and unstructured data. Examples of such applications include data management
in enterprises and government agencies, management of personal information on the desktop, and
management of digital libraries and scientific data. Such collections of data have been referred to
as dataspaces [8], and Dataspace Support Platforms (DSSPs) were proposed to offer several services
over dataspaces, including search and query, source discovery and categorization, indexing and
some forms of recovery.
One of the key services of a DSSP is to provide seamless querying on the structured and
unstructured data. Querying each kind of data in isolation has been the main subject of study for
the fields of databases and information retrieval. Recently the database community has studied
the problem of answering keyword queries on structured data such as relational data or XML
data [10, 2, 4, 21, 11].
Motivation
Broadly, our techniques apply in any context in which a user is querying a structured data source,
whereas there are also unstructured sources that may be related. The user may want the structured
query to be expanded to include the unstructured sources that have relevant information.
Our work is done in the context of the Semex Personal Information Management (PIM) Sys-
tem [6]. The goal of Semex is to offer easy access to all information on one’s desktop, with possible
extension to mobile devices, imported databases, and the Web. The various types of data on one’s
desktop, such as emails and contacts, Latex and Bibtex files, PDF files, Word documents and
Powerpoint presentations, and cached webpages, form the major data sources managed by Semex.
On one hand, Semex extracts instances and associations from these files by analyzing the data
formats, and creates a database. For example, from Latex and Bibtex files, it extracts Paper,
Person, Conference, Journal instances and authoredBy, publishedIn associations. On the other hand,
these files contain rich text and Semex considers them also as unstructured data.
Our Contributions
In this paper, we study how to extract keywords from a structured query, such that searching the
keywords on an unstructured data repository obtains the most relevant answers. The goal is to
obtain reasonably precise answers even without domain knowledge, and improve the precision if
knowledge of the schema and the structured data is available.
As depicted in Figure 1, the key element in our solution is to construct a query graph that
captures the essence of the structured query, such as the object instances mentioned in the query,
the attributes of these instances, and the associations between these instances. With this query
graph, we can ignore syntactic aspects of the query, and distinguish the query elements that convey
different concepts. The keyword set is selected from the node and edge labels of the graph.
Related Work
The Database community has recently considered how to answer keyword queries on RDB data [10,
2, 4] and on XML data [21, 11]. In this paper, we consider the reverse direction, answering
structured queries on unstructured data.
There are two bodies of research related to our work: the information-extraction approach and
the query-transformation approach. Most information-extraction work [9, 16, 17, 18, 13, 7, 3] uses
supervised learning, which is hard to scale to data in a large number of domains and apply to the
case where the query schema is unknown beforehand.
To the best of our knowledge, there is only one work, SCORE [15], considering transforming
structured queries into keyword search. SCORE extracts keywords from query results on structured
data and uses them to submit keyword queries that retrieve supplementary information. Our
approach extracts keywords from the query itself. It is generic in that we aim to provide reasonable
results even without the presence of structured data and domain knowledge; however, the technique
used in SCORE can serve as a supplement to our approach.