05-07-2014, 01:58 PM
A Probabilistic Scheme for Keyword-Based Incremental Query Construction
Abstract
Databases enable users to precisely express their informational needs using structured queries. However, database query
construction is a laborious and error-prone process, which cannot be performed well by most end users. Keyword search alleviates the
usability problem at the price of query expressiveness. As keyword search algorithms do not differentiate between the possible
informational needs represented by a keyword query, users may not receive adequate results. This paper presents IQP—a novel
approach to bridge the gap between usability of keyword search and expressiveness of database queries. IQP enables a user to start
with an arbitrary keyword query and incrementally refine it into a structured query through an interactive interface. The enabling
techniques of IQP include: 1) a probabilistic framework for incremental query construction; 2) a probabilistic model to assess the
possible informational needs represented by a keyword query; 3) an algorithm to obtain the optimal query construction process. This
paper presents the detailed design of IQP, and demonstrates its effectiveness and scalability through experiments over real-world data
and a user study.
INTRODUCTION
In this paper, we present IQP,
1 a novel system that aims
at bridging the gap between usability of keyword search
and expressiveness of database queries. Using IQP, a user
can benefit from both, a conventional ranking interface and
a more controllable query construction interface. The
former allows the user to immediately identify the most
common interpretation of her query. The latter enables the
user to clarify her search intent step by step, which is
especially helpful when the intended query interpretation
does not receive a good rank. For instance, if a user issues a
keyword query “London Age,” IQP would ask a number of
questions, such as “Is London a person?” or “Are you
looking for a city’s age?.” Based on the user’s responses,
IQP is able to accurately identify the structured query
representing the user’s informational need. Using IQP,
users are able to construct structured queries efficiently,
without necessarily knowing
Query Interpretation Generation
As creation of the entire interpretation space of a keyword
query is expensive, IQP uses a set of precomputed query
templates to accelerate this process. A query template is a
pattern typically used to issue structured queries.
Subquery Relationship
During query construction, IQP utilizes the subquery
relationships between the partial and complete query
interpretations to quickly reduce the interpretation space
of a keyword query
Query Construction Plan
As shown in Fig. 1, in each query construction step of IQP,
a user is presented with a list of query construction options,
i.e., partial interpretations. She is supposed to select the
option that correctly interprets her keywords.
To simplify our presentation, we assume that a user
decides on only one option at a time. If
ESTIMATING QUERY PROBABILITY
To support efficient query construction, it is important to
have an accurate assessment of the probability of whether a
structured query (or a query construction option) interprets
a user’s keyword query correctly. In this section, we
introduce a probabilistic model, which enables IQP to
compute these probabilities. We also discuss possible
statistics for supporting probability assessment.
QUERY CONSTRUCTION ALGORITHM
As mentioned in Section 3, given a keyword query, there
typically exist multiple possible query construction plans.
While every plan can allow a user to obtain the intended
structured query, these plans can differ in efficiency
significantly. A less efficient plan also means reduced
usability of IQP. To find an intended interpretation, a user
needs to traverse a branch of the QCP (as a binary tree). This
requires the QCP to be balanced. As mentioned earlier, a
special case of an unbalanced QCP is a simple ranked list of
all possible query interpretations [18], [32], [33], [37]. In
case, the intended interpretation does not appear within the
top-ranked items, the user has to examine all queries prior
to the intended one, which is tedious and error prone.
Instead, a more balanced QCP tree can efficiently prune the
search space and enable the user to obtain a desired query
in fewer steps. In this section, we propose an algorithm to
create a plan that imposes as little effort on the user as
possible, i.e., a minimum query construction plan
Data Sets and Keyword Queries
In our experiments, we used two real-world data sets: a
crawl of the Internet Movie Database (IMDB) and a crawl of
a lyrics database from the web. The IMDB data set contains
seven tables, such as movies, actors, and directors, with
more than 10,000,000 records. The Lyrics data set [22]
contains five tables, such as artists, albums, and songs, with
around 400,000 records. We deployed IQP on both data
sets. To generate templates, we used the automatic
approach presented in Section 3.2. We set the maximal
length of the join path to four, and obtained 74 templates for
IMDB and 16 templates for Lyrics
Effectiveness of the Probability Estimates
To estimate the effectiveness of the proposed query
probabilistic model, we used three variations of probability
estimates. The first variation, also called base line, assumes
that all structured queries and query construction options
are equally likely. The second variation, referred as (ATF,
Tequal), applied the probabilistic model introduced in
Section 4.1 (represented by (5) and (6)). It uses the Attribute
Term Frequency to estimate PðAi : kijT \AiÞ, but assumes
equal probabilities of query templates. The third version,
represented by (ATF, TLog), not only used ATF to estimate
PðAi : kijT \AiÞ, but also used the query log to estimate the
probabilities of the query templates.
Query Construction versus Query Ranking
Our second set of experiments aimed to compare the
interaction cost of query construction against that of query
ranking. We used two query ranking functions: the ranking
function of IQP and SQAK [32], one of the most recent
query ranking functions in the related work. Note that
query ranking approaches use different statistics from that
of result ranking approaches like [14], [22], as the ranking is
conducted without materializing query results.
We measure the interaction cost of query ranking as the
rank of the intended query interpretation in the ordered
Usability of Query Construction
To assess usability of IQP in real life, we conducted a user
study. The user study aimed to compare usability of two
user interfaces and to assess in which cases the IQP
interface can enable more efficient data access than the
query ranking interface. One interface is the IQP interface
shown in Fig. 1. The other is a query ranking interface
without using the query construction panel. For the query
ranking, we used the IQP ranking function with the
probability estimate (ATF, TEqual). For user convenience,
the query ranking interface presented structured queries in
several pages, each containing 20 queries.
Our users were 15 graduate students from the Computer
Sciences Department. We selected 14 tasks for the users to
perform. Each task required the user to retrieve certain
information from the IMDB data set. For each of the tasks,
we proposed a keyword query. Based on the rank of the
correct query interpretation, which ranged from 0 to 220, we
grouped the tasks into seven complexity categories: 0, 1, 2,
3, 4, 6, and 11. Each category contained two tasks. Category
k means that the correct query interpretation appears in the
kth page of the ranking interface. (0 means the correct
interpretation is within the top-10.) Table 1 provides an
overview over several example tasks.
CONCLUSION
We presented a conceptual framework for the incremental
query construction as well as a probabilistic model, which
enables consistent assessment of the probability of a query
interpretation. We presented an algorithm for generating
optimal query construction plans, which enables the user to
obtain the intended structured query with a minimal
number of interactions. Our experimental results and user
study show that IQP is highly helpful when user intended
structured queries cannot be found within the top-ranked
results. Moreover, our simulations confirm the scalability of
the proposed algorithms for the data sets with up to
100 tables.