27-08-2014, 12:01 PM
Relevance-Based Retrieval on Hidden-Web
Text Databases without Ranking Support
Relevance-Based.pdf (Size: 1.32 MB / Downloads: 16)
Abstract—
Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance,
PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user
would typically prefer a ranking by relevance, measured by an information retrieval (IR) ranking function. A naive approach would be to
submit a disjunctive query with all query keywords, retrieve all the returned matching documents, and then rerank them. Unfortunately,
such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper, we present
algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source
with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms
generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a
relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query
keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation
on the PubMed database and a TREC data set show that we achieve order of magnitude improvement compared to the current
baseline approaches.
Top-k Queries
A significant amount of work has been devoted to the
evaluation of top-k queries in databases. Ilyas et al. [13]
provide a survey of the research on top-k queries on
relational databases. This line of work typically handles the
aggregation of attribute values of objects in the case where
the attribute values lie in different sources [14], [15] or in a
single source [16]. Theobald et al. [17] describe a framework
for generating an approximate top-k answer, with some
probabilistic guarantees. In our work, we use the same idea;
the main and crucial difference is that we only have
“random access” to the underlying database (i.e., through
querying), and no “sorted access.” Theobald et al. assumed
that at least one source provides “sorted access” to the
underlying content.
3 Deep Web
Our work bears some similarities to the problem of searching
and extracting data from the Deep Web [18] databases. Meng
et al. [19], [20] examine the problem of estimating the number
of useful documents in the database, assuming that the
statistics about the frequency and the tf.idf weights of each
word in the database is given. In our work, we estimate such
statistics on-the-fly, as part of the explorative sampling
process. Ntoulas et al. [2] attempt to download the contents of
a Deep Web database by issuing queries through a web form
interface. The goal of Ntoulas et al. is to download and index
the contents of databases with limited query capabilities,
whereas in our case the focus is on achieving on-the-fly
ranking of query results, on top of sources with no (or
nonuseful) ranking capabilities. An alternative approach is to
characterize databases by extracting a small sample of
documents that is then used to describe the contents of the
database. For example, it is possible to use query-based
OVERVIEW OF APPROACH
As mentioned above, our approach is based on choosing the
best sequence q1; q2; ... ; qv of Boolean queries to submit to
the data source, such that we retrieve the top-k ranked
documents for Q. Of course, to select the best sequence of
queries, we need to know some statistics about the type of
documents retrieved by each query qi. To get these statistics
we need to sample the database through query-based
sampling. So, through querying we are both retrieving
documents to generate the necessary statistics and at the
same time aim to retrieve documents that are in the top-k
relevant documents. So, we can consider our approach as a
case of “exploration versus exploitation.”
Even though we can use any Boolean query in our
strategy, we only consider conjunctive Boolean queries as
candidates, given that a disjunctive query can be split to a set
of conjunctive queries. Conjunctive queries provide a good
query granularity and simplify the analysis below. Note that
in practice we add negation conditions to the issued
conjunctive queries in order to avoid retrieving the same
results multiple times. For instance, if Q ¼ fa; bg, after
submitting q1 ¼ a AND b, we submit q2 ¼ a AND NOT b
instead of q2 ¼ a.
So, what are the goals of our querying strategy?
Following (1), we need to know the tf and df values for
the terms in the database, to estimate the similarity score of
a query to a document. Using these values, we can then
estimate the overall similarity score distribution for all the
documents in the database. Given the score distribution, we
can compute how many documents in the database have
score higher than the documents that we have seen so far.
The relatively easy part is the estimation of the df values.
We can estimate these values in two ways: 1) We can send n
queries to the database, one for each query term ti, and
compute the df value for each term. Note that the PubMed
eUtils, which we use in our experiments, have a method to
directly return the number of results (df) for a query. 2) We
can use estimates of the idf (inverse df) values by using some
other database with similar content (for example, using the
Google Web 1T 5-gram collection5
).
CONCLUSIONS
We presented a framework and efficient algorithms to build
a ranking wrapper on top of a documents data source that
only serves Boolean keyword queries. Our algorithm