14-09-2013, 12:53 PM
Answering General Time-Sensitive Queries
Answering General.pdf (Size: 1.38 MB / Downloads: 19)
Abstract
Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far,
research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately,
topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that
we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in
conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for “recency”
queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically
identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly
integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental evaluation using a variety of
news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examine
several techniques for detecting the important time intervals for a query over a news archive and for incorporating this information in
the retrieval process. We show that our techniques are robust and significantly improve result quality for time-sensitive queries
compared to state-of-the-art retrieval techniques.
INTRODUCTION
TIME is an important dimension of relevance for a large
number of searches, such as over blogs and news
archives. So far, research on searching over such collections
has largely focused on retrieving topically similar documents
for a query. Unfortunately, ignoring or not fully exploiting
the time dimension can be detrimental for a large family of
queries for which we should consider not only the document
topical relevance but the publication time of the documents
as well, as demonstrated by the following example:
Example 1. Consider the query [Madrid bombing] over the news
archive of a state-of-the-art multidocument summarization
system that crawls and summarizes news articles from the web
on a daily basis. Fig. 1 zooms in on a portion of the histogram for
the query results, reporting the number of matching documents
in the news archive for each day between January and December
2004. This histogram reveals particular time intervals that are
likely to be of special interest for the query, such as the month of
March 2004, when a terrorist group bombed trains in Madrid.
The same figure shows an analogous histogram for query
[Google IPO]: the “peaks” in the histogram coincide with two
important events, namely, the announcement of the Google IPO
and, a few months later, the actual IPO.
TIME-SENSITIVE QUERIES
For recency queries [2], the bulk of the relevant documents is,
by definition, from recent days. For other families of queries,
the relevant documents may be distributed differently over
the time span of a news archive. For example, the query
[Madrid bombing] (Fig. 1) executed on a news archive might
be after articles about the specific details of the Madrid train
bombing at the time it happened, so this query might be
considered a past query. More generally, relevant results for
some queries may exist in certain time periods, in which
sudden, large-scale news coverage relevant to the queries
takes place and diminishes after a period of time. Other
queries, such as [Barack Obama], are likely to be after relevant
results from multiple “events.”
Estimation Using Binning
The previous technique, from Jones and Diaz [1], relies
heavily on the underlying retrieval model to estimate pðtjqÞ.
The retrieval model not only suggests the top-k matching
documents as an approximation to the true relevant
documents, but also weights these documents based on
their relevance scores. These scores are, in turn, used to
determine the contribution of each top-k document to the
final temporal relevance value of the document’s publica-
tion day. This direct dependency on the relevance scores for
estimating the pðtjqÞ values is somewhat problematic,
because these scores were designed for a different purpose,
namely, document ranking. Furthermore, the previous
technique is not conducive to exploring different “shapes”
of the pðtjqÞ probability distribution. Now, we suggest a
general framework to estimate pðtjqÞ that addresses these
issues, so that it is less dependent on the underlying
retrieval model by considering only the top-k matching
documents without using their relevance scores directly.
Background: Answering “Recency” Queries
Sometimes queries issued over a news archive are after
recent events or breaking news, as we discussed in Section 1.
Li and Croft [2] developed a time-sensitive approach for
processing recency queries. Their approach processes a
recency query by computing traditional topic relevance
scores for each document, and then “boosting” the scores of
the most recent documents, to privilege recent articles over
older ones. Language models [9] have been used as a
successful approach to rank documents in a collection
according to their topic relevance for a query. To estimate
the relevance of a document d to a query q; pðdjqÞ, the
conditional probability that d is topically relevant to q is
computed. This retrieval model defines pðdjqÞ as being
proportional to pðdÞ Á pðqjdÞ, where pðdÞ is the prior
probability that d is relevant, and pðqjdÞ is the probability
that query q will be generated from document d.7 In the
original language models and in later modifications, the
prior pðdÞ is ignored since it is assumed to be uniform and
constant for all documents. For recency queries, Li and
Croft suggest modifying pðqjdÞ to combine two elements,
time relevance and topical relevance. Specifically, Li and
Croft define the prior pðdÞ of document d as a function of
the document creation date, so that recent documents are
given a greater prior value than older documents.
RELATED WORK
Our approach expands Li and Croft’s [2] strategy to process
recency queries, which utilizes a language modeling
framework [9], [10], [13], [22]. Most language modeling
approaches assume that the prior probability pðdÞ that a
document is relevant to a query is constant. Li and Croft
modified the prior pðdÞ to reflect the fact that recently
published documents are more likely to be relevant to
recency queries. In our approach, which handles a broader
class of time-sensitive queries, including nonrecency
queries, it is not appropriate to modify the document prior
pðdÞ, as we would have to introduce query-specific
information (i.e., the temporal characteristics of the query)
in the document prior probability pðdÞ, which is assumed to
be query independent. In Section 5, we experimentally
compared our techniques against Li and Croft’s strategies,
which we dubbed QL-RECENCY and RM-RECENCY.
CONCLUSIONS AND FUTURE WORK
We presented a method for processing time-sensitive
queries over a news archive, with techniques for identifying
important time periods for a query. We presented an
extensive experimental evaluation, including TREC as well
as an archive of news articles, and showed that our
techniques improve the quality of search results, compared
to the existing state-of-the-art algorithms.
Our work demonstrates that integrating time in the
retrieval task can improve the quality of the retrieval
results, and motivates further research in the area.
Currently, we rely on the publication time of the documents
to locate time periods of interest. However, a document
published at a later date (e.g., a review article, summarizing
an event) may also be relevant; an interesting direction for
future research is to infer the temporal relevance of a
document by analyzing its contents [32], and not by relying
solely on its publication date. Another promising research
direction is to introduce time-based diversity in query
results by grouping the results into clusters of relevant time
ranges, enabling users to be aware of and interact with time
information when examining the query results. Along the
same lines, as future work, we are interested in integrating
our retrieval techniques with algorithms for query refor-
mulation, so that searchers are shown reformulations of
their queries that target specific time periods, as suggested
by Jones and Diaz for temporally ambiguous queries [1].