16-01-2013, 10:43 AM
Organizing User Search Histories
1Organizing User.pdf (Size: 660.51 KB / Downloads: 590)
Abstract
Users are increasingly pursuing complex task-oriented goals
on the Web, such as making travel arrangements, managing finances
or planning purchases. To this end, they usually break down the tasks
into a few co-dependent steps and issue multiple queries around these
steps repeatedly over long periods of time. To better support users in
their long-term information quests on the Web, search engines keep
track of their queries and clicks while searching online. In this paper, we
study the problem of organizing a user’s historical queries into groups
in a dynamic and automated fashion. Automatically identifying query
groups is helpful for a number of different search engine components
and applications, such as query suggestions, result ranking, query alterations,
sessionization, and collaborative search. In our approach, we
go beyond approaches that rely on textual similarity or time thresholds,
and we propose a more robust approach that leverages search query
logs. We experimentally study the performance of different techniques,
and showcase their potential, especially when combined together.
INTRODUCTION
AS the size and richness of information on the Web
grows, so does the variety and the complexity of
tasks that users try to accomplish online. Users are no
longer content with issuing simple navigational queries.
Various studies on query logs (e.g., Yahoo’s [1] and
AltaVista’s [2]) reveal that only about 20% of queries
are navigational. The rest are informational or transactional
in nature. This is because users now pursue much
broader informational and task-oriented goals such as
arranging for future travel, managing their finances, or
planning their purchase decisions. However, the primary
means of accessing information online is still through
keyword queries to a search engine. A complex task
such as travel arrangement has to be broken down into a
number of co-dependent steps over a period of time.
PRELIMINARIES
Goal. Our goal is to automatically organize a user’s
search history into query groups, each containing one
or more related queries and their corresponding clicks.
Each query group corresponds to an atomic information
need that may require a small number of queries and
clicks related to the same search goal. For example, in
the case of navigational queries, a query group may
involve as few as one query and one click (e.g., “cnn”
and www.cnn.com). For broader informational queries, a
query group may involve a few queries and clicks (e.g.,
Group 5 queries in Figure 2(b) are all about where to buy
Wii console and games). This definition of query groups
follows closely the definition of search goals given in [4].
QUERY RELEVANCE USING SEARCH LOGS
We now develop the machinery to define the query
relevance based on Web search logs. Our measure of
relevance is aimed at capturing two important properties
of relevant queries, namely: (1) queries that frequently
appear together as reformulations and (2) queries that
have induced the users to click on similar sets of pages.
We start our discussion by introducing three search
behavior graphs that capture the aforementioned properties.
Following that, we show how we can use these
graphs to compute query relevance and how we can
incorporate the clicks following a user’s query in order
to enhance our relevance metric.
Computing Query Relevance
Having introduced the search behavior graphs in the
previous section, we now compute the relevance between
two queries. More specifically, for a given user
query q, we compute a relevance vector using QFG,
where each entry corresponds to the relevance value of
each query qj 2 VQ to q.
The edges in QFG correspond to pairs of relevant
queries extracted from the query logs and the click
logs. However, it is not sufficiently effective to use the
pairwise relevance values directly expressed in QFG as
our query relevance scores. Let us consider a vector
rq, where each entry, rq(qj), is wf (q, qj ) if there exists
an edge from q to qj in QFG, and 0 otherwise. One
straightforward approach for computing the relevance
of qj to q is to use this rq(qj) value. However, although
this may work well in some cases, it will fail to capture
relevant queries that are not directly connected in QFG
(and thus rq(qj) = 0).
Using Search Logs
As discussed in Section 3, our query grouping algorithm
relies heavily on the use of search logs in two ways:
first, to construct the query fusion graph used in computing
query relevance, and, second, to expand the set
of queries considered when computing query relevance.
We start our experimental evaluation, by investigating
how we can make the most out of the search logs.
In our first experiment, we study how we should combine
the query graphs coming from the query reformulations
and the clicks within our query log.
RELATED WORK
While we are not aware of any previous work that
has the same objective of organizing user history into
query groups, there has been prior work in determining
whether two queries belong to the same search task.
In recent work, Jones and Klinkner [4] and Boldi et
al. [5] investigate the search-task identification problem.
More specifically, Jones and Klinkner [4] considered a
search session to consist of a number of tasks (missions),
and each task further consists of a number of sub-tasks
(goals). They trained a binary classifier with features
based on time, text, and query logs to determine whether
two queries belong to the same task. Boldi et al. [5]
employed similar features to construct a query flow
graph, where two queries linked by an edge were likely
to be part of the same search mission.
Our work differs from these prior works in the following
aspects. First, the query-log based features in [4],
[5] are extracted from co-occurrence statistics of query
pairs. In our work, we additionally consider query pairs
having common clicked URLs and we exploit both cooccurrence
and click information through a combined
query fusion graph. [4] will not be able to break ties
when an incoming query is considered relevant to two
existing query groups. Additionally, our approach does
not involve learning and thus does not require manual
labeling and re-training as more search data come in;
our Markov random walk approach essentially requires
maintaining an updated query fusion graph. Finally, our
goal is to provide users with useful query groups onthe-
fly while respecting existing query groups. On the
other hand, search task identification is mostly done at
server side with goals such as personalization, query
suggestions [5] etc.
CONCLUSION
The query reformulation and click graphs contain useful
information on user behavior when searching online.
In this paper, we show how such information can be
used effectively for the task of organizing user search
histories into query groups. More specifically, we propose
combining the two graphs into a query fusion
graph. We further show that our approach that is based
on probabilistic random walks over the query fusion
graph outperforms time-based and keyword similaritybased
approaches. We also find value in combining our
method with keyword similarity-based methods, especially
when there is insufficient usage information about
the queries. As future work, we intend to investigate the
usefulness of the knowledge gained from these query
groups in various applications such as providing query
suggestions and biasing the ranking of search results.