28-02-2013, 11:25 AM
Clustering User Queries of a Search Engine
Clustering User Queries.pdf (Size: 220.87 KB / Downloads: 13)
ABSTRACT
In order to increase retrieval precision, some new search engines
provide manually verified answers to Frequently Asked Queries
(FAQs). An underlying task is the identification of FAQs. This
paper describes our attempt to cluster similar queries according to
their contents as well as user logs. Our preliminary results show
that the resulting clusters provide useful information for FAQ
identification.
INTRODUCTION
The information explosion on the Internet has placed high
demands on search engines. Yet people are far from being
satisfied with the performance of the existing search engines,
which often return thousands of documents in response to a user
query. Many of the returned documents are irrelevant to the user's
need. The precision of current search engines is well under
people's expectations.
In order to find more precise answers to a query, a new generation
of search engines - or question answering systems - have appeared
on the Web (e.g. AskJeeves, http://www.askjeaves). Unlike
the traditional search engines that only use keywords to match
documents, this new generation of systems tries to “understand”
the user's question, and suggest some similar questions that other
people have often asked and for which the system has the correct
answers. In fact, the correct answers have been prepared or
checked by human editors in most cases.
Clustering Algorithm
Another question involved is the clustering algorithm proper.
There are many clustering algorithms available to us. The main
characteristics that guide our choice are the following ones:
1) The algorithm should not require manual setting of the
resulting form of the clusters, e.g. the number of clusters.
It is unreasonable to determine these parameters manually
in advance.
2) Since we only want to find FAQs, the algorithm should
filter out those queries with low frequencies.
3) Since query logs usually are very large, the algorithm
should be capable of handling a large data set within
reasonable time and space constraints.
RELATED WORK ON SIMILARITY
CALCULATIONS
The document clustering problem has been studied for a long time
in IR [11]. Traditional approaches use keywords extracted from
documents. If two documents share some keywords, then they are
thought to be similar to some extent. The more they share
common keywords, and the more these common keywords are
important, the higher their similarity is. This same approach may
also apply to query clustering, as a query may also be represented
as a set of keywords in the same way as a document. However, it
is well known that clustering using keywords has some
drawbacks, due mainly to the fact that keywords and meanings do
not strictly correspond. The same keyword does not always
represent the same information need (e.g. the word “table” may
refer to a concept in data structure or to a piece of furniture); and
different keywords may refer to the same concept. Therefore, the
calculated similarity between two semantically similar queries
may be small, while two semantically unrelated queries may be
considered similar. This is particularly the case when queries are
short. In addition, in traditional IR methods, words such as
“where” and “who” are treated as stopwords but are kept as
keywords. For queries, however, these words encode important
information about the user’s need, particularly in the newgeneration
search engines such as AskJeeves. For example, with a
“who”-question, the user intends to find information about a
person.
Combination of Multiple Measures
Similarities based on query contents and user clicks represent two
different points of view. In general, content-based measures tend to
cluster queries with the same or similar terms. Feedback-based
measures tend to cluster queries related to the same or similar topics.
Since user information needs may be partially captured by both
query texts and relevant documents, we would like to define a
combined measure that takes advantage of both strategies.
CONCLUSION
The new generation of search engines for precise question
answering requires the identification of FAQs, so that human
editors may prepare the correct answers to them. The
identification of FAQs is not an easy task; it requires a proper
estimation of query similarity. Given the different forms of
queries and user intentions, the similarity of queries cannot be
accurately estimated through an analysis of their contents alone
(i.e. via keywords). In this paper, we have suggested exploiting
user log information (or user document clicks) as a supplement. A
new clustering principle is proposed: if two queries give rise to
the same document clicks, they are similar. Our initial analysis of
the clustering results suggests that this clustering strategy can
effectively group similar queries together. It does provide
effective assistance for human editors in discovering new FAQs.