Clustering User Queries of a Search Engine pdf

**study tips** · 28-02-2013, 11:25 AM

Clustering User Queries of a Search Engine

.pdf

Clustering User Queries.pdf (Size: 220.87 KB / Downloads: 13)

ABSTRACT

In order to increase retrieval precision, some new search engines
provide manually verified answers to Frequently Asked Queries
(FAQs). An underlying task is the identification of FAQs. This
paper describes our attempt to cluster similar queries according to
their contents as well as user logs. Our preliminary results show
that the resulting clusters provide useful information for FAQ
identification.

INTRODUCTION

The information explosion on the Internet has placed high
demands on search engines. Yet people are far from being
satisfied with the performance of the existing search engines,
which often return thousands of documents in response to a user
query. Many of the returned documents are irrelevant to the user's
need. The precision of current search engines is well under
people's expectations.
In order to find more precise answers to a query, a new generation
of search engines - or question answering systems - have appeared
on the Web (e.g. AskJeeves, http://www.askjeaves). Unlike
the traditional search engines that only use keywords to match
documents, this new generation of systems tries to “understand”
the user's question, and suggest some similar questions that other
people have often asked and for which the system has the correct
answers. In fact, the correct answers have been prepared or
checked by human editors in most cases.

Clustering Algorithm

Another question involved is the clustering algorithm proper.
There are many clustering algorithms available to us. The main
characteristics that guide our choice are the following ones:
1) The algorithm should not require manual setting of the
resulting form of the clusters, e.g. the number of clusters.
It is unreasonable to determine these parameters manually
in advance.
2) Since we only want to find FAQs, the algorithm should
filter out those queries with low frequencies.
3) Since query logs usually are very large, the algorithm
should be capable of handling a large data set within
reasonable time and space constraints.

RELATED WORK ON SIMILARITY
CALCULATIONS

The document clustering problem has been studied for a long time
in IR [11]. Traditional approaches use keywords extracted from
documents. If two documents share some keywords, then they are
thought to be similar to some extent. The more they share
common keywords, and the more these common keywords are
important, the higher their similarity is. This same approach may
also apply to query clustering, as a query may also be represented
as a set of keywords in the same way as a document. However, it
is well known that clustering using keywords has some
drawbacks, due mainly to the fact that keywords and meanings do
not strictly correspond. The same keyword does not always
represent the same information need (e.g. the word “table” may
refer to a concept in data structure or to a piece of furniture); and
different keywords may refer to the same concept. Therefore, the
calculated similarity between two semantically similar queries
may be small, while two semantically unrelated queries may be
considered similar. This is particularly the case when queries are
short. In addition, in traditional IR methods, words such as
“where” and “who” are treated as stopwords but are kept as
keywords. For queries, however, these words encode important
information about the user’s need, particularly in the newgeneration
search engines such as AskJeeves. For example, with a
“who”-question, the user intends to find information about a
person.

Combination of Multiple Measures

Similarities based on query contents and user clicks represent two
different points of view. In general, content-based measures tend to
cluster queries with the same or similar terms. Feedback-based
measures tend to cluster queries related to the same or similar topics.
Since user information needs may be partially captured by both
query texts and relevant documents, we would like to define a
combined measure that takes advantage of both strategies.

CONCLUSION

The new generation of search engines for precise question
answering requires the identification of FAQs, so that human
editors may prepare the correct answers to them. The
identification of FAQs is not an easy task; it requires a proper
estimation of query similarity. Given the different forms of
queries and user intentions, the similarity of queries cannot be
accurately estimated through an analysis of their contents alone
(i.e. via keywords). In this paper, we have suggested exploiting
user log information (or user document clicks) as a supplement. A
new clustering principle is proposed: if two queries give rise to
the same document clicks, they are similar. Our initial analysis of
the clustering results suggests that this clustering strategy can
effectively group similar queries together. It does provide
effective assistance for human editors in discovering new FAQs.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Software Crisis pdf	study tips	1	2,117	21-09-2017, 04:31 PM Last Post: jaseela123
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	HOW EMAIL WORKS pdf	project girl	1	3,067	20-09-2017, 11:39 AM Last Post: jaseela123
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	Review: Context Aware Tools for Smart Home Development pdf	study tips	1	1,227	20-09-2017, 11:22 AM Last Post: jaseela123
	Getting Started with the MAXQ1103 Evaluation Kit and the CrossWorks Compiler pdf	project girl	1	969	15-09-2017, 03:11 PM Last Post: jaseela123
	Wireless Application Protocol (WAP) pdf	project girl	1	1,531	15-09-2017, 02:42 PM Last Post: jaseela123
	MAC Protocol for Reliable Multicast over Multi-Hop Wireless Ad Hoc Networks pdf	study tips	1	1,029	15-09-2017, 12:39 PM Last Post: jaseela123
	Google Search Engine	project maker	1	757	15-09-2017, 12:07 PM Last Post: jaseela123
	Wireless Automotive Communications pdf	seminar projects maker	1	637	14-09-2017, 01:27 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.