29-06-2013, 03:02 PM
Personalizing Search Based on User Search Histories
Personalizing Search.pdf (Size: 280.84 KB / Downloads: 30)
Abstract
User proles, descriptions of user interests, can be used by search engines to
provide personalized search results. Many approaches to creating user proles
collect user information through proxy servers (to capture browsing histories)
or desktop bots (to capture activities on a personal computer). Both these techniques
require participation of the user to install the proxy server or the bot. In
this study, we explore the use of a less-invasive means of gathering user information
for personalized search. In particular, we build user proles based on
activity at the search site itself and study the use of these proles to provide
personalized search results. By implementing a wrapper around the Google
search engine, we were able to collect information about individual user search
activities. In particular, we collected the queries for which at least one search result
was examined, and the snippets (titles and summaries) for each examined
result.
Introduction
Motivation
Companies that provide marketing data report that search engines are used
more and more as referrals to web sites, rather than direct navigation via hyperlinks
[30]. As search engines perform a larger role in commercial applications,
the desire to increase their effectiveness grows. However, search engines
order their results based on the small amount of information available in the
user's queries and by web site popularity, rather than individual user interests.
Thus, all users see the same results for the same query, even if they have wildly
different interests and backgrounds. To address this issue, interest in personalized
search had grown in the last several years, and user prole construction is
an important component of any personalization system.
Overview
Our approach builds user proles based on the user's interactions with a particular
search engine. Among all search engines available, we decided to adopt
Google [13] for the following reasons:
² it maintains one of the biggest collections of web pages;
² it provides a special APIs (Google APIs [12] ) that allows users to write
programs that submits queries to Google using a web service based on
the SOAP protocol [35]. The results retrieved are returned in a structured
XML le that can be easily processed;
² it is very popular, so users feel comfortable using it via a new interface
rather than relying on a completely different search engine altogether.
For our system, we implemented GoogleWrapper: a wrapper around the Google
search engine [13] that logs the queries, search results, and clicks on a per user
basis. This information was then used to create user proles and these proles
were used in a controlled study to determine their effectiveness for providing
personalized search results. In order to capture unbiased data, Google's results
were randomized before presentation to the user.
Privacy
In general, in order to provide personalized search, the system needs some
information from which to build a prole whether it be allocated by the server
or by a client-side bot. A commercial server-side approach could store just the
prole rather than the raw data. However, since we need to run and evaluate a
variety of algorithms, we stored data for the duration of the experiment. This
raises several privacy issues. First, how securely was the data protected from
hacking and second, do users want to share their data at all.
To address the rst issue, users were identied using an alphanumeric ID
stored in a cookie. No data on personal identity was exchanged except during
the initial registration process. This information was stored separately in order
to reset a cookie in case it was lost. The log les were stored in a directory that
was not world accessible. The log with queries and snippets was separated
from the le maintaining the identities of users. The mapping between the two
les was created by means of IDs.
Personalization
Personalization is the process of presenting the right information to the right
user at the right moment. In order to learn about a user, systems must collect
personal information, analyze it, and store the results of the analysis in a user
prole. Information can be collected from users in two ways: explicitly, for
example asking for feedback such as preferences or ratings; or implicitly, for
example observing user behaviors such as the time spent reading an on-line
document.
Commercial systems tend to focus on personalized search using an explicitly
dened prole. In Google's beta version [14], for example, users are asked
to select the categories of topics which they are interested in and the search
engine applies this information during the retrieval process.
Ontologies and Semantic Web
For our study, based on previous research work from Trajkova and Gauch [18],
we decided to represent user proles as a hierarchy of weighted concepts that
are dened in a reference ontology. According to Gruber [15], an ontology is
a specication of a conceptualization. Ontologies can be dened in different
ways, but they all represent a taxonomy of concepts along with the relations
between them. In the context of theWorldWideWeb, ontologies are important
because they formally dene terms shared between any type of agents without
ambiguity, allowing information to be processed automatically and accurately.
OntoSeek [16] is an example of an information retrieval system based on
ontologies. The main assumption is that precision and recall would improve if
we used sense matching instead of word matching. The domains in which
the system operates are catalogues of either heterogeneous or homogeneous
products. The description of each product in the catalog is translated into a
lexical conceptual graph; i.e., a tree structure where nodes are nouns from the
description and arcs are concepts inferred by the corresponding nouns. All
graphs, one for each product, are stored in a repository. A special user interface
is provided to submit queries. When a query is issued, the user is required to
disambiguate its meaning.