06-12-2012, 12:43 PM
Automatic Discovery of Personal Name Aliases from the Web
Automatic Discovery.pdf (Size: 1.39 MB / Downloads: 63)
Abstract
An individual is typically referred by numerous name aliases on the web. Accurate identification of aliases of a given person
name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name disambiguation, and
relation extraction. We propose a method to extract aliases of a given personal name from the web. Given a personal name, the
proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a
candidate being a correct alias of the given name. We propose a novel, automatically extracted lexical pattern-based approach to
efficiently extract a large set of candidate aliases from snippets retrieved from a web search engine. We define numerous ranking
scores to evaluate candidate aliases using three approaches: lexical pattern frequency, word co-occurrences in an anchor text graph,
and page counts on the web. To construct a robust alias detection system, we integrate the different ranking scores into a single
ranking function using ranking support vector machines. We evaluate the proposed method on three data sets: an English personal
names data set, an English place names data set, and a Japanese personal names data set. The proposed method outperforms
numerous baselines and previously proposed name alias extraction methods, achieving a statistically significant mean reciprocal rank
(MRR) of 0.67. Experiments carried out using location names and Japanese personal names suggest the possibility of extending the
proposed method to extract aliases for different types of named entities, and for different languages. Moreover, the aliases extracted
using the proposed method are successfully utilized in an information retrieval task and improve recall by 20 percent in a relationdetection
task.
INTRODUCTION
SEARCHING for information about people in the web is one
of the most common activities of internet users. Around
30 percent of search engine queries include person names
[1], [2]. However, retrieving information about people from
web search engines can become difficult when a person has
nicknames or name aliases. For example, the famous
Japanese major league baseball player Hideki Matsui is often
called as Godzilla on the web. A newspaper article on the
baseball player might use the real name, Hideki Matsui,
whereas a blogger would use the alias, Godzilla, in a blog
entry. We will not be able to retrieve all the information
about the baseball player, if we only use his real name.
Identification of entities on the web is difficult for two
fundamental reasons: first, different entities can share the
same name (i.e., lexical ambiguity); second, a single entity
can be designated by multiple names (i.e., referential
ambiguity). For example, the lexical ambiguity consider
the name Jim Clark. Aside from the two most popular
namesakes, the formula-one racing champion and the
founder of Netscape, at least 10 different people are listed
among the top 100 results returned by Google for the name.
On the other hand, referential ambiguity occurs because
people use different names to refer to the same entity on the
web. For example, the American movie star Will Smith is
often called the Fresh Prince in web contents.
RELATED WORK
Alias identification is closely related to the problem of
cross-document coreference resolution in which the objective
is to determine whether two mentions of a name in
different documents refer to the same entity. Bagga and
Baldwin [10] proposed a cross-document coreference
resolution algorithm by first performing within document
coreference resolution for each individual document to
extract coreference chains, and then, clustering the coreference
chains under a vector space model to identify all
mentions of a name in the document set. However, the
vastly numerous documents on the web render it impractical
to perform within document coreference resolution to
each document separately, and then, cluster the documents
to find aliases.
In personal name disambiguation the goal is to disambiguate
various people that share the same name
(namesakes) [3], [4]. Given an ambiguous name, most name
disambiguation algorithms have modeled the problem as
one of document clustering in which all documents that
discuss a particular individual of the given ambiguous
name are grouped into a single cluster. The web people
search task (WePS)1 provided an evaluation data set and
compared various name disambiguation systems. However,
the name disambiguation problem differs fundamentally
from that of alias extraction because in name disambiguation
the objective is to identify the different entities that are
referred by the same ambiguous name; in alias extraction,
we are interested in extracting all references to a single
entity from the web.
METHOD
The proposed method is outlined in Fig. 1 and comprises
two main components: pattern extraction, and alias extraction
and ranking. Using a seed list of name-alias pairs, we
first extract lexical patterns that are frequently used to
convey information related to aliases on the web. The
extracted patterns are then used to find candidate aliases for
a given name. We define various ranking scores using the
hyperlink structure on the web and page counts retrieved
from a search engine to identify the correct aliases among
the extracted candidates.
Lexical Pattern Frequency
In Section 3.1 we presented an algorithm to extract numerous
lexical patterns that are used to describe aliases of a personal
name. As we will see later in Section 4, the proposed pattern
extraction algorithm can extract a large number of lexical
patterns. If the personal name under consideration and a
candidate alias occur in many lexical patterns, then it can be
considered as a good alias for the personal name. Consequently,
we rank a set of candidate aliases in the descending
order of the number of different lexical patterns in which they
appear with a name. The lexical pattern frequency of an alias
is analogous to the document frequency (DF) popularly used
in information retrieval.
Hub Discounting
A frequently observed phenomenon related to the web is
that many pages with diverse topics link to so-called hubs
such as Google, Yahoo, or MSN. Two anchor texts might
link to a hub for entirely different reasons. Therefore, cooccurrences
coming from hubs are prone to noise. Consider
the situation shown in Fig. 6 where a certain web page is
linked to by two sets of anchor texts. One set of anchor texts
contains the real name for which we must find aliases,
whereas the other set of anchor texts contains various
candidate aliases. If the majority of anchor texts linked to a
particular web site use the real name to do so, then the
confidence of that page as a source of information regarding
the person whom we are interested in extracting aliases
increases. We use this intuition to compute a simple
discounting measure for co-occurrences in hubs as follows.
Web Search Task
To retrieve information about a particular person from a
web search engine, it is common to include the name of the
person in the query. In fact, it has been reported that
approximately one third of all web queries contain a person
name [1]. A name can be ambiguous in the sense that there
might exist more than one individual for a given name. As
such, searching only by the name is insufficient to locate
information regarding the person we are interested in. By
including an alias that uniquely identifies a person from his
or her namesakes, it might be possible to filter out irrelevant
search results. We set up an experiment to evaluate the
effect of aliases in a web search task.
CONCLUSION
We proposed a lexical-pattern-based approach to extract
aliases of a given name. We use a set of names and their
aliases as training data to extract lexical patterns that
describe numerous ways in which information related to
aliases of a name is presented on the web. Next, we
substitute the real name of the person that we are interested
in finding aliases in the extracted lexical patterns, and
download snippets from a web search engine. We extract a
set of candidate aliases from the snippets. The candidates
are ranked using various ranking scores computed using
three approaches: lexical pattern frequency, co-occurrences
in anchor texts, and page counts-based association measures.
Moreover, we integrate the different ranking scores
to construct a single ranking function using ranking support
vector machines. We evaluate the proposed method using
three data sets: an English personal names data set, an
English location names data set, and a Japanese personal
names data set. The proposed method reported high MRR
and AP scores on all three data sets and outperformed
numerous baselines and a previously proposed alias
extraction algorithm.