22-02-2013, 03:19 PM
Measuring Semantic Similarity between Words Using Web Search Engines
Measuring Semantic.pdf (Size: 273.44 KB / Downloads: 22)
ABSTRACT
Semantic similarity measures play important roles in infor-
mation retrieval and Natural Language Processing. Previ-
ous work in semantic web-related applications such as com-
munity mining, relation extraction, automatic meta data
extraction have used various semantic similarity measures.
Despite the usefulness of semantic similarity measures in
these applications, robustly measuring semantic similarity
between two words (or entities) remains a challenging task.
We propose a robust semantic similarity measure that uses
the information available on the Web to measure similarity
between words or entities. The proposed method exploits
page counts and text snippets returned by a Web search
engine. We de¯ne various similarity scores for two given
words P and Q, using the page counts for the queries P, Q
and P AND Q. Moreover, we propose a novel approach to
compute semantic similarity using automatically extracted
lexico-syntactic patterns from text snippets.
INTRODUCTION
The study of semantic similarity between words has long
been an integral part of information retrieval and natural
language processing. Semantic similarity between entities
changes over time and across domains. For example, apple
is frequently associated with computers on the Web. How-
ever, this sense of apple is not listed in most general-purpose
thesauri or dictionaries. A user who searches for apple on
the Web, may be interested in this sense of apple and not
apple as a fruit. New words are constantly being created as
well as new senses are assigned to existing words. Manually
maintaining thesauri to capture these new words and senses
is costly if not impossible.
We propose an automatic method to measure semantic
similarity between words or entities using Web search en-
gines. Because of the vastly numerous documents and the
high growth rate of the Web, it is di±cult to analyze each
document separately and directly. Web search engines pro-
vide an e±cient interface to this vast information. Page
counts and snippets are two useful information sources pro-
vided by most Web search engines. Page count of a query is
the number of pages that contain the query words 1. Page
count for the query P AND Q can be considered as a global
measure of co-occurrence of words P and Q. For example,
the page count of the query \apple" AND \computer" in
Google 2 is 288; 000; 000, whereas the same for \banana"
AND \computer" is only 3; 590; 000. The more than 80
times more numerous page counts for \apple" AND \com-
puter" indicate that apple is more semantically similar to
computer than is banana.
RELATEDWORK
Semantic similarity measures are important in manyWeb-
related tasks. In query expansion [5, 25, 40] a user query is
modi¯ed using synonymous words to improve the relevancy
of the search. One method to ¯nd appropriate words to
include in a query is to compare the previous user queries
using semantic similarity measures. If there exist a previous
query that is semantically related to the current query, then
it can be suggested either to the user or internally used by
the search engine to modify the original query.
Semantic similarity measures have been used in Semantic
Web related applications such as automatic annotation of
Web pages [7], community mining [23, 19], and keyword
extraction for inter-entity relation representation [26].
Semantic similarity measures are necessary for various ap-
plications in natural language processing such as word-sense
disambiguation [32], language modeling [34], synonym ex-
traction [16], and automatic thesauri extraction [8]. Manu-
ally compiled taxonomies such as WordNet3 and large text
corpora have been used in previous works on semantic sim-
ilarity [16, 31, 13, 17]. Regarding the Web as a live cor-
pus has become an active research topic recently. Simple,
unsupervised models demonstrably perform better when n-
gram counts are obtained from the Web rather than from a
large corpus [14, 15]. Resnik and Smith [33] extracted bilin-
gual sentences from the Web to create a parallel corpora
for machine translation. Turney [38] de¯ned a point-wise
mutual information (PMI-IR) measure using the number of
hits returned by aWeb search engine to recognize synonyms.
Matsuo et. al, [20] used a similar approach to measure the
similarity between words and apply their method in a graph-
based word clustering algorithm.
METHOD
Outline
We propose a method which integrates both page counts
and snippets to measure semantic similarity between a given
pair of words. In section 3.2, we de¯ne four similarity scores
using page counts. We then describe an automatic lexico-
syntactic pattern extraction algorithm in section 3.3. We
rank the patterns extracted by our algorithm according to
their ability to express semantic similarity. We use two-
class support vector machines (SVMs) to ¯nd the optimal
combination of page counts-based similarity scores and top-
ranking patterns. The SVM is trained to classify synony-
mous word-pairs and non-synonymous word-pairs. We select
synonymous word-pairs (positive training examples) from
WordNet synsets4. Non-synonymous word-pairs (negative
training examples) are automatically created using a ran-
dom shu²ing technique. We convert the output of SVM
into a posterior probability. We de¯ne the semantic similar-
ity between two words as the posterior probability that they
belong to the synonymous-words (positive) class.
Integrating Patterns and Page Counts
In section 3.2 we de¯ned four similarity scores using page
counts. Section 3.3 described a lexico-syntactic pattern ex-
traction algorithm and ranked the patterns according to
their ability to express synonymy. In this section we describe
leverage of a robust semantic similarity measure through in-
tegration of all the similarity scores and patterns described
in previous sections.
EXPERIMENTS
We conduct two sets of experiments to evaluate the pro-
posed semantic similarity measure. First we compare the
similarity scores produced by the proposed measure against
Miller-Charles benchmark dataset. We analyze the behav-
ior of the proposed measure with the number of patterns
used as features, the number of snippets used to extract the
patterns, and the size of the training dataset. Secondly, we
apply the proposed measure in two real-world applications:
community mining and entity disambiguation.
The Benchmark Dataset
We evaluate the proposed method against Miller-Charles [24]
dataset, a dataset of 30 word-pairs6 rated by a group of 38
human subjects. The word pairs are rated on a scale from
0 (no similarity) to 4 (perfect synonymy). Miller-Charles'
data set is a subset of Rubenstein-Goodenough's [35] origi-
nal data set of 65 word pairs. Although Miller-Charles ex-
periment was carried out 25 years later than Rubenstein-
Goodenough's, two sets of ratings are highly correlated (pear-
son correlation coe±cient=0:97). Therefore, Miller-Charles
ratings can be considered as a reliable benchmark for eval-
uating semantic similarity measures.
Training Data
We used synonymous word pairs extracted from Word-
Net synsets as positive training examples and automatically
generated non-synonymous word pairs as negative training
examples to train a two-class support vector machine in sec-
tion 3.4. To determine the optimum combination of positive
and negative training examples, we trained a linear kernel
SVM with di®erent combinations of positive and negative
training examples and evaluated accuracy against the hu-
man ratings in the Miller-Charles benchmark dataset. Ex-
perimental results are summarized in Figure 7. Maximum
correlation coe±cient of 0:8345 is achieved with 1900 posi-
tive training examples and 2400 negative training examples.
Moreover, Figure 7 reveals that correlation does not im-
prove beyond 2500 positive and negative training examples.
Therefore, we can conclude that 2500 examples are su±cient
to leverage the proposed semantic similarity measure.
Conclusion
In this paper, we proposed a measure that uses both page
counts and snippets to robustly calculate semantic simi-
larity between two given words or named entities. The
method consists of four page-count-based similarity scores
and automatically extracted lexico-syntactic patterns. We
integrated page-counts-based similarity scores with lexico
syntactic patterns using support vector machines. Training
data were automatically generated using WordNet synsets.
Proposed method outperformed all the baselines including
previously proposedWeb-based semantic similarity measures
on a benchmark dataset. A high correlation (correlation co-
e±cient of 0:834) with human ratings was found for semantic
similarity on this benchmark dataset.