27-09-2012, 01:25 PM
Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity
Semantic Text Similarity.pdf (Size: 288.87 KB / Downloads: 24)
INTRODUCTION
Similarity is a complex concept which has been widely discussed in the linguistic,
philosophical, and information theory communities [Hatzivassiloglou
et al. 1999]. Frawley [1992] discusses all semantic typing in terms of two
mechanisms: the detection of similarities and differences. Jackendoff [1983]
argues that standard semantic relations such as synonymy, paraphrase, redundancy,
and entailment all result from judgments of likeness whereas antonymycontradiction, and inconsistency derive from judgments of difference. For our
task, given two input text segments, we want to automatically determine a
score that indicates their similarity at semantic level, thus going beyond the
simple lexical matching methods traditionally used for this task.
RELATED WORK
There is extensive literature on measuring the similarity between long texts
or documents [Hatzivassiloglou et al. 1999; Landauer and Dumais 1997; Maguitman et al. 2005; Meadow et al. 2000], but there is less work related
to the measurement of similarity between sentences or short texts [Foltz et al.
1998]. Related work can roughly be classified into four major categories: word
co-occurrence/vector-based document model methods, corpus-based methods,
hybrid methods, and descriptive feature-based methods.
The vector-based document model methods are commonly used in Information
Retrieval (IR) systems [Meadow et al. 2000], where the document most
relevant to an input query is determined by representing a document as a word
vector, and then queries are matched to similar documents in the document
database via a similarity metric [Salton and Lesk 1971]. One extension of word
co-occurrence methods leads to the pattern matching methods which are commonly
used in text mining and conversational agents [Corley and Mihalcea
2005]. This technique relies on the assumption that more similar documents
have more words in common. But it is not always the case that texts with similar
meaning necessarily share many words. Again, the sentence representation
is not very efficient as the vector dimension is very large compared to the number
of words in a short text or sentence, thus, the resulting vectors would have
many null components.
PROPOSED METHOD
The proposed method determines the similarity of two texts from semantic and
syntactic information (in terms of common-word order) that they contain. We
consider three similarity functions in order to derive a more generalized text
similarity method. First, string similarity and semantic word similarity are
calculated and then we use an optional common-word order similarity function
to incorporate syntactic information in our method, if we wish. Finally, the
text similarity is derived by combining string similarity, semantic similarity
and common-word order similarity with normalization. We call our proposed
method the Semantic Text Similarity (STS) method.
String Similarity between Words
We use the longest common subsequence (LCS) [Allison and Dix 1986] measure
with some normalization and small modifications for our string similarity measure.
We use three different modified versions of LCS and then take a weighted
sum of these.1 Kondrak [2005] showed that edit distance and the length of the
longest common subsequence are special cases of n-gram distance and similarity,
respectively. Melamed [1999] normalized LCS by dividing the length of the
longest common subsequence by the length of the longer string and called it
longest common subsequence ratio (LCSR). But LCSR does not take into account
the length of the shorter string which sometimes has a significant impact
on the similarity score.
Semantic Similarity between Words
There is a relatively large number of word-to-word similarity metrics in the
literature, ranging from distance-oriented measures computed on semantic
networks or knowledge-based (dictionary/thesaurus-based) measures, to metrics
based on models of information theory (or corpus-based measures) learned
from large text collections. A detailed review on word similarity can be found
in Li et al. [2003], Rodriguez and Egenhofer [2003], Weeds et al. [2004], and
Bollegala et al. [2007]. We focus our attention on corpus-based measures because
of their large type coverage. The types that are used in real-world texts
are often not found in knowledge base.