Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity

**seminar flower** · 27-09-2012, 01:25 PM

Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity

.pdf

Semantic Text Similarity.pdf (Size: 288.87 KB / Downloads: 24)

INTRODUCTION

Similarity is a complex concept which has been widely discussed in the linguistic,
philosophical, and information theory communities [Hatzivassiloglou
et al. 1999]. Frawley [1992] discusses all semantic typing in terms of two
mechanisms: the detection of similarities and differences. Jackendoff [1983]
argues that standard semantic relations such as synonymy, paraphrase, redundancy,
and entailment all result from judgments of likeness whereas antonymycontradiction, and inconsistency derive from judgments of difference. For our
task, given two input text segments, we want to automatically determine a
score that indicates their similarity at semantic level, thus going beyond the
simple lexical matching methods traditionally used for this task.

RELATED WORK

There is extensive literature on measuring the similarity between long texts
or documents [Hatzivassiloglou et al. 1999; Landauer and Dumais 1997; Maguitman et al. 2005; Meadow et al. 2000], but there is less work related
to the measurement of similarity between sentences or short texts [Foltz et al.
1998]. Related work can roughly be classified into four major categories: word
co-occurrence/vector-based document model methods, corpus-based methods,
hybrid methods, and descriptive feature-based methods.
The vector-based document model methods are commonly used in Information
Retrieval (IR) systems [Meadow et al. 2000], where the document most
relevant to an input query is determined by representing a document as a word
vector, and then queries are matched to similar documents in the document
database via a similarity metric [Salton and Lesk 1971]. One extension of word
co-occurrence methods leads to the pattern matching methods which are commonly
used in text mining and conversational agents [Corley and Mihalcea
2005]. This technique relies on the assumption that more similar documents
have more words in common. But it is not always the case that texts with similar
meaning necessarily share many words. Again, the sentence representation
is not very efficient as the vector dimension is very large compared to the number
of words in a short text or sentence, thus, the resulting vectors would have
many null components.

PROPOSED METHOD

The proposed method determines the similarity of two texts from semantic and
syntactic information (in terms of common-word order) that they contain. We
consider three similarity functions in order to derive a more generalized text
similarity method. First, string similarity and semantic word similarity are
calculated and then we use an optional common-word order similarity function
to incorporate syntactic information in our method, if we wish. Finally, the
text similarity is derived by combining string similarity, semantic similarity
and common-word order similarity with normalization. We call our proposed
method the Semantic Text Similarity (STS) method.

String Similarity between Words

We use the longest common subsequence (LCS) [Allison and Dix 1986] measure
with some normalization and small modifications for our string similarity measure.
We use three different modified versions of LCS and then take a weighted
sum of these.1 Kondrak [2005] showed that edit distance and the length of the
longest common subsequence are special cases of n-gram distance and similarity,
respectively. Melamed [1999] normalized LCS by dividing the length of the
longest common subsequence by the length of the longer string and called it
longest common subsequence ratio (LCSR). But LCSR does not take into account
the length of the shorter string which sometimes has a significant impact
on the similarity score.

Semantic Similarity between Words

There is a relatively large number of word-to-word similarity metrics in the
literature, ranging from distance-oriented measures computed on semantic
networks or knowledge-based (dictionary/thesaurus-based) measures, to metrics
based on models of information theory (or corpus-based measures) learned
from large text collections. A detailed review on word similarity can be found
in Li et al. [2003], Rodriguez and Egenhofer [2003], Weeds et al. [2004], and
Bollegala et al. [2007]. We focus our attention on corpus-based measures because
of their large type coverage. The types that are used in real-world texts
are often not found in knowledge base.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Development of a workflow based Complaint Management System (where the complaints are	mechanical engineering crazy	2	28,844,331	26-11-2018, 12:11 PM Last Post: Guest
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	RIA based E- Shopping Portal for Electronic Gadgets Report	study tips	1	1,588	21-09-2017, 01:25 PM Last Post: jaseela123
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Color Image Indexing Using BTC	seminar tips	1	1,436	19-09-2017, 02:52 PM Last Post: jaseela123
	Mobile Messenger Using Ad-hoc Networks	seminar code	1	682	19-09-2017, 02:50 PM Last Post: jaseela123
	System Analysis (Modeling of the Existing and Proposed System using OOD)	seminar flower	1	2,459	15-09-2017, 03:39 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	DESIGN AND PERFORMANCE ANALYSIS OF OPTICAL CDMA SYSTEM USING NEWLY DESIGNED MULTIWAVE	project girl	1	1,270	15-09-2017, 01:34 PM Last Post: jaseela123
	Uisce: Characteristic-based Routing in Mobile Ad Hoc Networks	project uploader	1	1,721	14-09-2017, 03:30 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.