Clustering with Multi-Viewpoint based Similarity Measure

jp16586 · 27-12-2011, 07:13 PM

Check out the attachment

srinivas9912 · 09-01-2012, 02:44 PM

what is the main problem in existing system in this project?

**seminar ideas** · 14-07-2012, 10:20 AM

to get infomation about the topic " uml diagrams for clustering with multi viewpoint based similarity measure" related topic refer the link bellow

https://seminarproject.net/Thread-cluste...ull-report

**seminar flower** · 18-08-2012, 03:42 PM

Clustering with Multi-Viewpoint based Similarity Measure

.pdf

Clustering.pdf (Size: 553.02 KB / Downloads: 136)

Abstract

All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity
between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity
measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is
that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects
assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment
of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure.

INTRODUCTION

Clustering is one of the most interesting and important
topics in data mining. The aim of clustering is to
find intrinsic structures in data, and organize them into
meaningful subgroups for further study and analysis.
There have been many clustering algorithms published
every year. They can be proposed for very distinct
research fields, and developed using totally different
techniques and approaches. Nevertheless, according to
a recent study [1], more than half a century after it was
introduced, the simple algorithm k-means still remains
as one of the top 10 data mining algorithms nowadays.
It is the most frequently used partitional clustering algorithm
in practice. Another recent scientific discussion
[2] states that k-means is the favourite algorithm that
practitioners in the related fields choose to use. Needless
to mention, k-means has more than a few basic
drawbacks, such as sensitiveness to initialization and to
cluster size, and its performance can be worse than other
state-of-the-art algorithms in many domains.

RELATED WORK

First of all, Table 1 summarizes the basic notations that
will be used extensively throughout this paper to represent
documents and related concepts. Each document
in a corpus corresponds to an m-dimensional vector d,
where m is the total number of terms that the document
corpus has. Document vectors are often subjected to
some weighting schemes, such as the standard Term
Frequency-Inverse Document Frequency (TF-IDF), and
normalized to have unit length.
The principle definition of clustering is to arrange data
objects into separate clusters such that the intra-cluster
similarity as well as the inter-cluster dissimilarity is maximized.
The problem formulation itself implies that some
forms of measurement are needed to determine such
similarity or dissimilarity. There are many state-of-theart
clustering approaches that do not employ any specific
form of measurement, for instance, probabilistic modelbased
method [9], non-negative matrix factorization [10],
information theoretic co-clustering [11] and so on. In
this paper, though, we primarily focus on methods that
indeed do utilize a specific measure.

PERFORMANCE EVALUATION OF MVSC

To verify the advantages of our proposed methods, we
evaluate their performance in experiments on document
data. The objective of this section is to compare MVSCIR
and MVSC-IV with the existing algorithms that also
use specific similarity measures and criterion functions
for document clustering. The similarity measures to be
compared includes Euclidean distance, cosine similarity
and extended Jaccard coefficient.

Document collections

The data corpora that we used for experiments consist of
twenty benchmark document datasets. Besides reuters7
and k1b, which have been described in details earlier,
we included another eighteen text collections so that
the examination of the clustering methods is more thorough
and exhaustive. Similar to k1b, these datasets are
provided together with CLUTO by the toolkit’s authors
[19]. They had been used for experimental testing in
previous papers, and their source and origin had also
been described in details [30], [31]. Table 2 summarizes
their characteristics. The corpora present a diversity of
size, number of classes and class balance. They were
all preprocessed by standard procedures.

CONCLUSIONS AND FUTURE WORK

In this paper, we propose a Multi-Viewpoint based
Similarity measuring method, named MVS. Theoretical
analysis and empirical examples show that MVS is
potentially more suitable for text documents than the
popular cosine similarity. Based on MVS, two criterion
functions, IR and IV , and their respective clustering
algorithms, MVSC-IR and MVSC-IV , have been introduced.
Compared with other state-of-the-art clustering
methods that use different types of similarity measure.

**seminar flower** · 22-09-2012, 04:26 PM

Clustering with Multi-Viewpoint based Similarity Measure

.pdf

clustering.pdf (Size: 553.02 KB / Downloads: 54)

Abstract

All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity
between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity
measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is
that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects
assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment
of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that
use other popular similarity measures on various document collections to verify the advantages of our proposal.

INTRODUCTION

Clustering is one of the most interesting and important
topics in data mining. The aim of clustering is to
find intrinsic structures in data, and organize them into
meaningful subgroups for further study and analysis.
There have been many clustering algorithms published
every year. They can be proposed for very distinct
research fields, and developed using totally different
techniques and approaches. Nevertheless, according to
a recent study [1], more than half a century after it was
introduced, the simple algorithm k-means still remains
as one of the top 10 data mining algorithms nowadays.
It is the most frequently used partitional clustering algorithm
in practice. Another recent scientific discussion
[2] states that k-means is the favourite algorithm that
practitioners in the related fields choose to use. Needless
to mention, k-means has more than a few basic
drawbacks, such as sensitiveness to initialization and to
cluster size, and its performance can be worse than other
state-of-the-art algorithms in many domains. In spite of
that, its simplicity, understandability and scalability are
the reasons for its tremendous popularity. An algorithm
with adequate performance and usability in most of
application scenarios could be preferable to one with
better performance in some cases but limited usage due
to high complexity. While offering reasonable results, kmeans
is fast and easy to combine with other methods
in larger systems.

RELATED WORK

First of all, Table 1 summarizes the basic notations that
will be used extensively throughout this paper to represent
documents and related concepts. Each document
in a corpus corresponds to an m-dimensional vector d,
where m is the total number of terms that the document
corpus has. Document vectors are often subjected to
some weighting schemes, such as the standard Term
Frequency-Inverse Document Frequency (TF-IDF), and
normalized to have unit length.
The principle definition of clustering is to arrange data
objects into separate clusters such that the intra-cluster
similarity as well as the inter-cluster dissimilarity is maximized.
The problem formulation itself implies that some
forms of measurement are needed to determine such
similarity or dissimilarity. There are many state-of-theart
clustering approaches that do not employ any specific
form of measurement, for instance, probabilistic modelbased
method [9], non-negative matrix factorization [10],
information theoretic co-clustering [11] and so on. In
this paper, though, we primarily focus on methods that
indeed do utilize a specific measure.

PERFORMANCE EVALUATION OF MVSC

To verify the advantages of our proposed methods, we
evaluate their performance in experiments on document
data. The objective of this section is to compare MVSCIR
and MVSC-IV with the existing algorithms that also
use specific similarity measures and criterion functions
for document clustering. The similarity measures to be
compared includes Euclidean distance, cosine similarity
and extended Jaccard coefficient.

Document collections

The data corpora that we used for experiments consist of
twenty benchmark document datasets. Besides reuters7
and k1b, which have been described in details earlier,
we included another eighteen text collections so that
the examination of the clustering methods is more thorough
and exhaustive. Similar to k1b, these datasets are
provided together with CLUTO by the toolkit’s authors
[19]. They had been used for experimental testing in
previous papers, and their source and origin had also
been described in details [30], [31]. Table 2 summarizes
their characteristics. The corpora present a diversity of
size, number of classes and class balance. They were
all preprocessed by standard procedures, including stopword
removal, stemming

CONCLUSIONS AND FUTURE WORK

In this paper, we propose a Multi-Viewpoint based
Similarity measuring method, named MVS. Theoretical
analysis and empirical examples show that MVS is
potentially more suitable for text documents than the
popular cosine similarity. Based on MVS, two criterion
functions, IR and IV , and their respective clustering
algorithms, MVSC-IR and MVSC-IV , have been introduced.
Compared with other state-of-the-art clustering
methods that use different types of similarity measure

**seminar tips** · 03-10-2012, 10:43 AM

to get information about the topic "clustering with multi viewpoint based similarity measure" full report ppt and related topic refer the link bellow

https://seminarproject.net/Thread-cluste...ull-report

http://project-seminars.com/attachment.php?aid=32089

https://seminarproject.net/Thread-cluste...#pid113033

04-10-2012, 04:47 PM

Meaning about Former and alternative forms what you mentioned in your paper.please can you explain me more which helps to understand more.[email=visaraji[at]gmail.com, saran.saranyap[at]gmail.com]Clustering with Mult-viewpoint based Similarity Meaures[/email]

Thank you,
S.Visalakshi

**project girl** · 23-01-2013, 01:58 PM

Clustering With Multi-Viewpoint Based Similarity Measure Project

.doc

1Clustering With Multi.doc (Size: 37.5 KB / Downloads: 24)

ABSTRACT:

All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.

EXISTING SYSTEMS

• Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. There have been many clustering algorithms published every year.
• Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets.
• In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters. Instead, we assign documents to the best cluster.

PROPOSED SYSTEM

• The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance.
• It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve time efficiency and “the veracity” is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of Expectation-Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two sub-clusters combined when their overlap is the largest is narrated.
• Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	context based index using semantic retrieval	Guest	1	973	10-10-2017, 04:08 PM Last Post: jaseela123
	a novel security for online banking based on viretual machine	Guest	1	938	10-10-2017, 09:21 AM Last Post: jaseela123
	online stock trading based on scalable scheduling updates in streaming dataware house	Guest	1	974	09-10-2017, 10:29 AM Last Post: jaseela123
	web based biometric attendance monitoring system	Guest	1	1,065	07-10-2017, 02:05 PM Last Post: jaseela123
	SMART CARD BASED MEDICAL DISPENSARY	Guest	1	2,610	14-09-2017, 12:38 PM Last Post: jaseela123
	rf based anti collision device	vikashdhankar	1	1,571	26-08-2017, 04:21 PM Last Post: jaseela123
	microcontroller based gas leakage detection and autodialing	Guest	1	1,209	26-08-2017, 03:25 PM Last Post: jaseela123
	Microcontroller based high frequency heating for surfce hardening	hrn	1	904	26-08-2017, 01:35 PM Last Post: jaseela123
	GSM based project to get informed about factory line cut	Guest	1	846	14-08-2017, 01:49 PM Last Post: jaseela123
	semantic approach for modeling profiles and interaction based on digital content	Guest	1	671	31-07-2017, 11:29 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.