30-11-2012, 02:16 PM
Clustering with Multi-Viewpoint based Similarity Measure
Clustering with Multi-Viewpoint based.doc (Size: 2.23 MB / Downloads: 33)
INTRODUCTION
Clustering is one of the most interesting and impor- tant topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. There have been many clustering algorithms published every year. They can be proposed for very distinct research fields, and developed using totally different techniques and approaches. Nevertheless, according to a recent study [1], more than half a century after it was introduced, the simple algorithm k-means still remains as one of the top 10 data mining algorithms nowadays. It is the most frequently used partitional clustering al- gorithm in practice. Another recent scientific discussion [2] states that k-means is the favourite algorithm that practitioners in the related fields choose to use. Need- less to mention, k-means has more than a few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its performance can be worse than other state-of-the-art algorithms in many domains. In spite of that, its simplicity, understandability and scalability are the reasons for its tremendous popularity. An algorithm with adequate performance and usability in most of application scenarios could be preferable to one with better performance in some cases but limited usage due to high complexity. While offering reasonable results, k- means is fast and easy to combine with other methods in larger systems.
Analysis and practical examples of MVS
In this section, we present analytical study to show that the proposed MVS could be a very effective similarity measure for data clustering. In order to demonstrate its advantages, MVS is compared with cosine similarity (CS) on how well they reflect the true group structure in document collections. Firstly, exploring Eq. (10), we have:
MULTI-VI EWPOINT BASED CLUSTERING
Two clustering criterion functions IR and IV
Having defined our similarity measure, we now formu- late our clustering criterion functions. The first function, called IR , is the cluster size-weighted sum of average pairwise similarities of documents in the same cluster. Firstly, let us express this sum in a general form by function F :
PERFORMANCE EVALUATION OF MVSC
To verify the advantages of our proposed methods, we evaluate their performance in experiments on document data. The objective of this section is to compare MVSC- IR and MVSC-IV with the existing algorithms that also use specific similarity measures and criterion functions for document clustering. The similarity measures to be compared includes Euclidean distance, cosine similarity and extended Jaccard coefficient.