01-09-2014, 03:37 PM
Document Clustering using Rough K-means Algorithm
Document Clustering.docx (Size: 15.02 KB / Downloads: 14)
Abstract
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. The goal of clustering is to group similar objects in one cluster and dissimilar objects in different clusters. The K-means clustering is characterized by non-overlapping, clearly separated clusters with bivalent memberships: an object either belongs to or does not belongs to a cluster.
However many real life applications are characterized by situation where overlapping clusters would be a more suitable representation. Soft clustering mechanisms enable such representation as they allow an object to belong to overlapping clusters. Document clustering aims to cluster documents based on the similarity of concepts they are associated with. Document clustering is widely applied in the areas of web mining for clustering web pages, query results etc.
Soft clustering is relevant for document clustering. In our project we investigate the applicability of Rough K-means algorithm, a soft clustering technique based on rough set principles, for document clustering. The objective includes implementation of Rough K-means algorithm and performing document clustering using Rough K-means and K-means algorithms on benchmark datasets for comparative analysis
Work Done
Word done so far is summarized below:
1. Literature survey of soft clustering methods and in particular Rough K-means algorithm.
2. Pre-processing of documents and vector space modeling of data:
a. Stop-word removal.
b. Stemming.
c. Finding term frequency and inverse document frequency (tf-idf).
3. Clustering of objects using K-means algorithm
Future Work
1. Implementation of Rough K-means algorithm
2. Vector space representation of documents based on tf-idf values
3. Comparative study of K-Means and Rough K-means algorithms on benchmark datasets