11-11-2014, 10:40 AM
Abstracts: Document Clustering is an unsupervised learning of large set documents which will be helpful to filter, Encapsulate in similar groups and manage large set of Document Repository. Now a days in WWW (World Wide Web) there are lots of Documents. This unstructured big data can be useful if we perform Data mining on this large set of Documents. With help of clustering of this large set of unstructured documents we can have (semi-)automated categorization and make smoother types of search. Any Document clustering technique needs suitable Similarity measure technique to find similar documents and grouping them under most suitable cluster. While several clustering techniques and Distance-Measure techniques have been proposed in the past, there is no systematic approach to decide when to choose which Distance Measure technique. So for that we have studied four different Distance Measure techniques Euclidean DM, Squared Euclidean DM, Cosine DM and Tanimoto DM under various type of divisive algorithm like K-Means clustering, canopy clustering. We took observation under different factors that affects clustering results and their quality. Those factors are like number of iteration, threshold value, number of clusters and time. We did number of experiments in pseudo distributed mode with hadoop. From that observation with variable factors we conclude that Cosine and Tanimoto Distance Measure technique emerges the best Similarity measure to capture human categorization behaviour with help of confusion matrix and other quality measure, while Some time Euclidean similarity performs the poor and more time consuming. Also Cosine similarity is best out of four.