Data Clustering using Hadoop

**seminar code** · 11-11-2014, 10:40 AM

Abstracts: Document Clustering is an unsupervised learning of large set documents which will be helpful to filter, Encapsulate in similar groups and manage large set of Document Repository. Now a days in WWW (World Wide Web) there are lots of Documents. This unstructured big data can be useful if we perform Data mining on this large set of Documents. With help of clustering of this large set of unstructured documents we can have (semi-)automated categorization and make smoother types of search. Any Document clustering technique needs suitable Similarity measure technique to find similar documents and grouping them under most suitable cluster. While several clustering techniques and Distance-Measure techniques have been proposed in the past, there is no systematic approach to decide when to choose which Distance Measure technique. So for that we have studied four different Distance Measure techniques Euclidean DM, Squared Euclidean DM, Cosine DM and Tanimoto DM under various type of divisive algorithm like K-Means clustering, canopy clustering. We took observation under different factors that affects clustering results and their quality. Those factors are like number of iteration, threshold value, number of clusters and time. We did number of experiments in pseudo distributed mode with hadoop. From that observation with variable factors we conclude that Cosine and Tanimoto Distance Measure technique emerges the best Similarity measure to capture human categorization behaviour with help of confusion matrix and other quality measure, while Some time Euclidean similarity performs the poor and more time consuming. Also Cosine similarity is best out of four.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Different Initialization Data and the Performance by the BFM	seminar flower	1	680	20-09-2017, 12:44 PM Last Post: jaseela123
	Color Image Indexing Using BTC	seminar tips	1	1,436	19-09-2017, 02:52 PM Last Post: jaseela123
	Mobile Messenger Using Ad-hoc Networks	seminar code	1	682	19-09-2017, 02:50 PM Last Post: jaseela123
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123
	System Analysis (Modeling of the Existing and Proposed System using OOD)	seminar flower	1	2,459	15-09-2017, 03:39 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	DESIGN AND PERFORMANCE ANALYSIS OF OPTICAL CDMA SYSTEM USING NEWLY DESIGNED MULTIWAVE	project girl	1	1,270	15-09-2017, 01:34 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.