Top 10 algorithms in data mining ppt

**seminar flower** · 05-09-2012, 01:28 PM

Top 10 algorithms in data mining

.pdf

10Algorithms-08.pdf (Size: 783.19 KB / Downloads: 30)

Abstract

This paper presents the top 10 data mining algorithms identified by the IEEE
International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM,
Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms
are among the most influential data mining algorithms in the research community.With each
algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and
reviewcurrent and further research on the algorithm. These 10 algorithms cover classification,

Introduction

In an effort to identify some of the most influential algorithms that have been widely used
in the data mining community, the IEEE International Conference on Data Mining (ICDM,
http://www.cs.uvm.edu/~icdm/) identified the top 10 algorithms in data mining for presentation
at ICDM ’06 in Hong Kong.
As the first step in the identification process, in September 2006 we invited the ACMKDD
Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate
up to 10 best-known algorithms in data mining. All except one in this distinguished
set of award winners responded to our invitation. We asked each nomination to provide the
following information: (a) the algorithm name, (b) a brief justification, and © a representative
publication reference.We also advised that each nominated algorithm should have been
widely cited and used by other researchers in the field, and the nominations from each nominator
as a group should have a reasonable representation of the different areas in data mining.

Ruleset classifiers

Complex decision trees can be difficult to understand, for instance because information about
one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism
consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for
each class are grouped together. A case is classified by finding the first rule whose conditions
are satisfied by the case; if no rule is satisfied, the case is assigned to a default class.
C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root
of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path
andwhose class is the label of the leaf. This rule is then simplified by determining the effect of
discarding each condition in turn. Dropping a condition may increase the number N of cases
covered by the rule, and also the number E of cases that do not belong to the class nominated
by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing
algorithm is used to drop conditions until the lowest pessimistic error rate is found.

Research issues

We have frequently heard colleagues express the view that decision trees are a “solved problem.”
We do not agree with this proposition and will close with a couple of open research
problems.
Stable trees. It is well known that the error rate of a tree on the cases fromwhich itwas constructed
(the resubstitution error rate) is much lower than the error rate on unseen cases (the
predictive error rate). For example, on a well-known letter recognition dataset with 20,000
cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out
(20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from
20,000 often affects the tree that is constructed!
Suppose now that we could develop a non-trivial tree-construction algorithm that was
hardly ever affected by omitting a single case. For such stable trees, the resubstitution error
rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree
is of the “right” size.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Human Computer Interface : Seminar Report and PPT	seminar post	1	1,337	22-09-2017, 11:23 AM Last Post: jaseela123
	4G Broadband : Seminar Report and PPT	study tips	1	1,261	22-09-2017, 11:19 AM Last Post: jaseela123
	Software Life-Cycle Models ppt	seminar flower	1	3,852	22-09-2017, 10:54 AM Last Post: jaseela123
	PPT ON LINUX	project girl	1	1,829	21-09-2017, 03:56 PM Last Post: jaseela123
	Public Key Infrastructure (Digital Certificates and Digital Signatures) PPT	project girl	1	2,364	21-09-2017, 01:18 PM Last Post: jaseela123
	Itanium Processor : Seminar Report and PPT	seminar projects maker	1	1,052	21-09-2017, 12:46 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	Biometric Authentication PPT	project girl	1	1,109	19-09-2017, 02:32 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.