05-09-2012, 01:28 PM
Top 10 algorithms in data mining
10Algorithms-08.pdf (Size: 783.19 KB / Downloads: 30)
Abstract
This paper presents the top 10 data mining algorithms identified by the IEEE
International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM,
Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms
are among the most influential data mining algorithms in the research community.With each
algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and
reviewcurrent and further research on the algorithm. These 10 algorithms cover classification,
Introduction
In an effort to identify some of the most influential algorithms that have been widely used
in the data mining community, the IEEE International Conference on Data Mining (ICDM,
http://www.cs.uvm.edu/~icdm/) identified the top 10 algorithms in data mining for presentation
at ICDM ’06 in Hong Kong.
As the first step in the identification process, in September 2006 we invited the ACMKDD
Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate
up to 10 best-known algorithms in data mining. All except one in this distinguished
set of award winners responded to our invitation. We asked each nomination to provide the
following information: (a) the algorithm name, (b) a brief justification, and © a representative
publication reference.We also advised that each nominated algorithm should have been
widely cited and used by other researchers in the field, and the nominations from each nominator
as a group should have a reasonable representation of the different areas in data mining.
Ruleset classifiers
Complex decision trees can be difficult to understand, for instance because information about
one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism
consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for
each class are grouped together. A case is classified by finding the first rule whose conditions
are satisfied by the case; if no rule is satisfied, the case is assigned to a default class.
C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root
of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path
andwhose class is the label of the leaf. This rule is then simplified by determining the effect of
discarding each condition in turn. Dropping a condition may increase the number N of cases
covered by the rule, and also the number E of cases that do not belong to the class nominated
by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing
algorithm is used to drop conditions until the lowest pessimistic error rate is found.
Research issues
We have frequently heard colleagues express the view that decision trees are a “solved problem.”
We do not agree with this proposition and will close with a couple of open research
problems.
Stable trees. It is well known that the error rate of a tree on the cases fromwhich itwas constructed
(the resubstitution error rate) is much lower than the error rate on unseen cases (the
predictive error rate). For example, on a well-known letter recognition dataset with 20,000
cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out
(20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from
20,000 often affects the tree that is constructed!
Suppose now that we could develop a non-trivial tree-construction algorithm that was
hardly ever affected by omitting a single case. For such stable trees, the resubstitution error
rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree
is of the “right” size.