20-10-2016, 12:49 PM
1460171589-10Algorithms08.pdf (Size: 783.19 KB / Downloads: 6)
Abstract This paper presents the top 10 data mining algorithms identified by the IEEE
International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM,
Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms
are among the most influential data mining algorithms in the research community. With each
algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and
review current and further research on the algorithm. These 10 algorithms cover classification,
Introduction
In an effort to identify some of the most influential algorithms that have been widely used
in the data mining community, the IEEE International Conference on Data Mining (ICDM,
http://www.cs.uvm.edu/~icdm/) identified the top 10 algorithms in data mining for presentation
at ICDM ’06 in Hong Kong.
As the first step in the identification process, in September 2006 we invited the ACM KDD
Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate
up to 10 best-known algorithms in data mining. All except one in this distinguished
set of award winners responded to our invitation. We asked each nomination to provide the
following information: (a) the algorithm name, (b) a brief justification, and © a representative
publication reference. We also advised that each nominated algorithm should have been
widely cited and used by other researchers in the field, and the nominations from each nominator
as a group should have a reasonable representation of the different areas in data mining.
After the nominations in Step 1, we verified each nomination for its citations on Google
Scholar in late October 2006, and removed those nominations that did not have at least 50
citations. All remaining (18) nominations were then organized in 10 topics: association analysis,
classification, clustering, statistical learning, bagging and boosting, sequential patterns,
integrated mining, rough sets, link mining, and graph mining. For some of these 18 algorithms
such as k-means, the representative publication was not necessarily the original paper that
introduced the algorithm, but a recent paper that highlights the importance of the technique.
These representative publications are available at the ICDM website (http://www.cs.uvm.
edu/~icdm/algorithms/CandidateList.shtml).
In the third step of the identification process, we had a wider involvement of the research
community. We invited the Program Committee members of KDD-06 (the 2006 ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining), ICDM ’06 (the
2006 IEEE International Conference on Data Mining), and SDM ’06 (the 2006 SIAM International
Conference on Data Mining), as well as the ACM KDD Innovation Award and IEEE
ICDM Research Contributions Award winners to each vote for up to 10 well-known algorithms
from the 18-algorithm candidate list. The voting results of this step were presented at
the ICDM ’06 panel on Top 10 Algorithms in Data Mining.
At the ICDM ’06 panel of December 21, 2006, we also took an open vote with all 145
attendees on the top 10 algorithms from the above 18-algorithm candidate list, and the top 10
algorithms from this open vote were the same as the voting results from the above third step.
The 3-hour panel was organized as the last session of the ICDM ’06 conference, in parallel
with 7 paper presentation sessions of the Web Intelligence (WI ’06) and Intelligent Agent
Technology (IAT ’06) conferences at the same location, and attracting 145 participants to
this panel clearly showed that the panel was a great success.
1 C4.5 and beyond
1.1 Introduction
Systems that construct classifiers are one of the commonly used tools in data mining. Such
systems take as input a collection of cases, each belonging to one of a small number of
classes and described by its values for a fixed set of attributes, and output a classifier that can
accurately predict the class to which a new case belongs.
These notes describe C4.5 [64], a descendant of CLS [41] and ID3 [62]. Like CLS and
ID3, C4.5 generates classifiers expressed as decision trees, but it can also construct classifiers
in more comprehensible ruleset form. We will outline the algorithms employed in
C4.5, highlight some changes in its successor See5/C5.0, and conclude with a couple of open
research issues.
1.2 Decision trees
Given a set S of cases, C4.5 first grows an initial tree using the divide-and-conquer algorithm
as follows:
• If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with
the most frequent class in S.
• Otherwise, choose a test based on a single attribute with two or more outcomes. Make
this test the root of the tree with one branch for each outcome of the test, partition S into
corresponding subsets S1, S2,... according to the outcome for each case, and apply the
same procedure recursively to each subset.
There are usually many tests that could be chosen in this last step. C4.5 uses two heuristic
criteria to rank possible tests: information gain, which minimizes the total entropy of the
subsets {Si } (but is heavily biased towards tests with numerous outcomes), and the default
gain ratio that divides information gain by the information provided by the test outcomes.
Attributes can be either numeric or nominal and this determines the format of the test
outcomes. For a numeric attribute A they are {A ≤ h, A > h} where the threshold h is found
by sorting S on the values of A and choosing the split between successive values that maximizes
the criterion above. An attribute A with discrete values has by default one outcome
for each value, but an option allows the values to be grouped into two or more subsets with
one outcome for each subset.
The initial tree is then pruned to avoid overfitting. The pruning algorithm is based on a
pessimistic estimate of the error rate associated with a set of N cases, E of which do not
belong to the most frequent class. Instead of E/N, C4.5 determines the upper limit of the
binomial probability when E events have been observed in N trials, using a user-specified
confidence whose default value is 0.25.
Pruning is carried out from the leaves to the root. The estimated error at a leaf with N
cases and E errors is N times the pessimistic error rate as above. For a subtree, C4.5 adds
the estimated errors of the branches and compares this to the estimated error if the subtree is
replaced by a leaf; if the latter is no higher than the former, the subtree is pruned. Similarly,
C4.5 checks the estimated error if the subtree is replaced by one of its branches and when
this appears beneficial the tree is modified accordingly. The pruning process is completed in
one pass through the tree.
C4.5’s tree-construction algorithm differs in several respects from CART [9], for instance:
• Tests in CART are always binary, but C4.5 allows two or more outcomes.
• CART uses the Gini diversity index to rank tests, whereas C4.5 uses information-based
criteria.
• CART prunes trees using a cost-complexity model whose parameters are estimated by
cross-validation; C4.5 uses a single-pass algorithm derived from binomial confidence
limits.
• This brief discussion has not mentioned what happens when some of a case’s values are
unknown. CART looks for surrogate tests that approximate the outcomes when the tested
attribute has an unknown value, but C4.5 apportions the case probabilistically among the
outcomes.
1.3 Ruleset classifiers
Complex decision trees can be difficult to understand, for instance because information about
one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism
consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for
each class are grouped together. A case is classified by finding the first rule whose conditions
are satisfied by the case; if no rule is satisfied, the case is assigned to a default class.
C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root
of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path
and whose class is the label of the leaf. This rule is then simplified by determining the effect of
discarding each condition in turn. Dropping a condition may increase the number N of cases
covered by the rule, and also the number E of cases that do not belong to the class nominated
by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing
algorithm is used to drop conditions until the lowest pessimistic error rate is found.
To complete the process, a subset of simplified rules is selected for each class in turn.
These class subsets are ordered to minimize the error on the training cases and a default
class is chosen. The final ruleset usually has far fewer rules than the number of leaves on the
pruned decision tree.
The principal disadvantage of C4.5’s rulesets is the amount of CPU time and memory that
they require. In one experiment, samples ranging from 10,000 to 100,000 cases were drawn
from a large dataset. For decision trees, moving from 10 to 100K cases increased CPU time
on a PC from 1.4 to 61 s, a factor of 44. The time required for rulesets, however, increased
from 32 to 9,715 s, a factor of 300.
1.4 See5/C5.0
C4.5 was superseded in 1997 by a commercial system See5/C5.0 (or C5.0 for short). The
changes encompass new capabilities as well as much-improved efficiency, and include:
• A variant of boosting [24], which constructs an ensemble of classifiers that are then voted
to give a final classification. Boosting often leads to a dramatic improvement in predictive
accuracy.
• New data types (e.g., dates), “not applicable” values, variable misclassification costs, and
mechanisms to pre-filter attributes.
• Unordered rulesets—when a case is classified, all applicable rules are found and voted.
This improves both the interpretability of rulesets and their predictive accuracy.
• Greatly improved scalability of both decision trees and (particularly) rulesets. Scalability
is enhanced by multi-threading; C5.0 can take advantage of computers with multiple
CPUs and/or cores.
More details are available from http://rulequestsee5-comparison.html.
1.5 Research issues
We have frequently heard colleagues express the view that decision trees are a “solved problem.”
We do not agree with this proposition and will close with a couple of open research
problems.
Stable trees. It is well known that the error rate of a tree on the cases from which it was constructed
(the resubstitution error rate) is much lower than the error rate on unseen cases (the
predictive error rate). For example, on a well-known letter recognition dataset with 20,000
cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out
(20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from
20,000 often affects the tree that is constructed!
Suppose now that we could develop a non-trivial tree-construction algorithm that was
hardly ever affected by omitting a single case. For such stable trees, the resubstitution error
rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree
is of the “right” size.
Decomposing complex trees. Ensemble classifiers, whether generated by boosting, bagging,
weight randomization, or other techniques, usually offer improved predictive accuracy.
Now, given a small number of decision trees, it is possible to generate a single (very complex)
tree that is exactly equivalent to voting the original trees, but can we go the other way? That
is, can a complex tree be broken down to a small collection of simple trees that, when voted
together, give the same result as the complex tree? Such decomposition would be of great
help in producing comprehensible decision trees.
2 The k-means algorithm
2.1 The algorithm
The k-means algorithm is a simple iterative method to partition a given dataset into a userspecified
number of clusters, k. This algorithm has been discovered by several researchers
across different disciplines, most notably Lloyd (1957, 1982) [53], Forgey (1965), Friedman
and Rubin (1967), and McQueen (1967). A detailed history of k-means alongwith descriptions
of several variations are given in [43]. Gray and Neuhoff [34] provide a nice historical
background for k-means placed in the larger context of hill-climbing algorithms.
The algorithm operates on a set of d-dimensional vectors, D = {xi | i = 1,..., N}, where
xi ∈ d denotes the ith data point. The algorithm is initialized by picking k points in d as
the initial k cluster representatives or “centroids”. Techniques for selecting these initial seeds
include sampling at random from the dataset, setting them as the solution of clustering a
small subset of the data or perturbing the global mean of the data k times. Then the algorithm
iterates between two steps till convergence:
Step 1: Data Assignment. Each data point is assigned to its closest centroid, with ties
broken arbitrarily. This results in a partitioning of the data.
Step 2: Relocation of “means”. Each cluster representative is relocated to the center
(mean) of all data points assigned to it. If the data points come with a probability measure
(weights), then the relocation is to the expectations (weighted mean) of the data partitions.
The algorithm converges when the assignments (and hence the cj values) no longer change.
The algorithm execution is visually depicted in Fig. 1. Note that each iteration needs N × k
comparisons, which determines the time complexity of one iteration. The number of iterations
required for convergence varies and may depend on N, but as a first cut, this algorithm
can be considered linear in the dataset size.
One issue to resolve is how to quantify “closest” in the assignment step. The default
measure of closeness is the Euclidean distance, in which case one can readily show that the
non-negative cost function,
2.2 Limitations
In addition to being sensitive to initialization, the k-means algorithm suffers from several
other problems. First, observe that k-means is a limiting case of fitting data by a mixture of
k Gaussians with identical, isotropic covariance matrices ( = σ2I), when the soft assignments
of data points to mixture components are hardened to allocate each data point solely
to the most likely component. So, it will falter whenever the data is not well described by
reasonably separated spherical balls, for example, if there are non-covex shaped clusters in
the data. This problem may be alleviated by rescaling the data to “whiten” it before clustering,
or by using a different distance measure that is more appropriate for the dataset. For example,
information-theoretic clustering uses the KL-divergence to measure the distance between two
data points representing two discrete probability distributions. It has been recently shown that
if one measures distance by selecting any member of a very large class of divergences called
Bregman divergences during the assignment step and makes no other changes, the essential
properties of k-means, including guaranteed convergence, linear separation boundaries and
scalability, are retained [3]. This result makes k-means effective for a much larger class of
datasets so long as an appropriate divergence is used.
k-means can be paired with another algorithm to describe non-convex clusters. One
first clusters the data into a large number of groups using k-means. These groups are then
agglomerated into larger clusters using single link hierarchical clustering, which can detect
complex shapes. This approach also makes the solution less sensitive to initialization, and
since the hierarchical method provides results at multiple resolutions, one does not need to
pre-specify k either.
The cost of the optimal solution decreases with increasing k till it hits zero when the
number of clusters equals the number of distinct data-points. This makes it more difficult
to (a) directly compare solutions with different numbers of clusters and (b) to find the optimum
value of k. If the desired k is not known in advance, one will typically run k-means
with different values of k, and then use a suitable criterion to select one of the results. For
example, SAS uses the cube-clustering-criterion, while X-means adds a complexity term
(which increases with k) to the original cost function (Eq. 1) and then identifies the k which
minimizes this adjusted cost. Alternatively, one can progressively increase the number of
clusters, in conjunction with a suitable stopping criterion. Bisecting k-means [73] achieves
this by first putting all the data into a single cluster, and then recursively splitting the least
compact cluster into two using 2-means. The celebrated LBG algorithm [34] used for vector
quantization doubles the number of clusters till a suitable code-book size is obtained. Both
these approaches thus alleviate the need to know k beforehand.
The algorithm is also sensitive to the presence of outliers, since “mean” is not a robust
statistic. A preprocessing step to remove outliers can be helpful. Post-processing the results,
for example to eliminate small clusters, or to merge close clusters into a large cluster, is also
desirable. Ball and Hall’s ISODATA algorithm from 1967 effectively used both pre- and
post-processing on k-means.
2.3 Generalizations and connections
As mentioned earlier, k-means is closely related to fitting a mixture of k isotropic Gaussians
to the data. Moreover, the generalization of the distance measure to all Bregman divergences
is related to fitting the data with a mixture of k components from the exponential family of
distributions. Another broad generalization is to view the “means” as probabilistic models
instead of points in Rd . Here, in the assignment step, each data point is assigned to the most
likely model to have generated it. In the “relocation” step, the model parameters are updated
to best fit the assigned datasets. Such model-based k-means allow one to cater to more
complex data, e.g. sequences described by Hidden Markov models.
One can also “kernelize” k-means [19]. Though boundaries between clusters are still
linear in the implicit high-dimensional space, they can become non-linear when projected
back to the original space, thus allowing kernel k-means to deal with more complex clusters.
Dhillon et al. [19] have shown a close connection between kernel k-means and spectral
clustering. The K-medoid algorithm is similar to k-means except that the centroids have to
belong to the data set being clustered. Fuzzy c-means is also similar, except that it computes
fuzzy membership functions for each clusters rather than a hard one.
Despite its drawbacks, k-means remains the most widely used partitional clustering
algorithm in practice. The algorithm is simple, easily understandable and reasonably scalable,
and can be easily modified to deal with streaming data. To deal with very large datasets,
substantial effort has also gone into further speeding up k-means, most notably by using
kd-trees or exploiting the triangular inequality to avoid comparing each data point with all
the centroids during the assignment step. Continual improvements and generalizations of the basic algorithm have ensured its continued relevance and gradually increased its effectiveness
as well.
3 Support vector machines
In today’s machine learning applications, support vector machines (SVM) [83] are considered
a must try—it offers one of the most robust and accurate methods among all well-known
algorithms. It has a sound theoretical foundation, requires only a dozen examples for training,
and is insensitive to the number of dimensions. In addition, efficient methods for training
SVM are also being developed at a fast pace.
In a two-class learning task, the aim of SVM is to find the best classification function
to distinguish between members of the two classes in the training data. The metric for the
concept of the “best” classification function can be realized geometrically. For a linearly separable
dataset, a linear classification function corresponds to a separating hyperplane f (x)
that passes through the middle of the two classes, separating the two. Once this function is
determined, new data instance xn can be classified by simply testing the sign of the function
f (xn); xn belongs to the positive class if f (xn) > 0.
Because there are many such linear hyperplanes, what SVM additionally guarantee is that
the best such function is found by maximizing the margin between the two classes. Intuitively,
the margin is defined as the amount of space, or separation between the two classes
as defined by the hyperplane. Geometrically, the margin corresponds to the shortest distance
between the closest data points to a point on the hyperplane. Having this geometric definition
allows us to explore how to maximize the margin, so that even though there are an infinite
number of hyperplanes, only a few qualify as the solution to SVM.
The reason why SVM insists on finding the maximum margin hyperplanes is that it offers
the best generalization ability. It allows not only the best classification performance (e.g.,
accuracy) on the training data, but also leaves much room for the correct classification of the
future data. To ensure that the maximum margin hyperplanes are actually found, an SVM
classifier attempts to maximize the following function with respect to w and b:
L P = 1
2 w −t
i=1
αi yi(w · xi + b) +t
i=1
αi (2)
where t is the number of training examples, and αi,i = 1,..., t, are non-negative numbers
such that the derivatives of L P with respect to αi are zero. αi are the Lagrange multipliers
and L P is called the Lagrangian. In this equation, the vectors w and constant b define the
hyperplane.
There are several important questions and related extensions on the above basic formulation
of support vector machines. We list these questions and extensions below.
1. Can we understand the meaning of the SVM through a solid theoretical foundation?
2. Can we extend the SVM formulation to handle cases where we allow errors to exist,
when even the best hyperplane must admit some errors on the training data?
3. Can we extend the SVM formulation so that it works in situations where the training
data are not linearly separable?
4. Can we extend the SVM formulation so that the task is to predict numerical values or
to rank the instances in the likelihood of being a positive class member, rather than
classification?