15-02-2013, 09:48 AM
K-means Clustering
K-means.ppt (Size: 677 KB / Downloads: 61)
Introduction
Partitioning Clustering Approach
a typical clustering analysis approach via partitioning data set iteratively
construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance)
in principle, partitions achieved via minimising the sum of squared distance to its “representative object” in each cluster
Given a K, find a partition of K clusters to optimise the chosen partitioning criterion
global optimal: exhaustively enumerate all partitions
heuristic method: K-means algorithm
K-means algorithm (MacQueen’67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centres of clusters.
K-mean Algorithm
Initialisation: set seed points
Assign each object to the cluster with the nearest seed point
Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster)
Go back to Step 1), stop when no more new assignment
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
How many steps are needed for convergence?
What are memberships of two clusters?
What are centroids of two clusters after convergence?
How K-mean partitions?
When K centroids are set/fixed,
they partition the whole space
into K subspaces constituting a partitioning.
Changing positions of centroids leads to a new partitioning.
A partitioning amounts to a
Relevant Issues
Efficient in computation
O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n.
Local optimum
sensitive to initial seed points
converge to a local optimum that may be unwanted solution
Other problems
Need to specify K, the number of clusters, in advance
Unable to handle noisy data and outliers (K-Medoids algorithm)
Not suitable for discovering clusters with non-convex shapes
Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)
Conclusion
K-means algorithm is a simple yet popular method for clustering analysis
Its performance is determined by initialisation and appropriate distance measure
There are several variants of K-means to overcome its weaknesses
K-Medoids: resistance to noise and/or outliers
K-Modes: extension to categorical data clustering analysis
CLARA: dealing with large data sets
Mixture models (EM algorithm): handling uncertainty of clusters