28-06-2012, 12:08 PM
Introduction of Clustering
Introduction of Clustering.ppt (Size: 931.5 KB / Downloads: 169)
What is Clustering?
Attach label to each observation or data points in a set
You can say this “unsupervised classification”
Clustering is alternatively called as “grouping”
Intuitively, if you would want to assign same label to a data points that are “close” to each other
Thus, clustering algorithms rely on a distance metric between data points
Sometimes, it is said that the for clustering, the distance metric is more important than the clustering algorithm.
K-means Overview
An unsupervised clustering algorithm
“K” stands for number of clusters, it is typically a user input to the algorithm; some criteria can be used to automatically estimate K
It is an approximation to an NP-hard combinatorial optimization problem
K-means algorithm is iterative in nature
It converges, however only a local minimum is obtained
Works only for numerical data
Easy to implement
K-means: summary
Algorithmically, very simple to implement
K-means converges, but it finds a local minimum of the cost function
Works only for numerical observations
K is a user input; alternatively BIC (Bayesian information criterion) or MDL (minimum description length) can be used to estimate K
Outliers can considerable trouble to K-means
K-medoids Clustering
K-means is appropriate when we can work with Euclidean distances
Thus, K-means can work only with numerical, quantitative variable types
Euclidean distances do not work well in at least two situations
Some variables are categorical
Outliers can be potential threats
A general version of K-means algorithm called K-medoids can work with any distance measure
K-medoids clustering is computationally more intensive
Otsu’s Image Thresholding Method
Based on the clustering idea: Find the threshold that minimizes the weighted within-cluster point scatter.
This turns out to be the same as maximizing the between-class scatter.
Operates directly on the gray level histogram [e.g. 256 numbers, P(i)], so it’s fast (once the histogram is computed).