19-04-2012, 12:42 PM
k-means++
kMeansPlusPlus.ppt (Size: 242.5 KB / Downloads: 33)
The k-means Problem
Given integer k and n data points in Rd
Partition points into k “similar” clusters
More formally:
Choose k centers ci so as to minimize potential function:
Lloyd’s Algorithm
Often just the “k-means method”
Guess k initial centers ci:
How? Uniformly at random from data points
Repeat until stable:
Assign each point to the closest ci
Set ci to be the center of mass of points assigned to it
Approximation Algorithms
Lots of approximation algorithms available:
Har-Peled and Mazumdar ’04:
1 + ε approximation in O(n + kk+2 ε -2dk logk(n/ε)) time
Kanungo, Mount, et al. ’04:
9 + ε approximation in O(n3 / εd) time
Facility location with “relaxed” metric spaces
The Intuition
Easy way to fix this mistake:
Make centers far away from each other
The k-means++ way:
Choose starting centers iteratively
Let D(x) be distance from x to nearest existing center
Take x as new center with prob. proportional to D(x)2
Run standard Lloyd’s method with these centers
Conclusion
Friends don’t let friends use k-means!