27-02-2013, 01:52 PM
Minimum Spanning Tree Based Clustering Algorithms
Minimum Spanning.pdf (Size: 727.62 KB / Downloads: 131)
Abstract
The minimum spanning tree clustering algorithm is
known to be capable of detecting clusters with irregular
boundaries. In this paper, we propose two minimum spanning
tree based clustering algorithms. The first algorithm
produces a k-partition of a set of points for any given k. The
algorithm constructs a minimum spanning tree of the point
set and removes edges that satisfy a predefined criterion.
The process is repeated until k clusters are produced. The
second algorithm partitions a point set into a group of clusters
by maximizing the overall standard deviation reduction,
without a given k value. We present our experimental results
comparing our proposed algorithms to k-means and
EM. We also apply our algorithms to image color clustering
and compare our algorithms to the standard minimum
spanning tree clustering algorithm.
Introduction
A spanning tree is an acyclic subgraph of a graph G,
which contains all the vertices from G. The minimum
spanning tree (MST) of a weighted graph is the minimumweight
spanning tree of that graph. With the classical MST
algorithms [18, 13, 15], the cost of constructing a minimum
spanning tree is O(mlog n), where m is the number
of edges in the graph, n is the number of vertices. More
efficient algorithms for constructing MSTs have also been
extensively researched [12, 7, 8]. These algorithms promise
close to linear time complexity under different assumptions.
A Euclidean minimum spanning tree (EMST) is a spanning
tree of a set of n points in a metric space (En), where the
length of an edge is the Euclidean distance between a pair
of points in the point set.
Related work
Clustering algorithms based on minimum and maximum
spanning trees have been extensively studied. In the mid
80’s, Avis [2] found an O(n2 log2 n) algorithm for the minmax
diameter 2-clustering problem. Asano, Bhattacharya,
Keil, and Yao [1] later gave an optimal O(n log n) algorithm
using maximum spanning trees for minimizing the
maximum diameter of a bipartition. The problem becomes
NP-complete when the number of partitions is beyond
two [11]. Asano, Bhattacharya, Keil, and Yao also
considered the clustering problems in which the goal is to
maximize the minimum intercluster distance. They gave an
O(n log n) algorithm for computing a k-partition of a point
set by removing the k −1 longest edges from the minimum
spanning tree constructed from that point set [1].
Zahn [24] proposes to construct an MST of a point set
and delete inconsistent edges — the edges, whose weights
are significantly larger than the average weight of the nearby
edges in the tree. Zahn’s inconsistency measure is defined
as follows. Let e denote an edge in the MST of the point
set, and v1 and v2 be the end nodes of e, w be the weight
of e. A depth d neighborhood N of an end node v of an
edge e is defined as a set of all edges that belong to all the
paths of length d originating from the node v, excluding the
paths that include the edge e. Let N1 and N2 be the depth
d neighborhoods of the end nodes v1 and v2. Let wN1 be
the average weight of edges in N1 and σN1 be its standard
deviation. Similarly, let wN2 be the average weight of edges
in N2 and σN2 be its standard deviation.
Our EMST based clustering algorithms
By and large, the inherent cluster structure of a point
set in a metric space is closely related to how objects or
concepts are embedded in the point set. In practice, the
approximate number of embedded objects can sometimes
be acquired with the help of domain experts. Other times,
this information is hidden and unavailable to the clustering
algorithm. In this section, we present two EMST-based
clustering algorithms, one assumes a given cluster number
and we call it a hierarchical EMST clustering algorithm or
HEMST, the other does not and we call it a maximum standard
deviation reduction clustering algorithm or MSDR.
Our MSDR clustering algorithm
Our HEMST algorithm assumes that the desired number
of clusters is given in advance. In practice, determining
the number of clusters is often coupled with discovering the
cluster structure. In this section, we present another EMSTbased
clustering algorithm—the maximum standard deviation
reduction clustering algorithm, or MSDR which does
not require a predefined cluster number.
Our MSDR algorithm first constructs a EMST from the
given point set S. Next, it computes the standard deviation
of the edges in the EMST, and removes an edge to obtain
a set of two disjoint subtrees such that the overall standard
deviation reduction is maximized. This edge removing process
is repeated to create more disjoint subtrees until the
overall standard deviation reduction is within a threshold.
The desired number of clusters is obtained by finding the
local minimum of the standard deviation reduction function.
Conclusions and future work
We have demonstrated that our proposed EMST-based
clustering algorithms are very effective when applied to various
clustering problems. Our HEMST clustering algorithm
assumes a given cluster number. The algorithm gradually
finds a set of k representative points that serve as an “attractor”
to points nearby, and outputs the inherent cluster
structure subsequently. We have shown that our algorithm
works much more reliably than the simple SEMST clustering
algorithm. Our MSDR algorithm automatically determines
the desired number of clusters. The objective function
is defined to maximize the overall standard deviation
reduction.