31-05-2012, 12:29 PM
Clustering by Synchronization
clustering by synchronization (1).pdf (Size: 2.23 MB / Downloads: 180)
ABSTRACT
Synchronization is a powerful basic concept in nature regu-
lating a large variety of complex processes ranging from the
metabolism in the cell to social behavior in groups of individ-
uals. Therefore, synchronization phenomena have been ex-
tensively studied and models robustly capturing the dynam-
ical synchronization process have been proposed, e.g. the
Extensive Kuramoto Model. Inspired by the powerful con-
cept of synchronization, we propose Sync, a novel approach
to clustering. The basic idea is to view each data object as a
phase oscillator and simulate the interaction behavior of the
objects over time. As time evolves, similar objects naturally
synchronize together and form distinct clusters. Inherited
from synchronization, Sync has several desirable properties:
The clusters revealed by dynamic synchronization truly re-
°ect the intrinsic structure of the data set, Sync does not
rely on any distribution assumption and allows detecting
clusters of arbitrary number, shape and size. Moreover, the
concept of synchronization allows natural outlier handling,
since outliers do not synchronize with cluster objects. For
fully automatic clustering, we propose to combine Sync with
the Minimum Description Length principle. Extensive ex-
periments on synthetic and real world data demonstrate the
e®ectiveness and e±ciency of our approach.
Categories and Subject Descriptors
H.2.8 [Database applications]: Data mining
General Terms
Algorithms, Design, Reliability
Keywords
Synchronization, Clustering, Kuramoto Model
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’10, July 25–28, 2010, Washington, DC, USA.
Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00.
1. INTRODUCTION
Clustering is essential for knowledge discovery in a large
variety of applications. During the last decades, cluster-
ing has therefore attracted a huge volume of attention, with
multiple books, e.g. [18], surveys, e.g. [23] and research pa-
pers, e.g. [3, 4, 11, 31] to mention a few. Many approaches
have been proposed to address the clustering problem from
di®erent points of view, and each cluster notion comes with
speci¯c advantages and drawbacks. As an example, consider
the wide-spread Expectation Maximization (EM) algorithm
[11]. The result of EM, a mixture model of multivariate
Gaussians, is very useful for interpretation in many applica-
tions. However, EM only yields good results if the data at
least approximately follows the model assumption, i.e. con-
sists of Gaussian clusters. Moreover, the number of clusters
needs to be speci¯ed as an input parameter and the result
of EM is very sensitive w.r.t. outliers.
In this paper, we consider clustering from a novel dif-
ferent point of view: synchronization. Synchronization is
the phenomenon that a group of events spontaneously come
into co-occurrence with a common rhythm, despite of the
di®erences between individual rhythms of the events. Syn-
chronous behavior has attracted a large volume of interest
in physics, biology, ecology, sociology, communication and
other ¯elds of science and technology. It is known that syn-
chrony is rooted in human life from the metabolic processes
in our cells to the highest cognitive tasks we perform as a
group of individuals [5]. We will demonstrate that clustering
by synchronization has many attractive bene¯ts, but let us
¯rst illustrate the basic idea.
1.1 Basic Idea
Synchronization phenomena are prevalent in every day
life, as an example consider opinion formation. In the begin-
ning, each person has its own view about a certain problem.
After interaction by conversation or discussion, some peo-
ple with similar background, such as education or hobby,
will easily group together and form a common opinion. As
time evolves, in the phase of local synchronization, groups
with diverse opinions will be formed. After further conver-
sation with di®erent groups, all people may ¯nally achieve
a uniform opinion (global synchronization).
Inspired by these synchronization phenomena, we exploit
the synchronization dynamics to obtain a natural clustering
of a given data set. The key idea is to regard each data ob-
ject as a phase oscillator. Each object interacts dynamically
583
Figure 1: Clustering by synchronization. (a) The
initial state of the objects: the back arrows indi-
cate the mutual interaction and red arrows indicate
the direction of movement for some sample objects.
(b) The comparison of objects states before and af-
ter one time step: black points represent the initial
state, red points indicate the new state. © The
¯nal state of objects in local synchronization: syn-
chronized clusters and outliers.
with similar objects. We will elaborate a formal description
of the dynamical object interaction patterns by a reformu-
lated Kuramoto model in Section 3.2 and now provide the
intuition of clustering by synchronization. The process of
the dynamical clustering involves the following stages: First,
starting from initial conditions, each object runs indepen-
dently at its own phase. As time evolves, those objects with
highest similarity will synchronize ¯rst. Then, in a sequen-
tial process, more and more objects synchronize together
and clusters are produced driven by the intrinsic structure
of the data set. Finally, the whole population is split into
several stable distinct synchronized clusters. Outliers are
e®ectively detected due to their di®erent dynamic behavior
hardly synchronizing with any of the cluster objects.
Figure 1 displays 3 snapshots of the simulated dynamical
movement. For simplicity, 2-dimensional data are displayed.
Consider the 2 objects (P1,P2, P3), where P1 and P2 are
cluster objects and P3 is an outlier. For border object P1,
its neighborhood contains less objects and all neighbors are
located towards the inside of the cluster. Therefore, the bor-
der object is driven towards the inside of the cluster through
the non-linear dynamical interaction with its neighbors (Fig-
ure 1(a)). The larger the distance between an object and
its neighbors, the stronger is the interaction imposed, which
implies that neighbors with larger distances have a stronger
impact on the object. For the center objects, e.g. P2,
the neighborhood contains more objects, however, all these
neighbors locate around the object, causing a relatively slow
movement of P2 in comparison to the border objects, such
as P1. The movement of objects depends on the structure
of their overall neighborhood, which implies that the objects
will move towards the main direction of their neighborhood.
Through the non-linear interaction, all objects with similar
attributes ¯nally synchronize together. In contrast, outlier
objects remain isolated from all other objects, e.g. P3. Fig-
ure 1(b) displays the old and the new positions of the objects
for comparison. The objects with high similarity gradually
synchronize together through mutual coupling. The initial
points (black color) are replaced by the red points after one
time stamp. Figure 1© depicts the ¯nal states of the ob-
jects. The data set is split into clusters and outliers, where
cluster objects are synchronized together and have the same
phase while outlier objects remain isolated over time.
1.2 Contributions
The major bene¯ts of our algorithm Sync for Clustering
by Synchronization, can be summarized as follows:
1. Clusters detected by SynC truly re°ect the intrinsic
data structure. Rooted in the concept of synchroniza-
tion, Sync allows detecting clusters of arbitrary num-
ber, shape, size and object density without requiring
any distribution assumptions.
2. Sync is a powerful technique for clustering and outlier
detection. Based on the di®erent dynamical behavior
of objects during the process of synchronization, out-
liers are naturally and e®ectively detected.
3. Sync is robust against parameter settings and fully au-
tomatic by the combination with Minimum Description
Length. Relying on synchronization, already the basic
version of our algorithm is relatively robust against
parameter settings in comparison to other clustering
paradigms. For automatic clustering, we combine Sync
with the Minimum Description Length principle.
The remainder of this paper is organized as follows: In the
following section, we brie°y survey related work. Section
3 presents our algorithm in detail. Section 4 contains an
extensive experimental evaluation and Section 5 concludes
the paper.
2. RELATEDWORK
During the past several decades, many algorithms were
proposed for clustering, such as EM [11], CURE [16],
CLIQUE [3], BIRCH [31], CLARANS [25], DBSCAN [13],
etc. Due to space limitations, we can only provide a very
brief survey on some important major research directions.
In addition, we brie°y introduce related work on synchro-
nization phenomena.
PDF-based Clustering. The key idea of these methods
is to detect clusters by using a model of probability den-
sity functions (PDFs) to describe the data structure. The
most fundamental techniques in this line are K-means [22]
and EM [11]. These methods are suitable to detect spheri-
cally Gaussian clusters and the number of clusters K needs
to be speci¯ed by the user. The algorithm X-means [26] ex-
tends K-means by a technique for automatically detecting
K founded on information theory. The algorithm G-means
[17] additionally provides to detect non-spherical Gaussian
clusters. Another information-theoretic method, RIC [8],
has been designed as a postprocessing step to improve an
initial clustering of an arbitrary conventional clustering al-
gorithm. The cluster model comprises a rotation matrix
determined by PCA and a PDF assigned to each coordinate
selected from a set of prede¯ned PDFs. Clusters with sim-
ilar characteristics are ¯nally merged. However, the result
strongly depends on the quality of the initial clustering and
the cluster model is limited to linear attribute correlations
and a prede¯ned set of PDFs. Recently, OCI [9] is proposed
to detect clusters using Independent Components. This al-
gorithm uses the Exponential Power Distribution as cluster
model and applies the Independent Component Analysis for
584
determining the main directions inside a cluster as well as for
¯nding split planes in a top-down clustering approach. All
methods in this category tend to fail if the data distribution
does not correspond to the cluster model. An alternative
largely avoiding restricting assumptions on the data distri-
bution is the idea of Parzen Windows. Local information
on the object density is collected by histograms over local
windows in the feature space or by local kernel functions,
e.g. Gaussians. Thereby, at a global level, arbitrary data
distributions can be captured. The algorithm MeanShift
[10] combines this idea with clustering: During the run of
the algorithm, the windows move towards the direction of
the highest object density until convergence. However, the
window size needs to be suitably selected and our experi-
ments demonstrate that this algorithm tends to degrade in
performance in the presence of outliers and noise points.
Density-Based and Spectral Clustering. In density-
based clustering, clusters are regarded as regions of high ob-
ject density in the data space which are separated by regions
of low object density (noise). The algorithm DBSCAN [13]
formalizes this idea using two parameters : the neighbor-
hood of a given radius ² has to contain at least a minimum
number of objects (MinP ts). Arbitrarily shaped clusters
can be easily detected by DBSCAN, however the choice of a
suitable settings of ² andMinP ts is often di±cult, especially
since the parameters are correlated. Conceptually closely re-
lated, spectral clustering refers to a class of techniques which
rely on the Eigenstructure of a similarity matrix to partition
objects into disjoint clusters. These algorithms, e.g. [24, 7]
detect arbitrarily shaped clusters by considering the clus-
tering problem from a graph-theoretic perspective. A clus-
tering is obtained by removing the weakest edges between
highly connected subgraphs, which is formally de¯ned by
the normalized cut or similar objective functions. Similar
to K-means, the problem of these approaches is to choose
a suitable number of clusters and they are sensitive w.r.t.
outliers. The approach [7] partially overcomes these prob-
lems by proposing a new cost function for spectral clustering
based on the error between a given graph partitioning and
a clustering result. However, this approach requires sam-
ple data with a known partitioning which is not available
in most cases. The approach of A±nity Propagation [14]
is closely related to spectral and density-based clustering.
Each data point is regarded as a node in a network. The
algorithm performs clustering by letting the data points ex-
change messages along the edges of the networks. By mes-
sage passing, certain data points emerge as so-called exem-
plars (cluster representatives) and the other data points es-
tablish their cluster membership. As input parameter for
each data point a real value must be speci¯ed characteriz-
ing the probability that this particular point is selected as
an exemplar. The number of clusters is controlled by this
parameter but also by the structure of the network.
Kuramoto Model and Synchronization. Synchro-
nization phenomena in large populations of interacting el-
ements are the subject of intense research e®orts in physi-
cal, biological, chemical, and social systems. The extensive
Kuramoto model [20, 21, 1] is one of the most successful
approaches to explore synchronization phenomena. Seliger
et al. [27] discuss mechanisms of learning and plasticity
in networks of phase oscillators through a generalized Ku-
ramoto model. Arenas et al. [6] apply the Kuramoto model
for network analysis, and study the relationship between
topological scales and dynamic time scales in complex net-
works. This analysis provides a useful connection between
synchronization dynamics, network topology and spectral
graph analysis. From bioinformatics, Kim et al.[19] propose
a strategy to ¯nd groups of genes by analyzing the cell cycle
speci¯c gene expression with a modi¯ed Kuramoto model.
Aeyels et al. [2] introduce a mathematical model for the dy-
namics of chaos system. They characterize the data struc-
ture by a set of inequalities in the parameters of the model
and apply it to a system of interconnected water basins.
In summary, previous approaches mainly focus on the syn-
chronization phenomena of a dynamic system from a global
perspective. Moreover, to our best knowledge, the synchro-
nization principle and the Kuramoto model have not been in-
vestigated in the data mining community. Inspired by ideas
from dynamical system analysis, we propose a novel cluster-
ing technique based on local synchronization.
3. CLUSTERING BY SYNCHRONIZATION
Synchronization phenomena are prevalent in nature. For
example, the e®ect of synchrony has been described in exper-
iments of people conversation, song or rhythm, or of groups
of children interacting to an unconscious beat. In all cases
the purpose of the common wave length or rhythm is to
strengthen the group bond. The lack of such synchrony
can index unconscious tension, when goals cannot be iden-
ti¯ed nor worked towards because the members are "out of
sync". In this section, we introduce Sync, to explore clus-
ters by synchronization. We start with an introduction of
the Kuramoto Model and then develop a variant suitable for
clustering. In Section 3.3 we discuss the algorithm Sync in
detail.