08-08-2012, 12:57 PM
GRAPH CLUSTERING by FLOW SIMULATION
GRAPH CLUSTERING.pdf (Size: 4.31 MB / Downloads: 69)
Introduction
Cluster Analysis is the mathematical study of methods for recognizing natural groups
within a class of entities.
This is a common type of definition for the discipline of cluster analysis, and it is unsatisfactory
in at least one respect. However, this cause for dissatisfaction must be welcomed,
since it points directly to some of the essential questions and problems in cluster analysis.
These essential aspects will be briefly discussed in this introduction, followed by an
account of the main contribution of the thesis, the Markov Cluster Process, a cluster process
designed within the setting of graphs. This setting defines a relatively young area
in cluster analysis referred to as graph clustering, which has connections to the clearly
scoped field of graph partitioning. Clustering and graph-clustering methods are also
studied in the large research area labelled pattern recognition. These disciplines and the
applications studied therein form the natural habitat for the Markov Cluster Algorithm.
Their relative positions are sketched, anticipating a more thorough topography in the
first part of this thesis.
The Markov Cluster Process (abbreviated MCL process) defines a sequence of stochastic
matrices by alternation of two operators on a generating matrix. It is basically all that is
needed for the clustering of graphs, but it is useful to distinguish between the algorithm
and the algebraic process employed by the algorithm. The study of mathematical properties
thus belongs to the process proper, and aspects such as interpretation and scaling
belong to the algorithm or a particular implementation of the algorithm. The second
part of this thesis covers the conceptual and theoretical foundation of the Markov Cluster
(abbreviated MCL) Algorithm and Process. Other proposals for graph clustering are
discussed, and several classic mathematical results are presented which are relevant in
the broad perspective of graph clustering and graph partitioning, in particular the relationship
between graph spectra and connectivity properties. The third part of the thesis
is concerned with algorithmic issues such as complexity, scalability, and implementation,
and various types of experiments and benchmarks establishing the particular strengths
and weaknesses of the algorithm.
Cluster analysis and graph clustering
The definition given above is a common denominator of the definitions found in cluster
monographs. It is rather dry, but that is a shortcoming inherent in definitions in general.
More importantly, it is vague. There is nothing wrong with the highly indeterminate
entities, as this indicates that methods are studied in abstracto, without studying a specific
type of entity like for example man or meteorite. The problem lies with the word
`natural', as anyone who has ever attempted to use the word argumentatively may confirm.
What a person judges to be natural is in general heavily influenced by personality,
world context, a priori knowledge, and implicit or explicit expectations. If attention is
restricted to the grouping of visual stimuli into patterns by the eye and mind, it is seen
that this process is really a complex interaction of low-level cognitive functions (which
can be situated in the eye) and high-level functions (representing contextual and emotional
stimuli). Witness for example Figure 1, adapted from [151]. What are the natural
groups? It depends. There are several clouds of data points. On the left they may be
seen to form a ring with a sphere in its centre, or the two may be seen as one big sphere,
or each cloud may be seen as a sphere in its own right. In general, if a ring is seen, then
it is `natural' to see one long stretched shape on the right, as this needs the same kind
of linking required for forming the ring. In this instance, there is a tendency to scale
the perceived objects such that they are balanced. The way in which cluster structure
is perceived locally is affected by the overall perceived cluster structure.
The Markov Cluster Algorithm
The main contribution of this thesis is a powerful new cluster algorithm with many good
properties, called the Markov Cluster Algorithm or MCL algorithm. It was designed within
the mathematical setting of graphs and inspired by a simple paradigm (Section 1.2.1 below),
naturally leading to the formulation of an algebraic process for stochastic matrices
called the MCL process. This process forms the engine of the MCL algorithm; one of the
algorithm's particular assets is that it does not contain high-level procedural rules for
the assembling, splitting, or joining of groups.
The problems and aims described in the previous section apply as much to graph clustering
as they do to the field of clustering at large. Thus far few methods have been formulated
which make distinct use of the topology offered by graphs. The methods that have
been formulated have not yet found their way into the cluster analysis monographs, the
main reason being that the graph model is much of a new kid on the block. A detailed
account is given in Chapter 2. It should be mentioned that many classical methods in
cluster analysis are formulated in such an abstract manner that it is easy to apply them
in the setting of graphs. Peculiarly, methods such as found in the single-linkage family
even allow a graph-theoretical formulation, but the graph-theoretical concepts that
are used are neither very sophisticated nor powerful. The experiments in Chapter 10
indicate that this is unavoidable.
MCL experiments and benchmarking
The standardized comparison of cluster methods on common series of problems is altogether
non-existent in cluster analysis. This is a deplorable situation, even though
such standardized comparison (i.e. benchmarking) invites thorny problems. An obstacle
might be that benchmarking more or less requires that the quality of a solution is
measurable in terms of a cost function on clusterings. It is non-trivial to devise a cost
function for clusterings of arbitrary sizes and distribution, but still doable. A bigger
problem is that cost functions inevitably favour the detection of certain shapes above
others, e.g. spherical above stretched or vice versa. The main implication is that clustering
by optimization of a cost function has severe drawbacks if clusters can be of mixed
type, even disregarding the size of the search space. These issues have no impact on
benchmarking, as the combination of the type of problem instance and the cost function
used can be tuned in advance.