Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Ant-based clustering: a comparative study of its relative performance with respect to k -means, average link

[attachment=44495]

Abstract.

Ant-based clustering and sorting is a nature-inspired heuristic for general
clustering tasks. It has been applied variously, from problems arising in commerce, to
circuit design, to text-mining, all with some promise. However, although early results
were broadly encouraging, there has been very limited analytical evaluation of antbased
clustering. Toward this end, we first propose a scheme that enables unbiased
interpretation of the clustering solutions obtained, and then use this to conduct a full
evaluation of the algorithm. Our analysis uses three sets each of real and artificial data,
and four distinct analytical measures. These results are compared with those obtained
using established clustering techniques and we find evidence that ant-based clustering
is a robust and viable alternative.

Introduction

Ant-based clustering and sorting [4] was inspired by the clustering of corpses and larvalsorting
activities observed in real ant colonies [3]. The algorithm’s basic principles are straightforward
[16]: Ants are modelled by simple agents that randomly move in their environment,
a square grid with periodic boundary conditions. Data items that are scattered within this
environment can be picked up, transported and dropped by the agents. The picking and dropping
operations are biased by the similarity and density of data items within the ants’ local
neighbourhood: ants are likely to pick up data items that are either isolated or surrounded by
dissimilar ones; they tend to drop them in the vicinity of similar ones. In this way, a clustering
and sorting of the elements on the grid is obtained. Hence, like ant colony optimisation (ACO,
[6]), ant-based clustering and sorting is a distributed process that employs positive feedback.
However, in contrast to ACO, no artificial pheromones are used; instead, the environment
itself serves as stigmergic variable [5].

Clustering

Clustering is concerned with the division of data into homogenous subgroups. Informally, the
objective of this division is twofold: data items within one cluster are required to be similar
to each other, while those within different clusters should be dissimilar. Problems of this
type arise in a variety of disciplines ranging from sociology and psychology, to commerce,
biology and computer science, and algorithms for tackling them continue to be the subject of
active research. Consequently, there exists a multitude of clustering methods, which differ not
only in the principles of the algorithm used (which of course determine runtime behaviour
and scalability), but also in many of their most basic properties, such as the data handled
(numerical vs. categorical and proximity data), assumptions on the shape of the clusters (e.g.,
spherically shaped), the form of the final partitioning (hard vs. fuzzy assignments) or the
parameters that have to be provided (e.g., the correct number of clusters).
The four main classes of clustering algorithms available in the literature are partitioning
methods, hierarchical methods, density-based clustering and grid-based clustering (see [1]
for an extensive survey). For the purpose of our comparative study we select two of the most
popular and well-studied algorithms from the literature: -means, a representative of the class
of partitioning methods, and agglomerative average link clustering, which is a hierarchical
approach. Additionally, we compare against one-dimensional self-organising maps. All three
algorithms are described in more detail in Section 4.
3 Ant-based clustering in the literature
Ant-based clustering and sorting was originally introduced for tasks in robotics by Deneubourg
[4]. Lumer and Faieta [16] modified the algorithm to extend to numerical data analysis and
it has subsequently been used for data-mining [17], graph-partitioning [15, 14, 13] and textmining
[11, 20, 10].

Algorithms

We will now detail the four algorithms used in our comparative study, starting with a short
description of the ant-based algorithm, and the methods developed to automatically determine
parameter settings and to retrieve clusters from the grid.

Basics

The basic ant algorithm (see Algorithm 1) starts with an initialisation phase, in which (i) all
data items are randomly scattered on the toroidal grid; (ii) each agent randomly picks up one
data item; and (iii) each agent is placed at a random position on the grid. Subsequently, the
sorting phase starts: this is a simple loop, in which (i) one agent is randomly selected; (ii) the
agent performs a step of a given

Spatial separation

As stated above, the spatial separation of clusters on the grid is crucial in order for individual
clusters to be well-defined. Spatial closeness, when it occurs, is, to a large degree, an artefact
of early cluster formation. This is because, early on in a run, clusters will tend to form wherever
there are locally dense regions of similar data items; and thereafter these clusters tend to
drift only very slowly on the grid. After an initial clustering phase, we therefore use a short
interlude (from time 6¨©«ªB©to I¬@A) with a modified neighbourhood function, which replaces
the scaling parameter N18 by ®¯1° >in Equation 3, where ±C is the actual observed number
of occupied grid cells within the local neighbourhood. Hence only similarity, not density, is
taken into account, which has the effect of spreading out data items on the grid again, but
in a sorted fashion; the data items belonging to different clusters will now occupy individual
‘regions’ on the grid. Subsequently, we turn back to using the traditional neighbourhood
function. Once again, clear clusters are formed, but they now have a high likelihood of being
generated along the centres of these ‘regions’, due to the lower neighbourhood quality at their
boundaries.

Evaluation

Comparative results are presented for three synthetic test sets and three real data collections
taken from the Machine Learning Repository [2], which we describe below. In a preprocessing
step, the data vectors are normalised in each dimension, and, for the ant-based algorithm
and agglomerative clustering, all pairwise dissimilarities are precomputed. The employed distance
function is the Euclidean distance1 for the synthetic data sets, and the Cosine measure2
for the real data sets.

Conclusion

In contrast to previous studies on ant-based clustering and sorting, we have studied the algorithm’s
clustering solutions in isolation, in order to obtain an improved understanding of its
performance relative to other methods for clustering. Nonetheless, it should not be forgotten
that the mapping simultaneously provided by the algorithm is one of its most attractive features.
To what extent this mapping really is (or can be improved to be) topology-preserving
is investigated in [9].

project girl