Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Abstract—In this paper we propose a new iterative graph based
clustering technique. The proposed method has three desirable
characteristics as compared to well known clustering techniques.
Specifically, because of its iterative nature, the number of clusters
contained in a given data set it is not necessary to be known a priori
as partitional clustering techniques demand. Its performance,
in terms of the quality of the achieved clustering, does not depend
on the distribution of the cluster’s size. Finally, no threshold value
is required for achieving the clustering as hierarchical clustering
techniques demand. All these features manifest themselves in the
important problem of clustering homologous proteins when only
sequence information is available. The proposed method is tested
against well known and widely used techniques, by conducting a
number of experiments based on both artificial and real protein
data sets and all above mentioned characteristics were confirmed.
I. INTRODUCTION
An important problem in today’s genomics is that of
grouping together homologous proteins when only sequence
information is available. This is a difficult problem since
sequence similarity is a very noisy measure of evolutionary
relatedness [1]. The core of most methods proposed so far in
the bioinformatics literature, is based on well known clustering
techniques also used in other fields of sciences.
The importance of data clustering, the great number and
diversity of applications, along with the specific requirements
each of them poses, combined with the lack of a universally
accepted definition of the term cluster, has led to a plethora
of clustering methods over the past decades. In recent years
especially, the computer revolution has provided scientists with
very large amounts of data and the computation resources to
process and analyse them, leading to the development of modern
clustering techniques. The existence of this great number
of clustering methods, gives an indication to what nowadays is
generally accepted, that there exists no globally better method.
Instead, each method has its strengths and weaknesses and its
best suited to a specific class of applications [2].
Although as mentioned above, the definition of a cluster is
somewhat broad and takes usually the shape of the specific
requirements of the application at hand, the general intuition
behind it is that clusters are groups of objects presenting
similar characteristics among them, and dissimilar to the rest
of the data set. If a data set comprises of well-defined and
well-separated groups, then the cluster identification problem
becomes a fairly trivial one and the vast majority of the
proposed solutions would yield very similar results. In real
applications however, this is seldom the case, as real data
sets are usually characterized by inhomogeneity in their interobject
relations, having strongly related groups co-existing or
possibly overlapping with more loosely connected groups, as
well as “irrelevant” objects. What makes the problem even
more difficult is that, in the majority of real applications,
this knowledge is not present a priori and safe assumptions
can not be made. In these cases, the particular choice of the
clustering method used is of crucial importance, and different
methods applied to the same data set could yield highly
different results. Hence, it becomes clear that the selected
method should comply with the clustering requirements of the
application it is intended to, i.e. the interpretation given by the
method to the clustering objective, as well as the goals it sets
in order to achieve it, should match the needs of the application
at hand.
On the basis of the aforementioned interpretation of the
problem, as well as the a priori knowledge required from them,
the main bulk of the existing clustering methods can be loosely
classified into two broad categories; namely the partitional and
the hierarchical ones.
In partitional clustering methods, the data set at hand is
partitioned into a predefined number of clusters, by seeking
the partition that optimizes a clustering quality measure (e.g
modularity [3]). The most popular representative of this class
of methods is k-means [4], where the data are represented as
points in Euclidean space and the goal of the method is to
find the partition that minimizes the sum of distances of each
object from the center mass of the cluster it is assigned to.
In the same category belongs the family of spectral clustering
methods, which represent the given objects and the relations
between them in the form of a similarity graph, and seek to
partition the graph nodes into k disjoint groups, by introducing
suitable clustering objectives based on the spectral characteristics
of the graph, such as ratio cut (RCut), normalized
cut (NCut), and ratio association (RAssoc) [5], [6], [7]. It
has been recently shown, that the above mentioned spectral
clustering techniques are equivalent to the kernel k-means one
[8]. In both of the above families of methods, optimization
leads to NP-hard problems and solution is obtained through
relaxation of the criteria, or heuristic algorithms. The main
disadvantage of partitional clustering methods though, is that
the number of clusters, k, required beforehand, is in most
real cases unknown and difficult to assume, since in most
applications, the only knowledge one is provided with, is
the data set itself.

alicejoe