18-08-2012, 03:08 PM
A Novel Similarity Measure for Clustering Categorical
Data Sets
A Novel Similarity Measure for Clustering.pdf (Size: 257.15 KB / Downloads: 67)
INTRODUCTION
We Data clustering has attracted a lot of research attention in the
field of computation statistics and datamining. The clustering
techniques can be applied and used to perform similarity clusters
and search, pattern recognition, trend analysis and so forth.
Clustering [10] is the technique of grouping a set of physical or
abstract objects into different clusters, such that objects with in a
cluster are more similar to one another and are dissimilar to the
objects in other clusters. A good clustering algorithm generates
high quality clusters to yield low inter cluster similarity and high
intra cluster similarity.
RELATED WORK
This section deals with all categorical clustering algorithms and
their similarity approaches in finding the best of clusters and
also deals with the previous work of finding and dealing
similarity between one set of attributes with respect to other set
known as context based similarity. Most of the earlier work has
been done with k-means as the stepping platform to generate
clusters on categorical attributes.
K-Representative Algorithm
K-modes algorithm [14] has its own set of drawbacks because
of its instability due to non-uniqueness of the modes i.e., the
results of the clusters depend largely and strongly on the
selection of modes during the clustering process. Huang
combined k-modes with k-means to give k-prototype algorithm
[15] but because of the K-mode problem, limitations remained
same. K-representative algorithm, [7] works on the principal of
“cluster centers” called representatives for categorical objects.
Arithmetic operations are completely absent in the initialization
and setting of categorical objects, it applies the notion of fuzzy
logic in defining representatives instead of means for clusters.
With this theory, it can formulate the clustering problem of
categorical objects as a partitioning problem in the way similar
to k-means clustering. The dissimilarity measure of this
algorithm is as follows.
CLOPE Algorithm
CLOPE, Clustering with sLOPE, algorithm [12] proposes an
approach based on histograms: The goodness of a cluster is
higher if the average frequency of an item is high, as compared
to the number of items appearing within a transaction. The
algorithm is particularly suitable for large high- dimensional
databases, but it is sensitive to a user-defined parameter (the
repulsion factor), which weights the importance of the
compactness/sparseness of a cluster. A better cluster is reflected
graphically if higher height to weight ratio is achieved. CLOPE
uses histograms of a cluster C with items as the X –axis
decreasingly ordered by their occurrences and occurrences as yaxis.
A larger height means a heavier overlap among the items in
the cluster and thus more similarity among transactions in the
cluster.
CONCLUSION AND FUTURE WORK
In this paper, a novel similarity measure for categorical
attributes of relational data sets has been proposed based on the
intuitive idea of functional dependency and Context based
similarity. The idea is generalized with a functional dependency
that also uses context based of transactions in a cluster, and thus
the resulting number of clusters. Our application shows that this
similarity measure is quite effective in finding interesting
clustering of relational data sets.