12-09-2012, 05:09 PM
PROBABILISTIC DISTANCE CLUSTERING
1PROBABILISTIC DISTANCE.pdf (Size: 5.67 MB / Downloads: 83)
ABSTRACT
We present a new iterative method for probabilistic clustering of data. Given clusters,
their centers, and the distances of data points from these centers, the probability
of cluster membership at any point is assumed inversely proportional to the distance
from (the center of) the cluster in question. There are several ways of modeling this
assumption, and each such model (working principle) gives a di®erent algorithm.
The method is a generalization, to several centers, of the Weiszfeld method for
solving the Fermat{Weber location problem. At each iteration, the distances (Eu-
clidean, Mahalanobis, etc.) from the cluster centers are computed for all data points,
and the centers are updated as convex combinations of these points, with weights de-
termined by the working principle. Computations stop when the centers stop moving.
Progress is monitored by the joint distance function (JDF), a measure of dis-
tance from all cluster centers, that evolves during the iterations, and captures the data
in its low contours.
An important issue in clustering is the \right" number of clusters that best ¯ts a
data set. The JDF is used successfully to settle this issue, using only information freely
available by the algorithm.
Introduction
This thesis presents a new approach to clustering, called probabilistic distance cluster-
ing, its algorithms, and selected applications. The thesis is divided into ¯ve parts.
Preliminaries
The present chapter contains a description of the thesis.
Chapter 2 gives a brief survey of the clustering concepts, notation and terminology
that are relevant for this thesis. The approaches of center{based clustering and
hierarchical clustering are compared, and the main algorithms are described brie°y.
Probabilistic Distance Clustering
This part develops the models and algorithms for probabilistic clustering of data.
The main idea is presented in Chapter 3, with a new iterative method for prob-
abilistic clustering of data. The method is based on a principle, or a model of the
relationship between distances and probabilities. Given the clusters, their centers, and
the distances of data points from these centers, the probability of cluster membership
at any point is assumed inversely proportional to the distance from (the center of) the
cluster in question. The cluster centers and cluster membership probabilities of the
data points are updated using this principle. This is the basis of the probabilistic
distance clustering method described in Section 3.4.
Related Problems
This part studies two problems which are closely related to distance clustering, and can
be solved using the results of Part II with few modi¯cations.
In Chapter 6, an important application of PDQ method in estimating the parameters
of a mixture of distributions is presented. In such problems the cluster sizes are
unknown and need to be estimated. We ¯rst describe the problem of mixtures of
distributions and introduce the EM Algorithm, a well{known method for the solution
of this type of problems. The PDQ method may serve as an alternative to that method,
or as a preprocessor giving the EM Method a good start. We apply the algorithm to the
estimation of the parameters of Gaussian mixtures, and compare it to the EM method.
We conclude the chapter with the results of a number of computational experiments,
comparing the running time and solution quality of our algorithm with the EM Method.