19-01-2013, 11:53 AM
Similarity Measures for Text Document Clustering
1Similarity Measures.pdf (Size: 351.9 KB / Downloads: 103)
ABSTRACT
Clustering is a useful technique that organizes a large quantity
of unordered text documents into a small number of
meaningful and coherent clusters, thereby providing a basis
for intuitive and informative navigation and browsing
mechanisms. Partitional clustering algorithms have been
recognized to be more suitable as opposed to the hierarchical
clustering schemes for processing large datasets. A
wide variety of distance functions and similarity measures
have been used for clustering, such as squared Euclidean
distance, cosine similarity, and relative entropy.
In this paper, we compare and analyze the effectiveness
of these measures in partitional clustering for text document
datasets. Our experiments utilize the standard Kmeans
algorithm and we report results on seven text document
datasets and five distance/similarity measures that
have been most commonly used in text clustering
INTRODUCTION
We are facing an ever increasing volume of text documents.
The abundant texts flowing over the Internet, huge collections
of documents in digital libraries and repositories,
and digitized personal information such as blog articles and
emails are piling up quickly everyday. These have brought
challenges for the effective and efficient organization of text
documents.
DOCUMENT REPRESENTATION
There are several ways to model a text document. For example,
it can be represented as a bag of words, where words
are assumed to appear independently and the order is immaterial.
The bag of word model is widely used in information
retrieval and text mining [21]. Words are counted in the
bag, which differs from the mathematical definition of set.
Each word corresponds to a dimension in the resulting data
space and each document then becomes a vector consisting
of non-negative values on each dimension. Here we use the
frequency of each term as its weight, which means terms that
appear more frequently are more important and descriptive
for the document.
SIMILARITY MEASURES
Before clustering, a similarity/distance measure must be determined.
The measure reflects the degree of closeness or
separation of the target objects and should correspond to
the characteristics that are believed to distinguish the clusters
embedded in the data. In many cases, these characteristics
are dependent on the data or the problem context at
hand, and there is no measure that is universally best for all
kinds of clustering problems.
Moreover, choosing an appropriate similarity measure is also
crucial for cluster analysis, especially for a particular type
of clustering algorithms. For example, the density-based
clustering algorithms, such as DBScan [4], rely heavily on
the similarity computation. Density-based clustering finds
clusters as dense areas in the data set, and the density of a
given point is in turn estimated as the closeness of the corresponding
data object to its neighboring objects. Recalling
that closeness is quantified as the distance/similarity value,
we can see that large number of distance/similarity computations
are required for finding dense areas and estimate
cluster assignment of new data objects. Therefore, understanding
the effectiveness of different measures is of great
importance in helping to choose the best one.
Euclidean Distance
Euclidean distance is a standard metric for geometrical problems.
It is the ordinary distance between two points and can
be easily measured with a ruler in two- or three-dimensional
space. Euclidean distance is widely used in clustering problems,
including clustering text. It satisfies all the above four
conditions and therefore is a true metric. It is also the default
distance measure used with the K-means algorithm.
Jaccard Coefficient
The Jaccard coefficient, which is sometimes referred to as the
Tanimoto coefficient, measures similarity as the intersection
divided by the union of the objects. For text document, the
Jaccard coefficient compares the sum weight of shared terms
to the sum weight of terms that are present in either of the
two document but are not the shared terms.
CLUSTERING ALGORITHM
For all subsequent experiments, the standard K-means algorithm
is chosen as the clustering algorithm. This is an iterative
partitional clustering process that aims to minimize the
least squares error criterion [15]. As mentioned previously,
partitional clustering algorithms have been recognized to be
better suited for handling large document datasets than hierarchical
ones, due to their relatively low computational
requirements [16, 9, 3].
The standard K-means algorithm works as follows. Given a
set of data objects D and a pre-specified number of clusters
k, k data objects are randomly selected to initialize k clusters,
each one being the centroid of a cluster. The remaining
objects are then assigned to the cluster represented by the
nearest or most similar centroid. Next, new centroids are
re-computed for each cluster and in turn all documents are
re-assigned based on the new centroids. This step iterates
until a converged and fixed solution is reached, where all
data objects remain in the same cluster after an update of
centroids.
RELATED WORK
The most similar work to this paper is [17], which compares
four similarity measures on a collection of Yahoo! news
pages. The present study differs in two aspects. First, we
extended the experiments by including the averaged KL divergence.
Our results broadly agree with Strehl et al’s [17].
We both found that the performance of the cosine similarity,
Jaccard correlation and Pearson’s coefficient are very close,
and are significantly better than the Euclidean distance measure.
In addition, we found that the KL divergence measure
is comparable and in some cases better than the others. Second,
we also experimented with other types of data sets in
addition to the web page documents.
As for the averaged Kullback-Leibler divergence, Lin gave a
clear explanation in [11]. This measure was more frequently
used to assess the similarity between words, especially for
such applications as word sense disambiguation. It was not
until recently that this measure has been utilized for document
clustering. Information theoretic clustering algorithms
such as the Information Bottleneck method [18] rely on this
measure and have shown considerable improvement in overall
performance.
CONCLUSIONS
To conclude, this investigation found that except for the Euclidean
distance measure, the other measures have comparable
effectiveness for the partitional text document clustering
task. Pearson correlation coefficient and the averaged
KLD divergence measures are slightly better in that their
resulting clustering solutions are more balanced and has a
closer match with the manually created category structure.
Meanwhile, the Jaccard and Pearson coefficient measures
find more coherent clusters. Despite of the above differences,
these measures’ overall performance is similar. Considering
the type of cluster analysis involved in this study, which is
partitional and require a similarity or distance measure, we
can see that there are three components that affect the final
results—representation of the objects, distance or similarity
measures, and the clustering algorithm itself. This lead us
to two directions for future work as follows.