24-12-2012, 04:13 PM
Text Classification Aided by Clustering: a Literature Review
1Text Classification Aided.pdf (Size: 323.85 KB / Downloads: 56)
Introduction
Supervised and unsupervised learning have been the focus of critical research in the areas of
machine learning and artificial intelligence. In the literature, these two streams flow
independently of each other, despite their close conceptual and practical connections. In this
work we exclusively deal with the text classification aided by clustering scenario. This
chapter provides a review and interpretation of the role of clustering in different fields of
text classification with an eye towards identifying the important areas of research. Drawing
upon the literature review and analysis, we discuss several important research issues
surrounding text classification tasks and the role of clustering in support of these tasks. We
define the problem, postulate a number of baseline methods, examine the techniques used,
and classify them into meaningful categories.
A standard research issue for text classification is the creation of compact representations of
the feature space and the discovery of the complex relationships that exist between features,
documents and classes. There are several approaches that try to quantify the notion of
information for the basic components of a text classification problem. Given the variables of
interest, sources of information about these variables can be compressed while preserving
their information. Clustering is one of the approaches used in this context. In this vein, an
important area of research where clustering is used to aid text classification is the area of
dimensionality reduction. Clustering is used as a feature compression and/or extraction
method: features are clustered into groups based on selected clustering criteria. Feature
clustering methods create new, reduced-size event spaces by joining similar features into
groups. They define a similarity measure between features, and collapse similar features
into single events that no longer distinguish among their constituent features. Typically, the
parameters of the cluster become the weighted average of the parameters of its constituent
features. Two types of clustering have been studied: i) one-way clustering, i.e. feature
clustering based on the distributions of features in the documents or classes, and ii) coclustering,
i.e. clustering both features and documents.
One-way clustering (clustering features)
In text classification using one-way clustering, a clustering algorithm is applied prior to a
classifier to reduce feature dimensionality by grouping together “similar” features into a
much smaller number of feature clusters, i.e. clusters are used as features for the
classification task replacing the original feature space. A crucial stage in this procedure is
how to determine the similarity of features. Three main clustering methods have been
applied in the literature: information bottleneck, distributional clustering, and divisive
clustering.
An important feature clustering method that formulates a principle for the extraction and
efficient representation of relevant information is the information bottleneck (IB) method
(Tishby et al., 1999). The objective of the IB method is to extract the information from one
variable X that is relevant for the prediction of another variable Y.
Co-clustering (clustering features and documents)
Using co-clustering in text classification, a two-stage procedure is usually followed: feature
clustering and then document clustering. In this way a reduction for both dimensions is
attained.
The double-clustering (DC) algorithm (Slonim & Tishby, 2000) is a co-clustering two-stage
procedure based on the IB method. Intuitively, in the first stage of DC, feature clustering
generates coarser pseudo features, which reduce noise and sparseness that might be
exhibited in the original feature space. Then, in the second stage, documents are clustered as
distributions over the “distilled” pseudo features, and therefore generate more accurate
document clusters. An extension of the DC algorithm, the so called Iterative Double
Clustering (IDC) (Yaniv & Souroujon, 2001) applies the DC algorithm in an iterative
manner. Whenever the first DC iteration succeeds in extracting a meaningful structure of the
data, a number of the next consecutive iterations can continually improve the clustering
quality. This is achieved due to the generation of progressively less noisy data
representations. Experiments conducted on text classification tasks indicate that IDC
outperforms DC and competes even SVM when the training set is small.
Clustering in semi-supervised classification
Clustering in semi-supervised classification is used as a method to extract information from
the unlabelled data in order to boost the classification task. In particularly clustering is used:
i) to create a training set from the unlabelled data, ii) to augment the training set with new
documents from the unlabelled data, iii) to augment the dataset with new features, and iv)
to co-train a classifier.
Create a training set from the unlabelled data
(Fung and Mangasarian, 2001) propose a model for classifying two-class unlabelled data,
called clustered concave semi-supervised SVM (CVS3VM). First, a k-median clustering
algorithm finds k cluster centres for the unlabelled examples such that the sum of distances
between each example and the closest cluster centre is minimized. Then, examples within a
certain distance from these k cluster centres are treated as representative examples of the
clusters, and hence of the overall dataset, and are given to an expert or oracle to label.
Finally, a linear SVM is trained using this small sample of labelled data. The model is
effectively compared to other methods.
(Li et al., 2004) follow a similar approach where a k-means clustering algorithm is used to
cluster the unlabelled data into a certain number of subsets and to assign corresponding
cluster labels. Then, an SVM classifier is trained on this labelled set.
Co-training
In general, a co-training algorithm produces an initial weak classifier from a few labelled
examples and later uses unlabelled data to improve its performance. The idea was first
introduced in (Blum & Mitchell, 1998). The key defining features of this problem class are
that (i) the features can be factored into two (or more) components, i.e. there are two distinct
views of an example x, which are redundantly sufficient to correctly classify the example,
and (ii) the two components are independent and identically distributed, so that the features
in one view of the example x do not always co-occur with the features in the second view. A
different approach to co-training is given in (Goldman & Zhou, 2000). See (Abney, 2002;
Seeger, 2000) for a comprehensive survey on co-training.
The use of “concepts” derived by clustering as in (Raskutti et al., 2000a) provides an
alternate description of the data, similar to the redundant views used in co-training. In this
vein, (Raskutti et al., 2002b) present a co-training strategy to make use of unlabelled data.
Two predictors are trained in parallel, and each predictor labels the unlabelled data to train
the other predictor in the next round. The process repeats for a number of iterations. The
predictors are SVMs, one trained using the original word presence features view, and the
other trained with solely the new cluster features that are derived by clustering both
labelled and unlabelled data. The new features include membership information as well as
similarity to clusters’ centroids for the more populous clusters as described in their previous
work (Raskutti et al., 2000a).
Clustering in large-scale classification problems
Clustering in large-scale classification problems is used as a down-sampling pre-process to
classification, in order to select the most representative training examples according to: i)
clustering and information from the resulting hyperplane of a SVM initially trained on
cluster representatives, ii) clustering and prior class label information, iii) a combination of
cases i and ii, iv) solely clustering results, and v) problem decomposition.