15-05-2012, 01:42 PM
A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification
SPJ 15 A Fuzzy Self-Constructing Feature Clustering Algo.pdf (Size: 5.18 MB / Downloads: 56)
INTRODUCTION
IN text classification, the dimensionality of the feature
vector is usually huge. For example, 20 Newsgroups [1]
and Reuters21578 top-10 [2], which are two real-world data
sets, both have more than 15,000 features. Such high
dimensionality can be a severe obstacle for classification
algorithms [3], [4]. To alleviate this difficulty, feature
reduction approaches are applied before document classification
tasks are performed [5]. Two major approaches,
feature selection [6], [7], [8], [9], [10] and feature extraction
[11], [12], [13], have been proposed for feature reduction. In
general, feature extraction approaches are more effective
than feature selection techniques, but are more computationally
expensive [11], [12], [14]. Therefore, developing
scalable and efficient feature extraction algorithms is highly
demanded for dealing with high-dimensional document
data sets.
OUR METHOD
There are some issues pertinent to most of the existing
feature clustering methods. First, the parameter k, indicating
the desired number of extracted features, has to be
specified in advance. This gives a burden to the user, since
trial-and-error has to be done until the appropriate number
of extracted features is found. Second, when calculating
similarities, the variance of the underlying cluster is not
considered. Intuitively, the distribution of the data in a
cluster is an important factor in the calculation of similarity.
Third, all words in a cluster have the same degree of
contribution to the resulting extracted feature. Sometimes, it
may be better if more similar words are allowed to have
bigger degrees of contribution. Our feature clustering
algorithm is proposed to deal with these issues.
Self-Constructing Clustering
Our clustering algorithm is an incremental, self-constructing
learning approach. Word patterns are considered one
by one. The user does not need to have any idea about
the number of clusters in advance. No clusters exist at the
beginning, and clusters can be created if necessary. For
each word pattern, the similarity of this word pattern to
each existing cluster is calculated to decide whether it is
combined into an existing cluster or a new cluster is
created. Once a new cluster is created, the corresponding
membership function should be initialized. On the
contrary, when the word pattern is combined into an
existing cluster, the membership function of that cluster
should be updated accordingly.
Text Classification
Given a set D of training documents, text classification can
be done as follows: We specify the similarity threshold for
(16), and apply our clustering algorithm. Assume that
k clusters are obtained for the words in the feature vector
W. Then we find the weighting matrix T and convert D to
D0 by (25). Using D0 as training data, a classifier based on
support vector machines (SVM) is built. Note that any
classifying technique other than SVM can be applied.
Joachims [36] showed that SVM is better than other
methods for text categorization.