16-11-2012, 04:54 PM
Manifold Adaptive Experimental Design for Text Categorization
Manifold Adaptive Experimental Design.pdf (Size: 2.27 MB / Downloads: 30)
INTRODUCTION
TEXT classification has been a fundamental problem in
many information processing tasks [1], [14], [17], [22],
[32], [44]. In order to train a classifier that can automatically
distributes documents into different semantic categories,
one usually needs to collect a large set of labeled examples.
In order to reduce the efforts in collecting labels, many
researchers studied to use active learning [37] for text
categorization. The key problem in active learning is
determining which unlabeled examples would be the most
informative, i.e., improve the classifier the most if they were
labeled and used as training examples.
There has been a long tradition of research on active
learning in machine learning community [10], [12], [16].
One popular group of algorithms select the most uncertain
data given previously trained models. One representative
algorithm in this group is SVMActive. Based on the
observation that the closer to the SVM boundary a data
point is, the less reliable its classification is, Tong and Koller
proposed SVMActive which selects those unlabeled data
points closest to the boundary to solicit user’s labeling so as
to achieve maximal refinement on the hyperplane between
the two classes [43]. Another group of algorithms choose
the most informative points that optimize some expected
measures [12]. Many algorithms in statistics belong to this
category. In statistics, the problem of selecting samples to
label is typically referred to as experimental design [2].
BACKGROUND
The generic problem of active learning is the following.
Given a set of points X ¼fx1; x2; . . . ; xng in IRm, find a
subset Z ¼ fz1; z2; . . . ; zkg X, which contains the most
informative points. In other words, the points ziði ¼
1; . . . ; kÞ can improve the classifier the most if they are
labeled and used as training points.
There has been extensive research in machine learning
on this subject. Some popular directions include selecting
the most uncertain data given previously trained models
[42] and selecting the most representative points by
exploiting the cluster structure of the data [13]. One
representative algorithm which selects the most uncertain
data is SVMActive [42], [43]. This method selects the points
that can reduce the size of the version space as much as
possible. Since it is difficult to measure the version space,
the authors provide three approximations. One of them
which selects the points closest to the current decision
boundary is called SimpleMargin. This method was also
proposed by Schohn and Cohn [35] and has been very
popular. Some other methods include query-by-committee
[39], density-weighted methods [29], [38], and explicit errorreduction
techniques [34], [48]. Please refer [37] for a
comprehensive treatment of active learning approaches.
Manifold Adaptive Kernel
In order to incorporate the manifold structure into the
active learning process, a natural way is to perform active
learning tasks in manifold adaptive kernel space. In the
following, we discuss how to incorporate the manifold
structure into the reproducing kernel Hilbert space, which
leads to manifold adaptive kernel space.
Kernel trick is usually applied in the hope of discovering
the nonlinear structure in the data by mapping the original
nonlinear observations into a higher dimensional linear space
[36]. The most commonly used kernels include Gaussian
kernel and polynomial kernel. However, the nonlinear
structure captured by the data-independent kernels may
not be consistent with the intrinsic manifold structure, such as
geodesic distance, curvature, and homology [4], [31].
Text Categorization Results
In this section, we discuss the performance of the four
different algorithms on text categorization. Before experimental
comparison, it would be important to note that the
algorithms Random Sampling, Convex TED, and MAED are
all classifier independent, while the algorithm Simple Margin is
classifier dependent. For the former three algorithms, the data
selection is performed globally. In other words, the selected
data points will be used for all the binary classification tasks.
However, for Simple Margin, since the active learning (data
selection) process is dependent on the decision boundary, for
each binary classification task we have to select k data points
for labeling. In our experiments, four categories are used,
thus Simple Margin may select maximally 4k data points, if
there is no overlap. Moreover, since Simple Margin is
classifier dependent, it needs at least one example for each
category to train the initial classifier. In our experiments, we
randomly select one example from each category to train an
initial SVM classifier for Simple Margin.