26-05-2012, 02:36 PM
Bridging Domains Using World Wide Knowledge for Transfer Learning
Bridging Domains.pdf (Size: 10.26 MB / Downloads: 205)
INTRODUCTION
TEXT classification, which aims to assign a document to one
or more categories based on its content, is a fundamental
task for Web and document data mining applications,
ranging from information retrieval, spam detection, to online
advertisement and Web search. Traditional supervised
learning approaches for text classification require sufficient
labeled instances in a problem domain in order to train a highquality
model. However, it is not always easy or feasible to
obtain new labeled data in a domain of interest (hereafter,
referred to as the target domain). The lack of labeled data
problem can seriously hurt classification performance in
many real world applications.
To solve this problem, transfer learning techniques, in
particular domain adaptation techniques in transfer learning,
are introduced by capturing the shared knowledge from
some related domains where labeled data are available, and
use the knowledge to improve the performance of data
mining tasks in a target domain. In transfer learning
terminologies, one or more auxiliary domains are identified
as the source of knowledge transfer, and the domain of
interest is known as the target domain. Much effort has been
dedicated to this problem in recent years in machine learning,
data mining, and information retrieval [1], [2], [3], [4], [5].
RELATED WORK
In this section, we briefly review some previously proposed
methods for solving the task of domain adaptation. Since
the Wikipedia is used as our auxiliary data for building the
information bridge, we also review some methods that
extract useful knowledge from Wikipedia and other similar
knowledge bases such as ODP. Finally, since our approach
to domain adaptation problem is highly related to semisupervised
learning, we also briefly review this topic.
Domain Adaptation
Domain adaptation has attracted more and more attention in
the recent years. In general, previous domain adaptation
approaches can be classified into two categories [7]: instancebased
approaches [1], [2] or feature-based approaches [4],
[5], [8], [9].
Instance-based methods try to seek some reweighting
strategies on the source data, such that the source distribution
can match the target distribution. Feature-based
methods try to discover a shared feature space on which
the distributions of different domains are pulled closer. Both
types are trying to discover the relation between source and
target domains within the scope of two domains. For
example, instance-based transfer learning models assume
that there is a subset of instances sharing similar distributions
in different domains, and then they emphasize the
impact of these data in the models since they are more
“similar.” For the feature-based domain adaptation models,
they assume that different domains may share some features,
for instance, a subset of explicit features or implicit features.
Here, we consider some well-known instance-based
domain adaptation methods. Jiang et al. in [2] use the
instance weighting method for natural language processing,
where the method used is a type of importance sampling
method for solving sample selection bias problems [10]. Dai
et al. in [1] propose a boosting-style reweighting method
and provide different weighting schemes for data in
different domains. Other feature-based methods have been
developed and are compared to instance-based methods.
Daume et al. [11] propose a simple feature augmentation
method for NLP tasks. Blitzer et al. [12] use the Structural
Correspondence Learning model (SCL) to identify correspondences
among features from different domains by
modeling their correlations with pivot features that behave
in the same way for discriminative learning. They choose
the pivot features that are used to bridge two domains. Lee
et al. [13] use transfer learning on an ensemble of related
tasks to construct an informative prior on feature relevance.
They assume that features themselves have metafeatures
that are predictive of their relevance to the prediction task,
and modeled their relevance as a function of the metafeatures.
Raina et al. [5] describe an approach to self-taught
learning that uses sparse coding to construct high-level
features using the unlabeled data. They first express the
unlabeled data instance with a sparse weighted linear
combination of the basis and emphasize the L1 norm. They
then use these features as input to standard supervised
classification algorithms. In [9], a co-clustering-based
classification algorithm called CoCC is proposed to classify
out-of-domain documents. The class structure is passed
through word clusters from the in-domain data to the outof-
domain data. Additional class-label information given by
the in-domain data is extracted and used for labeling the
word clusters for out-of-domain documents. However, one
drawback of many previous works is that for the source and
target domains, the shared knowledge sometimes may be
quite limited and the relation between the two domains
cannot be fully exploited.
Data Mining with Online Knowledge Repository
A major component of our approach is to use online
knowledge repositories as auxiliary information sources to
help bridge the gap between the source domain and the
target domain. Therefore, we review some latest approaches
of data mining with online knowledge repositories.
In recent years, understanding and using online knowledge
repositories to aid real world data mining tasks has
become a hot research topic. There are more and more
works trying to use the Wikipedia for feature enrichment.
Gabrilovich and Markovitch [14], [15] try to use the Open
Directory Project (ODP) for feature enrichment in the text
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 771
classification problem. They also show that using Wikipedia
as the external Web knowledge resource for feature
enrichment performs better than using ODP [16].
CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a novel framework for tackling
the problem of domain adaption under large information
gaps. We model the learning problem as a semisupervised
learning problem aided by a method for filling in the
information gap between the source and target domains with
the help of an auxiliary knowledge base (such as the
Wikipedia). By conducting experiments on different difficult
domain adaptation tasks, we show that our algorithm can
significantly outperform several existing domain adaptation
approaches in situations when the source and target domains
are far from each other. In each case, an auxiliary domain can
be used to fill in the information gap efficiently.