Bridging Domains Using World Wide Knowledge for Transfer Learning

**seminar flower** · 26-05-2012, 02:36 PM

Bridging Domains Using World Wide Knowledge for Transfer Learning

.pdf

Bridging Domains.pdf (Size: 10.26 MB / Downloads: 205)

INTRODUCTION

TEXT classification, which aims to assign a document to one
or more categories based on its content, is a fundamental
task for Web and document data mining applications,
ranging from information retrieval, spam detection, to online
advertisement and Web search. Traditional supervised
learning approaches for text classification require sufficient
labeled instances in a problem domain in order to train a highquality
model. However, it is not always easy or feasible to
obtain new labeled data in a domain of interest (hereafter,
referred to as the target domain). The lack of labeled data
problem can seriously hurt classification performance in
many real world applications.
To solve this problem, transfer learning techniques, in
particular domain adaptation techniques in transfer learning,
are introduced by capturing the shared knowledge from
some related domains where labeled data are available, and
use the knowledge to improve the performance of data
mining tasks in a target domain. In transfer learning
terminologies, one or more auxiliary domains are identified
as the source of knowledge transfer, and the domain of
interest is known as the target domain. Much effort has been
dedicated to this problem in recent years in machine learning,
data mining, and information retrieval [1], [2], [3], [4], [5].

RELATED WORK

In this section, we briefly review some previously proposed
methods for solving the task of domain adaptation. Since
the Wikipedia is used as our auxiliary data for building the
information bridge, we also review some methods that
extract useful knowledge from Wikipedia and other similar
knowledge bases such as ODP. Finally, since our approach
to domain adaptation problem is highly related to semisupervised
learning, we also briefly review this topic.
Domain Adaptation
Domain adaptation has attracted more and more attention in
the recent years. In general, previous domain adaptation
approaches can be classified into two categories [7]: instancebased
approaches [1], [2] or feature-based approaches [4],
[5], [8], [9].
Instance-based methods try to seek some reweighting
strategies on the source data, such that the source distribution
can match the target distribution. Feature-based
methods try to discover a shared feature space on which
the distributions of different domains are pulled closer. Both
types are trying to discover the relation between source and
target domains within the scope of two domains. For
example, instance-based transfer learning models assume
that there is a subset of instances sharing similar distributions
in different domains, and then they emphasize the
impact of these data in the models since they are more
“similar.” For the feature-based domain adaptation models,
they assume that different domains may share some features,
for instance, a subset of explicit features or implicit features.
Here, we consider some well-known instance-based
domain adaptation methods. Jiang et al. in [2] use the
instance weighting method for natural language processing,
where the method used is a type of importance sampling
method for solving sample selection bias problems [10]. Dai
et al. in [1] propose a boosting-style reweighting method
and provide different weighting schemes for data in
different domains. Other feature-based methods have been
developed and are compared to instance-based methods.
Daume et al. [11] propose a simple feature augmentation
method for NLP tasks. Blitzer et al. [12] use the Structural
Correspondence Learning model (SCL) to identify correspondences
among features from different domains by
modeling their correlations with pivot features that behave
in the same way for discriminative learning. They choose
the pivot features that are used to bridge two domains. Lee
et al. [13] use transfer learning on an ensemble of related
tasks to construct an informative prior on feature relevance.
They assume that features themselves have metafeatures
that are predictive of their relevance to the prediction task,
and modeled their relevance as a function of the metafeatures.
Raina et al. [5] describe an approach to self-taught
learning that uses sparse coding to construct high-level
features using the unlabeled data. They first express the
unlabeled data instance with a sparse weighted linear
combination of the basis and emphasize the L1 norm. They
then use these features as input to standard supervised
classification algorithms. In [9], a co-clustering-based
classification algorithm called CoCC is proposed to classify
out-of-domain documents. The class structure is passed
through word clusters from the in-domain data to the outof-
domain data. Additional class-label information given by
the in-domain data is extracted and used for labeling the
word clusters for out-of-domain documents. However, one
drawback of many previous works is that for the source and
target domains, the shared knowledge sometimes may be
quite limited and the relation between the two domains
cannot be fully exploited.
Data Mining with Online Knowledge Repository
A major component of our approach is to use online
knowledge repositories as auxiliary information sources to
help bridge the gap between the source domain and the
target domain. Therefore, we review some latest approaches
of data mining with online knowledge repositories.
In recent years, understanding and using online knowledge
repositories to aid real world data mining tasks has
become a hot research topic. There are more and more
works trying to use the Wikipedia for feature enrichment.
Gabrilovich and Markovitch [14], [15] try to use the Open
Directory Project (ODP) for feature enrichment in the text
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 771
classification problem. They also show that using Wikipedia
as the external Web knowledge resource for feature
enrichment performs better than using ODP [16].

CONCLUSIONS AND FUTURE WORK

In this paper, we proposed a novel framework for tackling
the problem of domain adaption under large information
gaps. We model the learning problem as a semisupervised
learning problem aided by a method for filling in the
information gap between the source and target domains with
the help of an auxiliary knowledge base (such as the
Wikipedia). By conducting experiments on different difficult
domain adaptation tasks, we show that our algorithm can
significantly outperform several existing domain adaptation
approaches in situations when the source and target domains
are far from each other. In each case, an auxiliary domain can
be used to fill in the information gap efficiently.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	DEMONSTRATING DATAPOSSESSION AND UN CHEATABLE DATA TRANSFER	seminar flower	1	1,466	19-09-2017, 11:05 AM Last Post: jaseela123
	INCREMENTAL MINING USING FREQUENT PATTERN TREE	project topics	1	10,061,816	13-09-2017, 09:40 AM Last Post: jaseela123
	Efficiency Improvement of WCDMA Base Station Transmitters using Class-F power amplifi	project report helper	1	6,890,672	12-09-2017, 11:20 AM Last Post: jaseela123
	Information Processing Using Transient Dynamics of Semiconductor Lasers Subject	seminar projects maker	1	797	11-09-2017, 04:41 PM Last Post: jaseela123
	FTP USING RS232	seminar post	1	656	11-09-2017, 02:56 PM Last Post: jaseela123
	REAL TIME FACIAL EXPRESSION RECOGNITION USING A NOVEL METHOD	seminar tips	1	1,061	09-09-2017, 04:43 PM Last Post: jaseela123
	Voice Verification System Using Wavelets	seminar code	1	741	09-09-2017, 04:35 PM Last Post: jaseela123
	Implement Remote Procedure Call (RPC) mechanism for a file transfer across a network	seminar tips	1	2,095	09-09-2017, 01:11 PM Last Post: jaseela123
	MOBILE VOTING SYSTEM USING IRIS RECOGNITION AND CRYPTOGRAPHY TECHNIQUES	mkaasees	2	585	09-09-2017, 12:45 PM Last Post: jaseela123
	A Decision Support System to improve e-Learning Environments pdf	project girl	1	1,265	09-09-2017, 09:33 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.