17-07-2012, 09:47 AM
FEATURE-BASED TRANSFER LEARNING WITH REAL-WORLD APPLICATIONS
FEATURE-BASED TRANSFER LEARNING.pdf (Size: 1.92 MB / Downloads: 72)
ABSTRACT
Transfer learning is a new machine learning and data mining framework that allows the training
and test data to come from different distributions and/or feature spaces. We can find many novel
applications of machine learning and data mining where transfer learning is helpful, especially
when we have limited labeled data in our domain of interest. In this thesis, we first survey
different settings and approaches of transfer learning and give a big picture of the field. We
focus on latent space learning for transfer learning, which aims at discovering a “good” common
feature space across domain, such that knowledge transfer becomes possible. In our study,
we propose a novel dimensionality reduction framework for transfer learning, which tries to
reduce the distance between different domains while preserve data properties as much as possible.
This framework is general for many transfer learning problems when domain knowledge
is unavailable. Based on this framework, we propose three effective solutions to learn the latent
space for transfer learning. We apply these methods to two diverse applications: cross-domain
WiFi localization and cross-domain text classification, and achieve promising results. Furthermore,
for a specific application area, such as sentiment classification, where domain knowledge
is available for encoding to transfer learning methods, we propose a spectral feature alignment
algorithm for cross-domain learning. In this algorithm, we try to align domain-specific features
from different domains by using some domain independent features as a bridge. Experimental
results show that this method outperforms a state-of-the-art algorithm in two real-world datasets
on cross-domain sentiment classification.
INTRODUCTION
Supervised data mining and machine learning technologies have already been widely studied
and applied to many knowledge engineering areas. However, most traditional supervised algorithms
work well only under a common assumption: the training and test data are drawn from
the same feature space and the same distribution. Furthermore, the performance of these algorithms
heavily reply on collecting high quality and sufficient labeled training data to train a
statistical or computational model to make predictions on the future data [127, 77, 189]. However,
in many real-world scenarios, labeled training data are in short supply or can only be
obtained with expensive cost. This problem has become a major bottleneck of making machine
learning and data mining methods more applicable in practice.
In the last decade, semi-supervised learning [233, 34, 131, 27, 90] techniques have been
proposed to address the problem that the labeled training data may be too few to build a good
classifier, by making use of a large amount of unlabeled data to discover a powerful structure together
with a small amount of labeled data to train models. Nevertheless, most semi-supervised
methods require that the training data, including labeled and unlabeled data, and the test data
are both from the same domain of interest, which implicitly assumes the training and test data
are still represented in the same feature space and drawn from the same data distribution.
Instead of exploring unlabeled data to train a precise model, active learning, which is another
branch in machine learning for reducing annotation effort of supervised learning, tries to design
an active learner to pose queries, usually in the form of unlabeled data instances to be labeled
by an oracle (e.g., a human annotator). The key idea behind active learning is that a machine
learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to
choose the data from which it learns [101, 168]. However, most active learning methods assume
that there is a budget for the active learner to pose queries in the domain of interest. In some
real-world applications, the budget may be quite limited, where active learning methods may
not work in learning accurate classifiers in the domain of interest.
Transfer learning, in contrast, allows the domains, tasks, and distributions used in training
and testing to be different. The main idea behind transfer learning is to borrow labeled data
or knowledge extracted from some related domains to help a machine learning algorithm to
achieve greater performance in the domain of interest [183]. Thus, transfer learning can be
referred to as a different strategy for learning model with minimal human supervision, compared
to semi-supervised and active learning. In the real world, we can observe many examples of
1
transfer learning. For example, we may find that learning to recognize apples might help to
recognize pears. Similarly, learning to play the electronic organ may help facilitate learning the
piano. Furthermore, in many engineering applications, it is expensive or impossible to collect
sufficient training data to train a model for use in each domain of interest. It would be nice
if one could reuse the training data which have been collected in some related domains/tasks
or the knowledge that is already extracted from some related domains/tasks to learn a precise
model for use in the domain of interest. In such cases, knowledge transfer or transfer learning
between tasks or domains become more desirable and crucial.
Many examples in knowledge engineering can be found where transfer learning can truly be
beneficial. One example is Web document classification, where our goal is to classify a given
Web document into several predefined categories. As an example in the area of Web-document
classification (see, e.g., [49]), the labeled examples may be the university Web pages that are
associated with category information obtained through previous manual-labeling efforts. For a
classification task on a newly createdWeb site where the data features or data distributions may
be different, there may be a lack of labeled training data. As a result, we may not be able to
directly apply the Web-page classifiers learned on the university Web site to the new Web site.
In such cases, it would be helpful if we could transfer the classification knowledge into the new
domain.
The need for transfer learning may also arise when the data can be easily outdated. In this
case, the labeled data obtained in one time period may not follow the same distribution in a
later time period. For example, in indoor WiFi localization problems, which aims to detect a
user’s current location based on previously collected WiFi data, it is very expensive to calibrate
WiFi data for building localization models in a large-scale environment, because a user needs to
label a large collection of WiFi signal data at each location. However, the WiFi signal-strength
values may be a function of time, device or other dynamic factors. As shown in Figure 4.2,
values of received signal strength (RSS) may differ across time periods and mobile devices. As
a result, a model trained in one time period or on one device may cause the performance for
location estimation in another time period or on another device to be reduced. To reduce the
re-calibration effort, we might wish to adapt the localization model trained in one time period
(the source domain) for a new time period (the target domain), or to adapt the localization model
trained on a mobile device (the source domain) for a new mobile device (the target domain), as
introduced in [142].
As a third example, transfer learning is also desirable when the features between domains
change. Consider the problem of sentiment classification, where our task is to automatically
classify the reviews on a product, such as a brand of camera, into polarity categories (e.g.,
positive or negative). In literature, supervised learning algorithms [146] have proven to be
promising and widely used in sentiment classification. However, these methods are domain
(d) WiFi RSS received by device B in T2.
Figure 1.1: Contours of RSS values over a 2-dimensional environment collected from the same
AP but in different time periods and received by different mobile devices. Different colors
denote different signal strength values.
dependent. The reason is that users may use domain-specific words to express sentiment in
different domains. Table 1.1 shows several user review sentences from two domains: electronics
and video games. In the electronics domain, we may use words like “compact”, “sharp” to
express our positive sentiment and use “blurry” to express our negative sentiment. While in
the video game domain, words like “hooked”, “realistic” indicate positive opinion and the word
“boring” indicates negative opinion. Due to the mismatch among domain-specific words, a
sentiment classifier trained in one domain may not work well when directly applied to other
domains. Thus, cross-domain sentiment classification algorithms are highly desirable to reduce
domain dependency and manually labeling cost by transferring knowledge from related domains
to the domain of interest [25].
3
Table 1.1: Cross-domain sentiment classification examples: reviews of electronics and video
games products. Boldfaces are domain-specific words, which are much more frequent in one
domain than in the other one. “+” denotes positive sentiment, and “-” denotes negative sentiment.
electronics video games
+ Compact; easy to operate; very good picture
quality; looks sharp!
A very good game! It is action packed and
full of excitement. I am very much hooked
on this game.
+ I purchased this unit from Circuit City and
I was very excited about the quality of the
picture. It is really nice and sharp.
Very realistic shooting action and good
plots. We played this and were hooked.
- It is also quite blurry in very dark settings.
I will never buy HP again.
The game is so boring. I am extremely unhappy
and will probably never buy UbiSoft
again.
1.1 The Contribution of This Thesis
Generally speaking, transfer learning can be categorized into three settings: inductive transfer,
transductive transfer and unsupervised transfer, which is first described in our survey article
[141] and will be introduced in detail in Chapter 2. In this thesis, we focus on the transductive
transfer learning setting, where we are given a lot of labeled data in a source domain and
some unlabeled data in a target domain, our goal is to learn an accurate model for use in the
target domain. Note that in this setting, no labeled data in the target domain are available for
training.
Furthermore, in transfer learning, we have the following three main research issues: (1)
What to transfer; (2) How to transfer; (3) When to transfer [141], which will be introduced in
detail in Chapter 2 as well.
“What to transfer” asks which part of knowledge can be transferred across domains or tasks.
Some knowledge is specific for individual domains or tasks, and some knowledge may be common
between different domains such that they may help improve performance for the target
domain or task. After discovering which knowledge can be transferred, learning algorithms
need to be developed to transfer the knowledge, which corresponds to the “how to transfer”
issue.
“When to transfer” asks in which situations, transferring skills should be done. Likewise,
we are interested in knowing in which situations, knowledge should not be transferred. In some
situations, when the source domain and target domain are not related to each other, brute-force
transfer may be unsuccessful. In the worst case, it may even hurt the performance of learning
in the target domain, a situation which is often referred to as negative transfer.
In this thesis, we focus on “What to transfer” and “How to transfer” by implicitly assuming
that the source and target domains are related to each other. We leave the issue on how to avoid
4
negative transfer to our future work. For “How to transfer”, we propose to discover a latent
feature space for transfer learning, where the distance between domains can then be reduced
and the important information of the original data can be preserved simultaneously. Standard
machine learning and data mining methods can be applied directly in the latent space to train
models for making predictions on the target domain data. Thus, the latent space can be treated
as a bridge across domains to make knowledge transfer possible and successful.
For “How to transfer”, we propose two embedding learning frameworks to learn the latent
space based on two different situations: (1) domain knowledge is hidden or hard to capture, and
(2) domain knowledge can be observed or easy to encode in embedding learning. In most application
areas, such as text classification or WiFi localization, the domain knowledge is hidden.
For example, text data may be controlled by some latent topics, and WiFi data may be controlled
by some hidden factors, such as the structure of a building, etc. In this case, we propose
a novel and general dimensionality reduction framework for transfer learning. Our framework
aims to learn a latent space underlying across domains, such that the distance between data
distributions can be dramatically reduced and the original data properties, such as variance and
local geometric structure, can be preserved as much as possible, when data from different domains
are projected onto the latent space. Based on this framework, we propose three different
algorithms to learn the latent space, Maximum Mean Discrepancy Embedding (MMDE) [135],
Transfer Component Analysis (TCA) [139] and Semi-supervised Transfer Component Analysis
(SSTCA) [140]. More specifically, in MMDE we translate the latent space learning for transfer
learning to a non-parametric kernel matrix learning problem. The resultant kernel in a free
form may be more precise for transfer learning, but suffers from expensive computational cost.
Thus, in TCA and SSTCA, we propose to learn parametric kernel based embeddings for transfer
learning instead. The main difference between TCA and SSTCA is that TCA is an unsupervised
feature extraction method while SSTCA is semi-supervised feature extrantion method. We apply
these three algorithms to two diverse application areas: wireless sensor networks and Web
mining.
In contrast to the general framework to transfer learning without domain knowledge, in
some application areas, such as sentiment classification, some domain knowledge can be observed
and used for learning the latent space across domains. For example, in sentiment classification,
though users may use some domain-specific words as shown in Table 1.1, they may
use some domain-independent sentiment words, such as “good”, “never buy”, etc. In addition,
some domain-specific and domain-independent words may co-occur in reviews frequently,
which means there may be a correlation between these words. This observation motivates us to
propose a spectral feature clustering framework [137] to align domain-specific words from different
domains in a latent space by modeling the correlation between the domain-independent
and domain-specific words in a bipartite graph and using the domain-specific features as a
5
bridge for cross-domain sentiment classification.
In this thesis, we study the problem of feature-based transfer learning and its real-world
applications, such as WiFi localization, text classification and sentiment classification. Note
that there has been a large amount of work on transfer learning for reinforcement learning in
the machine learning literature (e.g., a current survey article [182]). However, in this thesis,
we only focus on transfer learning for classification and regression tasks that are related more
closely to machine learning and data mining tasks. The main contributions of this thesis can be
summarized as follows,
• We give a comprehensive survey on transfer learning, where we summarize different
transfer learning settings and approaches and discuss the relationship between transfer
learning and other related areas. Other researchers may get a big picture of transfer learning
by reading the survey.
• We propose a general dimensionality reduction framework for transfer learning without
any domain knowledge. Based on the framework, we propose three solutions to learn
the latent space for transfer learning. Furthermore, we apply them to solve the WiFi
localization and text classification problems and achieve promising results.
• We propose a specific latent space learning for sentiment classification, which encode the
domain knowledge in a spectral feature alignment framework. The proposed method outperforms
a sate-of-the-art cross-domain methods in the field of sentiment classification.