06-09-2016, 12:40 PM
1453265604-06247433.pdf (Size: 1.62 MB / Downloads: 4)
Abstract—In many text mining applications, side-information is available along with the text documents. Such side-information may
be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or
other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of
information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially
when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because
it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a
principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we
design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective
clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a
number of real data sets in order to illustrate the advantages of using such an approach.
INTRODUCTION
T HE problem of text clustering arises in the context of
many application domains such as the web, social networks,
and other digital collections. The rapidly increasing
amounts of text data in the context of these large online
collections has led to an interest in creating scalable and
effective mining algorithms. A tremendous amount of work
has been done in recent years on the problem of clustering
in text collections [5], [11], [27], [30], [37] in the
database and information retrieval communities. However,
this work is primarily designed for the problem of pure text
clustering, in the absence of other kinds of attributes. In
many application domains, a tremendous amount of sideinformation
is also associated along with the documents.
This is because text documents typically occur in the context
of a variety of applications in which there may be a
large amount of other kinds of database attributes or metainformation
which may be useful to the clustering process.
Some examples of such side-information are as follows:
• In an application in which we track user access
behavior of web documents, the user-access behavior
may be captured in the form of web logs. For each
document, the meta-information may correspond to
the browsing behavior of the different users. Such logs can be used to enhance the quality of the mining
process in a way which is more meaningful to the
user, and also application-sensitive. This is because
the logs can often pick up subtle correlations in content,
which cannot be picked up by the raw text
alone.
• Many text documents contain links among them,
which can also be treated as attributes. Such links
contain a lot of useful information for mining purposes.
As in the previous case, such attributes may
often provide insights about the correlations among
documents in a way which may not be easily accessible
from raw content.
• Many web documents have meta-data associated
with them which correspond to different kinds of
attributes such as the provenance or other information
about the origin of the document. In other
cases, data such as ownership, location, or even temporal
information may be informative for mining
purposes. In a number of network and user-sharing
applications, documents may be associated with
user-tags, which may also be quite informative.
While such side-information can sometimes be useful in
improving the quality of the clustering process, it can be a
risky approach when the side-information is noisy. In such
cases, it can actually worsen the quality of the mining process.
Therefore, we will use an approach which carefully
ascertains the coherence of the clustering characteristics of
the side information with that of the text content. This helps
in magnifying the clustering effects of both kinds of data.
The core of the approach is to determine a clustering in
which the text attributes and side-information provide similar
hints about the nature of the underlying clusters, and
at the same time ignore those aspects in which conflicting
hints are provided.
In order to achieve this goal, we will combine a partitioning
approach with a probabilistic estimation process,
which determines the coherence of the side-attributes in
the clustering process. A probabilistic model on the side
information uses the partitioning information (from text
attributes) for the purpose of estimating the coherence
of different clusters with side attributes. This helps in
abstracting out the noise in the membership behavior of
different attributes. The partitioning approach is specifically
designed to be very efficient for large data sets. This can
be important in scenarios in which the data sets are very
large. We will present experimental results on a number of
real data sets, and illustrate the effectiveness and efficiency
of the approach.
While our primary goal in this paper is to study the
clustering problem, we note that such an approach can
also be extended in principle to other data mining problems
in which auxiliary information is available with text.
Such scenarios are very common in a wide variety of data
domains. Therefore, we will also propose a method in this
paper in order to extend the approach to the problem classi-
fication. We will show that the extension of the approach to
the classification problem provides superior results because
of the incorporation of side information. Our goal is to
show that the advantages of using side-information extend
beyond a pure clustering task, and can provide competitive
advantages for a wider variety of problem scenarios.
This paper is organized as follows. The remainder of
this section will present the related work on the topic. In
the next section, we will formalize the problem of text
clustering with side information. We will also present an
algorithm for the clustering process. We will show how
to extend these techniques to the classification problem in
Section 3. In Section 4, we will present the experimental
results. Section 5 contains the conclusions and summary.
1.1 Related Work
The problem of text-clustering has been studied widely by
the database community [18], [25], [34]. The major focus
of this work has been on scalable clustering of multidimensional
data of different types [18], [19], [25], [34].
A general survey of clustering algorithms may be found
in [21]. The problem of clustering has also been studied
quite extensively in the context of text-data. A survey
of text clustering methods may be found in [3]. One of
the most well known techniques for text-clustering is the
scatter-gather technique [11], which uses a combination
of agglomerative and partitional clustering. Other related
methods for text-clustering which use similar methods are
discussed in [27], [29]. Co-clustering methods for text data
are proposed in [12], [13]. An Expectation Maximization
(EM) method for text clustering has been proposed in [22].
Matrix-factorization techniques for text clustering are proposed
in [32]. This technique selects words from the document
based on their relevance to the clustering process, and
uses an iterative EM method in order to refine the clusters.
A closely related area is that of topic-modeling, event tracking,
and text-categorization [6], [9], [15], [16]. In this context,
a method for topic-driven clustering for text data has been
proposed in [35]. Methods for text clustering in the context
of keyword extraction are discussed in [17]. A number of practical tools for text clustering may be found in [23].
A comparative study of different clustering methods may
be found in [30].
The problem of text clustering has also been studied in
context of scalability in [5], [20], [37]. However, all of these
methods are designed for the case of pure text data, and
do not work for cases in which the text-data is combined
with other forms of data. Some limited work has been done
on clustering text in the context of network-based linkage
information [1], [2], [8], [10], [24], [31], [33], [36], though
this work is not applicable to the case of general sideinformation
attributes. In this paper, we will provide a first
approach to using other kinds of attributes in conjunction
with text clustering. We will show the advantages of using
such an approach over pure text-based clustering. Such an
approach is especially useful, when the auxiliary information
is highly informative, and provides effective guidance
in creating more coherent clusters. We will also extend the
method to the problem of text classification, which has been
studied extensively in the literature. Detailed surveys on
text classification may be found in [4], [28].
2 CLUSTERING WITH SIDE INFORMATION
In this section, we will discuss an approach for clustering
text data with side information. We assume that we have a
corpus S of text documents. The total number of documents
is N, and they are denoted by T1 ... TN. It is assumed that
the set of distinct words in the entire corpus S is denoted
by W. Associated with each document Ti, we have a set of
side attributes Xi. Each set of side attributes Xi has d dimensions,
which are denoted by (xi1 ... xid). We refer to such
attributes as auxiliary attributes. For ease in notation and
analysis, we assume that each side-attribute xid is binary,
though both numerical and categorical attributes can easily
be converted to this format in a fairly straightforward
way. This is because the different values of the categorical
attribute can be assumed to be separate binary attributes,
whereas numerical data can be discretized to binary values
with the use of attribute ranges. Some examples of such
side-attributes are as follows:
• In a web log analysis application, we assume that
xir corresponds to the 0-1 variable, which indicates
whether or not the ith document has been accessed
by the rth user. This information can be used in order
to cluster the web pages in a site in a more informative
way than a techniques which is based purely
on the content of the documents. As in the previous
case, the number of pages in a site may be large, but
the number of documents accessed by a particular
user may be relatively small.
• In a network application, we assume that xir corresponds
to the 0-1 variable corresponding to whether
or not the ith document Ti has a hyperlink to the
rth page Tr. If desired, it can be implicitly assumed
that each page links to itself in order to maximize
linkage-based connectivity effects during the clustering
process. Since hyperlink graphs are large and
sparse, it follows that the number of such auxiliary
variables are high, but only a small fraction of them
take on the value of 1.
In a document application with associated GPS or
provenance information, the possible attributes may
be drawn on a large number of possibilities. Such
attributes will naturally satisfy the sparsity property.
As noted in the examples above, such auxiliary attributes
are quite sparse in many real applications. This can be a
challenge from an efficiency perspective, unless the sparsity
is carefully taken into account during the clustering process.
Therefore, our techniques will be designed to account for
such sparsity. However, it is possible to easily design our
approach for non-sparse attributes, by treating the attribute
values in a more symmetric way.
We note that our technique is not restricted to binary
auxiliary attributes, but can also be applied to attributes
of other types. When the auxiliary attributes are of other
types (quantitative or categorical), they can be converted
to binary attributes with the use of a simple transformation
process. For example, numerical data can be discretized
into binary attributes. Even in this case, the derived binary
attributes are quite sparse especially when the numerical
ranges are discretized into a large number of attributes. In
the case of categorical data, we can define a binary attribute
for each possible categorical value. In many cases, the number
of such values may be quite large. Therefore, we will
design our techniques under the implicit assumption that
such attributes are quite sparse. The formulation for the
problem of clustering with side information is as follows:
Text Clustering with Side Information: Given a corpus
S of documents denoted by T1 ... TN, and a set of auxiliary variables
Xi associated with document Ti, determine a clustering of
the documents into k clusters which are denoted by C1 ... Ck,
based on both the text content and the auxiliary variables.
We will use the auxiliary information in order to provide
additional insights, which can improve the quality of
clustering. In many cases, such auxiliary information may
be noisy, and may not have useful information for the clustering
process. Therefore, we will design our approach in
order to magnify the coherence between the text content
and the side-information, when this is detected. In cases, in
which the text content and side-information do not show
coherent behavior for the clustering process, the effects of
those portions of the side-information are marginalized.
2.1 The COATES Algorithm
In this section, we will describe our algorithm for text clustering
with side-information. We refer to this algorithm as
COATES throughout the paper, which corresponds to the
fact that it is a COntent and Auxiliary attribute based TExt
cluStering algorithm. We assume that an input to the algorithm
is the number of clusters k. As in the case of all
text-clustering algorithms, it is assumed that stop-words
have been removed, and stemming has been performed in
order to improve the discriminatory power of the attributes.
The algorithm requires two phases:
• Initialization: We use a lightweight initialization
phase in which a standard text clustering approach
is used without any side-information. For this purpose,
we use the algorithm described in [27]. The
reason that this algorithm is used, because it is a
simple algorithm which can quickly and efficiently provide a reasonable initial starting point. The centroids
and the partitioning created by the clusters
formed in the first phase provide an initial starting
point for the second phase. We note that the first
phase is based on text only, and does not use the
auxiliary information.
• Main Phase: The main phase of the algorithm is
executed after the first phase. This phase starts
off with these initial groups, and iteratively reconstructs
these clusters with the use of both the text
content and the auxiliary information. This phase
performs alternating iterations which use the text
content and auxiliary attribute information in order
to improve the quality of the clustering. We call
these iterations as content iterations and auxiliary
iterations respectively. The combination of the two
iterations is referred to as a major iteration. Each
major iteration thus contains two minor iterations, corresponding
to the auxiliary and text-based methods
respectively.
The focus of the first phase is simply to construct an initialization,
which provides a good starting point for the
clustering process based on text content. Since the key techniques
for content and auxiliary information integration
are in the second phase, we will focus most of our subsequent
discussion on the second phase of the algorithm. The
first phase is simply a direct application of the text clustering
algorithm proposed in [27]. The overall approach uses
alternating minor iterations of content-based and auxiliary
attribute-based clustering. These phases are referred to as
content-based and auxiliary attribute-based iterations respectively.
The algorithm maintains a set of seed centroids,
which are subsequently refined in the different iterations.
In each content-based phase, we assign a document to its
closest seed centroid based on a text similarity function.
The centroids for the k clusters created during this phase
are denoted by L1 ... Lk. Specifically, the cosine similarity
function is used for assignment purposes. In each auxiliary
phase, we create a probabilistic model, which relates
the attribute probabilities to the cluster-membership probabilities,
based on the clusters which have already been
created in the most recent text-based phase. The goal of this
modeling is to examine the coherence of the text clustering
with the side-information attributes. Before discussing the
auxiliary iteration in more detail, we will first introduce
some notations and definitions which help in explaining
the clustering model for combining auxiliary and text
variables.
We assume that the k clusters associated with the data
are denoted by C1 ... Ck. In order to construct a probabilistic
model of membership of the data points to clusters, we
assume that each auxiliary iteration has a prior probability
of assignment of documents to clusters (based on the
execution of the algorithm so far), and a posterior probability
of assignment of documents to clusters with the
use of auxiliary variables in that iteration. We denote the
prior probability that the document Ti belongs to the cluster
Cj by P(Ti ∈ Cj). Once the pure-text clustering phase
has been executed, the a-priori cluster membership probabilities
of the auxiliary attributes are generated with the
use of the last content-based iteration from this phase