21-09-2016, 12:24 PM
1455549843-textminnig.pdf (Size: 1.07 MB / Downloads: 6)
Abstract. The field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration
of interesting patterns. The techniques employed usually do not involve
deep linguistic analysis or parsing, but rely on simple “bag-of-words” text
representations based on vector space. Several approaches to the identification
of patterns are discussed, including dimensionality reduction,
automated classification and clustering. Pattern exploration is illustrated
through two applications from our recent work: a classification-based
Web meta-search engine and visualization of coauthorship relationships
automatically extracted from a semi-structured collection of documents
describing researchers in the region of Vojvodina. Finally, preliminary
results concerning the application of dimensionality reduction techniques
to problems in sentiment classification are presented.
Introduction
Text mining is a new area of computer science which fosters strong connections
with natural language processing, data mining, machine learning, information
retrieval and knowledge management. Text mining seeks to extract
useful information from unstructured textual data through the identification
and exploration of interesting patterns [2]. This paper will discuss several approaches
to the identification of global patterns in text, based on the “bag-ofwords”
(BOW) representation described in Section 2. The covered approaches
are automated classification and clustering (Section 3), and dimensionality reduction
(Section 5). Pattern exploration will be illustrated through two applications
from our recent work: presentation of Web meta-search engine results
(Section 4) and visualization of coauthorship relationships automatically
extracted from a semi-structured collection of documents describing researchers
in the Serbian province of Vojvodina (Section 6). Finally, preliminary results
concerning the application of dimensionality reduction techniques to problems
in sentiment classification are presented in Section 7.
Bag-of-Words Document Representation
Let W be the dictionary – the set of all terms (words) that occur at least
once in a collection of documents D. The bag-of-words representation of document
dn is a vector of weights (w1n,. . . ,w|W|n). In the simplest case, the weights
win ∈ {0, 1} and denote the presence or absence of a particular term in a document.
More commonly, win represent the frequency of the ith term in the nth
document, resulting in the term frequency representation. Normalization can
be employed to scale the term frequencies to values between 0 and 1, accounting
for differences in the lengths of documents. Besides words, n-grams may also be
used as terms. However, two different notions have been referred to as “n-grams”
in the literature. The first are phrases as sequences of n words, while the other
notion are n-grams as sequences of characters. N-grams as phrases are usually
used to enrich the BOW representation rather than on their own. N-grams as
sequences of characters are used instead of words.
The transformation of a document set D into the BOW representation enables
the transformed set to be viewed as a matrix, where rows represent document
vectors, and columns are terms. This view enables various matrix decomposition
techniques to be applied for the tasks of clustering [2] and dimensionality
reduction (Section 5). Furthermore, since documents are treated as vectors,
they can be compared using classical distance/similarity measures. The most
commonly employed measures include cosine and Tanimoto similarity [6].
3. Machine Learning with Textual Data
The field of machine learning (ML) is concerned with the question of how
to construct computer programs that automatically improve with experience.
One important division of learning methods is into supervised and unsupervised.
In supervised learning, computer programs capture structural information and
derive conclusions (predictions) from previously labeled examples (instances,
points). Unsupervised learning finds groups in data without relying on labels.
ML techniques can roughly be divided into four distinct areas: classification,
clustering, association learning and numeric prediction [10]. Classification
applied to text is the subject of text categorization (TC), which is the task
of automatically sorting a set of documents into categories from a predefined
set [8]. Classification of documents is employed in text filtering, categorization
of Web pages (see Section 4), sentiment analysis (see Section 7), etc. Classification
can also be used on smaller parts of text depending on the concrete
application, e.g. document segmentation or topic tracking. In the ML approach,
classifiers are trained beforehand on previously sorted (labeled) data, before being
applied to sorting unseen texts. The most popular classifiers applied to text
include naïve Bayes, k-nearest neighbor, and support vector machines [10].
While classification is concerned with finding models by generalization of evidence
produced by a dataset, clustering deals with the discovery of models by
finding groups of data points which satisfy some objective criterion, e.g. maximize
inter-cluster similarity of points, while minimizing similarity of points from different clusters. Examples of algorithms used on text data include k-means,
and approaches employing matrix decompositions [2].
4. Application: Enhancing Web Search
One way to enhance users’ efficiency and experience of Web search is by
means of meta-search engines. Traditionally, meta-search engines were conceived
to address different issues concerning general-purpose search engines,
including Web coverage, search result relevance, and their presentation to the
user. A common approach to alternative presentation of results is by sorting
them into (a hierarchy of) clusters which may be displayed to the user in a
variety of ways, e.g. as a separate expandable tree (vivisimo.com) or arcs which
connect Web pages within graphically rendered “maps” (kartoo.com). However,
topics generated by clustering may not prove satisfactory for every query, and
the “silver bullet” method has not yet been found. An example of a meta-search
engine which sorts search results into a hierarchy of topics using text categorization
techniques is CatS [7] (stribog.im.ns.ac.yu/cats). Figure 1 shows the
subset of the 100 results for query ‘animals england’ sorted into category Arts
→ Music, helping separate pages about animals living in England from pages
concerning the English music scene. The categories employed by CatS were
extracted from the dmoz Open Directory (www.dmoz.org).