TEXT MINING: APPROACHES AND APPLICATIONS

**mkaasees** · 21-09-2016, 12:24 PM

1455549843-textminnig.pdf (Size: 1.07 MB / Downloads: 6)

Abstract. The field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration
of interesting patterns. The techniques employed usually do not involve
deep linguistic analysis or parsing, but rely on simple “bag-of-words” text
representations based on vector space. Several approaches to the identification
of patterns are discussed, including dimensionality reduction,
automated classification and clustering. Pattern exploration is illustrated
through two applications from our recent work: a classification-based
Web meta-search engine and visualization of coauthorship relationships
automatically extracted from a semi-structured collection of documents
describing researchers in the region of Vojvodina. Finally, preliminary
results concerning the application of dimensionality reduction techniques
to problems in sentiment classification are presented.

Introduction
Text mining is a new area of computer science which fosters strong connections
with natural language processing, data mining, machine learning, information
retrieval and knowledge management. Text mining seeks to extract
useful information from unstructured textual data through the identification
and exploration of interesting patterns [2]. This paper will discuss several approaches
to the identification of global patterns in text, based on the “bag-ofwords”
(BOW) representation described in Section 2. The covered approaches
are automated classification and clustering (Section 3), and dimensionality reduction
(Section 5). Pattern exploration will be illustrated through two applications
from our recent work: presentation of Web meta-search engine results
(Section 4) and visualization of coauthorship relationships automatically
extracted from a semi-structured collection of documents describing researchers
in the Serbian province of Vojvodina (Section 6). Finally, preliminary results
concerning the application of dimensionality reduction techniques to problems
in sentiment classification are presented in Section 7.

Bag-of-Words Document Representation
Let W be the dictionary – the set of all terms (words) that occur at least
once in a collection of documents D. The bag-of-words representation of document
dn is a vector of weights (w1n,. . . ,w|W|n). In the simplest case, the weights
win ∈ {0, 1} and denote the presence or absence of a particular term in a document.
More commonly, win represent the frequency of the ith term in the nth
document, resulting in the term frequency representation. Normalization can
be employed to scale the term frequencies to values between 0 and 1, accounting
for differences in the lengths of documents. Besides words, n-grams may also be
used as terms. However, two different notions have been referred to as “n-grams”
in the literature. The first are phrases as sequences of n words, while the other
notion are n-grams as sequences of characters. N-grams as phrases are usually
used to enrich the BOW representation rather than on their own. N-grams as
sequences of characters are used instead of words.
The transformation of a document set D into the BOW representation enables
the transformed set to be viewed as a matrix, where rows represent document
vectors, and columns are terms. This view enables various matrix decomposition
techniques to be applied for the tasks of clustering [2] and dimensionality
reduction (Section 5). Furthermore, since documents are treated as vectors,
they can be compared using classical distance/similarity measures. The most
commonly employed measures include cosine and Tanimoto similarity [6].
3. Machine Learning with Textual Data
The field of machine learning (ML) is concerned with the question of how
to construct computer programs that automatically improve with experience.
One important division of learning methods is into supervised and unsupervised.
In supervised learning, computer programs capture structural information and
derive conclusions (predictions) from previously labeled examples (instances,
points). Unsupervised learning finds groups in data without relying on labels.
ML techniques can roughly be divided into four distinct areas: classification,
clustering, association learning and numeric prediction [10]. Classification
applied to text is the subject of text categorization (TC), which is the task
of automatically sorting a set of documents into categories from a predefined
set [8]. Classification of documents is employed in text filtering, categorization
of Web pages (see Section 4), sentiment analysis (see Section 7), etc. Classification
can also be used on smaller parts of text depending on the concrete
application, e.g. document segmentation or topic tracking. In the ML approach,
classifiers are trained beforehand on previously sorted (labeled) data, before being
applied to sorting unseen texts. The most popular classifiers applied to text
include naïve Bayes, k-nearest neighbor, and support vector machines [10].
While classification is concerned with finding models by generalization of evidence
produced by a dataset, clustering deals with the discovery of models by
finding groups of data points which satisfy some objective criterion, e.g. maximize
inter-cluster similarity of points, while minimizing similarity of points from different clusters. Examples of algorithms used on text data include k-means,
and approaches employing matrix decompositions [2].
4. Application: Enhancing Web Search
One way to enhance users’ efficiency and experience of Web search is by
means of meta-search engines. Traditionally, meta-search engines were conceived
to address different issues concerning general-purpose search engines,
including Web coverage, search result relevance, and their presentation to the
user. A common approach to alternative presentation of results is by sorting
them into (a hierarchy of) clusters which may be displayed to the user in a
variety of ways, e.g. as a separate expandable tree (vivisimo.com) or arcs which
connect Web pages within graphically rendered “maps” (kartoo.com). However,
topics generated by clustering may not prove satisfactory for every query, and
the “silver bullet” method has not yet been found. An example of a meta-search
engine which sorts search results into a hierarchy of topics using text categorization
techniques is CatS [7] (stribog.im.ns.ac.yu/cats). Figure 1 shows the
subset of the 100 results for query ‘animals england’ sorted into category Arts
→ Music, helping separate pages about animals living in England from pages
concerning the English music scene. The categories employed by CatS were
extracted from the dmoz Open Directory (www.dmoz.org).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	Image Clustering and Retrieval using Image Mining Techniques REPORT	project girl	1	1,221	09-09-2017, 04:45 PM Last Post: jaseela123
	Software Defect Association Mining and Defect Correction Effort Prediction	project topics	1	342,050	02-09-2017, 03:21 PM Last Post: jaseela123
	Parallelization of Genetic Algorithms and its Applications	project topics	1	9,185,345	02-09-2017, 12:28 PM Last Post: jaseela123
	FPGA-Based Reconfigurable Hardware for Compute Intensive Data Mining Applications	project girl	1	1,025	02-09-2017, 09:08 AM Last Post: jaseela123
	Online Text Tile On ICON tiles	seminar class	1	322,959	30-08-2017, 04:27 PM Last Post: jaseela123
	Data Mining for Big Data: A Review	mkaasees	0	294	25-08-2017, 09:32 PM Last Post: mkaasees
	DOT NET Based DOMAIN TRANSACTIONS ON DATA MINING Project Ideas	electronics seminars	0	8,337,422	25-08-2017, 09:32 PM Last Post: electronics seminars
	Tracking and having the View of the Cubicle and the Person sitting in it	Electrical Fan	0	9,224,815	25-08-2017, 09:32 PM Last Post: Electrical Fan

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.