11-02-2013, 03:59 PM
Text Mining with Information Extraction
Text Mining.pdf (Size: 127.87 KB / Downloads: 244)
Abstract
Text mining concerns looking for patterns in unstructured text. The related task of Information
Extraction (IE) is about locating specific items in natural-language documents. This paper
presents a framework for text mining, called DISCOTEX (Discovery from Text EXtraction),
using a learned information extraction system to transform text into more structured data which
is then mined for interesting relationships. The initial version of DISCOTEX integrates an IE
module acquired by an IE learning system, and a standard rule induction module. In addition,
rules mined from a database extracted from a corpus of texts are used to predict additional
information to extract from future documents, thereby improving the recall of the underlying
extraction system. Encouraging results are presented on applying these techniques to a corpus
of computer job announcement postings from an Internet newsgroup.
Introduction
The problem of text mining, i.e. discovering useful knowledge from unstructured or semi-structured
text, is attracting increasing attention [4, 18, 19, 21, 22, 27]. This paper suggests a new framework
for text mining based on the integration of Information Extraction (IE) and Knowledge Discovery
from Databases (KDD), a.k.a. data mining. KDD and IE are both topics of significant recent
interest. KDD considers the application of statistical and machine-learning methods to discover
novel relationships in large relational databases. IE concerns locating specific pieces of data in
natural-language documents, thereby extracting structured information from free text. However,
there has been little if any research exploring the interaction between these two important areas.
In this paper, we explore the mutual benefit that the integration of IE and KDD for text mining can
provide.
Traditional data mining assumes that the information to be “mined” is already in the form of a
relational database. Unfortunately, for many applications, electronic information is only available
in the form of free natural-language documents rather than structured databases. Since IE addresses
the problem of transforming a corpus of textual documents into a more structured database, the
database constructed by an IE module can be provided to the KDD module for further mining of
knowledge as illustrated in Figure 1. Information extraction can play an obvious role in text mining
as illustrated.
Background: Text Mining and Information Extraction
“Text mining” is used to describe the application of data mining techniques to automated discovery
of useful or interesting knowledge from unstructured text [20]. Several techniques have been
proposed for text mining including conceptual structure, association rule mining, episode rule mining,
decision trees, and rule induction methods. In addition, Information Retrieval (IR) techniques
have widely used the “bag-of-words” model [2] for tasks such as document matching, ranking, and
clustering.
The related task of information extraction aims to find specific data in natural-language text.
DARPA’s Message Understanding Conferences (MUC) have concentrated on IE by evaluating the
performance of participating IE systems based on blind test sets of text documents [13]. The
data to be extracted is typically given by a template which specifies a list of slots to be filled with
substrings taken from the document. Figure 2 shows a (shortened) document and its filled template
for an information extraction task in the job-posting domain. This template includes slots that are
filled by strings taken directly from the document. Several slots may have multiple fillers for the
job-posting domain as in programming languages, platforms, applications, and
areas.
We have developed machine learning techniques to automatically construct information extractors
for job postings, such as those listed in the USENET newsgroup misc.jobs.offered
[6]. By extracting information from a corpus of such textual job postings, a structured, searchable
database of jobs can be automatically constructed; thus making the data in online text more
easily accessible. IE has been shown to be useful in a variety of other applications, e.g. seminar
announcements, restaurant guides, university web pages, apartment rental ads, and news articles
on corporate acquisitions [5, 9, 23].
Integrating Data Mining and Information Extraction
In this section, we discuss the details of our proposed text mining framework, DISCOTEX (Discovery
from Text EXtraction). We consider the task of first constructing a database by applying a
learned information-extraction system to a corpus of natural-language documents. Then, we apply
standard data-mining techniques to the extracted data, discovering knowledge that can be used for
many tasks, including improving the accuracy of information extraction.
The DISCOTEX System
In the proposed framework for text mining, IE plays an important role by preprocessing a corpus
of text documents in order to pass extracted items to the data mining module. In our implementations,
we used two state-of-the-art systems for learning information extractors, RAPIER (Robust
Automated Production of Information Extraction Rules) [6] and BWI (Boosted Wrapper Induction)
[15]. By training on a corpus of documents annotated with their filled templates, they acquire
a knowledge base of extraction rules that can be tested on novel documents.