Text Mining with Information Extraction pdf

**study tips** · 11-02-2013, 03:59 PM

Text Mining with Information Extraction

.pdf

Text Mining.pdf (Size: 127.87 KB / Downloads: 244)

Abstract

Text mining concerns looking for patterns in unstructured text. The related task of Information
Extraction (IE) is about locating specific items in natural-language documents. This paper
presents a framework for text mining, called DISCOTEX (Discovery from Text EXtraction),
using a learned information extraction system to transform text into more structured data which
is then mined for interesting relationships. The initial version of DISCOTEX integrates an IE
module acquired by an IE learning system, and a standard rule induction module. In addition,
rules mined from a database extracted from a corpus of texts are used to predict additional
information to extract from future documents, thereby improving the recall of the underlying
extraction system. Encouraging results are presented on applying these techniques to a corpus
of computer job announcement postings from an Internet newsgroup.

Introduction

The problem of text mining, i.e. discovering useful knowledge from unstructured or semi-structured
text, is attracting increasing attention [4, 18, 19, 21, 22, 27]. This paper suggests a new framework
for text mining based on the integration of Information Extraction (IE) and Knowledge Discovery
from Databases (KDD), a.k.a. data mining. KDD and IE are both topics of significant recent
interest. KDD considers the application of statistical and machine-learning methods to discover
novel relationships in large relational databases. IE concerns locating specific pieces of data in
natural-language documents, thereby extracting structured information from free text. However,
there has been little if any research exploring the interaction between these two important areas.
In this paper, we explore the mutual benefit that the integration of IE and KDD for text mining can
provide.
Traditional data mining assumes that the information to be “mined” is already in the form of a
relational database. Unfortunately, for many applications, electronic information is only available
in the form of free natural-language documents rather than structured databases. Since IE addresses
the problem of transforming a corpus of textual documents into a more structured database, the
database constructed by an IE module can be provided to the KDD module for further mining of
knowledge as illustrated in Figure 1. Information extraction can play an obvious role in text mining
as illustrated.

Background: Text Mining and Information Extraction

“Text mining” is used to describe the application of data mining techniques to automated discovery
of useful or interesting knowledge from unstructured text [20]. Several techniques have been
proposed for text mining including conceptual structure, association rule mining, episode rule mining,
decision trees, and rule induction methods. In addition, Information Retrieval (IR) techniques
have widely used the “bag-of-words” model [2] for tasks such as document matching, ranking, and
clustering.
The related task of information extraction aims to find specific data in natural-language text.
DARPA’s Message Understanding Conferences (MUC) have concentrated on IE by evaluating the
performance of participating IE systems based on blind test sets of text documents [13]. The
data to be extracted is typically given by a template which specifies a list of slots to be filled with
substrings taken from the document. Figure 2 shows a (shortened) document and its filled template
for an information extraction task in the job-posting domain. This template includes slots that are
filled by strings taken directly from the document. Several slots may have multiple fillers for the
job-posting domain as in programming languages, platforms, applications, and
areas.
We have developed machine learning techniques to automatically construct information extractors
for job postings, such as those listed in the USENET newsgroup misc.jobs.offered
[6]. By extracting information from a corpus of such textual job postings, a structured, searchable
database of jobs can be automatically constructed; thus making the data in online text more
easily accessible. IE has been shown to be useful in a variety of other applications, e.g. seminar
announcements, restaurant guides, university web pages, apartment rental ads, and news articles
on corporate acquisitions [5, 9, 23].

Integrating Data Mining and Information Extraction

In this section, we discuss the details of our proposed text mining framework, DISCOTEX (Discovery
from Text EXtraction). We consider the task of first constructing a database by applying a
learned information-extraction system to a corpus of natural-language documents. Then, we apply
standard data-mining techniques to the extracted data, discovering knowledge that can be used for
many tasks, including improving the accuracy of information extraction.

The DISCOTEX System

In the proposed framework for text mining, IE plays an important role by preprocessing a corpus
of text documents in order to pass extracted items to the data mining module. In our implementations,
we used two state-of-the-art systems for learning information extractors, RAPIER (Robust
Automated Production of Information Extraction Rules) [6] and BWI (Boosted Wrapper Induction)
[15]. By training on a corpus of documents annotated with their filled templates, they acquire
a knowledge base of extraction rules that can be tested on novel documents.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Software Crisis pdf	study tips	1	2,117	21-09-2017, 04:31 PM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	HOW EMAIL WORKS pdf	project girl	1	3,067	20-09-2017, 11:39 AM Last Post: jaseela123
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	Review: Context Aware Tools for Smart Home Development pdf	study tips	1	1,227	20-09-2017, 11:22 AM Last Post: jaseela123
	Getting Started with the MAXQ1103 Evaluation Kit and the CrossWorks Compiler pdf	project girl	1	969	15-09-2017, 03:11 PM Last Post: jaseela123
	Wireless Application Protocol (WAP) pdf	project girl	1	1,531	15-09-2017, 02:42 PM Last Post: jaseela123
	MAC Protocol for Reliable Multicast over Multi-Hop Wireless Ad Hoc Networks pdf	study tips	1	1,029	15-09-2017, 12:39 PM Last Post: jaseela123
	Wireless Automotive Communications pdf	seminar projects maker	1	637	14-09-2017, 01:27 PM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.