13-11-2012, 01:53 PM
Incremental Information Extraction Using Relational Databases
Abstract
Information extraction systems are traditionally
implemented as a pipeline of special-purpose processing
modules targeting
the extraction of a particular kind of information. A
major drawback of such an approach is that whenever a
new extraction goal emerges or a module is improved,
extraction has to be reapplied from scratch to the entire
text corpus even though only a small part of the corpus
might be affected. In this paper, we describe a novel
approach for information extraction in which extraction
needs are expressed in the form of database queries,
which are evaluated and optimized by database systems.
Using database queries for information extraction
enables generic extraction and minimizes reprocessing
of data by performing incremental extraction to identify
which part of the data is affected by the change of
components or goals. Furthermore, our approach
provides automated query generation components so
that casual users do not have to learn the query language
in order to perform extraction. To demonstrate the
feasibility of our incremental extraction approach, we
performed experiments to highlight two important
aspects of an information extraction system: efficiency
and quality of extraction results. Our experiments show
that in the event of deployment of a new module, our
incremental extraction approach reduces the processing
time by 89.64 percent as compared to a traditional
pipeline approach. By applying our methods to a corpus
of 17 million biomedical abstracts, our experiments
show that the query performance is efficient for realtime
applications. Our experiments also revealed that
our approach achieves high quality extraction results.