23-02-2013, 12:21 PM
Biomedical Language Processing
Biomedical Language Processing.pdf (Size: 209.66 KB / Downloads: 22)
Literature Overload
Exponential growth of the peer-reviewed literature and
the breakdown of disciplinary boundaries heralded by
genome-scale instruments have made it harder than
ever for scientists to find and assimilate all the publications
relevant to their research. The widespread adoption
of title/abstract word search, primarily through the
National Library of Medicine’s PubMed system (http://
www.ncbi.nlm.nih.gov/pubmed), was the first major
change in the way bioscientists found relevant publications
since the origin of Index Medicus in 1879. (Although
it remains useful for locating pre-1966 literature
(Hersh, 2003), Index Medicus ceased publication in
2004.) However, PubMed is only the beginning of a revolution
in how scientists use the biomedical literature.
Computational tools that classify documents, extract
factual information, generate summaries, and generally
process human language are providing powerful new
tools for staying on top of the torrent of publications.
The biomedical literature is growing at a doubleexponential
pace; over the last 20 years, the total size
of MEDLINE (the database searched by PubMed) has
grown at a w4.2% compounded annual growth rate,
and the number of new entries in MEDLINE each year
has grown at a compounded annual growth rate
of w3.1% (see Figure 1). There are now more than
16,000,000 publications in MEDLINE; more than three
million of those were published in the last 5 years alone.
The number of MEDLINE entries with a 2005 publication
date was 666,029—more than 1800 per day.
Large as MEDLINE is, it captures only bibliographic information
and abstracts. Electronic access to the full
texts, including graphics and figures, is also on the
rise, and sophisticated linkages between publications
and data repositories or other supplementary materials
increase the amount of information available still further.
Although online full-text materials are increasingly prevalent,
dramatic increases in subscription prices and decreases
in library budgets have paradoxically decreased
access for some researchers. Toll-free linking, where
copyright owners allow free search but charge per
view, is one approach to ameliorating this problem.
An alternative strategy toward this goal is the recent
establishment of a movement toward a new ‘‘Open
Access’’ model of scientific publishing. On April 11, 2003,
a group of individuals interested in promoting open access
to the scientific literature drafted a statement of principles
that is now referred to as the Bethesda Statement
on Open Access Publishing (http://www.earlham.edu/
wpeters/fos/bethesda.htm), later followed in Europe
by The Berlin Declaration on Open Access to Knowledge
in the Sciences and Humanities (http://www.zim.mpg.
de/openaccess-berlin/berlindeclaration.html), ushering
in the era of unrestricted use of scientific publications.
Although some publishers have resisted open access,
many others have responded by increasing access to
archives and developing related services. In 2004, the
US National Library of Medicine (NLM) created a repository,
called PubMedCentral (PMC, http://www.
pubmedcentral.gov/), for open access articles, which
as of this writing tracks some or all of the content of
154 biomedical journals automatically and accepts individual
article submissions from hundreds of others. Perhaps
most important for the future was the publication in
the Federal Register of a new ‘‘Policy on Enhancing Public
Access to Archived Publications Resulting From NIHFunded
Research,’’ which beginning on May 2, 2005 requests
all NIH-funded investigators to submit to PMC all
manuscripts resulting from research supported in whole
or in part by NIH money. Less than 6 months later, more
than 430,000 full-text articles (totaling more than 5TB in
compressed form) are available through PMC. Furthermore,
NLM is digitizing earlier print issues of many of
the journals already in PMC, extending the availability
of full texts back to before the implementation of the
2005 policy. Although NIH officials estimate that
w10% of the literature is NIH supported, and only about
6.5% of the MEDLINE entries for 2005 were indexed as
supported by NIH extramural funding, PMC marks a significant
change in the availability of full-text scientific articles
in biomedicine. As stated on the NLM’s web site,
PMC ‘‘makes it possible to integrate the literature with
a variety of other information resources such as sequence
databases and other factual databases that
are available to scientists, clinicians and everyone else
interested in the life sciences. The intentional and serendipitous
discoveries that such links might foster excite
us and stimulate us to move forward.’’ Development of
novel computational tools and techniques for textual
analysis are a vital prerequisite for achieving NLM’s
vision.
Biomedical Language Processing Systems
Meanwhile, over the last 5 years or so, there has been
a remarkable surge of new results in biomedical language
processing (BLP). BLP encompasses the many
computational tools and methods that take humangenerated
texts as input, generally applied to tasks
such as information retrieval, document classification,
information extraction, plagiarism detection, or literature-
based discovery. Information retrieval systems,
like PubMed or Google, focus on searching large collections
to find documents that are relevant to a query.
They are evaluated by their sensitivity (what proportion
of all of the relevant documents are found) and their
specificity (what proportion of the documents found
are actually relevant to the query). Document classification
is another task with many biomedical applications.
Such a system can be used to organize large retrieval
results into meaningful categories (e.g., Tanabe et al.
*Correspondence: larry.hunter[at]uchsc.edu [1999]). Document classification can also be used to
filter or route a flow of documents (e.g., from a service
like Thompson’s Current Contents or from web technologies
such as RSS, the framework for providing automated
‘‘feeds’’ of news stories, articles, or other content
types related to a specific subject matter) based on their
contents. Such filtering technologies are used, for
example, by several of the model organism database
projects to identify publications relevant to their gene
annotation efforts. Information extraction (also sometimes
called text data mining) systems scan large
numbers of publications to extract specific factual information,
often to populate a database. The Literature
Support for Alternative Transcripts (LSAT) system,
discussed in more detail below, produced a database
of transcript diversity in about 4000 human genes by
scanning more than 14,000 MEDLINE abstracts from
hundreds of different journals. Literature-based discovery
is the attempt to automatically induce novel hypotheses
by processing existing publications. Although
a few dramatic results were obtained in the 1980s and
90s, e.g., the discovery of a linkage between magnesium
and migraine (Swanson, 1988), repeated, successful,
literature-based discovery remains beyond current
abilities (see Weeber et al. [2005] for a good review of
LBD systems).