22-04-2014, 12:24 PM
BibPro: A Citation Parser Based on Sequence Alignment
Abstract
Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the
resources to meet researchers’ needs. As a result, a number of network services have compiled databases from the public resources
scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an
interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has
attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a
citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted
experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark.
Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a
reasonable performance.
INTRODUCTION
CITATIONS play an important role in many scientific-
publication digital libraries (DLs), such as CiteSeer,
arXiv e-Print, DBLP, and Google Scholar. Users often use
citations to find information of interest in DLs, while
researchers depend on citations to determine the impact of
a particular article. Evaluations of an individual’s perfor-
mance for promotion purposes or the allocation of grants
may use citations as evidence of the competence of a
researcher and the impact of his/her published work.
Citations have also been used as auxiliary support in
information retrieval tasks, e.g., automatic document
classification [2], [3], indexing and ranking [10], and quality
assessment [4]. Moreover, bibliographic measures that rely
on citations have inspired recent web link analysis
algorithms like PageRank [5]. In a broader sense, citations
are the basis of DLs that specialize in scientific publications.
Parsing citations is essential for integrating bibliographi-
cal information published on the Internet. Most citation
management techniques are based on the assumption that
we can correctly identify the main components of a citation,
such as authors’ names, title, publication venue, date, and
the number of pages.
BIBPRO: A CITATION PARSER
In our observation, given its metadata, a citation string can
be rewritten into a canonical string consisting of symbols
corresponding to its Fields and Delimiters. Two citation
strings of the same citation format usually have similar
canonical strings. On the other hand, canonical strings of
two citation strings of different format are not quite similar.
Our strategy of solving the citation parsing problem,
thus, is to rewrite a given citation string into a canonical
form. However, without prior knowledge of its metadata,
the boundary between fields and delimiters of a citation
string are not trivial. For example, some punctuation marks
used as delimiters may also appear inside a certain field.
Besides, sometimes it is also hard to determine the type of a
field by its content, and the order of fields varies among
different citation formats. Thus, our strategy is to rewrite
one such citation string such that structured information
associated with its citation style is preserved as much as
possible while textual information of the specific citation
string is removed as much as possible.
CONCLUSION
Parsing citations is challenging due to the diverse nature of
citation formats. In this paper, we present a sequence-
alignment-based citation parser called “BibPro.” The basic
concept of BibPro is to transform semistructured properties
of a citation string into a sequence template, and apply
sequence alignment techniques to further resolve the
structured information. The effectiveness and applicability
of BibPro is demonstrated by experiments on three bench-
mark data sets. Specifically, BibPro achieved over 90 percent
field-level accuracy on all three benchmarks. In addition,
BibPro can automatically construct a template database
using not-so-accurately labeled citation strings retrieved
from the web, and experiments show that its field-level
accuracy is over 80 percent for the given data sets. BibPro is
implemented to allow users to maintain a profile of
templates in order to meet specific user requirements.