27-09-2012, 12:48 PM
Synopses Generation for Specialized DocumentElement Search Engines
DocumentElement.pdf (Size: 1.12 MB / Downloads: 14)
ABSTRACT
Scientists often want to search for document-elements like
tables and gures in digital documents. Using a document-
element search engine helps them to retrieve a set of document-
elements using keyword queries. Consequently, they need
to decide whether the returned document-element is use-
ful and then determine what information is contained in it.
The last step is typically done by downloading the paper
and reading it. In this paper, we investigate how to extract
information (synopsis) related to document-elements from
documents automatically. The extracted information can
be indexed and provided along with the search results, en-
abling the end-user to quickly nd the related information.
Thus, this work has signicant potential to facilitate ease-
of-use for a document-element search engine, consequently
increasing the productivity of the end-user. We propose a
novel method to extract synopses, investigate the optimum
synopsis-size and demonstrate the utility of our extracted
synopsis in document-element understanding with a user
study.
INTRODUCTION
In academic writing, authors use a number of document-
elements for a variety of purposes like reporting and sum-
marizing experimental results (plots, tables), describing a
process (
ow charts) or algorithm (pseudo-code) etc. A
document-element is dened as an entity, separate from the
running text of the document, that either augments or sum-
marizes the information contained in the running text of
the document. Figures, Tables and Pseudo-codes for algo-
rithms are the most commonly used document-elements in
scientic literature and are sources of valuable information.
Recently, signicant eorts have been made to utilize and
extract the information present in these document-elements.
Kataria et al., describe algorithms to extract data from 2-
D plots which can then be stored, indexed and eventually,
queried[5]. TableSeer, a specialized search engine allows end
users to search for tables in digital documents[6]. A special-
ized search engine for biology documents, BioText Search
Engine, oers capability to search for gures and tables in
the documents[4].
Presentation of Synopses to User
After scoring and ranking all the sentences, we need to
decide how many and what sentences to include in the syn-
opsis to be presented to the user. Carbonell et al., describe
Maximum Marginal Relevance as a criterion for selecting
sentences for summarization that combines query-relevance
and information novelty[1]. For a complete document, like
a paper, there are a lots of sentences that convey the same
information, for example, sentences in abstract, introduc-
tion, conclusion etc. Given that for a document-element,
we get only a small subset of sentences that are related to
the document-element, chances are very few that the small
set of candidate sentences will introduce redundancy. How-
ever, presenting all these relevant sentences to the user has
a detrimental eect on the readability of the synopsis, re-
quires more time to read and understand and hence, defeats
the whole purpose of making the search results more user
friendly. Hence, it is required to determine an optimum syn-
opsis size that balances the trade-o between information
content and readability and eectiveness of the synopsis.
Determining the Penalty parameter, P
In order to determine the value of P that optimizes user
satisfaction, we generated synopses for 43 document-elements
selected randomly from dierent scientic documents at dif-
ferent values of P. The subjects were asked to rate all the
generated synopses and the average scores for all the syn-
opses at dierent P values were computed. The average
length of synopses (in number of sentences) and average
scores for dierent P values are tabulated in Table 1 and
Figure 3 shows the variation of average scores with P.
Comparison with other methods
The aim of this experiment was to compare the proposed
approach with state of the art methods and investigate and
demonstrate the utility of synopses for document-element
search engines. For this, we randomly selected 100 document-
elements from dierent scientic publications and generated
synopses by our method and following two methods:
1. Google Desktop: It is the desktop version of the
most widely used commercial search engine Google and
is used for searching documents stored on the user's
desktop (http://desktop.google). Along with the
search results, it also provides Query Biased snippets
to facilitate the search process. Though the exact algo-
rithms used by it are unpublished, they are supposed
to represent the state of art. We stored all the test
documents on our desktop and then queried the desk-
top search engine with the same query formulated by
extracting keywords from the caption and reference
sentence as described in Algorithm 1. The synopses
in this case are the query biased snippets accompany-
ing the corresponding documents returned as search
results.
CONCLUSIONS
The present work identied the problem of generating syn-
opses for document-elements like tables and gures in digital
documents. The proposed algorithm generates synopses by
ranking sentences on the basis of their relevance to the doc-
ument element and proximity to reference sentences. The
algorithm then determines which sentences to include in the
description, balancing the information content and length of
the description so that the generated descriptions are both
eective and useful. The usefulness of proposed approached
is conrmed by a user study. Our future work would include
developing more features to improve the quality of generated
synopses and to investigate the use of synopses for improved
document search.