18-03-2014, 05:01 PM
An OCR Free Method for Word Spotting in Printed Documents: the Evaluation of Different Feature Sets
An OCR Free Method for Word.pdf (Size: 212.66 KB / Downloads: 16)
Abstract:
An OCR free word spotting method is developed and evaluated under a strong
experimental protocol. Different feature sets are evaluated under the same experimental
conditions. In addition, a tuning process in the document segmentation step is proposed which
provides a significant reduction in terms of processing time. For this purpose, a complete OCR-
free method for word spotting in printed documents was implemented, and a document
database containing document images and their corresponding ground truth text files was
created. A strong experimental protocol based on 800 document images allows us to compare
the results of the three feature sets used to represent the word image.
Introduction
With the advances in information technology observed in the last years, it is usual to
find large volumes of information available in digital format. A large amount of this
information is composed of scanned document images. Due to the large volume, there
is the urgency to provide fast access methods to this information. However, the
current tools for indexing and searching in large databases are not prepared to deal
with this type of data. Moreover, the use of OCR-based methods has shown itself an
expensive option of the computational point of view [Doermann, 98]. An interesting
alternative is the group of methods that aim to make possible the word spotting in
document images without using OCR. In such an approach, the methods have as
advantage a small execution time, and the robustness to noisy documents
[Balasubramanian, 06], [Lu, 04], [Lu, 02], [Rath, 03]
Segmentation
The first step in the segmentation process consists of finding all connected
components (CCs) in the binary image of the document. For each CC, shape features
are calculated such as: the ratio between height and width, and the ratio between black
and white pixels inside the connected component bounding box (CCBB). These
features are used in a filtering process which is dedicated to eliminate table lines,
figures and all kind of graphics present in the document. The CCs that go through this
filter have their position and dimension stored in linear data structures (LDS). In order
to accelerate the access to each CC,
Document Database
The document image database created for our experiments is composed of 865 papers
in PDF format, digitized with 300 dpi and published in the ICASSP ́97. Most of
documents contain four pages in two-column format with text, equations, graphics
and tables.
To create a textual version of each document keeping the page layout, the OCR
available in the Acrobat [Acrobat, 06] was used. In addition, this textual version was
submitted to the Xpdf [Xpdf, 05] to obtain for each word its position in the page, the
page number and the document file name. A visual verification of each textual
document was carried out. This process does not succeed to obtain the textual version
for some low quality documents – which corresponds to about 1% of the initial
amount of documents. To obtain a copy of the created database is just necessary to
contact the authors.
Figure 10 shows two parts of distint documents, where we can observe some
problems, such as: fragmented and touching characters, and a significant variance in
terms of stroke thickness, while Figure 11 shows a sample of a document page
available in the database.
Conclusions
We have implemented a complete OCR-free word spotting method and evaluated
three different feature sets by considering the same experimental protocol. In addition,
a document database containing document images and their corresponding ground
truth text files was created.
Some specific contributions were done at each stage of the developed word
spotting method, such as: a significant reduction in terms of the processing time in the
segmentation process; an automatic scheme to generate the conversion tables; and the
comparison of three feature sets under the same conditions.
Further work may be done by considering the combination of feature sets in the
proposed method. In order to make it, we plan to implement some methods for feature
selection. In addition, we plan to evaluate the efficiency of the best features for word
spotting in a database of historical documents and also to compare the proposed
method with OCR-based word spotting approaches.