25-09-2013, 04:46 PM
An OCR Free Method for Word Spotting in Printed Documents: the Evaluation of Different Feature Sets
OCR Free Method for Word Spotting.pdf (Size: 212.66 KB / Downloads: 16)
Abstract:
An OCR free word spotting method is developed and evaluated under a strong
experimental protocol. Different feature sets are evaluated under the same experimental
conditions. In addition, a tuning process in the document segmentation step is proposed which
provides a significant reduction in terms of processing time. For this purpose, a complete OCR-
free method for word spotting in printed documents was implemented, and a document
database containing document images and their corresponding ground truth text files was
created. A strong experimental protocol based on 800 document images allows us to compare
the results of the three feature sets used to represent the word image.
Introduction
With the advances in information technology observed in the last years, it is usual to
find large volumes of information available in digital format. A large amount of this
information is composed of scanned document images. Due to the large volume, there
is the urgency to provide fast access methods to this information. However, the
current tools for indexing and searching in large databases are not prepared to deal
with this type of data. Moreover, the use of OCR-based methods has shown itself an
expensive option of the computational point of view [Doermann, 98]. An interesting
alternative is the group of methods that aim to make possible the word spotting in
document images without using OCR. In such an approach, the methods have as
advantage a small execution time, and the robustness to noisy documents
[Balasubramanian, 06], [Lu, 04], [Lu, 02], [Rath, 03].
Method Overview
The implemented method for word spotting in printed documents is illustrated in
Figure 1 and encompasses several algorithms: pre-processing, segmentation of the
document image into word images, feature extraction from the word images,
conversion of ASCII queries into descriptors, and word matching. With this
framework, we may evaluate specific strategies for segmentation, feature extraction,
and word matching.
Preprocessing
The original image of a document is first binarized by using the Otsu method [Otsu,
78] followed by a smoothing process [Suen, 92] to reduce the noise in the contour of
the character strokes which may be produced during the acquisition process. The
masks in Figure 2 and their respective rotations by 90o, 180o and 270o were used in
the smoothing process. In these masks the code “1” represents the foreground, code
“0” represents the background, while the code “?” represents “don ́t care”. Figure 3
shows an example of character “h” before and after the smoothing process.
Left-to-Right Primitive String (LRPS)
The LRPS word descriptors were originally proposed for information retrieval in
document image databases. Each feature in this set is a pair of attributes (σ , ω ) where
σ is the Line-or-Traversal Attribute (LTA) and ω is the Ascender-and-Descender
Attribute (ADA). The first attribute is calculated taking into account an analysis of the
straight lines in the word image and the number of transitions that does not belong to
straight lines. The second attribute is obtained through the analysis of the feature
position (straight line or a transition) by considering the ascending and descending
lines. These attributes provide structural information of the word image. They are
scale and translation invariant. A detailed description of these attributes may be found
in [Lu, 04] and [Lu, 02]. Figure 6 provides an overview of the LPRS features.
Arica and Yarman-Vural Descriptors (AYV)
The AYV descriptors are based on the Arica and Yarman-Vural features [Arica, 00]
used for isolated character recognition. We have adapted the features proposed by the
authors for the representation of printed word images. For this purpose, different from
the original method, only the vertical columns were take into account, since we have
to employ them for both the words and the isolated characters. In addition, in our
modified AYV features no size normalization method is considered. Figure 7
illustrates the descriptors computed for two columns of the character “a”
Document Database
The document image database created for our experiments is composed of 865 papers
in PDF format, digitized with 300 dpi and published in the ICASSP ́97. Most of
documents contain four pages in two-column format with text, equations, graphics
and tables.
To create a textual version of each document keeping the page layout, the OCR
available in the Acrobat [Acrobat, 06] was used. In addition, this textual version was
submitted to the Xpdf [Xpdf, 05] to obtain for each word its position in the page, the
page number and the document file name. A visual verification of each textual
document was carried out. This process does not succeed to obtain the textual version
for some low quality documents – which corresponds to about 1% of the initial
amount of documents. To obtain a copy of the created database is just necessary to
contact the authors.
Figure 10 shows two parts of distint documents, where we can observe some
problems, such as: fragmented and touching characters, and a significant variance in
terms of stroke thickness, while Figure 11 shows a sample of a document page
available in the database.
Conclusions
We have implemented a complete OCR-free word spotting method and evaluated
three different feature sets by considering the same experimental protocol. In addition,
a document database containing document images and their corresponding ground
truth text files was created.
Some specific contributions were done at each stage of the developed word
spotting method, such as: a significant reduction in terms of the processing time in the
segmentation process; an automatic scheme to generate the conversion tables; and the
comparison of three feature sets under the same conditions.
Further work may be done by considering the combination of feature sets in the
proposed method. In order to make it, we plan to implement some methods for feature
selection. In addition, we plan to evaluate the efficiency of the best features for word
spotting in a database of historical documents and also to compare the proposed
method with OCR-based word spotting approaches.