16-11-2012, 04:17 PM
A hierarchical and scalable model for contemporary document
image segmentation
A hierarchical and scalable model for contemporary.pdf (Size: 1.7 MB / Downloads: 37)
Abstract
In this paper, we introduce a novel color segmentation
approach robust against digitization noise and
adapted to contemporary document images. This system is
scalable, hierarchical, versatile and completely automated,
i.e. user independent. It proposes an adaptive binarization/
quantization without any penalizing information loss. This
model may be used for many purposes. For instance, we rely
on it to carry out the first steps leading to advertisement
recognition in document images. Furthermore, the color
segmentation output is used to localize text areas and
enhance optical character recognition (OCR) performances.
Weheld tests on a variety of magazine images to point up our
contribution to the well-known OCR product Abby Finer-
Reader. We also get promising results with our ad detection
system on a large set of complex layout testing images.
Introduction
Nowadays, we encounter more and more digitized documents
with overlaying color layers owing to DTP (Desktop
publishing). However, few researches processing such
images exist in the literature. Even the existing ones target
specific applications such as mixed raster content (MRC)
[2]. Without prior processing of the colors in some document
pages, several applications, such as optical character
recognition (OCR) and layout segmentation, cannot be
efficient. Color information is imperative for further issues
such as advertisement detection.
Digitized documents are commonly spoiled by a conventional
series of operations (printing, digitization, image
compression, etc.) that affect the original colors and
introduce undesirable ones.
A generic multi-layer color segmentation system
We aim to get as close as possible to the original colors of
the document. To do so, we propose the color segmentation
scheme in Fig. 2.
• We first separate the chromatic layer from the achromatic
one. A given pixel is called chromatic if it has a
defined hue (red, green, blue, yellow, etc). Otherwise, it
is achromatic (shades of gray, including black and
white). This step is fundamental as chromatic and
achromatic pixels cannot always be treated in the same
way. Indeed, it would be meaningless to apply some
processes that examine the hue values to achromatic
pixels, since the hue for these pixels is either undefined,
or unreliable 19. . Additionally, the saturation noise is
removed at this stage.
• The chromatic layer is split into monochromatic and
multi-chromatic layers:
Application to Text localization to improve OCR results
A variety of approaches to text information extraction, that
goes from detection to recognition, from images and video
have been proposed for specific applications, including
page segmentation, address block localization, license plate
localization, etc. In spite of such extensive studies, there is
still no general-purpose system. A survey of text information
extraction methods is given in [18].
Several researches have been done in text extraction in
compressed domain (MPEG, JPEG...) [26, 48]. However,
they concern videos and use motion information to extract
text. Our method processes compressed images as well as
regular ones since the color segmentation stage efficiently
filters the compression damage.