19-07-2012, 12:50 PM
Extraction of Text Regions in Natural Images
Extraction of Text Regions in Natural Images.pdf (Size: 4.74 MB / Downloads: 96)
Abstract
The detection and extraction of text regions in an image is a well known problem in the
computer vision research area. The goal of this project is to compare two basic
approaches to text extraction in natural (non-document) images: edge-based and
connected-component based. The algorithms are implemented and evaluated using a set
of images of natural scenes that vary along the dimensions of lighting, scale and
orientation. Accuracy, precision and recall rates for each approach are analyzed to
determine the success and limitations of each approach. Recommendations for
improvements are given based on the results.
1. Introduction
Recent studies in the field of computer vision and pattern recognition show a great
amount of interest in content retrieval from images and videos. This content can be in the
form of objects, color, texture, shape as well as the relationships between them. The
semantic information provided by an image can be useful for content based image
retrieval, as well as for indexing and classification purposes [4,10]. As stated by Jung,
Kim and Jain in [4], text data is particularly interesting, because text can be used to easily
and clearly describe the contents of an image. Since the text data can be embedded in an
image or video in different font styles, sizes, orientations, colors, and against a complex
background, the problem of extracting the candidate text region becomes a challenging
one [4]. Also, current Optical Character Recognition (OCR) techniques can only handle
text against a plain monochrome background and cannot extract text from a complex or
textured background [7].
Different approaches for the extraction of text regions from images have been proposed
based on basic properties of text. As stated in [7], text has some common distinctive
characteristics in terms of frequency and orientation information, and also spatial
cohesion. Spatial cohesion refers to the fact that text characters of the same string appear
close to each other and are of similar height, orientation and spacing [7]. Two of the main
methods commonly used to determine spatial cohesion are based on edge [1,2] and
connected component [3] features of text characters.
5
The fact that an image can be divided into categories depending on whether or not it
contains any text data can also be used to classify candidate text regions. Thus other
methods for text region detection, as described in more detail in the following section,
utilize classification techniques such as support vector machines [9,11], k-means
clustering [7] and neural network based classifiers [10]. The algorithm proposed in [8]
uses the focus of attention mechanism from visual perception to detect text regions.
2. Related Work
The purpose of this project is to implement, compare, and contrast the edge-based and the
connected component methods. The other methods mentioned here are examples of text
extraction techniques that can be used for future projects.
Various methods have been proposed in the past for detection and localization of text in
images and videos. These approaches take into consideration different properties related
to text in an image such as color, intensity, connected-components, edges etc. These
properties are used to distinguish text regions from their background and/or other regions
within the image. The algorithm proposed by Wang and Kangas in [5] is based on color
clustering. The input image is first pre-processed to remove any noise if present. Then the
image is grouped into different color layers and a gray component. This approach utilizes
the fact that usually the color data in text characters is different from the color data in the
background. The potential text regions are localized using connected component based
heuristics from these layers. Also an aligning and merging analysis (AMA) method is
6
used in which each row and column value is analyzed [5]. The experiments conducted
show that the algorithm is robust in locating mostly Chinese and English characters in
images; some false alarms occurred due to uneven lighting or reflection conditions in the
test images.
The text detection algorithm in [6] is also based on color continuity. In addition it also
uses multi-resolution wavelet transforms and combines low as well as high level image
features for text region extraction. The textfinder algorithm proposed in [7] is based on
the frequency, orientation and spacing of text within an image. Texture based
segmentation is used to distinguish text from its background. Further a bottom-up ‘chip
generation’ process is carried out which uses the spatial cohesion property of text
characters. The chips are collections of pixels in the image consisting of potential text
strokes and edges. The results show that the algorithm is robust in most cases, except for
very small text characters that are not properly detected. Also in the case of low contrast
in the image, misclassifications occur in the texture segmentation.
A focus of attention based system for text region localization has been proposed by Liu
and Samarabandu in [8]. The intensity profiles and spatial variance is used to detect text
regions in images. A Gaussian pyramid is created with the original image at different
resolutions or scales. The text regions are detected in the highest resolution image and
then in each successive lower resolution image in the pyramid.
The approach used in [9, 11] utilizes a support vector machine (SVM) classifier to
segment text from non-text in an image or video frame. Initially text is detected in multi
7
scale images using edge based techniques, morphological operations and projection
profiles of the image [11]. These detected text regions are then verified using wavelet
features and SVM. The algorithm is robust with respect to variance in color and size of
font as well as language.
3. Approach
The goal of the project is to implement, test, and compare and contrast two approaches
for text region extraction in natural images, and to discover how the algorithms perform
under variations of lighting, orientation, and scale transformations of the text. The
algorithms are from Liu and Samarabandu in [1,2] and Gllavata, Ewerth and Freisleben
in [3]. The comparison is based on the accuracy of the results obtained, and precision
and recall rates. The technique used in [1,2] is an edge-based text extraction approach,
and the technique used in [3] is a connected-component based approach.
In order to test the robustness and performance of the approaches used, each algorithm
was first implemented in the original proposed format. The algorithms were tested on the
image data set provided by Xiaoqing Liu (xliu65[at]uwo.ca) and Jagath Samarabandu
(jagath[at]uwo.ca), as well as another data set which consists of a combination of indoor
and outdoor images taken from a digital camera. The results obtained were recorded
based on criteria such as invariance with respect to lighting conditions, color, rotation,
and distance from the camera (scale) as well as horizontal and/or vertical alignment of
text in an image. The experiments have also been conducted for images containing
8
different font styles and text characters belonging to language types other than English.
Also, the precision and recall rates (Equations (1) and (2)), have been computed based on
the number of correctly detected words in an image in order to further evaluate the
efficiency and robustness of each algorithm.
The Precision rate is defined as the ratio of correctly detected words to the sum of
correctly detected words plus false positives. False positives are those regions in the
image which are actually not characters of a text, but have been detected by the algorithm
as text regions.