03-04-2012, 01:06 PM
Text Information Extraction in Images and Video: A Survey
Text Information Extraction in Images and Video A Survey.pdf (Size: 1.43 MB / Downloads: 51)
Abstract
Text data present in images and video contain useful information for automatic annotation, indexing, and structuring of images. Extraction of this information involves detection, localization, tracking, extraction, enhancement, and recognition of the text from a given image. However, variations of text due to differences in size, style, orientation, and alignment, as well as low image contrast and complex background make the problem of automatic text extraction extremely challenging. While comprehensive surveys of related problems such as face detection, document analysis, and image & video indexing can be found, the problem of text information extraction is not well surveyed. A large number of techniques have been proposed to address this problem, and the purpose of this paper is to classify and review these algorithms, discuss benchmark data and performance evaluation, and to point out promising directions for future research.
Keywords: Text information extraction, text detection, text localization, text tracking, text enhancement, OCR
1 Introduction
Content-based image indexing refers to the process of attaching labels to images based on their content. Image content can be divided into two main categories: perceptual content and semantic content [1]. Perceptual content includes attributes such as color, intensity, shape, texture, and their temporal changes, whereas semantic content means objects, events, and their relations. A number of studies on the use of relatively low-level perceptual content [2-6] for image and video indexing have already been reported. Studies on semantic image content in the form of text, face, vehicle, and human action have also attracted some recent interest [7-16]. Among them, text within an image is of particular interest as (i) it is very useful for describing the contents of an image; (ii) it can be easily extracted compared to other semantic contents, and (iii) it enables applications such as keyword-based image search, automatic video logging, and text-based image indexing.
1.1 Text in images
A variety of approaches to text information extraction (TIE) from images and video have been proposed for specific applications including page segmentation [17, 18], address block location [19], license plate location [9, 20], and content-based image/video indexing [5, 21]. In spite of such extensive studies, it is still not easy to design a general-purpose TIE system. This is because there are so many possible sources of variation when extracting text from a shaded or textured background, from low-contrast or complex images, or from images having variations in font size, style, color, orientation, and alignment. These variations make the problem of automatic TIE extremely difficult.
Figures 1-4 show some examples of text in images. Page layout analysis usually deals with document images1 (Fig. 1). Readers may refer to papers on document segmentation/analysis [17, 18] for more examples of document images. Although images acquired by scanning book covers, CD covers, or other multi-colored documents have similar characteristics as the document images (Fig. 2), they can not be directly dealt with using a conventional document image analysis technique. Accordingly, this survey distinguishes this category of images as multi-color document images from other document images. Text in video images can be further classified into caption text (Fig. 3), which is artificially overlaid on the image, or scene text (Fig. 4), which exists naturally in the image. Some researchers like to use the term ‘graphics text’ for scene text, and ‘superimposed text’ or ‘artificial text’ for caption text [22, 23]. It is well known that scene text is more difficult to detect and very little work has been done in this area. In contrast to caption text, scene text can have any orientation and may be distorted by the perspective projection. Moreover, it is often affected by variations in scene and camera parameters such as
1 The distinction between document images and other scanned images is not very clear. In this paper, we refer to images
illumination, focus, motion, etc.
(a) (b) ©
Fig. 1. Grayscale document images: (a) single-column text from a book, (b) two-column page from a journal (IEEE Transactions on PAMI), and © an electrical drawing (courtesy of Lu [24]).
(a) (b) ©
Fig. 2. Multi-color document images: each text line may or may not be of the same color.
(a) (b) ©
Fig. 3. Images with caption text: (a) shows captions overlaid directly on the background. (b) and © contain text in frames for better contrast. © contains a text string that is polychrome.
with text contained in a homogeneous background as document images.
Fig. 4. Scene text images: Images with variations in skew, perspective, blur, illumination, and alignment.
Before we attempt to classify the various techniques used in TIE, it is important to define the commonly used terms and summarize the characteristics2 of text that can be used for TIE algorithms. Table 1 shows a list of properties that have been utilized in recently published algorithms [25-30]. Text in images can exhibit many variations with respect to the following properties:
1. Geometry: