30-05-2012, 12:58 PM
A Character Segmentation Algorithm for Printed Kannada Text Document
A Character Segmentation Algorithm for.doc (Size: 543 KB / Downloads: 94)
Abstract
Optical Character Recognition (OCR) systems have been effectively developed for the recognition of printed characters of Non-Indian languages .There are no sufficient number of works on Indian language character recognition especially for Kannada script , which is among 12 major scripts in India. Segmentation is an important task of any OCR system. It separates the image text documents into lines, words and characters. The accuracy of OCR system mainly depends on the segmentation algorithm being used. Segmentation of Kannada text is difficult when compared with other Latin based languages because of its structural complexity and increased character set. Some of the adjacent characters in a Kannada word sometimes overlap in the vertical projection profile due to the presence of bottom extension characters (subscript or Vatthus). The profile based methods can only segment non-overlapping lines and characters. This paper addresses the algorithm for the segmentation of printed Kannada text at the character level .Here we propose a segmentation method in which the vatthus of a Kannada character are segmented first by using Connected Component algorithm, and then the remaining character is segmented by vertical projection profile method. The proposed algorithm is based on projection profiles, and connected components. Experimental results it is observed that 100% line, word segmentation and about 98% character segmentation accuracy can be achieved with overlapping characters.
Keywords— Kannada Text, Projection Profile, Character Segmentation, Connected Components.
I. INTRODUCTION
Optical character recognition (OCR) is a program that translates scanned or printed image document into a text document. Once it is translated into text, it can be stored in ASCII or UNICODE format. There are several applications with OCR. Some of the practical applications [1] including (1) reading aid for the blind, (2) automatic text entry into the computer for desktop publication, library cataloging, ledgering, etc. (3) automatic reading for sorting of postal mail, bank cheques and other documents, (4) document data compression: from document image to ASCII format, (5) language processing such as indexing, spell checking, grammar checking etc., (6) multi-media system design, etc.
In OCR systems efficient character segmentation is a crucial pre-processing step for reliable character recognition .The overall success (recognition) rate of an OCR system depends mainly on the proper segmentation of characters. A typical OCR system consists of image capturing, pre-processing, segmentation, feature extraction and recognition stages. Segmentation refers to extraction of objects of interest from rest of the image. It is one of the decision stages in an OCR system, because in- correctly segmented characters will not be recognized properly. This reduces the recognition rate of the OCR system.
The methods of segmentation are broadly classified into three strategies as follows:
• The classical or dissection approach: In this approach, the segments are identified by extracting the distinguishing attributes of the character image.
• Recognition-based Segmentation: The image as a whole is searched for components that match predefined classes.
• Holistic Approach: The system tries to recognize the word as a whole.
The classical segmentation approach uses several methods of segmentation such a white space and pitch, Projection Analysis, Connected Component Processing etc. The most common approach is to use projection profile analysis since it is simple and fast.
In Kannada text, for words having bottom extension characters (Vatthus), the space between two adjacent characters does not have zero space valleys in the vertical projection profile, which makes it difficult to extract individual characters from the word as shown in the fig(8). In such situation the usual approach for segmentation is to use the Connected Component method by treating each of the individual characters in the text (both main and subscript) as separate components of the image. This is however both time consuming and computationally expensive method.
In This paper we propose a segmentation algorithm, in which only the Vatthus (subscripts) are segmented from the character using Connected-Component processing, then the remaining character is easily segmented using the traditional vertical projection profile method. As segmentation of characters by projection method is simple and fast, the proposed segmentation algorithm for Kannada character segmentation is faster than the conventional method in which all the characters from the text are segmented by connected component processing only.
The major strengths of the proposed algorithm for segmenting Kannada characters which combines connected component analysis and projection profile is that it works faster than the classical single stage method of segmenting characters using connected component analysis only. Also the proposed method can be used (ex: Telugu) which have a word structure consisting of ‘main’ and ‘vatthu’ (subscript) like characters.
II. LITERATURE SURVEY
Currently there are many OCR systems available for handling printed/handwritten English documents with reasonable accuracy. However, there are not many reported efforts at developing OCR systems for Indian languages especially for a South Indian Language like Kannada. Some of the previous works in Character segmentation for different languages is given below.
Amara and Noureddine [7] have described a method for segmentation of printed Arabic characters using a modified histogram as well as the number of black segments in a line of pixels.
Veena and Mishra [3] have proposed a method for segmentation of touching and fused Devanagari characters in two stages. In the first stage the words are segmented into easily separable characters or composite characters. Statistical information about the height and width of each separated box is used to hypothesize whether a character box is composite. In the second stage the hypothesized characters are further segmented.
Pal and Sagarika [6] have proposed a method of segmentation of unconstrained Bangla characters based on piece-wise projection for the segmentation of lines and Water reservoir principle for segmenting characters inside a word.
Anniwear and Yoshinao [8] propose a segmentation technique for Uygur Scripts.
In OCR system for Tamil, proposed by Aparna and Radhakrishna [2], horizontal and vertical projection profiles are employed for line and word segmentation. Connected component analysis is performed on words to extract the individual characters.
In Kannada OCR system proposed by Ashwin and Sastry [4] a segmentation method is described in which the words are vertically segmented into three zones. But there are few situations where there is a problem of overlapping zones, which reduces the recognition rate of OCR system.
This problem is solved in our segmentation by extracting the character as a whole instead of its constituents.
OVERVIEW OF KANNADA LANGUAGE
Kannada is one of the four popular Dravidian languages of South India. Kannada script is written horizontally from left to write and the concept of upper and lower case is absent. It is a non-cursive script i.e. a Kannada word is written without joining the characters of the word. The characters are isolated within the word.Kannada script is more complicated than English because of the presence of compound characters. Modern Kannada has 51 base characters, called as Varnamale. There are 16 vowels and 35consonants as shown in fig (1) and fig (2), respectively.
Figure 1: Kannada Vowels
Figure 2: Kannada Consonants
Consonants take modified shapes when added with vowels. When a consonant character is used alone, it results in a dead consonant (mula vyanjana). Vowel modifiers can appear to the right, top or at the bottom of the base Consonant. A basic consonant can combine with the vowel sign to form another set of 16 Consonant-Vowel(CV) composite characters called as Gunithakshara as shown in fig(3).
Figure 3: Consonant-Vowel Composite Characters
In Kannada, all the 34 consonants have a half/short form, which can be referred as half consonant (Vatthu or Subscript) as shown in fig (4).
Figure 4: short forms/Half Consonants (Vatthus or Subscripts)
Any half consonant can appear below any other consonant or a CV character as a bottom extension character to form a Conjunct-Consonant Character. Example of three such characters is shown in fig (5).
Figure 5: An example of Conjunct-Consonant
In rest of the paper Vatthus or half Consonant is referred as subscripts and all other characters (vowel, consonants, and CV characters) other than the subscripts are referred as main characters.
III. SEGMENTATION METHODOLOGY
Segmentation extracts lines, words and then finally into characters from the text document images. Our segmentation system uses the classical approach in which the scanned image is dissected into individual building blocks to be recognized as characters. The dissection method makes use of the properties like height, width, spacing etc.
The proposed method starts by segmenting the lines and then words from the scanned document image using Horizontal and Vertical projection profiles respectively. In the projection profile methods, the horizontal and vertical profiles are computed. Then each word is segmented into individual characters by vertical projection profile. Each segmented character is then examined for the presence of subscript character. If subscript characters are present in the character then it is extracted using Connected Component Method. If subscript is not present (or when subscript is extracted) the main character is segmented using Vertical Projection Profile.
The details of the segmentation methodology adopted for segmentation of lines, words and characters are now described.
A. Line Segmentation
To separate the text lines, the horizontal projection profile of the text document image is found. The horizontal projection profile (HPP) is a Histogram of a number of ON pixels along every row of the image. When the projection profiles are plotted we can see peaks and valleys in the plot. White space between the text lines is used to segment the text lines. Fig (6) shows an example of a Kannada document along with its horizontal projection. The projection profile has valleys of zero height between the text lines. Line segment is done at this point.
Fig (6): Text Lines with Horizontal Projection Profile
B. Word Segmentation
The spacing between the words is used for word segmentation. For Kannada script, spacing between the words is greater than the spacing between the characters in a word. The spacing between the words is found by taking the Vertical Projection Profile (VPP) of an input text line. Vertical Projection profile is the sum of ON pixels along every column of the image. A sample input text line and its vertical projection profile is shown in fig (7).From the Histogram it is clear that the width of the zero-valued valleys is more between the words in the line as compared to the width of zero-valued valleys that exists between characters in a word. This information is used to separate words from the input text lines.
Fig (7): Input Text Lines and its vertical Projection Profile indicating word segmentation for a sample document
C. Character Segmentation
Kannada is a non-cursive script in which the individual characters in a word are isolated. Spacing between the characters is used for segmentation. But sometimes in the vertical projection profile there will not be any zero-valued valleys due to the presence of conjunct-consonant (subscripts) characters. The subscript character position overlaps with the two adjacent main characters in vertical direction as shown in fig (8).
Fig (8): A sample Kannada word with subscripts along with its vertical projection profile
Hence in these cases the usual method of vertical projection profile to separate characters is not possible. In these cases the following approach is used.
Thus for character segmentation it is first necessary to check whether there is any subscript in a character. For this the Kannada character is divided into different horizontal zones as explained below.
1) Zones in a Kannada word: A Kannada word can be divided into different horizontal zones. Two different cases are considered, a word without subscripts as in Fig. (9) (Pronounced as RAMANU) and a word with subscripts as in Fig. (10) (Pronounced as PRASHASTHAVAAGI).
Fig (9): Two-Horizontal zones in a sample word without conjunct-consonant character
Fig (10): Three-Horizontal zones in a sample word with conjunct-consonant character
Consider the sample word as in Fig. (9) Which does not have a subscript character. The imaginary horizontal line that passes through the top most pixel of the word is the top line. Similarly, the horizontal line that is passing through the bottom most pixel of the main character is the base line. The horizontal line passing through the first peak in the profile is the head line. The word can be divided into top and middle zones. Top zone is the portion between the top and head line and the middle zone is the portion between the head line and base line.
For words with conjunct-consonant characters, it is divided into three horizontal zones as in Fig. (10) For a sample word with subscripts. The word is divided into top, middle and bottom zones. The top and middle zones are chosen similar to that of the word without subscripts. A bottom portion is chosen between the baseline and the bottom line. The bottom line is the horizontal line passing through the bottom most pixel of the word.
Before character segmentation it is first necessary to find out whether the segmented character has a subscript or not. This can be detected as follows:
i.) In the HPP as in Fig (9), there are two peaks of approximately equal size in the top and middle zones of the word. The absence of the third peak after the second peak indicates that there are no subscripts in the word.
ii.) In the HPP as in Fig (10), there are two peaks of approximately equal size in the top and middle zones of the word. Also, there is an occurrence of third peak after the second peak in the bottom zone of the word, which is due to the subscripts in the word.
Thus, by checking the presence or absence of the third peak in the bottom zone of the horizontal projection profile of the segmented Kannada character, it is possible to find out whether the segmented word has a subscript or not.
2) Character Segmentation of a word without Subscripts:
Consider a Kannada character which does not have any subscripts as shown in fig (11). There are zero valued valleys in the VPP of the word which makes the character separation easier. The word is examined Row-Wise. The portion of the image which lies between two successive zero valued valleys of the VPP is assumed to be as a separate character and separated out.
Fig (11): Vertical Projection Profile of a word without subscripts
3) Character Segmentation of a word having subscripts: Consider a sample Kannada word as shown in
Fig (12) which contains the subscripts.
Fig (12): A sample word with subscripts
If VPP of this word is considered, then there will be no zero valued valleys between the third character, its subscript character and also for the fifth character and its subscript. Hence, just the zero valued valleys of the vertical projection do not determine the character separation. The individual characters in this case are separated as follows:
4) Subscript character segmentation: Consider a sample character as shown in Fig. (13).
Fig (13): A sample character with subscript
The total height of the character in terms of number of rows (H) is calculated. The columns of the character are scanned from left to right. Every column is scanned from bottom to top to find the presence of an ON pixel P. When such an ON pixel is found, the number of rows that has gone up (L) is counted. If L is less than or equal to some threshold value, the pixel P is assumed to be one of the points of the subscript character. Then using P as initial point, Connected Component algorithm is applied to extract the subscript character at that position. Threshold value is calculated by finding the position of the valley between the second peak and the third peak which is below the base line in the bottom zone of the character. Connected Component Analysis on an image is done in order to extract a group of pixels connected by 8-connectivity.By knowing one of the inside point of a connected component in an image, it is possible to extract all the pixels of the component in an iterative manner using connected component algorithm. At the end of segmentation process, after separating subscripts what remains is a plain character without having any subscript characters as in
Fig (14)
Fig (14): Plain Word and segmented subscript of Fig (13)
5) Main Character Segmentation: The output from the initial process of removing the subscript, converts a character with subscript into a plain character without any subscript character. Hence during the next process the segmentation of main character is done as explained earlier.
IV. EXPERIMENTAL RESULTS
The algorithm is implemented in MATLAB. The algorithm is tested with several printed Kannada document images, which contained characters of different fonts and size. We considered only good quality of printed documents where there are no touching or broken characters.
After completion of all segmentation process we will be having two database, of which one contains only the main characters and the other database contains only the separated subscripts of that particular document .Since here the separation of the subscript is done at each individual character level, it is easier to link the main character with its relevant subscript, which was a difficult task in the earlier proposed methods.
The same sample page is segmented using conventional method in which all the characters of the text are treated as individual components and all of them are extracted using the method of connected component. The time taken to segment was more compared to this proposed method. In conventional method for the segmentation of the characters, the connected component method is applied for both main and subscript character. Whereas, in the proposed method, only subscripts are extracted using connected component analysis. The main character is extracted using simple projection profile method which does not take much time. Normally, the ratio of subscripts to main characters in a Kannada document is very small. Hence, the time taken for segmentation by the proposed method is much faster.
This proposed method also works for the characters like the main character itself descends down in the bottom zone as shown in fig (15), which was a failure cases in the earlier proposed methods.
Fig (15): A sample Kannada word having characters with their vowel modifier extended to bottom zone
VI. FUTURE WORK:
In the proposed method only good quality of printed document is considered without any touching or broken characters. The proposed method can be extended to include the touching and broken characters in the document. Segmentation of the touching lines and characters may require some heuristic approaches.