segmentation of overlapping text lines, characters in printed telugu text

M Swamy Das et. al. / International Journal of Engineering Science and Technology Vol. 2(11), 2010, 6606-6610

SEGMENTATION OF OVERLAPPING TEXT LINES, CHARACTERS IN PRINTED TELUGU TEXT DOCUMENT IMAGES M Swamy Das, Dept. of Computer Science and Engineering, Hyderabad, INDIA [email protected]

Dr. CRK Reddy Dept. of Computer Science and Engineering, Hyderabad, INDIA [email protected]

Dr. A Govardhan Dept. of Computer Science and Engineering, JNTU Jagtial, INDIA [email protected]

G. Saikrishna Dept. of Computer Science and Engineering, Hyderabad, INDIA [email protected] Abstract: Segmentation is an important task of any OCR system. It separates the image text documents into lines, words and characters. The accuracy of OCR system mainly depends on the segmentation algorithm being used. Segmentation Telugu text is difficult when compared with Latin based languages because of its structural complexity and increased character set. It contains vowels, consonants and compound characters. Some of the characters may overlap together. The profile based methods can only segment non-overlapping lines and characters. This paper addresses the segmentation of overlapped text lines and characters. The proposed algorithm is based on projection profiles, connected components and spatial vertical relationships. It also uses nearest neighborhood method to cluster the connected components. Experimental results it is observed that 100% line segmentation and about 98% character segmentation accuracy can be achieved with overlapping lines and characters. Keywords: segmentation; nearest neighborhood; glyphs; projection profiles; base line. 1. Introduction Optical character recognition (OCR), is a program that translates scanned or printed image document into a text document. Once it is translated into text, it can be stored in ASCII or UNICODE format. There are several applications with OCR. Some of the practical applications [1] including (1) reading aid for the blind, (2) automatic text entry into the computer for desktop publication, library cataloging, ledgering, etc. (3) automatic reading for sorting of postal mail, bank cheques and other documents, (4) document data compression: from document image to ASCII format, (5) language processing such as indexing, spell checking, grammar checking etc., (6) multi-media system design, etc. The typical phases [2] of an OCR system are o Preprocessing o Segmentation o Recognition The Preprocessing phase includes the conversion of gray scale image into binary, noise removal, thinning and skew detection and correction. Segmentation phase includes the segmentation text image into lines, word and characters. The final recognition phase consists of feature extraction, selection and classification. Actual processing takes place on the binary images. Binary image separates the foreground pixels from the background. For binarization Otsu[ 3] method can be used. In this a threshold value is selected and the intensity values above the threshold are converted into one intensity value(white) and below the threshold are converted into another intensity (black).

ISSN: 0975-5462

6606

M Swamy Das et. al. / International Journal of Engineering Science and Technology Vol. 2(11), 2010, 6606-6610 The scanned documents may contain noise. To process the document noise is to be removed. The commonly used approach is to low-pass filter the image and to use it for later processing. After removing the noise the image is to be thinned (one pixel-width representation) to reduce the image components to their essential information for further analysis and recognition. The scanned document may also contain skew. Skew angle is to be identified and the image skew is to be rectified. Preprocessing (noise removal, de skewing, digitization)

Segmentation (lines, words and characters)

Feature extraction and recognition Fig. 1. Steps of an OCR System

The noise removed, skew corrected output image from the preprocessing phase is given as input to the segmentation stage. In this stage the binary image is separated into lines, words and characters. There are several segmentation methods as discussed in section 3. The extracted characters are then given as input to the feature extraction and recognition phase to recognize and classify the characters. The Figure 1 shows the steps of a typical OCR system. 2. The characteristic of Telugu script Telugu is a script based most popular South Indian spoken language. The character set of Telugu contains 16 vowels, 36 consonants, vowel (maatras) and consonant modifiers (vaththus). These orthographic units are combined to represent several frequently used syllables (estimated between 5000 and 10000) in the language [4]. We refer to these basic orthographic units as glyphs (single connected component representation). These characters will have variable size. (i.e. width and height). In Latin based scripts most of the characters have same size except few letters. As observed from the printed Telugu news papers and other books, documents we can find the characters that are overlapped or touched as seen in the Figure 4. Segmentation of such characters are difficult when compared with Latin based scripts like English. The figure 2 shows the some sample Telugu simple and compound character images.

Fig. 2. Examples for simple and compound characters

3. Segmentation algorithms Segmentation extracts lines, words and then finally into characters from the text document images. These methods are classified into dissection, recognition and holistic [5]. The dissection method makes use of the properties like height, width, spacing etc. This method is suitable for printed image documents in which each character image is well spaced. The other two methods are based on component searching. These are used for the segmentation of hand written image documents. In the literature several dissection based methods are reported for segmenting Indian script documents. These methods include projection profile (white space analysis) [6], voronoi and docstrum [7], graph cut, connected components based. All these methods result in segmentation errors. Jawahar [8] proposed the graph cut method that requires apriori information about the script structure to cut. Rajasekharan propped a method based on projection method for Kannada script document segmentation [9]. In the projection profile methods, the horizontal and vertical profiles are computed. Projection profile is the histogram of the image. When the projection profiles are plotted we can see peaks and valleys in the plot. The

ISSN: 0975-5462

6607

M Swamy Das et. al. / International Journal of Engineering Science and Technology Vol. 2(11), 2010, 6606-6610 zero valued valleys are identified to separate the lines, words and characters. The horizontal profile is used for line segmentation and the vertical profile is used for word and character segmentation. Examples are shown in fig. 4. When the characters are overlapped or touched this method can’t segment as shown in fig. 5. This method is suitable for segmenting image documents that are well spaced without overlapping and touching. The connected component method [10, 13] first labels the pixels in the image. The pixels that are connected are labeled with the same blob. This connectivity can be 4 or 8. After labeling, the labeled components are extracted from the image. The CC method solves the overlapping character segmentation problem, but the separates simple characters into their constituent glyphs which may increases the recognition complexity. For example the characters (ex,etc) will be segmented in to two glyphs[12] each. These glyphs are to be reassembled to preserve the character shape if the recognition phase uses the shapes of the basic characters. If the recognition system uses the basic characters then CC method is suitable. The proposed method can solve the overlapping segmentation problem. This method makes use the projection profiles and CC with some heuristics to segment the overlapped text documents in a robust way. This method is described in the next section. 4. Proposed method In Telugu script, the consonant and vowel modifiers may be attached / placed on top or bottom or left or right to the base character as shown in Figure 2. The text document image may contain overlapped lines and characters. To segment the image into lines and characters, in the this method first the connected components are extracted from the document image and labeled For each connected component the top, bottom, left, right positions are identified. Using these positions establish spatial relationships and cluster the components using nearest neighborhood algorithm [11]. The following section describes the segmentation of line, word, character using this method 4.1. Line segmentation Label the connected components for the given document image Determine the top, bottom, left and right position for each CC using bounding box Establish the following vertical spatial relations to check whether two CCs are a. Fully overlapped or b. Partially overlapped Cluster the CCs using nearest neighborhood method to extract lines A Connected component is said to be fully overlapped if the top and bottom positions of the component are within the range of another component’s top and bottom positions. A component is said to be partially overlapped if the overlapping distance is greater than or equal to half of its height. Figure 3 shows the fully and partially overlapping components. For clustering the Connected components to form lines construct an undirected graph. In the graph each node represents a connected component and the link represent the distance between Components. If a component is overlapped

Fig. 3. Example of fully partially overlapped components

With another component then compute the Euclidean distance between the components using the relationships established in the above algorithm. The distance between non overlapping components is infinite. i.e. they are not reachable. If a connected component is reachable and nearest from a component in a cluster that belongs to that cluster add the component to that cluster. Each cluster of connected components forms a line. This clustering approach is nearest neighborhood approach. The Fig. 4 (a,b,c). shows sample text image, horizontal projection profile and the extracted text lines using this method. 4.2. Word segmentation In Telugu documents words are generally well spaced and this space varies. To segmenting the text line image into words, compute vertical projection profiles. The projection profile is the histogram of the image. In the profile, the zero valley peaks may represent the character or word space. To differentiate the whether it is character or word spacing, find the maximum character space cluster and use it for separating the words. From

ISSN: 0975-5462

6608

M Swamy Das et. al. / International Journal of Engineering Science and Technology Vol. 2(11), 2010, 6606-6610 the experimentation with several Telugu document images it is identified that the word space varies from 4 to 9 and character space varies from 0 to 4. A spacing of 4 is used for word segmentation. 4.3. Character Segmentation Character segmentation from Telugu words is little bit complex since vowel modifiers placed / attached on top of base characters and consonant modifiers attached to left or right or bottom to the base character. The projection profiles may lead to under segmentation errors. To extract characters with overlapping, this method first removes vowel modifiers and consonant modifiers. After removing the consonant and vowel modifiers, the word image contain only the base characters with clear paths between them. This space can be used to segment the base characters. For this image vertical projections are computed. Using these profiles separate the base characters and then vowel and consonant modifiers removed earlier are added to these base characters using the nearest neighborhood method. The algorithm is given as 1. Remove consonant modifiers from the word. For this, a. b.

2. 3.

Determine the middle row using bounding box (i.e. height/2) Compute horizontal profile and identify the bottom base line using this profile. The bottom base line the highest peak row in the profile down from the middle row. c. The CCs down the bottom base line are consonant modifiers. Remove them from the word and add to the consonant modifiers group. Remove vowel modifiers by finding the top base line computed as in the above step and modifiers to the vowel modifier group. Using the vertical profile separate the base characters using the clear paths between them. Then add vowel and consonant modifiers using nearest neighborhood method with horizontal relationship heuristics.

(a). Sample text image

(b). Horizontal profile showing no zero valleys

(c). Extracted text lines Fig. 4. Line segmentation

(a). Sample text line and its vertical profile

b). Extracted words

Fig. 5. word segmentation

(a) Sample text word image and its profile

ISSN: 0975-5462

(b) Word image after the removal of consonant modifiers

6609

M Swamy Das et. al. / International Journal of Engineering Science and Technology Vol. 2(11), 2010, 6606-6610

(c) base characters(after removal of vowel modifiers)

(d) Extracted characters

Fig. 6. Character segmentation

5. Results and discussions The algorithm is implemented in MATLAB. The data sets are collected the books of Digital Library of India (http://dli.iiit.ac.in). The algorithm is tested with several document images. Some of these documents contained overlapping lines and characters. Sample test results are shown in Figures 3, 4, and 5. From the experimentation it is understood that the proposed method is reliable to segment text documents even though the text is overlapped. The line segmentation accuracy of 100% for good quality, 98% for poor quality documents and the character segmentation accuracy of 98% for good quality, 95% for poor quality documents achieved. The limitation of this method is that it resulted in segmentation errors for touching and broken characters. 6. Conclusion and Future work In this experiment, the proposed algorithm is tested with several document images. Some of the documents contained overlapping lines and characters. Even though it could segment all the documents in a robust way and gave good results. But, it couldn’t segment the touching lines and characters. The broken characters have been over segmented. Segmentation of the touching lines and characters may require some heuristic approaches. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

U. Pal, B.B. Chaudhuri. (2004): Indian script character recognition: a survey, Pattern Recognition, 37,1887 – 1899. B. Anuradhaand, Arun Agarwal and C. Raghavendra Rao. (2008): An Overview of OCR Research in Indian Scripts, IJCSES, Vol.2, No.2. N. Otsu. (1979): A threshold selection method from gray-level histograms, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-9, NO. 1. Atul Nagi, Chakravarthy Bhagavathi and B. Krishna. (2001): An OCR System for Telugu, proceedings of 6th ICDAR, IEEE Computer Society Press. Richard G. Cassey and Eric Lecolinet.. (1996): A Survey of methods and strategies in character segmentation, EEE Transactions on Pattern analysis and Machine Intelligence, Vol 18, No. 7. C. V Lakshmi, C. Patvardhan. (2004): An optical character recognition system for printed Telugu text, Pattern Analysis & Applications, Volume7,pp.190-204. Agarwal, David Doermann. (2009): Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features, 10th International Conference, ICDAR. K.S. Sesh Kumar, A. M. Namboodiri, C.V. Jawahar. (2006): Learning Segmentation of Documents with Complex Scripts, Fifth Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338, pp.749-760. B.M. Sagar, DR. G. Shoba, DR. P. Ramakanth Kumar. (2008): Character Segmentation algorithms for kannada optical character Recognition, Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition. R.C. Gonzalez and R.E. Woods. (2004): Digital Image Processing, Pearson Education. Nitin Bhatia and Vandana. (2010): Survey of Nearest Neighbor Techniques, IJCSIS. C V Lakshmi, C PAtardhan “A Multi-font OCR System for printed Telugu Text.”, Proceeding of LEC’02, IEEE, 2002 Stephen Marchand Maillet ,”Binary Digital Image Processing- A Discrete Approach”, 1999. Chakravarthy, Ravi “ On Developing high Accuracy OCR Systems for Telugu and Other Indian Scripts”, Proceedings of LEC’02, IEEE 2002. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science, 1989.

ISSN: 0975-5462

6610