Offline Handwritten DevanagariWord Recognition: A ... - CiteSeerX

Offline Handwritten Devanagari Word Recognition: A Segmentation Based Approach Bikash Shaw CVPR Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700108, India. bikash [email protected]

Swapan Kr. Parui CVPR Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700108, India. [email protected]

Abstract A novel segmentation based approach is proposed for recognition of offline handwritten Devanagari words. Stroke based features are used as feature vectors. A hidden Markov model is used for recognition at pseudocharacter level. The word level recognition is done on the basis of a string edit distance.

1

Introduction

Offline handwriting recognition is one of the challenging problems in pattern recognition. The main challenges are wide variety of handwriting styles, large varieties of pen-type, poor image quality and a lack of ordering information of strokes. There are two broad approaches to this problem. One is segmentation-based approach and the other segmentation-free approach. For the first approach it is required to segment a word image into character or pseudocharacter subimages, then recognize the subimages separately and finally combine the results for word recognition. The second approach treats the whole word image as a single entity and performs recognition without explicit segmentation. However, the first approach depends much upon segmentation performance since poor segmentation contributes to recognition error. Our paper proposes a segmentation-based approach to handwritten Devanagari word recognition. Based on the head line, a word image is segmented into pseudocharacters. A continuous density hidden Markov model (HMM) is proposed to recognize the pseudocharacters. A pseudocharacter is assumed to be a string of several strokes. Each such stroke is assumed to be generated from a stroke primitive. These stroke primitives

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

Malayappan Shridhar University of Michigan-Dearbon, Dearbon, Michigan, 48128, USA. [email protected]

are in fact the states of the proposed HMM. One HMM is constructed for each pseudocharacter. To classify an unknown pseudocharacter image, its class conditional probability for each HMM is computed. Thus, from a word image, a string of pseudocharacters obtained. The string edit distance is computed between this string and all the words in the lexicon. The word for which the distance is minimum is the recognized word. The number of word classes here is 100 (Fig. 1).

2

Devanagari Script

The Devanagari script follows left to right fashion for writing. The Devanagari alphabet is used for writing Hindi, Sanskrit, Marathi, Nepali languages. Each Devanagari consonant has an inherent vowel (A). Vowels can be written as independent letters, or by using a variety of diacritical marks which are written above, below, before or after the consonant they belong to. Devanagari has 13 independent vowels, 33 independent consonants and 12 dependent vowel signs (Fig. 2). One of the characteristic features of Devanagari script is that there is a horizontal line on the top of most characters. This line is called head line or M atra or Shirorekha. A Devanagari word is divided into three zones, namely, upper zone middle zone and lower zone. M atra divides the upper and middle zones (Fig. 3).

3

Segmentation

Segmentation is the process of separating individual character or pseudocharacter images from a word image. The large variation in handwriting style and cursiveness of the script makes the task of segmentation quite difficult. We make an assumption in our segmen-

Figure 5. Variation of M atra shape. shown in Fig. 4(b). We then extract the subimages that are separated from their neighbours by vertical white spaces. The subimages obtained are shown in Fig. 4(c). For printed Devanagari script, the M atra is easily identifiable by taking the horizontal projection of the word image. But the M atra is not easily identified in handwritten script due to large variation in writing style (Fig. 5).

Figure 1. 100 Devanagari words of the lexicon set with their class number.

Figure 2. Devanagari character set.

Preprocessing: The image is first smoothed using median filter and then binarized by Otsu’s [2] thresholding method. The binarized image is smoothed again using median filter. Skew correction is performed during segmentation. No slant, image height and pen-width correction is done here. Detection of M atra: Mathematical morphology operator, namely, opening which is a composition of an erosion followed by a dilation is used here along with a horizontal structuring element of size 1 × 27 (the size 27 is determined empirically). For example, after opening the image (X0 ) in Fig. 6(a) becomes the image (X1 ) in Fig. 6(b) in which the connected components are studied. Smaller components are removed and larger components indicate the M atra. The threshold value of the length of a component is empirically determined. Let image XM represent the M atra (Fig. 6(c)).

Figure 3. Zones of Devanagari word. tation procedure that a head line or M atra is present in a word image.

Figure 4. Segmentation of printed word. In printed Devanagari script the segmentation procedure is straight forward and simple. It is illustrated in Fig. 4. First the dominant horizontal line in the word is identified and then it is removed from the word as

Figure 6. (a) X0 , (b) X1 , (c) XM , (d) XM S , (e) XS , (f) XF , (g) individual subimages

Skew correction of word image: The proposed skew correction scheme is designed for non-uniform skew, which assumes that the M atra is piecewise linear. The M atra image XM is first divided into n horizontal windows V1 , V2 , · · · , Vn of width w. The M atra component in each window is assumed to have uniform skew.

tain the final deskewed M atra (fourth row in Fig. 7). For example, the deskewed M atra (XM S ) in Fig. 6(d) is obtained from the M atra in Fig. 6(c). Now, after applying the same operations described above on the image in Fig. 6(a), we get the image (XS ) in Fig. 6(e).

Figure 7. Skew correction

The task now is to estimate the skew for each such M atra component, that is, the angle that each such component makes with the x-axis. For each Vj , the pixel positions (i.e., x and y coordinates) of all object pixels are considered and let the scatter matrix of these coordinates be Σj . Let ej be the eigen vector corresponding to the larger eigen value of Σj . ej = (aj , bj ) indicates the principal axis of Vj . Hence, the skew angle is computed as θj = tan−1 (bj /aj ). The task now is to rotate Vj by angle θj in the clockwise direction so that it becomes horizontal (due to the discrete nature of the data, it will only be approximately horizontal). Let rj = (bj /aj ). An integer K is found such that rj is closest to 1/K. Each slab of K columns (starting from the left) is adjusted, that is, pushed up or down depending on whether θj is negative or positive. For example, in Fig. 7 there are 3 M atra components and each component has width w = 9. The values of r1 , r2 and r3 are 0.24, 0.32 and 0.24 respectively and the corresponding K-values are 4, 3 and 4 respectively. For the first component, the slab size is 4 and hence there are 3 slabs with widths 4, 4 and 1 from the left. The first slab is kept unchanged, the second slab is pushed down by 1 row and the third slab is pushed down by 2 rows (since the skew angle is positive). (In general, the i-th slab is pushed up or down by i − 1 rows for a negative or positive skew angle in a component.) In the second component, the slab size is 3 and there are 3 slabs which are pushed down similarly. A similar action is taken on the third component. It can be seen that at this stage each M atra component is deskewed so that it is at least nearly horizontal. However, the deskewed components may remain at different heights (second row in Fig. 7). For proper alignment, the average row values in each component are computed and the components are pushed up or down so that their average row values are aligned with that of the first component (third row in Fig. 7). Due to the discrete nature of the data, there may be some noise that is removed by median filtering to ob-

Removal of M atra from word image: The M atra of the word image is removed by removing the skew corrected M atra object pixels of XM S from the object pixels of skew corrected image XS resulting in final image XF shown in Fig. 6(f). Connected component analysis is performed on XF to get the individual subimages of pseudocharacters (Fig. 6(g)).

4

Pseudocharacter Recognition

Feature Extraction: We consider two directional view based extraction of strokes that are either horizontal or vertical [3]. 8 scalar features are extracted from each vertical and horizontal stroke. These features represent the shape, size and position of a stroke with respect to the pseudocharacter image. Details of features extracted from the vertical and horizontal strokes are given in [3].

Figure 8. (a) Preprocessed image (b)-(c) Vertical and horizontal strokes along with input image (d)-(e) Vertical and horizontal strokes after removing small strokes

HMM Classifier: An HMM with the state space S = {s1 , · · · , sN } and observation sequence Q = q1 , · · · , qT is defined as γ=(π, A, B) where the initial state distribution is given by π ={πi }, πi = Prob(q1 = si ), the state transition probability distribution by A ={aij (t)} where aij (t) = Prob (qt+1 = sj /qt = si ) and the observation symbol probability distributions by B = {bi } where bi (Ot ) is the distribution for state i and Ot is the observation at instant t. The HMM here is non-homogeneous. The problem here is how to efficiently compute P (O/γ), the probability of an observation sequence O = O1 , · · · , OT given a model γ =(π,A,B). For a classifier of m classes, we denote m different HMMs by γj , j = 1, · · · , m. Let an unknown input pattern X have an observation sequence O. The probability P (O/γj ) is

computed for each model γj and X is assigned to class c whose model shows the highest probability. For a given γ, P (O/γ) is computed using the well known forward and backward algorithms [4]. Note that the observation sequence O = O1 , · · · , OT in our problem is the sequence of feature vectors of the strokes (arranged from left to right) that are present in a handwritten pseudocharacter image. T is the number of strokes in the image. The states here are certain feature primitives (or more specifically, individual 8dimensional Gaussian distributions in the feature space) that are found using EM algorithm [3]. The parameters of HMM are estimated as described in [3]. The HMM parameter estimates are fine-tuned using re-estimation by Baum-Welch forward-backward algorithm [4].

5

Word Level Recognition

Each training set word image is segmented into pseudocharacter subimages each of which is recognized as a pseudocharacter using the HMM described above. Thus, each training image generates a string of pseudocharacters. All such distinct strings of pseudocharacters obtained from the training set, form a dictionary D = {W1 , W2 , · · · , WJ } of pseudocharacter strings on the basis of the lexicon set of words. Note that each Wj comes from a unique word in the lexicon set. Now, an unknown word image is first segmented into pseudocharacter subimages. These individual subimages are then recognized by the HMM classifier generating a pseudocharacter string, say, Str. The string edit distance [1] of the string Str is computed against all the strings in D. A dynamic programming approach described in [5] gives us an efficient way to find the minimum edit distance between Str and Wj , j = 1, 2, · · · , J. If for string Wj the distance is minimum, then Str is recognized as the word corresponding to Wj . To calculate the edit distance, we need estimates for the error probabilities of insertion, deletion and substitution in advance. Here, we have made an empirical guess for estimation of these costs. For insertion and deletion of a pseudocharacter, the empirically estimated cost is 1.50. For substitution for a pair of shape similar pseudocharacters we have estimated the cost to be 0.80, else the cost is 1.0.

6

Experimental Results and Conclusions

There is no reported work on recognition of handwritten Devanagari word except [3] which deals with a holistic approach. Here a novel algorithm for Devanagari word segmentation is proposed and an HMM is

used for recognition at pseudocharacter level. A string edit distance algorithm is used for recognizing the word. The proposed scheme has been trained and tested on the recently developed database of handwritten Devanagari word images. There does not exist any other standard database of handwritten Devanagari word images. The training and test databases here consist respectively of 22500 and 17200 images of handwritten words of 100 word classes collected from 436 different writers. Our lexicon set is rich in variation in the sense that it covers all individual basic characters (independent vowels and consonants), dependent vowel signs and all commonly occurring compound characters (shown in Fig. 2). Also, the word length in our lexicon covers a wide spectrum (from 2 to 7 characters). For a number of characters, their handwritten images sometimes get segmented into two pseudocharacters after removal of the M atra. Thus, our set of pseudocharacters consists of such pseudocharacters as well as true characters. The number of pseudocharacters obtained from the training set is 118. So, in the first stage, we have a 118-class problem. One HMM is modeled for each pseudocharacter class. To classify an unknown pseudocharacter image, its class conditional probability for each HMM is computed. The accuracy for test set is 81.63% for pseudocharacter level recognition. The string of pseudocharacters thus obtained is used in the second stage to find the best matching Wj in the dictionary D using an edit distance. The word corresponding to Wj is identified from a look up table which gives the recognized word. The word level accuracy is 84.31% on the test set. This accuracy reported in [3] is 82.89%. Performance of the proposed method may be improved by improving pseudocharacter level accuracy and by obtaining better estimates for costs of insertion, deletion and substitution.

References [1] T. Okuda, E. Tanaka, and T. Kasai. A method for the correction of garbled words based on the levenshtein matrix. IEEE Trans. Inform. Theory, 25(1):172–178, February 1976. [2] N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. on Systems, Man, and Cybernetics, 9(1):62–66, March 1979. [3] S. K. Parui and B. Shaw. Offline handwritten devanagari word recognition: An hmm based approach. Proc. PReMI-2007(Springer), LNCS-4815:528–535, December 2007. [4] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings. of the IEEE., 77(2):257–286, February 1989. [5] R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, January 1974.