A two-stage approach for segmentation of handwritten Bangla word

A two-stage approach for segmentation of handwritten Bangla word images Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, # Mita Nasipuri , Dipak Kumar Basu Computer Science & Engineering Department, Jadavpur University, Kolkata-700032, India. # Corresponding Author; e-mail: [email protected]

Abstract. Segmentation of handwritten Bangla word images is a challenging problem for the researchers. Discontinuity or absence of ‘Matra’, an important feature of Bangla script, may lead to inherent segmentation within the word images. Around 55% of these inherently segmented connected sub-images do not require further segmentation. In the present work, we have designed a novel two-stage approach for segmentation of isolated Bangla word images. In the first stage, a feature based approach is designed to classify the connected word segments into either of the two classes, namely, ‘Segment further’ and ‘Do not Segment’ using a multi-layer perceptron based classifier. In the second stage, fuzzy segmentation features are designed to identify the ‘Matra’ region and the potential segmentation points on the ‘Matra’ of the connected word segments that belong to ‘Segment further’ class. Using the current technique, the overall successful segmentation accuracy achieved after two stages is 95.87%. Keywords: Two-stage segmentation, handwritten Bangla words, multi-layer perceptron, fuzzy segmentation features.

1. Introduction Segmentation of isolated word images, extracted from optically scanned document images of handwritten text, is one of the major problems of optical character recognition (OCR). Segmentation and identification of word components make a decisive contribution towards the overall performance of an OCR system. The better is the segmentation process, the lesser is the ambiguity encountered in recognition of candidate characters or word pieces. Vertical pixel density histograms of word images are often used for segmenting the word images into constituent characters. This can be done by identifying the valleys of the histograms as the terminal points of the said characters. The technique has applications in OCR of English text. But it cannot be effective for segmenting words of Bangla script. Appearance of consecutive characters overlapped in column positions makes the problem of Bangla word segmentation more complex compared to segmentation of English words. The problem becomes compounded with

handwritten Bangla words because of variation in sizes and shapes of handwritten characters. Bangla is an important East Asian script widely used in India and Bangladesh. Popularity wise, Bangla ranks fifth in the world, both as a script and a language. Some of the significant contributions made so far for OCR of handwritten texts include English texts [1-6]. A contour based segmentation technique [5] and a Bayesian knowledge based SVM [6] were designed for segmentation of handwritten English word images. The work relating to OCR of Bangla script is found to have limited references in the literature. Two such instances, [7] and [8], the former focusing on recognition of isolated handwritten characters based on stroke features and the latter on a multistage approach based on different topological features, have not addressed the problem of Bangla text segmentation. The problem of Bangla text segmentation has been addressed in [9-13]. The technique of word segmentation, as described in [9], has shown a high success rate by properly segmenting nearly 98.6% of characters of printed text. The technique is based on detection of an important feature of Bangla text, called the Matra. A Matra is a horizontal line, which passes touching the upper part of many characters of Bangla script as shown in Fig. 1(a). Depending on the characters, it covers at most the entire character width. The consecutive characters, in a Bangla word, which have Matras, are joined through a common Matra formed by joining the Matras of individual characters as shown in Fig. 1(b). This line may have some discontinuity over the positions where the characters in the word appear without Matras. The said technique [9] mainly works by successfully identifying and removing the Matra. The technique will not be effective for handwritten text, where the Matras are not horizontal as strictly as those of printed words. Another work related to segmentation of touching characters in printed Bangla and Devnagri text is presented in [10]. A. Bishnu et al. had developed a recursive contour following technique [11] for segmentation of handwritten Bangla word images. In another work, U. Pal et al. had used water reservoir principle [12] for the same purpose. Prior to the present work, a fuzzy technique [13] was developed for segmentation of handwritten Bangla word images by these authors. However, due to unavailability of

standard Bangla word datasets, the performances of these works can not be compared. In the light of the above discussion, the problem of segmentation of Bangla words still remains as an active area of research.

(a) An illustration of Matras of individual characters word

shapes of characters or their subparts. Some or all of these sub-images may require further segmentation for extracting individual characters or modified shapes. In a typical survey it has been found that only around 18% handwritten Bangla words are written as a single connected segment. Around 74% of the words generate 2-4 connected segments and around 8% word images generate more than 4 connected segments.

in a

Fig. 2. Internal segmentation within a handwritten Bangla word image.

(b) An illustration of the common Matra of a word

The major motivation behind the present work is to identify only those connected segments from handwritten Bangla word images that need to be segmented further and subsequently segment them using a fuzzy feature based segmentation algorithm. The segments which need no further segmentation may be left alone, as an input to the recognition module for subsequent processing.

2. The present work (c) An illustration of the three zones and region boundaries of a word Fig. 1(a-c). Illustration of some important features of Bangla script

Most of the aforesaid techniques for segmentation of handwritten Bangla word images depend heavily on the presence of the Matra feature. But in reality, due to the variety of writing styles of individuals, the Matra in a word image often appear as wavy or non-horizontal, discontinuous and even completely missing in continuous writing. These wavy, discontinuous or missing Matras make the problem of word segmentation more difficult. The existing segmentation algorithms, as reported in [913], often identify potential segmentation points on the common Matra of the word image. However, the performances of such techniques may degrade significantly in case of discontinuous or missing Matra in the word images. Fig. 2 shows a sample word image with discontinuous Matras, where the technique described in [13] fails to identify potential segmentation points on the Matra. In such cases, the word image appears internally segmented into number of sub-images or segments, containing a collection of connected (4-connected or 8connected) black pixels as shown in Fig. 2. Such subimages may contain one or more characters, modified

Choice of suitable features for pattern classes is a domain specific design issue. In the first stage of the present work we have designed a set of features that can classify the noise-free connected segments into one of the two classes, namely “Segment Further” and “Do not Segment”, using a MLP classifier. In the second stage we have designed a non-linear fuzzy feature-set, to identify the Matra of the word segments and subsequently segment them into potential word components. The following sub-sections briefly discuss the methodology involved in the present work.

2.1. Preprocessing of word images Preprocessing is an important task in document image processing. In the present work, we have used several computing metrics based on spatial attributes of pixels of the binary image. Therefore, noise pixels appearing at the background and along the contour of the word image may affect the segmentation accuracy. To remove a noisy pixel and to smooth the contours of data, we have used a sequence of erosion and dilation, two basic mathematical morphological operators [14], on the input handwritten word images.

2.2. Connected component analysis In the present work we have used a simple technique for identifying the connected segments within the word image. To label all connected pixels in a word image identically, the connected component labeling algorithm [14] scans the

image pixel by pixel from left to right and from top to bottom. During scanning, it considers all 8 neighbours of each pixel. For each of the connected segments, all its member pixels appearing in the sub-image are replaced by a single distinct symbol. This is done to complete labellings of the connected pixels in the image and to generate uniquely coded connected segments. Each of such connected segments is subsequently extracted for segment analysis.

2.3. Design of the segment classification features For identifying segments, that require further segmentation, we have designed a 7-element feature vector based on the morphological attributes of different connected segments. Table.1 shows brief descriptions of these features, selected for the present work. All the feature values are normalized within the range (0,1) using the following formula. Normalized feature value = (actual feature value / maximum possible feature value) The height/width ratio of a connected segment gives a rough idea about the possible structure of the sub-image. This feature can have the maximum value as the height of the word image. Lower is this ratio, more is the chance of the sub-image to be segmented further. Table 1. Different feature vectors used for classification of the word segments Sl. Description of the feature vector No. 1 The height/width ratio of each connected segment 2 Width of each connected segment 3 Proportion of black pixels in each connected segment 4 Offset in number of rows of each connected segment from the starting row of the original word image 5 Maximum horizontalness of black pixels in each connected segment 6 Count of Matra pixels, as discussed in [13], for each connected segment 7 Count of segmentation pixels, as discussed in [13], for each connected segment

Width of a connected segment, alone, is also an important attribute for segment classification. Often, due to the presence of ascendants and descendants within a word image, the height/width ratio may result into misleading information. In such cases this feature can be of use for true classification of segments. More is the width of a connected segment, higher is its chance of belongingness in the “Segment Further” class. The proportion of black pixels within the sub-image of given size is also used as one of the features to estimate the length of pen strokes within it. Higher is the proportion within the connected segment, more is its chance to be further segmented.

As shown in Fig. 1(c), any handwritten Bangla word image may be hypothetically segmented three horizontal zones, namely the upper zone containing the ascendants, the lower zone containing the descendants and the middle zone, containing most of the characters, modified shapes of characters and their sub-parts. Connected segments generated close to the middle zone have higher chances of belongingness to the “Segment Further” class. On the contrary, connected segments generated close to the upper and lower zones have more chances to be in the “Do not Segment” class. To approximate this observation, we have computed the starting row offset of each connected segment and the original word image. This feature gives an estimate of belongingness of the segment in any of the three aforementioned zones. Segmentation of any sub-image depends significantly on the Matra feature. As mentioned earlier, this Matra, within a word image, appears to be horizontal in nature. Any connected segment with significant presence of Matra may further be segmented into component characters or modified shapes or their sub-parts. To estimate the presence of Matra pixels within a sub-image, a horizontal longest-run count of each pixel, as discussed in [13], is computed within the sub-image. The maximum value of this longest-run count is used as a feature after suitable normalization. Any connected segment may further be segmented using existing Matra based segmentation algorithm, as discussed in [13]. In doing so, each of the segments may generate a set of approximate Matra pixels and some potential segmentation points on the Matra. Any connected segment that needs further segmentation identifies more number of Matra pixels and potential segmentation points on the Matra in comparison to the segments that require no further segmentation. In the present work, we have used these features as the counts of Matra pixels and potential segmentation points with suitable normalization. Fig. 3 shows a sample word image and the two possible categories of connected word segments in the same.

Fig. 3. A sample word image and its two classes of connected segments

2.4. Design of the MLP classifier in first stage In the present work, an MLP classifier is used for classification of connected word segments, generated from the word image, into either of the two output classes to

decide whether the given segmented sub-image needs to be further segmented or not, using the above mentioned feature set. The MLP classifier designed for this work is trained with the Back Propagation (BP) algorithm. It minimizes the sum of the squared errors for the training samples by conducting a gradient descent search in the weight space. The number of neurons in a hidden layer in the same is also adjusted during its training.

with respect to the maximum longest run value of any pixel within the word image.

2.5. Fuzzy headline estimation in second stage The common headline or Matra of a connected word segment may be identified as the continuous horizontal stripe of black pixels appearing at the top of most of the characters and some of modified shapes in the word segment. In a cursive handwriting the appearance of a Matra is often disjoint and wavy. This makes the identification of potential Matra pixels a challenging task. In the present work, we have developed two fuzzy measures to identify the membership value of each pixel for its potential of belongingness to Matra.

Fig 5: Word images and the corresponding vertical longest run components that exceeds the mean verticalness of the respective words

2.5.3. Design of the fuzzy membership function In the present work, we have designed a bell shaped membership functions to map the horizontalness feature values of each row to determine its belongingness in the Matra region. The generalized bell function depends on three parameters a, b, and c as given by:

2.5.1. Horizontalness feature This horizontalness property of the Matra may be extracted from the row wise sum of continuous run of black pixels, as shown in Fig. 4. This value is normalized with respect to the maximum longest run value of any pixel within the word image.

Where, the parameter b is usually positive. The parameter c locates the center of the curve, i.e., R2 and x is the row index for any black pixel Px y in the word image. For computation of the fuzzy feature values, we have designed a fuzzy function, viz, fh(xh ,f(x;a,b,c)) for horizontalness feature respectively. Such that, fh(xh , f(x;a,b,c)) = xh * f(x;a,b,c)

2.5.2. Verticalness feature

Where, xh is normalized horizontalness component of each pixel Px y under consideration and 0 ≤ xh ≤ 1. . Fig. 6 shows a diagramatic representation of the bell shaped fuzzy membership function, designed for the present work. A pixel Px y is identified as a headline pixel, if its value exceeds the mean of all such fh(Px y ) values within the region R1-R3.

Many characters and modified shapes in Bangla script have vertical stripe of black pixels, as a part of their shapes. This vertical stripe often appears at the right side, middle or left side of the characters. These stripes touch the Matra of a word image and often extend till the bottom of the respective characters or modified shapes. In the present work, we have developed a technique to identify prominent vertical stripes in word image and identify their average top and bottom rows within the principal segments. This verticalness property of the Matra may be extracted from the column wise count of continuous run of black pixels, as shown in Fig. 5. This value is normalized

Fig. 6. Fuzzy Bell-shape memberships function for Matra determination.

Fig. 4. Word images and the corresponding horizontal longest run components that exceeds the mean horizontalness of the respective words.

2.6. Design of fuzzy segmentation features Once the black pixels constituting the Matra of a word segment are identified the next task becomes to identify certain column positions on the Matra from where the word segment can be vertically segmented into constituent characters. Such column positions are called terminal points of segments. One of the prominent features for identifying terminal points of segments is the number of black pixels along each vertical column position on the Matra. The less is the number of black pixels along a vertical column position on the Matra, the higher is its degree of belongingness (µ1) to the set of terminal segment-points. On this basis a bell-shaped fuzzy membership function (µ1), as discussed in previous section, is designed. Another feature (F2), is considered here within the region (R2 - R4). Here again the more is the distance, the less is the degree of belongingness (µ2) of the associated point to the set of segment terminal points. A third feature (F3), similar to (F2), is considered here by extending the region (R2 - R3), previously considered for computing F2, to (R2 - R4). Detailed description of these three features is already given in [1]. The necessary membership functions (µ1 , µ2 , µ3) for these features are shown in Fig. 7.

Fig. 7. Fuzzy membership functions µ1, µ2 and µ3

To determine finally whether a black pixel on the Matra can be considered as a segment terminal point, the average of all the three feature values exceed certain predetermined threshold, are finally considered as segment terminal points. The threshold is fixed up by taking the average of all the three feature values of all the black pixel positions over the Matra of a word segment.

3. Results and discussion In the present work, we have collected isolated handwritten Bangla word images from different persons of varying age groups. Word images are assumed to be slant and slope corrected and written in black ink with uniform pressure. Each such image is digitized using a flatbed scanner with 300 dpi resolution. 250 such word images were randomly selected for the current experimentation. As discussed earlier, around 82% of such word images generate more than one connected segment after the connected component analysis, and around 52% of such connected segments need no further segmentation.

In the first stage, to classify the connected segments into one of the two classes, namely, “Segment Further” and “Do not Segment”, an MLP based classifier is designed with Back Propagation learning algorithm. For preparation of the training and the test sets, a collection of 600 such connected segments of Bangla word images is formed by taking 300 segments each from the two aforesaid classes. For cross validation of results, three different folds of test sets are formed by dividing the original dataset of 600 samples into three equal mutually disjoint parts. For each fold of the test set, the corresponding training set is formed with the rest of the dataset. Thus three pairs of the test and the training sets are formed for three fold cross validation of results. In each of these pairs, the training and the test sets are of sizes 400 samples and 200 samples respectively. For the present work, a single layer MLP, i.e., an MLP with one hidden layer is chosen. This is mainly to keep the computational requirement of the same low without affecting its function approximation capability. According to Universal Approximation theorem [15], a single hidden layer is sufficient to compute a uniform approximation to a given training set. To design an MLP for classification of handwritten alphabetic characters, several runs of BP algorithm with learning rate (η) = 0.8 and momentum term (α) = 0.7 are executed for different numbers of neurons in its hidden layer. The maximum recognition performances of the MLP, as achieved through three-fold cross validation of results, are 94.5%, 93% and 94.5%. Finally, the average success rate of these three sets of experiments is computed as 94%. Fig. 8(a-c) shows the images of some test samples successfully classified through this experimentation. Fig. 9 shows some of the images where our technique fails to classify the segments in the desired classes.

(a)

(b)

(c)

Fig. 8. Some of the correctly classified test samples (a) Successfully classified into “Do not Segment” class (b-c) Successfully classified into “Segment Further” class

Fig. 9. Some sample test images misclassified into “Segment Further” class

In the second stage, for designing of the fuzzy function, the values of two positive constants a and b were chosen as 1. As discussed earlier, the row index of the lower

boundary of the upper zone (R2) is assigned to the third constant c in the said fuzzy function. Fig. 10(a-b) shows some of the sample word images, where segment classification algorithm of first stage is not applied, resulting in incorrect segmentation results in different parts of the word image. However, using the current two stage algorithm, connected word segments belonging to ‘Do not Segment’ class are successfully extracted and only the segments that need further segmentation are segmented properly, as shown in Fig. 11(a-b) respectively.

Acknowledgements Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project on Storage Retrieval and Understanding of Video for Multimedia” of Computer Science & Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work.

References 1.

2.

3.

(a) 4.

5.

(b) Fig. 10. Sample word images with segmentation errors shown in encircled regions.

6.

7.

8. (a) 9.

(b) Fig. 11. Sample word images with successfully classified word segments and subsequent segmentation result.

After two stages, the overall segmentation accuracy, as observed manually with respect to potential segmentation points on each word image of the 250 word dataset, is evaluated as 95.87%. When only the second stage segmentation algorithm is employed on the same dataset, the successful segmentation accuracy becomes 91.86%. Thus we could significantly improve over the conventional single stage approach to identify potential segmentation points in handwritten Bangla word images. This technique may significantly reduce the cases of under-segmentation. However, there are further scopes of improvements. An iterative implementation of the present technique, along with the existing segmentation algorithm, may further improve the overall segmentation performance of handwritten Bangla word images in future.

10.

11.

12.

13.

14. 15.

R.G. Casey et.al. “A Survey of Methods and Strategies in Character Segmentation”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18,pp 690-706, 1996. R.M. Bozinovic et.al. “Off-line Cursive Script Word Recognition”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11,pp 68-83, 1989. J.T. Faveta, “Offline General Handwritten Word Recognition Using an Approximate BEAM Matching Algorithm”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23,pp 1069-1021, 2001. A.W. Senior et.al. “An Off-line Cursive Handwriting Recognition System”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20,pp 309-321, 1998. B. Verma, “A contour code feature based segmentation for handwriting recognition,” in Proc. 7th ICDAR, pp. 12031207. M. Maragoudisakis, et.al., “Improving handwritten character segmentation by incorporating Bayesian knowledge with support vector machines,” in Proc. ICASSP’2002, vol. 4, pp. IV-4174. A. F. R. Rahman, R. Rahman, M.C. Fairhurst, “Recognition of Handwritten Bengali Characters: a Novel Multistage Approach,” Pattern Recognition, vol. 35, p.p. 997-1006, 2002. T. K. Bhowmik, U. Bhattacharya and S. K. Parui, “Recognition of Bangla Handwritten Characters Using an MLP Classifier Based on Stroke Features,” in Proc. ICONIP, Kolkata, India, pp. 814-819, 2004. B. B. Chaudhuri and U. Pal, “A Complete Printed Bangla OCR System,” Pattern Recognition, vol. 31, No. 5. pp. 531549, 1998. A. Bishnu, B. B. Chaudhuri, “Segmentation of Bangla Handwritten Text into Characters by Recursive Contour Following,” in Proc. 5th ICDAR, pp. 402-405, 1999. U. Pal, S. Datta, “Segmentation of Bangla Unconstrained Handwritten text,” in Proc. 7th ICDAR, pp. 1128-1132, 2003. U. Garain, B. B. Chaudhuri, “Segmentation of touching characters in printed Devnagri and Bangla scripts using fuzzy multifactorial analysis,” IEEE Trans. On Systems, Man and Cybernetics – Part C: Applications and Reviews, vol. 22, pp. 449 – 459, 2002. S.Basu, R.Sarkar, N. Das, M.Kundu, M.Nasipuri, D.K.Basu, “A Fuzzy Technique for Segmentation of Handwritten Bangla Word Images,” iccta, pp. 427-433, International Conference on Computing: Theory and Applications (ICCTA'07), 2007. R.C. Gonzalez and R.E. Woods, Digital Image Processing, Prentice-Hall India, First Edition, (1992). S. Haykin, Neural Networks: “A Comprehensive Foundation”, Second Edition, Pearson Education Asia, pp. 208-209 (2001).