Handwritten Bangla Compound character ... - Semantic Scholar

1 downloads 0 Views 826KB Size Report
One important finding of the survey is that only 4.27 percent of characters in a standard text piece are on average compound characters. Out of the .... attention and may be excluded from the list of 260 compound characters identified from the ...
4th Indian International Conference on Artificial Intelligence (IICAI-09)

Handwritten Bangla Compound character recognition: Potential challenges and probable solution Nibaran Das, Subhadip Basu, Ram Sarkar, Mahantapas Kundu, Mita Nasipuri Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, India [email protected]

Abstract.

A novel technique is presented here for recognition of handwritten compound characters of Bangla alphabet. It advocates for incrementally expanding the number of learned character classes from more frequently occurred to less frequently occurred ones. The work is preceded by a survey for finding the frequencies of occurrences of all Bangla characters in the standard literature. One important finding of the survey is that only 4.27 percent of characters in a standard text piece are on average compound characters. Out of the 160 compound character classes, characters of 55 classes constitute 90 percent of the compound characters occurring on average in a standard text piece. For the time being, handwritten characters from these classes are considered here. The average recognition rate, as observed under this work, is 84.67 percent after 3 fold cross validation of results. It is more or less comparable with the performance reported in another related work[3]. The work presented here can be considered as an important step for the development of OCR for handwritten Bangla characters, including complex shaped compound characters.

Key Words. Bangla compound character, OCR, Quad-tree structure, Longest Run, Cross Validation.

1. Introduction Character recognition is the most fundamental problem of Optical Character recognition (OCR) related research. The problem of character recognition varies from script to script. Its complexity grows from printed to handwritten characters. OCR research has a long nearly half a century long, and rich history. Still its success has been found to have been limited within Roman script, related to English and some European languages, and scripts related to Asian languages like Chinese, Korean, and Japanese.

1901

4th Indian International Conference on Artificial Intelligence (IICAI-09)

Some important Asian scripts, like Devanagri, Tamil, Oriya and Bangla, which are mainly in use in India and neighbouring countries, have so far received little attention for OCR research. Out of these, Bangla is an important script. Like Roman script, it is also used to express a number of languages like Bangla, Assamese and Manipuri. Bangla is the national language of Bangladesh, the second most popular language in India after Hindi, and the fifth most popular language in the world. Besides its popularity, Bangla script has a very rich and complex alphabet set of nearly more than 200 characters in all. In addition to 50 Basic characters, Bangla script has 160 Compound characters. Compound characters have no parallel in Roman script. A compound character is a complex shaped character, which consists of two or more basic characters, pronounced simultaneously in words of Bangla language. With all the richness of Bangla alphabet, the problem of Bangla character recognition gets compounded with unconstrained handwritten characters. Despite the complexity of the problem of handwritten Bangla character recognition and the popularity of Bangla Script, evidences of research on OCR of handwritten Bangla characters, as observed in the literature, are few in number [1,2,4,5]. Moreover, all these research efforts are directed at Basic characters only. In the recent past, U. Pal et al. [3] has attempted to recognize handwritten compound characters of 138 classes using gradient features. The present work is motivated by a survey, conducted by the authors of this paper. The survey reveals that, in a standard text piece, 95.63 percent of characters are Basic characters and the rest 4.27 percent characters are compound characters on average. Text samples from present day’s popular Bangla newspapers like Anandabazar, Bartaman, AajKaal, and Bangla magazines like Anandamela and Desh are collected for this purpose. Another important observation, made from this survey, is that 90 percent of the compound characters occurring in the text samples consistute only 55 classes of compound characters. The above observations have motivated us to develop an incremental approach in dealing with the recognition problem of handwritten compound characters of Bangla alphabet. As a first step of this strategy, we propose here a scheme to recognize Bangla compound characters of 55 classes, which cover 90 percent of the compound characters occurring in popular text pieces. The scheme can be gradually extended to include newer classes of compound characters from more frequently to less frequently occurred ones. At the time of implementation of this scheme certain short cut is made after observing high substitution rates among certain characters classes in the confusion matrix prepared with the training data. It is due to the limitation of the feature set used here. Instead of putting effort for further enhancement of the present feature set, we have decided to club classes of mutually high substitution rates into single groups to defer their finer classification for a second pass of classification process as described in the two pass approach to pattern classification [8]. After making this short cut, finally 43 classes of compound characters, some of which truly covers more than one character class, are considered here for this work. By the above approach, initially we concentrate our effort on recognizing compound characters that constitute 90 percent of all the compound characters in a standard text piece. The character classes of the rest 10 percent of the

1902

4th Indian International Conference on Artificial Intelligence (IICAI-09)

compound characters may be clubbed together into a single class until the present scheme is extended to include any character class from the clubbed one. Since ours is an incremental approach, the clubbed character classes will gradually merge with the learned classes of the present scheme.

2. A Brief overview of Bangla Script The following are the major features of Bangla script. • The script of Bangla language is known as Bangla/Bengali Script. • The script is used to write Bangla, Assamese and Manipuri Languages. • It is derived from the ancient Brahmin script through various transformations. • The writing style is horizontal and left to right. The concept of upper or lower case is absent. • There are about fifty Basic characters in Bangla script, among which eleven are vowels and thirty nine are consonants. • A vowel occurring after a consonant in a Bangla takes a Modified shape, called the Modifier. • Consonant modifiers are also possible. • Two or more consonant characters combine to form Compound characters which partly retain the shapes of the constituent characters. • In general two hundred sixty Compound characters are present in theliterature. But, according to ‘Barnaparichaya’ there are 194 Bangla compound characters. This variation occurs due to change of shape and simplification of spellings over time. It has already been mentioned that, compared to the number of Basic characters, the number of compound characters is much higher in Bangla alphabet. As two or more basic characters are combined to form a compound character, compound characters are more complex in shape than the Basic characters. Some compound characters resemble pair wise so closely that a period or a small line is left between them only as a sign of difference. In the year 1997, West Bengal Bangla Academy introduced new types of shapes/glyphs to represent Bangla Compound characters through their Bangla dictionary “Academy Banan Abhidhan”. The objective of this is to simplify the complex shapes for easier understanding of the compound characters. Nowadays, in India, these types of glyphs for compound characters are mainly followed in Bangla books. But, in newspapers, these recommended shapes are not followed, as common people are not yet accustomed to these.

2.1. A survey on Bangla compound characters Bangla compound characters have evolved into their present shapes and structures with time. Some compound characters, shown in Fig. 1, have become obsolete at present, and some have been modified in shapes and structures. So before recognizing

1903

4th Indian International Conference on Artificial Intelligence (IICAI-09)

compound characters, it is to be decided first that which of the compound characters with what kind of shapes and structures are to be recognized. And also their importance, i.e., how frequently they occurs in standard text. For this, a survey is conducted under the present work with samples of alphabetic characters collected from three popular Bangla dailies, Ananda-Bazar Patrika, Aajkaal, Bartaman and two popular magazines Anada Mela, Desh. Around 40 million words are randomly selected for this purpose. It has been observed from that 260 compound characters are in use at present along with 50 Basic characters. In a standard piece of text, 95.63 percent characters are found to be Bangla characters and the rest 4.27 percent of characters compound characters. Out of the compound characters, surveyed here, those formed with Ya-Phala do not change the shapes of the associated characters as the Ya-phala do not change the shapes of the associated characters as the Ya-Phala appears separately, with a distinct shape and structure, after the associated character. So all compound characters formed with Ya-Phala need not be given separate attention and may be excluded from the list of 260 compound characters identified from the survey here. Like wise, all the compound characters formed with R-Ph may also be excluded from the list as the R-Ph, which appears separately at the top of the associated character like a modifier, can be recognized as a separate character as a result of this consideration, the total number of compound characters surveyed under this work stands as 160.

Fig 1. Some obsolete characters which are not used in current Bangla Text

3. The Past work It has already been mentioned that instances of work on OCR of handwritten Bangla characters [1,2,4,5] are few in the literature. Besides that, only one instance of work on OCR of handwritten compound characters [3] has come to our notice in the literature. In this work, Pal et al. have used Modified Quadratic Discriminant Function (MQDF) as the classifier. The directional information obtained from the arc tangent of the gradient is used there to form the feature set for the characters. Using 5-fold cross validation of results Pal et al. obtained around 85.90% accuracy from a dataset of Bangla compound characters containing 20,543 samples. The number of class considered there for the compound characters is 138. The class have been selected on the basis of “Relational studies between phoneme and grapheme statistics

1904

4th Indian International Conference on Artificial Intelligence (IICAI-09)

in current Bangla”, Journal of Acoustical Society of India, vol.-23, pp. 67-77, 1995 by B. B. Chaudhuri and U. Pal.

4. The Present work It has already been mentioned that 43 compound character classes which constitute 90 percent of compound characters in the state of the art Bangla literature, are chosen here. Some of these character classes consist of different characters with very close resemblances as per the confusion matrix prepared with the training data. Typical characters of all these classes are shown in Fig 2. A 84 element, quad tree based longest run feature set has been designed here for classification of compound characters of the said classes. An MLP based pattern classifier is trained and tested on the basis of this feature set. SL. No.

Character Samples

Number of Samples

1

241

2

240

3

222

4

240

5

222

6

217

7

232

8

239

9

236

10

233

11

228

12

238

13

238

14

238

15

229

16

234

1905

4th Indian International Conference on Artificial Intelligence (IICAI-09)

17

223

18

229

19

228

20

225

21

222

22

240

23

230

24

232

25

227

26

221

27

216

28

227

29

219

30

240

31

215

32

229

33

222

34

205

35

214

36

217

37

214

38

217

39

234

40

231

41

220

1906

4th Indian International Conference on Artificial Intelligence (IICAI-09)

42

226

43

215 Total no. of Character Samples = 9765

Fig-2 : Typical characters of 43 compound character classes are enlisted

4.1. The Feature Set 4.1.1. Longest Run Features The longest run features [4] are computed within the minimal bounding box encompassing a character image. These are computed along 4 directions viz, row wise, column wise and along the directions of two major diagonals. The row wise longest run feature is computed by considering the sum of the lengths of the longest bars that fit consecutive black pixels along each of all the rows of the region. In fitting a bar with a number of consecutive black pixels within a rectangular region, the bar may extend beyond the boundary of the region if the chain of black pixels is continued there. The three other longest-run features within the rectangle are computed in the same way but along the rest three directions. Each of the longest run feature values is to be normalized by dividing it with the product of the height (h) and the width (w) of the minimal bounding box. The product, h x w, represents the sum of the lengths of the bars that fit consecutive black pixels individually in each of the four directions within the region completely filled with black pixels. 4.1.2. A Quad-tree based Features In a quad-tree, each node, except the leaf nodes, to have four children. A Quad-tree is most often used for representing a two dimensional space by recursively subdividing it into four equal quadrants or regions. In the current work, we have used a modified version of the quad tree to represent multiple partitions of subimages of a character image. Here, partitioning a character pattern (or a subpart of it) into 4 regions is done by drawing a horizontal and a vertical line through the Centre of Gravity (CG) of black pixels in that region. If the depth of the quad-tree structure is d, then total number of sub images for each digit pattern at leaf nodes would be d4. The coordinates of the CG of any image frame, (Cx ,Cy), is calculated as follows:

Cx =

1 1 x. f ( x, y ) C y = ∑ ∑ y. f ( x, y) mn mn mn mn ;

⎧1; for all black pixels f ( x, y ) = ⎨ ⎩0; otherwise

1907

4th Indian International Conference on Artificial Intelligence (IICAI-09)

Where, x and y are the coordinates of each pixel in the image of size (m . n) pixels. Equal partitioning and C. G. based partitioning of character images for generating quad-tree structures of depth 2 for each tree are illustrated in Fig 3. With three sample images of compound characters. For each sub image at any node of the quad-tree structure, 4 longest-run features are computed. Partitioning any character pattern using CG based quad tree structure is a novelty of the current work. Equal partitioning, as usually done in many approaches, often generates less informative sub-images in comparison to the CG based partitioning. generates many sub-images with no information, which is avoided in the current CG based quad-tree structure. In the current work, we have considered the depth of the quad-tree structure (d) as 2, which consists of a root node, 4 nodes at depth 1 and 16 nodes at depth 2. Thus, the total number of nodes in the quad tree structure is 21(=1+4+16).Altogether 84(=21x4) longest run features are computed from all 21 image partitions represented with various nodes of the quad tree.

(a) Sample character images.

(b) Character ( c ) CG based images with partitioning of Equal depth 2. partitioning. Fig. 3. Different image partitioning schemes for different samples are illustrated.

4.2. The MLP Classifier In the present work, an MLP classifier[4] is employed for recognition of unknown compound characters using the above mentioned 84 feature set. MLP has been choosen because it is a kind of feed forward Artificial Neural Networks (ANNs) in general famous for their learning and generalization abilities, necessary for dealing with imprecision in input patterns.

1908

4th Indian International Conference on Artificial Intelligence (IICAI-09)

5. Experimentation Experimentation on the present work is conducted in two phases, viz., preparation of the data set, and design of the MLP based pattern classifier with this data set. 5.1 Preparation of the Data Set A data set of handwritten samples of compound character classes of our choice is required here for testing the performance of the proposed technique. Due to inavailability one such standard data set, a data set is prepared for our work. In preparing this, a problem arises from the fact that due to variations in standards, followed by different printing houses, some compound characters appear in various shapes in documents prepared by the houses. For these characters, the most commonly used forms or shapes are considered here. Particular shapes of compound characters we consider as standards are given in the data collection sheets. The sheets have been filled up by more than 250 individuals of different age groups and sexes. We have collected all 160 different compound character samples necessary for the survey associated with this work. All the handwritten character samples collected over the data sheets are digitized into gray scale images through an optical scanner. These gray scale images are binarized through thresholding and finally scaled to 96x96 pixel size each. Out of the samples of 160 classes, 55 samples of compound character classes, which cover 90 percent of the compound character occurring in standard text pieces, are isolated for our work. These samples are further regrouped to form 43 classes as mentioned before. For preparation of the training and the test sets of samples, a second data set of 9,765 compound character samples of the said 43 character classes are formed from the original one. This data set is divided into 3 equal folds for cross validation of results. Each of these folds is in turn considered as a test set and the rest 2/3 portion of the data as a training set. At this point, it is worth mentioning that the number of samples per character class is not equal here for all classes. But the ratio of the number of training samples to the test samples for each character class is kept same. 5.2 Design of the MLP based Pattern Classifier For the present work, a single layer MLP, i.e., an MLP with one input layer, one hidden and one output layer is chosen. This is mainly to keep the computational requirement of the MLP low without affecting its function approximation capability. According to Universal Approximation Theorem, a single hidden layer is sufficient to compute a uniform approximation to a given training set[9]. To design an MLP for classification of handwritten compound characters, several runs of Back Propagation (BP) algorithm with learning rate (η) = 0.8 and momentum term (α)=0.7 are executed for different numbers of neurons in its hidden layer. Recognition performances of the MLP on the test sets, observed from this experimentation, are given in Table 2.

1909

4th Indian International Conference on Artificial Intelligence (IICAI-09)

Curves showing variation of the recognition performance of the MLP, for the three folds test samples, with increases in the number of neurons in its hidden layer, are plotted in Fig. 6 from the Table 2. Table 2. Recognition performances of the MLP on 3 folds of test data with different numbers of neurons in the hidden layers No of Hidden neurons

Percentage recognition rate of the MLP on test samples Fold#1

40  50  60  70  80  90  100  110  120 

Fold#2

83.36 83.79 83.85 84.28 84.28 84.59 85.17 84.65 84.53

81.07 81.16 81.93 81.22 81.78 81.88 82.64 82.24 81.9

Fold#3

85.12 84.69 85.61 85.7 85.61 86.19 86.1 85.64 85.46

The best recognition performance of the MLP is observed on the test fold# 3. It is 86.19% for hidden neurons 90. The average of the optimal recognition performances on the three folds of test samples is 84.67%.

6. Discussion on Experimental Results Due to lack of a standard data set of handwritten compound characters, the performances of the work cannot be directly compared with those of others. Still it can be said that the recognition performance, achieved here, is quite comparable with that of a contemporary work [3], mentioned before. Few samples of misclassified and correctly classified character images are shown in Fig. 4 and Fig. 5 respectively. The primary reason behind misclassification of the character images, used here, is the insufficiency of the feature set in distinguishing finer details of character images of different classes with close resemblance. The image pairs 4 (b-c) and 4 (d-e) reflect this. The instance of misclassification, shown in Fig. 4(a), is possibly resulted from non standard style of writing followed for the character. Remedial procedures in such cases may not be available. For other cases of misclassification, more powerful features may be explored in future. Lexicon matching may also be useful in domain specific applications, like city name recognition for automatic mail sorting, extraction of information from filled in forms etc.

1910

4th Indian International Conference on Artificial Intelligence (IICAI-09)

(a)

(b)

(c )

(d)

(e)

(a) A character image of “k” misclassified as “t” (b) A character image of “n” misclassified as “p” (c) A character image of “p” misclassified as “n” (d) A character image of “p” misclassified as “m” (e) A character image of “p” misclassified as “n” Fig. 4. Some samples of misclassified character images.

(a)

(b)

(a) A character image of n (b) A character image of b (c) A character image of m Fig. 5. Some samples of successfully classified character images.

1911

(c )

4th Indian International Conference on Artificial Intelligence (IICAI-09)

Fig. 6. The Curves show variation of recognition performances of the MLP as the number of neurons in its hidden layer is increased for three different folds of the data.

Sample images of the character classes, which are formed with samples of more than one class can be further classified using more statistical and other topological features or by analysing the context of occurrence of each such character. Further newer classes of less frequently occurred compound characters may be included in the future work. In the light of the above discussion, the present work can be viewed as an important step for dealing with the recognition problem of large number of compound character classes by creating a scope for incrementally extending the number of learned character classes from more frequently occurred to less frequently occurred ones.

Acknowledgements Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project on Storage Retrieval and Understanding of Video for Multimedia” of Computer Science & Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work.

References [1] A.F.R. Rahman,R. Rahman,M.C. Fairhurst, “Recognition of Handwritten Bengali Characters: a Novel Multistage Approach,” Pattern Recognition, vol. 35, p.p. 997-1006, 2002.

1912

4th Indian International Conference on Artificial Intelligence (IICAI-09)

[2] T. K. Bhowmik, U.Bhattacharya and Swapan K. Parui, “Recognition of Bangla Handwritten Characters Using an MLP Classifier Based on Stroke Features,” in Proc. ICONIP, Kolkata, India, p.p. 814-819, 2004. [3]U. Pal, T. Wakabayashi and F. Kimura, “Handwritten Bangla compound character Recognition using Gradient Feature”, ICIT , 2007 [4] S.Basu, N.Das, R.Sarkar, M.Kundu, M.Nasipuri, D.K.Basu, "Handwritten Bangla Alphabet Recognition using an MLP Based Classifier," in Proc. of the 2nd National Conf. on Computer Processing of Bangla, pp. 285-291, Feb-2005, Dhaka. [5] K. Roy, U. Pal and F. Kimura, “Bangla handwritten character recognition”, International Journal of Tomography & Statistics, Vol. 5, pp. 27-36, 2007 [6] B. B. Chaudhuri and U. Pal, “Relational studies between phoneme and grapheme statistics in current Bangla”,Journal of Acoustical Society of India, vol.-23, pp. 67-77, 1995.

[7] http://paschimbangabanglaakademi.org/wbfinalkarmasuchi.htm [8] S.Basu, C.Chaudhuri, M.Kundu, M.Nasipuri, D.K.Basu, “A two pass approach to pattern classification”, proc. of the 11th ICONIP, Nov-2004 [9] S. Haykin, Neural Networks: “A Comprehensive Foundation”, Second Edition, Pearson Education Asia, pp. 208-209 (2001).

1913