Gujarati Handwritten Numeral optical Character ...

42 downloads 21918 Views 160KB Size Report
International journal of Computer Science & Network Solutions ... unrelated to the presence or absence of any other feature, given the class variable.
International journal of Computer Science & Network Solutions http://www.ijcsns.com

May.2014-Volume 2.No5 ISSN 2345-3397

Gujarati Handwritten Numeral optical Character through Naive Bayes Classifier Kamal Moro, Mohammed Fakir, Belaid Bouikhalene, Badr Dine El Kessab, Rachid El Yachi

Information processing and telecommunication teams Faculty of Science and Technology, 523, Beni Mellal, Morocco [email protected]

Abstract This paper deals with an optical character recognition (OCR) system for handwritten Gujarati numbers. One may find so much of work for Indian languages like Hindi, Kannada, Tamil, Bangala, Malayalam, Gurumukhi etc, but Gujarati is a language for which hardly any work is traceable especially for handwritten characters. The features of Gujarati digits are abstracted by four different profiles of digits. Skeletonization and binarization are also done for preprocessing of handwritten numerals before their classification. This work has achieved approximately 80,5% of success rate for Gujarati handwritten digit identification. Keywords: Optical character recognition, Naïve Bayes, Gujarati Handwritten Digits, classification.

I.

Introduction

In simple terms, a naive Bayes classifier assumes that the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the presence or absence of the other features. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers (Mozina.M et al, 2004). An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

28

International journal of Computer Science & Network Solutions http://www.ijcsns.com

II.

May.2014-Volume 2.No5 ISSN 2345-3397

Bayes theorem

Gujarati belonging to Devnagari family of languages, which originated and flourished in Gujarat a western state of India, is spoken by over 50 million people of the state. Though it has inherited rich cultural and literature, and is a very widely spoken language, hardly any significant work has been done for the identification of Gujarati optical characters. The Gujarati script differs from those of many other Indian languages not having any shirolekha (headlines). Gujarati numerals do not carry shirolekha and it applies to almost all Indian languages. The numerals in Indian languages are based on sharp curves and hardly any straight lines are used. Figure.1 is a set of Gujarati numerals. As it is visible in Figure.1, Gujarati digits are very peculiar by nature. Only two Gujarati digits one(1) and five(5) are having straight line, making Gujarati digit identification a little more difficult. Also Gujarati digits often invite misclassification. These confusing sets of digits areas shown in Figure. 2.

Figure.1. Gujarati digits 0-9

Figure.2. Confusing Gujarati digits This paper addresses the problem of handwritten Gujarati numeral recognition. Gujarati numeral recognition requires binarization and skeletonozation as preprocess. Further, profiles are used for feature extraction and artificial neural network (ANN) is suggested for the classification. For developing a system to identify Gujarati handwritten digits, we have collected numerals 0-9 written in Gujarati scripts from a large number of writers. These numbers were scanned in 300 dpi by a flatbed scanner. Initially they are in separate boxes of 50*30 pixels each. Since our problem is to identify handwritten digits, the first thing required is to bring all the characters in a standard normal form. This is needed because when a writer writes he may use different types of pens, papers, they may follow even different styles of writing etc.

III.

Naive bayesian model

In probability theory, Bayes theorem states conditional probabilities: given two events A and B, Bayes theorem determines the probability of A knowing B, if we know the probabilities of A, B and B knowing A. This basic theorem (originally named "probability of causes") has significant applications. To achieve the Bayes theorem, we start with a definition of conditional probability:

29

International journal of Computer Science & Network Solutions http://www.ijcsns.com

By selecting by P (B), we obtain:

May.2014-Volume 2.No5 ISSN 2345-3397

is the probability that A and B have both location. By dividing both sides

(1) Bayes' theorem is sometimes improved by noting that: (2) Theorem to rewrite as:

Where is the complement of A. More generally, if we obtain:

(3) is a partition of the set of possibilities,

(4)

IV.

Classification from a probability model

Abstractly, the probability model for a classifier is a conditional model , over a dependent class variable with a small number of outcomes or classes, conditional on several feature variables . Using Bayes' theorem, this can be written (5) In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C and the values of the features are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: (6) Now the "naive" conditional independence assumptions come into play: assume that each feature is conditionally independent of every other feature for . This means that under the above independence assumptions, the conditional distribution over the class variable C is:

(7) V.

Classification from a probability model

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function defined as follows: 30

International journal of Computer Science & Network Solutions http://www.ijcsns.com

May.2014-Volume 2.No5 ISSN 2345-3397

(8) VI.

Estimated value of the parameters

All model parameters (probabilities of the classes and probability distributions associated with different characteristics) can be approximated with respect to the relative frequencies of the classes and the features in the training data set. This is a maximum likelihood estimate of the probabilities. The probabilities of classes may for example be calculated based on the assumption ), or by estimating each that the classes are equiprobable (each class probability based on the set of training data (prior to . When working with features that are random variables, it is generally assumed that the laws of corresponding probabilities with normal distributions, which we deem the expectation and variance. The hope , is calculated with:

(9) Where N is the number of samples and The variance is calculated with:

is and the value of a given sample.

Assuming that there are m classes, the probabilities follows:

(10) for i = 1 ... m, can be estimated as

(11) If, for a certain class, a certain characteristic never takes a given throughout training data value, then the probability estimation based on the frequency to be zero. This poses a problem since it leads to the appearance of a zero factor when probabilities are multiplied. Therefore, knowing that is an increasing function, the probability estimates are corrected using the following formula:

(12)

VII.

Experimental results

For this experiment, this network was trained for total 30 sets of digits, and was tested for 60 other new sets of digits. In total the network was trained by 300 digits and tested for 600 digits. Figure.3 shows the recognition process used. 31

International journal of Computer Science & Network Solutions http://www.ijcsns.com

May.2014-Volume 2.No5 ISSN 2345-3397

Resizing

Binarization

Skeletonization

Feature Extractions

Classification Figure.3. Recognition Process

A. Binarization Binarization is often the first step in treatment systems and image analysis (Trier.D et al, 1995), (Leedham.S et al, 2002), (Kefali. A et al, 2004), especially images of documents. Its goal is to reduce the amount of information present in the image, and keeping only the relevant information, which allows us to use simple analysis methods vis-à-vis images in grayscale or color (Sauvola.J et al, 2000). In our case, we used the algorithm of Wolf (Wolf.C et al, 2002) (Figure.4).

Figure.4. Binarization of a digit, a: bifor binarization, b: after binarization B. Skeletonization In the continuous scheme, a skeleton of the form is a plurality of lines passing down the middle (Figure.5). The objective of skeletonization is to represent a set with a minimum of information in a form that is as simple to remove and convenient to handle. This is the notion of a continuous centerline form introduced by Blum (Blum.H et al, 1967) in 1964 "Be a meadow covered by dry 32

International journal of Computer Science & Network Solutions http://www.ijcsns.com

May.2014-Volume 2.No5 ISSN 2345-3397

grass and Ω a set point of the prairie. If all points of the contour Ω ignited simultaneously, the fire spreads evenly and runs at a constant speed. The skeleton of Ω (denoted MA (Ω)) is then defined as the locus of points where the burning fronts meet. "

Figure.5. Example of Skeleton There are currently a variety of methods to build skeletons from forms. One of the best known is the topological thinning, it is to examine the pixels of the binary image and iteratively remove those who do not belong to the final skeleton. Reviews of pixels is done in two ways: sequential iteration depends at transactions made so far approach where the removal of a point P to the but also the pixels already processed and parallel approach in which the pixels are examined iteration depends at operations independently at each iteration, the removal of the P to the performed at the previous iteration. It is the latter approach which is used by applying the algorithm Guo_Hall (Guo.Z et al, 1989) (Figure.6). a

b

Figure.6. Skeleton of a digit, a: bifor sketetonization, b: after skeletonization C. Feature Extraction We use the Box-approach in Refs (Hammandlu.O.V et al, 2007), (Hammandlu.K.R et al, 2003) and (Hammandlu.O.V et al, 2005). This approach requires the special division of the character image. The major advantage of this approach stems from its robustness to small variations, ease of implementation and relatively high recognition rate. The choice of box size and number of boxes is discussed in Section 6 on results. Each character image is divided into 24 boxes so that the portions of a numeral will be in some of these boxes. There could be boxes that are empty, as shown in Figure.7. English numeral 3 is enclosed in the 6*4 grid. However, all boxes are considered for analysis in a sequential order. By considering the bottom left corner as the absolute origin (0,0), the coordinate distance (vector distance) for the kth pixel in the bth box at location (i,j) is computed as (13)

By dividing the sum of distances of all black pixels present in a box with the total number of pixels in that box, a normalized vector distance ( ) for each box is obtained as

33

International journal of Computer Science & Network Solutions http://www.ijcsns.com

May.2014-Volume 2.No5 ISSN 2345-3397

(14)

Where is the total number of pixels in bth box. These vector distances constitute a set of features based on distances. Therefore, 24 ’s corresponding to 24 boxes will constitute a feature set. However, for empty boxes, the value will be zero.

Figure.7. Portions of the numeral lie within some boxes while others are empty In practice, each character is divided into 50 blocks of 10*5 pixels. A sample of the extraction vector is as follows: [0 0 0.6672 0.7017 0 0 1.9429 2.8865 0 0 0.5770 2.0683 0 0 0 3.3593 0 0 1.1157 0 2.4256 1.8303 0 2.4254 3.5686 0 4.0410 6.6789 9.9418 0 0 0 0 7.1634 0 0 0 0 8.0269 0 0 9.6223 0 8.6789 0 0 0 10.4075 3.5689 0].

D. Classification In terms of classification, we performed recognition on 600 characters, so 60 characters for each class, based on 100 learning characters so 10 characters for each class. Following the recognition process of Figure.2, 483 characters are recognized among 600 characters, so a recognition rate of 80.5%, the Table.1 gives more details on the recognition rate for each class. TABLE I RESULT SUMMARY 0 1 2 3 4 5 6 7 8 9 Success(%) Figure 0 1 2 3 4 5 6 7 8 9

56 0 0 0 0 0 0 2 2 0

0 42 12 1 3 5 2 0 0 0

0 2 43 0 1 3 0 1 0 2

0 0 0 58 1 0 3 4 0 0

0 0 2 0 49 1 0 0 2 1

0 14 0 0 1 51 3 0 0 0

VIII.

34

0 2 1 1 0 0 50 0 1 0

2 0 1 0 1 0 0 53 0 8

2 0 1 0 4 0 1 0 54 22

0 0 0 0 0 0 1 0 1 27

93,33 70 71,67 96,67 81,67 85 83,33 88,33 90 45

International journal of Computer Science & Network Solutions http://www.ijcsns.com

IX.

May.2014-Volume 2.No5 ISSN 2345-3397

Conclusions

Following the recognition process described above, the naive Bayes classification has enabled us to achieve a recognition rate of 80.5%, it is considered that this is an acceptable recognition rate but remains inadequate in a professional setting. The performance of each classification method is based on feature extraction. In our outlook, we expect to use other extraction techniques in the process of recognition and use hidden Markov networks and k-nearest neighbors in terms of classification.

References i. ii.

iii.

iv. v. vi.

vii. viii. ix. x.

xi.

Mozina M, Demsar J, Kattan M, & Zupan B. (2004). Nomograms for Visualization of Naive Bayesian Classifier. In Proc. of PKDD-2004, pp 337-348. Trier.Ø. D and Taxt.T. (March 1995). Evaluation of binarization methods for document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(3), pp. 312315. Leedham.G , Varma.S , Patankar.A, and Govindaraju.V. (August 2002). Separating text and background in degraded documents images - A comparison of global thresholding techniques for multi-stage thresholding. Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, pp. 244–249. Kefali.A, Sari.T and Sellami.M. (2009). Evaluation de plusieurs Techniques de seuillage d’images de documents arabes anciens. IMAGE’09 Biskra Sauvola.J, Pietikainen.M. (2000). Adaptive document image binarization. Pattern Recognition, 33(2), pp. 225–236. Wolf.C, Jolion.J.M, and Chassaing.F. (2002). Extraction de texte dans des vidéos : le cas de la binarisation, 13ème Congrès Francophone de Reconnaissance des Formes et Intelligence Artificielle, pp. 145-152. Blum.H. (1967). A transformation for extracting new descriptions of shape.Models for the Perception of Speech and Visual Form, MIT Press, pp. 362–380. Guo.Z and Hall.R.W. (March 1989). Parallel thinning with two-subiteration algorithms. Comm. ACM, 32(3) :359–373. M. Hanmandlu.M, O.V. Ramana.M.O.V (2007). Fuzzy model based recognition of handwritten numerals. Pattern Recognition 40, pp: 1840-1852. Hanmandlu.M, Murali Mohan.K.R, Chakraborty.S, Goyal.S, Roy Choudhury.D. (2003). Unconstrained handwritten character recognition based on fuzzy logic. Pattern Recognition 36 (3), pp: 603-623. Hanmandlu.M, Yusof.M.H.M, Vamsi Krishna.M. (2005). Off line signature verification and forgery detection using fuzzy modeling. Pattern Recognition 38 (3), pp: 341-356.

35