Devnagari Handwritten Character Recognition

3 downloads 0 Views 681KB Size Report
handwritten ancient documents are not available for all, ..... TABLE I: DETAILS OF DEVANAGARI HANDWRITTEN CHARACTER RECOGNITION SYSTEMS FOR ...
Proceedings of 2013 IEEE International Conference on Information and Communication Technologies (ICT 2013)

Devnagari Handwritten Character Recognition (DHCR) for Ancient Documents: A Review Kunal Ravindra Shah

Dipak Dattatray Badgujar

Sinhgad Institute of Technology Lonavala, India [email protected]

Sinhgad Institute of Technology Lonavala, India [email protected]

Abstract— The field of character recognition has attracted more concentration of researchers for recognizing text from ancient documents. But because of some of the problems like error in recognition in text lines, overlapped characters, joined text line, ages, damage in text lines it is becoming more challenging for them to recognize the text. However, Most of devnagari handwritten ancient documents are not available for all, because of their delicate condition. The purpose of recognizing such document is to preserve and retrieve ancient knowledge from ancient manuscripts. So here we are giving review of some of the methods for detection of characters easily with less error in retrieved text. Keywords—OCR, Pattern Recognition, DBSCAN, DoG, Devnagari Handwritten Character Recognition, Ancient Documents, Historical Manuscripts

I.

INTRODUCTION

In Computer Science the identification of input data, such as images, speech, stream of text having relationships between them which can be identified by using technique of pattern recognition. This pattern recognition involves some of the steps for identification of attributes, extracting features of attributes & at last in patterns comparisons, the result is matched pattern or unmatched pattern. Text recognition from ancient documents poses specific challenges such as degradation and stains, lighten out of ink, vary text lines, overlay of text-elements or vary of layouts, amongst others. In the case of degraded manuscripts (e.g. by mold, humidity, bad storage conditions) the text or parts of it can disappear. Huge amounts of ancient Vedic, Historical documents are preserved in museums worldwide. But the purpose of the preservation has not been served. Very few of these documents are available for research purpose. However, this raw material will be useful, if it is available for all. That needs to be transcribed into a textual electronic format. In order to preserve our cultural heritage and for automated document processing libraries and national archives have started digitizing historical documents. We want to propose a system which can recognize ancient document and make it available further for research scholars. There are many applications where pattern recognition is important like Optical character recognition, astronomy, medicine, robotics & remote sensing by satellites. So from this area of pattern recognition in this paper we are representing different methods of character recognition from ancient documents. Character recognition

978-1-4673-5758-6/13/$31.00 © 2013 IEEE

was firstly invented in 1929 by Gaustav Tauschek in Germany. Since then many researchers are mostly interested for recognizing characters from ancient documents. From languages Hindi & Bangla are more famous languages in India. The Hindi language is called as Devnagari Language [2]. Also both these languages are developed from derived from Bramhi scripts. Nowadays though the facility digitization is provided but all the documents cannot be digitized. There will be some noise exist in the documents like overlapping characters, characters interfacing with lines of page, clutter, aging, irregularities in writing styles, fluctuating text lines [5]. So there is need of very accurate method for identifying such characters from ancient documents. Likforman-Sulem [3] provides a detailed survey about segmentation of text lines with respect to historical documents. Recent approaches introduce seam carving known from image retargeting for text line segmentation [2]. The Document Layout analysis is the first step in document understanding which identifies region of interest for Optical Character Recognition [4]. Following sections explains the different methods of Handwritten Character Recognition for Ancient Documents. Following Sections provided the review of different methods used for detection of hand written characters.

Fig.1 Sample Devnagari language Ancient Image

785

Proceedings of 2013 IEEE International Conference on Information and Communication Technologies (ICT 2013)

II. METHODS FOR CHARACTER DETECTION FOR ANCIENT DOCUMENTS Method 1 Layout Analysis for Historical Manuscripts [4] According to Angelika Garz, Markus Diem & Robert Sablatnig the proposed method of layout analysis was first applied to a Glagolitic manuscripts dating from the 11th century, which are the pieces of a discovery of 42 codices in St. Catherine’s Monastery in the year 1975. This method consists of two consecutive steps. The first step is called as extraction & classification of features where as second step consist of multistage localization algorithm. For both the steps the Difference of Gaussian (DOG) was calculated. This DOG gives interest points for local minima & local maxima i.e. junctions, circles, arcs. Then each involved points, local feature will be calculated using SIFT (128 bit dimensional vector) descriptors [4]. Wherever the area is blur the gradient magnitude & orientation are computed at interest point. In the step of localization the scales & positions of determined interest points were developed. This step reduces falsely detected descriptors. At last step i.e. text line segmentation actual lines of text the density clustering algorithm will be applied. Here the text line can be detected by high density interest points. Then the algorithm like DBSCAN clustering algorithm will applied on interest points. This method was applied on 100 pages having variations in scripts, layouts, writing styles, writing instruments, character sizes within pages & between pages. Also the data set contains image patches with 18 embellished initials, 30 plain initials and 30 headings. Also readings are taken for dataset in the form of precision & recall. The results indicate that precision is low & recall is higher. Also some of the issues of embellished initials are carefully handled in this method. In this way part based layout analysis was implemented.

Method 2 GIDOC: Gimp-based Interactive Transcription of Old Text Documents [9] This method was proposed by A Juan, Veronica R, J Andreu, Nicolas S, A. H. Toselli. According to them for some handwritten text recognition applications, obtaining perfect (error free) characters is not compulsory. It may contain less error or for making it friendlier to user it is better to avoid the characters which are not visible correctly. The results can nearer to the original characters. This GIDOC gives the better approach for making the system user friendly & interactive. In this proposed method the author provided option to user which specifies the minimum amount of errors allowed for final retrieved characters. The semi supervised learning method is used for GIDOC. Here the collection of handwritten documents are used is Catastro de Ensenada. For e.g. they have used the collection of data for checking entry of registration in old town. So this requires checking of documents, properties. This system was built on the basis of very popular GNU Image Manipulation. The input image was processed by keeping the prediction of output retrieved characters with minimum errors which are unable for detection to human necked eyes. Also at the end the confidence measures calculated as posterior word probabilities.

978-1-4673-5758-6/13/$31.00 © 2013 IEEE

They proposed framework which consist of following stages: A) Document image preprocessing: 1)Skew correction 2)Background removal 3)Noise reduction 4) Line image extractions 5) Slant correction 6) Non-linear size normalization B) HMM training/decoding: 1) Lines represented as sequences of feature vectors 2) Characters modeled by continuous density HMM 3) Words modeled by stochastic finite-state automata 4)Text lines modeled by using N-grams 5) Baum-Welch algorithm for training 6) Viterbi algorithm for decoding At the end they got the characters with less error & the characters which are not easily detected were rejected for further analysis.

Method 3 Recognition-based Keyword Spotting [11] A. Bhardwaj, V. Govindaraju & S. Setlur proposed a method of Keyword Spotting Technique for Sanskrit Documents. The proposed method consists following 3 phases: Phase 1: Preprocessing phase Phase 2: Recognition phase Phase 3: Matching phase In the first phase i.e. preprocessing phase, the horizontal parts of the document image were used for segmentation of documents into lined images where as the vertical parts of each image is used to take out individual word images. As the ancient documents are having less quality of printing they contain noise. So to make images clean they used the Block Adjacency Graph (BAG) to continue with character recognition part of proposed procedure [11].

Fig.2 Sample Character Detection using keyword spotting [11]

The Block Adjacency Graph (BAG) was created by using division of text horizontally & vertically, also they were stored by using variables like Hruns. This Hruns has classifications as above (abv) & below (bel). In the splitting stage Hruns was added to the block which occurs above it & in merging stage Hrun was added to the block the occurs below it. The graph was drawn on the by connecting centroids of neighboring blocks. Due to noise into text the text extracted may not be clearly detectable to the user. So to make text line clear, the BAG based contour smoothing was done by mapping of pixels to the blocks in BAG structure [15]. Also in this method average block size avg_size was calculated by dividing sum of

786

Proceedings of 2013 IEEE International Conference on Information and Communication Technologies (ICT 2013)

the each area of each block by total number of blocks. Here the blocks those are having less area than the variable avg_size & those are connected to only one edge were considered as noisy & were removed. In this method as they have done the merging & splitting runs made as a one group as well as the some ends of characters having bony area were removed so that the retrieved character will be properly visible to end users. Here the next phase i.e. character recognition phase includes the segmentation by making different merging of blocks from BAG. In this they also made different hypotheses which were provided by the OCR for successful recognition of characters. And because of which the best match was obtained between the hypotheses & class labels related to word which was used as query for recognition. So this process comes under the phase of matching of characters. At the end the in this way the Keyword Spotting gives best performance in character recognition.

headlines from page. Here the middle & lower zone boundary without touching a black pixel was found by them. At the end part by part segmentation was done[18]. 3) Grouping of Characters The character which will be detected was grouped into 3 groups i.e. basic, modifier & component group. And this will be done depending on bounding box width, number of border pixel per unit width & accumulated curvature over the border. 4) Character recognition In this step character can be recognized on the basis of properties of characters of Bangla & Hindi Languages [19]. The decision rules define presence & absence of features in characters. 5) Error detection Error will be detected by using dictionary. To detect the error in detection of characters the structural properties of character will be identified. The error distribution was done in five categories i.e. substitution of characters, rejection of characters, splitting of characters, run on & deletion error. Almost 10000 words are used for recognition of characters. So, from results we can say that by using this OCR method the less error was found in detection of characters from words.

Fig.3 Sample Character Detection using keyword spotting [11]

Method 4 OCR system for identifying characters of languages Bangla & Devnagari [1] This method was proposed by B. B. Chaudhuri & U. Pal in 1997. According to them this system will undergo following steps: 1) Text Digitization, Gray tone to two tone conversion, Noise cleaning, Text Block Identification In this step text was converted from gray tone to tone to two tone conversion. The image at this step will be having noise & some of irregularity in text. Also it shows some undesired effects of texts. To remove this noise the method of Mahmoud was used [6]. 2) Skew correction, Line & word detection This second step includes skew correction & detection. Here average width of components is selected for skew angle detection. To perform this task the uppermost pixel of each component was selected so that the properties of digital straight line will be identified. Also digital line was selected for skew angle detection & skew angle correction will be done by rotating image in opposite direction. To detect the multiple columns the text blocks will be separated. For line detection the valley of histogram computed by row wise sum of gray values [1]. The position between two consecutive headlines where the histogram height is the least denotes one boundary line[17]. As the characters of the word are most of times connected to the headlines so they have just removed the

978-1-4673-5758-6/13/$31.00 © 2013 IEEE

Fig.4 OCR system Devnagari languages [1]

Method 5 Text line segmentation without Binarization [7] According to Angelika Garz, Horst Bunke , Robert Sablatnig & Andreas Fischer the text line can be detected by using clustering technique without binarization. To overcome the real life applications issues like strains in text, aging of text can be recovered by using this method. In this method first the interest points are identified from gray scale images. At next step the different clusters were extracted in high density regions & touching components such as ascenders & descenders were separated using the method of seam carving & at the end text line was generated by concatenation of neighboring clusters by undertaking orientation of the word in the documents. For this method they have used the Carolingian handwriting [8] in dataset including ascenders, descenders as well as capital letters having long strokes with little structure. As like method of layout analysis the Difference of Gaussian [9] were used for detection of interest points. To group the consecutive characters into words DBSCAN clustering was applied on clusters. Also the text area was calculated for each word cluster. By using seam carving [7] the identification & separation of text line was done. And finally the text line generated from joining of nearest clusters of words. From the

787

Proceedings of 2013 IEEE International Conference on Information and Communication Technologies (ICT 2013)

point of implementation & results this method found easy for detection of text lines from ancient documents. Also this method is useful for real world handwritten application.

Method 6: Text Line Extraction [10] According to F.Kleber, R.Sablatnig, M.Gau a method is based on ruling evaluation of Glagolitic texts based on text line extraction and is suitable for corrupted manuscripts by extrapolating the baselines with the a priori knowledge of the ruling. The detection of text depends on ruling of text i.e. line distance, lines positions, orientation of line. The upper & lower baseline is used for calculating is used for calculation of synthetic ruling [10]. While detection of characters there are some of the difficulties occurs like angled line, skewing line, distorted text, cutting lines, inline text, breaking structure of line due to height of line, font of initials. So it becomes hard for recognition of characters. In the proposed method there are some of the steps followed by the authors mentioned above. The steps are: a) Preprocessing b) Finding base lines c) Extrapolation

In the step of preprocessing the distortion was corrected. The detection was used in [12]. After correction of distortion the noise was removed from the input data by using morphological operations [10]. In this method they have chosen the size of 131*131 pixels for interpolation. Also in case of spotted, bony texts the method of binarization followed by labeling of image was used to correct the texts. The next step that they have used is finding base lines still the text filtered in previous step was again used for deletion of noise from text which belongs to some of the connected components. Then after removing noise the average height was calculated so that while segmentation of page according to height of text. And if any connected component does not get fit into ruling technique then that particular text was ignored. Now based on connected components the neighbor was searched to get nearest neighbor. The line was detected by them by checking number of gaps between the texts so if numbers of gaps are more then it will be horizontal line. On the basis of Principal Component analysis upper or lower corner points of boxes of all CC in same lines was used [10]. In the step of Extrapolation is used for filling the gaps by the use of different lines. Here the angle of deviation was considered i.e. +20 or -20. And from the results it was proved that the described angle gives the improved results. At last, if any place needs the insertion of line in between, then intion of line was done. Hence by using this method the text line was extracted successfully.

TABLE I: DETAILS OF DEVANAGARI HANDWRITTEN CHARACTER RECOGNITION SYSTEMS FOR ANCIENT DOCUMENTS

Sr

Method

Feature

Classifier

Dataset

1

Layout Analysis for Historical Manuscripts

DB Scan Clustering

Gaussian classifier

8000

Accuracy (%) 82

2

GIDOC: Gimp-based Interactive Transcription of Old Text Documents Recognition-based Keyword Spotting OCR system for identifying characters of languages Bangla & Devnagari Text line segmentation without Binarization Text Line Extraction

Finite State Automata

HMM

8000

60

BAG

Neural Network

4506

84

DOG

HMM

10000

75

DOG

Multistage localization

5600

88

Component connect, Bounding Box

Principle Component Analysis (PCA)

12000

87

3 4

5 6

978-1-4673-5758-6/13/$31.00 © 2013 IEEE

788

Proceedings of 2013 IEEE International Conference on Information and Communication Technologies (ICT 2013)

III. CONCLUSION & DISCUSSION From all the methods reviewed above are very important & useful for recognizing characters from given ancient documents. Also the problems like staining of text, aging of text, Overlapping of text line, detection of characters touching to page lines, detection of writing errors can be solved by using above methods of character recognition. So, if we compare the above mentioned methods the method of text line recognition without binarization gives proves to be effective. REFERENCES [1] B. B. Chaudhuri ,U. Pal, “An OCR System to Read Two Indian Language Scripts: Bangla & Devnagari”, IEEE, 1977 [2] F Kleber and R Sablatnig, M Gau & Heinz Mikas, “Ancient Document Analysis Based on Text Line Extraction”, IEEE, 2008 [3] L. Likforman-Sulem, A. Zahour, and B. Taconet, “Text Line Segmentation of Historical Documents: A Survey” , IJDAR, vol. 9, no. 2, pp. 123–138, 2007 [4] Angelika Garz, Robert Sablatnig, Markus Diem, “Layout Analysis for Historical Manuscripts Using SIFT Features”, International Conference on Document Analysis and Recognition, IEEE, 2011. [5] Angelika Garz, Robert Sablatnig, “Multi-Scale TextureBased Text Recognition in Ancient Manuscripts”, IEEE, 2010 [6] S. A. Mahmoud, “Arabic character recognition using Fourier descriptors and character contour encoding”, Pattern Recognition, IEEE, vol. 27, pp. 815-824, 1994 [7] Angelika Garz, Andreas Fischer, Robert Sablatnig, Horst Bunke, “Binarization-free Text Line Segmentation for Historical Documents Based on Interest Point Clustering”, 10th IAPR International Workshop on Document Analysis Systems, IEEE, 2012 [8] D. G. Lowe, “Distinctive Image Features from Scale- Invariant Key points”, IJCV, vol. 60, no. 2, pp. 91–110, 2004

978-1-4673-5758-6/13/$31.00 © 2013 IEEE

[9] A. Juan, V. Romero, J.A. Sanchez, N. Serrano, A.H. Toselli & E. Vidal, "Handwritten Text Recognition for Ancient Documents”, Workshop and Conference Proceedings 11 58-65, JMLR, 2010 [10] F.Kleber, R.Sablatnig, M.Gau, H.Miklas, "Ancient Document Analysis Based on Text Line Extraction", 978-1-4244-2175-6/08, IEEE, 2008. [11] A. Bhardwaj, S. Kompalli, S. Setlur and V. Govindaraju. “An OCR based approach to word spotting in Devanagari documents”. In Proceedings of the 15th SPIE - Document Recognition and Retrieval, volume 6815. 2008. [12] F. Kleber and R. Sablatnig. “Skew Detection Technique Suitable for Degraded Ancient Documents”, In Proc. of the 36th CAA Conference, to appear, 2008 [13] Anurag Bhardwaj, Srirangaraj Setlur and Venu Govindaraju, “Keyword Spotting Techniques for Sanskrit Documents”, Sanskrit Computational Linguistics, Lecture Notes in Computer Science, Springer, Volume 5402, 2009, pp 403-416 Volume 5402, 2009, pp 403-416. [14] Teh C.-H. And Chin R.T. “On Image Analysis by the Methods of Moments”, IEEE Trans. Pattern Analysis and Machine Intelligence, 10, No. 4, pp. 496513, 1988. [15] A. Bhardwaj, S. Kompalli, S. Setlur and V Govindaraju, “An OCR based approach to word spotting in Devanagari Documents”, In Proceedings of the 15th SPIE - Document Recognition and Retrieval, volume 6815, 2008 [16] Nikita G D Singh, “Sanskrit word recognition using Gradient feature extraction ", VSRD-IJCSIT, Vol. 2 (3), 2012, 167-174 [17] N Garg, S Kaur, "Improvement in Efficiency of Recognition of Handwritten Gurumukhi Script", IJCST Vol. 2, Issue 3, September 2011 [18] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal," Offline Recognition of Devanagari Script: A Survey", IEEE Transaction on Systems, Man & Cybernetics—Part C: Applications &Reviews, VOL. 41, No. 6, Nov. 2011 [19] M Shahi, Dr. A K Ahlawat, B.N Pandey,"Literature Survey on Offline Recognition of Handwritten Hindi Curve Script Using ANN Approach", International Journal of Scientific and Research Publications, Volume 2, Issue 5, May 2012

789