Segmentation-free Handwritten Chinese Text ... - Semantic Scholar

13 downloads 21048 Views 215KB Size Report
Arabic characters, hence we propose to apply RNN on Chinese text recognition. ... risk when training the network's free parameters using CTC is the numerical .... occurrences of blank: we will call this procedure “raw” RNN decoding. .... handwritten Chinese text recognition,” in 10th International Conference on Document ...
Segmentation-free Handwritten Chinese Text Recognition with LSTM-RNN Ronaldo Messina, J´erˆome Louradour A2iA S.A. 39, rue de la Bienfaisance Paris - France Email: {rm,jl}@a2ia.com Abstract—We present initial results on the use of MultiDimensional Long-Short Term Memory Recurrent Neural Networks (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters. In fact, most of Chinese text recognizers in the literature perform a pre-segmentation of text image into characters. This can be a drawback, as explicit segmentation is an extra step before recognizing the text, and the errors made at this stage have direct impact on the performance of the whole system. MDLSTM-RNN is now a state-of-the-art technology that provides the best performance on languages with Latin and Arabic characters, hence we propose to apply RNN on Chinese text recognition. Our results on the data from the Task 4 in ICDAR 2013 competition for handwritten Chinese recognition are comparable in performance with the best reported systems.

I.

I NTRODUCTION

Most systems that recognize Chinese Text use some presegmentation of the text line [1], [2], [3], [4] before identifying the detected characters with a classifier. In this paper, we present some initial results on the recognition of Chinese Characters (CC) without explicit segmentation using Connectionist Temporal Classification (CTC) [5], that enables MultiDimensional Long-Short Term Memory Recurrent Neural Network (MDLSTM-RNN) to recognize sequences of characters. Recently, most text line recognition competitions were won by systems using MDLSTM-RNN trained with CTC [6], [7], [8], largely outperforming HMM-based systems. The main difference between the tasks addressed in these competitions and Chinese Text Recognition is the number of characters to recognize, which is much smaller: at most a few hundreds different characters in the case of Arabic and Latin in comparison to several thousands in CC. Apart from this difference, there is no particular reason why MDLSTM-RNN could not work with Chinese or with any other language with a large charset. Indeed, the only assumption that is made about the writing system is that lines of characters must be horizontal with at most one character per column of pixels. Having to segment each line into characters is quite costly and also increase the chances of errors. To the best of our knowledge, the only work to dispense with the character segmentation stage for Chinese Text Recognition is [9] that used Hidden Markov Models. We advocate the use of MDLSTMRNN system to deal with the variable number of characters in each line, avoiding explicit character segmentation. The aim of this work is to assess the capacity and performance of

MDLSTM-RNN and CTC-based line recognition systems in recognizing CC. As mentioned above, one of the issues any system recognizing CC has to face is the huge number of characters. In the framework of RNN and CTC, the number of possible characters is (roughly) the number of outputs of the RNN. The main risk when training the network’s free parameters using CTC is the numerical underflow that could occur if the probabilities were uniformly spread over too many character classes. Our results, presented in Section IV, show that MDLSTM-RNN actually yields performance at the level of state-of-art systems. In our experiments, the system is trained using labeled images from the Institute of Automation of Chinese Academy of Sciences (CASIA) Off-line Chinese Handwriting Databases [10], briefly described in Section II. Performance is evaluated on Task 4 of the ICDAR 2013 Handwritten Chinese Character Recognition Competition [3], which uses data from the CASIA databases. It is measured by the character error rate (CER), that is calculated by aligning the system output (as a sequence of characters) and the available ground-truth of the line snippets. The system we evaluate is described in Section III. Without any prior knowledge of Chinese, we use only character-based n-grams language models, constructed as depicted in Section III-B. Note that other systems use word and character-based language models [11], [12], [13]. Using character-based language model enables us to circumvent the issue of converting the sequence of CC into words, which is a non-trivial problem in Chinese. II.

DATABASES

A. Labeled images of text We use the data from the CASIA Off-line Chinese Handwriting Databases to train the “optical model” (see Section III) of the recognition system. This database comprises images of both isolated characters and lines of text. The part comprising only isolated characters has three sub-sets, named HWDB1.01.2, and the other also has three parts: HWDB2.0-2.2. The details on the collection of data, number of characters and lines are presented in [10]. All these data were grouped and we randomly split it into two sets for developing the MDLSTMRNN. About 85% is used to calculate the free parameters of the network: this is usually known as the “training” set. The remaining “validation” data is used to gauge the convergence of the parameters and perform early-stopping, but it is not used to modify the free parameters.

The data used to assess performance (“evaluation” dataset) is the one from the ICDAR 2013 Handwritten Chinese Character Recognition Competition [3], which is a held-out part of the CASIA database. We only explore the Task 4, where lines of text in a page are given to the system, so that there is no influence of automatic line detection on the recognition performance. There are 60 writers represented by the evaluation data, which are different from those in the training dataset. Our evaluation dataset is in fact a bit smaller than the reported one, as we ignored lines containing invalid characters. Our evaluation dataset comprises 3 397 lines, which amount to 90 635 total characters (vs reported 91,563 tokens). The number of different characters is 1 379 (vs reported 1 385 “classes”), including 3 outlier characters that are never seen in the training data and that appear 28 times in the evaluation dataset. B. Textual data There are four sources of text data in our work. We use the transcriptions in the CASIA database and three openly available corpora: PH corpus [14] with newswire texts from the Xinhua News Agency between January 1990 and March 1991 (identified as “PH”), the Chinese part of the multilingual UN corpus [15] (“UN”) and the Lancaster Corpus of Mandarin Chinese [16] (“LCMC”). The number of characters in each corpus is given in Table I. We ignore lines containing characters not in the charset of our MDLSTM-RNN. Th remaining data are used to estimate language models for the recognition system, as described in Section III-B. TABLE I: Number of characters in the corpora used to estimate the language model. Database CASIA PH UN LCMC

4 3 5 1

#Chars 901 265 741 129 541 943 509 112

#UniqueChars 7 347 4 705 2 353 4 801

Finally, note that we applied some normalization to the corpora used to train the Optical Model (Section III) and to estimate the Language Model (Section III-B). Namely, we converted ellipsis to three dots and double and single quotes to the usual western keyboard characters. Besides, some Latin characters that were coded as fullwidth characters were changed into the standard characters obtained with a western keyboard. All these normalizations should not have a significant impact on the character error rates, so we can still compare our rates with those found in the literature on the same evaluation data. III.

R ECOGNITION SYSTEM

A. Optical Model Our recognition system uses a Multi-Dimensional LongShort Term Memory Recurrent Neural Network (MDLSTMRNN) [17], [18] for modeling sequences of characters; this network is referred in the following as the “optical model”. The input is an image of a text line that is scanned in four different directions [19], producing predictions for each character in the charset.

The scanning procedure can be seen as dividing the image into regularly spaced “frames” (vertical slices of the input image). To each frame in the image, an output vector of probabilities is produced. Each element in the vector is associated to one of the characters in the charset. The charset includes Chinese and Latin characters along with punctuation marks, white space and also a special character, dubbed “blank”, which is active either when the frame corresponds to the transition between two characters or when it matches the same character as in the previous frame [17]. Images are gray-scale and our system uses the raw pixels values, keeping their 2D structure. The pixel values are normalized so as to have zeromean and unit-variance on the training dataset and no actual feature extraction is performed. The network free parameters are “trained” to minimize the negative log-likelihood (NLL) of the sequence of characters predicted by the network and the target transcription, for all images in the training data. Connectionist Temporal Classification [5] is an efficient method to sum up over all alignments between the sequence of network outputs for each image and the sequence of characters in the transcription of the image. This method avoids explicit character segmentation and also enables calculating the NLL during training. Minimization of NLL is performed with Stochastic Gradient Descent [20] and a constant learning rate (1.e−3 ) is used in all epochs. Convergence is achieved when the Character Error Rate on a validation set does not decrease by more than a threshold for a given number of iterations. Dropout [21] is employed to regularize the parameters and was found to boost the performance [22]. It is applied after each LSTM layer of the network, whose architecture illustrated in Figure 2. Generally speaking, it is also possible to adapt an existing network to new data. When the charset of the new data is different, the last layer of the network can be modified consequently1 , whereas the rest of the network can be kept as it is. Adaptation reduces the time needed to have an optical model and in our experience, it is always better to use an existing network and adapt its parameters using new data instead of estimating the parameters from scratch with a random initialization. The intuition supporting how adaptation should work is that the first layers, that are connected directly or closely to the input images, learn low-level features which characterize images of text in general, hopefully not very dependent on the language. We had no system for Chinese characters to adapt from, and no certainty whether adaptation could be useful when the charset is completely new and particularly large. An existing model for handwritten French [8] was adapted on a small subset of the Chinese data (with about 2000 elements in the charset) and compared the convergence rate and the final performance with another model with same architecture but trained from scratch. The adapted RNN converged faster and reach a much lower error rate, as illustrated in Figure 1. As the convergence was successful, we adapted this first network for Chinese characters on the whole training dataset and charset. We report the performance of the resultant system in Section IV. Regarding the computational complexity, we observed an important increase in the time spent to compute the out1 removing connections corresponding to deprecated characters, and adding zero-weight connections to model new characters.

Dropout

Dropout Dropout

CTC N-way softmax

..... 2 2

20

6

Input image MDLSTM Block: 2 x 2 Features: 2

Convolutional Sum & Tanh MDLSTM Input: 2x4 Features: 10 Stride: 2x4 Features: 6

50

20

10

6

2

N

Convolutional Input: 2x4 Stride: 2x4 Features: 20

Sum & Tanh

MDLSTM Features: 50

N

N

Fully-connected Features: N

Sum

Collapse

Fig. 2: Architecture of the MDLSTM-RNN used in our system.

1.0

Scratch MaurdorHWRFR

0.9 0.8 0.7 CER

0.6 0.5

0.4 0.3 0.2 0

100

200

300 400 Nb updates / 1k

500

600

700

Fig. 1: Convergence curves for the models from scratch and from an existing one (dashed line).

put normalization of the network (“softmax” layer) on the 1 379 output classes of characters. Whereas the time spent to compute this normalization is often negligible with about a hundred classes, it amounts to roughly 10% of the total time in the context of Chinese recognition provided the number of possible characters. Future work includes more efficient computation as in [23] to reduce the burden. B. Language modeling Decoding with RNN can be done simply by taking the output with highest activation for each frame, and removing the repetitions of characters and blank, and lastly removing all the occurrences of blank: we will call this procedure “raw” RNN decoding. Even if the results of this baseline approach can be quite good, the accuracy can be still improved by incorporating lexical constraints and priors about the language. For this purpose, we estimate character n-gram Language Models (LM) on textual data, handled as sequences of Chinese characters.

Punctuation marks and other particular symbols are treated as any other character in the LM. Note that the Language Model and the Optical Model handle the same set of tokens, which is not the case when performing state-of-the-art Text Recognition on Latin languages, with an LM on words and an Optical Model on characters. Different combinations of the textual data presented in Section II-B are used to estimate language models, we present recognition results for all of them in Table II. We estimate the LM using the SRILM toolkit [24], with orders in the range n = 2 . . . 5, pruning at a threshold of 1.e−7 , and Witten-Bell [25] smoothing. We use the Kaldi [26] decoder to apply the language model on the network outputs, using finite-state-transducers [27] created using the popular “HCLG recipe” [28]. IV.

R ESULTS

In the ICDAR 2013 competition, performance was measure in terms of Accuracy Rates (AR) and Correct Rates (CR) that are defined as follows: N − Sub − Ins − Del N N − Sub − Del CR = N AR =

where N is the number of characters in the line, and where Sub, Ins, and Del are the number of Substitutions, Insertions and Deletions issued when aligning the recognition hypothesis and the reference transcription. CR can be negative if lines suffer of severe over-segmentation. The Character Error Rate (CER) is simply equal to 100% − AR, the best system in the competition [3] reached CER=13.3% (AR=86.7%, CR=88.8%) while the best reported system at that time [2] achieved a CER of 10.7% (AR=89.3% and CR=90.2%). The insertion rate of these systems based on explicit character segmentation lies between 1% and 2%. Table II presents the results of our system with raw RNN decoding and with different Language Models depending on which data is used. The raw RNN decoding reaches 16.5% of

character errors. By using a Language Model, this error rate is decreased to 10.6%, and the insertion rate is about 0.5%.

0.14 0.12

Language Model data

This work

No LM (raw RNN decoding) CASIA CASIA+UN CASIA+LCMC CASIA+PH CASIA+UN+LCMC+PH

2

n-gram order 3 4

0.10 Frequency

TABLE II: Character Error Rate (%) on CASIA evaluation dataset, for our approach with raw RNN decoding and with different language models (characterized by the data used to estimate them). State-of-the-art performance is also indicated.

[3] ICDAR 2013 winner [2]

14.0 13.0 11.6 10.9 10.7

0.08 0.06

5 0.04

16.5 14.9 13.9 12.5 12.0 12.0

Not isolated isolatedChars

13.9 12.9 11.6 10.8 10.6

13.9 12.8 11.6 10.8 10.9

0.02 0.00 0

20

13.3 10.7

40 60 Character error [%]

80

100

80

100

(a) Raw RNN decoding. 2

The textual corpus that yields the most improvement with respect to the “smallest” Language Model is PH: it alone practically dominates the performance improvement. It can be explained by the fact that it contains texts of news, which is the same kind of material as in the CASIA evaluation database [10]. The UN corpus is the largest in terms of number of characters, but it is the one with the least number of unique characters and we would expect the contents of this kind of text to be quite distinct from news and that is what the results show: not much improvement compared to the base. The LCMC which is quite smaller but has data that is similar to news brings more improvement than the UN corpus. These experimental results also show there is no real improvement for with order above n = 3 for the n-gram LM, but the size of the text dataset limits the quality of the estimated probabilities with increasing model order. There is a set of characters appearing only in isolation in the training images; we will call them “isolatedChars”. This set is larger than the one of characters appearing along with other in the text lines. In the evaluation data the lines having at least one character in “isolatedChars” present a total of 8 990 characters, among them 388 in the isolated only set. As there is just a few characters in this situation it could not hamper the final performance, but if they were much more frequent the performance of MDLSTM-RNN systems could be lowered, as those characters are never presented in other context to the network. In Figure 3, we show the distribution of the character error rate for all characters in the evaluation dataset, split in the 2 see

details about the charset in Section II-B

0.14 0.12

Not isolated isolatedChars

0.10 Frequency

Although the results are not rigorously comparable , we can state that the performance reached with “CASIA+UN+LCMC+PH” data for the LM is very close to the best reported results [2]. Note that this system [2] uses a much larger corpus for language modeling (roughly six times more characters), and pretty much the same labeled data to train the optical model (CASIA HWDB1.x and 2.x). MDLSTM-RNN provides high quality character predictions but the Language Model also plays an important role. Many characters are visually similar in Chinese, therefore the LM helps at giving low scores to sequences that are not likely neither valid in Chinese.

0.08 0.06 0.04 0.02 0.00 0

20

40 60 Character error [%]

(b) With 4-gram LM (“CASIA+UN+LCMC+PH”).

Fig. 3: Distribution of the character error rates, for chars in “isolatedChars” and the rest, for the raw output and using a language model.

“isolatedChars” and the rest. The distribution is presented for both the raw RNN decoding (Fig.3(a)) and when using a 4gram LM (Fig.3(b)). In the data from other sources (than the training dataset) used to create the language models, about 30% of the lines contain at least a character present in “isolatedChars”. It is interesting that the language model could improve on the recognition rate of the characters that were present in isolation in the training dataset, but there is still a large percentage of those characters that are not as well recognized as those seen in lines of text during training (c.f. Fig.3(b)). V.

C ONCLUSION

We presented a handwritten text recognition system for Chinese characters based on MDLSTM-RNN, that works without segmentation the line in terms of characters. The raw outputs of the network allow to obtain a character error rate of 16.5% and the use of a language model at the character level brings the performance to 10.6%, which is at

the same level of best published result on the evaluation data of Task 4 in the ICDAR 2013 competition [3]. Line detection is more robust than character detection and annotating the positions of each character is much more time consuming and error-prone. Reducing the annotation efforts to just line detection and transcription presents an economic interest when creating new databases for training handwritten Chinese recognition systems. MDLSTM-RNN presented a higher error for characters that were only seen is isolation during training. This lowers the performance our segmentation-free approach using CTC on those characters. But this is more an artifact of data preparation. We believe that if we avoid presenting some characters only in isolation, this should not be a real issue. There is more to be done on efficient calculation of a large output softmax layer in the neural network. Future work is to use hierarchical computation as in [23] to reduce the burden. R EFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

S. N. Srihari, X. Yang, and G. R. Ball, “Offline Chinese handwriting recognition: an assessment of current technology,” Frontiers of Computer Science in China, vol. 1, no. 2, pp. 137–155, 2007. Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten Chinese Text Recognition by Integrating Multiple Contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1469–1481, 2012. F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “ICDAR 2013 Chinese Handwriting Recognition Competition,” in 12th International Conference on Document Analysis and Recognition, ser. ICDAR ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 1464–1470. Y. Zou, K. Yu, and K. Wang, “Continuous Chinese handwriting recognition with language model,” in 11th International Conference on Frontiers in Handwriting Recognition, 2008, pp. 344–348. A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in 23rd international conference on Machine learning. ACM, 2006, pp. 369–376. E. Grosicki and H. El Abed, “ICDAR 2009 handwriting recognition competition,” in 10th International Conference on Document Analysis and Recognition. IEEE, 2009, pp. 1398–1402. J. Sanchez, V. Romero, A. Toselli, and E. Vidal, “ICFHR2014 competition on handwritten text recognition on transcriptorium datasets (HTRtS),” in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014, pp. 181–186. B. Moysset, T. Bluche, M. Knibbe, M. F. Benzeghiba, R. Messina, J. Louradour, and C. Kermorvant, “The A2iA Multi-lingual Text Recognition System at the Maurdor Evaluation,” in International Conference on Frontiers in Handwriting Recognition, 2014. T.-H. Su, T.-W. Zhang, D.-J. Guan, and H.-J. Huang, “Off-line recognition of realistic chinese handwriting using segmentation-free strategy,” Pattern Recognition, vol. 42, no. 1, pp. 167–182, 2009. C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA Online and Offline Chinese Handwriting Databases,” in ICDAR. IEEE, 2011, pp. 37–41. Q.-F. Wang, F. Yin, and C.-L. Liu, “Integrating language model in handwritten Chinese text recognition,” in 10th International Conference on Document Analysis and Recognition. IEEE, 2009, pp. 1036–1040.

[12]

H.-M. Wang, J.-L. Shen, Y.-J. Yang, C.-Y. Tseng, and L.-S. Lee, “Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but limited training data,” in International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 1995, pp. 61–64. [13] Y. Li, C. L. Tan, and X. Ding, “A hybrid post-processing system for offline handwritten Chinese script recognition,” Pattern Anal. Appl., vol. 8, no. 3, pp. 272–286, 2005. [14] [15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23] [24]

[25]

[26]

[27]

[28]

G. Jin. The PH corpus. [Online]. Available: ftp://ftp.cogsci.ed.ac.uk/pub/chinese A. Eisele and Y. Chen, “Multiun: A multilingual corpus from united nation documents,” in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), N. C. C. Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, Eds. Valletta, Malta: European Language Resources Association (ELRA), may 2010. A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study,” in LREC. European Language Resources Association, 2004. A. Graves and J. Schmidhuber, “Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks,” in NIPS, 2008, pp. 545– 552. A. Graves, S. Fern´andez, and J. Schmidhuber, “Multi-dimensional recurrent neural networks,” in 17th International Conference on Artificial Neural Networks, 2007, pp. 549–558. T. Bluche, J. Louradour, M. Knibbe, B. Moysset, F. Benzeghiba, and C. Kermorvant, “The A2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation,” in International Workshop on Document Analysis Systems, 2014. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, 1998, pp. 2278–2324. G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012. V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in International Conference on Frontiers in Handwriting Recognition, 2014. F. Morin and Y. Bengio, “Hierarchical probabilistic neural network language model,” in AISTATS, vol. 5, 2005, pp. 246–252. A. Stolcke, “SRILM - An Extensible Language Modeling Toolkit,” in Proceedings International Conference on Spoken Language Processing, 2002, pp. 257–286. I. Witten and T. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression,” IEEE Transactions on Information Theory, vol. 37, no. 4, 1991. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB. C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007), ser. Lecture Notes in Computer Science, vol. 4783. Springer, 2007, pp. 11–23, http://www.openfst.org. M. Mohri, “Finite-State Transducers in Language and Speech Processing,” Computational Linguistics, vol. 23, pp. 269–311, 1997.