Chinese Character Recognition: History, Status, and Prospects

Chinese Character Recognition: History, Status, and Prospects DAI Ruwei1, LIU Chenglin2, and XIAO Baihua1 1

Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 10080, China {ruwei.dai,baihua.xiao}@ia.ac.cn 2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 10080, China [email protected]

Abstract. Chinese character recognition (CCR) is an important branch of pattern recognition. It was considered as an extremely difficult problem due to the very large number of categories, complicated structures, similarity between characters, and the variability of fonts or writing styles. Because of its unique technical challenges and great social needs, the last four decades witnessed the intensive research in this field and a rapid increase of successful applications. However, higher recognition performance is continuously needed to improve the existing applications and to exploit new applications. This paper first provides an overview of Chinese character recognition and the properties of Chinese characters. Some important methods and successful results in the history of Chinese character recognition are then summarized. As for classification methods, this article pays special attention to the syntactic-semantic approach for online Chinese character recognition, as well as the meta-synthesis approach for discipline crossing. Finally, the remaining problems and the possible solutions to them are discussed.

1 Introduction Chinese character recognition is an important branch of pattern recognition [1-5]. The solution of this problem relies on many techniques in various fields: image processing, machine learning, cognitive science (noetic science), linguistics, etc. From the start of pattern recognition research in 1950s, character recognition has been a major test case and a stimulator of pattern recognition methodology. At the first workshop on pattern recognition, held in Puerto Rico, US, 1966, about one third of papers were dealing with character recognition [6]. Approaches such as blurring [7], directional pattern matching [8,9], hierarchical classification [10,11] and multiple classifiers combination [12,13] were first proposed by the character recognition community, and later evolved into attractive research fields. Character recognition systems contribute tremendously to the advance of the automation process and can be of significant benefit to man-machine communication in many applications, such as postal mail sorting, business card reading, bank checks

1

and transaction forms processing, and recently, in digital libraries and mobile phones. Chinese characters are used by over 1.3 billion people in China and some other countries or areas, but typing Chinese characters into computers is not a trivial tasks. In China, many people cannot even use phonetic codes for character entry because they habitually speak dialect and cannot pronounce Mandarin correctly. So, the automatic recognition of Chinese characters would have widespread special benefits. Chinese character recognition was considered as an extremely difficult problem due to the very large number of categories, complicated structures, similarity between characters, and the variability of fonts or writing styles. Due to its unique technical challenges and great social needs, the last four decades witnessed the intensive research in this field and a rapid increase of successful applications. This paper provides a brief review of this field, outlines the important methods and advances, and discuss the potential future research directions. The approaches of character recognition is dichotomized into online and offline depending on the hardware and application mode. It is called “online” if the temporal sequence of pen trajectory (captured by, e.g., digitizing tablet) is available. The pen trajectory is immediately recognized after it is written, and the user can respond to the recognition result (to correct the result or re-write). It is “offline” if to recognize previously written text, which is converted to images using a scanner or a camera. This paper covers both online and offline Chinese character recognition. The rest of this paper is organized as follows: Section 2 describes the properties of Chinese characters, Section 3 briefly reviews the history of Chinese character recognition and the state of the art. Section 4 addresses from syntactic to syntactic-semantic approach and its applications to online Chinese Characters Recognition. Section 5 discusses the discipline crossing between pattern recognition and systems science, as well as the resulting meta-synthesis approaches. Finally, section 6 discusses the remaining problems and the possibilities for solving them.

2 Properties of Chinese Characters Chinese characters have unique structures compared to western characters and this uniqueness poses technical challenges to recognition. This section summarizes the properties of Chinese characters as follows. 2.1 Evolution of Chinese Characters Fig. 1 demonstrates the evolution of Chinese characters. The origin of Chinese characters can be traced back to oracle script and script on bronze before 1000 BC. Official script was invented in Qin Dyanisty (about 220BC), and got popular in Han Dynasty. Its shape is very similar to the contemporary characters. Regular script, cursive script and fluent script were invented in late Han Dynasty (about 180AD). After that time, while spoken Chinese varies across regions, the written Chinese characters remain relatively stable. The regular script, cursive script and fluent script have been commonly used until today. However, as you can see in Fig.2, the traditional Chinese

2

characters have too many strokes. To ease writing, Chinese government carried out Chinese character reformation and published 2,235 simplified characters during 1956-1964. The average number of strokes for the 2,235 characters was reduced from 16.03 to 10.3. The simplified Chinese characters together with the characters that were not simplified, come to be a standard for official communication across China.

Fig. 1. Examples of the evolution of Chinese characters (from left to right: ‘sun’, ‘moon’, ‘vehicle’, and ‘horse’).

Fig. 2. Examples of traditional and simplified Chinese characters (upper for traditional, lower for simplified).

2.2 Chinese Character Set Chinese characters are used in daily communications by over one quarter of world’s population, mainly in Asia, such as China, Korea, Japan, and Singapore. There are mainly three character sets: traditional Chinese characters, simplified Chinese characters, and Japanese Kanji [5]. In Japan, 2,965 Kanji characters are included in the JIS level-1 standard and 3,390 Kanji characters are in the level-2 standard. Japanese Kanji characters have mostly identical shape to the corresponding traditional Chinese or simplified Chinese. In Taiwan of China, 5,401 traditional characters are included in a standard set. In the mainland of China, three character sets, containing 6,763, 20,902 and 27,533 Chinese characters, respectively, were announced as the National Standards (see Table 1). The 6,763 characters in GB2312-80 covers 99.99% of usage, but still do not suffice. Especially, many characters used in human names and place names are not included in this set. A general-purpose recognizer needs to cover about 9,000 simplified characters, about 3,000 of which have different traditional shapes. In addition, about 1,000 symbols and special characters should be included. In experiments of academic research, usually 3,755 characters are considered. The very large number of

3

categories poses a technical challenge for efficient and accurate classification of Chinese characters. Table 1. National standards of Chinese character set. National standard GB2312-80

GBK GB18030-2000

Number of characters Level-1: 3,755 Level-2: 3,008 (Totally 6,763) 20,902 27,533

Description Simplified

Simplified and Traditional Plus characters of minority nationalities

2.3 Character Structures Chinese characters are ideographs with complicated structures. Many Chinese characters contain relatively independent substructures, called radicals, and some common radicals are shared by different characters. That is to say, a Chinese character is composed of radicals, which are in turn composed of straight-line or poly-line strokes (see Fig. 3). As far as we know, the most complicated Chinese character has 36 strokes, see the bottom left of Fig. 3. The total number of radicals and single-component characters in Chinese characters is about 500.

Fig. 3. Examples of Chinese character structures (the right panel shows a complicated Chinese character with five radicals, some of which can be further decomposed).

The pattern of Chinese character structures can be roughly categorized into 10 types (single-radical, left-right, up-down, up-right, left-down, up-left-down, left-upright, left-down-right, and enclosure), see Fig. 4. Some of the patterns can be further divided into sub-categories. The structural complexity of Chinese characters is a merit for recognition: it carries rich information for discriminating different characters. This hierarchical characterradical-stroke structure can be utilized in recognition to largely reduce the size of reference model database and speed up recognition. However, the complexity of structures makes the structural description difficult.

4

Fig. 4. 10 types of Chinese character structures (single-radical, left-right, up-down, up-left, upright, left-down, up-left-down, left-up-right, left-down-right and enclosure).

Besides the large number of categories and the complexity of structures, there are many similar Chinese characters which differ only slightly (see Fig. 5). The similar characters are hard to discriminate by computer recognizers.

Fig. 5. Examples of similar Chinese character pairs.

2.4 Writing Styles The enormous writing styles of different persons can be roughly divided into three categories: regular script (also called handprint), fluent script, and cursive script. The intermediate style between regular and fluent is called fluent-regular, and the intermediate between fluent and cursive is called fluent-cursive [5]. Some examples of the three typical styles are shown in Fig. 6. We can see that the strokes of regular script are mostly straight-line segments. The fluent script has many curved strokes and, frequently, successive strokes are connected. In cursive script, some character shapes differ drastically from the standard shape.

5

Fig. 6. Examples of three major writing styles (from left to right: regular script, fluent script, cursive script).

3 Historical Review of the Technology In the following, we review the history of Chinese character recognition in respect of the evolution of recognition target, the evolution of recognition methods, the citation of important techniques and representative results, and finally, we summarize the recent advances in the state of the art. 3.1 Evolution of Recognition Target Since Casey and Nagy published the first work of printed Chinese character recognition [10], the recognition of Chinese characters has evolved into an attractive research area. The evolution of recognition target, from machine-printed to handwritten, from online to offline, basically observes an easy-to-difficult order. As summarized in Table 2, both printed Chinese character recognition and online handprinted Chinese character recognition were started in mid-1960s, by researchers in US. Researchers in Japan began to deal with printed and online handprinted Chinese (Kanji) character recognition in late 1960s, and offline handprinted Chinese character recognition in late 1970s. Chinese researchers started area about 10 years later than those in Japan, i.e., printed and online handprinted Chinese character recognition in late 1970s, and offline handprinted Chinese character recognition in late 1980s. From 1990s, handprinted Chinese character recognition (especially online recognition) has been well commercialized, and the research target has moved to lessconstrained handwritten character recognition. Table 2. Evolution of recognition targets. Time Mid-1960s~ Late 1960s~ Late 1970s~ Late 1980s~ Mid-1990s~

Recognition target Printed Chinese (IBM) [10] Online handprint (MIT, U. Pittsburgh) [14,15] Printed Chinese, online handprint (Japan) Offline handprint (Japan) Printed Chinese, online handprint (China) Offline handprint (China) Less-constrained handwritten

3.2 Evolution of Recognition Methods From the 1960s, many effective methods have been proposed in the area of Chinese character recognition. The evolution of major methods is summarized in Table 2. Template matching, including one-stage classification and hierarchical classification, was widely used in early works of Chinese character recognition, especially printed

6

character recognition. Character structure analysis (stroke analysis, relaxation matching, attributed graph matching) attracted much attention during 1970s-1990s. Especially, structural matching using relaxation and attributed graphs was popular in 1980s-1990s. Following template matching and pattern matching, feature matching got popular in 1980s [2]. It provides good feature extraction techniques for the current statistical classification methods. From 1990s, statistical recognition methods dominate the technology. However, structural methods are still under study, because they resemble the procedure of human cognition and have the potential of recognizing cursively handwritten characters. A recent advance is the statistical modeling of character structures [21]. Table 3. Evolution of major recognition methods. Time 1960s-1970s 1970s-1990s 1980s~ 1990s~

Recognition method Hierarchical template matching [10,11] Structural (stroke analysis, relaxation, attributed graph) [1][16], Syntactic (attributed grammar) [17-19] Feature matching [2], Statistical classification [20] Statistical methods dominate, Statistical structure modeling [21]

3.3 Important Techniques In addition to the evolution of general methods as reviewed in the last section, this section summarizes the important detailed techniques that have affected the technology of this field, i.e., they have yielded satisfactory recognition results or made applications successful. Table 4 gives a list of such important techniques, including those for pre-processing, feature extraction, classification, etc. Most of the techniques are still actively used now. The first paper on printed Chinese character recognition, by Casey and Nagy in 1966, propose the technique of hierarchical template matching [10]. This is the origin of current multi-stage classification, which is efficient to speed up the classification of large category set. Blurring was proposed by Iijima in 1960s, and got internationally know when it was published in 1973 at the 1st International Joint Conference on Pattern Recognition (IJCPR) [7]. Blurring is equivalent to the current spatial filtering, and is effective to reduce image noise and improve the translation invariance. Directional pattern matching was first published in 1979 by Yasuda and Fujisawa [8], and got widely known after a paper was published in 1983 at Pattern Recognition Letters (Yamashita et al. [9]). Directional pattern matching is the origin of the popularly used direction feature extraction, and is often used together with blurring. The work of Yamamoto and Rosenfeld on Chinese character recognition using relaxation matching [16], started a boom of relaxation-based structural matching in 1980s and 1990s. Nonlinear normalization based on line density equalization is effective to reduce the within-class shape variation, and thus greatly improve the recognition accuracy. It was first published in 1984 in Japanese, and internationally at ICPR 1988 [22] and Pattern Recognition Journal in 1990 [23].

7

Before the modified quadratic discriminant function (MQDF) of Kimura et al. was proposed, Chinese character recognition were mostly performed by a simple distancebased classifier. The MQDF, with lower complexity and higher generalization performance than the ordinary QDF, is demonstrated superior in handwritten Chinese character recognition. It was first published in ICPR 1984, and got widely known after publication in IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI) in 1987 [20]. Decision-tree classification, intensively studied by Suen’s group in Concordia University [24,25], is superior in classification speed for large category set, and has applied successfully to printed Chinese character recognition The semantic-syntactic approach, proposed in 1980s by J.W. Tai (R.W. Dai) [19,26], has largely affected the technology of online Chinese character recognition. Though multiple-classifier approaches have been intensively studies from the beginning of 1990s, it was not successfully applied to large category set problems until the researchers from Institute of Automation, Chinese Academy of Sciences (CASIA) and Tsinghua University published their works of Chinese character recognition by combining multiple classifiers [27,28]. Table 4. Important techniques in Chinese character recognition. Year 1966 1960s 1979

Technique Hierarchical template matching [10] Blurring (spatial filtering) [7] Directional pattern matching [8,9]

1982

Relaxation matching [16]

1984

Nonlinear normalization [22,23]

1984 Mid-1980s Mid-1980s

Modified QDF [20] Decision tree classification [24,25] Semantic-syntactic (attributed grammar) [19,26] Multiple classifiers fusion [27,28]

1997

Authors Casey & Nagy (IBM) T. Iijima (TIT, Japan) Fujisawa & Yasuda (Hitachi), K. Yamamoto (ETL) Yamada (ETL) Tsukumo (NEC) Kimura et al. (Mie U) Suen’s group (Concordia) J.W. Tai (R.W. Dai), Y.J. Liu (CASIA) R.W. Dai (CASIA), X.Q. Ding (Tsinghua)

3.3 Important Results Many experiments of Chinese character recognition, mostly using the effective techniques listed in Table 4, have reported high performance. To evaluate the performance, the ETL8B and ETL9B databases, collected by Electro-Technical Laboratory of Japan, have been widely tested. The ETL8B database contains the handprinted images of 952 characters (881 Kanji and 71 hiragana), 160 samples per class. The ETL9B database contains the handprinted images of 3,036 classes (2,965 Kanji and 71 hiragana), 200 samples per class. In China, some experiments have been conducted on the HCL2000 database (collected by Beijing University of Posts and Telecommunications) and the CASIA database (Institute of Automation, Chinese Acad-

8

emy of Sciences), both for 3,755 Chinese characters, with 1,000 samples per class and 300 samples per class, respectively. Some representative results of online Chinese character recognition have been summarized in [5]. In this paper, we give a list of representative results of offline Chinese character recognition in Table 5. For reference, the underlying normalization/feature extraction and classification methods are also given. “NLN” denotes nonlinear normalization, which varies slightly in implementation. Table 5. Important results in Chinese character recognition. Year

Author

Feature

Classifier

1983 1984 1984 1986 1988 1992 1993 1996 1997 1997 1999 2001 2003 2005 2006

Yamashita [9] Yamamoto [29] Yamada [30] Yamamoto [31] Tsukumo [22] Tsukumo [32] Jun Guo [33] Saruta [34] Suzuki [35] Kimura [36] Kato [37] Sawa [38] J.X. Dong [39] H. Liu [40] C.L. Liu [41]

Direction Contour Contour Contour NLN/Direction NLN/Direction NLN/Direction NLN/Direction NLN/Direction NLN/Direction NLN/Direction NLN/Gradient NLN/Gradient NLN/Gradient NLN/Gradient

Correlation Relaxation DP Relaxation Correlation Flexible match Perturbation Neural net Mahalanobis+ MQDF Mahalanobis MQDF+Perturb SVM MQDF+MCE+ DFE-MQDF

Database/class ETL8/881 ETL8/952 ETL8/952 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 ETL9/3036 HCL/3755 CASIA/3755

Accuracy 94.8% 99.0% 99.6% 98.5% 94.42% 95.05% 96.32% 95.48% 99.31% 99.15% 99.42% 99.41% 99.0% 98.56% 98.43%

Yamashita et al. first applied directional pattern matching to a relatively large character set and reported a high accuracy on a test set of 20 samples per class of ETL8B [9]. The methods of Yamamoto et al. [29], Yamada [30], and Yamamoto et al [31] are basically contour segment matching. They reported very high accuracies on small test sets of only a few samples per class. Among the experiments on ETL9B database, Tsukumo and Tanaka [22], Tsukumo [32], Guo et al [33], and Saruta et al. [34] used 100 samples per class for training classifiers and the disjoint 100 samples per class for testing. All the four works use nonlinear normalization and contour direction feature, while the classification methods are correlation (simple distance), flexible pattern matching, perturbation-based correlation, and class-module neural network, respectively. From 1997, the experiments on ETL9B database mostly use 160 or more samples per class for training and the remaining samples for testing. They all use nonlinear normalization and contour or gradient direction feature, with finer implementations. The higher accuracies are also partially due to the larger number of training samples. Suzuki et al. use a modified Mahalanobis distance for classification and an auxiliary measure for pairwise discrimination [35]. Kimura et al. use a pseudo Bayes classifier, which is similar to MQDF, for classification [36]. The classifier of Kato et al. is an asymmetric Mahalanobis distance function [37]. Sawa et al. generated deformed training samples for estimating the parameters of MQDF [38]. Dong et al. proposed a

9

fast algorithm for training support vector machine (SVM) classifiers on large data set and applied successfully to Chinese character recognition [39]. H. Liu and X. Ding (Tsinghua University) [40] and C.L. Liu (CASIA) [41] experimented on ETL9B database as well as a Chinese database. They both use nonlinear normalization and gradient direction feature, and for classification, they combine MQDF with minimum classification error (MCE) training or discriminative feature extraction (DFE). Their accuracies on the test samples of ETL9B are over 99.3%, and the accuracies on HCL2000 and CASIA database are 98.56% and 98.43%, respectively. This indicates that the characters written by Chinese are more difficult to recognize than those written by Japanese. 3.4 The State of the Art In the following, we discuss generally the status of technology and highlight some recent advances in respect of pre-processing, feature extraction, feature transformation, and classifier design. 3.4.1 Character Pre-Processing

The main pre-processing steps including noise reduction (generally by smoothing or low-pass spatial filtering) and character shape normalization. Normalization is more influential to the recognition performance. It not only standardizes the image size, but also reduces the within-class variation of character shape. Nonlinear normalization based on line density equalization [22,23], has contributed significantly to the improvement of performance in handwritten Chinese character recognition. A pseudo two-dimensional (P2D) nonlinear normalization method through line density smoothing [42], can further improve the recognition accuracy. This method, however, is very time consuming. Recently, a new P2D normalization method based on line density projection interpolation was proposed by Liu et al. [43]. It performs comparably well with the method of [42] with only a little extra complexity than 1D normalization. Compared to 1D nonlinear normalization, P2D normalization can not only correct the non-uniform stroke density, but also alleviate the imbalance of width/height and stroke positions in different parts of the character image. 3.4.2 Feature Extraction

For feature extraction, the features are hoped to describe more details of the character shape, be invariant against within-class shape variation, and reflect between-class difference. Better tradeoff between the within-class invariance and the between-class discrimination can be achieved by structural element decomposition (e.g., local stroke direction decomposition) and image (or feature map) blurring. Early works usually used 4-orientation decomposition to extract direction features. 8-direction decomposition is found to give higher recognition accuracy. Possibly, it can be extended to 12-direction and 16-direction. In the case of 8-direction contour (chaincode) or gradient direction decomposition, the two sides of stroke edge are treated in different directions, such that the confusion between parallel strokes can be better differentiated. The local gradient direction of character image can be partitioned into a number of angle ranges [44]. A gradient vector decomposition approach,

10

originally proposed in online character recognition [45], has yielded superior recognition performance in handwritten character recognition [46]. Comparing chaincode direction feature and gradient direction feature, both are insensitive to the stroke-width variation and yield high recognition accuracies in Chinese character recognition, but chaincode feature applies to binary image only, while gradient feature applies to gray-scale image as well. The gradient feature generally outperforms the chaincode feature because it is more stable against image noise and the fluctuation of local contour direction. Nevertheless, the computation of gradient feature is a little more complicated than that of chaincode feature. Extracting gradient direction feature directly from gray-scale images will be trend in the future, especially for low-resolution or degraded character images. 3.4.3 Feature Transformation

After feature extraction, the reduction of dimensionality is important for both reducing the computation complexity of classification and improving the generalization accuracy. The dimensionality of the feature vector of direction feature, for example, is as high as 512 (8-direction, 8x8 sampled values for each direction) or more. When using a nonlinear classifier like the MQDF directly on this high-dimensional vector, the storage and computation complexity will be very high. Hence, dimensionality reduction is now widely adopted in character recognition. Dimensionality reduction is performed by projecting the feature vector onto a lowdimensional linear subspace. The most popular linear dimensionality reduction techniques are the principal component analysis (PCA) and the linear discriminant analysis (LDA) [47]. Unlike the PCA that maximize the variance of data vectors in subspace regardless of the class labels of vectors, the LDA learns a subspace that maximize the ratio of between-class scatter to within-class scatter. LDA has shown promise in Chinese character recognition [36]. Though performs fairly well in practice, LDA has some inherent drawbacks. It assumes equal-covariance Gaussian densities for all classes, and does not separate well nearby classes in subspace. The heteroscedastic discriminant analysis (HDA, e.g., [48]) considers the difference of covariance in subspace learning, but is extremely expensive for large category set. Very recently, H. Liu and X. Ding proposed a new HDA method [40], which is computationally feasible for large category set and yields higher accuracy than LDA. Discriminative feature extraction (DFE), which adjusts the subspace axes with the aim of minimizing the classification error on training data, has yielded significant improvement of accuracy in handwritten Chinese character recognition [41]. 3.4.4 Classifier Design

As to classifier design, there is a tradeoff between the classifier complexity and the classification accuracy. Typical classifiers include: (1) Minimum distance (or correlation) classifier, which was widely used before 1980s; (2) LVQ (learning vector quantization) for prototype optimization, which offers good tradeoff between complexity and accuracy [49]; (3) Modified quadratic discriminant function (MQDF, proposed by Kimura et al in 1980s [20]), which gives high accuracy but involves a large number of classifier parameters.

11

To speed up the classification of large category set, hierarchical classification has been commonly adopted from the early work of Chinese character recognition. Suen’s group reported progress in Chinese character recognition using decision tree classifiers in 1980s [24,25]. More often, candidate reduction by class grouping (clustering) or multi-stage dynamic candidate selection are adopted. Typical works using both two schemes include the printed Chinese character system of Hitachi [11] (published in 1st IJCPR 1973), and the work of handprinted Chinese character recognition by Y.Y. Tang et al [50]. More recently, multiple classifiers combination approaches have attracted more and more interests. Combination of multiple classifiers is effective to improve the recognition accuracy of single classifiers. Suen’s group in Concordia University did many works on combining classifiers for small character set recognition (e.g., handwritten numerals) [12]. Combining classifiers for large character set faces some problems: simple majority vote does not perform sufficiently, while the training of combiner on large data set is not trivial. Researchers from CASIA and Tsinghua University reported success of multiple classifiers in Chinese character recognition: one uses weighted confidence fusion [28] and the others are based on meta-synthesis [27,5155].

4 From Syntactic to Syntactic-Semantic Approach Generally speaking, the mathematical methods for solving pattern recognition problems can be grouped into two major categories: statistical (or decision theoretic) approach and syntactic (or structural) approach. Statistical pattern recognition is based on statistical characterizations of patterns, assuming that the patterns are generated by a probabilistic system. Structural pattern recognition is based on the structural interrelationships of features. The former approach has been intensively studied and widely used. This section will focus on the progress of the latter approach, from syntactic to syntactic-semantic. 4.1 Syntactic Pattern Recognition The structure information has been regarded as very important in pattern analysis. It is well known that Prof. K.S. Fu is the pioneer of the syntactic pattern recognition. He has published many papers and books in this field [17,18,56-64]. This approach draws an analogy between the structure of patterns and the syntax of languages, see Fig. 7.

12

Fig. 7. Block diagram of a syntactic pattern recognition system [17].

Different kinds of grammars have been proposed to describe patterns. It is hoped that in order to describe a class of patterns，the grammar used can be directly inferred from a set of sample strings or a set of sample patterns. The problem of learning a grammar based on a set of sample strings is called grammatical inference. Actually，even some simple two-dimensional patterns(such as equilateral triangles) have to be described by context-sensitive languages. However，the inference of general context-sensitive grammar may not be desired since recognition of context-sensitive languages is in general rather complex and time consuming. 4.2 Syntactic-Semantic Approach The syntactic approach was considered to expand in 80s last century. Strings, trees, and graphs have been suggested to describe pattern structures. The basic idea is to represent a pattern by its components (sub-patterns) and the relations between them. By using relational graphs to represent pattern structures, the simple “concatenation” relation can be extended to include many other relations. And semantic information in pattern description can be included by adding attributes to the description of subpattern and relations. Besides, the description and recognition of patterns by means of attributed grammars have been advocated by several investigators [17,59-64]. Attributed grammars were first formulated by Knuth [65]. And an attributed grammar method for pictorial pattern recognition was developed in China [19,26,66], and successfully applied for solving online Chinese character recognition problem. The attributed grammar including two parts was defined as follows: (1) A syntactic part represented by a context-free or finite-state grammar. (2) A semantic part consisting of three sets 1) A set of primitive attributes 2) A set of relational attributes 3) A set of semantic functions or semantic rules Such an “attributed grammar approach” was termed as “syntactic-semantic approach”. For the attributed grammar，there is a trade-off between syntactic and semantic complexities in grammar definition, see Figure 8. That is, semantic information can be applied to achieve lower syntactic complexity in pattern description. Thus, attributed finite-state instead of context-sensitive languages, can be used as a normal form

13

for pattern description. One could find that both the traditional statistical and syntactic approaches are the special cases of syntactic-semantic approach.

Fig. 8. Syntactic-semantic approach: tradeoff between syntactic and semantic.

5 Discipline Crossing Character recognition is highly related to various fields: image processing, machine learning, cognitive science (noetic science), linguistics, etc. Meta-synthesis was proposed by Chinese scientists in 1990 for dealing with open complex giant systems (OCGS) [67]. This section will briefly introduce the progress on discipline crossing between pattern recognition and systems science, and the related meta-synthetic approach for pattern recognition system design. 5.1 An Introduction of OCGS and Meta-synthesis To facilitate research on systems science，systems can be divided using different principles into different class types. Depending on the number and variety of subsystems contained in the system， and the number of interactions between them， systems can be divided into two groups: simple systems and giant systems. If there are a large variety of subsystems with a hierarchical structure; and the interaction between basic unit and basic unit, basic unit and environment is represented by some mathematical formula，such as nonlinear function, or by an information protocol, then the aggregate is called a complex giant system. In addition, if a system and its subsystems exchange energy, information or material with environment, it is called an open system. The method dealing with OCGS is beyond the reductionism, studies and practice have clearly proven that the only feasible and effective way to treat an OCGS is a meta-synthesis from qualitative to the quantitative，i.e. the meta-synthetic engineering method. The main ideas of meta-synthesis are summarized as follows: human computer cooperation, integration and system point of view. 5.2 Meta-synthesis Approach for Pattern Recognition Systems Design In order to show the meta-synthesis related to pattern recognition, an example of traditional Chinese medical diagnosis is given in Figure 9. In traditional Chinese medical treatment, the medical doctor makes acquisition information from patient by means of looking, hearing, asking and feeling the pulse. These pieces of information

14

are put into the doctor's brain, and according to his (or her) medical experiences, he makes meta-synthesis, then gives the result of diagnoses.

Fig. 9. An example of meta-synthesis (Traditional Chinese medical treatment, from the situation of human body to consider the synthetic diagnosis).

Enlightened by the ideas of meta-synthesis mentioned above，the meta-synthetic approaches were proposed in China as a guided framework to deal with complicated pattern recognition system design. The characteristics of the meta-synthetic approach for pattern recognition system design can be summarized as follows ： HumanComputer Cooperation，Integration and Closed-loop system. Fig. 10 demonstrates the framework of such approaches.

Fig. 10. Block diagram of a pattern recognition system using meta-synthesis approach.

5.3 Multiple Classifiers Combination for CCR Multiple classifiers integration has received considerable attention in the past decade. The idea appeared under many names: hybrid methods, classifier combination, information fusion, ensemble learning, etc. In meta-synthetic approach, the integration includes not only multiple classifiers combination, but also human-computer integration by supervised learning. As mentioned above, Chinese character recognition was considered as an extremely difficult problem due to the very large categories, complicated structures, similarity between characters, and variability of fonts or writing styles. The large categories characteristic is the main obstacle that prevents many supervised learning

15

algorithms (such as artificial neural networks, etc.) from successful application in CCR. To solve this problem, some researchers started research on multiple classifiers combination using the ideas of meta-synthesis from 1990s, and reported promising higher performance on handwritten Chinese character recognition [27,51-55]. In 1997, Hao proposed an initial meta-synthesis approach for handwritten Chinese character recognition, which used linear subnet to train the integration network [35]. As is known, the human is proficient at judging which class the training sample belongs to, instead of judging the suitable weights directly. Thus, an adaptive weighted multiple classifiers combination method was proposed, where supervised learning can be applied indirectly [53]. In order not only to improve the performance of pattern recognition system, but also to pursue feasible human-machine integration scheme, a new parallel compact integration scheme based on multi-layer perceptron (MLP) networks is proposed by Wang [54] to solve handwritten Chinese character recognition problem, see Figure 11. In this approach, MLP network classification and integration can be applied reasonably and effectively to solving large vocabulary classification problem.

Fig. 11. Totally parallel integration with two-step supervised learning.

Figure.11 shows the parallel compact integration system, where L compact MLP network classifiers are integrated together to solve an N-category classification problem. Outputs of all compact MLP network classifiers are combined together to be an enhanced feature, which is inputted to the integration network. Each output node of the integration MLP network corresponds to a category, and its output value is in

16

direct proportion to the similarity between the input pattern and its corresponding category. Effective human-machine integration is realized through the procedure of two-step supervised learning, which takes full advantage of the intelligence of teacher. In the first step, the compact MLP network classifiers are constructed and trained, and then the integration network is constructed and trained in the second step. During the supervised learning procedure, the relationship among different categories is learned and recorded in the compact MLP network classifiers, and the relationship among different compact MLP network classifiers is learned and recorded in the integration network too. 5.4 Pattern Recognition with Feedback Furthermore，the meta-synthetic ideas can be applied to pattern recognition system design to change the traditional viewpoint, which leads to the incorporating feedback into classifiers integration network. The integrated pattern recognition system incorporating feedback is no longer a traditional nonlinear forward mapping, but a closed-loop nonlinear dynamical system. The importance of this point view is that denotes a connection between pattern recognition field and control system field. This method has been successfully applied to handwritten numeral recognition [53], see Fig. 12.

17

Fig. 12. Integration network with feedback.

6 Future Directions The gap between the technical status and the required performance indicates that the problem of Chinese character recognition is not solved completely yet and it leaves us research opportunities. For examples, on three sets of handwritten Chinese character samples (regular script, fluent script, and cursive script) shown in Fig. 6, a state-ofthe-art recognizer gives accuracies of 98%, 82%, and 70%, respectively. The performance on fluent script and cursive script are far from satisfaction. To improve the recognition performance, Table 6 gives a list of the remaining technical problems and potential solutions. Table 6. Remaining problems and potential solutions. Issue Database (very important) Normalization Feature extraction Classification Structural matching Character segmentation Contextual processing

Problem Insufficient training data, esp. fluent and cursive writing Tradeoff between shape restoration and distortion Features for discriminating similar characters Insufficient accuracy, esp. similar characters and cursive Model building, stroke extraction Character splitting and merging Statistics (n-gram): not sufficient

Solution Collection of new data, artificial samples New normalization algorithms, perturbation Higher-order feature detectors, feature selection Learning from large data, multi heterogeneous classifiers Structural learning, modelbased stroke extraction Character detection relying on robust classification Syntactic/semantic text analysis

The first problem is the lacking of large sample database for training classifiers. Especially, the samples of fluent/cursive handwriting are not sufficient. For statistical classification, the techniques of normalization, feature extraction, and classification can be further improved: (1) Current normalization methods tradeoff between within-class variation reduction and shape distortion. New normalization methods are to be proposed, or perturbation method can work well. (2) In feature extraction, we can design higher-order feature detectors for extracting more discriminative features, but need feature selection to overcome the curse of dimensionality. (3) Learning classifiers from large sample data has not paid enough attention in Chinese character recognition. Some classifiers like neural networks and support vector machines (SVMs), encounters difficulty when applying to large category set classification problems. Structural matching needs efficient algorithms for stroke extraction and structural model learning. Syntactic-semantic methods need to be further studied, because they

18

resemble the procedure of human cognition and have the potential of recognizing cursive script Chinese characters. In addition, for practical applications, character segmentation and contextual processing should be paid high attention. Segmentation is a character detection problem, good classifiers can help. Contextual processing should exploit more syntactic/semantic knowledge.

Acknowledgements Some parts of this paper have been presented at the 18th ICPR in Hong Kong, August 2006, as a keynote speech. The authors would like to thank Prof. Yuan Yan Tang, Prof. Xiaoqing Ding and Prof. Chunheng Wang for providing materials to assist with this paper, and the reviewers for suggestions of improvements. Also, the authors would like to apologize to researchers whose works are overlooked.

References 1. 2. 3. 4. 5. 6. 7. 8.

9.

10. 11. 12.

13.

W. Stallings, Approaches to Chinese character recognition, Pattern Recognition, 8(2): 8798, 1976. S. Mori, K. Yamamoto, M. Yasuda, Research on machine recognition of handprinted characters, IEEE Trans. Pattern Analysis and Machine Intelligence, 6(4): 386-405, 1984. M. Umeda, Advances in recognition methods for handwritten Kanji characters, IEICE Trans. Information and Systems, E29(5): 401-410, 1996. T.H. Hildebrandt, W. Liu, Optical Recognition of Handwritten Chinese Characters: Advances since 1980, Pattern Recognition, Vol. 26, No. 2, pp. 205-225, 1993. C.-L. Liu, S. Jaeger, M. Nakagawa, Online recognition of Chinese characters: the stateof-the-art, IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2): 198-213, 2004. G. Nagy, Pattern Recognition 1966 IEEE Workshop, IEEE Spectrum, Feb. 1967, pp.9294. T. Iijima, H. Genchi, K. Mori, A theory of character recognition by pattern matching method, Proc. 1st IJCPR, 1973, pp.50-56. M. Yasuda, H. Fujisawa, An improved correlation method for character recognition, Systems, Computers, and Controls, 10(2): 29-38, 1979 (Translated from Trans. IEICE Japan, 62-D(3): 217-224, 1979). Y. Yamashita, K. Higuchi, Y. Yamada, Y. Haga, Classification of handprinted Kanji characters by the structured segment matching method, Pattern Recognition Letters, 1: 475-479, 1983. R. Casey, G. Nagy, Recognition of printed Chinese characters, IEEE Trans. Electronic Computers, EC-15(1): 91-101, 1966. S. Yamamoto, A. Nakajima, K. Nakata, Chinese character recognition by hierarchical pattern matching, Proc. 1st IJCPR, 1973, pp.183-194. L. Xu, A. Krzyzak, C. Y. Suen, Methods of combining multiple classifiers and their applications to handwriting recognition, IEEE Trans. System, Man, and Cybernetics, 27(3): 418-435, 1992. J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3): 226-239, 1998.

19

14. 15. 16. 17. 18. 19. 20.

21.

22. 23.

24.

25.

26.

27. 28.

29. 30. 31. 32. 33.

34.

35.

J. Liu, Real Time Chinese Handwriting Recognition, E.E. Thesis, MIT, Cambridge, 1966. M. Zobrak, A method for rapid recognition hand drawn line patterns, M.S. Thesis, University of Pittsburgh, 1966. K. Yamamoto, A. Rosenfeld, Recognition of handprinted Kanji characters by a relaxation method, Proc. 6th ICPR, Munich, 1982, pp.395-398. K.S. Fu, Syntactic Methods in Pattern Recognition, Academic Press, 1974. K.S. Fu, Syntactic Pattern Recognition and Applications, Prentice-Hall, 1982. J.W. Tai, A syntactic-semantic approach for Chinese character recognition, Proc. 7th ICPR, Montreal, Canada, 1984, pp.374-376. F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadratic discriminant functions and the application to Chinese character recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 9(1): 149-153, 1987. I.-J. Kim, J.H. Kim, Statistical character structure modeling and its application to handwritten Chinese character recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 25(11): 1422-1436, 2003. J. Tsukumo, H. Tanaka, Classification of handprinted Chinese characters using non-linear normalization and correlation methods, Proc. 9th ICPR, Rome, 1988, pp.168-171. H. Yamada, K. Yamamoto, T. Saito, A nonlinear normalization method for hanprinted Kanji character recognition--line density equalization, Pattern Recognition, 23(9): 10231029, 1990. Y.X. Gu, Q.R. Wang, C.Y. Suen, Application of a multilayer decision tree in computer recognition of Chinese characters, IEEE Trans. Pattern Analysis and Machine Intelligence, 5(1): 83-89, 1983. Q.R. Wang, C.Y. Suen, Analysis and design of a decision tree based on entropy reduction and its application to large character set recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 6(4): 406-417, 1984. J.W. Tai, Y.J. Liu, Chinese character recognition, Syntactic and Structural Pattern Recognition--Theory and Application, H. Bunke and A. Sanfeliu (Eds.), World Scientific, 1989. H. Hao, X. Xiao, R. Dai, Handwritten Chinese character recognition by metasynthesis approach, Pattern Recognition, 30(8), 1321-1328, 1997. X. Lin, X. Ding, M. Chen, R. Zhang, Y. Wu, Adaptive confidence transform based classifier combination for Chinese character recognition, Pattern Recognition Letters, 19(10): 975-988, 1998. K. Yamamoto, H. Yamada, T. Saito, R. Oka, Recognition of handprinted Chinese characters and Japanese cursive syllabury, Proc. 7th ICPR, Montreal, 1984, pp.385-388. H. Yamada, Contour DP matching method and its application to handprinted Chinese character recognition, Proc. 7th ICPR, Montreal, 1984, pp.389-392. K. Yamamoto, H. Yamada, T. Saito, I. Sakaga, Recognition of handprinted characters in the first level of JIS Chinese characters, Proc. 8th ICPR, Paris, 1986, pp.570-572. J. Tsukumo, Handprinted Kanji character recognition based on flexible template matching, Proc. 11th ICPR, The Hague, 1992, Vol.2, pp.483-486. J. Guo, N. Sun, Y. Nemoto, M. Kimura, H. Echigo, R. Sato, Recognition of handwritten characters using pattern transformation method with cosine function, Trans. IEICE Japan, J76-D-II(4): 835-842, 1993 (in Japanese). K. Saruta, N. Kato, M. Abe, Y. Nemoto, High accuracy recognition of ETL9B using exclusive learning neural network-II (ELNET-II), IEICE Trans. Information and Systems, 79-D(5): 516-521, 1996. M. Suzuki, S. Omachi, N. Kato, H. Aso, H. Nemoto, A discrimination method of similar characters using compound Mahalanobis function, Trans. IEICE Japan, J80-D-II(10): 2752-2760, 1997 (in Japanese).

20

36.

37.

38.

39.

40.

41.

42.

43. 44.

45. 46. 47. 48.

49.

50.

51. 52.

53.

F. Kimura, T. Wakabayashi, S. Tsuruoka, Y. Miyake, Improvement of handwritten Japanese character recognition using weighted direction code histogram, Pattern Recognition, 30(8): 1329-1337, 1997. N. Kato, M. Suzuki, S. Omachi, H. Aso, Y. Nemoto, A handwritten character recognition system using directional element feature and asymmetric Mahalanobis distance, IEEE Trans. Pattern Analysis and Machine Intelligence, 21(3): 258-262, 1999. K. Sawa, T. Wakabayashi, S. Tsuruoka, F. Kimura, Y. Miyake, Accuracy improvement by gradient feature and variable absorbing covariance matrix in handwritten Chinese character recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, J84-DII(11): 2379-2397, 2001 (in Japanese). J.X. Dong, A. Krzyzak, C.Y. Suen, High accuracy handwritten Chinese character recognition using support vector machine, Proc. Int. Workshop on Artificial Neural Networks for Pattern Recognition, Florence, Italy, 2003. H. Liu, X. Ding, Handwritten character recognition using gradient feature and quadratic classifier with multiple discrimination schemes, Proc. 8th ICDAR, Seoul, Korea, 2005, pp.19-23. C.-L. Liu, High accuracy handwritten Chinese character recognition using quadratic classifiers with discriminative feature extraction, Proc. 18th ICPR, Hong Kong, 2006, Vol.2, pp.942-945. T. Horiuchi, R. Haruki, H. Yamada, K. Yamamoto, Two-dimensional extension of nonlinear normalization method using line density for character recognition, Proc. 4th ICDAR, Ulm, Germany, 1997, pp.511-514. C.-L. Liu, K. Marukawa, Pseudo Two-dimensional shape normalization methods for handwritten Chinese character recognition, Pattern Recognition, 38(12): 2242-2255, 2005. A. Kawamura, K. Yura, T. Hayama, Y. Hidai, T. Minamikawa, A. Tanaka, S. Masuda, On-line recognition of freely handwritten Japanese characters using directional feature densities, Proc. 11th ICPR, The Hague, 1992, Vol.2, pp.183-186. G. Srikantan, S.W. Lam, S.N. Srihari, Gradient-based contour encoder for character recognition, Pattern Recognition, 29(7): 1147-1160, 1996. C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: benchmarking of state-of-the-art techniques, Pattern Recognition, 36(10): 2271-2285, 2003. K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. M. Loog, R.P.W. Duin, Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion, IEEE Trans. Pattern Analysis and Machine Intelligence, 26(6): 732-739, 2004. C.-L. Liu, M. Nakagawa, Evaluation of prototype learning algorithms for nearest neighbor classifier in application to handwritten character recognition, Pattern Recognition, 34(3): 601-615, 2001. Y.Y. Tang, et al., Offline recognition of Chinese handwriting by multi-feature and multilevel classification, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(5): 556561, 1998. Dai Ruwei, Hao Hongwei, Xiao Xuhong, Systems and Integration of Chinese Character Recognition, Zhejiang Science and Technology Press, 1998 (in Chinese). Dai Ruwei, Wang Lixin, Pattern recognition systems integration by metasynthesis, Systems Science and Systems Engineering, Scientific and Technical Documents House, Beijing, 1997, 7-13. B.H. Xiao, C.H. Wang, R.W. Dai, Adaptive combination of classifiers and its application to handwritten Chinese character recognition, Proc. 15th ICPR, Barcelona, 2000, pp. 327330.

21

54.

55.

56. 57. 58. 59. 60.

61. 62. 63. 64. 65. 66. 67.

Wang Chunheng, Xiao Baihua, Dai Ruwei, Parallel compact integration in handwritten Chinese character recognition, Science in China Series F--Information Sciences, 47(1): 89-96, 2004. B.H. Xiao, C.H. Wang, R.W. Dai, Handwritten Chinese character recognition by metasynthetic approach, Int. J. Information Technology and Decision Making, World Scientific Press, 1(4): 621-634, 2003. K.S. Fu, Sequential Methods in Pattern Recognition and Machine Learning, Academic Press, New York, 1968. K.S. Fu, Pattern Recognition and Machine Learning, Plenum Press, 1971. K.S. Fu, Grammatical inference: introduction and survey, Part I and Part II, IEEE Pattern Analysis and Machine Intelligence, 8(3): 343-375, 1986. K.C. You, K.S. Fu, A syntactic approach to stage recognition using attributed grammars，IEEE Trans. System, Man, and Cybernetics, 9(6): 334-345, 1979. W.-H. Tsai, K.S. Fu, Attributed grammar--a tool for combining syntactic and statistical approaches to pattern recognition, IEEE Trans. System, Man, and Cybernetics, 10(12): 873-885, 1980. W.-H. Tsai, K.S. Fu, A syntactic-statistical approach to recognition of industrial objects，Proc. 5th ICPR, Miami, 1980, pp.251-259. W.-H. Tsai, K.S. Fu, Error-correcting isomorphism of attributed relational graphs for pattern analysis, IEEE Trans. System, Man, and Cybernetics, 9(12): 757-768, 1979. W.-H. Tsai, K.S. Fu, A pattern deformation model and Bayes error-correcting recognition system, IEEE Trans. System, Man, and Cybernetics, 9(12): 745-756, 1979. J.W. Tai, K.S. Fu, Semantic syntax-directed translation for pictorial pattern recognition, Technical Report, School of EE, Purdue University, TR-EE 81-83, Oct 1981. D.E. Knuth, Semantics of context-free language, Journal of Mathematical System Theory, 2(2): 127-145, 1968. J.W. Tai, A kind of relational attributed grammars, Acta Automatica Sinica, 9(2), 1983 (in Chinese). Qian Xuesen, Yu Jingyuan, Dai Ruwei, A new discipline of science--the study of open complex giant system and its methodology, Chinese J. System Engineering and Electronics (in English), 4(2): 2-12, 1993.

22