The recognition of handwritten digit strings of unknown ... - CiteSeerX

The recognition of handwritten digit strings of unknown length using hidden Markov models S. Procter, J. Illingworth and A. J. Elms School of Electronic Engineering, Information Technology and Mathematics University of Surrey, Guildford, Surrey, GU2 5XH, United Kingdom fS.Procter,[email protected] Abstract We apply an HMM-based text recognition system to the recognition of handwritten digit strings of unknown length. The algorithm is tailored to the input data by controlling the maximum number of levels searched by the Level Building (LB) search algorithm. We demonstrate that setting this parameter according to the pixel length of the observation sequence, rather than using a fixed value for all input data, results in a faster and more accurate system. Best results were achieved by setting the maximum number of levels to twice the estimated number of characters in the input string. We also describe experiments which show the potential for further improvement by using an adaptive termination criterion in the LB search.

1. Introduction Hidden Markov models (HMMs) [5] have been widely used in the field of speech recognition for many years [4], but have only recently begun to receive a similar degree of attention in the context of text recognition [1, 3]. The HMM approach is particularly suited to the recognition of handwritten text as it does not rely on the prior segmentation of words into characters, which is the basis of many existing OCR systems. Such techniques often fail catastrophically when segmentation of the text string is difficult or impossible, as is often the case with unconstrained handwriting. This paper focuses on the Level Building (LB) algorithm [6], one of the fundamental building blocks of an HMM-based system for segmentation-free recognition. The LB algorithm is used to match models against an observation sequence, without having to first segment the sequence into subsequences that may have been produced by different models – our system uses HMMs to model individual characters, and the LB algorithm to match these models to unsegmented text strings. The particular focus of this paper

is to examine the effect of a particular algorithm parameter, the maximum number of levels L, on the speed and accuracy of the system. The following section briefly describes the operation of the LB algorithm in the context of an HMM-based text recognition system [2]. Section 3 describes our experiments to examine the effect of the parameter L on the performance of the system. Section 4 describes a method by which the parameter can be adjusted adaptively as the algorithm proceeds, improving the efficiency of the system. Finally, Section 5 discusses the results of our experiments and indicates possible directions for future research.

2. The Level Building algorithm In the same way that the Viterbi algorithm matches a model to a sequence of observations, determining the maximum likelihood state sequence of the model given the sequence of observations, the LB algorithm [6] is used to match an observation sequence to a number of models. The LB algorithm jointly optimises the segmentation of the sequence into subsequences produced by different models, and the matching of the subsequences to particular models. The technique is very useful in the field of text recognition since, in conjunction with hidden Markov models (HMMs) [5], it allows the recognition of entire words without their prior segmentation into characters [2]. The operation of the LB algorithm in our system is depicted diagrammatically in Figure 1. Each level enumerates the match of a single character model, consisting of a number of internal states, to a part of the observation sequence, which is a function of time t. If the entire observation sequence has length T , then each path through the trellis from t = 0 to t = T ? 1 corresponds to a particular match of a sequence of models to the observations. The algorithm proceeds one level at a time. At level l = 0, each model w is matched to the observation sequence from time t = 0. An array P (l; t; w) is maintained holding the probability of the

Level 2

Level 1

Character recognition rate (%)

93.4 ... continue to Level (L-1) ...

93.2 93.0 92.8 92.6

Fixed Proportional

92.4 75.0

2 1

... continues ...

Level 0

0 0

1

2

3

4 Time

Figure 1. The Level Building algorithm sequence up to and including model w at level l matching the observation sequence up to time t. After each level is completed, the level reduced array P^ is computed:

^ ( ) = max ( w

P l; t

P l; t; w

)

such that P^ (l; t) is the probability of the best match sequence at level l and time t. Subsequent levels pick up from the best match at the previous level, and a back-pointer array is maintained so that the optimum path through the trellis can be traced out when all levels have been completed. The probability of the maximum likelihood match of l models to the entire observation sequence is Pmax (l) = P^ (l; T ? 1), and the probability of the overall best match model sequence can be found as the maximum of Pmax (l) over all l, 0 l < L. The sequence of models which comprise this best match can be determined by tracing a path back through the trellis, using an array of back-pointers that is constructed as the trellis is built up. Previous work on the LB algorithm has not addressed the issue of how to select the maximum number of levels L – researchers have tended simply to choose an arbitrary “large enough” value for L. However, building more levels than necessary wastes computation time, as well as providing more opportunity for accidental alignments to produce high probability incorrect matches, possibly causing the correct match to be overlooked. Until recently, the approach taken in our work followed previous implementations of the LB algorithm in using a simple fixed number of search levels. This approach is far from ideal, however, in applications where the test data consists of strings of unknown length. In this case the question of the number of levels to build is particularly sensitive.

3. Fixed versus variable levels As a first attempt to increase the efficiency of our system, we made the maximum number of levels for each field proportional to the width of the field in pixels, rather than

Field recognition rate (%)

State Index

74.5

74.0

73.5

73.0 0.23

Fixed Proportional

0.24

0.25 0.26 0.27 Speed (characters per second)

Figure 2. Effect of mance

L

0.28

0.29

on recognition perfor-

a fixed value for all fields, i.e. L = x=x0 where x is the width of the observation sequence in pixels, and x0 is a constant. The system was evaluated for various values of x0 between 10 and 27 pixels per level. For comparison we also evaluated the system with the maximum number of levels L fixed at various values between 15 and 25. In order to compare the two approaches, we plotted their recognition rates at both character and field levels against the speed at which the system could process the input, in characters per second. Obviously the more levels explored by the algorithm, the slower the system became. The results of these tests are shown in Figure 2. The test set consists of the digit fields from the first 50 forms in the NIST Special Database 1; a total of 6500 characters in 1400 fields. Each form contains three fields of length 10, and five fields of each length from 2 to 6 digits. The system was trained on approximately 6000 examples of each digit. The times quoted are CPU times on an SGI Power Challenge 200MHz R10000 processor. Note that the software is currently in the development stage and no optimisation for speed has been performed; recognition speeds are quoted merely for the purpose of relative comparisons between the various trials described in this paper. The character recognition rates behave in a very predictable manner, remaining constant until the number of levels falls below a critical point, where they begin to fall rapidly. The field recognition rates actually increase slightly as the number of levels is reduced down to the critical point, due to the number of insertion errors falling faster than the rate of increase in the number of deletions. There is little difference between the recognition rates of the best examples of each of the two methods, but the graphs clearly show

log(Pmax(l)) x 10

-3

-1.0

-1.5

-2.0

-2.5

1

10 Level

Figure 3. Log probability of best match of models to an observation sequence

100

l

minated after n = 2; 3 consecutive falls in Pmax (l). The recognition rates of the n = 2 and n = 3 trials (rf = 74:2%, rc = 93:25%) are identical to those of the x0 = 10 trial in the previous section, while the new trials are about 17% and 19% faster respectively – this indicates the adaptive approach holds some promise in speeding up recognition without affecting the performance. The n = 1 trial performed similarly to the fastest variable-levels trials in the previous section, in terms of both speed and accuracy. However, the previous trials with x0 close to the critical point at around x0 = 22 achieve a higher field recognition rate than the adaptive trials while running at virtually comparable speeds, indicating that a more intelligent method of selecting the search cutoff point is required.

5. Discussion that the variable-levels method is about 6% faster than the fixed-levels method for a given recognition rate.

4. Adaptive level building The ideal method of determining how many levels to use in the LB algorithm would be to recognise the point where the exploration of further levels would not significantly improve the results produced by the system and terminate the search at this point. On examining the probabilities of the matches produced at each level of the LB search, it is apparent that the curve tends to be concave, i.e. the likelihood of the best match of l models to an observation sequence tends to increase monotonically with l until some optimum value lopt is reached, and then steadily declines for l > lopt . An example of this behavior is shown in Figure 3, which shows a graph of the log probability of the maximum likelihood match of l models to the observation sequence shown, for l = 2; 3; : : : ; 40. The system was able to correctly recognise the sequence at level 10, the lowest level possible since the sequence contains 10 characters. The increasing probabilities observed between level 10 and the maximum at level 17 are due to the insertion of space characters, producing a better match to the input sequence. If the relationship between l and Pmax (l) were truly concave, it would be possible to detect the level of the optimum match as the minimum l that satisfies Pmax (l) > Pmax (l + 1), i.e. the search could be terminated as soon as the probability of the best match at the current level is lower than that of the previous level. Although there is no guarantee that the probability function will always be truly concave, the relationship appears sufficiently well-behaved for this strategy to be successful. Thus the test was repeated with no predetermined maximum L; instead the search was terminated as soon as the maximum probability decreased from one level to the next. For increased robustness, two further trials were performed, in which the search was ter-

In this paper we have shown that careful choice of the maximum number of levels L in the LB search has beneficial effects on both the speed and accuracy of an HMMbased text recognition system. Overall best performance was achieved by the x0 = 22 trial, equivalent to approximately two levels per digit for our test data, which seems very appropriate – just enough to model each digit and all inter-character spaces. We believe that an adaptive termination strategy for the LB search has the potential to produce further improvements, particularly in the speed of the system, but a more intelligent approach is required than that described in the previous section. One possibility is to end the search when the same segmentation, disregarding spaces, has been produced at a certain number of consecutive levels. Another approach would be to terminate the search as soon as some minimum probability threshold has been reached.

References [1] C. Bose and S. Kuo. Connected and degraded text recognition using hidden Markov model. Pattern Recognition, 27(10):1345–1363, 1994. [2] A. Elms. A connected character recogniser using level building of HMMs. In Proc. 12th Int. Conf. Pattern Recognition, pages 439–442, 1994. [3] M. Gilloux, M. Leroux, and J.-M. Bertille. Strategies for handwritten words recognition using hidden Markov models. In Proc. 2nd Int. Conf. Document Analysis and Recognition, pages 299–304, 1993. [4] X. Huang, Y. Ariki, and M. Jack. Hidden Markov Models for Speech Recognition. Edinburgh University Press, 1990. [5] L. Rabiner and B. Juang. An introduction to hidden Markov models. ASSP Magazine, 3(1):4–16, 1986. [6] L. Rabiner and S. Levinson. A speaker-independent, syntaxdirected, connected word recognition system based on hidden Markov models and level building. IEEE Trans. Acoustics, Speech and Signal Processing, 33(3):561–573, 1985.