keyword search in handwritten documents

KEYWORD SEARCH IN HANDWRITTEN DOCUMENTS Aleksander Koácz, Joshua Alspector, Marijke Augusteijn, Robert Carlson and George Viorel Popescu* University of Colorado at Colorado Springs, 1420 Austin Bluffs Pkwy. Colorado Springs, CO 80907, USA (ark@eas, josh@eas, mfa@antero, carlson@vision).uccs.edu *Rutgers University, CAIP Frelinghuysen Rod, PO Box 1390, Piscataway, NJ 08855, USA [email protected]

Abstract The problem of word recognition in databases of handwritten documents is addressed. The particular approach taken exploits the natural line-oriented structure present in handwritten material, which facilitates the search and allows the utilization of the temporal information embedded in the data. The proposed method is based on elastic matching between features extracted from the keywords, provided as search inputs, and the document candidate words. The feature-space distance between the input and candidate words serves as a basis for estimating the probability of a correct match, where the probabilities obtained for different feature sets are combined to give a single final estimate. Several different search modes are considered in a series of experiments that quantify the effectiveness of the current approach. Future improvements of the method, as well as its extensions to other related search problems are suggested.

1

Introduction

Although much progress has been made in the area of isolated character recognition [1], the results are not easily transferred to the problem of handwriting recognition, mainly due to the difficulty of reliably segmenting cursive writing into individual characters. There is an imminent need to find effective ways of recognizing handwritten words, as there is a large collection of (especially historical) documents that have not been converted into typewritten form, and the tasks of analysing and searching such data are tedious and time consuming. It is felt that a visual variant of the popular grep command would be particularly useful, since it would allow identification of document regions containing material resembling the one provided by the user, thus allowing examination of the handwritten text together with its additional important attributes (e.g., style), which would be lost if the documents were simply transformed into the typewritten format. Such a tool would also be useful in situations where it is impossible to translate the graphical material into symbols (e.g., due to the lack of understanding of the particular language structure), as well as in cases where the graphical information does not represent text but other, similarly structured information. As such, the graphical approach would provide an alternative to methods taking advantage of the lexical (or character oriented) structure of text data [2, 3].

Figure 1. A typical page from the Archives database. Our approach follows the general principles of handwriting recognition [1], where preprocessing of the image data is followed by feature extraction and recognition utilizing combination of different-feature results. What makes it

distinctive is an emphasis placed on the line-structure present in handwritten data. In most languages writing is performed along a pre-determined direction, with parallel lines of text constituting a document page. For each individual line, there is a definite time ordering of consecutive symbols, which makes it possible to associate each line of text with a time axis, although strict time ordering is less valid for the actual strokes of which characters are composed. We attempt to exploit the line oriented structure inherent to handwritten data, particularly during the feature-selection stage. We address the problem of word spotting using a database of fragments from the Archives of the Indies (Seville, Spain), representing the official Spanish records following the New World conquest. These documents are especially challenging as they feature large variability of style and, typically for that period, they contain flowery handwriting where overlap between individual words and variations of character size are common. Figure 1 shows a typical page from the Archives.

2

Line-oriented document processing

Following the standard preprocessing stages of noise removal and deskewing, the document pages are segmented into lines along which the search is performed. Although the line segmentation introduces desirable structure into the search process, certain errors are also introduced by this operation, as parts of the words may be clipped and divided between neighboring lines. The linesegmentation algorithm is based on a Fourier technique, where the fundamental frequency of the horizontal ink-density histogram of a document page is estimated and interpreted as average line width to select line-break positions. This width is then locally varied to account for the inherent irregularities in line spacing present in handwritten documents. The same ink-density histogram is also used in the de-skewing process, where the maximization of the histogram peaks is sought. Figure 2 illustrates the line-segmentation process. The segmentation noise due to the presence of significant text skew can be seen. The script data along consecutive document lines are further processed by extracting salient features of the text. In our approach, feature extraction is based on recording the positions of text−background transitions occurring within the vertical pixel columns along a document line where, for each pixel column, the ordered set of transition points provides a form of a run-length code, allowing reconstruction of the complete inked and background regions. Although the full set of such transitions provides a complete description of the data, in order to simplify the search process, it can be reduced to a certain extent, while still

UHVXOWVRIOLQHVHJPHQWDWLRQ 2 5 0

"ink" density

2 0 0

1 5 0

1 0 0

5 0

0

0

1 0 0

2 0 0

3 0 0

4 0 0 p i x e l li n e

5 0 0

400

500

6 0 0

7 0 0

8 0 0

600

700

800

LQNGHQVLW\KLVWRJUDP 12000

10000

8000

6000

4000

2000

0

0

100

200

300

))7RIWKHKLVWRJUDPGRPLQDQWIUHTXHQF\FOHDUO\SUHVHQW

Figure 2. Illustration of the line-segmentation algorithm (compare with the original page in Figure 1).

providing an accurate description. In the current implementation only the top and bottom transition points are used, which is equivalent to extracting the upper and lower word profiles (or outlines). More precisely, the upper and lower profiles are created by recording the top-most and bottom-most inked pixels, respectively, for each vertical pixel column along a document line. This process is illustrated in Figure 3, where the profiles are extracted for an example of the word gobernador. The shaded areas visualize the interpretation of the profiles as word outlines when viewed from above or below the word. As seen in Figure 3, special attention has to be paid to small gaps (i.e., blank pixel columns) occurring within words. Several methods of treating these artifacts can be envisioned, such as assigning the profiles within the gap region a constant value (e.g., corresponding to the top or bottom line border or to the average profile value). Eventually, we settled on the variant where profile values within a gap are equal to the average non-gap profile values within the line region considered. An analogous method of profile extraction is performed for the keyword (or model) inputs to the system.

Figure 3. Illustration of the upper (left figure) and lower (right figure) profileextraction process. Given a set of keyword inputs covering the class of words to be found in the document, the search procedure considers each keyword in turn and, finally, combines the keyword-specific results into the overall system output. During each keyword search the lines of the segmented document pages are scanned from left to right and each pixel position is considered a potential starting point of a candidate word. Thus no attempt of explicit word segmentation is made. Instead, a window whose length is equal to the length of the model word is initially taken to contain the candidate word, with its length subsequently adjusted according to the estimated difference in horizontal scale between the

candidate and the model. Experience with the data suggested that placing the limits of 0.5 and 2 on the acceptible range of the horizontal-scale differences is reasonable. The search method is based on the idea of elastic matching, where at every potential candidate (pixel) position along a document line a transformation between the representations corresponding to the model and candidate words is found, such that their (suitably defined) distance is minimized. The set of transformations considered consists of affine transformations. In future work we will also incorporate a more general class of nonlinear transformations, with a suitable cost function penalizing the transformations leading to significant changes of the original word shape.

3

The search and selection process

During the search of a document line using one of the input models, once the length of a candidate word is estimated for each pixel position, the distance between the candidate and the model word is calculated. More precisely, the distance between the corresponding upper and lower profiles of the candidate and model words is computed, where the distance measure can be based on any suitable metric, for example, given by the Euclidean or the city-block distance. Additional normalization makes the recorded distances independent of the lengths of the words being compared. In the approach adopted in this work, the profiles being compared are first re-scaled to the same length and the resulting distance is divided by their length. Thus if u(i) i = 1,..., T and w(i) i = 1,..., T represent the (upper or lower) profiles being compared (after being re-scaled to the same length), then their distance, D(w, u) , is defined by

D(w, u) =

1 T

T

∑ d (u(i), w(i)) i =1

where d denotes the (e.g., Euclidean) distance between profile elements. The upper and lower profile distance-calculation process is carried out for all segmented lines of the document pages and for all models in the keyword set. Thus, in the end, the intermediate search results are given by a set of inter-profile distance functions, which serve as a basis for calculation of a probability estimate indicating the chance of a correct match. Such an estimate can be derived for matching according to each of the individual profiles (i.e., upper and lower) separately, but ultimately we seek to combine the information contained

in the individual features to obtain a single, more accurate estimate incorporating all the information available. The task of deriving the correct form of a probability function is a difficult one, as the function is inherently dependent on the distance metric considered, as well as on the class properties of the document words. Possible methods involve parametric and nonparametric approaches, in the latter case learning algorithms (e.g., neural networks) can be employed to derive the appropriate distance-to-probability mapping directly from the data. The nonparametric approach to the estimation of the probability of a correct match is currently still under development and the approach presently used relies on applying a simple but plausible transformation that maps the distance values between the profile features into the “pseudo-probability” of a correct match. More precisely, the inter-profile distances are monotonically transformed into the [0,1] range by the following mapping p = exp(− α ⋅ d ) where d represents the distance between (upper or lower) profiles at a given pixel position and p denotes the corresponding value of pseudo-probability. Thus the probabilistically motivated transformation normalizes the distance waveforms to lie within the [0,1] interval, while preserving the relative ordering of distances such that low values of d (indicating positions of a likely match) are mapped into high values of p. The individual probability functions of the upper and lower profiles are subsequently combined (assuming their partial independence) using the following formula: p = 1 − (1 − pu )(1 − pd ) where p represents the combined probability measure, while pu and pd correspond to the upper and lower profiles, respectively. Absolute values of the combined probability functions can be used as a starting point for the selection of correct match candidates in the search. According to the most straightforward approach, the probability value at each pixel position along each document line could be compared with all others to determine if a candidate word starting at the given position presents itself as a likely candidate for the match. We can expect, however, that only local maxima of the probability waveform are likely to correspond to valid match candidates. Additionally, since the probability waveforms are calculated for the search results with each element of the model set, it makes sense to combine these result such that only the “strongest” model-based candidates are considered. We implement these arguments by means of a two stage filtering process. In the first stage local maxima of the probability waveforms are identified for the search results with each model. Here, the length of the locality window is equal to the length of the model, with the window being centered at the start of the candidate word. This procedure produces a set of (probability) peak locations for each document line and each model. Subsequently, during the second filtering

stage, some of the potential-candidate locations are eliminated by imposing the requirement that a valid candidate most correspond to a local maximum of the probability function when the results corresponding to all models are taken into account. Here, the length of the peak-locality window is set to the average length of a model word.

4

Experimental results

To assess the effectiveness of our approach, several experiments have been carried out, where a 13-page document was searched using a set of 19 model words (representing instances of the Spanish word gobernador, often occurring in the archive). Each page consisted of 3 lines of text and contained one instance of the word gobernador (Figure 4 provides an example).

Figure 4. Example of a document “page” used in the simulations. The word gobernador can be seen in the center. We identified several search modes whose practical utility depends on the amount of user interaction desired. Generally, however, the search was considered successful if any of the model words in the set led to the localization of a correct word within the scope of search, where by correct localization it is meant that the identified candidate and the correct word in the document overlap. The most likely candidates were identified as those corresponding to the highest (i.e., above a given threshold) values of the probability function. By varying the value of the acceptance threshold, the number of correctly found instances of gobernador can be traded with the total number of candidates labeled as “correct”. The results are visualized by means of a “success-rate vs. false-alarm rate” plot, which indicates the number of false candidates selected for a given proportion of correct matches. Currently available results correspond to the search based on the upper and lower profiles only, with the probability of a correct match at a given position along a document line obtained via combining the individual profile-based

probabilities, assuming their mutual independence (which is not generally valid) and a Gaussian distribution (as described in section 3.1). Word-spotting performace 1

0.8

0.8 success rate

success/cost rate

Word-spotting performace 1

0.6

0.4 0.2

0.6

0.4 0.2

false alarm rate success rate 0 0.6

0.7

0.8 threshold

0.9

1

0 0

50

100 false alarm rate

150

Figure 5. Experimental results of localizing the correct word on a page. The left part shows the recognition rate (and its cost in terms of the fraction of candidates considered) as functions of the probability threshold. The right part shows the success rate vs. the average number of incorrect candidates (per page) included in the set. The results are summarized by Figure 5 where the success rate is plotted against the “false-alarms rate” (right), representing the number of incorrect candidates included for a given threshold value. As can be seen, several instances of the word gobernador provide very close matches with the model set, but complete recognition can be achieved only at a relatively high cost in terms of the number of false candidates that have to be considered. The left graph of Figure 5 shows the recognition rate and false alarm rate as functions of the pseudo-probability threshold. Here, the false alarm rate is given as a proportion of the false candidates that are selected for a given threshold value. The two curves are clearly separated, although there is a definite scope for further improvement. It is hoped that by extending the feature set (e.g., to the full information provided by the pixel columns) and by allowing nonlinear deformation within the class of elastic matching used by the algorithm, the current results will be substantially improved.

5

Conclusions

The results obtained, although far from the desired goal, suggest that even with a limited feature set (i.e., incorporating the upper and lower word profiles only)

and a crude estimate of the probability of a correct match, the method is fairly effective and able to localize the correct word candidates. Hence, the upper and lower profile waveforms represent valid features of handwritten words, although they appear not to be sufficient to offer practical levels of word-spotting performance within the current framework. We expect that the enhancements currently under development will enhance the method’s performance in a significant way. In particular, we plan to apply generative stroke-based models [4] to account for the possible variation within the class of input words without the need for excessively large amounts of input data. This approach should be particularly helpful in cases where the input words are rare and their various instances are not readily available. Stroke models will also enable search methods based on subword matching, whose accuracy may be increased with respect to the accuracy of whole-word procedures by taking advantage of the temporal information embedded in the documents (e.g., via application of Hidden Markov Model methods [2]). Additional improvements are expected when the full set of features corresponding to ink−background transitions along pixel columns is used, as opposed to using the upper and lower profiles only. We also expect to extend the results of our method to other similarly structured search problemsfor example, involving video archives.

Acknowledgment We thank Ms. Victoria Carmona Vergara of Seville, Spain, for her help in obtaining sample documents from the Archives of the Indies.

References [1] S. Impedovo, Fundamentals in Handwriting Recognition : Springer-Verlag, 1994. [2] A. D. Mohamed and P. D. Gader, “Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, pp. 548-554, 1996. [3] P. D. Gader, M. Mohamed, and J. H. Chiang, “Handwritten Word Recognition with Character and Inter-Character Neural Networks,” IEEE Transactions on Systems, Man, and Cybernetics - Part B, vol. 27, pp. 158-164, 1997. [4] M. Revow, C. K. I. Williams, and G. E. Hinton, “Using Generative Models for Handwritten Digit Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, pp. 592-606, 1996.