Handwritten Chinese text line segmentation by ... - Semantic Scholar

3 downloads 5873 Views 1MB Size Report
Given a distance metric, the connected components (CCs) ...... About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, ...
Pattern Recognition 42 (2009) 3146 -- 3157

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Handwritten Chinese text line segmentation by clustering with distance metric learning Fei Yin ∗ , Cheng-Lin Liu National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing 100190, PR China

A R T I C L E

I N F O

Article history: Received 7 August 2008 Received in revised form 21 November 2008 Accepted 20 December 2008 Keywords: Handwritten text line segmentation Clustering Minimal spanning tree (MST) Distance metric learning Hypervolume reduction

A B S T R A C T

Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction Text line segmentation from document images is one of the major problems in document image analysis. It provides crucial information for the tasks of text block segmentation, character segmentation and recognition, and text string recognition. Whereas the difficulty of machine-printed document analysis mainly lies in the complex layout structure and degraded image quality, handwritten document analysis is difficult mainly due to the irregularity of layout and character shapes originated from the variability of writing styles. For unconstrained handwritten documents, text line segmentation and character segmentation-recognition are not solved though enormous efforts have been devoted to them and great advances have been made. Text line segmentation of handwritten documents is much more difficult than that of printed documents. Unlike that printed documents have approximately straight and parallel text lines, the lines in handwritten documents are often un-uniformly skewed and curved. Moreover, the spaces between handwritten text lines are often not obvious compared to the spaces between within-line characters, and some text lines may interfere with each other. Therefore, many text

∗ Corresponding author. Tel.: +86 10 6263 2251. E-mail addresses: [email protected] (F. Yin), [email protected] (C.-L. Liu). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.12.013

line detection techniques, such as projection analysis [1–7] and K-nearest neighbor connected components (CCs) grouping [12–14], are not able to segment handwritten text lines successfully. Fig. 1 shows an example of unconstrained handwritten Chinese document with segmentation results by the X–Y cut algorithm [1], the stroke skew correction algorithm [6], the Docstrum algorithm [12] and the piece-wise projection algorithm [5]. In this case, we can see that the X–Y cut algorithm and the stroke skew correction algorithm succeed in detecting the text lines, but fail to locate the boundaries of text lines. The Docstrum algorithm can locate the boundaries of text lines very well, but fails to detect some lines (the first and fourth lines in Fig. 1(c)) correctly because of the anomalous size of characters. Although the piece-wise projection algorithm can overcome the aforementioned errors, it fails to segment some small-size CCs (the first and eighth lines in Fig. 1(d)). Many efforts have been devoted to the difficult problem of handwritten text line segmentation [1–28]. The methods can be roughly categorized into three classes: top-down, bottom-up, and hybrid. Top-down methods partition the document image recursively into text regions, text lines, and words/characters with the assumption of straight lines. Bottom-up methods group small units of image (pixels, CCs, characters, words, etc.) into text lines and then text regions. Bottom-up grouping can be viewed as a clustering process, which aggregates image components according to proximity and does not rely on the assumption of straight lines. Hybrid methods combine bottom-up grouping and top-down partitioning in different ways.

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

3147

Fig. 1. An example of handwritten document with text lines segmented by the X–Y cut algorithm (a), the stroke skew correction algorithm (b), the Docstrum algorithm (c) and the piece-wise projection algorithm (d).

All the three approaches have their disadvantages. Top-down methods do not perform well on curved and overlapping text lines. The performance of bottom-up grouping relies on some heuristic rules or artificial parameters, such as the between-component distance metric for clustering. On the other hand, hybrid methods are complicated in computation, and the design of a robust combination scheme is non-trivial. In this paper, we propose an effective bottom-up method for text line segmentation in unconstrained handwritten Chinese documents. Our approach is based on minimal spanning tree (MST) clustering of CCs and the distance metric between CCs is designed by supervised learning. The number of clusters, namely the number of text lines, is automatically decided by a new hypervolume reduction criterion. Except for some empirical parameters in pre-processing of CCs and in post-processing of text lines, the clustering algorithm itself has no artificial parameters. The experimental comparison of clustering with metric learning with that of artificially designed metric shows that supervised metric learning improves largely the accuracy of text line segmentation. The proposed method was also compared with other state-of-the-art methods in experiments on a large database of handwritten Chinese documents and its superiority was demonstrated. By customizing the between-component features and training with documents of specific languages, we suggest that the proposed method is also applicable to the documents of other languages. The rest of this paper is organized as follows. In Section 2, we give a brief review of the related works; An overall description of our clustering-based text line segmentation method is given in Section 3, and the distance metric learning scheme is elaborated in Section 4. In Section 5, we present the hypervolume reduction criterion and the straightness measure for text line grouping. Experimental results are presented in Section 6 and concluding remarks are given in Section 7.

2. Previous works The structure of a document image is a hierarchy of text regions, text lines, words, characters and CCs. Text lines can be extracted by either top-down region partitioning, bottom-up components aggregation, or a hybrid scheme. Some representative segmentation methods are reviewed below. The X–Y cut algorithm [1,2] is a typical projection-based topdown segmentation method. It uses horizontal and vertical projection histograms alternately along the X and Y axis so as to partition the document image into a hierarchical tree structure in which each leaf node represents a text line. Because of the assumption of parallel text lines and significant between-line gaps, this method performs well only on printed documents. Some modified projection-based methods have been proposed to deal with slightly curved handwritten text lines. The piece-wise projection approaches partition the document image into several vertical strips [3–5]. The text lines, assumed to be approximately straight in a strip, are extracted from each strip according to horizontal projection profiles and then connected using heuristic rules. Su et al. [6] use the horizontal stroke histogram to detect the skew of handwritten Chinese documents and segment the text lines with the projection histogram along the estimated skew angle. Weliwitage et al. [7] describe a modified projection-based method called cut text minimization (CTM), in which an optimization technique is applied to minimize the text pixels cut while tracking the boundary between two lines after the start points of lines are found using projection. Similarly, Liwicki et al. [8] propose another improved projection-based method combined with slant detection, in which they use dynamic programming to find the paths between two consecutive lines. From a different viewpoint, some researchers proposed water reservoir based top-down methods [9–11]. They assume that hypothetical water flows, from both left and right sides of the image frame, face obstruction from

3148

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

characters of text lines, and the strips of areas left un-wetted on the image frame are labeled for extracting text lines. An obvious observation of most top-down methods is that their performance relies on the assumption of well-separable text lines: approximately straight and parallel globally or locally in a region. The Docstrum method of O'Gorman [12] is typical of bottom-up grouping. It merges neighboring CCs using rules based on the geometric relationship between K nearest neighbor units, and performs well on printed documents as well as handwritten documents with slightly curved lines. Under similar ideas, the Voronoi diagram combined with heuristic rules in [14] are used to merge CCs into text lines. In [13], each CC is represented by its vertical coordinates of the bounding box, and the CCs are grouped by weighted k-means clustering under the spatial constraints of valid address lines. Likforman-Sulem and Faure [15] develop a method based on perceptual grouping, in which text lines are iteratively constructed by grouping neighboring CCs according to three Gestalt criteria, namely, proximity, similarity and direction continuity. Although this method can integrate the local constraints with a global measure, it cannot be applied to poor structured documents. Nicola et al. [16] use the artificial intelligence concept of production system to search for an optimal alignment of CCs into text lines. On defining the initial state as a set of CCs in the un-segmented document and a possible alignment (text lines) of the CCs as the goal state, they give two operators (“merge” or “do not merge” a component to its adjacent text lines) for traversing states under a best path search framework. The reliance of this method on heuristic rules makes it inefficient for unconstrained handwritten documents. The grouping of components to text lines has been treated using MST clustering [17,18], in which the CCs are grouped by MST with a hand-crafted distance metric and then the edges between text lines are deleted using heuristic rules, whose performance relies on the distance metric between components and the heuristic rules. The Hough transform has also been applied to handwritten text line detection with the gravity centers [19,20] or minima points [21] of CCs as the points to be fitted. Sometimes, the CCs are split into equally spaced blocks to be voted in the Hough domain [22]. In general, Hough transform based methods need a sophisticated post-processing procedure to extract the lines and involves high computation burden. From a different viewpoint, some researchers proposed smearingbased bottom-up methods. Shi et al. use an adaptive local connectivity map (ALCM) [23] or a fuzzy runlength [24], in which the value of each pixel is the sum of all pixels in the original image within a specified horizontal distance. After thresholding the smeared image, the CCs represent probable regions of text lines. Kennard and Barrett use a similar method with slight extension to deal with freeform handwritten historical documents [25]. All the smearing-based methods, like other bottom-up ones, involve parameters to be tuned artificially. Nicolas et al. treat document image segmentation as a labeling problem [26]. They partition the document image into a n × m grid and construct a Markov random field (MRF) model based on the grid, and then label the grid pixels into some states. Their results show that this method does not perform robustly on handwritten documents. The level set based method proposed by Li et al. [27] is an effective hybrid approach for unconstrained handwritten documents. On converting a binary image to gray-scaled using a continuous anisotropic Gaussian kernel, the level set method is exploited to determine the boundary between neighboring text lines. Though reported high accuracies of text line segmentation, this method obviously suffers from high computational complexity. 3. Clustering based text line segmentation In this section, we describe the rationale of our approach and the MST clustering algorithm. The distance metric learning and text line

Fig. 2. Hierarchical structure of a document image.

grouping techniques are elaborated in Sections 4 and 5, respectively. The performance of MST clustering relies on the metric of distance between image components. After clustering, the resulting tree is carefully cut into subtrees each corresponding to a text line. 3.1. Rationale A document image can be viewed as a hierarchical structure as in Fig. 2: it consists of text lines, each text line consists of CCs, and a CC is made of black runs or pixels. Equivalently, a text line can be viewed as a cluster of stroke pixels or CCs. We prefer using CCs as the basic units of clustering because the CCs are easy to detect and the number of CCs is much smaller than that of stroke pixels. Obviously, an important feature of this clustering problem is that all clusters (text lines) have irregular boundaries. We use the MST algorithm for clustering the CCs into text lines because it does not assume a spherical shaped clustering structure of the underlying data as many other clustering algorithms do. Two important issues in clustering are the distance metric between units and the criterion for determining the number of clusters. A good metric for clustering CCs should meet the condition that the distance between two neighboring components in the same text line is smaller than that between different lines. In documents with close or interfering text lines, the Euclidean distance does not satisfy this. We previously used a hand-crafted distance metric [28], which works fairly well but not sufficiently. We hereby design a better metric by supervised learning on labeled pairs of CCs. By labeling some pairs as “close” (within the same text line) and some others as “distant” (between lines), a distance metric can be automatically learned to fit the target of small distance within text line and large distance between lines. Under a learned distance metric, the tree generated by MST algorithm has the desired characteristic that the neighboring CCs of the same text line are connected and each line corresponds to a subtree (Fig. 3). However, the branches (paths between terminal and branching nodes) do not correspond to text lines perfectly due to the variability of layout of text lines. We hence use a second-stage clustering procedure to dynamically cut the edges of the tree into groups corresponding to text lines. The criterion to select the edge to cut and the criterion to stop cutting (to determine the number of clusters) are important in the second stage. Simply deleting the shortest edge does not promise because the edges between different lines (red lines in Fig. 3) are not always longer than those within the same line. Our approach is to select the edge to cut such that the sum of hypervolumes of clusters

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

3149

4. Distance metric learning

Fig. 3. MST of a document image.

As many clustering algorithms rely critically on the distance metric between pairs of input units, some recent studies have contributed to metric learning from data [32–34]. For improving the performance of fuzzy c-means clustering, an evolutionary algorithm was used to optimize the scales of the dimensions of input data set [32]. Domeniconi [33] proposed a variant of k-means algorithm in which individual Euclidean metric weights were learned for each cluster. Xing et al. [34] combined gradient descent and iterative projections to learn a Mahalanobis metric for k-means clustering. Inspired by these works of distance metric learning, we herein design our distance metric for text line segmentation by supervised learning. In our work, the definition of the distance between CCs is the key to make the generated MST have the components of the same text line in the same subtree and those of different lines in different subtrees. In Fig. 5, we give an example of MST clustering based on a hand-crafted metric [28] and the one based on learned metric proposed in this paper. In the figure, we mark the between-line edges with blue lines. We can observe that while the learned metric groups the CCs of the same text line in the same subtree (Fig. 5(b)), the hand-crafted metric splits some text lines into multiple subtrees (Fig. 5(a)). 4.1. Problem formulation

Fig. 4. The framework of our approach.

is reduced maximally, and to stop clustering when the measure of straightness of text lines reaches a maximum. From the above description, the framework of our approach can be depicted as in Fig. 4. 3.2. Clustering algorithm Our algorithm starts with a binary document image. In preprocessing, the CCs are labeled using a fast algorithm based on contour tracing [29]. Small components with few black pixels are considered as noises and are removed. We then estimate the dominant character size from the component-size histogram obtained using the method in [12]. Empirically, the components with height or width larger than three times of the dominant character height are split vertically or horizontally using the touching character splitting method in [30] because they are most likely to contain touching characters and such big components affect the result of MST clustering. Finally, each component is viewed as a node in a graph (document graph). Each pair of nodes is linked by an edge with the distance between them as the weight. The metric of distance is designed to strengthen within-line links and weaken between-line links. From the weighted document graph, a MST is built using the Kruskal's algorithm [31]. In the resulting tree, most edges correspond to within-line links and some correspond to between-line links. Since the Kruskal's MST algorithm is well known and can be easily found in the literature, we will not give its details in this paper.

For supervised learning of distance metric between CCs, we need some training samples of component pairs labeled as “within-line” and “between-line”. To do this, we annotated some training document images using our ground-truthing tool GTLC (Ground-truthing tool for handwritten Chinese Text Lines and Characters) [42], which labels text lines and characters by automated transcript alignment and hand correction. Let C = {x1 , x2 , . . . , xn } be a collection of CCs in a training document, where n is the number of components. We obtain two sets of component pairs as the samples for metric learning: S = {(xi , xj )|xi and xj belong to the same line}, D = {(xi , xj )|xi and xj belong to different lines} Considering the fact that only the spatially neighboring components are linked in the MST, we can discard many component pairs from the sample set for accelerating metric learning. To do this, we construct the area Voronoi diagram [14] of the training document, which represents the spatial adjacency between the components. A component xi is the neighbor of another one xj only if they are adjacent in the Voronoi diagram. The pairs that are not adjacent are removed from S and D. The aim of metric learning is to make the distance between components in S small and the distance between components in D large under the learned metric. Hence, we formulate the problem of metric learning as a convex programming problem [35]: min

A∈Rm×m

 (xi ,xj )∈S

s.t. A  0,

xi − xj 2A 

xi − xj 2A  1,

(xi ,xj )∈D

where A ∈ Rm×m (m is the dimensionality of the feature space characterizing component pairs) defines the distance metric: d(xi , xj ) = dA (xi , xj ) = xi − xj A =



vTij ∗ A ∗ vij ,

3150

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

Fig. 5. The results of MST clustering with hand-crafted metric (a) and learned metric (b).

and vij is the feature vector characterizing the relationship between components xi and xj . A is determined by solving the convex programming problem. 4.2. Feature space The features characterizing the relationship between two components xi and xj are integral of the distance metric, and are influential to the performance of clustering. Below is a list of features (eight features in total) that we use. (1) Normalized horizontal and vertical distances between the centroids of two components. The horizontal/vertical distance between the centroids of two CCs measures the spatial closeness. For generalizing to different documents (with differing font size and imaging resolution), this distance should be normalized with respect to the character size (divided by the estimated dominant character size). (2) Normalized horizontal and vertical overlapping degree. If two components overlap horizontally (align vertically), the normalized horizontal overlap degree can be computed by [30]: 1 novlpx = 2



ovlp ovlp + W1 W2



Fig. 6. Definition of normalized horizontal overlap.

Fig. 7. An example of minimum run-length (MRL).

dist , − span

where ovlp is the overlapping width of two bounding boxes, W1 and W2 are the widths of the bounding boxes, dist is the horizontal distance between the centers of two bounding boxes, and span is the spanning width of two bounding boxes (Fig. 6). The normalized vertical overlap degree is computed similarly from the heights of two bounding boxes. (3) Normalized horizontal and vertical minimum run-length.

The horizontal minimum run-length (MRLx ) is the horizontal runlength between vertically overlapping (horizontally aligned) CCs, wherein the minimum horizontal distance between black runs is taken as the distance measure (Fig. 7). It is similarly normalized with respect to the dominant character size of the document image. The vertical minimum run-length (MRLy ) is computed similarly and normalized with respect to the dominant character size. (4) Height and width ratio of merged components.

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

Suppose two CCs are merged, then the Height Ratio is computed

for evaluating the partition:

by:

Rhei =

max(H1 , H2 ) , span

where H1 and H2 are the heights of the bounding boxes, and span is the spanning height of two bounding boxes. The Width Ratio is computed similarity from the heights of two CCs.

5. Text line grouping Although the learned distance metric encourages the components in the same text line to be connected in a subtree, there are still some components from different lines connected. Since betweenline edges are not obvious because their lengths (distances between components) are not necessarily longer than the within-line edge lengths, to correctly recognize and cut the between-line edges is non-trivial. Although several algorithms [36–39] on this problem have been proposed, they do not perform satisfactorily in our case of handwritten Chinese text line segmentation. The cutting results for the image in Fig. 3 using the algorithms of [37,39] are shown in Fig. 8, where many cutting or connection errors occurred, which were pointed out by blue circles. Our MST-based text line grouping process consists of two phases: in initial grouping the MST is cut into subtrees using a hypervolume reduction criterion and a straightness measure, and in post-processing, some remaining text line errors are corrected using heuristic rules.

3151

Fv =

k  [det(Ci )]1/2 , i=1

where det(Ci ) is the determinant of the covariance matrix Ci of cluster i, computed from the constituent black pixels of the CCs in the cluster. Initially, all the components in the MST are considered as a single cluster, and every edge is deleted tentatively to split the cluster into two clusters (subtrees). The edge with the maximal reduction of Fv measure is selected to delete such that the total Fv measure of the document is minimized. We call this as maximum hypervolume reduction criterion, denoted by: edgedeleted = arg max Fv edge

= arg max[Fv (Sk ) − Fv (Sk+1 )], edge

where Sk = {T1 , T2 , . . . , Tk } denote the partition of k disjoint subtrees (S1 denotes the initial MST). The Fv measure cannot evaluate the number of clusters since it always decreases as the number of clusters increases. Fortunately, it is reasonable to assume rectangular shapes for the text lines (if a text line is curvilinear, it can be divided into several sublines that are approximately straight). We conjecture that when the number of clusters (partitioned text lines) is appropriate, a measure of straightness of the text lines reaches a maximum. We compute the total straightness measure as: Fs =

 k   i1 2 i=1

i2

,

5.1. Initial grouping

where k is the number of clusters (partitioned text lines), i1 and i2 (i1  i2 ) are the eigenvalues of the covariance matrix of each

We use a criterion based on hypervolume [40] for selecting edges of MST to cut. By cutting some edges, each subtree corresponds to a cluster of CCs. The sum of hypervolumes of the clusters is computed

cluster. Our experiments demonstrate that the Fs measure performs well in finding the number of clusters: the number of maximum Fs fits

Fig. 8. Results of edge cutting for the image in Fig. 3 using the algorithm in [37] (a) and the algorithm in [39] (b). Cutting/connection errors are marked with blue circles. (For interpretation of the references to colour in this figure legend the reader is referred to the web version of this article.)

3152

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

Fig. 9. Partitioning criteria for the document in Fig. 3. (a) FV as a function of number of clusters; (b) FS as a function of number of clusters; (c) the partitioned text lines.

well the actual number of text lines. An example is shown in Fig. 9. By iteratively deleting edges according to FV , the total FV measure and Fs measure with increasing number of clusters are shown in Fig. 9(a) and (b), respectively. We can see that k = 5, corresponding to the maximum of Fs , gives a preferable partition of text lines (Fig. 9(c)). 5.2. Post-processing After initial grouping, most of text lines have been grouped correctly, but a few errors may still exist. For example, some lines are split into several pieces because of large within-line horizontal gaps, or some CCs are falsely grouped into other lines. Most of these errors can be corrected using some heuristic rules similar to [27]. The post-processing procedure has following steps: (1) Estimate the orientation of the initial text lines using the least mean squared-error method, and estimate their height and width using the method in [30]. 1 of the image width (2) If the length of a text line is shorter than 10 or it contains less than three CCs and the height is larger than half of the average height of all lines, it is labeled as “isolated line”. The other text lines are labeled as “unprocessed line”. (3) If the height of an “unprocessed line” is smaller than half of the average height of all lines, all of its CCs are labeled as “isolated CC”. In an “unprocessed line”, if the distance between the centroid of a CC and the midline (which crosses the centroid of the text line and has the same orientation) is larger than half of the height of the text line, the CC is also labeled as “isolated CC”. (4) Select the longest “unprocessed line”, we can merge it with another “unprocessed line” if all the following conditions are met:

(1) the difference of their orientation is less than 15◦ ; (2) the 1 of horizontal gap between their bounding boxes is less than 10 the image width; (3) their bounding boxes overlap more than 50% of average height in the orthogonal of the average orientation. Mark the merged line as “processed line”. (5) Iterate Step 4 until there is no “unprocessed line”. (6) Merge an “isolated CC” to the i-th text line if the distance between the centroid of the CC and the midline of the text line is smaller than the height of the text line. If we cannot find a text line to merge the “isolated CC”, the CC is labeled as “noise” and is deleted. (7) Similar to Step 6, merge the CCs of an “isolated line” to a “processed line”, but if we cannot find a text line to merge the CC, we keep it in the “isolated line”, and if this “isolates line” still have CCs after merging all CCs, we label it as “processed line”. 6. Experimental results We evaluated the performance of our algorithm on a large database of unconstrained handwritten Chinese documents and compared with some existing reference algorithms. As follows, we briefly describe the database and evaluation methodology, outline the reference algorithms, and then present the experimental results. 6.1. Database A large database of unconstrained Chinese handwritten documents, HIT-MW [41], was collected by Harbin Institute of Technology and is publicly available for free use. The database contains 853 text forms written by more than 780 writers. There are 8,677 text lines in total and each line has 21.51 characters on average. Each

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

3153

Fig. 10. Example documents in the HIT-MW database.

document was scanned at a resolution of 300DPI. A typical image size is approximately 1700×1500 pixels, and each image contains 530 CCs on average. Fig. 10 shows two images in this database. Since the images in the HIT-MW database are not labeled at CCs level (only a part of images have been segmented into text lines), we have annotated all the 853 document images using our groundtruthing tool GTLC [42]. 6.2. Evaluation methodology Several evaluation schemes have been proposed for document image segmentation [43–45], but they were designed for printed documents or graphics and to evaluate the performance based on bounding boxes. It is not appropriate to measure handwritten text lines using bounding boxes because they are often curved and multiskewed. Therefore, we evaluate the performance by counting the number of matches between the pixels of detected text lines and the pixels in the ground-truth data. Similar to [27], we calculate the MatchScore matrix between a detected text line and a groundtruthed line: MatchScore(i, j) =

T(Gi ∩ Rj ) , T(Gi ∪ Rj )

where Gi is the set of pixels of the i-th ground-truthed text line, Rj is the set of pixels of the j-th detected text line, T(S) is the cardinality of set S. The Hungarian algorithm is used to find one-to-one correspondence between the detected text lines and the ground-truthed ones [46]. Since the number of lines in two sets may be different, either a detected line or a ground-truthed line is allowed to be matched with a dummy line. The performance is evaluated at the text line level. If a ground-truthed line and the corresponding detected line share at least 95% of pixels, the detected text line is claimed to be correct. The percentage of correctly detected text lines out of the groundtruthed lines gives the correct detection rate (recall rate), and the percentage of false lines out of the detected lines gives the error rate. 6.3. Reference algorithms In addition to comparison with our previous clustering algorithm with hand-crafted metric [28], we compared the hypervolume reduction criterion in text line grouping with other criteria in [37,39]. Then, we compared the performance of the proposed algorithm not

only with two algorithms X–Y cut [1] and Docstrum [12] that were designed for printed documents, but also with two algorithms stroke skew correction [6] and piece-wise projection [5] that were designed for segmenting handwritten text lines. The hand-crafted metric was formed using a subset of the features described in Section 4.2: it is the weighted combination of the horizontal minimum runlength and the Euclidean distance between the centroids of two CCs, with the weight determined by the normalized vertical overlapping degree. This empirical combination was found to perform fairly well. For fair comparison with other methods, we also optimize the weighting parameter on some ground-truthed document images. After MST clustering, the algorithms in [37,39] cut edges in difference ways. The one in [37] finds a global threshold of edge length according to the edge length histogram of the linkage graph, then, all the edges with length over the threshold are cut. The authors of [37] demonstrated that this method was more efficient than the algorithm in [36]. The algorithm in [39] measures each hypothesized cluster (subtree) using the standard deviation of edge lengths within the subtree. The edge selected to cut is the one that reduces the average standard deviation maximally. This is similar to our method of hypervolume reduction but it uses deviation of edge lengths instead of hypervolume. As a typical top-down method, the X–Y cut algorithm [1] builds a structural tree of the document by recursively analyzing the horizontal and vertical projection profiles of partitioned regions. The Docstrum algorithm [12] builds the document structure bottom-up by merging neighboring CCs. We used in our experiments a public domain implementation of the X–Y cut and Docstrum algorithms [45]. The stroke skew correction algorithm [6] estimates the skew angle of text lines from the horizontal stroke histogram and then segments text lines using projection profiles after deskewing. The piece-wise projection algorithm [5] obtains an initial set of text lines from the piece-wise projection profiles, then any obstructing CCs are associated to a line above or below by a probability evaluated under Gaussian assumption or a distance metric. This algorithm was shown to perform very well in segmenting English and Arabic text lines. To yield best performance for the above algorithms (MST clustering with hand-crafted metric, X–Y cut, Docstrum, piece-wise projection), we use the Nelder–Mead simplex search method [47] to optimize the free parameters of them based on ground-truthed data as Mao et al. did in [45,48]. The hand-crafted metric has a weighting parameter to optimize. The X–Y cut algorithm and the Docstrum

3154

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

Table 1 Correct rates of text line detection using learned and hand-crafted metrics.

Table 3 Correct rates and error rates of text line detection on 803 images.

Correct detection Learned metric Hand-crafted metric (optimized weight) Hand-crafted metric (empirical weight)

1051 (95.02%) 1008 (91.14%) 975 (88.16%)

Table 2 Correct rates of text line detection using different clustering criteria.

Proposed (with post-processing) Proposed (w/o post-processing) X–Y cut [1] Docstrum [12] Stroke skew correction [6] Piece-wise projection [5]

Detected lines

Correct detection

Error rate (%)

8211 8444 8605 9602 5897 8150

8008 7822 3682 5341 4521 7521

2.47 7.37 57.21 44.38 23.33 7.72

(98.02%) (95.75%) (45.07%) (65.38%) (55.34%) (92.07%)

Correct detection Hypervolume reduction Criterion in [37] Criterion in [39]

1051 (95.02%) 823 (74.41%) 341 (30.83%)

algorithm each has four parameters optimized by simplex search, as done in [45]. For the piece-wise projection algorithm, the authors of [5] did not mention any free parameter. But in our implementation, we found that two parameters, the minimal difference and the minimal distance between the neighboring peak and valley of the projection histogram, need to be determined. The other strip-based projection methods were not evaluated in our experiments because they rely on many heuristic rules and free parameters, and so, it is hard to tune the rules and parameters to optimize the performance. 6.4. Performance comparison We conducted three experiments to compare the performance of metric learning with hand-crafted metric, compare hypervolume reduction in text line grouping with other MST cluster criteria, and compare the proposed text line segmentation algorithm with four existing methods. 6.4.1. Comparing metric learning with hand-crafted metric The example of Fig. 5 in Section 4 demonstrates that distance metric learning can obviously improve the performance of MST clustering of handwritten documents. To evaluate the performance quantitatively, we selected 150 images with complex layout in the HIT-MW database, 50 documents were used for distance metric learning and optimizing the weighting parameter of hand-crafted metric, and the remaining 100 documents containing 1,106 text lines were used for evaluation. Previously, the weight of hand-crafted metric was determined from the normalized vertical overlapping degree [28]. In this experiment, all the processing steps except the distance metric are the same. The correct rates of text line detection by MST clustering with learned metric and hand-crafted metric (with optimized weight and empirical weight) are shown in Table 1. We can see that distance metric learning improves the performance of text line segmentation significantly. By optimizing the weighting parameter of the handcrafted metric, the performance is also improved considerably compared to the one with empirical weight. 6.4.2. Comparing MST clustering criteria To compare the text line segmentation performance of the proposed hypervolume reduction criterion with other MST clustering criteria in [37,39], we used the same 150 images as in Section 6.4.1: 50 for metric learning and 100 for evaluation. For the three criteria compared, all the processing steps except the tree partitioning procedure are the same. The correct rates of text line detection on the 100 test images are shown in Table 2. We can see the hypervolume reduction criterion yields the best performance. Since the criterion in [37] finds a global threshold of edge length, between-line

edges shorter than the threshold cannot be deleted to separate the linked text lines. By the criterion in [39], we observed that the local minimum of the standard deviation reduction function always gives fewer clusters than the real number of text lines. This causes many text lines merged with each other. On the contrary, our algorithm based on the hypervolume reduction criterion combined with the straightness measure of text lines mostly finds the correct number of real clusters. 6.4.3. Comparison with existing methods To compare the performance of the proposed MST clusteringbased text line segmentation algorithm with the X-Y cut, Docstrum algorithm, stroke skew correction algorithm and piece-wise projection algorithm, we randomly selected 50 images (containing 508 text lines) in the HIT-HW database for training (distance metric learning and parameters tuning), and the remaining 803 images (containing 8,169 text lines) for evaluation. The correct rates (recall rates, percentage of correctly detected lines out of all ground-truthed ones) and error rates (percentage of error text lines out of all detected ones) are shown in Table 3. To justify the effect of post-processing with the proposed clustering-based method, we evaluated both the algorithm with post-processing and the one without post-processing. From Table 3, we can see that post-processing is effective to improve the correct rate of text line segmentation. However, even without post-processing, the proposed method still yields higher correct rate and lower error rate than the four existing algorithms. Among the existing methods, the piece-wise projection algorithm yields the best performance. We observe that the X–Y cut and Docstrum algorithms tend to extract more text lines, but many of them do not match the ground-truthed lines. The stroke skew correction is designed for segmenting handwritten Chinese text lines, but it is only better than X–Y cut algorithm. The piece-wise projection algorithm achieves competitive results because it can tolerate the multi-skew and curve of the text lines in a certain extent. Fig. 11 shows the segmentation results of a document image using the X–Y cut, Docstrum, stroke skew correction, piece-wise projection and the proposed algorithm. From the figure, we can see that only the proposed algorithm segments the text lines totally correctly, while the X–Y cut, Docstrum and stroke skew correction algorithms generate many false text lines. Though the piece-wise projection algorithm almost finds all text lines, some small CCs are falsely segmented such as the last text line in Fig. 11(d). Overall, the proposed algorithm performs very well on unconstrained handwritten Chinese documents with multi-skewed and curved text lines. The proposed cluster-based algorithm was implemented in C++ codes. The overall processing time for an image with size of 1700×1500 pixels and containing 1000 CCs (about 500–600 characters) is about 2.5 seconds on a personal computer with CPU of Pentium 4–3.6 GHz and 1 GB Memory. This speed is nevertheless acceptable. We could not compare our method with the level set based method of Li et al. [27] because that method is non-trivial to implement and their image database is not available for our evaluation. Our algorithm and the one of Li et al. were compared with the X–Y

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

3155

Fig. 11. Segmentation results of a document image by X–Y cut (a), Docstrum (b), stroke skew correction (c), piece-wise projection (d) and the proposed algorithm (e).

cut and Docstrum algorithms and as the result, both yielded significantly higher correct detection rates than the X–Y cut and Docstrum algorithms. Nevertheless, the level set based method turns out to be more computationally demanding: according to [27], the segmentation of an image of 2000×1500 pixels costs about 20 seconds on a CPU of 1.6 GHz and 1 G memory. 6.5. Error analysis The proposed clustering-based method with metric learning, though performs sufficiently well, still remains some text line

detection errors. The errors are mostly of two types: (1) error line splitting (ELS): a real text line is split into two or more lines (corresponding to multiple clusters); (2) error line merging (ELM): two or more real text lines are merged into a single cluster. We observed that the ELS occur when characters are inserted in a line, such as those (marked by blue circles) in Fig. 12. The ELM are mainly caused by the overlapping of two neighboring text lines, especially touching of characters, such as that (marked by blue circle) in Fig. 13. In this case, since two text lines are connected in only few touched characters, a post-processing procedure is necessary to separate them vertically.

3156

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

Fig. 12. An example of error line splitting.

Fig. 13. An example of error line merging.

7. Conclusion We propose a new method for text line segmentation in unconstrained handwritten Chinese document images based on minimum spanning tree (MST) clustering with distance metric learning. This bottom-up method is able to segment multi-skewed, curved and slightly overlapping text lines. Except some empirical parameters (which are easy to determine and do not influence the performance critically) in pre-processing of connected components (CCs) and post-processing of text lines, this algorithm has no artificial parameter in clustering. In MST clustering, the metric of distance between CCs is learned on a dataset of pairs of components labeled as “within-line” or “between-line”. This avoids artificial tuning of metric and improves the clustering performance significantly. The number of clusters is automatically determined by cutting the edges of the generated tree using a hypervolume reduction criterion and a straightness measure. The proposed algorithm was evaluated on a large database of unconstrained handwritten Chinese documents, and was demonstrated superior to some previous algorithms. Our algorithm is to be further improved by refining the features of distance metric and the post-processing procedure, and to be evaluated on document images of various languages via customizing the between-component features and training with document images of specific languages. Acknowledgments The authors would like to thank Tonghua Su for authorizing us to use the HIT-MW database, Zhenglong Li for discussions on distance metric learning, Gang Liu and Yi Li for their suggestions on the experiments. This research was supported by the National Natural Science Foundation of China (NSFC) under grant nos. 60775004 and 60825301.

References [1] G. Nagy, S. Seth, M. Viswanathan, A prototype document image analysis system for technical journals, Computer 25 (7) (1992) 10–22. [2] J. He, A.C. Downton, User-assisted archive document analysis for digital library construction, in: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 1, 2003, pp. 498–502. [3] A. Zahour, B. Taconet, P. Mercy, S. Ramdane, Arabic handwritten text-line extraction, in: Proceedings of the Sixth International Conference on Document Analysis and Recognition, 2001, pp. 281–285. [4] U. Pal, S. Datta, Segmentation of Bangla unconstrained handwritten text, in: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, 2003, pp. 1128–1132.

[5] M. Arivazhagan, H. Srinivasan, S. Srihari, A statistical approach to line segmentation in handwritten documents, in: Document Recognition and Retrieval XIV, Proceedings of the SPIE, 2007, pp. 6500T-1-11. [6] T. Su, T. Zhang, H. Huang, Y. Zhou, Skew detection for Chinese handwriting by horizontal stroke histogram, in: Proceedings of the Ninth International Conference on Document Analysis and Recognition, 2007, pp. 899–903. [7] C. Weliwitage, A.L. Harvey, A.B. Jennings, Handwritten document offline text line segmentation, in: Proceedings of Digital Image Computing: Techniques and Applications, 2005, pp.184–187. [8] M. Liwicki, E. Indermuehle, H. Bunke, On-line handwritten text line detection using dynamic programming, in: Proceedings of Ninth International Conference on Document Analysis and Recognition, 2007, pp. 447–451. [9] S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, D.K. Basu, Text line extraction from multi-skewed handwritten documents, Pattern Recognition 40 (6) (2007) 1825–1839. [10] U. Pal, P.P. Roy, Multioriented and curved text lines extraction from Indian documents, IEEE Transactions on Systems, Man and Cybernetics, Part B 34 (4) (2004) 1676–1684. [11] U. Pal, P.P. Roy, Text line extraction from India document, in: Proceeding of Fifth International Conference on Advances in Pattern Recognition, 2003, pp. 275–279. [12] L. O'Gorman, The document spectrum for page layout analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11) (1993) 1162–1173. [13] F. Kimura, Y. Miyake, M. Shridhar, Handwritten ZIP code recognition using lexicon free word recognition algorithm, in: Proceeding of the Third International Conference on Document Analysis and Recognition, 1995, pp. 906–910. [14] K. Kise, A. Sato, M. Iwata, Segmentation of page images using the area Voronoi diagram, Computer Vision and Image Understanding 70 (3) (1998) 370–382. [15] L. Likforman-Sulem, C. Faure, Extracting lines on handwritten document by perceptual grouping, in: Advances in Handwriting and Drawing: A Multidisciplinary Approach, 1994, pp. 21–38. [16] S. Nicola, T. Paquet, L. Heutte, Text line segmentation in handwritten document using a production system, in: Proceeding of the Ninth International Workshop on Frontiers in Handwriting Recognition, 2004, pp. 245–250. [17] I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, Line extraction and stroke ordering of text pages, in: Proceeding of the Third International Conference on Document Analysis and Recognition, vol. 1, 1995, pp. 390–393. [18] A. Simon, J.-C. Pret, A.P. Johnson, A fast algorithm for bottom-up document layout analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (3) (1997) 273–277. [19] B. Yu, A. Jain, A robust and fast skew detection algorithm for generic document, Pattern Recognition 29 (10) (1996) 1599–1629. [20] Y. Pu, Z. Shi, A natural learning algorithm based on Hough transform for text lines extraction in handwritten document, in: Proceeding of the Sixth International Workshop on Frontiers in Handwriting Recognition, 1998, pp. 637–646. [21] L. Likforman-Sulem, A. Hanimyan, C. Faure, A Hough based algorithm for extracting text lines in handwritten documents, in: Proceeding of the Third International Conference on Document Analysis and Recognition, 1995, pp. 774–777. [22] G. Louloudis, B. Gatos, I. Pratikakis, K. Halatis, A block-based Hough transform mapping for text line detection in handwritten document, in: Proceeding of the 10th International Workshop on Frontiers in Handwriting Recognition, 2006, pp. 515–520. [23] Z. Shi, S. Setlur, V. Govindaraju, Text extraction from gray scale historical document image using adaptive local connectivity map, in: Proceeding of the Eighth International Conference on Document Analysis and Recognition, vol. 2, 2005, pp. 794–798.

F. Yin, C.-L. Liu / Pattern Recognition 42 (2009) 3146 -- 3157

[24] Z.X. Shi, V. Govindaraju, Line separation for complex document images using fuzzy runlength, in: Proceeding of the First International Conference on Document Image Analysis for Libraries, 2004, pp. 306–312. [25] D.J. Kennard, W.A. Barrett, Separating lines of text in free-form handwritten historical documents, in: Proceeding of the Second International Conference on Document Image Analysis for Libraries, 2006, pp. 12–23. [26] S. Nicolas, T. Paquet, L. Heutte, Markov random field models to extract the layout of complex handwritten documents, in: Proceeding of the 10th International Workshop on Frontiers in Handwriting Recognition, 2006, pp. 563–568. [27] Y. Li, Y. Zheng, D. Doermann, S. Jaeger, Script-independent text line segmentation in freestyle handwritten document, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (8) (2008) 1–17. [28] F. Yin, C.-L. Liu, Handwritten text line extraction based on minimal spanning tree clustering, in: Proceeding of the Fifth International Conference on Wavelet Analysis and Pattern Recognition, vol. 3, 2007, pp. 1123–1128. [29] F. Chang, C.J. Chen, C.J. Lu, A linear-time component-labeling algorithm using contour tracing technique, Computer Vision and Image Understanding 93 (2) (2004) 206–220. [30] C.-L. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (11) (2002) 1425–1437. [31] A.V. Aho, J.E. Hopcroft, J.D. Ullman, Data Structures and Algorithms, AddisonWesley, 1983. [32] A. Schenker, M. Last, H. Bunke, A. Kandel, Fuzzy clustering with genetically adaptive scaling, International Journal of Image and Graphics 2 (4) (2002) 557–572. [33] C. Domeniconi, Locally adaptive techniques for pattern classification, Ph.D. Thesis, University of California, Riverside, 2002. [34] E. Xing, A.Y. Ng, M. Jordan, S. Russell, Distance metric learning with application to clustering with side-information, in: Advances in Neural Information Processing Systems, vol. 15, 2003, pp. 505–512. [35] L. Vandenberghe, S. Boyd, Semidefinite programming, SIAM Review 38 (1) (1996) 1–50.

3157

[36] C.T. Zahn, Graph-theoretical methods for detecting and describing Gestalt clusters, IEEE Transactions on Computers 20 (1) (1971) 68–86. [37] Y. He, L.H. Chen, A threshold criterion, auto-detection and its use in MST-based clustering, Intelligence Data Analysis 9 (3) (2005) 253–271. [38] L. Yujian, A clustering algorithm based on maximal -distant subtrees, Pattern Recognition 40 (5) (2007) 1425–1431. [39] O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clustering algorithm, in: Proceeding of the 18th International Conference on Tools with Artificial Intelligence, 2006, pp. 73–81. [40] I. Gath, A.B. Genv, Unsupervised optimal fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (7) (1989) 773–781. [41] T. Su, T. Zhang, D. Guan, Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text, International Journal on Document Analysis and Recognition 10 (1) (2007) 27–38. [42] F. Yin, C.-L. Liu, A tool for ground-truthing text lines and characters in handwritten Chinese documents, submitted to ICDAR2009. [43] C. Wolf, J.M. Jolion, Object count/area graphs for the evaluation of object detection and segmentation algorithms, International Journal on Document Analysis and Recognition 8 (4) (2006) 280–296. [44] I.T. Philips, A.K. Chhabra, Empirical performance evaluation of graphics recognition system, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (9) (1999) 849–870. [45] S. Mao, T. Kanungo, Empirical performance evaluation methodology and its application to page segmentation algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3) (2001) 242–256. [46] G. Liu, R.M. Haralick, Optimal matching problem in detection and recognition performance evaluation, Pattern Recognition 35 (10) (2002) 2125–2135. [47] J. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (4) (1965) 308–313. [48] S. Mao, T. Kanungo, Automatic training of page segmentation algorithms: an optimization approach, in: Proceeding of the 15th International Conference on Pattern Recognition, 2000, pp. 531–534.

About the Author—FEI YIN received the B.S. degree in Computer Science from Xidian University, Xi'an, China, the M.E. degree in Pattern Recognition and Intelligent Systems from Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing a Ph.D. degree in Pattern Recognition and Intelligent Systems at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include document image analysis, handwritten character recognition and computer vision. About the Author—CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China, and is now the Deputy Director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 80 technical papers at international journals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.