Proceedings Template - WORD - Eduardo Valle

Layout-Aware Limiarization for Readability Enhancement of Degraded Historical Documents Flávio Bertholdo

Eduardo Valle

Arnaldo de A. Araújo

Bertholdo Consultoria Av. Prof. Mário Werneck 881 / 302 30455-610 Belo Horizonte, Brazil +55 31 3243-7397

NPDI — ICEx / DCC / UFMG Av. Antônio Carlos, 6627 31270-010 Belo Horizonte, Brazil +55 31 3409-5860

NPDI — ICEx / DCC / UFMG Av. Antônio Carlos, 6627 31270-010 Belo Horizonte, Brazil +55 31 3409-5860

[email protected]

[email protected]

[email protected]

ABSTRACT

In this paper we propose a technique of limiarization (also known as thresholding or binarization) tailored to improve the readability of degraded historical documents. Limiarization is a simple image processing technique, which is employed in many complex tasks like image compression, object segmentation and character recognition. The technique also finds applications on itself: since it results in a high-contrast image, in which the foreground is clearly separated from the background, it can greatly improve the readability of a document, provided that other attributes (like character shape) do not suffer. Our technique exploits statistical characteristics of textual documents and applies both global and local thresholding. Under visual inspection on experiments made in a collection of severely degraded historical documents, it compares favorably with the state of the art.

Categories and Subject Descriptors

I.4.3 [Image Processing and Computer Vision]: Enhancement – grayscale manipulation. I.4.6. [Image Processing and Computer Vision]: Segmentation. I.7.5 [Document and Text Processing]: Document Capture – Optical character recognition, Scanning.

General Terms

Algorithms, Design, Experimentation.

Keywords

Image enhancement, Binarization, Limiarization, Readability improvement, Historical documents.

1. INTRODUCTION Digitization has become a major tool for Libraries, Archives and Museums in order to improve the access to their collections [2]. The digital format is able to offer reproductions of such quality, that the user will seldom request access to the original. Since excessive manipulation is a source of wear and tear, the availability of digital surrogates may have a great impact on the preservation of historical documents. On the other hand, if the copies are poor, access to the originals will be requested often, defeating the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DocEng’09, September 15–18, 2009, Munich, Germany. Copyright 2009 ACM 978-1-60558-081-4/08/09...$5.00.

purpose of digitization. Therefore much effort is spent in order to improve the users’ acceptance of the reproductions. For textual documents (printed, typed or handwritten), the most important quality feature is readability, for which measurements have been developed since the inception of analogical reproduction. For example, norms on resolution, quality index and density [1] have been developed in order to obtain better readability on microfilms. On digital images, the criteria are more complex, but are still highly dependent on contrast (which is related to density) and resolution. Also important are the absence of noise and the preservation of character shapes. Limiarization (also known as thresholding or binarization) is a simple image processing technique in which the pixels of a grayscale image are mapped to the extreme values of the scale (either black or white). The result is a high-contrast image, in which the foreground is clearly separated from the background. This can greatly improve the readability of the document, provided that other criteria (like character shape) are not degraded. Limiarization is important for many applications, including compression, segmentation, character recognition (OCR and ICR), etc. Not surprisingly, there is an abundant literature on the subject. Basically there are two families of solutions: global methods, which apply a single threshold to the whole image, and local methods, which adapt the threshold to different regions of the document. The latter are more robust when applied to complex backgrounds, but they are also prone to generating noise artifacts. We have proposed a method based on both global and local techniques, in order to improve the readability of degraded historical documents. Its originality is to assume that an image representing a page of text has an inherent structure (being composed of lines of characters), which can be exploited to eliminate local anomalies, preserve the character shapes and avoid noise, while the global contrast is enhanced. We have applied our method to a severely deteriorated collection of important documents of Brazilian history. In a visual comparison, our method better preserves the character shapes, while still clearly separating text from background.

2. RELATED WORK As we mentioned before, the literature on limiarization methods is very abundant, and it is not our intention here to make an exhaustive inventory of the state of the art. A more comprehensive survey can be found in [10]. In a very general way, the works on limiarization can be divided into two classes: (1) general-purpose methods, which do not take

into consideration specific characteristics of the documents; and (2) application oriented methods, which profit from specific a priori knowledge about the documents being processed. In addition, the methods can be considered global, if they propose a single threshold to the entire document, or local, if they adapt the thresholding to local characteristics of the document.

grayvalues on the row, and classify it as containing text if the difference between those values is higher than a threshold Trow. We have empirically set Trow = 20 (for a 0–255 grayscale range), a value which allowed a good separation between text-bearing and background rows. The results can be appreciated in Figure 1.

Global methods, which are comparatively simpler, try to find the value of the threshold for the grayscale value above which all pixels should be mapped to white and below which all pixels should be mapped to black. The criterion to establish this value varies for each technique and may take into consideration the separability between the two classes of pixels to be assigned black and white [8], maximization of the entropy [5], etc.

Figure 1: Detecting the text-bearing rows — the second step of our approach to limiarization.

Local methods try to take into account local statistics on the image to obtain an adaptive threshold for different regions on the image. The methods proposed by Niblack [7] and Sauvola [9] are perhaps the best-known adaptive thresholding methods, and use a sliding window technique to compute an individual threshold for each individual pixel. While Niblack’s method is more sensitive to background noise, Sauvola’s method uses an a priori hypothesis on the grayvalues for the foreground and the background to regularize the threshold estimation.

In the third step we apply an adaptive local thresholding on the text-bearing rows. The choice of applying local limiarization only on those lines is motivated for two reasons. First, we have observed that that local thresholding tends to create noise artifacts in the darker regions of the background. Second, assuming that the row does contain text, we are able to employ a rather aggressive character enhancing technique. Inspired on the works of Niblack [7] and Sauvola [9], a threshold (T) is computed, using the row mean (μrow) and standard deviation (σrow) and the image mean (μ):

Recent methods have been proposed with the specific purpose of contrast enhancement for textual historical documents [3, 6, 11, 12]. In particular, Kavallieratou [6] proposes an interesting method, based on an iterative procedure of histogram equalization, and using as a priori the fact that the white pixels (from the paper) should be much more frequent than the black pixels (from the text). Strictly speaking, this is not a limiarization method, since pixels with intermediate values may remain after the contrast enhancement, but those can be eliminated by a simple thresholding, if necessary.

3. OUR APPROACH We propose an algorithm combining global and local strategies. Our aim is to ameliorate the document readability, by eliminating local anomalies and noise, while preserving the character shapes and clearly separating text and background. The method can be decomposed into four basic steps: 1. 2. 3. 4.

Extraction of global image statistics; Segmentation of the document into text lines; Thresholding only on the rows containing text, considering both local and global statistics — the main goal of this step is to preserve the character shapes; Global limiarization based on Kavallieratou’s work [6].

The first step extracts global statistics from the image grayscale values. Moment-based statics, like the mean and the variance, are the most used in the literature (e.g. [7, 9]). In our work we extract the global mean (μ) over grayvalues, which is used later on step 3 in the computation of the local thresholds. The second step consists in segmenting the image into horizontal text lines. This is done through a very simple method, with each (1 pixel) row of the image being independently classified. We assume that text bearing rows have a concentration of grayvalues around two well-separated modes, due to the presence of the characters, while the background rows tend to a narrow unimodal distribution. Thus, we compute the mean and the mode for the

We then add σrow to pixels whose grayvalues are above T — the idea is to skew the brightest pixels even more towards the bright end of the scale, in order to minimize background fluctuations. Then, we scan the row horizontally, trying to find significant transitions. We look for decreasing transitions nearly followed by increasing transitions, and we consider that the transition is significant whenever the difference in grayvalues between the extremes of the transition is at least σrow Whenever we detect this kind of profile (which indicates the presence of a character) we subtract σrow from the grayvalues of the whole area, effectively skewing those pixels to the darker end of the scale. In the fourth step, we apply a global contrast enhancement with the Kavallieratou technique. This consists basically in computing the mean grayvalue for the image, subtracting this mean from all pixels, apply a histogram equalization on the image and iterating the entire process. The iteration stops when the difference between two consecutive means becomes too small (lesser than 0.2 in a grayscale normalized to 0–1). After the last equalization, some intermediate grayvalues may remain, and a simple thresholding can be applied to obtain a pure black and white image. A detailed description can be found in [6].

4. THE TEST IMAGES Our test images come from a collection held by the State Archives of Minas Gerais (APM). The documents belonged to the Department of Political and Social Order of the State of Minas Gerais (DOPS/MG), which was created during the period of Brazilian History known as Estado Novo (“New State”, 1937–45) and which held the role of secret police during the Brazilian Military Dictatorship Government (1964–85). The collection has documents created between 1927 and 1982 and is considered an important source for the collective memory of the State and the Country. It contains information about the periods of

Figure 3: Comparative test on newspaper clipping. From leftto-right, top-to-bottom: original document (top left), Kapur, Otsu, Sauvola, Kavallieratou, our approach (bottom right).

Figure 2: Comparative test on typewritten document. From top to bottom: original document (top), Kapur [5], Otsu [8], Sauvola [9], Kavallieratou [6], our approach (bottom). political repression, as well as the political and social movements which fought for the restoration of democracy. The documents themselves have a tortuous history. Though their transfer to the Permanent Collection of APM was decided in 1990, the actual transfer only happened in 1998, after it was discovered that they continued to be illegally used by the security forces. In addition, the original documents were incinerated before the transfer (in microfilm format) was completed, generating the unusual situation (in the context of Historical Archives) where the films are considered primary sources. The collection today consists of 98 microfilm rolls, containing approximately 250 thousand photograms. The images have been digitized into JPEG files, meant to make access to the collection easier. Because of the intrinsic nature of the documents (newspaper clippings, typewritten notes, pamphlets…), and its history (bad storage, low quality microfilming, digitization in compressed format) the collection presents a strong challenge in terms of readability enhancement. It remains, nevertheless, one of the most important collections of APM, and it is among the most requested by the users of the Archive.

5. RESULTS Due to the lack of standardized databases of historical documents, the validation of methods is made mainly by visual comparison with the state of the art. We have chosen to compare with two classical, well-accepted methods by Kapur [5] and Otsu [8], and also two modern approaches by Sauvola [9] and Kavallieratou [6]. The comparison with purely textual documents (Figures 2 and 3) shows that our approach provides a significant enhancement on the contrast of deteriorated documents, while avoiding distorting the character shapes. The test on the newspaper clipping (Figure 3), in particular, shows how the methods based only on global thresholding (Otsu, Kapur) fail to provide good results on severely deteriorated documents, since they tend either to make character strokes overly thick, or to severely fragment the characters. The results by Sauvola, which use local thresholding, provide a more uniform density, but since the method is not particularly aware of the image textual content, this uniformity does not translate necessarily into a better preservation of character shapes. Note, in particular how text on underlined zones is badly distorted. Kavallieratou’s method provided good results in both cases, but our approach was able to produce a crisper separation between the text and the background, due to the local enhancement of text-bearing rows.

proved by employing better line segmentation. The algorithm could also benefit from more sophisticated layout-detection techniques, though our results show that it already performs well even on complex pages (multiple columns, embedded images). We are currently working on a more quantitative evaluation of our results. We are interesting both on user-based experimental designs and on computer-based evaluations [4, 10, 13], though the first are expensive and difficult to control and the latter suffer from inherent validity issues when applied to difficult datasets like ours. We expect that using our images to cross-validate user experiments and computer-based evaluations will result in the reinforcement of the authority of the latter.

7. ACKNOWLEDGEMENTS The authors are thankful to the Brazilian agencies CAPES, CNPq and FAPEMIG for the financial support to this work.

8. REFERENCES [1] ANSI/AIIM MS23-1991, Practice for operational procedures/inspection and quality control of first-generation, silver microfilm of documents, pp. 46-48, § 8.3.7–8.3.7.3.4 [2] Conway, P. Preservation in the digital world. (CLIR Reports). Concil on Library and Information Resources, 1996. [3] Gatos, B., Pratikakis, I., Perantonis, S.. An adaptive binarization technique for low quality historical documents. IAPR Workshop on Document Analysis systems, p. 102–113, 2004. [4] Govindaraju, V. Srihari, S. Assessment of image quality to predict readability of documents. Proc. SPIE, 2660(333), 1996. [5] Kapur, J., Sahoo, P., Wong, K. A new method for gray-level picture thresholding using the entropy of the histogram, Computer Vision, Graphics and Image Processing, 1985. [6] Kavallieratou, E., Antonopoulou, H. Cleaning and enhancing historical document images, Advanced Concepts for Intelligent Vision Systems, pp. 681–688, 2005. Figure 4: Comparative test on magazine cover. From left-toright, top-to-bottom: original document (top left), Kapur, Otsu, Sauvola, Kavallieratou, our approach (bottom right). On the test with text and pictures on the same page (Figure 4), we see that, while our technique is optimized for text, it does not penalize the limiarization of pictures. In this test, the advantages of the text-oriented local thresholding are particularly striking: while in Kavallieratou’s technique the low-contrast heading was lost, our technique was able to preserve it.

6. DISCUSSION The main advantage of our algorithm is its awareness of the textcontaining areas in the page and the use of a priori knowledge about the grayscale distribution of text-containing areas, which allows the enhancement of character shapes before the global contrast enhancement is applied. The weakest point of the algorithm is the line classification step, which classifies the pixel rows independently and sometimes “misses” one of the textual rows (classifying it as non-text) and ends up generating artifacts. We expect that results will be im-

[7] Niblack, W. An introduction to digital image processing. Englewood Cliffs, N. J., Prentice Hall, pp. 115–116, 1986. [8] Otsu, N. A threshold selection method from gray-level histograms, IEEE Transactions on Systems, Man and Cybernetics, SMC 9(1), pp. 62–66, 1979. [9] Sauvola, J., Pietikainen, M. Adaptive document image binarization. Pattern Recognition, 33, p. 225–236, 2000. [10] Sezgin, M., Sankur, B. Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging, 13(01), pp.146–165, 2004. [11] Shi, Z., Govindaraju, V. Historical document image segmentation using background light intensity normalization, SPIE Document Recognition and Retrieval XII, p.16–20, 2005. [12] Silva, J., Lins, R.; Rocha, V. Binarizing and filtering historical documents with back-to-front interference, ACM Symp. on Document Engineering, pp. 853–858, 2006. [13] Sturgill, M., Simske, S. An Optical Character Recognition Approach to Qualifying Thresholding Algorithms, ACM Symp. on Document Engineering, pp. 263–266, 2008.