A Method for Improving the Visual Quality of

A Method for Improving the Visual Quality of Digitized Antique Books F. Stanco, G. Ramponi and L. Tenze DEEI, University of Trieste, via A. Valerio 10, 34127 Trieste, Italy [email protected], [email protected], [email protected]

Abstract

In this paper we propose an algorithm which improves the quality of the digital version of antique books. In particular, we suggest a technique that enhances the virtual quality of the paper, eliminates some defects and improves the performance of Optical Character Recognition (OCR) operators.

1. Introduction Antique books represent an immense historic, cultural, and artistic patrimony which is stored in a distributed form in thousands of private and public libraries all over the world. It is well-known that digital techniques are a powerful tool to make this patrimony available to users with different needs and interests, ranging from scholars to the general public: large intellectual and financial resources have already been devoted to this purpose. Examples of supported projects in the field abound; without claiming to compile an exhaustive list, we may mention the Universal Library Project [1], the Digital Libraries Initiative [2], or the EU-funded BRICKS Digital Library project [3]. A typical processing chain for a page to become a digital document is not easily defined, since its steps are strongly dependent on the purpose of the procedure. After the scanning process, which must be performed accurately avoiding UV illumination and limiting visible light exposure and mechanical stress on the book’s binding, the document may be segmented into its text - graphic - background components; it may be compressed, separately or jointly, it may be binarized and input to an OCR system. An important element of the overall treatment is the restoration phase. To be generally applied without incurring the criticism of the paleographic expert, the latter has to be performed granting that the processed document gets rid of the defects which are due to aging, but does not lose its original characteristics such as the texture of the paper and the different shades of the ink. The interest of this paper is in the restoration field. Different problems are addressed, namely the alterations of the ink and of the paper mainly due to humidity, and the yellowing of the paper due to cellulose oxidation catalysed by metals [4]. In the procedure we propose, a method is suggested for making the background aspect of the paper more homo-

geneous, reducing the humidity-induced local alterations; at the same time, parts of the characters which are discoloured are restored. Some early results related to the possible usage of this technique as a preprocessing stage in an OCR system are finally shown. The rest of the paper is organized as follows: Section II describes the algorithm for the background enhancement; Section III describes how to use the method to improve the performance of an OCR operator; Section IV shows some experimental results. A Conclusions section ends the paper. 2. Page enhancement Due to the humidity or unintentional water interaction the pages in old books can be deformed and take local wavelike shape. When we acquire these books, the deformations produce projective distortion (skew) in the final image and then different colors in its background. If the distortion is significant and, hence, it produces irregular text lines, it can not be ignored. In this case different warping algorithms can be applied. For example, Brown et al. [6] suggest to create a planar representation of a document that is modified to obtain a flat area. In most cases, however, the distortion is small and the skew does not produce relevant artifacts, but only color variations in the background (see Fig.1). In this section, we suggest to use a simple algorithm that improves the quality using only the luminance information. Given the context this method is designed for, it is reasonable to assume that the page is formed by a dark text over a light background. Starting from this assumption our algorithm divides the luminance histogram in two parts: text and background. Then, the background is adjusted in color. The way in which binarization and restoration are performed is described in the rest

Stanco et alii. / A Method for Improving the Visual Quality of Digitized Antique Books

of this section. For clarity, a block scheme of the procedure is reported in Fig.2.

typical values of the parameters, for 8-bit images, are A = 5 and k = 0.03. The image Yr output from such an operator contains the most significant background and character structures in the original page and is further processed as indicated below; small but important details in the page are lost in Yr but are preserved in our procedure by evaluating the pixel-by-pixel ratio between the original and filtered images: Yd (i, j) = Y (i, j)/Yr (i, j)

(2)

The image Yd is multiplied back at the end of the tonal adjustment process, as indicated in Fig.2. Tonal Adjustment. We want to modify the values of Yr in order to enhance the contrast between the text and the background. To this purpose we adjust the tonal range of the image using a curve. We consider a graph where the horizontal axis represents the original intensity values of the pixels, and the vertical axis represents the new values. All the points on the diagonal line have identical input and output values; any curve different from the diagonal changes the intensity of the pixels. To enhance the differences between the text and the background, we use a curve that flattens both dark and bright image areas, as indicatively shown in Fig.3. Increasing or decreasing the slope of its central portion we can control the contrast of the image. The output image is more contrasted than the input one. It is multiplied by Yd as mentioned above to obtain Yn and the result is used to adjust the chrominance matrices. Yn is also used in the RGB color space conversion rather than the original Y . Figure 1: An example of background with a variation of colors.

Scanning the page using a flat-bed scanner or other acquisition devices we produce a color image I, represented by its R, G and B components. Our algorithm first converts this input image I in a different color space, YCbCr , via a conventional transformation [8]. The algorithm operates first on the luminance image Y , and it modifies its values to better reproduce the structure of the original image. Then, using the structure learned in Y it modifies the chrominance matrices Cb and Cr . Rational Filter. In this step we produce an image Yr , in which the largest edges are still well-marked and the values inside a region are made homogeneous. This result may be achieved by applying an Anisotropic Diffusion (AD) operator [9] to the luminance Y , or by applying n times the Rational Filter (RF) [5]. We choose the RF because it is simple and effective. The Rational filter is a nonlinear operator that attenuates small image variations while preserving edges. It acts by modulating the coefficients of a linear lowpass filter in order to limit its action in presence of luminance changes. Different versions of this operator can be devised; we use one among the simplest: for each pixel Y (i, j), the output of the filter is obtained according to the relation: Yr (i, j) = Y (i, j) +

Y (i−1, j)+Y (i+1, j)−2Y (i, j) k(Y (i−1, j)−Y (i+1, j))2 +A

+

Y (i, j−1)+Y (i, j+1)−2Y (i, j) k(Y (i, j−1)−Y (i, j+1))2 +A

(1)

Figure 3: Examples of curves. Thresholding. In the image Yn the algorithm automatically searches for a threshold that divides the luminance histogram in two separate classes. We use Otsu’s method of threshold selection [10], that is among the the most effective and fastest global thresholding method. Since the histogram of Yn shows two separate peaks depending on the tones correction described above, the threshold t selected by Otsu’s method is effective. If we put: ½ 0 i f Yn (i, j) ≥ t Ybw (i, j) = (3) 1 elsewhere


Figure 2: Block scheme of the algorithm.

Ybw is the binary version of Yn . Color filtering. In this step we correct the chrominance matrices. We use the information derived from the luminance over the chrominance. In other words, we change the chrominance values of the pixels that belong to the background, and leave unchanged all the remaining pixels. The binary image Ybw is helpful for this operation. Let Mx = {Cx (i, j)|Ybw (i, j) = 1}, with x = {r, b}, be the set of chrominance values that belong to the background. The new chrominance matrices are defined as follows: ½ median(Mx ) i f Ybw (i, j) = 1 0 Cx (i, j) = (4) Cx (i, j) elsewhere The matrices Cb0 , and Cr0 are the new chrominance matrices for our algorithm. Color conversion. The components Yn , Cb0 , and Cr0 obtained from our algorithm are used instead of Y , Cb , and Cr in the color conversion to obtain the RGB components of the final image I f . 3. Improving OCR Optical Character Recognition (OCR) algorithms detect alpha-numeric characters in document images [7]. Commonly diffused OCR systems give better results if the text is black on white uniform background or viceversa. Obviously, they are based over the assumption that the original image is a flat document. This means that pages from an antique book are not an ideal input for the OCR algorithm: as described above, they are usually characterized by a nonuniform background, and the document may be not flat. These problems, adding to the faulty recognition of particular characters used in the past, produce outputs that are illegible and far from the reality. Since an important part of the OCR is the process of binarization, we propose to use the technique described in the previous section to improve this task. In particular, we suggest to create a particular binary image Ybw and use this image as the input of the OCR system. In this case, since it is

necessary to impose a clear separation between the text and the background, we suggest to choose with care the curve during the tonal adjustment. More precisely, we propose to use a high-slope curve like the rightmost in Fig. 3. This solution allows to better separate the text from the background, and to enhance the color of the text.

4. Experimental results This section shows some results obtained applying the proposed algorithm to old books. The performances of the algorithm cannot be quantitatively compared using MSE or PSNR due to the fact that the images are real scans of old books. The proposed algorithm works without user intervention. Only two parameters will be set: the number of iterations n for the RF and the equation for the curve used during the tonal adjustment. In our experiments the number of iteration n is 10. The curve used is described by a 5-th degree polynomial with null first derivative in zero and in one, and null second and third derivative in one. More precisely the equation of the curve is y(x) = −4x5 + 15x4 − 20x3 + 10x2

(5)

Figs. 4, 5, 6 report some processed images. It is possible to notice that the algorithm enhances the text and restores the background. It maintains the original characteristics of the paper and eliminates the artifacts. In particular, for the image in Fig. 4 the background looks still natural but the paper deformation has disappeared. In Fig. 6(b) the blotches near the text have been eliminated even if the color of the stamp has been preserved. To improve the binarization for the OCR, as described in Section 3, we use a curve different from (5). We propose a 7-th degree plynomial with several null derivatives in zero and one. The equation is y(x) = −20x7 + 70x6 − 84x5 + 35x4

(6)


(a)

Figure 4: Image in Fig. 1 after the restoration.

Fig. 7 shows some details of the original image and of our processed image; both have been binarized by the same (Otsu) operator. Both results have been scanned by a common OCR. It is possible to notice that if the processing we propose is performed the number of mistakes is reduced. As described in Section 2, our technique works better because the RF combined with the tones adjustment divides the histogram in two separate peaks. Fig. 8 shows in gray the original luminance histogram and in black the histogram produced by our technique. It is possible to observe that the new histogram has two separate peaks, while the gray histogram presents only one evident peak. Moreover, we would like to stress that this marked separation in the histogram allows to obtain a more effective thresholding. 5. Conclusions In this paper an algorithm to restore the visual quality in old books has been proposed. The method works preserving and enhancing the text information and modifying the background. It has been demonstrated that this method is helpful to improve the performance of an OCR operator.

(b)

Figure 5: (a) original image with a damaged background; (b) Image in (a) after restoration.

This work has been partially supported by a grant of the Regione Friuli Venezia Giulia.

Acknowledgements We wish to thank Digital Codex and Biblion Onlus for providing the pictures used in the experiments. The original documents belong to Redentoristi Library at S. Maria della Consolazione in Venice, Italy.

References [1] http://www.ul.cs.cmu.edu [2] http://www.dli2.nsf.gov


Figure 7: OCR results with different images.

(a)

Figure 8: Histograms.

(b)

Figure 6: (a) Particular of an image with a damaged background ; (b) Image in (a) after restoration

[3] http://www.brickscommunity.org [4] Z. Kollia, E. Sarantopoulou, A.C. Cefalas, S. Kobe, and Z. Samardzija, Nanometric size control and treatment of historic paper manuscript and prints with laser light at 157 nm. Appl. Phys., A, Mater. Sci. Process. 79:379– 382, 2004. [5] G. Ramponi. The Rational Filter for Image Smoothing. IEEE Signal Processing Letters, 3(3):63–65, March 1996. [6] M. S. Brown, W. Brent Seales. Image Restoration of Arbitrarily Warped Documents. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(10):1295–1306, October 2004.

[7] T. Pavlidis, S. Mori. Optical Character Recognition. Proc. IEEE, 80(7):1027–1209, July 1992. [8] www.mathworks.com [9] P. Perona, J. Malik, Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Patt. An. and Machine Intell. 12(7): 629-639, July 1990. [10] N. Otsu A Threshold selection method from gray-scale histogram. IEEE Trans. Syst., Man, Cyber., SMC-8:62– 66, 1978.