Text Extraction from Web Images

2 downloads 0 Views 2MB Size Report
paper, we first give a short review of these methods proposed for text detection and recognition in web images; then a framework to extract from web images is ...
Text Extraction from Web Images Changsong Liua, Cheng Yanga, Xiaoqing Dinga, Jian Fanb a State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China; b Hewlett-Packard Labs. (United States) ABSTRACT Web images constitute an important part of web document and become a powerful medium of expression, especially for the images containing text. The text embedded in web images often carry semantic information related to layout and content of the pages. Statistics show that there is a significant need to detect and recognize text from web images. In this paper, we first give a short review of these methods proposed for text detection and recognition in web images; then a framework to extract from web images is presented, including stages of text localization and recognition. In text localization stage, localization method is applied to generate text candidates and a two-stage strategy is utilized to select text candidates, then text regions are localized using a coarse-to-fine text lines extraction algorithm. For text recognition, two text region binarization methods have been proposed to improve the performance of text recognition in web images. Experimental results for text localization and recognition prove the effectiveness of these methods. Additionally, a recognition evaluation for text regions in web images has been conducted for benchmark. Keywords: web images, text detection, text recognition, text extraction, OCR

1. INTRODUCTION In recent years, web pages published online have been increasing explosively and become more and more popular in our daily life. Web images constitute an important part of the web document and have been rapidly integrated into web pages as a powerful content delivery medium. They are used for a variety of purposes, such as navigation, decoration, advertisements, logos and informational images [1]. As web page design becomes more and more sophisticated, lots of text presented on the web pages is being embedded into images. The images containing text in web page often carry important textual information related to the layout and content of page document. The text embedded often corresponds to page headers, titles, URLs and, therefore, has a potentially high semantic value for web document analysis and understanding. One common use of this semantic value is in terms of information retrieval and indexing. To our knowledge, the existing web image search systems are based primarily on surrounding metadata of images on web pages, such as image names (ATL tags) and HTML data, while the text inside images is not accessible via the real-world search engines. However, ignoring the text embedded may be a serious matter according to the surveys [2, 6, 7]. The result [7] declares that 42% of the images in the samples contain text and 59% of the images with text contain at least one word that does not appear in the corresponding HTML file. What is worse, a significant fraction (56%) of the textual description (ALT tags) on web pages is incomplete, wrong or does not even exist at all [2]. Obviously, it can be seen that there is a significant need for methods to detect and recognize the text embedded images in web pages. Though approaches for text extraction in digital images have been studied extensively for decades, most of the works are focused on text recovery from scanned, real scene and video images. Few techniques that can be applied directly for text recovery from web images have been reported. However, the task of text recovery from web images is considerably challenging for conventional OCR techniques since web images are in many ways different from video frames, real scene images and other document images [4]. First, web images are usually of ultra low resolution (72-100 dpi) and text is often of small font size (between 6-12 pt). Second, web images tend to have various artifacts, thus with complex color theme both for the text and background. Third, web images are usually noiseless and have the anti-aliasing property. Such special conditions clearly pose a challenge to traditional text recovery techniques.

Imaging and Printing in a Web 2.0 World II, edited by Qian Lin, Jan P. Allebach, Zhigang Fan, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7879, 78790P · © 2011 SPIE-IS&T · CCC code: 0277-786X/11/$18 · doi: 10.1117/12.880027 Proc. of SPIE-IS&T Vol. 7879 78790P-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

Similar to the general text information extraction system applied in other domain, the procedure of text recovery from web images can be divided into two main stages: text detection and recognition. The performance of text extraction algorithms depends critically on text detection module, which usually contains text candidates’ generation, text candidates’ selection and text line grouping. And the recognition stage often involves of pre-processing, characters segmentation, feature extraction, pattern classification and post-processing. The general architecture of the text information extraction (TIE) system is shown in Figure 1. Text candidates’ generation

Text candidates’ selection Text lines localization

Input images

Text detection Text recognition Text region binarization

Output text

Text recognition

Text segmentation

Figure 1. General architecture of text information extraction system (TIE).

In this paper, we first present a short review of these methods proposed for text detection and recognition in web images. Then a TIE framework oriented to detect and recognize text from web images is proposed, including approaches for text localization and text region foreground and background segmentation. Experimental results for text localization and recognition prove the effectiveness of these methods. Additionally, a recognition evaluation for text regions in web image has been conducted for a benchmark. The rest of this paper is organized as follows. In Section 2, we give a short review of approaches for text detection and recognition from web images. Section 3 describes our proposed method to detect text in web images. In Section 4, two binarization methods for text foreground and background segmentation are presented. Experimental results of both text localization and recognition are discussed in Section 5. Finally, we offer our conclusions and suggest topics for further works in Section 6.

2. RELATED WORK Since late 1990s, several methods focusing text detection from web images have been proposed. Simultaneously, a few authors have presented some text recognition algorithms that are directed to web images OCR. In this section, we will give a short review of the existing methods for text detection and recognition from web images. Similar to other literatures, the two subtopics will be addressed separately here. 2.1. Text detection from web images Text in images can be detected by exploiting the discriminate properties of characters such as edge density, gradient color, texture, perceptual uniformity. There are a wide variety of techniques proposed to detect text in video frames, still scene images, book covers and web images. For comprehensive surveys of text detection, see [8]. In general, current text detection methods can be broadly classified into two categories [8]: region-based method and texture-based method. In region-based methods, pixels exhibiting certain properties, such as constant color and higher gradient strength, are grouped together to extract text candidates regions. The resulting text candidate components are verified geometrically using heuristic and machine learning based methods. Texture-based methods use the observation that text in images has distinct textural properties that distinguish them from the background and it usually consists of texture feature extraction and machine learning based classification stage. For text detection from web images, most of methods proposed are in region-based category. Of these methods, the majority exploit color processing to extract and verify text candidates, while a few algorithms are implemented using gray level processing.

Proc. of SPIE-IS&T Vol. 7879 78790P-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

In the color processing methods, the main contribution to the specific problem comes from Lopresti and Zhou [5, 6]. They present an approach based on color clustering for text extraction followed by a connected component analysis for text detection [5]. In more recent papers [6], the same authors changed slightly the clustering algorithm of their method to include spatial information as well as color information. However, these methods cannot be applied on text with varying colors and images with gradually changing colors. A. Antonacopoulos et al. propose a few important approaches for text segmentation in color web images using color perception and topological features [1, 2, 3, 4]. In [2], the method exploits the unique characteristics of the design of web images for human viewing on a screen by adopting a human perspective of color perception. Later in the new method[3, 1], a fuzzy propinquity measure is used to express the likelihood for merging two components, based on features of topological relationship and color similarity using a color distance measures. Finally in [4, 1], the authors propose a complete novel method that works in a split-and-merge manner, aiming to identify and analyze regions that are perceptually different in color using Hue-Lightness-Saturation (HLS) color system. Additionally, a comparison of the two segmentation methods is given in [1]. Taehoon Park et al. [25] present a method to detect a text with various orientation and multifont sizes in a web image. It uses a spatial merging algorithm to reduce color in low resolution web images and extracts character employing stroke-based method. In [24], Jun Sun et al. propose an efficient text extraction method using color clustering and connected component analysis. A novel stroke verification method is used to remove noisy non-character connected components. Several region-based text detection methods that work in gray level have been proposed. Pretagonis et al. in [35] present a novel web image processing algorithm for text area identification. In this algorithm, text and invert text candidates are segmented in gray scale images using a novel edge extraction technique. Then a conditional dilation technique applied with several iterations helps to choose text and inverted text objects among all candidate objects. In [23], the authors have improved their algorithm by defining a threshold to distinguish between the brightness transitions from the text body to its background. Seongah Chin [26] proposed a method for filtering of text blocks in web images working without prior knowledge of the text orientation, size or font. The method endeavors to calculate segment widths and their number of appearance in order to identify font density function. Jiaying HE et al. [27] present a novel method for hybrid Chinese/English text location, segmentation and Chinese character reconstruction in web images. In this method, morphic operation is used to segment text pixels in gray scale image and character components are identified based on morphological complexity analysis. For completeness, few texture-based approaches that have been presented to detect text in web images will be discussed here. It should be mentioned that these methods are oriented for text detection from general color images or video frames and have been applied to detect text in web images. In [28], Chunmei Liu et al. propose an algorithm based on the color texture features for detecting texts in images and video frames. Qixiang Ye et al. [29] propose a novel coarse-to-fine algorithm that is able to locate text lines under complex background by using multi-scale wavelet features. From these methods described above, we can see that most of them are in region-based category and generate text candidates by using color information, and strokes’ features are analyzed with heuristic knowledge to select text candidates. Maybe more attention have been paid to text extraction from scene image or video frames recently, few new methods have been proposed to detect text from web images since 2004. 2.2. Text recognition from web images After the text regions are well localized using certain text detection methods, they will be recognized in the recognition stage. Like the scheme of traditional text recognition, most efforts on text recognition from images have been put on stages of pre-processing, text background segmentation, feature extraction, pattern classification and post-processing. We will give a review of methods proposed for recognizing low resolution, poor quality text in web images. These methods proposed mainly involve stages of text background segmentation and recognition. For text recognition from web images, a few methods have been proposed focusing on the text background segmentation, feature extraction and post-processing. Apart from text detection methods described in section 2.1, Lopresti and Zhou described two methods [6, 30] for recognizing text from in-line web images. The first recognition method proposed is based on polynomial surface fitting. Another OCR method suggested by the same author is based on n-tuples [6, 30]. The n-tuple technique works better than polynomial surface fitting method on the same evaluation dataset. Zohra Saidane et al. [10] propose a novel binarization method to segment text from complex background color images. It is based on supervised learning and works without making any assumptions or using tunable parameters. Jun Sun et al. [24] propose a method for text recognition algorithm in web images. The extracted binary text line image is segmented and recognized

Proc. of SPIE-IS&T Vol. 7879 78790P-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

using dynamic programming algorithm. A post processing stage using web pages contextual information is used to improve the recognition performance. Besides the effort on the stages of pre-processing and post-processing in the recognition procedure, some novel methods that are able to simultaneously segment and recognize ultra low resolution or adjacent words have been presented. Farshideh Einsele et al. [32] propose a competing approach aiming at directly model the specificities of these images: gray-level, ultra low resolution and anti-aliased. The experiments performed on synthetic data with the HMM system brings promising results and clearly show that HMMs are able to simultaneously segment and recognize ultra low resolution words. The same authors [33] improve their system by using an ergodic topology for the HMMs, where all character models are connected to each other. The proposed system is evaluated on different font sizes and families, showing good robustness for sizes down to 6 points. In [31], Charles Jacobs et al. present a camera-based OCR system that can recognize text from poor-quality images of text documents. They utilize convolutional neural network based character recognizer to deal with character segmentation and recognition simultaneously, using the grey level image. Actually, most of the methods proposed for text recognition in web images consider the module of preprocessing and text region binarization as main way to improve the performance of conventional OCR engines. Novel classification algorithm that is able to perform segmentation in the same time as recognition, such like HMMs and CNN (convolutional neural network), also give impressive results for small font and low resolution character recognition. 2.3. Experimental results reported Several experimental results about text localization and recognition in web images have been published in the related literatures. Table 1 gives a general overview of these results reported. It should be noted there is no meanings to compare the results as almost of them are evaluated based on different dataset. Table 1. Overview of the reported experimental results for extraction and recognition of text from web images.

Authors

Year

OCR Type

Image

Detection

Recognition

Method

D. Lopresti [5]

1997

-

262

47%

-

Color clustering and CCA

D. Lopresti [30]

1997

dedicated

50

-

69.7%

Surface fitting

D. Lopresti [30]

1997

dedicated

50

-

89.3%

N-tuple

D. Lopresti [6]

2000

-

482

78.8%

-

Color clustering and CCA

Taehoon Park [25]

1998

-

200

88.8%

-

Color reduction and CCA of strokes

A.Antonacopoulos [3]

2002

-

124

53.10% ~73.71%

-

Fuzzy segmentation

A.Antonacopoulos [4]

2007

-

115

69.65%

-

Split-merge

S.J. Perantonis [26]

2003

FineReader 5

650

85.58%

61.58%

Text area identification

S.J. Perantonis [23]

2004

FineReader 5

1100

80.45%

64.07%

Text area location

Jun Sun [24]

2003

dedicated

43

89.5%

70.4%

Color clustering, CCA, Post processing

Jiaying He [27]

2004

-

250

84.5%

-

Hybrid characters location

3. TEXT LOCALIZATION In this section, we propose a region-based method for localizing text in web images with complex background. First, we adjust the images’ resolution according to the images’ size in pre-processing stage. In text detection stage, text candidates have been extracted based on the intensity properties of local regions in gray level. For text selection, multi connected components analysis based on heuristic knowledge is used to filter out apparently noise CCs, then a cascade classifier is employed to classify the remained text candidates by calculating the single CC’s features. Finally, a coarse-to-fine text line extraction algorithm is utilized to localize the text regions in image. 3.1. Image enhancement

Proc. of SPIE-IS&T Vol. 7879 78790P-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

As already mentioned, certain pre-processing is necessary to correct the web images’ resolution for text detection. In our experiment we found that, about more than 80% of web images appear with low resolution from 72dpi to 96 dpi, only less than 20% these web images’ resolution is higher than 100dpi. That is to say, compared with the common images’ 300dpi resolution, almost of the web images’ resolution is less than one-third of that for normal images. It is absolutely essential to adjust the web images’ resolution using effective image enhancement algorithm for better text detection. Based on the characteristics of text in web images, we choose a classical interpolation algorithm as the resolution enhancement method for its robustness and efficiency. The enhancement method is based on bicubic interpolation, which has been widely applied in other low resolution images super resolution like camera or video frames [11]. Compared with other traditional interpolation or more complex algorithms, bicubic method has achieved better performance for the tradeoff between speed and accuracy. The bicubic interpolation method passes a cubic polynomial through the neighbor points instead of a linear function and its convolution kernel we adopt here is Hermite spline with α equaling to 0.5. To simplify the images enhancement, we just upscale all the source web images before other processing. For each image, we calculate the factor according to its image sizes of width and height. Considering the web images’ resolution described above, the scaling factor is chosen variously ranging from 2x to 4x. Basically, the larger image’s size is, the greater scaling coefficient should be selected. In our experiment, this resolution correction strategy enables that almost of the characters in these web images enhanced are recognizable. 3.2. Text candidates’ generation Similar the text detection methods proposed in [12], we propose region-based text detection method using an adaptive image decomposition algorithm. First, we convert the enhanced color images into grayscales. Then the images have been decomposed by adapting local thresholding approach which the threshold is calculated over each pixel mainly by a local mean. Finally, the text candidates have been extracted in the resulting decomposed image layers. The local thresholding method for image decomposition is a simplified version of classical binarization methods like Niblack and Sauvola with less computation, and it is described in the following formula.

⎧255 I ( x, y ) > T+ ( x, y ) ⎪ s( x, y ) = ⎨ 0 I ( x, y ) < T− ( x, y ) , where T± = μ ( x, y,W ) ± δ ⎪127 otherwise ⎩

(1)

Where s ( x , y ) and I ( x, y ) are the segmented result and original intensity value at each pixel in the gray scale image. μ (x, y, W ) is the mean intensity of sub-windows and W is the size of sub-window centered at pixel (x, y ) . δ is a const positive integer by which each pixel is segmented into its layer. Based on our experiment, we set δ to 8 to get high recall rate. Without prior knowledge, W is calculated dynamically according to the images’ size in a given range. Through the segmentation method, the original gray scale image has been decomposed into three layers: white, black and homogeneous layers. We extract text candidates just in both white and black layers and the homogeneous one will be analyzed further as that only few small characters may appear in this layer. We label the connected components in the white and black layers and generate the text candidates from the resulting connected components. An example of the text candidates’ generation is show in Figure 2. / Networked Applications

Networked Applications

" (Sales Force Automation )

Sa1es Force Automatton

AcëesBuslnessDga

ACC -

I

-

:essIusinessDa

A

More applications

(a) A web image

(b) local binarization result

here

Moreapplications

(c) (d) text candidates in black and white layer

Figure 2. Text candidates’ generation result.

3.3. Text candidates’ selection

Proc. of SPIE-IS&T Vol. 7879 78790P-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

In this section, we will verify all the text candidates extracted in section 3.2 and filter out the non-text components in both binary layers. We take a two-stage strategy for text candidates’ selection. In the first stage, isolated, too small or too big CCs will be removed according to the position relationship between the CC and its neighboring CCs using some heuristic rules. Among the remaining candidates, we calculate several features of all the components and classify them into text or non-text ones using cascade classifier in the late stage. Finally, we get the detected text components in two layers. An example of text candidates’ selection from the same image is given in Figure 3. The main idea for neighboring components analysis comes from [13]. Text characters generally don’t appear alone, but together with other characters of similar properties and usually regular placed in horizontal string [14]. We take several relative loose rules to eliminate the distinct false candidates roughly, especially for the isolated or bar-shaped components. For each text candidate cc i , we search its corresponding text cc j that meets all the following rules. If cc j does not exist, cc i will be filtered out as false candidates. We only consider the locations of each cc , where xs , xe , ys , ye , x and y are the left, right, top, bottom, width and height of its bounding box rectangle.

xmax xmax y max y max < x i & & xi < And < yi & & yi < MaxChars MinChars MaxLines MinLines xe j − xsi yi ys j − ysi yi

< 3 ||

< 2 And

xs j − xei yi yi − y j yi

(2)

T ( x, y ) I ( x, y ) ≤ T ( x, y )

(15)

According to the rough estimation results, we sum up all the color value for all the corresponding foreground and background pixels in original RGB image O(x,y) respectively. Then the mean color values M fg for foreground pixels and M bg for background pixels in O(x,y) are obtained. In the step of color classification, instead of using traditional color clustering method, we just classify the color pixels into foreground and background ones simply according to the color distance between the pixels’ color value in O(x,y) to the mean value M fg and M bg .The rule for final binarization using color classification is defined as following.

⎧0 if B ( x, y ) = ⎨ ⎩1 if

Dbg < D fg || Dbg < Tbg D fg < Dbg || D fg < T fg

(16)

Where B(x,y) is the final binarized image, D fg and Dbg represent the Euclidean color distance between pixels value in RGB color image O(x,y) to M fg and M bg respectively. T fg and Tbg are the color distance thresholds to classify pixels into foreground and background ones, which are both set to 20 in our experiment. Some text region binarization examples using this approach are shown in Figure 7. (a) Text region (b) Method in 4.1 (c) Method in 4.2 Figure 7. Comparison for the two binarization methods.

The proposed color classification based method performs better than local binarization based algorithms for segmentation of some text region with complex background, which can reduce noise in foreground regions and few strokes are missed. However, it will fail where the foreground pixels are in inconsistent color or illustration. The characters in low resolution or contrast text regions maybe binarized with thicker or thinner stroke than they are.

5. EXPERIMENTAL RESULTS In this section, the proposed system is implemented and evaluated to show the efficiency of the method proposed for text detection and recognition in web images. The proposed system has been developed in Microsoft visual C/C++ environment and both the experiments have been constructed on a Core 2 Duo 3.16 GHz desktop computer with windows XP SP3 OS. Since the criteria and dataset for text detection and recognition are different, we divide the experiments into two phases. We will describe the two experiments and results separately. 5.1. Evaluation of text localization In this phase, we evaluate the text localization approach on an image set collected from real world web pages. This dataset contains 1134 color web images containing text, which have been randomly selected from 67 websites. The sizes of images vary from 25×13 to 1407×1009 pixels. Most of the characters embedded in these images are in English. In this dataset, we label the locations of text in images by a set of rectangles of bounding boxes. And about 3551 rectangles, the ground truth bounding boxes provided for these images, have been tagged manually. For text detection evaluation scheme, there are several methods proposed [9] with different measurements, corresponding rules and threshold choosing strategies. And the text detection evaluation criteria presented in ICDAR2005 competitions [34] has been widely used to evaluate scene text localization. We adopt the evaluation strategy [34] for text detection in web images and describe these measures here for completeness. In the criteria, two measures, namely precision and

Proc. of SPIE-IS&T Vol. 7879 78790P-10 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

recall have been defined based on the area matching ratio respectively: p=



r=



re ∈E

m(re , T )

(17)

E

and rt ∈T

m(rt , E )

(18)

T

Where m (r , R ) defines the best match for a rectangle r in a set of rectangles R, T and E are the ground truth sets and estimated rectangles respectively. We use the f measure to evaluate the overall performance by combining the precision and recall. The relative weights of these are controlled by α, which we set to 0.5 to give equal weight to precision and recall: f=

1 α p + (1 − α ) r

(19)

The overall performance of our method is shown in Table 2. Some text localization results are illustrated in Figure 8. Table 2. Text line localization results.

system

precision

Proposed method

0.74

recall 0.80

f

t(s)

0.77

0.067

Figure 8. Some representative images with the corresponding results.

5.2. Evaluation of text recognition Then the performance of text recognition in web images has been evaluated in this phase. In this experiment, some famous OCR engines together with our recognition engine have been evaluated on a text regions set for comparison and the result also provides a benchmark for us. Additionally, to evaluate the text region binarization method proposed in Section 4, we compare the text recognition results produced by our traditional OCR engine and improve the OCR engine by integrating proposed text extraction method. We collect a dataset of text regions that were cropped manually from 155 real world web images. This dataset consists of 348 color text regions, almost of which are in word form and only a few are in sentence level. In the dataset, 311 regions are in the resolution 96 dpi and 37 regions are 72 dpi. The regions’ size range from 17×11 to 249×176 pixels, and the text heights vary from 9 to 176 pixels. There are about 3070 characters embedded in these regions, most of which are in English. And the number of characters in each region is between 2 and 45, about an average of 9 characters in per region. Some text regions samples are shown in Figure 9. Figure 9. Some text regions example in the dataset.

To give a benchmark, several well known OCR engines have been used to evaluate their performance of text recognition on this dataset. Of these OCR software, ABBYY FineReader 10 and Nuance OmniPage Pro 17 are two famous commercial OCR engines, Tesseract OCR V3.0 is an open source engine, and THOCR2007 is commercial software developed by our lab. Specifically, FineReader 10 [19] is professional OCR software aiming at recognizing scanned

Proc. of SPIE-IS&T Vol. 7879 78790P-11 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

documents, PDFs and digital camera images; OmniPage Pro 17 [20] is able to convert a variety of documents into editable text and PDFs with high accuracy; Tesseract OCR V3.0 [21] is probably one of the most accurate open source OCR engines available, which was one of the top 3 engines in the 1995 UNLV Accuracy test; THOCR 2007 is famous Asian multi-languages recognition software oriented for scanned documents, which was mainly developed by our lab from Tsinghua University. To access the performance of the different OCR engines, we use the character recognition rate (CRR) and character precision rate (CPR) defined in [22]. The two metrics are computed on a ground truth basis as: CRR =

Nr N

(20)

CPR =

Nr Ne

(21)

and

Where N is the true total number of characters, N r is the number of correctly recognized characters and N e is the total number of extracted characters. In addition, we compute the word recognition rate (WRR) to get an idea of the coherency of character recognition within different recognition engines. It should be mentioned that text regions have been up-scaled into 2x using linear interpolation before being feed into OmniPage, Tesseract OCR and TH-OCR engines, since none of the software is able correct text regions’ resolution by itself, while FineReader is able to adjust the input web images’ resolution. All the input color images here have been tested directly using these engines without any other pre-processing stage like binarization or segmentation. The recognition results are listed in Table 3. Table 3. Recognition results on the dataset ( Ext.: the number of extracted characters).

OCR engines

Ext.

CRR(%)

CPR(%)

WRR(%)

Nuance OmniPage Pro 17

2234

62.66%

80.11%

47.99%

Tesseract OCR V3.0

3024

82.91%

84.17%

57.47%

THOCR 2007

2939

86.48%

90.33%

64.66%

ABBYY FineReader 10

3056

96.38%

96.82%

90.23%

It can be seen from Table 3 that FineReader 10 outperforms the other three OCR engines. The main reason that the other OCR engines cannot achieve higher recognition rates is that none of them is oriented to work with low resolution images, all of which are mainly used in scanned document recognition. On the contrary, FineReader 10 is able to recognize photos of documents captured with digital camera and mobile phones with its already 3rd generation of ABBYY Camera OCR technology. Moreover, FineReader adjusts digital photos resolution automatically before recognition and provides a wide range of image pre-processing functions and tools, such as blurred image correction, automated resolution detection and correction, automatic correction of 3D perspective distortions and so on [19]. It is worthy of notes that most of text regions in the dataset have the characteristics that are significantly different from traditional documents, such as blurred or adjacent characters with complex background. Proper text extraction methods are necessary to segment characters from the background before feeding into traditional recognition engines. To prove the efficiency of the text binarization algorithms proposed in Section 4, we convert these color text images in the dataset into binary ones using the proposed text extraction methods and feed the binarized images into OCR engines for evaluation. In this experiment, two outputs are generated by integrating two different text extraction methods and we select the better one as the final recognition result for each text region. The evaluation results are listed in Table 4. Table 4. Comparison of recognition results with and without the text extraction methods proposed.

THOCR 2007

EXT.

CRR(%)

CPR(%)

WRR(%)

traditional method

2939

86.48%

90.33%

64.66%

With proposed method

3044

91.60%

92.38%

75.14%

Proc. of SPIE-IS&T Vol. 7879 78790P-12 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

From Table 4, we can see that there is about more than 5% improvement in CCR by adopting the proposed text extraction methods. Though the role of text extraction method is relative limited in this theme, it is very effective to enhance the performance of text recognition from web images for general OCR engine.

6. CONCLUSION AND FUTURE WORKS 6.1. Conclusion In this paper, we present an effective method for characters detection and text region binarization in web images. A short review of the approaches proposed for text detection and recognition form web images is given for completeness. In our proposed text detection method, a local thresholding algorithm is employed to generate text candidates in grayscale image. Then multi neighboring CCs have been analyzed to filter noise candidates roughly and a cascade classifier is used to select text candidates from the remained CCs. A projection profile based text line segmentation algorithm is utilized to localize the text regions in image. In text recognition stage, two text region binarization algorithms are presented to improve the performance of web images OCR. Experimental results for both text localization and recognition show the effectiveness of proposed methods. Additionally, an evaluation for text region recognition from web images using several existing known OCR engines has been conducted to provide benchmark. 6.2. Future works Though the evaluation results for text information extraction from web images are promising, there is still much work to do. Both stages of text localization and recognition in images can be improved via several ways. For text localization, we will take multi-resolution approach to extract the characters that are too small or too big. And robust machine learning based method like SVM can be adopted to help remove non-text CCs. In text recognition, we will focus on novel preprocessing and text region segmentation methods to extract text from images with complex background. Moreover, the web images set used in our experiments will be provided for public text localization evaluation later.

REFERENCES [1] D. Karatzas, “Text segmentation in web images using colour perception and topological features”, PhD Thesis, University of Liverpool,UK, 2002. [2] A. Antonacopoulos and D. Karatzas, “An anthropocentric appraoch to text extraction from www images”, DAS00, p 515-526, Rio de Janiro, Brazil.,2000. [3] A. Antonacopoulos and D. Karatzas, “Fuzzy Segmentation of Characters in Web Images Based on Human Colour Perception”, DAS 2002, pp. 295–306, 2002. [4] D. Karatzas and A. Antonacopoulos, “Colour text segmentation in web images based on human perception”, Image and Vision Computing, 25(2007). [5] D. Lopresti and J. Zhou, “ Extracting text from www images”, ICDAR97, pages 248-252, 1997. [6] Lopresti D and Zhou J, “Locating and Recognizing Text in WWW Images”, Information Retrieval 2, 177–206 (2000). [7] Tapas Kanungo and Chang Ha Lee, “What Fraction of Images on the Web. Contain Text”, IBM Almaden Research Center, 2001. [8] Keechul Junga, Kwang In Kim, Anil K. Jain, “Text information extraction in images and video: a survey”, Pattern Recognition, V.37, Issue 5, May 2004. [9] Yi-Feng Pan et al, “Text Localization in Natural Scene Images based on Conditional Random Field”, ICDAR 2009. [10] Zohra Saidane and Christophe Garcia, “Robust Binarization for Video Text Recognition”,Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007. [11] http://en.wikipedia.org/wiki/Bicubic_interpolation [12] JIANG Ren-jie et al, “A learning-based method to detect and segment text from scene images”, Zhejiang Univ Sci A 2007 8(4):568-574. [13] B. Gatos, I. Pratikakis, K. Kepene, and S. Perantonis, “Text detection in indoor/outdoor scene images”, CBDAR 2005. [14] N. Ezaki, M. Bulacu, and L. Schomaker, “Text Detection from Natural Scene Images: Towards a System for Visually Impaired Persons”, In International Conference on Pattern Recognition, pages 683–686, 2004. [15] Rodolfo P. dos Santos, “Text Line Segmentation Based on Morphology and Histogram Projection”, ICDAR 2009.

Proc. of SPIE-IS&T Vol. 7879 78790P-13 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms

[16] Jui-Chen Wu, Jun-Wei Hsieh, Yung-Sheng Chen, “Morphology-based text line extraction”, Machine Vision and Applications (2008) 19:195–207. [17] C. Thillou and B. Gosselin, “Color Binarization for Complex Camera-based Images”, in: Proc. Electronic Imaging Conf. Int. Soc. Opt. Imaging, 2005, pp. 301-308. [18] Sauvola J., M. Pietikainen, “Adaptive Document Image Binarization”, Pattern Recognition 33(2000). [19] http://finereader.abbyy.com/professional [20] http://www.nuance.com/for-business/by-product/omnipage/professional/index.htm [21] http://code.google.com/p/tesseract-ocr/ [22] D. Chen, J. M. Odobez, and H. Bourlard, "Text detection and recognition in images and video frames", Pattern Recogn, vol. 37, no. 3, pp. 595 - 608, 2004. [23] S.J. Perantonis, et al., “Text area identification in web images”, In Lecture Notes in Artificial Intelligence, Springer Verlag, pages 82-92, Samos, Greece, 2004. [24] Jun Sun, Hao Yu, Yukata Katsuyama, “Effective Text Extraction and Recognition for WWW Images”, Proceedings of the ACM symposium on Document engineering, Pages 115-117, 2003. [25] Taehoon Park,Dongsung Kim, Kyusik Chung, “Orientation and Scale Invariant Text Region Extraction in WWW Images”, IAPR Workshop on Machine Vision Applications, Nov. 17-19, 1998, Makuhari, Chiba, Japan. [26] Seongah Chin, “Filtering of Text Blocks in Web Images”, IDEAL 2003, LNCS 2690, pp. 1037–1041, 2003. [27] Jiaying He, Shaofa Li, “Hybrid Chinese/English Text Identification in Web Images”, ICIG’04, 2004. [28] Chunmei Liu, Chunheng Wang, and Ruwei Dai, “Text Detection in Images Based on Color Texture Features”, ICIC 2005, Part I, LNCS 3644, pp. 40-48. [29] Qixiang Ye, Qingming Huang et al., “Fast and robust text detection in images and video frames”, Image and Vision Computing 23 (2005) 565–576. [30] Jiangying Zhou and Daniel Lopresti, “OCR for World Wide Web images”, In Proceedings of the IS&T/SPIE International Symposium on Electronic Imaging, pages 58-65, San Jose, California, February 1997. [31] Charles Jacobs, Patrice Y. Simard, Paul Viola, and James Rinker, “Text Recognition of Low-resolution Document Images”, ICDAR’05. [32] Farshideh Einsele et al., “A HMM-Based Approach to Recognize Ultra Low Resolution Anti-Aliased Words”, PReMI 2007, LNCS 4815, pp. 511–518, 2007. [33] Farshideh Einsele, Rolf Ingold, “A language-independent, open-vocabulary system based on HMMs for recognition of ultra low resolution words”, Journal of Universal Computer Science, vol. 14, no. 18 (2008), 2982-2997. [34] Simon M. Lucas, “ICDAR 2005 Text Locating Competition Results”, ICDAR 2005. [35] S.J. Perantonis, B. Gatos, and V. Maragos, “A novel web image processing algorithm for text area identification that helps commercial ocr engines to improve their web recognition accuracy”. WDA2003, Edinburgh, UK., 2003. [36] Christian Wolf, Jean-Michel Jolion, Françoise Chassaing, “Text Localization, Enhancement and Binarization in Multimedia Documents”, ICPR (2) 2002: 1037-1040

Proc. of SPIE-IS&T Vol. 7879 78790P-14 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/08/2014 Terms of Use: http://spiedl.org/terms