Integrating Visual and Textual Cues for Image ... - Springer Link

7 downloads 1104 Views 182KB Size Report
weighted document terms and color invariant image features to obtain a high- ... text and discriminatory HTML tags such as IMG, and the HTML fields SRC and ALT. ..... two hue values h1 and h2, ranging from [0,2π), is defined as follows: d(h1 ...
Integrating Visual and Textual Cues for Image Classification Theo Gevers, Frank Aldershoff, Jan-Mark Geusebroek ISIS, University of Amsterdam, Kruislaan 403 1098 SJ Amsterdam, The Netherlands {gevers, mark}@wins.uva.nl

Abstract In this paper, we study computational models and techniques to merge textual and image features to classify images on the World Wide Web (WWW). A vector-based framework is used to index images on the basis of textual, pictorial and composite (textual-pictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a highdimensional image descriptor in vector form to be used as an index. Experiments are conducted on a representative set of more than 100.000 images down loaded from the WWW together with their associated text. Performance evaluations are reported on the accuracy of merging textual and pictorial information for image classification.

1

Introduction

Today, a number of systems are available for retrieving images from the World Wide Web on the basis of textual or pictorial information [3], [4], for example. New research is directed towards unifying both textual and pictorial information to retrieve images from the World Wide Web [1], [3]. Most of these contentbased search systems are based on the so-called query by example paradigm, and significant results have been achieved. A drawback, however, is that the low-level image features used for image retrieval often too restricted to describe images on a conceptual or semantic level. This semantic gap is a well-known problem in content-based image retrieval by query by example. To this end, to enhance the performance of content-based retrieval systems (e.g. by pruning the number of candidate images), image classification has been proposed to group images into semantically meaningful classes [5], [6], [7]. The advantage of these classification schemes is that simple, low-level image features can be used to express semantically meaningful classes. Image classification is based on unsupervised learning techniques such as clustering, Self-Organization Maps (SOM) [7] and Markov models [6]. Further, supervised grouping can be applied. For example, vacation images have been classified based on a Bayesian framework into city vs. landscape by supervised learning techniques [5]. However, these classification schemes are entirely based on pictorial information. Aside from image R. Laurini (Ed.): VISUAL 2000, LNCS 1929, pp. 419−429, 2000.  Springer-Verlag Berlin Heidelberg 2000

420

T. Gevers, F. Aldershoff, and J.-M. Geusebroek

retrieval ( [1], [3]), very little attention has been paid on using both textual and pictorial information for classifying images on the Web. This is even more surprisingly if one realizes that images on Web pages are usually surrounded by text and discriminatory HTML tags such as IMG, and the HTML fields SRC and ALT. Hence, WWW images have intrinsic annotation information induced by the HTML structure. Consequently, the set of images on the Web can be seen as an annotated image set. The challenge is now to get to a framework allowing to classify images on the Web by means of composite pictorial and textual (annotated) information into semantically meaningful groups, and to evaluate its expected added value, i.e. whether the use of composite information will increase the classification rate as opposed to the classification based only on visual or pictorial information. Therefore, in this paper, we study computational models and techniques to combine textual and image features to classify images on the Web. The goal it to get to the classification of WWW images into photographical (e.g. real world pictures) and synthetical images (e.g. icons, buttons and banners). Classifying images into photo-artwork is a realistic application scenario as 80 % of the images on Internet are synthetic images and 20 % are photo’s providing a significant image class division. Then, we aim at classifying photographical images into portraits-nonportrait (i.e. the image contains a substantial face or not at all). Portraits are one of the most favorite search scenario’s among people. To achieve classification, a vector-based framework is presented to index images on textual, pictorial and composite (textual-pictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a highdimensional descriptor in vector form to be used as an index.

2

Approach

The Web-crawler down loaded over 100.000 images from the World Wide Web together with their associated textual descriptions appearing near the images in the HTML documents such as HTML-tags (IMG), and various fields (SRC and ALT). After parsing the text associated with an image, significant keywords are identified and captured in vector form. Then, salient image features (e.g. color and texture) are computed from the images in vector form. Then, the textual and visual vectors are combined into a unified multidimensional vector descriptor to be used as an index. The classification of an image is computed by comparing its composite feature vector to the feature vectors of a set of already classified images. To be precise, let an image I be represented by its feature vectors of the form I = (f0 , wI0 ; f1 , wI1 ; ..., ; ft , wIt ) and a typical query Q by Q = (f0 , wQ0 ; f1 , wQ1 ; ..., ; ft , wQt ), where wIk (or wQk ) represent the weight of feature fk in image I (or query Q), and t image/text features are used for classification. The weights are assumed to be between 0 and 1. A feature can be seen as an image characteristic or a HTML keyword. Then, a weighting scheme is used to emphasize features having high feature frequencies but low overall collection frequen-

Integrating Visual and Textual Cues for Image Classification

421

cies. In this way, image features and key-words are transformed into numbers denoting the feature frequency (ff) times the inverse document frequency: wi = ffi log(

N ) n

(1)

where n is the number of documents/images containing the feature on a total of N documents/images. Depending on the content of the vectors, we will use different distance functions (e.g. histogram intersection for visual information and cosine distance for composite information). For general classification, the Minkowski distance is taken. To make the Minkowski distances comparable, normalization is obtained by dividing vectors by their standard deviation: v u n uX |xi − y i |ρ DM (x, y) = t (2) σi i=1 equalizing the length of the image vectors, where σi2 = Var[xi ] i.e. is the standard deviation of the corresponding vector. The classification scheme, used in this paper, is based on the Nearest Neighbor Classifier. In this scheme, the classification of an unknown image is computed by comparing its feature vector to the feature vectors of a set of already classified images. This is done by finding the nearest neighbors of the unknown image in the feature vector space. The class of the unknown image is then derived from the classes of its nearest neighbors.

3

Textual Information

HTML structure facilitates the extraction of information associated with an image. Firstly, an image is included on a Web page by its IMG-tag. This IMG tag contains the name of the file of the image to be displayed. In addition, the IMG tag contains fields such as SRC and ALT. Consequently, by using this tag and its fields, it is possible to identify textual information associated with an image. Note that text, appearing before and after a particular image, is not excerpted. From an informal survey of Web pages we came to the observation that, in general, these text blocks contain a large variety of words not very specific to the images nearby. A more elaborated survey on valuable HTML-tags can be found in [3]. In order to yield a robust, consistent and non-redundant set of discriminative words, HTML documents are parsed for each class during the training stage as follows: – Class labeling: The class of WWW images is given manually such as photo, art, or portrait. – Text parsing: Text fragments appearing in the IMG-tag and the SRC and ALT fields are excerpt.

422

T. Gevers, F. Aldershoff, and J.-M. Geusebroek

– Eliminating redundant words: A stop list is used to eliminate redundant words. For example, to discriminate photographical and synthetical images, words such as image and gif, which appear in equal number in both classes, are eliminated. – Stemming: Suffix removal methods are applied on the remaining words to reduce each word to its stem form. – Stem merging: Multiple occurrences of a stem form are merged into a single text term. – Word reduction: Words with too low frequency are eliminated. In this way, a highly representative set of key-words are computed for each class during the supervised training stage. Then, weights are given to the key-words and represented in multidimensional vector space as discussed in Section 2.

4

Pictorial Information

In this section, visual cues are proposed to allow the classification of images into the following groups: photographical/synthetical images in Section 4.1, and portraits/non-portraits images in Section 4.2. 4.1

Photographic vs. Art

Classification of WWW images into photographic and synthetic (artwork) images has been addressed before by [2], for example. To capture the differences between the two groups, various human observers were asked to visually inspect and classify images into photo’s and artwork. The following observations were made. Because artwork is manually created, it tends to have a limited number of colors. In contrast, photographs usually contain many different shades of colors. Further, artwork is usually designed to convey information such as buttons. Hence, the limited amount of colors in artwork are often very bright to attract the attention of the user. In contrast, photographs contain mostly dull colors. Further, edges in photographs are usually soft and subtle due to natural light variations and shading. In artwork, edges are usually very abrupt and artificial. Based on the above mentioned observations, the following image features have been selected to distinguish photographic images from artwork: Color variation: The number of distinct hue values in an image relative to the total number of hues. Hue values are computed by converting the RGB-image into HSI color space from which H is extracted. Synthetic images tend to have fewer distinct hue colors then photographs. Color saturation: The accumulation of the saturation of colors in an image relative to the total number of pixels. S from the HSI-model is used to express saturation. Colors in synthetic images are likely to be more saturated. Color transition strength: The pronouncement of hue edges in an image. Color transition strength is computed by applying the Canny edge detector on the H color component. Synthetic images tend to have more abrupt color transitions than photograph images.

Integrating Visual and Textual Cues for Image Classification

423

To express these observation in a more mathematical way, we need to be precise on the definitions of intensity I, RGB, normalized color rgb, saturation S, and hue H. Let R, G and B, obtained by a color camera, represent the 3-D sensor space: Z p(λ)fC (λ)dλ (3) C= λ

for C ∈ (R, G, B), where p(λ) is the radiance spectrum and fC (λ) are the three color filter transmission functions. To represent the RGB-sensor space, a cube can be defined on the R, G, and B axes. White is produced when all three primary colors are at M , where M is the maximum light intensity, say M = 255. The main diagonal-axis connecting the black and white corners defines the intensity: I(R, G, B) = R + G + B

(4)

All points in a plane perpendicular to the grey axis of the color cube have the same intensity. The plane through the color cube at points R = G = B = M is one such plane. This plane cuts out an equilateral triangle which is the standard rgb chromaticity triangle: R (5) R+G+B G g(R, G, B) = (6) R+G+B B b(R, G, B) = (7) R+G+B The transformation from RGB used here to describe the color impression hue H is given by: √  3(G − B) H(R, G, B) = arctan (8) (R − G) + (R − B) r(R, G, B) =

and saturation S measuring the relative white content of a color as having a particular hue by: min(R, G, B) (9) R+G+B In this way, all color features can be calculated from the original R, G, B values corresponding to the red, green, and blue images provided by the color camera. S(R, G, B) = 1 −

Color Variation The number of distinct hue values in an image relative to the total number of hues. Synthetic images tend to have fewer distinct hue colors then photographs. More precisely, a hue histogram HH (i) is constructed by counting the number of times a hue value H(Rx , Gx , Bx ) is present in an image I:

424

T. Gevers, F. Aldershoff, and J.-M. Geusebroek

η(H(Rx , Gx , Bx ) = i) for ∀x ∈ I (10) N where η indicates the number of times H(Rx , Gx , Bx ), defined by eq. ( 8), equals the value of index (i). N is the total number of image locations. Then the relative number of distinct hue values in an image is given by: HH (i) =

1 η(HH (i) > tH ) (11) B where B is the total number of bins and tH is a threshold based on the noise level in the hue image to suppress marginally visible hue regions. f1 =

Color Saturation The accumulation of the saturation of colors in an image relative to the maximum saturation. Colors in synthetic images are likely to be more saturated. Let histogram HS (i) be constructed by counting the number of times a saturation value S(Rx , Gx , Bx ) is present in an image I: η(S(Rx , Gx , Bx ) = i) for ∀x ∈ I (12) N where η indicates the number of times S(Rx , Gx , Bx ), defined by eq. ( 9), equals the value of index (i). Then the relative accumulation of the saturation of colors in an image is given by: HS (i) =

PB f2 =

i=1

i HS (i) B

(13)

where B is the total number of bins. Color Transition Strength The pronouncement of hue edges in an image. Color transition strength is computed by applying the edge detector on the H color component. Synthetic images tend to have more abrupt color transitions than photographs. Due to the circular nature of hue, the standard difference operator is not suited for computing the difference between hue values. The difference between two hue values h1 and h2 , ranging from [0, 2π), is defined as follows: d(h1 , h2 ) = |h1 − h2 | mod π

(14)

yielding a difference d(h1 , h2 ) ∈ [0, π] between h1 and h2 . To find hue edges in images we use an edge detector of the Sobel type where the component of the positive gradient vector in the x-direction is defined as follows: Hx (x) =

1 (d(H(x − 1, y − 1), H(x + 1, y − 1))+ 4

Integrating Visual and Textual Cues for Image Classification 2d(H(x − 1, y), H(x + 1, y)) + d(H(x − 1, y + 1), H(x + 1, y + 1)))

425 (15)

And in the y-direction as: Hy (x) =

1 (d(H(x − 1, y − 1), H(x + 1, y − 1))+ 4

2d(H(x, y − 1), H(x, y + 1)) + d(H(x − 1, y + 1), H(x + 1, y + 1)))

The gradient magnitude is represented by: q ||∇H(x)|| = Hx2 (x) + Hy2 (x)

(16)

(17)

After computing the gradient magnitude, non-maximum suppression is applied to ||∇H(x, y)|| to obtain local maxima in the gradient values:  ||∇H(x)||, if (||∇H(x)|| > tσ ) is a local    maximum M(x) = (18)    0, otherwise where tσ is a threshold based on the noise level in the hue image to suppress marginally visible edges. Then the pronouncement of hue edges in an image is given by: f3 =

M 1 X M(xi ) M i=1

(19)

where M is the number of local hue edge maxima. 4.2

Portrait vs. Non-portrait

In the previous section, images are classified into photo/artwork. In this section, photo’s are further classified into portrait-nonportrait type. We use the observation that portrait images are substantially occupied by skin-colors (color constraints) from faces (shape constraints). To that end, a two-step filter is used to detect portraits. Firstly, all pixels within the skin-tone color are determined resulting in blobs. Then, these blobs will be tested with respect to shape and size requirements. It is known that when a face is folded (i.e. varying surface orientation) a broad variance of RGB values will be generated due to shading. In contrast, normalized color space rgb and c1 c2 c3 are insensitive to face folding and a change in surface orientation, illumination direction and illumination intensity [2]. To further reduce the illumination effects, color ratio’s have been taken based on the observation that a change in illumination color will affect c1 , c2 and c2 , c3 proportionally. Hence, by computing the color ratio’s c2 /c1 and c3 /c2 , and r/g and g/b, the disturbing influences of a change in illumination color is suppressed to a large degree. A drawback is that c1 c2 c3 and rgb become unstable when intensity is very low near 5% of the total intensity range [2]. Therefore pixels with low intensity have been removed from the skin-tone pixels.

T. Gevers, F. Aldershoff, and J.-M. Geusebroek

5

10

4.5

9

4

8

3.5

7

3

6 b/g

c3/c2

426

2.5

5

2

4

1.5

3

1

2

0.5

1

0

0 0

0.5

1

1.5

2

2.5 c2/c1

3

3.5

4

4.5

5

0

0.5

1

1.5

2

2.5 g/r

3

3.5

4

4.5

5

Fig. 1. a. Skin (bright) and non-skin (dark) pixels plotted in the color ratio color space: c2 /c1 and c3 /c2 . b. Skin (bright) and non-skin (dark) pixels plotted in the color ratio color space: r/g and g/b. The color ratio’s achieve robust isolation of the skin-tone pixels.

To illustrate the imaging effects on color ratio’s, in Figure 1, pixels are plotted based on rgb and c1 c2 c3 ratio’s. As can be seen in Figure 1.a, the color ratio’s achieve robust isolation of the skin-tone pixels. In fact, it allows us to specify a rectangular envelop containing most of the skin pixels. This rectangular envelop defines the limits of the c2/c1 and c3/c2 values of pixels taken from regions within the skin-tone. These limits have been set as follows: c2/c1 = [0.35...0.80] and c3/c2 = [0.40...0.98]. These values have been determined empirically by visual inspection on a large set of test images. It has proved to be effective in our application. The second step is to impose shape restrictions on these blobs to actually identify portraits. The reason is that still problems may occur when dealing with images taken under a wide variety of imaging circumstances. To this end, small blobs are removed by applying morphological operations. After removing small blobs, the remaining blobs are tested with respect to two criteria. First, blobs are required to occupy at least 5% of the total image. Further, at least 20% of the blob should occupy the image. This ensures that only fairly large blobs will remain.

5

Experiments

In this section, we assess the accuracy of the image classification method along the following criteria. In the experiments, we use two different sets of images. The c Stock Photo Libraries. first dataset consists of images taken from the Corel The second dataset is composed of the 100.000 images downloaded from the Web.

Integrating Visual and Textual Cues for Image Classification

5.1

427

Photographical vs. synthetical

Test Set I: The first dataset was composed of 100 images per class resulting in a total of 200 images. Hence the training set for each class is 100 images. The test set (query set) consisted of 50 images for each class. Classification based on Image Features: Based on edge strength and saturation, defined in Section 4.1, the 8-nearest neighbor classifier with weighted histogram intersection distance provided a classification success of 90% (i.e. 90% were correctly classified) for photographical images and 85% for synthetical images. From the results it is concluded that automated type classification based on low-level image features provides satisfactory distinction between photo and artwork. Test Set II: The second dataset consists of a total of 1432 images composed of 1157 synthetic images and 275 photographical images. Hence, 81% are artwork and 19% photo’s were provided yielding a good representative ratio of WWW images. Classification based on Image Features: Images are classified on the basis of color variation, color saturation, and color edge strength. The 8-nearest neighbor classifier with weighted histogram intersection distance provided a classification success of 74% for photographical images and 87% for synthetical images. Classification based on Composite Information: Images are classified on the basis of composite information. Therefore, the same image features are used: color variation, color saturation and color edge strength. Further, the annotations of the images have been processed on the basis of the HTML parsing method given in Section 3. Typical high frequency words derived from photo’s were: photo -in different languages such as bilder (German) and foto (Dutch)-, picture and people. Typical high frequency words derived from artwork were: icon, logo, graphics, button, home, banner, and menu. Based on image features and keywords, the 8-nearest neighbor classifier with weighted cosine distance provided a classification success of 89% (i.e. 89% were correctly classified) for photographical images and 96% for synthetical images. From these results it is concluded that classification accuracy based on both text and pictorial information is very high and that it outperforms the classification rate based only on pictorial information. 5.2

Portraits vs. Non-portraits

Test Set I: To classify images into portrait-nonportrait images, a dataset is used consisting of 110 images containing a portrait and an arbitrary set of 100 images resulting in a total of 210 images. The test set consisted of 32 queries. Portraits vs. Non-portraits based on Image Features: Based on the skin detector, the 8- nearest neighbor classifier provided a classification success of 81% for images containing portraits. Test Set II: The second dataset is composed of a 64 images of which 26 were portraits and

428

T. Gevers, F. Aldershoff, and J.-M. Geusebroek

32 non-portraits (arbitrary photo’s). Portraits vs. Non-portraits based on Image Features: Images have been represented by the skin feature in addition to 3 eigenvectors expressing color in the c1 c2 c3 color space. As a consequence, the skin feature is used together with a global color invariant features for classification. Based on a 4-nearest neighbor a classification success of 72% for portrait images has been obtained and 92% for non-portrait images. Portraits vs. Non-portraits based on Composite Information: HTML pages were parsed and relevant words were derived. Typical high frequency words derived for portraits were: usr, credits, team and many different names such as laura, kellerp, eckert, arnold, and schreiber. High frequency words derived for nonportraits were (see above): photo -in different languages such as bilder (German) and foto (Dutch)-, picture and tour. Based on image features and keywords, the 8-nearest neighbor classifier with weighted cosine distance provided a classification success of 72% (i.e. 72% were correctly classified) for portrait images and 92% for non-portrait images. From these results it is concluded that classification accuracy based on both textual and pictorial information is fairly the same with respect to the classification rate based entirely on pictorial information. This is due to the inconsistent textual image descriptions assigned to portrait images which we found on the Web. For example, different nicknames, surnames and family names were associated with portraits. The use of a list of names in the training set improved the classification rate of 87% for portrait images and 94% for non-portrait images. In the future, we research on other HTML-tags and word lists to improve portrait classification performance.

6

Conclusion

From the theoretical and experimental results it is concluded that for classifying images into photographic/synthetic, the contribution of image and text features is equally important. Consequently, high discriminative classification power is obtained based on composite information. Classifying images into portraits/nonportraits shows that pictorial information is more important then textual information. This is due to the inconsistent textual image descriptions for portrait images found on the Web. Hence only marginal improvement in performance is achieved by using composite information for classification. Extensions have been made by adding a list of surnames in the training set enhancing the classification rate.

References 1. Favella, J. and Meza,. V., ”Image-retrieval Agent: Integrating Image Content and Text”, IEEE Int. Sys., 1999. 2. T. Gevers and Arnold W.M. Smeulders, ”PicToSeek: Combining Color and Shape Invariant Features for Image Retrieval”, IEEE Trans. on Image Processing, 9(1), pp. 102-120, 2000.

Integrating Visual and Textual Cues for Image Classification

429

3. S. Sclaroff, M. La Cascia, S. Sethi, L.Taycher, “Unifying Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web,” CVIU, 75(1/2), 1999. 4. J.R. Smith and S.-F. Chang, ”VisualSEEK: A Fully Automated Content-based Image Query System,” ACM Multimedia, 1996. 5. A. Vailaya, M. Figueiredo, A. Jain, H. Zhang, “Content-based Hierarchical Classification of Vacation Images,” IEEE ICMCS, June 7-11 1999, 1999. 6. H. -H. Yu and W. Wolf, “Scene Classification Methods for Image and Video Databases”, Proc. SPIE on DISAS, 1995. 7. D. Zhong, H. j. Zhang, S. -F. Chang, “Clustering Methods for Video Browsing and Annotation”, Proc. SPIE on SRIVD, 1995.