Author Guidelines for 8

ADAPTIVE IMAGE CLASSIFICATION BASED ON FOLKSONOMY Esin Guldogan and Moncef Gabbouj Department of Signal Processing Tampere University of Technology Tampere, Finland [email protected], [email protected] ABSTRACT In this paper, we present a novel adaptive image classification method for content-based image classification systems based on user defined tags and annotations. The proposed method utilizes low-level features and folksonomies for improved classification accuracy. Thus, users’ perceptive semantics are modeled in terms of lowlevel features and they are combined with low-level image categorization adaptively. The proposed method has been thoroughly evaluated and selected results are illustrated in the paper. It is shown that, satisfactory improvements can be achieved with integrating folksonomies into classification scheme. Furthermore, it is a language-independent and lowcomplex method that can be used on various databases, languages and Content-Based Image Retrieval applications. 1. INTRODUCTION Content-based image analysis systems usually represent content with low-level and high-level features. Various lowlevel descriptors have been proposed recently in the domain of image indexing, retrieval and classification. High level features are also known as logical, semantic features. They engage various degrees of semantic existing in images. High level features can be classified as objective or subjective features. Subjective features concern the abstract attributes. They describe the meaning and purpose of objects or scenes. The use of low-level features does not yield satisfactory retrieval results in many cases; especially, when high-level concepts in the user’s mind are not easily expressible in terms of low-level features. This challenge is called “semantic gap” between low-level feature vector representation and semantic concept of image content, which has recently been a popular research area involving studies on relevance feedback, annotation of images, and classification. Semantic image classification is found to be one of the most powerful methods for improving image retrieval performance. Image classification also helps to increase the efficiency of browsing. However, utilizing only low-level features for image classification will not give satisfactory accuracy due to semantic gap. Therefore, associated metadata of images may be combined into image classification method in order to improve the accuracy. Users may help to extract the connotative information of the This work is supported by Devices and Interoperability Ecosystem DIEM - project is part of the Finnish ICT SHOK program coordinated by TIVIT.

image by tagging. Image annotation or social tagging is the process of associating texts or key words to an image, which is defined as folksonomy. A folksonomy is a method of collaboratively creating tags to annotate and categorize image content. The term is also known as collaborative tagging, social classification, social indexing, and social tagging. Some websites include tag clouds as a way to visualize tags in a folksonomy such as Flickr. Image tagging process is subjective and inconsistent due to interpreting the semantically related images with different words [6]. This inconsistency and subjectivity may lead to ambiguous image classification. In order to improve the semantic accuracy and efficiency, these two approaches of image classification should be combined appropriately. The studies presented in this paper are based on the motivation of this idea of improvement. In this paper, an adaptive image classification based on visual low-level features and folksonomies is presented. The rest of the paper is organized as follows: Section-2 introduces the image classification approaches in the literature, and Section-3 briefly discusses the proposed method and algorithms. Section-4 presents the experimental results. Finally, Section-5 concludes the paper along with some future remarks. 2. IMAGE CLASSIFICATION Image classification has been proposed to group images into semantically meaningful classes in order to fill the semantic gap between low-level visual features and human perception. There are various supervised and unsupervised approaches based on low-level image features for image clustering, such as Self-Organization Maps (SOM) and Markov models [12], [13]. However, these classification schemes are entirely based on pictorial information. Recent systems map the words to visual features and images by automatically annotating the images [3], [5], [11]. In such systems, query process is based on natural-language on automatically annotated image database. One drawback of this approach is to have limited defined vocabularies. Textual queries have two main challenges: the query must correspond to the text associated with the image and the language must match between the query and tags of the images. There are various introduced approaches categorizing images in order to annotate them for further retrieval process [4], [7]. However, image categorization results highly impact the accuracy of annotation, which will

proportionally affect the retrieval accuracy. In most cases they utilize training databases, and selection of the training sample is a considerably challenging process. Additionally, most of the previous works categorize images with controlled vocabularies. However, this approach limits the information that can be captured from the image [8]. Similarly, tags and annotations help the system to refer the high-level concepts in the image. However, they do not represent and refer comprehensively the information lying in the image. We cannot totally rely on annotated images since large image databases are rarely entirely tagged and tags are subjective, inconsistent and unreliable. Cai et. al. in [2], studied a hierarchical clustering of WWW images using visual, textual and link analysis. However, this method aims to facilitate user’s browsing of WWW-images. Abbasi et. al. in [1], studied folksonomies to improve the image classification. They created a vector from the bag-of-words and merged it with low-level features to train the one class SVM. However, combining more than one low-level feature increases the complexity of SVM. It is also a challenging task to choose an appropriate SVM kernel and associated parameters. Different values of the parameters and different choices of the kernel may reveal different performances. It is challenging task to translate the visual representation of image into text descriptions. In the proposed method we do not totally eliminate the low-level features (denotative attributes) that can be extracted from the image. We also utilize unrestricted user defined folksonomies to extract the boundless information lying in the image. At the end, image includes most of the information that users are trying to find. The aim of the proposed method is to categorize images more accurately by integrating the folksonomies to take into consideration of human perception within the process. The main differences of the proposed method from the previous works are as follows: - The proposed method utilizes low-level features and folksonomies at each level of categorization without separating them. - The tags are not directly used as text data in the classification process unlike many other approaches. Image classification is performed with the visual features based on the assist of tags. - In this study, we model the human perception by lowlevel features based on folksonomies. - This approach does not need training data and language independent. 3. ADAPTIVE IMAGE CLASSIFICATION METHOD Proposed method consist three steps. In the first step, image database is categorized based on low-level visual features. Let I n represents the image without any tag, where N is the number of images in the database and 0 n N . Image database contains tagged and untagged images represented as follows:

R I n , I nT T is the bag-of-words consisting folksonomies, which represent the collaboratively created tags (social tags). After the first step, each image in the database is belongs to one class, based on image classification with only low-level visual features as follows: R I n , C k and I nT , C k , 0 k K ,where K represents the number of clusters. In the second step, a visual feature model is created for the commonly tagged images within a category. Figure 1 illustrates the overview of the proposed visual feature modeling approach. Red circles indicate the image categories created after the first step, given above. Dashed circles represent folksonomy groups, where commonly tagged images within a cluster are shown with striped texture. As shown in the Figure 1, a visual feature model for each image category is created after the second step of the proposed method. The Term Frequencies (TF) of the tagged images within a category is calculated and highest TF is selected as commonly tagged images within that category. In order to define the commonly tagged group of images, the following procedure is followed:

I nT , C k

I nTC , C k

,where I nTC represents commonly tagged images within a class k. A visual feature model is created from these images ( I nTC ) in order to model the human perception based-on tags with the low-level features. The visual feature model is created from the low-level features of commonly tagged images in class k as follows: t I nTC

(1 / t )

f i ( I nTC ) i 0

,where t represents the number of I nTC within the class k and f() represents a low-level feature vector of the images. Afterwards, a feature vector data I nTC =[x1, x2, …, xV] can be converted to a new vector of smoothed data, where V represents the size of the feature data vector. The smoothed points YI nTC =[y1,y2, …,yV] will be the visual feature model of commonly tagged images within a class k for an individual visual feature. i 1

xv i 0 v V i 1 3 Final step is so-called adaptation step, where recategorization is performed on the whole database. Euclidean distance is calculated between the model of each cluster and all the images within the database. Afterwards, relevant images are moved to each specific cluster, which can be represented as follows: I n , Ck In , C j , 0 k, j K yv

if

d (YC j , I n ) Threshold

Figure 1: Overview of the visual feature modeling approach 4. EXPERIMENTAL RESULTS In the experiments, we have utilized two image databases. The first dataset is a real-world image databases for evaluating the categorization accuracy of the proposed method. Parts of these test images are created by Nokia Image Space service [10] and uploaded to Flickr. Mainly, images are selected manually from the Flickr, where most of the images are tagged by the users of Image Space and Flickr. All the associated tags are utilized for the image classification. There are 360 images in the database, where only 180 of them are tagged. These images are pre-assigned by MUVIS [9] team members to 9 semantic classes each containing 40 images. The classes are: Sports, animals, towers, flowers, berries, doors, people, shoes and handcrafts. The second dataset is Corel real-world image databases for evaluating the results of the proposed method. Corel image data sets are well-categorized and widely used in the CBIR literature. Therefore, we utilized Corel database with 10000 images for evaluating the results. These images are pre-assigned by a group of human observers to 100 semantic classes each containing 100 images. The sample classes are: Africa, Beach, Buildings, Buses, Dinosaurs, Flowers, Elephants, Horses, Food, and Mountains. However, Corel database do not have available social tags. For this reason, class names are used as tags. Therefore, there was only one single tag, which also represents the semantic ground-truth classes for each image in the database. In the experiments, the following low-level color, shape, and texture features are used: YUV and HSV color histograms with 128-bins, Gray Level Co-Occurrence Matrix texture feature with parameter value 12, Canny Edge Histogram, and Dominant Color with 3 colors. The image database is categorized by K-means clustering method with only low-level features, with only tags and with the

proposed method in order to assess the benefits. The results are illustrated in figures. Figure 2 shows the classification accuracy of categorization with low-level features and with the proposed method for each category of tagged images. The proposed adaptive method significantly improves the classification accuracy in majority of the image categories. Visual Features

Proposed Method

100 90 80 70 60 50 40 30 20 10 0

Figure 2: Classification accuracy of the tagged images in each category In Figure 3, the test database contains 180 tagged and 180 untagged images. The proposed method improves the classification accuracy by 21% within the entire database. As shown in Figure 3, the proposed method compared with a method introduced by Abbasi et. al. in [1]. They studied a method, which merges the tags and low-level visual features into one single form for image classification. In these experiments, we have utilized K-means clustering, in order to be comparing both methods equally. As illustrated in Figure 3 that, the proposed method improves the classification accuracy compared with other classification approaches.

5. CONCLUSIONS

70 60 50 40 30 20 10 0 With Visual Features

With Textual Tag+Visual With Proposed Features Features in [1] Method

Figure 3: Classification accuracy of the tagged and untagged image database 100 90 80 70 60 50 40 30 20 10 0

Visual Features

Proposed Method

A novel approach on adaptive image classification is introduced in this paper. Visual features and folksonomies are combined for adaptive image classification in order to improve the accuracy. We introduce an approach to model the folksonomies visually for classifying both tagged and untagged images. Experiments show that proposed method improves the image classification accuracy. The proposed image categorization approach is language independent that gives the flexibility for using it in different languages. It may be applied on several types of databases and sets of features. Determining the “commonly tagged” images is a challenging task. Therefore, ontologies will be used for more general categorization in the future. Assessment studies on the criteria will also be carried out using different databases on different platforms in the future. In addition, this work may be extended to multimodal features for multimedia databases. 6. REFERENCES [1]

[2]

[3]

Figure 4: Classification accuracy of sample 10 classes in Corel dataset

[4]

40 [5]

35 30 25 20

[6]

15 10 5

[7]

0 With Visual Features

With Proposed Method [8]

Figure 5: Classification Accuracy of Corel Image Database Figure 4 and Figure 5 present accuracy of classification with low-level visual features and with the proposed method in Corel image database. In Figure 4, classification accuracies of sample 10 classes are presented. The proposed adaptive method significantly improves the accuracy in majority of the image categories. Figure 5 illustrates the classification accuracy of Corel image dataset with 10000 images. As show in Figure 5, the proposed adaptive classification method improves the accuracy compared with classification with only visual features.

[9] [10] [11]

[12] [13]

R. Abbasi, M. Grzegorzek, S. Staab, “Using Colors as Tags in Folksonomies to Improve Image Classification”, In Proceedings of SAMT, 2008. D. Cai, X. He, Z. Li, W.Y. Ma, J.R. Wen, “Hierachical Clustering Of WWW Image Search Results Using Visual, Textual And Link Information,” In Proceedings of the ACM International Conference on Multimedia, New York, USA, 2004, pp. 952-959. G. Carneiro et al., "Supervised Learning of Semantic Classes for Image Annotation and Retrieval," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 3, 2006, pp. 394–410. R. Datta, W. Ge, J. Li, and J. Z. Wang, “Toward Bridging the Annotation-Retrieval Gap in Image Search”, IEEE Multimedia: Special Issue on Advances in Multimedia Computing, vol. 14, no. 3, pp. 24 - 35, 2007. D. Djordjevic, E. Izquierdo, “An Object- and User-Driven System for Semantic-Based Image Annotation and Retrieval”, IEEE Trans. Circuits Syst. Video Techn., Vol. 17(3), 2007, pp. 313-323. W-T. Fu, T. George Kannampallil, R. Kang, "A Semantic Imitation Model of Social Tag Choices," 2009 International Conference on Computational Science and Engineering, 2009, vol. 4, pp.66-73. S. Lindstaedt, R. Mörzinger, R. Sorschag, V. Pammer, G. Thallinger, “Automatic image annotation using visual content and folksonomies”, Multimedia Tools and Applications, Vol. 42, No. 1. (1 March 2009), pp. 97-113. E. Menard, “Ordinary image retrieval in a multilingual context: a comparison of two indexing vocabularies”, In Proceedings of ISKO UK Conference 2009, June 2009. MUVIS: A System for Content-Based Indexing and Retrieval in Multimedia Databases, http://muvis.cs.tut.fi/. Nokia Image Space Service: http://research.nokia.com/research/imagespace N. Rasiwasia, N. Vasconcelos and P.J. Moreno, "Query by Semantic Example," Proc. 5th Int"l Conf. Image and Video Retrieval, LNCS 4071, Springer, 2006, pp. 51–60. H. Yu and W. Wolf, “Scene Classification methods for Image and Video Databases”, Proc. SPIE on DISAS, 1995. D. Zhong, H. Zhang, F. Chang, “Clustering Methods for Video Browsing and Annotation”, Proc. SPIE on SRIVD, 1995.