A Query Paradigm to Discover the Relation between Text ... - CiteSeerX

A Query Paradigm to Discover the Relation between Text and Images Simone Santini Praja, Inc. University of California, San Diego ABSTRACT This paper studies the relation between images and text in image databases. An analysis of this relation results in the definition of three distinct query modalities: (1) linguistic scenario: images are part of a whole including a self-contained linguistic discourse, and their meaning derives from their interaction with the linguistic discourse. A typical case of this scenario is constituted by images on the World Wide Web; (2) closed world scenario: images are defined in a limited domain, and their meaning is anchored by conventions and norms in that domain. (3) user scenario: the linguistic discourse is provided by the user. This is the case of highly interactive systems with relevance feedback. This paper deals with image databases of the first type. It shows how the relation between images (or parts of images) and text can be inferred, and exploited for search. The paper develops a similarity model in which the similarity between two images is given by both their visual similarity and the similarity of the attached words. Both the visual and textual similarity can be manipulated by the user through the two windows of the interface. Keywords: Image Databases, Information Retrieval, Images and Text, Textual Description of Images, Text and Visual Feature Integration

1. INTRODUCTION Content based image retrieval (alias CBIR) borrows many ideas, concepts, and techniques from the closely related field of Information Retrieval. Any parallel and mutual influence between the two fields, however, is destined to rest on shaky grounds until the semiotic relation between images and text is clarified, and all the due consequences are drawn. In particular, the common model of information retrieval is that a document is about something, and that aboutness is a property that can be ascribed to text qua text, in a way which is (in first approximation at least) independent on the query that is being asked. The methodological question that one should ask as a foundation of CBIR therefore is: are images parallel to text in this respect? Are images about something? Or, to make the question more precise, is an image a predicate, or a set of predicates? One must recognize that, in general, the answer to this question is negative. If it is true that, as Barthes stated, the implicit message in every image is C ¸ a-a-été (this-has-been), it is also true that there is nothing in the image per se to specify the nature of the thing that really happened, since there are no syntactic differences between a documentary image and a staged scene. A necessary premise to the analysis of the semiotic status of an image is the recognition of its artificial status as a message. An image, just like a text or a speech, is an artifact created for the purpose of communication. ∗ . This statement could be interpreted as a truism in the case of a painting or a staged photograph, but it is no less true in the case of an allegedly “realistic” or “documentary” photograph, which relies on the filtering, selecting, and ultimately mediating presence of the photographer, and whose recognition as a “documentary” photograph is more of a cultural and social recognition of the status and function of a photographer than a function of the content of the image itself. There is nothing in an image that says C ¸ a-a-été, except our acceptance of the social role of the photographer. Simone Santini, Department of Electrical and Computer Engineering; University of California, San Diego; 9500 Gilman Drive; La Jolla, CA 92093-0407; USA; email: [email protected] ∗ I am not forgetting the problematicity of such a statement, but it is not essential for the analysis that will follow to consider it in greater depth. The reader is however invited to do so.1

Eco2 wrote that a sign is that which can be used to lie and, in this sense, an image is not a sign. An image is not a predicate, but something that can be predicated† . If I see a photograph of Jacques Chirac shaking hands with Georg Cantor on the first page of Le Monde, I am inclined to think of it as a lie. The lie, however is not in the picture itself, but in the picture taken in a particular context (the front page of the Le Monde). The conventions that regulate this context require that pictures on the first page of a newspaper represent “reality,” unless otherwise specified. That is, in this case, the context is stating the predicate C ¸ a-a-été of which the photograph is an object, and the context is lying, not the picture. Had the picture been in the book section of teh same newspaper, used in the review of a book on “the political influence of higher mathematics,” it would have been a perfectly normal and “true” photograph. The necessity of an external validation of the image content is related to the contextual incompleteness of pictures, the state of affairs by which pictures can’t themselves be predicates, but become so with the help of a textual discourse.3 Pictures are not predicates but entities which are predicated by some form of associated text with the help of some verbal shifter (like the famous ceci in Magritte’s ceci nes pas une pipe). An image is therefore a predicate only in a certain textual context. I call eidoneme the elementary carrier of signification in the image world (much like the grapheme in the case of written communication). An important question (one might say the most important question for CBIR) is the nature of the textual component of an eidoneme. This question gives rise to three possible scenarios4 : 1. The image is part of a coherent whole which includes text and a discourse that anchors this text to a meaning. The World Wide Web provides a perfect model of this situation. In this case the database operates in the territory between text and image, as a trait d’union between the two.5 2. The text is implicit in the social discourse, which sets and delimits the possible interpretations of the images even before the database is designed. This is the typical case of limited domain dayabases, and here the image database can operate in its most traditional way, as a search engine based on automatically extracted content features. 3. The linguistic discourse is provided by the user; this is the case of strong feedback and tight loop interactive systems. The databse in this case is a writing tool for the user, in which linguistic concepts are written and associated to the images in the repository. This paper discusses the interaction between text and images in the first scenario above. Within the realm of a simple, direct relation between images and the accompanying text, there are some important distinctions to be made. The simplest (and less interesting) relation is normative, and is represented by a label telling one exactly what the image is supposed to mean. In this case, however, the label acts as a strong delimiter of the content of the image. Moreover, the label itself, taken in isolation, can’t support signification without being immersed in a context. In other words: labeling images can only be done in a pre-defined domain which will immerse the image and the label in a context (case 2 above), and doesn’t constitute a relation of the type in which I am interested here. A more interesting circumstance occurs when the relation between text and images is more flexible and complex, such as in the case of images on the web embedded in the linguistic context of pages. This type of relation has been exploited in many systems, such as VisualSeek5 and MAVIS2.6 These systems invoke first order relations between images and text, and images can be characterized to the extent that the text associated to them can. This is a problem in cases in which certain images escape characterization by first order relations, either because supporting text couldn’t be found, or because the text itself can’t be interpreted. In this case it is possible to use second order relations induced by the similarity between images and by the first order relation between images and text. So, an image visually close to a cluster of images sharing a certain meaning will participate of that meaning. †

This doesn’t actually mean that an image is not a sign but, in the language of Peirce, that it is a rhematic indexical qualisign.

2. IMAGE ACQUISITION In order to work in a linguistic modality, it is necessary to find a situation in which the text is complete and extended enough to determine its own context without the need of an external convention. At the same time, the text must include references to images. The most obvious text in which these conditions are verified, and which is already in a convenient form for automatic processing is the world wide web. In the following, therefore, I will assume the world wide web as my test-bed, and consider the problem of creating an image database for images taken from web pages.

2.1. Web Crawling Images are captured from web pages together with the accompanying text. A link following program (sometimes colorfully known as “web crawler” or “web spider”) iteratively visits a page and all the pages linked to it in a random fashion. The crawler maintains a list of pending pages to visit, and a list of visited pages. The high level description of its operations is as follows: Algorithm 2.1. Let P be the set of pending sites, V the set of visited sites, each one represented by its URL h. The algorithm starts with V = ∅ and P = {h0 }, where h0 is the starting site. The algorithm visits at most NM AX sites. Also, rnd(P ) is a random element taken from P while P 6= ∅ and visits < NM AX h ← rnd(P ); P = P − {h} V = V ∪ {h} if h 6∈ V for all links hi in the page h if hi 6∈ V P ← P ∪ {hi } Q[v].t ← T ext(h) Q[v].n ← T itle(h) Q[v].k ← Keywords(h) for each image Ik in the page h Q[v].I[k].n ← U RL(Ik ) Q[v].I[k].i ← IndexOf (Ik , h, T [v]) Q[v].I[k].a ← AltT ext(Ik ) visits = visits + 1 The functions T ext returns an array with all the words in the text portion of the page (that is, excluding text which is included in the header of tags). The function IndexOf takes an image, the URL of a page, and the text of the same page; it returns the position in the text array at which the image Ik occurs. The function Keywords returns the list of words that are included in the META, KEYWORDS tag. The function AltT ext returns the text included in the ALT argument of the IMG tag of Ik . The output of the crawler is, for each page, a structure Q containing the title, text, and keywords of the page, and a list of structures I, one for each image. Each structure I contains the url, index, and text associated to the image. Before indexing, images are subject to a simple filtering operation. In order to be accepted as valid, images need to satisfy the following criteria: 1. The image must be contained in a jpeg file. 2. The image must be at least of size 100 × 100. 3. The width to height ratio must be between 1/3 and 3 4. There must be some text between the image and the beginning of the page.

The first rule tries to eliminate troublesome images like animated GIFs, or images of no direct interest to this specific database, like cartoons‡ . The second and third criteria try to eliminate small images, like buttons, and very skewed images like separation lines. Rules 3 and 4 are also used to reduce the number of ad banners that make into the database.

2.2. How text comes about on the ’net English text on the internet is not like usual English text: differences are presents at different levels. Lexically, Amitay7 found that the distribution of words in web pages is different from that of other forms of written English: sentences are more fragmented and often fragments don’t have a proper sentence structure. In general, delimiters are less frequently used in web pages: the word “the” makes up about 7% of regular written English, but only 3% of “web English.” While the cultural consequences of this impoverishment and “newspeak” transformation of the English language can be rather serious, from the point of view of automatic analysis these characteristics are advantageous: less structure means that the relation between meaning and word counting is more direct. The structure of a page is also rather different from that of a page of linear English text, mostly due to the presence of clearly identifiable markers with specific funcions that have no correspondence in English. The most priminent of these markers is the link which, on a typical web page, is both typographically and functionally segregated. There is a well documented tendency, in web writing, to place semantically charged words near links, and to, so to speak, compress the more meaningful portions of the discourse in those areas. The portion text around the link in which useful information about the linked page can be found is called the anchor text for that link. The extent of the anchor text is evaluated to 50 characters before and after the link text.8 It is noteworthy that the most reliable information about the contents of a page is not found in the page itself, but in the anchor text of the pages that link to it. Considering single links between pair of pages still gives a partial view of the structure of the web. Rather, the complete pattern of links between sites is a better description of the underlying social organization of the web. Chakrabati at.al.9 took the point of view that one of the most prominent aspects of such social organization is the conferral of authority to certain pages. This line of thought may lead to interesting insights in the process of meaning assignment on the web, but it will not be pursued here.

2.3. Construction of the Text Index Following acquisition, the text collected in the pages and the image tags is analyzed to make it suitable for indexing. First, all uninformative words are removed. There are two classes of such words: the first is composed of very common words in the English language (such as “the,” “which,” and so on), while the second is composed of words widely used in web pages, such as “url,” “http,” “click,” “buy,” and so on. These words give no information about the content of a page, and are ignored. Common words removal is followed by stemming, a procedure that reduces the size of the index by eliminating variations of the same words. A typical example of stemming in English is the removal of the final “s” from plurals, so that, for instance, “books” and “book” will be considered as the same term. The system presented here uses Porter’s stemming algorithm,10 which is a rule based algorithm. The algorithm performs quite well, but it has the disadvantage (especially acute in web applications) that its set of rules is specific to the English language. It is therefore impossible to create a good multilingual index of the web. Other algorithms, based on the statistical occurrence of terminations11 can be made multilingual simply by using a multilingual dictionary and seem more promising for web applications. I am currently testing these algorithms. The terms produced by stemming are used directly as an index. I am gathering data on the effectiveness of this approach, to evaluate the possibility of using some more sophisticated indexing technique such as latent semantics.12 Indexing terms come from three separate areas of the page that links to the image under consideration: the anchor text of the tag (this include the alternate text for the image, if present), the collection of keywords contained in the tag of the page (if present), and the title of the page. These areas have different expected relevance for the characterization of the contents of the images. In particular, the considerations of the previous section point at a higher relevance of the anchor text. When the page is analyzed, the three different areas are summarized in three separate sets of terms, which then contribute differently to the final index. ‡

I am also rather uncomfortable with proprietary standards, which I feel are contrary to the spirit of the Internet.

Weighting. The three different areas of the page from which the index text is derived are assigned a priori “thematic” weights. In general, the anchor text is the most relevant for indexing, followed by the keywords and the title. I used the following three thematic weights: νa = 1.2/2 for the anchor text, νk = νt = 0.4/2 for the keywords and the title. Each term is also assigned a weight measuring its potential indexing quality, determined using probabilistic weighting.11 Let Tµ be a term to which a weight needs to be assigned, dfµ the term frequency of Tµ (that is, the number of documents in which Tµ appears as a term), and tfµk the term frequency of Tµ for image Ik (that is, the number of times Tµ appears as a term in the index of image Ik .) The basic term weight for Tµ in image Ik , βk (Tµ ) is then given by NI − dfµ (1) βk (Tµ ) = tfµk log dfµ where NI is the number of images in the database. This weight formula captures the two characteristics that make a term highly discriminative: 1. it appears in relatively few documents in the collection (low value of dfµ ), and 2. it appears many times in the image in question (high value of tfµk ). The weight of a term is then multiplied by the thematic weight corresponding to the section in which the term was found. If, in the index of image Ik , the term Tµ appears taµk times in the anchor text, tkµk times among the keywords, and ttµk times in the title, then the associated weight is αµ,k = α(Tµ , Ik ) = (taµk νa + tkµk νk + ttµk νt ) log

N − dfµ . dfµ

(2)

2.4. Visual Features The visual features used for the images are quite standard in Visual information retrieval. The visual search algorithm was put together using pieces of “scrap” software developed in the lab for different purposes, or pieces of publicly available software. The search algorithm uses three types of features representing color, structure, and texture. Color and structure are represented by feature histograms, while texture is represented by mean and variance of local features. Each feature is partially localized by dividing the images in a fixed number R of rectangles and computing the features separately for each rectangle. The organization of the feature vector for the visual search is shown in Fig. 1. The feature vector contains

!# "%$&'

GIH&JKH&L JKHIGIMNJ QO P MIRTSUL P+V

7 8:9;7%='8

)+*-,.

?@A B@CD 7 8 E# 8 F7%