Integrated Browsing and Querying for Image ... - Semantic Scholar

Integrated Browsing and Querying for Image Databases Simone Santini and Ramesh Jain University of California, San Diego

Abstract

90,000, then the employee to whom that record corresponds earns $90,000 per year. This assumption has been, more or less explicitly, extended (coeteris paribus, of course) to image databases. But, as our analysis revealed, the coeteris was, in this case, not paribus at all. The common outcome of the use of many existing image databases is a mix of excitement and frustration. It is clear that the systems could help you solve some interesting problems, but there are aspects of those very systems that leave you with a deep frustration. To understand the point, consider Fig. 1, which is a typical example of query and an-

This paper describes the architecture and the salient characteristics of the image database system El Niño. The system uses a new interaction model which purports to overcome the problem of the semantic gap. We have a semantic gap when the meaning that the user has in mind for an image is at a higher semantic level than the features on which the database operates. We argue that we can overcome the problems caused by the semantic gap if we replace the usual query paradigm with a more active exploration process, and develop an interface based on these premises. Other relevant aspects of El Niño, which are discussed in the paper are its distributed architecture composed of partially independent engines connected to a mediator, its attempts to integrate visual and textual queries, and its query algebra approach to the problem of putting together queries from different engines.

1 Introduction What is exactly wrong with existing image databases? Most people would wholeheartedly agree that something is indeed amiss but, if asked to pinpoint exactly where the problems are, their analytical capacities seem to fail them. When we started working on image databases at the Visual Computing Laboratory at the University of California, San Diego, we asked ourselves the same question. In the course of the last three years, we believe we have started to glimpse some of the answers, and we have tried to incorporate the lessons we learned in our system “El Niño”. El Niño is the collective name of a group of search engines, interface tools, communication and integration modules for the management of image repositories. This paper describes the architecture of El Niño and its most characteristic aspects. The starting point of our work was an analysis of the meaning of images and the role of semantics in the search process. In traditional databases, the semantics of a particular record is a function of its syntax and of the semantics of its atomic constituents. If the bits in the third column of a certain relational table can be interpreted as the number

Figure 1: An example of query from a first generation image database. swer from a current image database. The query is the image in the top-left corner, and the similarity criterion included a mixture of global color, local color, edges, and texture that we selected using four knobs in the interface. 1

2 Interface and Query Specification

Some of these images are acceptable answers to the query, while others will appear quite out of place. We can show, however, that most of them are not out of place at all. The face in third position, for instance, is there because the forehead of the woman is very similar in structure and position to the arch above the door in the query. The fourth image is there because the vest of the woman is similar in color and position to the door in the query, and so on. The database obviously is making some sense out of the similarity criterion that we selected. The problem is that the sense that the database is making is not what the user wanted.

The interface and the query specification modalities are strictly connected, in El Niño, and are arguably the most distinctive characteristics of the system. We started our work from some consideration on the meaning of an image. The same image can “mean” a number of different things depending on the particular circumstances of the query. Moreover, as a simple experiment can reveal, the meaning attributed by people to an image depends on the context in which the image is presented. For instance, showing people the images of Fig. 2, and asking to identify the image in the middle, the word “portrait” appears more frequently than the word “face.” The opposite happens if we ask people to name the central image in Fig. 3. Context is essential for the determination of the meaning of an image and for the judgment of image similarity. Unfortunately, context is not something that we can codify easily. Context is a set of social and cultural conventions that depend critically on our role as participants in a network of interactions. Since cognition, in its full sense, is inextricably tied to context, we can’t expect any significant cognitive ability from a context free database. Our solution is to let the user be constantly aware of the overall context in which the database is placing the images by displaying a configuration of images. A configuration is a set of images displayed in such a way that their mutual distance reflects their similarity as currently interpreted by the database. Figures 2 and 3 can be taken as examples of configurations. Note that, although the configurations contains in part the same images, the meanings they convey are quite different. In El Niño the minimal displayable unit is not a single image, but a configuration of images which, much like the configurations of Fig. 2 and 3, gives the user information on what kind of similarity measure is the database using (since, as we have seen, an image, taken by itself, does not convey any meaning). If we are looking for portrait, and the database shows us the configuration of Fig. 2, we are on the right track. If the database shows us the configuration in Fig. 3, there is something wrong with our query. El Niño gives the user tools for exploring the cognitive map. The user can pan, zoom the map, and apply other display operators such as fish-eye lenses. Browsing, as a supplement for querying, has already stimulated a certain interest in the image database arena [5, 2, 8]. The display portion of our interface can be seen essentially as browsing the database cognitive map. Another important aspect of our interaction model is the use of context feedback as the communication channel between the user and the database. The user does not intervene directly on the parameters of the distance measure but (as in relevance feedback [9]) selects a number of “positive” examples and a number of “negative” examples. In addition to the standard relevance feedback, the user can place the

We can analyze the situation as follows. The user has some semantic specification in mind (“I want to see images of old doors”). He found an example of old door (how this is done is the subject of a glissando in many systems) and concocted a similarity criterion that, according to him, captured a notion of similarity that could induce that semantic (the door is blue, and it has a fairly well defined structure). The database used this similarity to sort the images and returned the best results. The similarity did in fact induce a semantics in the database images. Alas, not the right one. We call this problem the semantic gap. The user has a fairly rich semantics in mind when he starts selecting a similarity criterion, but the tools that the database offer him are inadequate to express it. Connected to this, we have the problem of query refinement. Receiving an answer like that of Fig. 1 is not bad at all, providing that we know how to change the similarity criterion in order to improve it. In this sense, the database fails: it is not clear at all how to manipulate our four knobs in order to receive a better answer: we don’t know whether we should give more weight to color, less weight to texture or whatnot. In fact, our four knobs probably don’t give us enough expressive power to let us express the similarity criterion we would need.

This paper presents El Niño, an image database system stemming in part from these ideas. The most noteworthy characteristic of El Niño is its interaction model, which tries to solve the problem of the semantic gap. Section 2, describes how the exploratory interface of El Niño can help narrowing the semantic gap. Section 3 describes the distributed architecture of El Niño and the integration of its various components. Section 4 describes the search engines used by El Niño and how the use of operators in the mediator can help integrate queries of different nature and allow some original ideas in the integration of perceptual queries and metadata queries. Section 5 briefly sketches the role of the similarity algebra in joining together queries based on different criteria. Conclusions and future directions are presented in Section 7. 2

Figure 2: A Modigliani portait placed in a context that suggests “Painting.” examples in the relative positions that they should have.

images. The reorganization involves the whole database. Some images will disappear from the display (the purple image in Fig. 4.A), and some will appear (the yellow, gray, and cyan images in Fig. 4.D). This feedback mechanism constitutes the fundamental interaction mode between the user and the database, but other collateral tools can be attached to the interface. We introduce two: visual concepts and visual dictionaries. A visual concept is simply a set of images that the user regards as equivalent (or almost equivalent) and the vehicle of a specific semantic value in the context of the current query. These images can be collected in a single unit that behaves in the interface like any other image (although it represents a more abstract unit than a single image). An interaction involving visual concepts looks like that in Fig. 5. Looking at the display of Fig. 5.A, the user still decides to consider the red and the green images as close to each other but, in addition to this, they are regarded as having enough semantic relevance in the current context to deserve a special status as a concept. The user opens a concept box and drags the images inside the box. The box is then used as an icon to replace the images in the display space. A visual concept works much like a cluster of images that are kept very close in the display space, but in addition to this, ancillary information can be attached to the concept box as metadata. So, if the user of a museum creates a concept called “mannerist madonna,” the words “mannerist” and “madonna” can be used to replace the actual images in a query. The second tool (the visual dictionary) derives from the same desire to integrate textual and visual information. It

For instance, assume that a user is looking for pictures of people. Although all the pictures of people would be marked as positive examples, the close-ups could be clustered together to indicate that they should be considered very similar, while the full figure shots would be grouped together in another part of the display. In other words, rather than the user trying to understand the properties of the similarity measures used by the database, the database should use the user categorization to develop new similarity measures. An user interaction using configuration feedback is shown schematically in Fig. 4. In Fig. 4.A the database proposes a certain distribution of images (shown as colored rectangles), representing the current similarity criterion. For instance, the the green image is considered very similar to the orange one, and the brown to the purple. In Fig. 4.B the user moves some images around to reflect his own interpretation of the relevant similarities. The result is shown in Fig. 4.C. According to the user, the red and green images are quite similar to each other, and the brown image is quite different from them. The images that the user has selected form anchors for the determination of the new similarity criterion. Given the new position of the anchors, the database will redefine its similarity measure, and change the configuration to that of Fig. 4.D. The red and the green images are in this case considered quite similar, and the brown quite different. Note that the result is not a simple rearrangement of the images in the interface. For practical reasons, an interface can’t present more than a small fraction of the images in the database. Typically, we display the 100-300 most relevant 3

Figure 3: A Modigliani portrait placed in a context that suggests “Face.” is well known that attaching labels to a database suffers from two drawbacks: it is an extremely expensive operation (which limits its usefulness to applications with high added value), and doesn’t capture all the meanings of an image. In a visual dictionary, we label a subset of a database and use the results of the textual search as a starting point for the visual search. The structure of a visual dictionary is shown in Fig. 6. Let us assume that a user is looking for some romantic images of old cars in quiet country roads. We have a large database D of images, and a subset A D that has been labeled (or for which every image has some text attached). Note that A might not contain the images that we are looking for, and that its labeling might be too coarse for the semantics that we are considering (e.g. images might not be labeled according to their “romanticity,” or to whether they are city or country images.) On the other hand, A will probably contain some examples of cars, and we will be able to retrieve them as a (partial) match to the query “old cars in country roads.” Although these cars are not what we are looking for, we can use them as visual examples to start a visual query in the whole database D. The visual dictionary solves two major problems of text and visual databases:

a visual search. Apart from the idea of drawing a sketch of what one is looking for, there is no commonly accepted way of posing a query to a database. A visual dictionary is a tool for starting visual queries. In addition to all this, the visual dictionary is easily integrated with visual concepts. A visual concept can be seen as a dictionary entry generated by the user in the context of a query. This observation highlights the problem of the scope and the life span of a visual concept. They could be limited to a session, to a user, or they could be shared among all users, thereby being permanently inserted into the visual dictionary. This is essentially a policy problem that is best resolved on an application by application basis, and whose implications go beyond the scope of this paper.

3 Architecture El Niño is a collection of search engines connected to a mediator which communicates with the user via an user interface (Fig. 7). In our current implementation, the interface is written in Java and made available through the web, so the user runs it from within a web browser. Each engine defines one (or more) feature space and a representation of images in it. In addition, each engine measures similarity between images using a similarity criterion that can be adapted by changing the values of a number of parameters. The similarity engines of El Niño will be considered in the next section, while this section will analyze the function of the mediator.

It overcomes the problems of labeling schemes. It is not necessary to label the whole database, or to try to capture in the text all the minutiae of an image. We don’t expect a good answer from the visual dictionary, but just enough examples to start a visual query. The visual dictionary provides a convenient way to start 4

Figure 5: Interaction involving the creation of concepts.

must be joined together using the and function. For each image, given the values s1 (I; J ) and s2 (I; J ) the and operator returns the value s(I; J ) = f^ (s1 (I; J ); s2 (I; J )), which is returned as the similarity between image I and image J with respect to the current similarity criterion. Once a similarity measure is set, El Niño allows two interrogation modalities: a user can ask to return the first k images from a given reference point in the feature space (and the corresponding distances), or can ask to compute the distance between two images in the database. The first modality would be sufficient for an interface based on a simple browser that just displays the images closest to the query. Displaying configurations requires knowledge of the distances between all pairs of images, so that the images can be placed in proper positions in the interface. Displaying a configuration entails (from the mediator point of view), the following steps:

Figure 4: Schematic description of an interaction using a direct manipulation interface.

3.1 The Mediator The mediator in El Niño is assigned three different tasks: dispatching the queries to the right engines, managing the configuration feedback process, and managing the database name space. 3.1.1 Query Dispatch The main activity of the mediator the establishment of a similarity criterion and its use for the determination of the similarity between two images. This process is illustrated in Fig. 8. The similarity criterion is specified as a graph. The graph in Fig. 8, for instance, establishes that the similarity criterion should include text and color similarity, and the results of the two criteria should be joined together using the and operator (the operator algebra of El Niño will be described in section 5). The queries T and C can contain parameters for the configuration of the query engines. For instance, the query C could include the information that color saturation should be disregarded for this particular criterion. A more sophisticated model of engine configuration will be discussed in the following section. Two engines (a text and a color engine) are allocated to serve the query, and the query parameters are used to configure them. Whenever the users asks to compute the distance between two images I and J in the database, the text engine returns a value s1 (I; J ), and the color engine returns a value s2 (I; J ). As specified by the query graph, these two scores

1. A criterion is established, either by parameters set in the criterion graph or as a result of configuration feedback (see below). 2. A query is issued to retrieve the first k images closest to a referent (in our geometric approach, the feature space is transformed so that the referent is always the origin, therefore there is no need of explicitly defining it as part of the query process; see section 4). These images form the display set D 3. The distance modality is used to request the distances between all pairs of images in the set D. The interface uses these distances for determining the position of the images in the configuration (see section 2 below). 5

LOCAL INTERFACE

labeled subset

OPERATOR

query car

dog car man house tree man boat dog car sun tree house car tree man boat sun tree

OPERATOR

MEDIATOR STUB

MEDIATOR

OPERATOR

complete database REMOTE CONNECTION

partial answer

final answer visual query

ENGINE

ENGINE

ENGINE

DATABASE

DATABASE

DATABASE

LOCAL CONNECTION

visual engine

NAME SPACE

Figure 6: Schematic representation of a search involving a visual dictionary. Figure 7: The overall architecture of El Niño. 3.1.2 Configuration feedback in at most some 60 distance constraints. On the other hand, the similarity measure can easily depend on more than 100 parameters. We need to constraint the optimization problem in order to avoid overfitting. We are looking for a significant global change in the distance measure, and not on pathological adaptation of single parameters. We tackle this problem using the two concepts of space curvature and natural distance. For each feature space in every engine we define a natural distance. Formally, the definition of the natural distance is arbitrary, but intuitively it corresponds to a somewhat neutral distance, isotropic and uniform over the space. This distance determines what we conventionally call a zero curvature space (this is just a conventional designation: the actual Gaussian curvature of the space might not be zero). When the distance parameters change, the distance function changes too, and this results in a curvature in the feature space. We associate to the curvature determined by the parameters w a cost Fc (w; w0 ), where w0 is the parameter vector corresponding to the natural distance. If the configuration entered by the user is characterized by the distances d = [d12 ; d23 ; : : : ; dh?1;h ] (where dij is the requested distance between the ith and the j th images in the configuration), and if the parameters w induce a distance vector d0 (w) between the same images, then we measure the mismatch by a cost Fd (d0 (w) ? d). The optimizer minimizes the criterion:

The main modality of interaction between the interface and the engines is configuration feedback. The user selects a number of images on the screen, and places them in a position that, for the purpose of the current query, reflects their mutual similarity. The similarity engines use this information to adapt their similarity criterion. For each engine, the distance function depends on a parameter vector w, which must be changed in order to adapt the similarity criterion in a way that reflects the user requirements. This requires an optimization process, and El Niño recognizes two types of engines: self-adapting, and non self-adapting. Self-adapting engines contain an internal optimizer and run their own optimization process (Fig. 9). The mediator sends the engines the ID of the images that the user has selected, and a table with the required distances between these images. The engine runs its own optimization process and determines the optimal value of the vector w. This operational mode minimizes the load on the mediator and network communication, but requires the presence of an optimization algorithm on the engine. If an engine does not have an internal optimization procedure, the mediator can take care of the optimization (Fig. 10). In this case, the engine must provide procedures for reading and updating the parameter vector w and for determining the distance between two images given a parameter vector. The mediator will run its own optimizer and will communicate the engine the new values of the vector w. The optimization process resulting from a configuration feedback is in general largely unconstrained. A user typically selects at most a dozen images for feedback, resulting

F (d0 (w) ? d) + F (w; w0 ): d

c

In other words, the optimizer tries to find a solution that places the images in the configuration at the right distance 6

s1(I,J) s2(I,J) T

mediator

h1 h2 h3 h4 h5 h6 h7 h8 h9

and

f (s1,s2)

d d d d d d d d d

and C

T

s1(I,J)

C

text engine

s2(I,J)

color engine

d d d d d d d d

d d d d d d d

d d d d d d

d d d d d

d d d d d d d d d d

mediator cost

engine similarity measure dist vector write vector read

w1 w2 w3 w4 w5 w6 parameters vector

optimizer database

database

Figure 8: Query dispatch process

Figure 10: Mediator assisted adaptation of the similarity measure in a non self-optimizing engine.

engine similarity measure distance data from the mediator

h1 h2 h3 h4 h5 h6 h7 h8 h9 d d d d d d d d d

d d d d d d d d

d d d d d d d

d d d d d d

d d d d d

d d d d d d d d d d

that is to be inserted. If the mediator grants the insertion, it generates a new handle, and communicates it to the engine, authorizing the insertion. In the current version of the system, the mediator does not perform any check to guarantee that the images being inserted is not already present in the system beyond checking for duplicate URLs. When the request is authorized, the engine downloads the image, extracts the proper features, and stores it into its internal index. At the same time the mediator issues an insertion request to all other engines in the system, specifying the URL of the image and the handle that has been assigned to it. Each engine will download the image, encode it, and store it in its own index. Finally, the mediator inserts the handle and the URL of the image in a location table so that the location can be made available to the user. Optionally, the mediator can download the image and create a thumbnail version in a local archive. A web server running on the same machine makes the thumbnails available to the interface through the web.

parameters vector w1 w2 w3 w4 w5 w6

optimizer

Figure 9: Adaptation of the similarity measure in a selfoptimizing engine. and at the same time deviates as little as possible from the natural distance function. 3.1.3 Database Name Space

4 Search Engines

Each image in the database is identified by a system-wide image handle, which is assigned by the mediator at when the image is inserted into the database. Typically, all the engines contain the same images (this is not a strict requirement, but images that are not represented in some engines may fail to respond properly to queries involving those engines). The insertion process is represented schematically in Fig. 11 for a simple system containing only two engines. The insertion process is initiated by one of the engines. The administrator of the engine communicates the URL of the desired image to the engine. The engine, in turn, issues an insertion request to the mediator, with the URL of the image

The architecture of El Niño is based on the idea of defining different search engine for the different needs of the system, and integrating them using the mediator’s operator algebra (which will be introduced in section 5). In this section we present the search engines that have been implemented so far as part of the system. There are currently three engines in El Niño: a “specialized features” engine, which uses a combination of standard image analysis techniques, an “image decomposition” engine, which uses a wavelet decomposition generated by a transformation group to represent images, and a textual en7

local thumbnail web server

feature vector

Location Table

mediator

(h,URL)

handle 1000 1001 1003

URL http://www... http://mysite.. http://www...

color composite feature

texture composite feature

shape composite feature

thumbnail archive h

insReq(URL)

insReq(URL, h)

insert(URL)

engine

color local features

engine (h,f)

f

(h,f)

shape local features

texture local features

f

feature extractor

feature extractor

database

database

web site with the original image

Figure 12: The feature vector of the SF engine.

Figure 11: Insertion of a new image in El Niño.

are extracted by each of the local feature extractor on each of the R rectangle as follows:

gine which implements the visual dictionary.

4.1 The specialized features engine

Color. Pixels are represented in the HSV color space, and each dimension in the color space is represented as a separate single dimensional histogram. Rather than dividing HSV into bins and computing the histogram, we compute the first three moments of the statistical distribution of the histogram directly from the pixel data, following the technique developed in [12]. Every local color feature is composed of 9 numbers: three statistical moments for each of the three color channels.

We occasionally refer to the specialized feature (SF) engine as “el cheapo,” because it was put together in a short time (about a week) using pieces of “scrap” software found in the lab’s disks, or pieces of publicly available software. Originally “el cheapo” was developed as a reference for measuring the performance of other engines that we developed or will develop in the future. The SF engine uses three types of features representing color, structure, and texture, which represent the standard choice of many existing databases (part of the interest in developing SF was to see how our interface could improve the performance of these standard techniques). Color and structure are represented by feature histograms, while texture is represented by mean and variance of local features. Each feature is partially localized by dividing the images in a fixed number R of rectangles and computing the features separately for each rectangle. The organization of the feature vector of the SF engine is shown in Fig. 12. The feature vector contains three composite feature vectors, one for color, one for structure, and one for texture. Each composite feature vector contains R local feature vectors, one for each of the rectangles in which the image is divided. Each local feature contains a feature extractor. The three types of features (color, structure, texture)

Structure. Structure is represented as a histogram of edge directions. Eight filters are used to compute the strength of edges along eight different directions and, for each direction, a histogram of the strength of the edges along that direction is computed. As in the color case, we represent each histogram with its first three statistical moments. Each local structure vector is therefore represented as 24 numbers (three moments for each of the eight directions). Texture. Texture is represented using Manjunath and Ma’s Gabor features [6]. For every point in the rectangle, we apply 30 filters (six directions and 5 scales) and take the norm of the complex number resulting from the application of the filter. A texture is represented by the averages and variances of each of the components of 8

between two images as the distance between two sets helps us keep the dimensionality of the space low and avoid many of the problems of the “dimensionality curse.” Most multiresolution decompositions will generate intolerably large numbers of coefficients. For instance, the decomposition of image of size 128 128 generates about 21,000 coefficients, each of which is represented by 6 numbers. Fortunately this representation is highly redundant and, for database application, we don’t need the complete reconstruction of the image. Since the image is represented as a set of coefficients in a metric space, it is natural to use vector quantization to reduce the number of coefficients necessary. We can represent images with as little as 50 coefficients and still obtain acceptable results, although usually we use 100 or 200 coefficients. A similarity criterion is determined by imposing a metric on the image space. Thanks to the generality of our feature representation, virtually all similarity criteria can be represented as a metric in the image space. A mixture of Gaussian model is used to create a probability distribution that contains most of the features of the positive examples and as little as possible of the features of the negative examples. The metric of the image space is then derived from this distribution. As an illustration, Fig 13.a shows a number of geodesics (lines of minimal distance) of an hypothetical two dimensional image space with no samples selected. The image space is Euclidean, and the geodesics are straight lines. If the user selects a number of images with a concentration of features around the point (0:4; 0:4), the image space is distorted, and its geodesics become as in Fig. 13.b. If the user selects a set of images with two concentrations of features (around (0:2; 0:2) and (0:8; 0:8)), the geodesics become those of Fig. 13.c. More general spatial distribu-

the feature vector when varying across the rectangle, for a total of 60 numbers. For each local feature, a distance function is defined to compute the distance between two instances of that features. In the SF engine, the distance between two local feature vectors x and y is always a weighted Minkowski distance of the type

d (x; y) =

"

X

w jx ? y j i

L

i

i

# p1

p

i

with p > 0. The weights w are set independently for each rectangle as a consequence of the configuration feedback process. The composite feature computes distance between two vectors as a weighted sum of the distances between corresponding local features. Finally, the distances relative to the three composite features are weighted and put together in the complete feature vector using one of the “and” operators defined in section 5.

4.2 The Image Decomposition Engine The Image decomposition engine is based on the idea of deriving a feature vector general enough so that, in principle, every conceivable similarity criterion can be expressed using it. Selecting a particular feature vector in general constrains the types of similarity we can represent. It is well known, for instance, that using global histograms it is impossible to enforce any similarity criterion depending on the spatial distribution of features. In order to guarantee the generality of the features, we impose a reconstruction constraint: the feature set of an image must be general enough to allow (at least in principle) the reconstruction of the complete image. The feature space is derived from a multiresolution decomposition of the image generated by a discrete subgroup of a suitable transformation group. The choice of the transformation group is determined by invariance considerations [10]. As an example, consider the use of the affine group to generate the transform. An element of the group is determined by three parameters: the two spatial coordinates x and y , and the scale parameter s. In a color image, the value of a pixel is an element of C , the color manifold which is imbedded in R3 . A pixel in an image is an element of the six-dimensional space G C , where G is the affine transformation group. We call this the image space. An image is a set of elements in this space. Both the transformation group and the color space can be endowed with a natural metric, which makes it possible to define the distance between two coefficients. With the distance betweem two coefficients it is possible to define the distance between two sets (i.e. between two images) in a standard way. Defining the distance

Figure 13: Geodesics in the query space for (a) Euclidean space, (b) space with one concentration of features, and (c) space with two concentrations of features. The black lines with the circle show the features concentration areas. tions of the positive and negative examples generate more complex distance measures. The geometry of the image space implicitly defines categorization in El Niño. The samples collected by the user form the context from which conceptualization emerges. Regions rich in relevant features are singled out by the choice of the metric. These categories do not rely on a pre9

5 Similarity Algebra

defined ontology as is the case for simple schemes based on weighting distances computed on predefined features.

The mediator contains a number of of operators to put together the results of the different engines. The operators take two similarity measures relative to two engines, and transform them into a new similarity measure resulting from their combination. The operators define an algebra in the space of distance functions [1]. These operators act on the distance functions defined by the single engines. All the distance functions are defined in F F (where F is an appropriate feature space), and take values in [0; 1]. Consider a query Q, and let sc1 (Q; I ) 2 [0; 1) be the similarity between the query Q and image I according to the criterion c1 implemented by engine e1 . We can interpret this value as the truth value of the predicate “Image I is like the query Q according to criterion e1 .” Similarly, we have the value sc2 (Q; I ) representing the similarity between Q and I according to criterion c2 . We make the following hypothesis: the similarity of Q and I with respect to criterion “c1 and c2 ” (resp. or) depends only on the values s1 (Q; I ), s2 (Q; I ). In this case, we can write

4.3 The Text Engine The text engine uses some elementary techniques from information retrieval to associate labels to images and to retrieve images based on the labels. Label association is done automatically whenever possible. In the case of web images, we parse the page and collect the text surrounding the IMG tag from which the URL of the image was derived using a technique similar to that used by Smith and Chang in their system WebSeek [11]. The text goes through two different processing stges (Fig. 14). The first stage (common words removal) removes all the words that are so common in the language that they do not provide any information about the content of the image. Articles, prepositions, and the like are included in this category. We also include time specifications (today, yesterday, and so on) since they might be no longer valid when the images are retrieved from the database. The second stage (stemming) attempts to extract the root of a word. Suffixes like the “s” of the plural or the “ed” of the past tense are removed in this phase.

s 1 ^ 2 (Q; I ) = ^(s 1 (Q; I ); s 2 (Q; I )) c

c

c

c

(and similarly, with a function _ for the or connective) [4]. The function ^ (resp. _) implements the and (or) operator in our system. Similarly, we define the not operator as s:c(Q; I ) = :(sc (Q; I )). Formally, in El Niño, a distance in a function in L2 [F F ; R+ ], which is a Hilbert space, and the operators form an algebra on this space. On this Hilbert space, we define the two operators: and (^), and or (_). The and operator, for instance, has a signature:

Stemming is prone to error. Some words have derivation with radically different root, like “go” and “went,” while words with different meanings can originate from the same root and differ only in the suffix, like “terrible” and “terrific.” In information retrieval systems these problems are solved with ad-hoc rules. In the current text engine in El Niño we simply ignore them. We are planning to include a more sophisticated text search engine in the future versions.

^ : L2 [F F ; R+ ] L2 [F F ; R+ ] ! L2 [F F ; R+ ] and similarly for the or operator. The not operator : is defined on L2 [F F ; R+ ].

Once keywords have been assigned to the images in the visual dictionary, we use the technique of vector space similarity to determine the match between the query and an image. Let I be an image with associated a set pA of keywords. Every word in A receives a “weight” 1= jAj. Similarly, thepquery contains a set B of terms, each one with a weight 1= B . The similarity between the query and image I is given by

In order to allow query optimization, it is important that the two operations be distributive. If d1 , d2 , d3 are distance measures, then we should have

d1 ^ (d2 _ d3 ) = (d1 ^ d2 ) _ (d1 ^ d3 )

jA \ B j jAjjB j

p

And

d1 _ (d2 ^ d3 ) = (d1 _ d2 ) ^ (d1 _ d3 )

In standard logic, there are three important properties of the negation operator (:): the De Morgsn’s theorems, and the involutive property of the negation. these can be written as

This measure has a simple geometric interpretation. The lists of keywords are seen as vectors in a (very high dimensional) space with one axis for each words in the dictionary. The vector corresponding to a list has component 1 along an axis if the corresponding keyword is part of the list, 0 otherwise. The previous similarity function is equal to the cosine of the angle between the image vector and the query vector.

:(d1 ^ d2 ) = (:d1 ) _ (:d2 ) :(d1 _ d2 ) = (:d1 ) ^ (:d2 ) :(:x) = x 10

One of today’s most famous directors, Stanley Kubrick died yesterday in his house just outside London

common words removal

famous director Stanley Kubrick death house London

famous directors Stanley Kubrick died house London

stemming

Figure 14: Text processing to create labels for the visual dictionary engine. In the class of operators that we use, we can make one, and only one of the three relations valid. The only operator definition that make it possible to satisfy all three relations are 1 :

W 2

q

This definition enforces distributivity, but forces us to do some compromise about the negation operator. In particular, it is possible to choose the negation operator to satisfy exactly one of the properties above (i.e., one of the two de Morgan’s theorems, or the involution property), but it is impossible to satisfy more than one. Whether these operators are sufficiently better than min and max to justify renouncing such important properties is essentially an empirical matter. A series of experiments conducted to measure the agreement between the “and” operator and the user intuitive notion of “and” gave the results of Table 1. This table reports the weighted displacement index as defined in [3] (i.e., the lower the number, the better the result). Analysis of variance reveals that the difference is significative within a 5% possibility of error.

The min and max operators have the disadvantage that for any value of d1 , d2 , ^(d1 ; d2 ) and _(d1 ; d2 ) depend on only one of them. For instance, assume that we have two criteria of similarity, s1 and s2 , and two images I1 , and I2 . The similarity between I1 , I2 , and the query according to the two criteria is given by the following table

s1

0.1 0.1

s2

0.1 0.99

Intuitively, the similarity with the query with respect to

s1 and s2 should be higher for image I2 , but if we use the

min function, such value is the same for the two images. Fagin [4] proposed the use of two classes of dual functions that he called norms and co-norms. Each norm has a dual co-norm (as a particular case, min is a norm, and max is the dual co-norm). Using a norm for the ^ function, the dual co-norm for the _ function, and the not function above, the de Morgan theorems and involution are satisfied. Unfortunately, non of Fagin’s function (except for min and max) make ^ and _ distribute. A class of operators that generates a distributive algebra of operators is obtained as follows. Consider a monotonically increasing function : [0; 1) ! R such that (0) = 0 and (x) ! 1 for x ! 1. The inverse ?1 is defined for all x 0, and takes values in [0; 1). The logic operators can then be defined as

d1 ^ d 2 = d1 _ d 2 =

0.2 0.05

q ) and variance (2 ) of the weighted Table 1: Average (W displacement for the 2 similarity algebra considered.

d1 ^ d2 = max(d1 ; d2 ) d1 _ d2 = min(d1 ; d2 ) :d = 1 ? d

I1 I2

min 0.3 0.12

6 A Query Example This section presents a brief query example of use of El Niño and its interface. The interface of El Niño is shown in Fig. 15 with a configuration of random images displayed. The user was asked to look for a certain group of car images, that look like the image in Fig. 16 (not necessarily that particular one). In the display of Fig. 15 there are no cars like the one we are looking for, but there are a couple of cars. The user selected one (the subjectively most similar to the target) and marked another one with a cross. The second car image will be used to determine the similarity measure of the database, but it will not be considered a query example (the database will not try to return images similar ti that car); its value is in the estabilishment of a reference distance between two examples of cars. After a few interaction, the situation is that of Fig. 17 (for the sake of clarity, from now on we only show the image space rather than the whole interface). Two images of

?1 ( (d1 ) (d2 )) ?1 ( (d1 ) + (d2 ))

1 The use of max and min is inverted with respect to the norm established, for instance, in fuzzy logic. the reason is that we are dealing with distances rather than similarities.

11

Figure 15: The interface of El Niño with the beginning of a search process.

Figure 17: The interface of El Niño during the search process.

7 Conclusion El Niño is an attempt to overcome the limitation of traditional search engines and, more specifically, to find an alternative to the well known query by example paradigm. The interaction model of El Niño is based on the following two ideas:

Figure 16: One of our target images. the type that we are looking for have appeared. They are selected and placed very close to one another. A third car image is placed at a certain distance but excluded from the search process. Note that the selection process is being refined as we proceed. During some of the previous interactions, the car image that is now excluded was selected as a positive example because, relatively to what it was presented at the time, it was relatively similar to our target. Now that we are “zeroing in” to the images that we are actually interested in, the red car is no longer similar to what we need. At the next iteration, the situation is that of Fig. 18. At this time we have a number of examples of the images we are looking for. Further iterations (e.g. with the selection represented in the figure) can be used to obtain more examples of that class of images. Note that the “negative” examples are placed much farther away from the positive examples than in the previous case. This will lead to a more discriminating distance measure which, in effect, will try to zoom in the class of images we are lookig for.

Context is essential for the determination of the meaning of an image. Contextuality is enforced by showing to the user, rather than a set of images, a configuration of the feature space. In a configuration, the images are not just presented, but placed in such a way that their mutual distance reflects their similarity as currently interpreted by the database. The user never manipulates the parameters of the distance measure directly. Rather, interaction proceeds entirely through an augmented form of relevance feedback that we called configuration feedback.

The functional specifications of El Niño are based on results on user interaction with traditional image repositories. For instance, [7] is a study on the search patterns of journalists looking for images to illustrate stories. The issues that El Niño addresses are:

12

Images are searched following several different models that are very hard to integrate. In [7] it was found that about half of the images were searched by some very specific label (“I need an image of Frank Zappa”). Other are searched by a more gneric situation (“I want

the 1998 ACM SIGMOD Conference on Management of Data, pages 402–413, 1998. [2] V. Chalana, M.Y. Jaisimha, D.R. Haynor, and E. Arbogast. Medplus: a medical image analysis and browsing environment. In Proceedings of the SPIE - The International Society for Optical Engineering, vol.3031, (Medical Imaging), 1997. [3] A. Desai Narasimhalu, Mohan S. Kankanhalli, and Jiankang Wu. Benchmarking multimedia databases. Multimedia Tools and Applications, 4(3):333–355, May 1997. [4] Ronald Fagin. Combining fuzzy information from multiple systems. In Proceedings of the 15th ACM Symposium on Principles of Database Systems, Montreal. [5] R.R. Korfhage and K.A. Olsen. Image organization using VIBE, a visual information browsing environment. In Proceedings of the SPIE - The International Society for Optical Engineering vol.2606, (Digital Image Storage and Archiving Systems, Philadelphia), 1995.

Figure 18: The interface of El Niño during the search process. an image of a rock concert”). Other searches involve characteristics that are hard to characterize with labels (“A quiet countryside”).

[6] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):837–842, 1996.

Users don’t like clear-cut answer, but a certain amount of browsing is appreciated [7]. On the other hand, browsing should not take too much time. When browsing is done manually (as in the case of most existing archives), users don’t want to browse more than a few dozens images.

[7] Marjo Markkula and Eero Sormounen. Searching for photos-journalists practices in pictorial IR. In The Challenge of Image Retrieval. Papers Presented at a Workshop on Image Retrieval. University of Northumbria at Newcastle, February 1998.

These issues conditioned the design of El Niño at all levels. The most evident influence is in the design of the user interface. Starting from an analysis of the meaning of images, we designed an interface in which browsing and searching are indistinguishable and integrated. The meaning of images emerges from the interaction of the user with the database, and is “discovered” rather than being encoded into the database. The same issues also condition our architectural design for El Niño. Since we need to deal with several query models at the same time (ranging from visual browsing to keyword queries), we need to place special emphasis into the integration of different similarity engines. Our architecture, based on a mediator which integrates different engines and logic-like operators that put together their responses is designed with this necessity in mind.

[8] R.L. Plante, D. Goscha, R.M. Crutcher, J. Plutchak, R.E.M. McGrath, X. Lu, and M. Folk. Java, image browsing, and the NCSA astronomy digital image library. In Astronomical Society of the Pacific Conference Series, volume 125, 1996. [9] Yong Rui, T.S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5):644– 655, 1998. [10] Simone Santini. Explorations in Image Databases. PhD thesis, Univerity of California, San Diego, January 1998. [11] Shih-Fu Smith, J.R.and Chang. Visually searching the web for content. IEEE Multimedia, 4(3):12–20, 1997. [12] Markus Stricker and Markus Orengo. Similarity of color images. In Proceedings of SPIE, Vol. 2420, Sorage and Retrieval of Image and Video Databases III, San Jose, USA, pages 381–392, Feb 1995.

References [1] S. Adali, P. Bonatti, M. L. Sapino, and V. S. Subrahmanian. A multi-similarity algebra. In Proceedings of 13