Beyond Query by Example - Semantic Scholar

Beyond Query by Example Simone Santini∗and Ramesh Jain Visual Computing Laboratory University of California, San Diego ssantini, [email protected]

Abstract This paper considers some of the problems we found trying to extract meaning from images in database applications, and proposes some ways to solve them. We argue that the meaning of an image is an ill-defined entity, and it is not in general possible to derive from an image the meaning that the user of the database wants. Rather, we should be content with a correlation between the intended meaning and simple perceptual clues that databases can extract. Rather than working on the impossible task of extracting unambiguous meaning from images, we should provide the user with the tools he needs to drive the database in the areas of the feature space where “interesting” images are. 1

Meaningless Responses

Try to remember the last time you used a web demo to experiment with an image database. Most—if not all— database demos have very similar interfaces. On one side you form a query, either drawing a sketch or selecting one of a set of random images that the database gives you. While doing this, you also decide one or more similarity criteria on which you want to base you query. Typical choices are color, structure, or texture [2]. Sometimes you can use several criteria together [3], or you can draw areas and decide that the spatial relations between these areas is relevant to your query. On the other (metaphorical) half of your interface you have a browser. You hit the “go” button and—after a certain time—the browser will display the first n (typically 9 to 15) images in the database in order of similarity with the query. If you are like us, you will look at the results with mixed feelings. One or two of them will make perfect sense; some others will not. Consider Fig. 1. This is the result of a query done with one of the standard image database engines currently available. (The image on the top left corner is the query.) Some of the images returned are somewhat ∗ This work was partially supported by the National Science Foundation under grant NSF-IRI-9610518

Figure 1: disappointing. Yet, looking at them closely, it is possible to understand in most cases why the database returned them. The head of the woman in the third image resembles superficially the shape of the arch on top of the door in the query. The woman in the fourth image wears a vest similar in color to the door, and so on. We can explain in this way most of the images returned. Still, you are not happy. The fact is that you asked for an image similar to a door, and received images that semantically were not doors at all. In this paper, we propose and support the following explaination: determining the meaning of an image is an inherently ill posed problem, since it depends on the “situatedness” of the observer as well as on the image data; we can however find a correlation between reasonable interpretations of an image and the simple perceptual clues that a database can use. With the use of the right interface, the user can interact with the database to “drive it” in interesting regions of the feature space. There were problems in the example, essentially, because images can be similar at many different levels. Two images can be similar because they have the same dominant colors, because they are both paintings by Br¨ ugel the elder, because they both convey a sense of calm, because they both represent an old man and a dog, and so on. The database can interpret and use only some of these possible similarities, based on very primitive semantics. This results in breakdown [11]. The computer system operated in accordance to its own semantic categories. The breakdown occurs because at first the system gave the illusion to operate according to the same categories as the user

(i.e. some of the image returned are actually doors), and when these are violated, the user experiences frustration. If we assume that we have no annotations, and that the domain of the database is not overly restricted, it is virtually impossible to work with models with high semantic content. In our work we intentionally reject any form of object identification or region segmentation, and decide to rely on simple perceptual clues. Users, on the other hand, reason on a different semantic level—one in which objects, and not perceptual clues are the main concern. A flexible, perceptual approach can be useful only if there is a correlation between the two levels that is, if the perceptual level can provide information about the semantic categories of interest to the user. We emphasized the word “correlation” to stress the fact that we don’t look for an exact correspondence. That would be tantamount to doing unconstrained object recognition: an effort bound to incur in the frame problem [6, 1]. We argue that for the retrieval problem a correlation is enough. The user will provide the missing information and the situatedness to drive the system towards the right images.

Figure 2: The referent of the apple query (a) and the cat query (b). 1 Fruit Red Object Round Object Apple

0.8

0.6

0.4

2

Where did semantics go? 0.2

What the user wants from a database is a semantically meaningful answer to a query. If we refuse any symbolic representation of semantic categories in favor of a more perceptual approach, we should ask whether this approach can be useful for the queries we have in mind. In absolute terms, the answer is no, as testified by the terrible difficulty of building an autonomous agent driven by vision. Our goal, however, is not to build a machine that sees, but to build a machine that will assist a user in activity that require perception. Our refusal of the anthropomorphic paradigm has shifted the problem significantly. The relevant question that we need to answer now is no longer “can simple perceptual clues identify semantically meaningful objects,” but: is there enough correlation between simple perceptual clues and semantically meaningful objects so that the interaction between the user and the system will be meaningful? The answer to that question is not unique. It not only depends on the similarity measurement, but on the whole system. In particular, in a social system, as opposed to an anthropomorphic one, we cannot ignore the role of the interface. Consider two queries, which are presented to the database as “query-by-example.” The first “Apple” query uses the image of Fig. 2.a, the second “Cat” query uses the image of Fig. 2.b. We submitted the queries to the similarity measurement system in [9]. How well does our simple perceptual engine capture some of the possible meanings associated to the images in Fig. 2? Let us start with the apple query: some possible interpretations for the image of Fig. 2.a are that it represents a fruit, a red object, a round object, and an apple. Fig. 3 shows the percentage of the first k images returned by the database that have these four semantics. These four semantic interpretations of the image are usually considered at very different “levels.” Color is considered a very “low level,” or perceptual attribute, while being a fruit or an apple is a cognitive attribute. The fact that a significant percentage of results returned by the database are in effect fruits or apples, indicates a correlation between perceptual and cognitive semantics that our system can exploit.

0 0

5

10

15

20

25

Figure 3: The percentage of the first k images returned by the database that are fruits, red objects, round objects, and apples. The apple query is relatively simple: a single fruit in the middle of a white background. Would some kind of semantic interpretation emerge even from a more complicated image? The “cat” query can help us answer this question. In this case, we tried to attach two semantics to our image: it represents a cat and it represents an animal. Fig. 4 shows how many of the first k answers returned by the system are cats or animals. When we try to interpret the meaning of the images returned by the database, the results are not perfect: only a fraction of the top images have the meanings that we were looking for. This is unavoidable, since we are intentionally avoiding modeling the user “high level” semantic interpretation, and rely on some form of correlation between the two independent perceptions of the database and the user. In this approach, the traditionally non-perceptual parts of an image database (like query formulation, results display, and so on) play a vital role. They must deal with the uncer1 Cat Animal 0.8

0.6

0.4

0.2

0 0

5

10

15

20

25

Figure 4: The percentage of the first k images returned by the database that are cats or animals

Figure 5: Schematic description of an interaction using a direct manipulation interface. tainty induced by the search engine, and let the user explore continuously the perceptual space of the database. It is essential to provide the user with tools for the exploration of the database’s perceptual space [8]. 3

Interface Design

Based on the observation in the previous two sections, we derive the following two guidelines for the design of interfaces: 1. The user should have a global view of the placement of images in the database. This global view will place every image in the context of other images that, given the current similarity criterion, are similar to it. This “cognitive map” of the database will make the correlation between the similarity criterion and the possible semantics more evident by placing similar images together. (In the first example above, if red apples are associated with cucumbers and bananas, the database has obviously captured the “fruit” semantics, if they are associated with red balls, the database is using the “red round” semantics.) 2. The user should be able to manipulate its environment in a simple and intuitive way. In many current interfaces, the user is given knobs or cursors corresponding to some database-defined quantities. For instance, the database may employ several feature extractors, and the user can be asked to choose the relative importance of them. This is a highly unintuitive interface and a poor use of the user abilities. It is not immediate to se how asking for “more texture,” as opposed to “less color” will change the database response. Based on these principles, we tried to replace the queryanswer model of interaction with a direct manipulation one. In our model, the database gives the user information about the status of the whole database, rather than just about a few images that satisfy the qery. Whenever possible, the user manipulates the image space directly by moving images around, rather than manipulating weights or some other quantity related to the similarity measure currently used by the database. An user interaction using a direct manipulation interface is shown schematically in Fig. 5. In Fig. 5.A the database proposes a certain distribution of images (represented schematically as shapes) to the user. The distribution of the images

reflects the current similarity interpretation of the database. For instance, the the pointed triangle is considered very similar to the eight points star. In Fig. 5.B the user moves some images around to reflect his own interpretation of the relevant similarities. The result is shown in Fig. 5.C. According to the user, the triangle and the five points star are quite similar to each other, and the circle is quite different from them. The images that the user has placed form the anchors for the determination of the new similarity criterion. The database then redefines its similarity measure, and returns with the configuration of Fig. 5.D. Note that the result is not a simple rearrangement of the images in the interface. The reorganization consequent the user interaction involves the whole database. Some images will disappear from the display (the hexagon in Fig. 5.A), and some will appear (e.g. the black square in Fig. 5.D). A slightly different operation on the same interface is the definition of visual concepts. A visual concept is a set of images that, for the purpose of the current application, can be considered as equivalent or almost equivalent. A user can decide to open a bucket and fill it with images that represent the same concept. The bucket can have metadata attached and can be placed on the interface like a regular image. A concept works much like a cluster of images that are kept very close in the display space. Visual concepts can also be transferred from one session to another and made available to other users to integrate application-related semantics into visual searhes. This approach requires a different and more sophisticated organization of the database in several respects: 1. The requirement for a contextual presentation that the user can manipulate requires a formal definition of one or more display space, in which the images are presented in a relation that resembles that estabilished by the query. Images are placed in a high dimensional feature space. Given the impossibility to represent high dimensional spaces, we need to carefully define the relations between the display space and the feature space. 2. The database must accomodate arbitrary (or almost arbitrary) similarity measures, and must automatically determine the similarity measure based on the anchors and the concepts formed by the user. 3. The database should make optimal use of the semantics induced by the use of visual concepts. In the following section we describe in greater details the principles behind the design ofdirect manipulation interfaces. This paper deals with interfaces to image databases, so we will not consider point 2 above. The reader should refer to [7] for more details. 4

Direct Manipulation Interface

The direct manipulation interface is composed of three spaces and a number of operators. The operators can be transformations of a space onto itself or transformation from one space to another. The three space on which the interface is based are: • The Feature space F . This is the space of the coefficients of a suitable representation of the image [9]. The feature space is a topological space, but not a metric

one. There is in general no way to assign a “distance” to a pair of feature vectors. • The Query space Q. When the feature space is endowed with a metric, the result is the query space. The metric of the query space is derived from the user query, so that the distance from the origin of the space to any image is the “score” of the image in the user query. • The Display space D is a low dimensional space (0 to 3 dimensions) which is displayed to the user and with which the user interacts. The distribution of images in the display space is derivated from that of the query space. The feature space is a reltively fixed entity, and is a property of the database. The query space, on the other hand, is created anew with a different metric for every new query. 4.1

Operators in the Feature Space

the feature space is not completely immutable. Some of the operators operate on it. This is usually done for reasons of convenience (some queries may be faster on some particular transformation of the feature space), a concept not dissimilar from the formation of “stored views” in databases. In many cases, we need to adjust the feature vectors for the convenience of the database operations. A very common example is dimensionality reduction. In this case, we will say that we obtain a view in the feature space. The operators that operate on the feature space are used for this purpose. The most common are projection (projecting the feature vector on a low dimensionality space), and quantization. 4.2

The Query Space

The feature space, endowed with a similarity measure derived from a query, becomes the query space. In the query space the “score” of an image is determined by its distance from the origin. The determination of the geometry of the query space is in general quite complicated, and is beyond the scope of this paper. We will just assume that every image is represented as a set of n number (which may or may not identify a vector in an n-dimensional vector space, as discussed in the previous section) and that the query space is endowed with a distance function that depends on m parameters. The feature sets corresponding to images x and y are represented by xi and y i , i = 1, . . . n, and the parameters by ξ µ , µ = 1, . . . , m. In the following, latin indices will span 1, . . . , n, and greek indices will span 1, . . . , m. Also, to indicate a particular image in the database we will use either different latin letters, as in xi , y i or an uppercase latin index. So xI is the I-th image in the database (1 ≤ I ≤ N ), and xjI is the corresponding feature vector. Since this notation can be quite confusing, we will try to avoid using it whenever possible, and use xi , y i instead. The parameters ξ µ are a representation of the query, and are the values that determine the distance function. Given the parameters ξ µ , the distance function in the query space can be written as n

n

f : IR × IR × IR

m

+

i

i

µ

i

i

µ

→ IR : (x , y , ξ ) 7→ f (x , y ; ξ )

(1)

with f ∈ L2 (IRn ×IRn ×IRm , IR+ ). Depending on the situation, we will write fξ (xi , y i ) in lieu of f (xi , y i ; ξ µ ).

As defined in the previous section, the feature space per se is topological but not metric. Rather, its intrinsic properties are characterized by the functional L : IRm → L2 (IRn × IRn , IR+ )

(2)

µ

which associates to each query ξ a distance function: L(ξ µ ) = f (·, ·; ξ µ ).

(3)

The query can also be seen as an operator χξ which transforms the feature space into the query space. If L is the characteristic functional of the feature space, then χξ L = L(ξ) is the metric of the query space. Once the feature space F space has been transformed into the metric query space Q, other operations are possible, like: Distance. Given a feature set xi , return its distance from the query: D(xi ) = f (0, xi ; ξ µ ) (4) Select by Distance. Return all feature sets that are closer to the query than a given distance: S(d) = xi : D(xi ) ≤ d

k-Nearest Neighbors. query

(5)

Return the k images closest to the

N (k) = xi : y i : D(y i ) < D(xi ) < k

(6)

It is necessary to stress again that these operations are not defined in the feature space F since that space is not endowed with a metric. Only when a query is defined does a metric exist, and these operations make sense. This unfortunately represents a complication from the pointof view of indexing. Indexing can’t be tailored to the feature space F , but must take into account the presence of the different metrics that are possible in the query space Q. 4.3

The Display Space

The display operator φ projects image xi on the screen position X Σ , Σ = 1, 21 in such a way that d(X Σ , Y Σ ) ≈ f (xi , y i ; ξ µ )

(7)

This projection is done by an operator that we write: φ(xkI ; fξ ) = (XIΣ , ∅).

(8)

The parameter fξ reminds us that the projection that we see on the screen depends on the distribution of images in the query space which, in turn, depends on the query parameters ξ µ . The notation (XIΣ , ∅) means that the image xI is placed at the coordinates xΣ I in the display space, and that there are no labels attached to it (that is, the image is not anchored at any particular location of the screen, and does not belong to any particular visual concept). A configuration of the display space is obtained by applying the display operator to the whole query space: φ(Q) = φ(F ; fξ ) = (XIΣ , ℵI ) 1

(9)

Capital Greek indices will span 1, 2. In this section we will refer to a two dimensional display space which, in our experience, is the most common case. The arguments presented apply, mutatis mutandis, to display spaces of dimensions from 1 to 3.

As we said before, it is impractical to display the whole database. More often, we display only a limited number P of image. Formally, this can be done by applying the P -nearest neighbors operator to the space Q:

5

We have used these principles in the design of the interface for our database system El ni˜ no. As we mentioned in the previouse section, the interface that we described requires the support of a suitable engine and data model. In particular, the engine must be able to

φ(N (P )(Q)) = φ(N (P )(F ; fξ )) = (XIΣ , ℵI ), i = 1, . . . , P (10) where ℵI is the set of labels associated to the I-th images. The display space D is the space of such configurations. With these definition, we can describe the operators that the user has available to manipulate the display space.

• Understand the placement of images in the display space. • Be able to create a similarity criterion “on the fly” based on the placement of samples in the display space.

The Place Operator The place operator moves an image from one position of the display space to another, and attaches a label α to the images to “glue” it to its new position. The operator that places the I-th image in the display is ζI : Q → Q with: ˜ IΣ , ℵI ∩ α) ∪ (X (11) ˜ is the position given to the image by the user. where X

ζI (XJΣ , ℵJ ) =

(XJΣ , ℵJ ) − (XIΣ , ℵI )

Visual Concept Creation A visual concept is is a set of images that, conceptually, occupy the same position in the display space and are characterized by a set Λ of labels. Formally, we will include in the set Λ the keywords associated to the concept as well as the identifiers of the images that are included in the concept. So, if the concept contains images I1 , . . . , Ik , the set of labels is λ = (W, {I1 , . . . , Ik })

(12)

where W is the set of keywords. We call Λ the set of concept, and we will use the letter λ to represent a concept. The creation of a concept is an operator κ : D → Λ defined as: κ (XJΣ , ℵJ ) = W ∪ {I1 , . . . , Ik } = λ

(13)

Visual Concept placement The insertion of a concept in a position Z Σ of the display space is defined as the action of the operator η : Λ × IR2 × D → D defined as: η Λ, Z Σ , (XJΣ , ℵJ )

4.4

=

(XJΣ , ℵJ ) − (XIΣk , ℵJ )

∪ ZΣ, α ∪ λ

(14)

Query Creation

When the user moves images around the interface, he or she imposes a certain number of constraints of the form d(xI , y I ) = dxy . Assume that the user takes a set T of images and places them in certain positions of the interface, so that, for all pairs (x, y) ∈ T × T , the value dxy is given. The query can then be determined by solving the system of equations: f (xi , y i ; ξ µ ) = dxy x, y ∈ T (15) in the unknown ξ µ . The creation of a query can be seen as an operator: χ : D → IR

m

:

(XIΣ , ℵI )

µ

7→ ξ .

The Interface at Work

(16)

In general, we require that a query depends only on the images Σ that have been anchored by the user, so, if C = (XI , ℵI ) and C 0 = (XIΣ , ℵI ) ∈ C : α ∈ ℵI we have χ(C) = χ(C 0 ).

The engine that we use in El ni˜ no satisfies these requirements using a purely geometric approach. The feature space is generated with a multiresolution decomposition of the image. Depending on the transformation group that generates the decomposition, the space can be embedded in different manifolds. If the transformation is generated by the two dimensional affine group, then the space has dimensions x, y, and scale, in addition to the three color dimensions R, G, B. In this case the feature space is diffeomorphic to IR6 . In other applications, we generate the transform using the phase space of the Weyl-Heisenberg group [4], obtaining transformation kernels which are a generalization of the Gabor filters [10, 5]. In this case, in addition to the six dimensions above we have the direction θ of the filters, and the feature space is diffeomorphic to the cylinder IR6 × S 1 . An image is represented as a set of coefficients in this six (or seven) dimensional space. The raw feature space of El ni˜ no is the space of such sets of coefficients. Each image is represented by a set of about 30,000 coefficients. In order to reduce the memory occupation of each image and make the distance computations more efficient, we create a view in this space by a vector quantization operation that reduces an image to a number of coefficients between 50 and 100 (depending on the particular implementation of El ni˜ no). The query space is created endowing the view on the feature space with a metric. One of the characteristics of El ni˜ no is that the metric is not a simple Minkowski metric, but a more general Riemann metric. This fact allows us to create endless similarity criteria based on the query choices of the user. The description of the engine of El ni˜ no goes beyond the scope of this paper. The interested reader can find a full description in [7]. The display space in El ni˜ no can be visualized using a number of two-dimensional displays, and some three dimensional displays. Fig. 6 shows a two dimensional display of El ni˜ no. The user can zoom in and out, and see a sample of the images in the display along the two axes (at higher magnification the dots inside the display also are displayed as images). The small window on the left is a visual concept being formed. Other possible displays include a “checkerboard” display, in which images are displayed in a fixed grid and a three dimensional display (Fig. 7). 6

Conclusions

In a retrieval system the interface is an essential element. The goal of the system is to communicate with the user and to engage with him in a search for information. Since the user semantics is correlated to, but not exactly corresponding to, the perceptual semantics of the database, it is essential to create tools to direct the database towards images with the right semantics.

This approach marks a substantial departure from the usual anthropomorphic paradigm. Image management systems can’t simulate human perception because disconnected images don’t provide an environment in which any perception phenomenologically equivalent to human perception can take place. This doesn’t mean that we can’t build tools that use perception in order to assist people in some task. We must, however, place the systems in the right social framework: we must build perceptual tools that operate in an environment that include the person using them, not anthropomorphic systems that attempt to operate independently of any interaction with people. References [1] Daniel Dennett. Cognitive wheels: the frame problem of AI. In Christopher Hookway, editor, Minds, Machines, and Evolution, Philosophical Studies, pages 129–152. Cambridge University Press, 1984.

Figure 6: Two dimensional embodiement of the display space.

[2] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Query by image and video content: the QBIC system. IEEE Computer, 1995. [3] Armanath Gupta. Visual information retrieval technology: A virage perspective. Technical report, Virage, Inc., 1995. [4] C. Kalisa and B. Torrésani. N-dimensional affine Weyl-Heisenberg wavelets. Annales de L’Instut Henri Poincarè, Physique thèorique, 59(2):201–236, 1993. [5] Shidong Li and Jr Healy, D.M. A parametric class of discrete gabor expansions. IEEE Transactions on Signal Processing, 44(2):201–211, 1996. [6] John McCarthy and P. Hayes. Some philosophical problems from the standpoint of artificial intelligence. In B. Meltzer and D. Michie, editors, Machine Intelligence, vol. 4, pages 463–502. 1969. [7] Simone Santini. Explorations in Image Databases. PhD thesis, Univerity of California, San Diego, January 1998. [8] Simone Santini and Ramesh Jain. Image databases are not databases with images. In Alberto Del Bimbo, editor, Proceedings of the International Conference on Image Analysis and Processing, Florence, Italy. SpringerVerlag, September 1997. [9] Simone Santini and Ramesh Jain. Similarity is a geometer. Multimedia Tools and Applications, 5(3), November 1997. available at http://wwwcse.ucsd.edu/users/ssantini.

Figure 7: A three dimensional embodiement of the display space.

[10] B. Torrésani. Wavelets associated with representations of the affine weyl-heisenberg group. Journal of Mathematical Physics, 32(5):1273–1279, 1991. [11] Terry Winograd and Fernando Flores. Understanding Computers and Cognition. Addison-Wesley, 1987.