AI Goggles: Describing What You See and

AI Goggles: Describing What You See and Retrieving What You Have Seen Tatsuya Harada, Hideki Nakayama and Yasuo Kuniyoshi Graduate School of Information Science and Technology, The University of Tokyo 7-3-1 Bunkyo-ku Hongo, Tokyo 113-8656 Japan {harada, nakayama, kuniyoshi}@isi.imi.t.u-tokyo.ac.jp

Abstract— ”AI Goggles” assists recognition and memory capability of human users who wear it. It is a wearable system consisting of a camera and head mount display (HMD) attached on the goggles, and a tiny wearable PC. The AI Goggles instantly recognize images in the user’s field of view, automatically name them, and records movie clips automatically annotated with the names. The recorded data can then be searched by specifying a keyword (i.e. ”key”) in order to retrieve movie clips around the time when the specified scene was seen (i.e. leaving it at the bathroom mirror) - as if the user’s ”external” visual memories. Moreover, the system can quickly learn unknown images by teaching and learn to label and retrieve them on site, without loss of recognition ability for previously learnt ones. As the core algorithm, we propose a new method of multi-labeling and retrieval of unconstrained real-world images. Our method outperforms the current state-of-the-art method, in terms of both accuracy and computation speed on the standard benchmark dataset. This is a major contribution to development of visual and memory assistive man-machine user interface. Index Terms— Image Annotation/Retrieval, Wearable System, Global Feature, Probabilistic Canonical Correlation Analysis

I. I NTRODUCTION A memory aid is essential assistance to improve quality of life especially for the elderly. The AI Goggles is a wearable system capable of annotating images and retrieving the visual memories in real time without any external computation resources. The AI Goggles is also capable of learning new images taught by users. This ability further enhances manmachine interaction and makes our system practical for a memory aid in the real world. This system is composed of a web camera mounted on the goggles, a head mount display (HMD), and a tablet PC (Fig.1). The system continuously extracts an image of the user’s view from the web camera, annotates the image, and then displays the annotation result on the HMD. At the same time, the system accumulates the image and corresponding annotation result to a vision log. When it receives a search query from the user, it finds appropriate movie clips in the vision log and displays them on the HMD. In addition, it can incrementally learn images taught by the user on sight. There are some closely-related studies to the AI Goggles. Schiele et al. [13] had proposed a memory-assistive system using visual triggers. They built a museum’s guide as a practical application. However, their interface is built on query-byexample (QBE) and does not provide keyword search. It has been argued that it is difficult to overcome semantic-gap [14]

Head Mount Display displays labels of objects and scene with the image.

Camera captures images, which the wearer is seeing now.

Portable Computer recognizes the images quickly that the camera is capturing, and shows the results on HMD. Moreover, it accumulates the images and labels, and enables the wearer to search those images by labels.

Fig. 1.

Overview of the AI Goggles system.

by QBE. Torralba et al. [15] had proposed a wearable system that recognizes versatile objects and scenes. Nevertheless, because they need to train a large-scale graphical model, it will face practical issues on the large number of target objects. In addition, their system does not support online learning. In contrast, our method is extremely fast, in terms of both learning and recognition. Our system can annotate a wide range of images in real time and enables re-experiencing of the memory through our image annotation/retrieval method. Besides, it supports online learning, which is an important function for a practical application. II. R EAL TIME A NNOTATION /R ETRIEVALS OF V ISUAL I NFORMATION AND O NLINE L EARNING As the core of the system, we develop an accurate and fast image annotation and retrieval method [12]. Our contributions are 1) the nonparametric density estimation in the concept space obtained by using both images and labels, and 2) the stable online learning supporting the addition of new labels. Our algorithm is based on Canonical Correlation Analysis (CCA) to learn the concept. Although there are some previous works using CCA for image annotation [8] [5], mere use of CCA loses the information of non-linear distribution in the latent space. Our proposed method can efficiently exploit the non-linear structure in the highly compressed concept space by sample-based approach with a probabilistic basis [1]. The kernelised methods [5] [6] are usually utilized to exploit the non-linearity, but the scalability of kernelised methods is barely tractable because they need to solve eigenvalue problems whose dimensions are the number of training samples.

A. Model of Images and Labels On a learning phase, we assume that images are weakly annotated with one or more labels. Our goal is to put multiple labels to unlabeled generic images, or to output unlabeled images related to user’s queries. To this end, we have to learn the relationship between images and labels. We represent an image feature as x, a labels feature as w, and a joint probability of the image feature and the labels feature as p(x, w). In order to obtain p(x, w), we do not connect images and labels directly, but introduce the concept that associates indirectly images with labels. The concept obtained from an image feature x and a labels feature w is denoted as z. We assume the image feature and the labels feature are conditionally independent given the concept. By introducing the concept, the joint probability p(x, w) of the image and the labels can be represented as: p(x, w) =

Nz

p(x|z i )p(w|z i )p(z i ),

(1)

i=1

where Nz is the number of the latent variables. B. Conceptual Learning Selection of the concept obtained from images and labels is important for the performance of annotation and retrieval. In this work, we use CCA to acquire the concept. CCA does not seek a direct relationship between two variables, but linearly transforms the variables into new variables, and maximizes a correlation between the new variables. Actually, from the viewpoint of a probabilistic interpretation of CCA [1], the structure for CCA is similar in (1). In this manner, we can learn the concept efficiently using CCA with the statistically proved background. Now, the image feature vector has p variables x = (x1 , · · · , xp )T , and the labels feature vector has q variables w = (w1 , · · · , wq )T . CCA transforms a learning data set D ={(x1 , w1 ), . . . , (xN , wN )} by the following linear transformation: si ti

ˆ i, ¯ ) = AT x = AT (xi − x T T ˆ i, ¯ =B w = B (wi − w)

(2) (3)

where A and B are the matrices containing the canonical correlation vectors that maximize the correlation between s and t. Cxx Cxw denote the sample covariWe let C = Cwx Cww ance matrix obtained from D. The optimal matrices A and B can be obtained by the following eigenvalue problem: −1 Cwx A = Cxx AΛ2 (AT Cxx A = Id ), Cxw Cww −1 Cwx Cxx Cxw B

2

(4)

= Cww BΛ (B Cww B = Id ), (5) T

where Λ2 is the diagonal matrix of the first d canonical correlations, and d is the dimension of the canonical elements, and s, t. Because we only solve the eigenvalue problem once on learning, we can get the latent variables very fast.

The relationship between s, t and the latent variable z can be obtained as follows using the posterior expectation of z on the probabilistic CCA: E(z|x) = GTx s, E(z|w) = GTw t, where Gx , Gw are arbitrary matrices such that Gx GTw = Λ. C. Selection of Probability Density Function We let Dl = {(s1 , t1 ), . . . , (sN , tN )} denote the transformed data set from D. Because Gx and Gw are arbitral matrices such that Gx GTw = Λ, proper selection of these matrices is very difficult. Here, we select a unit matrix as Gx . That is, we use s as the latent variable z. By using s as the latent variable, and using a uniform prior p(si ) = 1/N , equation (1) can be represented as: N 1 p(x, w) = p(x|si )p(w|si ). N i=1

(6)

p(x|si ) is a conditional density function responsible for generating the image feature x from the concept si . In this calculation, we do not use the distance in the image feature space, but in the concept space. Therefore, we can compare images in the intrinsic feature space for annotation where features not contributing annotation are eliminated. We use a Gaussian kernel over si in the concept space. Let s denote the transformed variable of x using the linear transformation A. Then we have: p(x|si ) =

1 (2π)M |Σ|

e− 2 (s−si ) 1

T

Σ−1 (s−si )

,

(7)

where Σ is a covariance matrix. Here we assume Σ = βId . We can design the smoothness of the density function by controlling the bandwidth β. In general, we can reduce the computational cost of (7), because the dimension of transformed variables (s and si ) is highly compressed compared with the original dimension of the images. p(w|si ) is a conditional density function responsible for generating the labels from the concept si . Now, we design this function from the top-down using a simple language model like CRM and MBRM. (8) p(w|si ) = pW (w|si ), w∈w where w is a single label, PW is the occurrence probability of words, and is represented as follows: pW (w|si ) = μδw,si + (1 − μ)

Nw , NW

(9)

where NW is the total number of labels in the training data set, Nw is the number of the images that contain w in the training data set, δw,si is one if the label w is annotated in the image si otherwise zero. μ is a parameter between zero and one. If μ approaches one, the occurrence probability of the labels gives weights to the labels of each image, otherwise it gives weights to frequency of labels among all labels.

D. Online Learning

E. Image and Label Features

In order to extend the algorithm to incremental learning of new training samples, the main problem is how to derive incremental CCA. One obvious method is to use the classical perceptron algorithm [9]. However, to get discriminative power in this approach, we need to sequentially input all target objects into the system evenly. In a practical situation it is natural to give the system a large batch of samples of a certain object all at once, so this assumption is invalid. Moreover, it is quite difficult to use this approach if the dimension of the feature vector increases. Thus, it is almost impossible to learn a new label. For this reason, we extend a CCA to update only covariance matrices incrementally. This method needs to solve an eigenvalue problem every time, but, unlike kernelised methods [5] [6], the dimension of our eigenvalue problem is that of image and labels features. Therefore, the computational cost is constant against the number of samples and is generally quite smaller than the cost of calculating covariance matrices. Furthermore, this approach can deal with the case in which the dimension of features increases. Suppose we already have t training samples, and mean vector, correlation matrices, covariance matrices of them. We let m，R，C denote them respectively. When a new training sample {xt+1 , wt+1 } is given, we update mean and covariance as follows.

As the image feature, we use the color higher-order local auto-correlation (Color-HLAC) features [7]. This is a powerful global feature for color images. In general, global image features are suitable for realizing scalable systems because they can be extracted quite fast. They are well suited for weak labeling problem where we cannot predict the location and the number of objects in images. In this system we use at most the 1st order correlations. We extract Color-HLAC features from the original images and weakly binarized images as described in [12]. We use the word histogram as the labels feature. In this work, each image is simply annotated with a few words, so the word histogram becomes a binary feature.

mx Rxx Cxx

t−l 1+l mx + xt+1 , t+1 t+1 t−l 1+l Rxx + xt+1 xTt+1 , = t+1 t+1 = Rxx − mx mTx , =

III. E XPERIMENTAL R ESULTS A. Performance of the Annotation/Retrieval Algorithm We evaluated the performance using the Corel5K [3] containing 5000 pairs of the image and the labels. This is the de facto standard for the problem of image annotation with multiple words. Each image is manually annotated with one to five words. 4500 images are the training data, 500 images are the test data. The training data has 371 words. 260 words among them appear in the test data. We evaluate the annotation with Mean-Recall (MR) and Mean-Precision (MP). Because they are trade-off, we also evaluate the total annotation with the f-measure (FM). Similarly, we evaluate the retrieval with both the Mean Average Precision (MAP), and the average precision of the labels, in which recall > 0 on annotation (MAP+). TABLE I E XPERIMENTAL RESULTS ON C OREL 5K

(10)

where l is a positive number that determines the weight of a new sample. We update Cww ，Cxw likewise. When a new label is added, the update algorithm slightly changes as follows. A new label feature corresponding to the new label is added to the tail of original label features vector. t−l 1+l mw wt+1 , + mw = 0 t+1 t+1 t−l 1+l Rww 0 wt+1 wTt+1 , Rww = + 0 0 t+1 t+1 1+l t−l Rxw 0 + xt+1 wTt+1 . (11) Rxw = t+1 t+1 After updating covariance matrices, we can obtain new projection matrix A，B by solving (1). In incremental learning, finding a good value of d (dimension of the canonical space) is difficult because the structure of the canonical space changes dynamically. Here, we set a threshold for eigenvalues, and employ eigenvectors whose eigenvalues are greater than the threshold. After updating the canonical space, we need to calculate the hidden variable s again using A. s simplicity, It is worth noting that, in spite of the algorithm’ it is guaranteed to provide the same solution as the batch process. This property makes our system highly stable, which is crucially important for realistic situation.

MBRM [4] SML [2] DCMRM [10] JEC [11] Proposed 1 Proposed 2

MR 0.25 0.29 0.28 0.32 0.31 0.32

MP 0.24 0.23 0.23 0.27 0.29 0.31

FM 0.24 0.26 0.25 0.29 0.30 0.31

MAP 0.30 0.31 0.33 0.33 0.33

MAP+ 0.35 0.49 0.52 0.53 0.58

There are two proposed methods where the Proposed 1 uses 1st order Color-HLAC features, and the Proposed 2 uses 2nd order Color-HLAC features. From the results, our method outperforms the previously published methods. MR of the Proposed 1 is inferior to JEC, but its FM as the total annotation score overcomes JEC. As for the computational costs on the Corel5K, our method takes just 5 seconds for learning, and takes 10 seconds for annotation over all 500 test images on an eight-core desktop PC. On the other hand, SML [2] used 3000 state-of-the-art Linux machines in 2005. JEC uses many kinds of image features and calculates those distances in the raw feature space, whereas our method measures distances in the highly compressed concept space compared to the image feature space. Therefore, our method calculates the distances faster than JEC. In this sense, our proposed algorithm is well suited for the mobile application we aim at. The computational costs of our recognition and incremental learning method are O(N ). The cost of O(N ) seems intractable for large amounts of dataset, but they

can be perfectly parallelized because they are instance-based methods and do not need sequential estimation. This property is suitable for the trend of current computer hardware. B. Demonstration of the AI Goggles We test the AI Goggles in the real environment. We set up an experimental room simulating common life space. We placed various objects such as a desk, chair, books, and clock. Moreover, we test our system outside of the room and teach some landmark objects and scenes. Overall, we teach 106 labels in the current setup. The initial training dataset contains about 5000 samples of just 40 labels in the room. We teach a large part of samples through online learning algorithm. The objective here is to test how our system can adapt to a whole new environment. Our incremental algorithm takes just 0.1 seconds on an average for each frame to update the all samples in the concept space.

phone 0.98 keyboard 0.79 mouse 0.78

map 0.96 Hokkaido 0.52

0.99 flower 0.99 pansy begonia 0.99

environment does not deteriorate the recognition performance in the experimental room. These results indicate the powerful and flexible recognition ability of the system. We also attempt to retrieve images containing certain objects from the previously recorded vision log (Fig.3). For example, when we search the vision log by ”face” query, our system can retrieve various objects as ”face”. In other words, it can flexibly deal with the high intra-class variation. Our system permits multilabel retrieval. If we input ”face” and ”frog” as the query, the system correctly retrieves the target. As for a memory aid, suppose we cannot find a pair of pliers, and want to know where it is now. Then we let the system retrieve images of the pliers that we must have seen somewhere before. The system shows us a sequence of images of the last time the pliers were seen. This visual information reminds us where the pliers are. IV. C ONCLUSIONS In this paper, we presented the AI Goggles, which annotates and retrieves various images in the real world. Using the AI Goggles, we can retrieve what we have seen in the past using a search query as if the user’s ”external” visual memories. Moreover, it enables instant learning of new scenes. We can optimize the labeling policy of the system flexibly according to the user’s preference. We hope our system contributes a memory aid for many people. R EFERENCES

bicycle

0.68

Fig. 2.

Annotation examples with the AI Goggles.

Query

street building

0.65 0.65

tree stone statue

0.87 0.86 0.86

Retrieved Images

face

face frog

flower begonia

Fig. 3.

Retrieval examples with the AI Goggles.

After completing learning, we actually wear the goggles and attempt real-time annotation in the real environment. The vision log, which contains the image and the posterior probability of each word, is saved at 1 fps. We set the threshold of posterior probability. The recognition result is the set of objects whose posterior probability exceeds this threshold. The system succeeds in annotating various generic images both in the experimental room and outside (Fig.2). It is remarkable that though these two environments are considerably different, the incremental learning in the outside

[1] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2005. [2] G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. PAMI, 29(3):394–410, 2007. [3] P. Duygulu, K. Barnard, and D. F. N. Freitas. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proc. ECCV, 2002. [4] S. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Proc. CVPR, 2004. [5] D. Hardoon, C. Saunders, S. Szedmak, and J. Shawe-Taylor. A correlation approach for automatic image annotation. In Second Int. Conf. on ADMA, 2006. [6] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004. [7] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method for full color image database –query by visual example–. In Proc. ICPR, 1992. [8] T. Kurita, T. Kato, I. Fukuda, and A. Sakakura. Sense retrieval on a image database of full color paintings. Trans. of IPSJ, 33(11):1373– 1383, 1992. [9] P. L. Lai and C. Fyfe. A neural implementation of canonical correlation analysis. Neural Networks, 12(10):1391–1397, 1999. [10] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma. Dual cross-media relevance model for image annotation. In Proc. ACM Multimedia, 2007. [11] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In Proc. ECCV, 2008. [12] H. Nakayama, T. Harada, Y. Kuniyoshi, and N. Otsu. High-performance image annotation and retrieval for weakly labeled images using latent space learning. In PCM, 2008. [13] B. Schiele, T. Jebara, and N. Oliver. Sensory-augmented computing: Wearing the museum’s guide. IEEE Micro, 21(3):44–52, May 2001. [14] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. on PAMI, 22(12):1349–1380, 2000. [15] A. Torralba, K. P. Murphy, W. T. Freeman, and M. Rubin. Contextbased vision system for place and object recognition. In Proc. ICCV, 2003.