Communicating User's Focus of Attention by Image Processing as

Communicating User’s Focus of Attention by Image Processing as Input for a Mobile Museum Guide Adriano Albertini, Roberto Brunelli, Oliviero Stock, Massimo Zancanaro ITC-irst 38050, Povo, Trento, Italy +39 0461 315444

{albertini,brunelli,stock,[email protected]} ABSTRACT The paper presents a first prototype of a handheld museum guide delivering contextualized information based on the recognition of drawing details selected by the user through the guide camera. The resulting interaction modality has been analyzed and compared to previous approaches. Finally, alternative, more scalable, solutions are presented that preserve the most interesting features of the system described.

Categories and Subject Descriptors H.1.2 [User/Machine Systems]: information processing.

Human

factors,

Human

General Terms Human Factors.

Keywords Human-machine Interaction, Machine Learning, Appearancebased Recognition.

1. INTRODUCTION Vision-based human computer interfaces are gaining wide acceptance since they allow a casual user to interact in a natural way with a computer system. Usually, face detection and gesture recognition are the most exploited tasks where the user may control a system through natural gesturing often combined with speech commands. Embedded in a smart environment like in [4] or used in conjunction with robot-like appearance of the system as in [6], this input style helps sustaining the metaphor that interacting with a computer system is like interacting with another person. Recognition of the focus of attention of the user through vision techniques as a basis of dynamic overlaid presentations has been studied by various people and has been at the basis of some advanced realizations (among which we can cite [8]). These approaches, while interesting, still require substantial hardware such as goggles for instance, which may not be indicated for a cultural visit setting and for our personal and highly mobile setting. In this paper, we introduce an initial investigation toward a less intrusive interaction style, based on vision recognition. Our test Copyright is held by the author/owner(s). IUI’05, January 10–13, 2005, San Diego, California, USA. ACM 1-58113-894-6/05/0001.

case scenario is a museum visit and the task is to obtain information about a detail of a painting. With a standard multimedia mobile guide, this task can be accomplished by “static” menu selection. In a different smart environment speech and gesturing may provide a solution, but. natural interaction may be inappropriate in public and quiet spaces like museums. In our setting, the visitor brings a PDA equipped with a webcam and headphones. In order to ask for information about a painting, or a detail, the visitor just points the web cam toward the painting, an act that requires no learning. A feedback of the camera view is provided on the PDA screen. While the camera frames the paintings and the details, short text labels are displayed on the PDA screen to help the visitor recognizing the objects (and to provide feedback on the system capability of recognizing objects). By clicking the text labels associated to an object, the visitor may retrieve a multimedia presentation. At present, we have completed a first prototype for the famous XV century fresco “The Cycle of the Months” at Torre Aquila in Trento, Italy. Although experimental, the prototype helped clarifying the assumptions of this interaction style. In the following, we briefly describe the architecture of the system and some details about the vision recognition engine. Finally, we discuss the lesson learned and the design of the new prototype under construction.

2. SYSTEM ARCHITECTURE Figure 1 illustrates the system architecture. The mobile device is iPAQ 3870 with an internal wireless card equipped with a Winnov Traveler web cam. The user interface is implemented in C# and embeds a Flash MX player to run the multimedia presentations. On the server side, the vision recognition engine, described in the next section, runs on a Linux machine as an http server while the database is an extension of mySQL that allows semantic-based indexing and xml-based querying on an http connection [1].The communications among the client and the servers uses a standard Wi-Fi network. The user interface on the mobile device has two modes: browsing and presenting. In browsing mode the point of view of the camera is mirrored on the display. While in this mode, color-reduced still images are sent to the vision-recognition engine every 500 ms. If the engine recognizes a scene, an identifier is sent back to the interface that queries the database for a presentation. If a multimedia presentation for that

scene is found, the database responds with its title and code (as well as other meta-data not used here).

mapping from them to painting coordinates [3]. There are then three important stages: how the training images are generated, how the system perceives them, and how it translates them into painting coordinates.

Figure 1. The system architecture (simplified) The title is then used to prompt the user. If the user clicks on the label, the actual presentation, an SWF file, is retrieved from the database through http and played back.

Figure 3. The vision recognition engine: generation of training samples (a-b), low resolution perception (c), metric localization error with grid and random sensor layout (d) A major problem in the development of systems based on learning is the availability of a sufficient number of training examples. In order to gather enough data a flexible graphical engine has been realized [3]. The tool supports the efficient simulation of several characteristics of real optical systems relying on a set of postprocessing functions acting on a synthetic image generated by a 3D rendering system.

Figure 2. Snapshot of the interface in browsing mode During the playing, the interface automatically sets the presenting mode providing a VCR-like panel to pause stop and re-play the presentation. At the end of the presentation, the interface automatically changes back to browsing mode.

3. THE VISION RECOGNITION ENGINE The task to be solved by the vision recognition engine is that of understanding which area of a painting the user is targeting the camera at, in order to provide contextualized information to the visitor. The proposed solution is based on the learning-byexample paradigm: several images, annotated with the required parameters are fed to the system that learns the appropriate

Visual hyper-acuity, the perception of a difference in the relative spatial localization of visual stimuli with a precision exceeding the characteristic size of the sensors, is a task dependent ability improvable by training. Based on the insight of [5], and the ideas of [7], an artificial visual system has been realized sampling the incoming image through a set of receptive field (see Figure 3(c)) histograms describing the local distribution of several image features (hue, edgeness, and luminance). Large receptive fields are more likely to overlap across different snapshots of the world, supporting increased performance of the location system via a mechanism similar to the one underlying hyper-acuity The histograms of all receptive fields are then concatenated and used to learn the required mapping using function approximation techniques based on the knowledge of the values at a sparse set of points (i.e. the examples). The performance obtained on 10000 synthetic examples is reported in Figure 3. The system proved capable of generalizing, from the synthetic images to those obtained through the PDA camera, showing a degraded performance.

4. LESSON LEARNED FROM THE FIRST PROTOTYPE The choice of Torre Aquila as a whole as test case was too ambitious for the state of the technology. The size of the frescoes, 7.89x3.05 meters, their position at 2 meters from the floor, and the darkness of the medieval tower turned out to be a big challenge for the capabilities of a standard webcam. Thus, for demonstration purposes, the system was experimented on reproduction of the actual frescos and this hindered the possibility of an extensive user evaluation. Yet, some initial consideration about the usability of the interaction paradigm can be drawn even in that non ecological setting. First of all, it is self evident that the point-and-click style of the interaction makes the task of selecting the details easier. The VCR metaphor of the presenting mode is intuitive enough to be mastered in few minutes of interaction with the system. Since the mode selection is automatic, the user does not have the burden of controlling the system. The feedback about the title of the presentation provides a reasonable feedback about the system status and its capability of recognizing the scenes. Yet, this feedback should be further improved, for example providing hints not only when the engine recognizes something but also when it does not. The visionengine is of course sensitive to the relative position and the light and even if these limitation can be compensated by pre-analyzing many images of the same subject, there is always the possibility that the system actually has information about a given scene but it is not able to recognize it. A feedback about the impossibility of analyzing an image, because of poor light condition for example might help the user in better using the system. In the present application, the system analyses the flow of images in real time. Of course, this choice poses a scalability problem. If the system response is not in real-time, the user may be confused. One possible solution is to provide a feedback about the speed of movement, if it is too fast, the interface may visually signal the user to slow down, for example, by blurring the display. Another possible solution is to slightly change the interaction style. Rather than interpreting the flow of images, the system might interpret just the one the user chooses to “shoot”. Of course, this would prevent the system to feedback about a known scene, but if the internal database of the vision-recognition engine is well designed it is possible that very few shots are not recognized. This solution presents a number of advantages: first of all scalability, both in terms of number of objects in the museum covered by the system and in terms of communication over the Wi-Fi network; secondly, it is easier to port this

metaphor on other portable devices and infrastructure, like mobile phones and UMTS networks. A new prototype for testing this design and its implications is already in progress.

5. CONCLUSIONS A first prototype of a handheld museum guide delivering contextualized information based on the recognition of drawing details selected by the user through the guide camera has been described. Several interaction issues have been analyzed and new solutions, aiming at a more scalable, non obtrusive approach, have been proposed.

6. ACKNOWLEDGMENTS Research partially funded by Provincia Autonoma di Trento under Project PEACH: Personal Experience of Active Cultural Heritage.

7. REFERENCES [1] Alfaro I, Zancanaro M., Nardon M., Guerzoni A. Navigating by Knowledge. In Proceedings of International Conference on Intelligent User Interfaces IUI2003, Miami, FL. January 2003. [2] Bell B., Feiner S., Hollerer T. Information at a glance. Computer Graphics and Applications IEEE, vol. 22, pp. 6-9. 2002. [3] Brunelli R. Low Resolution Image Sampling for Pattern Matching. In Proceedings of International Conference on Computer Vision and Graphics ICCVG 2004, Warsaw, Poland, 22-24. September 2004. [4] Crowley J., Berard F., and Coutaz J. Finger tracking as an input device for augmented reality. In Proceedings of International Workshop on Automatic Face and Gesture Recognition, pages 195–200, Zurich. 1995 [5] Poggio T., Fahle M. and Edelman S. (1992) Fast perceptual learning in visual hyperacuity, Science, 247:1018-1021. [6] Robertson, Laddaga, Van Kleek (2004) Virtual Mouse Vision Based Interface. In Proceedings of International Conference on Intelligent User Interfaces IUI2004. Madeira, 2004. [7] Schiele B., Crowley J.L. Recognition without correspondence using multidimensional receptive field histograms. International Journal of Computer Vision 36(1). 2000. [8] Schnadelbach H., Rodden T., Koleva B., Flintham M., Frases M., Izadi S., Chandler P., Foster M., Benford S., Greenhalgh C. The Augurscope: a mixed reality interface for outdoors. In Proceedings of SIGCHI 2002, pp. 91-6. 2002.