Designing a multimodal dialogue system for mobile phones - CiteSeerX

Designing a multimodal dialogue system for mobile phones Holmer Hemsen Natural Interactive Systems Laboratory University of Southern Denmark Campusvej 55 DK-5230 Odense M, Denmark [email protected] Abstract Due to their high spreading and increasing technical capabilities mobile phones play an important role as future user interfaces for information and transaction services in non stationary situations. The small screen size, however, introduces usability problems. Therefore, a speech centric multimodal interface will be a challenging approach for building effective user interfaces for these devices. In this paper we present an experimental implemented client-server approach for a near realistic application in the area of real estate, we also discuss design decisions for the combination of speech with other modalities under the above mentioned constraints.

1 Background Effective graphical user interfaces for small screen devices are difficult to realize, the lack of user acceptance of interfaces based on WAP is only one example. Several projects have developed techniques for creating more effective graphical user interfaces (e.g. Björk et al., 2000; Buchanan et al., 2000; Holmquist, 2000). The very restricted screen size on mobile phones however only allows a little amount of information to be presented at once, even if we consider e.g. focus & context method like presented in Holmquist (2000). For this reason speech input/output is a promising solution for circumventing the restrictions of the screen size and for navigating between pages of information. Mobile phones have technical restrictions, which makes it hard to develop a speech centric multimodal dialogue system. First of all the device has relatively low computational power and secondly the extremely variable background noise makes it difficult to develop good speech recognizers (Dobler, 2000), even if speech recognition is done on an external server. However, the computational power of mobile phones is increasing and approaches for recognition of continuous speech recognition via mobile phones are under development. Command based information services aiming at mobile phones are already publicly available (e.g. from Telefónica Móvistar, (Telefónica, 2003)). Assuming that the technical restrictions will drop out furthermore in the future, the question still remains how speech interaction can reduce the usability problems of graphical user interfaces for mobile phone devices and which kind of combination of speech with visual and pointing input turn out to be effective. Results on this issue are still few; see Almeida et al. (2002), Chang (2002) for some approaches, which however use PDAs as interaction device.

For evaluation of different types of combination of modalities, and for getting hands-on experience a test environment has been constructed. The architecture will be shortly described in section 2.1, followed by a description of a use case (section 3), and a discussion of different user interface designs for the implementation of the test application (section 4). Section 5 discusses preliminary evaluation results.

2 Approach 2.1 A multimodal architecture for Rapid Application Development (RAD) The main focus of the research is to analyse the advantages of spoken input and output in combination with other modalities for developing usable dialogue systems for small screen devices and in particular mobile phones. A frequently used method for development and evaluation of spoken dialogue systems is Wizard of OZ (WOZ) (cf. Dahlbäck, 1993; Bernsen et al., 1998:p. 127ff). For multimodal dialogue systems however, WOZ is much more difficult to realise. In most cases software tools and special technical setups are needed for supporting the wizard in his or her task to efficiently simulate the system. Furthermore capturing the non-speech input probably requires additional tools. We decided instead to use a rapid application development approach, using the SpeechMania dialogue development system. SpeechMania is a telephone-based system allowing to connect to an external application using a dynamic linked library. The mobile device is currently simulated as a widget on a desktop PC, including buttons as substitute for the keypad of the phone and a mouse pointer for simulating a pointing device. We abstract from many technical restrictions as e.g. bandwidth problems and the difficulty of speech recognition in noisy environments. Nevertheless, it was taken into account the fact that mobile device also in the future will have a much lower computational power than a desktop environment. The architecture (fig. 1) reflects this fact, and therefore was divided into a client that uses few computational resources (the GUI simulating the mobile phone) and a server part (SpeechMania plus an additional component for modality handling and database connection). The GUI is solely handling the presentation of the data coming from the server and catching mouse events, which are sent to the server for processing. The data between client and server is sent on demand, which reduces the amount of bandwidth needed to a minimum.

Figure 1: Architecture of the multimodal dialogue system. 2.2 Application domain On the basis of the above mentioned architecture a dialogue system for retrieving information on real estates is under development. This scenario makes it possible to verify different combinations of modalities not only according to their more or less theoretical value but also for their practical usage. The system is intended to enable the user to specify a basic set of search criteria and to inspect the retrieved set of results. The data for the application is based on advertisements for real estates in a local newspaper. Even though the approach described here is an experimental system, it should be clear that a real system is not meant to be a replacement of an information retrieval system for desktop environments, but rather a supplement for situations in which a desktop environment is not available. 2.3 User tasks and combinations of modalities used The system allows the user to specify a set of criteria the house of choice should satisfy, query the database and interactively inspect the results on the display of his/her mobile phone. The input modalities used are: continuous speech, pointing, clicking or typing. The output modalities are: continuous speech (pre-recorded), photos, tables, tables with highlighted cells and buttons. Solving the following problems has been crucial for the choice of modalities/modality combinations at the different steps of the dialogue: • providing easy ways of navigation between the results retrieved; • using combinations of modalities that enables efficient interaction; • and finding modality combinations that work well on small screen displays.

For a description of the combination of modalities used see Table 1 and 2. Input modality / combination of modalities Speech Speech AND Pointing Speech EXOR Clicking Clicking Typing

Example of usage Select criteria. Zooming in/out map on selected area pointed on (fig. 3). Clicking on Tab or choosing view via speech (fig. 5). Selecting properties from checkbox list. Entering numbers e.g. price level.

Table 1: Used combination of input-modalities with examples Output modality / combination of modalities Speech List Checkboxes Tabbed Pane AND Picture AND Buttons AND Speech

Example of usage

Presenting instructions. Showing list of areas for selection (fig. 2). List of properties to choose, e.g. garage. Picture of realty, buttons indicating number of pictures available, speech explains buttons; tabbed pane used for grouping the information for one realty (fig. 5). Tabbed Pane AND Table Table used for presenting price information (fig. 5). Tabbed Pane AND Table AND Highlighted If the user asked for a specific price the Cell corresponding cell is highlighted for fast location (fig. 5). Map For selecting an area of the town (fig. 3). Tabbed Pane AND Picture AND Table While calling the realtor, presenting on the screen the ID of the realty and the name and a picture of the contact person. Table 2: Used combination of output-modalities with examples

3 Use case example Let us assume that our user is interested in buying a house in the district A or B and with the size of the living area between 100 and 200 square meters. Using the mobile phone together with a headset including a microphone, the user calls the system. In the following the interaction between user and system is described: 1. After calling the system a welcome screen is presented and the system is asking the user if he has experience with the system. If the user knows the

2.

3. 4.

5.

6. 7.

8.

9.

system, no introduction about the system is given and the user is directly asked to specify one of the following search criteria: price level, area, size of the house or basic properties (e.g. garage yes/no, central heating yes/no). Assuming that the user decides to specify the areas first, three alternative user interfaces are available: selecting the areas via a map (fig. 3), selecting the areas from a list of areas (fig. 2) or by directly specifying the areas speech only (cf. section 4.2). The system offers the user to specify further criteria. The user is asking the system to specify the size of the realty, and clarifies in a spoken dialogue that he/she wants to specify the size of the living area. The system is presenting two text fields for upper and lower limit of the house on the screen and additionally explaining (using speech output) the format of the expected input and remarking that at least one field has to be filled. After the user typed in, upper and lower size limit and pressed the send button, the system presents the values on the screen and asks the user to confirm. Using speech input, the user confirms the values and negotiates with the system not to specify further criteria. After querying the database the system presents on the screen a list of items satisfying the search criteria. Each entry shows a list of items dynamically chosen according to the search criteria specified. Since the user has specified limits for the size of the living area, showing the concrete value for each house, as part of each list entry, will be an obvious choice. The items and the fields of the items (e.g. in fig. 4: size of the living area, size of the realty and age) are explained by the system using speech the first time the list is presented. Although only containing few items, the list enables the user to identify items of no interest, for example because of the age of the house. By selecting an item of interest and asking for a specific date, for example a photo, a tabbed pane showing the data asked for is presented on the screen. The data presented on the other tabs contain more information about the same realty (see fig. 6). The user can now either inspect the other tabs, which contain more information or data of the house or select a concrete type of data by asking for it using speech. Furthermore, the user can ask to go back to the list, go directly to the next item and thereby having a function to browse between the pictures of several houses or establish automatically a call to the realtor.

In the following some design decisions for developing the system and the user interface of the system are described in more detail. Benefits and consequences for usability with small screen terminals are discussed.

4 Designing the User Interface The dialogue system has a speech centric interface, where pure speech interaction is used for selecting the properties, which the user will specify as search criteria for his/her house of choice and partly for inspecting the result set. Tactile and multimodal interaction is used for specifying the properties and for inspecting the result set as well 4.1 The special role of speech For the development of the demonstrator speech interaction plays a special role: 1. Speech output is used for presenting information for which the area of the screen is too small. 2. Speech is used as an alternative when tactile interaction would be slower. 3. Speech input is used in combination with tactile interaction for enhancing tactile interaction possibilities. 4. Speech interaction is used for choosing the properties to define and for inspecting the different features of the realties in the result set. One criterion for evaluating if the use of speech is preferable to another modality is to see if the task can be solved more efficiently (JavaSpeechApi:chapter 3.1), in particular according to task completion time (Jurafsky & Martin, 2000:p. 758). In the MATIS project it was discovered that users prefer using the mouse for making selections from short lists (Sturm et al., 2001:p. 164). We therefore decided to use checkboxes to allow the user to select between five basic properties the house should have (e.g. central heating). In contrast, for selecting between the around twenty areas of the city, we provide the user to use a speech only interface for selecting up to three areas the house should be situated in. Marking the areas in a list displayed on the screen would take longer, in particular if we keep in mind that the screen is very little and scroll functionality is needed. 4.2 Different user interfaces for the same task Although for selecting areas a speech only interface seems favourable, two alternative interfaces have been implemented for the same task: • a list where the user can choose different areas (fig. 2). • a map on which the user can select areas by clicking on them and using speech for zooming in/out (fig. 3).

Figure 2: List view for selecting multiple areas by clicking.

Figure 3: Multimodal map; speech + pointing interaction. The list view has the advantage that the user, not so experienced with the districts the system covers, can select them and the map enables user to choose areas of the city without knowing the exact names. Furthermore, these different possibilities allow us to evaluate different user interfaces for the same task, but they also can give evidence on which kind of user interface is preferred and if all these interfaces work equally well using a small screen user interface. 4.3 The problem of navigation The big difference between user interfaces with desktop size screens for internet access and small screen devices is not only that presenting the same amount of information requires to split it into several chunks presented separately, but also that showing the document structure by using menus or tree like structures can not be done in parallel. Finding efficient techniques for navigation can therefore be seen as a central problem for designing effective user interfaces for small screen devices. Navigation using speech can have crucial impact on the usability of a user interface for a small screen device. Speech allows the user to jump to basically any part of the document, without knowing the exact document hierarchy. However, the user has to be aware of the different nodes of the hierarchy, a

problem similar of providing the user in a spoken dialogue system which information is available at each part of the dialogue. For the real estate information system, navigation is needed for inspecting the result set retrieved from the database. Whereas in the real estate information system the specification of the criteria of choice is directed by a spoken dialogue, the problem of navigation occurs when inspecting the results retrieved from the database. Similar to searching for a house in a newspaper the user has to be enabled a) to quickly identifying houses of potential interest and b) to easily inspecting details of the realty. For the real estate information system we therefore implemented two ways of navigation (see fig. 4). After specifying the search criteria the user gets a list of results. The list contains only very basic information on each house. This information is generated dynamically depending on criteria specified by the user. For example, it will not be necessary to present in this list that the house is situated in the area "Tarup", if this has been defined as one of the ’must have’ criteria. Parallel to presenting the list, the type of the items of the list are explained via speech, so no additional headers or labels are needed.

Figure 4: List view (left) and tabbed pane view (right) Although only giving very basic information the list enables the user to exclude realties of absolutely no interest. For accessing further information on a realty, the user selects the specific item and asks for specific information on the realty, for example a photo of the house. The information asked for is shown on the screen using a tabbed pane GUI. The tabbed pane has the advantage that the information available for each house is kept together (see fig. 5). The user can inspect some details of the house by: clicking on the picture buttons to view another picture on the same tab, asking for specific type of information on the house via speech, go back to the list or analysing details of the next house in the list. The tabbed pane has the additional advantage that the depth of the document hierarchy is reduced and tabs with same colour indicate the same kind of information for every realty.

Figure 5: Tabbed pane views

5 Preliminary evaluation The system is currently at an experimental stage. The architecture has proved to work and the information system has been implemented with basic functionality. Due to the chosen rapid application development approach, both white box and black box tests have been done as part of the implementation-test-revise iterations. One of the advantages of SpeechMania is that the input to the system can be done offline, which means that the input can be typed for easy testing. These tests revealed problems like missing links in the intended dialogue structure as well as lengthy system prompts in some sections of the dialogue. The online version has been tested as well, by a small group of internal users, which were provided with a test scenario to solve. The test users considered the general idea of a multimodal information system for mobile phone usage positive. However, the test showed the usability problematic of a multimodal system and one of the challenges of this approach: finding efficient methods of presenting the possible interaction alternatives to the user at each step of the dialogue structure. Not all of the users were aware of all the interaction possibilities at each step, especially the possibilities when the results are presented by the system. The general dialogue structure seemed to be reasonable. The strategy of using a tabbed pane for grouping and presenting the information for each house was perceived as natural and easy to use.

6 Conclusion In the present paper, we have shown several possibilities of incorporating speech into the design of a multimodal dialogue system, providing the system with a more usable and effective user interface under the condition of using a very small screen device. On the basis of a concrete task – building an information retrieval system for the real estate domain – we have shown how speech input/output can be integrated and which advantages result from this integration. Although the different modality combinations for circumventing the restricted screen size have been partially evaluated so far, the results are promising. Furthermore, we have found several modes for combining speech input/output enabling a more effective interaction. Empirical results are expected when testing the system with real users.

7 Acknowledgements I would like to thank the anonymous reviewers for their valuable comments.

8 References Almeida, L., I. Amdal, N. Beires, M. Boualem, L. Boves, E. den Os, P. Filoche, R. Gomes, J. Eikeset Kudsen, K. Kvale, J. Rugelbak, C. Tallec & N. Warakagoda, 2002.

Implementing and evaluating a multimodal and multilingual touris guide. In Jan van Kuppevelt et al., (eds.) Proceedings of the International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, pages 1–7, Copenhagen, Denmark, June 28-29. Björk, St., J. Redström, P. Ljungstrang, L.E. Holmquist, 2000. PowerView: Using information links and information views to navigate and visualize information on small displays. In Proceedings of Handheld and Ubiquitous Computing 2000 (HUC2k), Bristol, U.K. Buchanan, G., S. Farrant, M. Jones, H. Thimbleby, G. Marsden, & M. Pazzani, 2000 Improving mobile internet usability. In Proceedings of the Tenth International World Wide Web Conference, pages 673–680, Hong Kong. Chang, E., H. Meng, Y. Li & T. Fung, 2002. Efficient web search on mobile devices with multi-modal input and intelligent text summarization. In Proceedings of the eleventh international World Wide Web conference, Honolulu, Hawai,USA, May. Dobler, St., 2000. Speech recognition technology for mobile phones. Ericsson Review, (3). Dybkjær, L., E. André, W. Minker & P. Heisterkamp (eds.), 2002. Proceedings of the ISCA Tutorial and Research Workshop Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, Germany, June 17 - 19 2002. Holmquist, L.E., 2000. Breaking the Screen Barrier. PhD thesis, Göteborg University, Department of Informatics, Sweden. JavaSpeechAPI, Sun Microsystems, Inc. Java Speech API Programmer’s Guide, v1.0 edition. Jurafsky, D. & J.H. Martin, 2000, Speech and Language Processing – An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall. Kaljuvee, O., O. Buyukkokten, H. Garcia-Molina, & A. Paepcke, 2001. Efficient web form entry on PDAs. In Proceedings of the 10th World Wide Web conference. Maybury, M.T. & J.-C. Martin (eds.) 2002. Proceedings of the Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, Spain. Movistar, Telefonica Móviles España, http://www.empresa.movistar.com/30/301030.shtml, Accessed: 15.05.03. Pieraccini, R., B. Carpenter, E. Woudenberg, S.Caskey, St. Springer, J. Bloom & M. Philips, 2002. Multi-modal spoken dialog with wireless devices. In Dybkjær et al. 2002. Sturm, J., F. Wang & B. Cranen, 2001. Adding extra input/output modalities to a spoken dialogue system. In Jan van Kuppevelt & Ronnie Smith (eds.) 2nd SIGdial Workshop on Discourse and Dialogue, 1-2 September, Aalborg, Denmark.