John Benjamins Publishing Company - Osaka University

1 downloads 0 Views 5MB Size Report
We analyzed video recordings by adopting CA and ..... nado 'such as Mount Rokko and Arima hot springs', the robot looks towards ... America United States.
John Benjamins Publishing Company

This is a contribution from Interaction Studies 14:3 © 2013. John Benjamins Publishing Company This electronic file may not be altered in any way. The author(s) of this article is/are permitted to use this PDF file to generate printed copies to be used by way of offprints, for their personal use only. Permission is granted by the publishers to post this file on a closed server which is accessible to members (students and staff) only of the author’s/s’ institute, it is not permitted to post this PDF on the open internet. For any other use of this material prior written permission should be obtained from the publishers or through the Copyright Clearance Center (for USA: www.copyright.com). Please contact [email protected] or consult our website: www.benjamins.com Tables of Contents, abstracts and guidelines are available at www.benjamins.com

Interactions between a quiz robot and multiple participants Focusing on speech, gaze and bodily conduct in Japanese and English speakers Akiko Yamazaki1, Keiichi Yamazaki4, Keiko Ikeda2, Matthew Burdelski3, Mihoko Fukushima4, Tomoyuki Suzuki4, Miyuki Kurihara4, Yoshinori Kuno4 & Yoshinori Kobayashi4 1Tokyo

University of Technology / 2Kansai University / 3Osaka University / 4Saitama University

This paper reports on a quiz robot experiment in which we explore similarities and differences in human participant speech, gaze, and bodily conduct in responding to a robot’s speech, gaze, and bodily conduct across two languages. Our experiment involved three-person groups of Japanese and English-speaking participants who stood facing the robot and a projection screen that displayed pictures related to the robot’s questions. The robot was programmed so that its speech was coordinated with its gaze, body position, and gestures in relation to transition relevance places (TRPs), key words, and deictic words and expressions (e.g. this, this picture) in both languages. Contrary to findings on human interaction, we found that the frequency of English speakers’ head nodding was higher than that of Japanese speakers in human-robot interaction (HRI). Our findings suggest that the coordination of the robot’s verbal and non-verbal actions surrounding TRPs, key words, and deictic words and expressions is important for facilitating HRI irrespective of participants’ native language. Keywords: coordination of verbal and non-verbal actions; robot gaze comparison between English and Japanese; human-robot interaction (HRI); transition relevance place (TRP); conversation analysis

1. Introduction “Were an ethologist from Mars to take a preliminary look at the dominant animal on this planet, he would be immediately struck by how much of its behavior, within a rather extraordinary array of situations and settings (from camps in the tropical rain forest to meetings in Manhattan skyscrapers), was organized through face-toface interaction with other members of its species” (M. Goodwin 1990: p. 1). Interaction Studies 14:3 (2013), 366–389. DOI 10.1075/is.14.3.04yam ISSN 1572–0373 / E-ISSN 1572–0381 © John Benjamins Publishing Company

Interaction between a robot and multiple participants 367

The study of face-to-face interaction has long been a central concern across various social scientific disciplines. More recently it has become an important focus within fields related to technology and computer-mediated discourse (e.g. Heath & Luff 2000; Suchman 2006). This is especially true of research adopting ethnomethodology (Garfinkel 1967) and conversation analysis (hereafter abbreviated as CA) (Sacks, Schegloff & Jefferson 1974) that examines humanrobot interaction (HRI) (e.g. Pitsch et al. 2013; A. Yamazaki et al. 2010 and A. Yamazaki et al. 2008). One of the central issues in this area is multicultural and inter-cultural patterns in human-robot and human-virtual agents. Many of these corpora are of human-human interaction, but were collected for the purposes of HRI or human-agent interaction, such as the CUBE-G corpus (e.g. Nakano & Rehm 2009), corpora of the natural language dialogue group of USC (e.g. Traum et al. 2012), and the CMU cross-cultural receptionist corpus (e.g. Makatchev, Simmons & Sakr 2012). These works also focus on verbal and nonverbal behavior of human interactions in order to develop virtual agents. While we have a similar interest in examining and developing technology that can be employed in a real-world environment across various cross-cultural and crosslinguistic settings, our research project utilizes a robot that is able to verbalize and display its bodily orientation towards objects in the immediate vicinity and multiparty participants in order to facilitate participants’ understanding of, and engagement in, the robot’s talk. Thus, a key feature of the present research is not only the use of speech and non-verbal language, but also the coordination of these resources in order to facilitate visitors’ engagement in human-robot interaction (A. Yamazaki et al. 2010). In the present paper we focus on how non-verbal actions (e.g. gaze, torso, gesture) are related to question-response sequences in multi-party HRI. A rich literature on human social interaction can be found in studies on CA. A main focus of these studies is to identify the underlying social organization in constructing sequences of interaction. In particular, a growing number of studies examines question-response sequences. For example, as Stivers and Rossano (2010) point out, a question typically elicits a response from the recipient of the question turn. Thus asking a question is a technique for selecting a next speaker (Sacks, Schegloff & Jefferson 1974). Since a question forms the first part of an adjacency pair, it calls for a specific type of second pair part (cf. Sacks 1987; Schegloff 2007). Rossano (2013) points out that in all question-response sequences there is a transition relevance place (TRP). A TRP is defined as: “The first possible completion of a first such unit constitutes an initial transitionrelevance place” (Sacks, Schegloff & Jefferson 1974: p. 703) and it is a place where turn transfer or speaker change may occur. At a TRP, a hearer can provide a response to the speaker, but may not necessarily take the turn (e.g. verbal © 2013. John Benjamins Publishing Company All rights reserved

368 Akiko Yamazaki et al.

continuers, head nods). When somebody asks a question, a hearer has a normative obligation to answer (Rossano 2013). Stivers and her colleagues (2010) examined question-response sequences of naturally occurring dyadic interactions among ten different languages. While there are some variations in the ways speakers produce question formats, what is common among them is that the speaker typically gazes towards the addressee(s) when asking a question to a multiparty audience (e.g. Hayashi 2010). A number of researchers have studied gaze in human interaction in various settings and sequential contexts. In particular, Kendon (1967) points out that gaze patterns were systematically related to the particular feature of talk. C. Goodwin (1981) clarified that hearers display their engagement towards the speaker by using gaze. In addition, Bavelas and his colleagues (2002) describe how mutual gaze plays a role in speaker-hearer interactions. A recent study of dyadic interactions in cross-cultural settings reveals similarities and differences of gaze behaviors among different language and cultures (Rossano, Levinson & Brown 2009). While gaze has been given much thought and acknowledged as an important resource in human-human interaction, a question still remains as to how gaze is deployed and can be employed in multiparty HRI. Within the current research on gaze behavior in HRI (e.g. Knight & Simmons 2012; Mutlu et al. 2009) there has not yet been discussion on multiparty question-response sequences in cross-cultural settings. The present paper begins to fill this gap by comparing human-robot interaction in Japanese and English within a quiz robot experiment. The use of a robot allows us to ask the same questions employing the same bodily gestures, and to compare the responses of participants under the same conditions. Utilizing videotaped recordings, our analysis involves detailed transcriptions of human-robot interaction and a quantitative summary of the kinds of participants’ responses. We show that a main difference between the two language groups is that the frequency of nodding of English speakers is significantly higher than that of Japanese speakers. This is contrary to research on human-human interaction that argues that Japanese speakers nod more often than English speakers (e.g. Maynard 1990). Participants show their engagement in interaction with a robot when the robot’s utterance and bodily behavior such as gaze are coordinated appropriately. This paper is organized in the following manner. In Section 2, we discuss the background of this study. In Section 3, we explain the setup for a quiz robot experiment. In Section 4, we offer initial analysis. In Section 5, we provide detailed analysis of participants’ responses in regard to the robot’s gaze and talk. Discussion and concluding remarks will follow in Section 6.

© 2013. John Benjamins Publishing Company All rights reserved

Interaction between a robot and multiple participants 369

2. Background of this study 2.1 Cross-cultural communicative differences: Word order A rich literature on multicultural and inter-cultural variations in interaction can be found in the literature on ‘interactional linguistics’ (e.g. Tanaka 1999; Iwasaki 2009) in which scholars with training in CA and related fields tackle cultural differences from a cross-linguistic perspective. They do not follow the traditional syntactical approach to defining “differences” among languages used in interaction. Rather, they reveal an interactional word order associated to a specific social interactional activity. For instance, one study focused on a cross-cultural comparison between Japanese and English (Fox, Hayashi & Jasperson 1996). Differences between interactions involving these two languages are particularly interesting due to their respective word order. There is a distinctive difference in ‘projection’ in regard to the timing of completion of a current turn-constructional unit (e.g. a sentential unit), which is defined as ‘projectability’ in CA. In regard to question formats, which include interrogatives, declaratives and tag-questions (Stivers 2010), English and Japanese word order exhibits differences and similarities in interaction. For interrogatives, the sentence structure is nearly the opposite between Japanese and English. As Tanaka (1999: p. 103) states, “[i]n a crude sense, the structures of Japanese and English can be regarded as polar opposites. This is reflected in differences in participant orientations to turn-construction and projection.” For declaratives, there are no dramatic differences between the two languages in terms of word order. However, the placement of the question word in the sentence is different. For tag-questions, the sentence structure is similar between Japanese and English as a question-format ‘tagged on’ to the end of a statement. 2.2 Coordination of verbal and non-verbal actions and questioning strategy We have conducted ethnographic research in several museums and exhibitions in Japan and the United States in order to explore ways that expert human guides engage visitors in the exhibits. We analyzed video recordings by adopting CA and have applied those findings to a guide robot that can be employed in museums. Based on the following findings, we employed two central design principles in the robot for the current project. First, as the coordination of verbal actions and non-verbal actions is an essential part of human interaction (C. Goodwin 2000), we observed that guides often turn their heads toward the visitors when they mention a key word in their talk, and point towards a display when they use a deictic word or expression (e.g. this

© 2013. John Benjamins Publishing Company All rights reserved

370 Akiko Yamazaki et al.

painting). We also found that visitors display their engagement by nodding and shifting their gaze, in particular at key words, deictic words and expressions, and sentence endings (one place of a TRP). Adopting these findings we programmed the coordination of verbal and non-verbal actions into a museum guide robot, and found that in dyadic robot guide-human interactions participants frequently nodded and shifted their gaze towards the object (A. Yamazaki et al. 2010). Second, we observed that human guides often use a question-answer strategy in engaging visitors. In particular, guides use questions aimed at engaging visitors while coordinating their gaze (K. Yamazaki et al. 2009; Diginfonews 2010). In particular, guides ask a pre-question (first question) regarding an exhibit, and at the same time monitor the visitors’ responses (during a pause) to check whether visitors are displaying engagement (e.g. nodding). Then the guide asks a main question (second question) towards a particular visitor who displays engagement. The first question serves as a “clue” to the main question. We programmed this questioning strategy into the robot’s talk, and the results showed that the robot could select an appropriate visitor who has displayed “knowledge” in order to provide an answer to the main question (A. Yamazaki et al. 2012). In relation to the second finding, we found that guides use a combination of three types of pre-question and main question formats towards multiple visitors: (1) Guide begins with a pre-question as an interrogative, and then asks a main question, (2) guide begins with a pre-question as a tag-question, and then asks a main-question, and (3) guide begins with a declarative sentence without telling the name or attribution of the referent and then asks a question regarding the referent. We implemented these three types of combinations of pre-question and main question in the current quiz robot, as we are interested in how coordination of the robot’s question (verbal) and gaze (non-verbal actions) is effective in multiparty interaction in English and Japanese.

3. The present experiment: A quiz robot in Japanese and English In this experiment, we implemented the robot’s movement based on the ethnography described above in regard to verbal actions in both English and Japanese. 3.1 Robot system We used Robovie-R ver.3. The experimental system was designed to provide explanations to three participants. The system has three pan-tilt-zoom (PTZ) cameras and three PCs, which are each dedicated to processing images from one PTZ camera observing one participant. Figure 1 presents an overview of the robot system. © 2013. John Benjamins Publishing Company All rights reserved

Interaction between a robot and multiple participants 371

In addition to the three cameras outside of the body of the robot and three PCs for image processing, we used a laser range sensor (Hokuyo UTM-30LX) and another PC for processing the range sensor data and for integrating the sensor processing results. The system detects and tracks the bodies of multiple visitors in the range sensor data. A PTZ camera is assigned to each detected body. The system is controlled to enable it to turn toward the observed bodies of participants. For detecting and tracking a face and computing its direction, we used Face API (http:// www.seeingmachines.com/product/faceapi/). The pan, tilt, and zoom of each PTZ camera are automatically adjusted based on the face detection results, so that the face remains focused in the center of the image. The system can locate a human body with a margin of error of 6 cm for position and 6 degrees for orientation. It can measure 3D face direction within 3 degrees of error at 30 frames per second. The head orientation is measured around three coordinate axes (roll, pitch and yaw) with the origin at the center of the head. From the face direction results, the system can recognize the following behaviors of participants: nodding, shaking, cocking and tilting the head, and gazing away, and it can choose an appropriate answerer based on such recognition results (A. Yamazaki et al. 2012). However, in the experiments described later, we did not use these autonomous functions so as to prevent potential recognition errors from influencing human behaviors. The system detected and tracked the participants’ positions using the laser range sensor to turn its head precisely toward them. A human experimenter, who was seated in back of the screen and could see the participants’ faces, controlled the robot to point to and ask one of three participants to answer the question. In other words, we adopted a WOZ (Wizard of Oz) method PTZ cameras Laser range sensor Robot

PC Robot control PC

Figure 1. Quiz robot system

© 2013. John Benjamins Publishing Company All rights reserved

PC PC PC

Body tracking

Face tracking

Main process

Pan-tilt control

372 Akiko Yamazaki et al.

for selecting an answerer by employing a human experimenter who controlled the robot’s body movement. Following the findings of our ethnographic research of expert human guides, we programmed the robot to move its gaze by moving its head (its eyes can be moved as well) and arms/hands in relation to its speech by using canned phrases (a built-in text-speech system can be used as well). The robot can open and close its hands and move its forefingers to point towards a target projected on the screen, similar to a human guide. The robot speaks English towards English participants and it speaks Japanese towards Japanese participants. A rough outline of the sequence of speech and movement is as follows: (1) Before the robot begins to talk, it looks towards the participants, (2) When the robot says the first word, it moves its gaze and hands to a picture projected on the projection screen in back, (3) During its speech, the robot moves its gaze and hand when it utters deictic words and expressions such as ‘here’ and ‘this picture’, (4) At the end of each sentence, the robot moves its gaze towards three participants one at a time, or it looks at a particular participant depending on the length of sentence as expert human guides do, (5) When the robot asks the main question, it moves its gaze and hand towards a particular participant (selected by the experimenter), (6) If the participant gives the correct answer to the main question, the robot makes a clapping gesture (an experimenter operates a PC to make a sound of clapping hands) and repeats the answer to the question. When the participant gives an incorrect answer, the robot says, “That’s incorrect” (Chigaimasu, in Japanese) and then produces the correct answer. 3.2 Experimental setup The following are the details of the experiment we conducted in English and Japanese with the quiz robot. As described above, each group consisted of three participants. 1.

2.

Experiment 1 (in English): Kansai University (Osaka, Japan), 21 participants (7 groups): 18 male and 3 female native speakers of English (mainly Americans, New Zealanders, and Australians). (15 June 2012). All participants are international undergraduate/graduate students or researchers who either are presently studying Japanese language and culture or have in-depth knowledge of Japan. Experiment 2 (in Japanese): Saitama University (Saitama, Japan, near Tokyo), 51 participants (27 groups): 31 male and 20 female native speakers of Japanese (4 July 2012). All participants were undergraduate students of Saitama University.

© 2013. John Benjamins Publishing Company All rights reserved

Interaction between a robot and multiple participants 373

During the experiment, we used three video cameras. The position of the robot (R), participants (1, 2 and 3), and the three video cameras (A, B, C) are shown (Figure 2). Display image size: 180 cm width × 120 cm height Camera B

Screen R

m

2

140 cm

140

11 0c

cm

3

Camera A

400 cm

200

1

cm

230 cm

Camera C 450 cm Figure 2. Bird’s eye view of experimental setup

Before the experiments, our staff asked the participants to answer the robot’s quiz questions. Then the robot asked each group six questions related to six different pictures projected on a screen in back of the robot. The content of these questions (Q1–Q6) was the following: Q1:

Name of the war portrayed in Picasso’s painting ‘Guernica’ (Screen: Guernica painting) (Answer: Spanish civil war). Q2: Name of a Japanese puppet play (Screen: Picture of a famous playwright and a person operating a puppet) (Answer: Bunraku). Q3: Name of the prefecture in which the city of Kobe is located (Screen: picture of Kobe port and a Christmas light show called ‘Luminarie’) (Answer: Hyogo prefecture) (Figure 3). Q4: Name (either first, last, or both) of the lord of Osaka Castle (picture of Osaka castle and Lord Hideyoshi Toyotomi) (Answer: Hideyoshi Toyotomi) (Figure 4). Q5-a (English speakers only): Full name of a Japanese baseball player in the American major leagues (Screen: photo of Ichiro Suzuki) (Answer: Ichiro Suzuki). Q5-b (Japanese speakers only): Full name of the chief cabinet secretary in former Japanese Prime Minister Kan’s cabinet (Screen: photo of Cabinet Secretary Yukio Edano) (Answer: Yukio Edano) © 2013. John Benjamins Publishing Company All rights reserved

374 Akiko Yamazaki et al.

Q6:

Name of the former governor of California who was the husband of the niece of a former president of the United States (Screen: map of California, John F. Kennedy) (Answer: Arnold Schwarzenegger) (Figure 5).

Figure 3. Image projected on screen at Q3 (the left is Kobe port and the right is “Luminarie”)

Figure 4. Image projected on screen at Q4 (the left is Osaka castle and the right is Hideyoshi Toyotomi)

As an image is projected on the screen (as in Figures 3, 4 and 5), the robot poses a pre-question to the visitors (e.g. “Do you know this castle?”), and then provides an explanation of the image before asking a main question (e.g. “Do you know the name of the famous person in this photo who had this castle built?”).

© 2013. John Benjamins Publishing Company All rights reserved

Interaction between a robot and multiple participants 375

Figure 5. Image projected on screen at Q6 (President Kennedy is on the left and map of the United States with California highlighted is on the right)

Due to differences in the topics of Q5 in English and Japanese and the low frequency of correct answers by participants for Q1 and Q2, here we do not analyze these three questions in detail. Rather we focus on Q3, Q4, and Q6. We will also report on some of the results of a questionnaire we asked participants to fill out after the experiment on whether they knew the answers to the pre-question and main question of Q3, Q4 and Q6. 3.3 Experimental stimuli In what follows we explain the three question types in English and Japanese in relation to robot gaze and discuss similarities and differences in terms of word order. 1. Declarative question: Q3 For each utterance there are three lines of transcript. Rh stands for the robot’s hand motion and Rg represents the robot’s head motion. The third line, R, represents the robot’s speech. Transcription symbols are as follows: f = robot facing forward towards the participants; a comma-like mark (,) represents the robot moving its hand/head; ‘i’ indicates the robot in a still position facing towards the screen; d = robot’s hand/head down; ‘1’ indicates the robot in a still position facing towards Participant 1 who stands to the furthest right of three participants in regard to the robot position; ‘2’ means the robot is in a still position facing towards Participant 2 who stands in between the other two participants; ‘3’ represents the robot in a still position facing towards Participant 3 who stands to the furthest left of the three

© 2013. John Benjamins Publishing Company All rights reserved

376 Akiko Yamazaki et al.

participants; ‘o’ indicates the robot spreading its hands and arms outward; ‘m’ represents the robot moving its hands. We adopt the Jefferson (1984) transcription notation for describing the robot’s speech. The transcription symbols are as follows: > < represents the portion of an utterance delivered at a pace noticeably quicker than the surrounding talk, and < > noticeably slower. (.) represents a brief pause. ↑ marks a sharp rise in pitch register. ↓ marks a noticeable fall in pitch register. _ Underlining represents vocalic emphasis. Q3 in English 01. Rh: f,,,,,iiiiiiiiiiiiiiiiiiiiiiiiiiiii 02. Rg: f,,,,,iiiiiiiiiiiiiiiiiii,,,,,fffff 03. R : This well-known(.)port in Japan has 04. Rh: iiiiiiiiiiiiiiiiiiii,,,,,,ooooommmmmmmm 05. Rg: ffffffffffffffffffff,,,,,,111111,,,,,,, 07. Rh: mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm 08. Rg: ,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3,,, 09. R : a;nd Arima (.)>hot springs