SIGCHI Conference Paper Format - ACM Digital Library - Association ...

0 downloads 0 Views 2MB Size Report
Apr 26, 2018 - Eduardo Velloso, Dominik Schmidt, Jason Alexander,. Hans Gellersen and Andreas Bulling. 2015. The feet in human--computer interaction: A ...
CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

Better Understanding of Foot Gestures: An Elicitation Study Yasmin Felberbaum University of Haifa Haifa, Israel [email protected] ABSTRACT

We present a study aimed to better understand users’ perceptions of foot gestures employed on a horizontal surface. We applied a user elicitation methodology, in which participants were asked to suggest foot gestures to actions (referents) in three conditions: standing up in front of a large display, sitting down in front of a desktop display, and standing on a projected surface. Based on majority count and agreement scores, we identified three gesture sets, one for each condition. Each gesture set shows a mapping between a common action and its chosen gesture. As a further contribution, we suggest a new measure called specification score, which indicates the degree to which a gesture is specific, preferable and intuitive to an action in a specific condition of use. Finally, we present measurable insights that can be implemented as guidelines for future development and research of foot interaction. Author Keywords

Foot interaction; foot gestures; elicitation study, userdefined gesture set. ACM Classification Keywords

H.5.2. Information interfaces and presentation (e.g., HCI): User Interfaces; INTRODUCTION

We often use our feet to interact with surrounding devices. We use our feet in the car while driving, or on a bicycle when pedaling. We use foot pedals when playing a piano or to add effects when playing an electric guitar, and we use our feet to play various computer games (most notably dancing or sport games). Foot input has the potential to create new opportunities that can augment and enrich our interactions with computing devices. For example, foot input could be used to augment mouse interaction while working on a stationary computer, to switch between screen displays during surgery, or to compensate for limited hand movement abilities. Foot interaction can be either an additional input mechanism or the main method of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHI 2018, April 21–26, 2018, Montreal, QC, Canada © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5620-6/18/04…$15.00 https://doi.org/10.1145/3173574.3173908

Paper 334

Joel Lanir University of Haifa Haifa, Israel [email protected] interaction; it can be used while standing on our feet and while sitting; and it can be used by different types of users in various scenarios. Various research projects have started exploring the domain space of foot interaction. Some previous works looked at foot interaction while the user is in motion, for example using kicking gestures as a control option for mobile devices when the hands are not available [1]. Others looked at stationary scenarios, such as interacting with a fixed vertical surface [10], navigating in a virtual environment by combining foot gestures with other body-based gestures [9], or using the feet as the main input to control an avatar in a video game [34]. Looking at these studies, it seems that foot interaction has the potential to be intuitive, helpful and even enjoyable. It is important to differentiate between the terms foot interaction and foot gesture. Foot interaction means any way of interacting with a device using one’s foot (e.g. pedals, foot buttons etc.). Foot gesture refers to a predefined foot movement that triggers a specific action, for example, when dragging the foot causes the dragging of an object on the screen. While the examples given above use and implement foot interactions in various scenarios, the nature of foot gestures and the understanding of which foot gestures to assign to different functions often remains unclear. Some mappings were suggested in previous works [1, 4, 34]. However, these were either examined in a specific domain or studied within a specific implementation. The focus of the current study is on direct foot interaction on a horizontal surface. We examined three different user conditions: standing up in front of a large display, sitting down in front of a desktop display, and standing on a projected surface. We elicited user gestures employing a user-defined gesture methodology [32] for two representational domains: GUI actions and Avatar actions, creating gesture sets for foot interaction in all three user conditions. The findings revealed measurable observations and principles for foot gestures that could be generalized and used across various domains and applications. RELATED WORK

In this section, we first explore the use of foot gestures as an additional interaction mechanism, and then look at works that examined foot gestures as the main method of interaction. Finally, we review previous work regarding foot gesture categorization and mapping.

Page 1

CHI 2018 Paper

Foot gestures as an additional interaction mechanism

Foot interaction can be used as an augmentation to other types of input, with hands often more helpful for performing accurate tasks and feet used to perform less accurate tasks [17, 21, 24, 33]. Lu et al. [14] claimed that combining hand and foot interaction may be more intuitive than using hands only. They developed a multimodal football game using augmented reality on smartphones. The interaction is done using foot gestures and finger touch interaction. Foot gestures, can also augment body-based gestures. LaViola et al. [12] presented an example in which hands-free, upper body-based gestures in combination with foot gestures were perceived as an intuitive and natural way to navigate a virtual world. Silva and Bowman [23] presented an example of combining body-based interaction, foot interaction (using a pedal) and keyboard input for desktop games. Göbel et al. [6] used eye gaze and foot interaction to assign different controls to “pan” and “zoom” tasks. Other studies that examined foot gestures as an augmentation were mostly interested in finding a way to interact with devices when our hands are busy or when our device is out of reach. Crossan et al. [4] claimed that foot gestures alone should not be the main interaction type. Similarly, Scott et al. [22] suggested a larger variety of foot gestures to allow for eyes- and hands-free interaction with mobile devices. Foot gestures as the main form of interaction

A different approach looks at foot gestures as a stand-alone way of interaction and the major means of input. In Yin and Pai [34], the user controlled a virtual character using only foot gestures (specifically foot pressure). However, the authors mention that it is impossible to recreate full body motion by using foot pressure alone. Lv et al. [15] presented an augmented foot interaction interface that detects and tracks user’s foot motion and foot gestures using computer vision-based algorithms. There were two implementations presented in the paper, an augmented football game and an augmented foot piano, both controlled by foot pressure or gestures only. Another implementation suggested controlling a mobile phone menu by “kicking” the wanted option [16]. Following this direction, Han et al. [9] explored how to control various mobile actions such as navigation and zoom with kick gestures. Foot Gestures Categorization and Mapping

Several studies attempted to categorize various aspects of foot interaction. Different challenges that should be taken into consideration when using foot interaction with a horizontal surface were explored, such as how to prevent inadvertent activation of the surface, what is the active part of the foot when stepping on the surface, and what is the active part when pointing at an object on the surface [2]. Velloso et al. [29] suggested describing foot gestures according to four categories: (1) semaphoric gestures, simple signals (e.g. tap), which are considered the most basic foot gestures one could perform; (2) deictic gestures,

Paper 334

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

which are meant for pointing at or "touching" a specific target (e.g. tapping on an object); (3) manipulative gestures, which are performed in order to change an object's properties, e.g. changing an object’s location by dragging it; and (4) implicit gestures, which are not primarily targeted to interact with a device. Looking specifically at foot input while sitting with the feet placed under the desk, Velloso et al. [28] explored foot pointing methods, direction of foot movement, the use of both feet together, and the combination of both feet and a hand-based mouse interaction. The authors found, that a horizontal foot motion is faster than a vertical one, and that users performed tasks faster using two feet than when using only one. Various works explored specific foot gestures in relation to specific system actions. For example, kick gestures, among other gestures, were explored to interact with parts of a large vertical surface [10], desktop applications [20] and controlling a mobile device [9, 16]. Crossan et al. [4] explored the use of foot tapping to browse a menu on a mobile device which is out of reach (i.e., in the pocket). Fukahori et al. [7] presented a user-defined gesture study of foot pressure and pattern recognition. Finally, Alexander et al. [1] performed a user elicitation study to examine how foot gestures can be used to control a mobile device when the hands are busy. They asked participants to suggest gestures for a variety of common mobile-device commands such as answering and ending a call. The work presents general gesture mappings and is focused on mobile phone control while the user is on the move, allowing for any type of foot gesture (including kicking and mid-air gestures). The mappings suggested in these previous works mostly examined one or two gestures performed in a specific scenario or related to a specific use. The focus of our study is on what we believe is the most common setting for foot interaction: interacting with a horizontal surface. Horizontal sensing surfaces are already in commercial use. Nevertheless, the mapping of foot gestures to control various actions on these surfaces remains relatively unexplored. Our work aims to explore this design space, define intuitive mappings, and find general insights about the characteristics and suitability of foot gestures for various tasks. Our study thus provides a first perspective into users’ opinions and intuitions regarding foot interaction and the use of foot gestures with a horizontal foot-sensing input device. METHODOLOGY

The user elicitation study methodology was first presented as a method to define intuitive and easy to use gestures in [31, 32]. Participants suggest gestures that could be used to perform basic actions presented to them (referred to as referents). The level of consensus regarding a referent is determined by deriving an agreement score calculated for each of the referents.

Page 2

CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

(a) Standing condition (b) Sitting condition (c) Projection condition Figure 1: (a) Performing foot gestures in front of a large display (Standing); (b) Performing foot gestures in front of desktop display (Sitting); and (c) Performing foot gestures while the display is projected on the floor (Projection)

We conducted a user elicitation study for foot gestures on a horizontal surface using a methodology similar to that employed in previous works [8, 18, 25, 30]. Our goal was to obtain an overview of using the feet as an interaction input tool regardless of a specific use case; hence, we conducted the study in three conditions, and included referents from two representative domains.

common actions of an avatar in a virtual environment [13]. We did not use referents related to an avatar’s interactions with other virtual entities, (for example: bow, hand wave, agree, deny, etc.). Avatar control referents were not examined in the Projection condition, since controlling an avatar on a projected floor is not intuitive.

Study Conditions

A total of 60 participants took part in the study, 20 participants were randomly assigned to each condition: Standing (13 Female), Sitting (13 Female) and Projection (16 female). Each session took approximately one hour. The average age of participants in the Standing condition group was 27.2 (SD = 2.91), in the Sitting condition group 26.55 (SD = 4.86), and in the Projection condition group 24.9 (SD = 4.05). The participants were mostly students from various departments in our university: information systems, psychology, philosophy, economics and therapy professions.

There are three possible body postures when interacting using the feet: standing, sitting and walking/running [29]. We looked at the standing and sitting postures, which are more natural with a stationary horizontal surface. We considered walking and running as possible suggested gestures. In addition, we incorporated two options for displaying the system’s output: via an external screen or projected onto an area on the floor. Thus, we examined foot gestures in three conditions (Figure 1): (1) Standing. The participant stands on the surface and the referents are presented on a large display facing of the user; (2) Sitting. The participant sits in front of a computer screen with referents presented on the screen; (3) Projection. The participant stands on the surface with referents projected onto the surface, similar to a large horizontal touch screen. Referents

During the study, we presented referents from two domains - GUI actions and avatar actions, which are representative of a wide range of use cases of foot interaction [12, 15, 17, 23]. The GUI domain represents a common scenario of operating a general computing interface, while the Avatar domain represents a scenario of operating a virtual character (e.g., in gaming). The list of representative referents includes basic action controls. For the GUI referents, we used most of the referents suggested at Wobbrock et al. [32], excluding referents that were targeted to specific use cases. The avatar referents were based on

Paper 334

Participants

Foot Preference

Foot preference is the user's natural preference to use the left or right foot. We used a partial foot preference test based on Chapman et al. [3] to determine participants’ foot preference. Our test included three of the tasks that were found most reliable: (1) Write the participant's name on the Zoom in so you could see the objects closer

(a) Zoom-in

Jump

(b) Jump

Figure 2: (a) image of the Zoom-in referent’s video; (b) image of the Jump referent’s video

Page 3

CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

floor with your foot similar to writing it on the sand; (2) Erase the participant's name from the sand; and (3) kick an imaginary ball. Most of the participants tended to have a more skillful right foot.

0.6 0.5 0.4 0.3 0.2 0.1 0

Procedure

The study’s procedure is similar to the procedure presented by Wobbrock et al. [32]. In our study, before starting the session, the participants were asked to perform the foot preference test as explained above. GUI and Avatar short videos representing the referents were shown separately as a set, in a counter-balanced order (Figure 2 shows an example of shown referents). Referents of each type were presented in a random order, but opposite referents (such as zoom-in and zoom-out) were presented sequentially, as there is a high degree of consistency even when the referents are presented randomly [32]. Participants were allowed to use their feet in any way they wanted, and were encouraged to verbally explain their choices and thinkaloud. After each suggested gesture, participants were asked to scale their suggested gesture’s goodness and ease on two 7-point Likert scales, similar to the scales used in [32]. A video camera placed in front of the participant recorded each session, which was subsequently manually annotated and categorized. RESULTS

Throughout this work, we differentiate between suggested gesture and chosen gesture. Suggested gesture is any gesture suggested during the study by a single participant, whereas chosen gesture is used to refer to a gesture that was suggested by the majority of participants for a specific referent. In this section, we describe observations and quantitative results based on the gestures’ categorization. Chosen Gestures

We discuss the overall agreement score, goodness and ease results across conditions and domains. Next, we describe the chosen gestures for each of the three conditions. Overall Agreement Score Results

We calculated agreement score as a single value that

0.25 0.21

0.13

0.13

Standing

Sitting GUI

Projection

Avatar

Figure 3: Mean agreement scores across conditions

and domains. Error bars denote standard error. represents the consensus on a chosen gesture for a particular referent. There are several ways to calculate agreement scores [26, 27], we followed the original method presented in [32], and shown in (Eq. 1). Results of the mean agreement scores in all conditions and referent sets are presented in Figure 3. In both the Standing and Sitting conditions, the overall average agreement score for the Avatar referents was significantly higher than that obtained for the GUI referents (Standing: t(10.831) =-4.34 p = 0.01; Sitting: t(26) =-3.22 p = 0.003). |P | 2

𝐴𝑟 = ∑r∈R ∑P𝑖⊆P𝑟 (|P 𝑖 |) 𝑟

(1)

Looking at the GUI domain, a one-way ANOVA showed significant differences in agreement scores across the three conditions (F = 5.32, p = 0.008). A Post-hoc analysis using the Bonferroni correction indicated that the Projection average agreement score was significantly higher than either the Standing (p = 0.012) or the Sitting (p = 0.047) conditions. In the Avatar domain, the mean agreement score in the Standing condition was found to be significantly higher than that of the Sitting condition(t(22) =2.35, p =.028). This suggests that regarding the avatar-related gestures, the participants tended to agree more about

sitting

projection

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 jump run sidewalk turn left turn right step forward step backward walk change mode stand up sit down duplicate gather next rotate enlarge nextUp nextDown shrink nextLeft zoom in close nextRight previous zoom out delete spread open cancel choose one pan choose group move

AGREEMET SCORE

standing

0.40

Avatar

GUI Figure 4: Agreement Scores in all three conditions.

Paper 334

Page 4

CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

Standing

Sitting

Projection

Figure 5: Top-ten chosen gestures in each of the three conditions: standing, sitting and projection Paper 334

Page 5

CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

gestures performed when standing than about gestures performed in the sitting position. A graph comparing the agreement scores for each referent in the three conditions is presented in Figure 4.

suggested in the Standing condition; 480 gestures (107 distinct) in the Sitting condition; and 400 gestures (114 distinct) in the Projector condition.

Goodness and Ease

We wished to generalize our information on foot gestures regardless of a specific referent or use case. Therefore, we suggest a new metric to complement the agreement score that examines how much a gesture is "unique" or "specific" to the referent for which it was chosen and how much it can be generalized and used for various referents. We refer to this as the specification score. The equation is based on the agreement score equation presented in [32].

For each suggested gesture, participants rated its Goodness (to what extent do you think that the suggested gesture is a good match for the referent?) and ease (to what extent is the suggested gesture easy to perform?). We compared goodness scores across the conditions and domains. For GUI referents, goodness ratings were almost the same across all conditions (Standing: MGUI = 5.73, Sitting: MGUI = 5.73, Projection: MGUI = 5.97). However, the difference found between the projection condition and the other two conditions was significant (F(2,975)=4.854 p = 0.008). A post-hoc analysis indicated that goodness level in the Projection condition was significantly higher than in either the Standing or the Sitting conditions. For Avatar referents, goodness ratings in the Standing condition were significantly higher than in the Sitting condition (Standing: MAvatar = 6.11; Sitting: MAvatar = 5.79; t(437) = 2.49 p = 0.013), meaning that participants found their suggested gestures for Avatar referents performed from a standing position more suitable than those performed from a seated position. We compared ease scores across conditions and domains as well. For GUI referents, ease ratings were similar across all three conditions, and no significant difference was found between the conditions. However, for Avatar referents, participants found the gestures suggested for the Standing condition easier to perform than the gestures proposed for the Sitting condition (Standing: MAvatar = 6.42; Sitting: MAvatar = 6.07; t(437) = 3.10 p = 0.002). Gesture sets

Figure 5 presents the top-10 chosen gestures of the referents with the highest agreement scores in the standing, sitting and projection conditions. Chosen gestures are gestures that were suggested the highest number of times for a specific referent. A total of 620 gestures (115 distinct) were 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.66 0.50 0.37

0.38

Standing

Sitting GUI

0.43

Projection

Avatar

Specification Score

|𝑟|

2

(2)

S = ∑r∈Rp (|𝑅 |) 𝑝

In Eq. 2, 𝑅𝑝 is the set of referents for which the proposed gesture p was suggested. Here we count the number of times a gesture was suggested for a referent (𝑟) and divide it by the number of times the gesture was chosen in the dataset. To calculate the score, we sum the square of division results per gesture. The score can vary between 0 and 1. The more referents the gesture was chosen for, the lower the score is. Thus, when a gesture gets a low score, it is less specific to an action and is more general, and when a gesture gets a high score, it is more specific to an action. For Example, the "walk forward/in place" gesture in the Standing condition was suggested twice (by two participants) as corresponding to the "activate" referent and 11 times to the "walk" referent. Therefore, the specification score for this gesture is as follows: 2

2

11 2

𝑆walk forward/in place (𝑠𝑡𝑎𝑛𝑑𝑖𝑛𝑔) = ( ) + ( ) = 0.74 13

13

The relatively high score shows that this gesture is quite specific to the referent/s for which it was chosen. Figure 6 presents the average Specification scores as described above, for the gestures chosen in each of the conditions and for each of the referent sets. Comparing the GUI- and Avatar-related specification scores for the chosen gestures revealed that in the Standing and Sitting condition, the mean Avatar-related scores were higher than those of the GUI-related chosen gestures. This difference was significant in the Sitting condition, but not in the Standing condition (Standing: MAvatar = 0.50, MGUI = 0.37, t(11.98) =-1.55 p = 0.145; Sitting: MAvatar = 0.66, MGUI = 0.38, t(26) =- 2.45 p = 0.021). As regards the specification scores of GUI-related chosen gestures, the differences between the conditions were not significant. Similarly, in the Avatar-related chosen gestures, specification scores in the Sitting condition were higher but not significantly than those in the Standing condition (Standing: MAvatar = 0.50, Sitting: MAvatar = 0.66, t(22) =-1.33 p = 0.198).

Figure 6: Mean specification scores across conditions and domains. Error bars denote standard error.

Paper 334

Page 6

CHI 2018 Paper

CHI 2018, April 21–26, 2018, Montréal, QC, Canada

Standing

Sitting

Projection

Simple

Complex

Simple

Complex

Simple

Complex

GUI

57.3%

42.7%

80%

20%

43%

56.8%

Avatar

42.1%

57.9%

66.4%

33.6%

Table 1: Distribution of simple and complex gestures in each of the conditions and domains

When comparing gestures for a single referent across conditions, some gestures have different specification scores. Thus, for example, the gesture "drag from right to left", which was suggested and chosen to denote the referent "select/move the next object on the right" in both conditions, had a score of 0.389 in the Standing condition, 0.621 in the Sitting condition, and 0.352 in the Projection condition. This suggests that gestures that are unique or specific in one condition, are not necessarily as specific in another condition. We believe that this difference stems from difference in the affordances between the conditions, as gestures that are more natural to perform when standing are not necessarily as easy to perform when sitting. Properties of Gestures

In this section, we describe the analysis of a few relevant foot gesture properties that may help devise guidelines for developing foot input systems. Simple and Complex Gestures

To learn about the foot gesture structure, we classified all the suggested gestures into simple and complex gestures (or compound gestures, as referred to in Ruiz et al. [19]). Accordingly, simple gestures consist of a single movement, meaning that there is no spatial discontinuity while performing the gesture (for example, inflection points, or pauses in motion etc.); and complex gestures contain more than one simple gesture, meaning that it can be divided into simple gestures. As many foot gestures require the use of both feet, we considered gestures in which both feet perform a simple gesture simultaneously as simple, whereas gestures in which a simple gesture is performed alternately with each foot as complex. The distribution of suggested simple and complex gestures within the three conditions is presented in Table 1. Overall, in the Standing and Projection conditions, there was more or less an equal distribution of simple and complex gestures, whereas in the Sitting condition, most of the suggested gestures were simple. When looking more closely at each condition, in the Standing condition, most of Standing

the GUI-related gestures were simple and most of the avatar-related gestures were complex, whereas in the Sitting condition, most of the gestures suggested for both domains were simple. A correlation was found between the domain and the complexity of suggested gesture (Standing: χ2 = 16.18, 𝑝 < 0.001; Sitting: χ2 = 12.30, 𝑝 < 0.002). When looking at the chosen gestures sets across both domains, complex gestures had a significantly higher mean specification score than did simple gestures, in both the Standing and Sitting conditions (Standing: Msimple=0.28, Mcomplex=0.51, p