Annotating Multimodal Behaviors Occurring during Non Basic Emotions

4 downloads 23190 Views 146KB Size Report
multimodal behaviors related to emotions have considered only basic and acted .... object), illustrator (represents attributes, actions, relationships about objects ...
Annotating Multimodal Behaviors Occurring during Non Basic Emotions Jean-Claude Martin, Sarkis Abrilian, Laurence Devillers LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France {martin, abrilian, devil}@limsi.fr

Abstract. The design of affective interfaces such as credible expressive characters in story-telling applications requires the understanding and the modeling of relations between realistic emotions and behaviors in different modalities such as facial expressions, speech, hand gestures and body movements. Yet, research on emotional multimodal behaviors has focused on individual modalities during acted basic emotions. In this paper we describe the coding scheme that we have designed for annotating multimodal behaviors observed during mixed and non acted emotions. We explain how we used it for the annotation of videos from a corpus of emotionally rich TV interviews. We illustrate how the annotations can be used to compute expressive profiles of videos and relations between non basic emotions and multimodal behaviors.

1 Introduction The design of affective interfaces such as credible expressive characters in storytelling applications requires the understanding and the modeling of relations between realistic emotions and behaviors in different modalities such as facial expressions, speech, body movements and hand gestures. Until now, most experimental studies of multimodal behaviors related to emotions have considered only basic and acted emotions and their relation with mono-modal behaviors such as facial expressions [8] or body movements [18]. Recent tools [11] facilitate the annotation and the collection of multimodal corpora but raise several issues regarding the study of emotions: How should we annotate multimodal behaviors occurring during emotions? What are the relevant behavioral dimensions to annotate? What are the differences between basic acted emotions and non acted emotions regarding the multimodal behaviors? EmoTV is an audiovisual corpus featuring 51 video clips of emotionally rich monologues from TV interviews (with various topics as politics, law, sports…) that we have collected for studying non acted emotions. We have designed a coding scheme for annotating the context and several dimensions of emotions (categories, intensity, valence …), both at the level of the whole video clip and at the level of the different emotional segments that each clip contain (see [1] for a description of how these segments were defined). The main conclusions of a first annotation phase of the 51 clips were that emotional segments can not be labeled with a single emotion label

but rather with a combination of two labels (blend, masked, cause-effect conflict, ambiguous) [6]. Furthermore classical schemes used for detailed annotation of communicative multimodal behaviors revealed to be partly inappropriate for non acted emotions [2]. Such parts of the coding scheme were either removed or modified in order to improve the annotation process. In this paper we describe this new coding scheme that we have designed for annotating multimodal behaviors during real life mixed emotions. This scheme focuses on the annotation of emotion specific behaviors in speech, head and torso movements, facial expressions, gaze, and hand gestures. We do not aim at collecting detailed data on each individual modality or statistically representative models of the relations between emotions and multimodal behaviors. Instead, our goals are to use the annotations produced with this scheme to identify the required levels of representation for realistic emotional behavior and to explore the coordination between modalities during non acted behaviors observed in individual videos. Section 2 provides a short survey of some studies on multimodal emotional behavior. Section 3 details the coding scheme that we have defined. In section 4, we illustrate the measures that can be done from our annotations such as the computation of expressivity profiles.

2 Studying multimodal emotional behaviors The experiment described in [16] consisted of asking younger and older adults to evaluate the following parameters from actors’ body movements: age, gender and race; hand position; gait; variations in movement form, tempo, and direction; and movement quality. The actors were silent and their faces were electronically blurred in order to isolate the body cues to emotion. In a first part, they identified emotions depicted in brief videotaped displays of young adult actors portraying emotional situations. In a second part, they rated the videotaped displays using characteristics of movement quality (form, tempo, force, and direction) rated on a 7-point scale and verbal descriptors (smooth / jerky, stiff / loose, soft / hard, slow / fast, expanded / contracted, and no action / a lot of action). The ratings of both age groups revealed to be in high agreement and provided reliable information about particular body cues to emotion. The errors made by older adults were linked to exaggerated or ambiguous body cues. In [18], twelve drama students were asked to code body movement and posture performed by actors. The emotions were acted and mostly basic. Only categories with more than 75% inter-coder agreement between 2 coders were kept: position of upper body, position of shoulders, position of head, position of arms, position of hands, movement quality (movement activity, spatial expansion, movement dynamics, energy, and power) ; body movements (jerky and active), body posture. The experiment in [4] investigated the accuracy of children to identify emotional meaning in expressive body movement performances. 103 subjects participated in two tasks of nonverbal decoding skill. For this experiment, the subjects were asked to identify, on a TV screen, the emotions expressed by two dancers (i.e. which one of them was feeling happy, sad, angry, or afraid). The following specific cues were used by the adult subjects to discriminate the emotions: Anger (changes in tempo; direc-

tional changes in face and torso); Happiness (frequency of arms up, duration of arms away from torso); Fear (muscle tension); Sadness (duration of time leaning forward). In [5], the authors studied the effects of three variables on the inference of emotions from body movements: sex of the mover, sex of the perceiver, and expressiveness of the movement. 42 adults participated in the experiment, analyzing videorecordings consisting of performances of 96 different movements, each performed by students of expressive dance. The subjects used seven dichotomous dimensions for describing movements: trunk movement (stretching, bowing), arm movement (opening, closing), vertical direction (upward, downward), sagittal direction (forward, backward), force (strong-light), velocity (fast-slow), directness (moving straight towards the end-position versus following a lingering, s-shaped pathway). Three dimensions of movement are defined in Laban's theory of the Art of Movement [17]: up-down, left-right, and forward-backward. Some body movements and gestures are interpreted (e.g. the wish to communicate is expressed in outwardreached movements and the wish to remain private is expressed in inward movements). This suggests that form and motion are related to emotions. In the field of Embodied Conversational Agents, [10] proposed a set of expressivity attributes for facial expressions (intensity, temporal course), for gaze (length of mutual gaze), for gesture (strength, tempo, dynamism, amplitude) and global parameters (overall activation, spatial extent, fluidity, power, repetitivity). Several studies also involve TV material. The Belfast naturalistic database contains similar emotional interviews annotated with continuous dimensions [7]. Other studies deal with multimodal annotation, although not specific to emotion [14]. A coding scheme was designed for the annotation of 3 videos of interviews taken from television broadcasting [3]. It includes annotation of facial displays, gestures, and speech. The coding scheme involves the annotation of both the form of the expression and of its semantic-pragmatic function (e.g. give feedback / elicit feedback / turn managing / information structuring) as well as the relation between modalities: repetition, addition, substitution, contradiction. As revealed by this short survey, most studies of multimodal behavior during emotions are based on basic and acted emotions and do not always make use of recent tools and advances in multimodal corpora management.

3 A scheme for annotating multimodal emotional behavior We have grounded our coding scheme on requirements collected from both the parameters described as perceptually relevant for the study of emotional behavior (See previous Sect.), and the features of the emotionally rich TV interviews that we have selected. The following measures are thus required for the study of emotional behaviors: the expressivity of movements, the number of annotations in each modality, their temporal features (duration, alternation, repetition, and structural descriptions of gestures), the directions of movements and the functional description of relevant gestures. We have selected some non verbal descriptors from the existing studies listed above. We have defined the coding scheme at an abstract level and then implemented it as a XML file for use with the Anvil tool [11]. This section describes how

each modality is annotated in order to enable subsequent computation of the relevant parameters of emotional behavior listed above. Each track is annotated one after the other (e.g. the annotator starts by annotating the 1st track for the whole video and then proceeds to the next track). The duration and order of annotations is available in the annotations. For the speech transcription, we annotate verbal and nonverbal sounds1, as hesitations, breaths, etc... Most annotated tags are: “laugh”, “cries”, “b” (breath), and “pff”. We do the word by word transcription using Praat2. In the videos only the upper body of people is visible. Torso, head and gestures tracks contain both a description of pose and movement. Pose and movement annotations thus alternate. The direction of movement, its type (e.g. twist vs. bend) and the angles can be computed from the annotations in the pose track. Examples of instructions for annotating torso movements are provided in Table 1. Movement quality is annotated for torso, head, shoulders, and hand gestures. The attributes of movement quality that are listed in the studies reported in the previous section and that we selected as relevant for our corpus are: the number of repetitions, the fluidity (smooth, normal, jerky), the strength (soft, normal, hard), the speed (slow, normal, fast), and the spatial expansion (contracted, normal, expanded). The annotation of fluidity is relevant for individual gestures as several of them are repetitive. The 1st annotation phase revealed that a set of three possible values for each expressive parameter was more appropriate than a larger set of possible values. The head pose track contains pose attributes adapted from the FACS coding scheme [9]: front, turned left / right, tilt left / right, upward / downward, forward / backward. Head primary movement observed between the start and the end pose is annotated with the same set of values as the pose attribute. A secondary movement enables the combination of several simultaneous head movements which are observed in EmoTV (e.g. head nod while turning the head). Facial expressions are coded using combinations of Action Units (the low level we used for the 1st annotation phase was based on FAPS and was inappropriate for manual annotation of emotional videos). As for gesture annotation, we have kept the classical attributes [12, 15] but focused on repetitive and manipulator gestures which occur frequently in EmoTV, as we observed during the 1st annotation phase. Our coding scheme enables the annotation of the structural description (“phases”) of gestures as their temporal patterns might be related to emotion: preparation (bringing arm and hand into stroke position), stroke (the most energetic part of the gesture), sequence of strokes (a number of successive strokes), hold (a phase of stillness just before or just after the stroke), and retract (movement back to rest position). We have selected the following set of gesture functions (“phrases”) as they revealed to be observed in our corpus: manipulator (contact with body or object, movement which serve functions of drive reduction or other noncommunicative functions, like scratching oneself), beat (synchronized with the emphasis of the speech), deictic (arm or hand is used to point at an existing or imaginary object), illustrator (represents attributes, actions, relationships about objects and characters), emblem (movement with a precise, culturally defined meaning). Currently, 1 2

http://www.ldc.upenn.edu/ http://www.fon.hum.uva.nl/praat/

the hand shape is not annotated since it is not considered as a main feature of emotional behavior in our survey of experimental studies nor in our videos. Direction of movement for shoulders is also annotated as some of them are observed. An example of the annotation of multimodal behaviors is provided in Figure 1. Table 1. Annotating torso side and bend movements: example of annotation guide instructions

Torso side movement

Side right (+20°)

Front (0°)

Side left (-20°)

Illustrative example

Torso bend movement

Illustrative example

Bend front (+20°)

Front (0°)

Bend back (-20°)

4 Computing measures from annotations With this new coding scheme, 455 multimodal annotations of behaviors in the different modalities were done by one coder on the 19 emotional segments on 4 videos selected for their multimodally rich content (e.g. expressive gesture) for a total duration of 77 seconds. These annotations have been validated and corrected by a second coder. The average required annotation time was evaluated to 200 seconds to annotate 1 second of such video featuring rich multimodal behaviors. This annotation time might decrease during future annotations as the coders will learn how to use the scheme more efficiently. We developed a software for parsing the files resulting of annotation and for computing measures. It enables to compare the “expressivity profile” of different videos which feature blended emotions (Table 2), similarly to the work done by [10] on expressive embodied agents. For example videos #3 and #36 are quite similar regarding their emotion labels, average intensity and valence (although their durations are quite different).

Fig. 1. From top: (1) speech tracks, (2) alternation of pose and movement for torso, (3) head behaviors annotated with one pose and two movement tracks, (4) facial expressions (including frequent closed eyes in this example), (5) hand gestures including annotation phase, phrase and movement expressivity (jerky, hard, fast, …)

Modalities involving movements in these videos are also similar (head, then torso, and then hand gestures). The relations between the extreme values of expressive parameters are also compatible in the two videos: fast movements are more often perceived than slow movements; hard movement more often than soft; jerky more often than smooth; contracted more often than expanded. Video #30 with a similar average intensity (4) has a quite different expressive profile with a positive valence but also more head movements and no hand gestures. Even more movements are evaluated as being fast, but compared with videos #3 and #36, movements in this video are more perceived as soft and smooth. The goal of such measures is to explore the dimensions of emotional behavior in order to study how multimodal signs of emotion are combined during non acted emotion. Table 2. Expressivity profiles of three videos involving blended emotions

Video# Duration Emotion labels Average intensity (1: min – 5: max) Average valence (1: negative, 5: positive) % head movement % torso movement % hand movement % fast vs. slow % hard vs. soft % jerky vs. smooth % expanded vs. contracted

#3 37s Anger (66%), despair (25%), sadness (12%)

#36 7s

#30 10s Exaltation (50%), Anger (55%), Joy (25%), despair (44%) Pride (25%)

5

4.6

4

1

1.6

4.3

56 28 16

60 20 20

72 27 0

47 vs. 3 17 vs. 17 19 vs. 8 0 vs. 38

33 vs. 13 20 vs. 0 6 vs. 0 13 vs. 20

83 vs. 0 0 vs. 27 5 vs. 50 0 vs. 33

5 Conclusion and future directions In this paper we have described our work for the study of multimodal behaviors occurring during non acted emotions. We explained the different features of the coding scheme we have defined and illustrated some measures that can be done from the annotations. Currently, the protocol used for validation of annotations is to have a second coder validate the annotations done by a first coder followed by brainstorming discussions. Other validation protocoles are currently under consideration. We are currently investigating the use of such annotations in a copy-synthesis approach for the specification of expressive embodied agents which can be useful for perceptual validation [13]. Future directions include the annotation of other videos of EmoTV with the coding scheme described in this paper, the validation of the annotations by

the automatic computation of inter-coder agreements from the annotations by several coders, and the computation of other relations between 1) the multimodal annotations, and 2) the annotation of emotions (labels, intensity and valence), and the global annotations such as the modalities in which activity was perceived as relevant to emotion [6].

References 1. Abrilian, S., Devillers, L., Buisine, S., Martin, J.-C.: EmoTV1: Annotation of Real-life Emotions for the Specification of Multimodal Affective Interfaces. HCI International (2005a) Las Vegas, USA 2. Abrilian, S., Martin, J.-C., Devillers, L.: A Corpus-Based Approach for the Modeling of Multimodal Emotional Behaviors for the Specification of Embodied Agents. HCI International (2005b) Las Vegas, USA 3. Allwood, J., Cerrato, L., Dybkær, L., Paggio, P.: The MUMIN multimodal coding scheme. Workshop on Multimodal Corpora and Annotation , (2004) Stockholm 4. Boone, R. T., Cunningham, J. G.: Children's decoding of emotion in expressive body movement: The development of cue attunement. Developmental Psychology 34 5 (1998) 5. DeMeijer, M.: The attribution of agression and grief to body movements : the effect of sexstereotypes. European Journal of Social Psychology 21 (1991) 6. Devillers, L., Abrilian, S., Martin, J.-C.: Representing real life emotions in audiovisual data with non basic emotional patterns and context features. First International Conference on Affective Computing & Intelligent Interaction (ACII'2005) (2005) Beijing, China 7. Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech; Towards a new generation of databases. Speech Communication 40 (2003) 8. Ekman, P.: Emotions revealed. Weidenfeld & Nicolson (2003) 9. Ekman, P. W., F.: Facial Action Coding System (FACS). (1978) 10. Hartmann, B., Mancini, M., Pelachaud, C.: Formational Parameters and Adaptive Prototype Instantiation for MPEG-4 Compliant Gesture Synthesis. Computer Animation (2002) 11. Kipp, M.: Anvil - A Generic Annotation Tool for Multimodal Dialogue. Eurospeech (2001) 12. Kipp, M.: Gesture Generation by Imitation. From Human Behavior to Computer Character Animation. Boca Raton, Dissertation.com Florida (2004) 13. Lamolle, M., Mancini, M., Pelachaud, C., Abrilian, S., Martin, J.-C., Devillers, L.: Contextual Factors and Adaptative Multimodal Human-Computer Interaction: Multi-Level Specification of Emotion and Expressivity in Embodied Conversational Agents. 4th International and Interdisciplinary Conference on Modeling and Using Context (Context) (2005) Paris 14. Magno Caldognetto, E., Poggi, I., Cosi, P., Cavicchio, F., Merola, G.: Multimodal Score: an Anvil Based Annotation Scheme for Multimodal Audio-Video Analysis. Workshop "Multimodal Corpora: Models Of Human Behaviour For The Specification And Evaluation Of Multimodal Input And Output Interfaces" In Association with the 4th International Conference On Language Resources And Evaluation (LREC) (2004) Lisbon, Portugal 15. McNeill, D.: Hand and mind - what gestures reveal about thoughts. University of Chicago Press (1992) 16. Montepare, J., Koff, E., Zaitchik, D., Albert, M.: The use of body movements and gestures as cues to emotions in younger and older adults. Journal of Nonverbal Behavior 23 2 (1999) 17. Newlove, J.: Laban for actors and dancers. Routledge New York (1993) 18. Wallbott, H. G.: Bodily expression of emotion. European Journal of Social Psychology 28 (1998)