The Conductor: Gestures for embodied agents with

1 downloads 0 Views 265KB Size Report
bodied agents, in particular for their non-verbal acts like gestures and postures[16]. ... meaning of the natural language-like terms as such 'front' and 'fast ' are ...
The Conductor: Gestures for embodied agents with logic programming Zsòfia Ruttkay1, Zhisheng Huang2, Anton Eliens2 1

Center for Mathematics and Computer Sciences Amsterdam, The Netherlands [email protected] http://www.cwi.nl/~zsofi 2 Intelligent Multimedia Group, Division of Computer Science, Vrije Universiteit Amsterdam, The Netherlands {huang,eliens}@cs.vu.nl http://www.cs.vu.nl/{~huang,~eliens}

Abstract. The paper discusses how distributed logic programming can be used to define and control the hand gestures of embodied agents in virtual worlds, by using the STEP language as an interface between the constructs of logic programming and the humanoid model defined in VRML. By using this framework, different gesture dictionaries can be defined and variants of a hand gesture, according to dynamically changing factors, can be generated on the fly. The approach is tested on the demanding demonstrator of conducting, providing experience, also on time performance of the animation.

1 Introduction Embodied conversational agents [2] are more or less realistic representation of humans in interactive applications. Particularly, we are interested in Web-based applications, where the ECA is part of a virtual environment, accessible by standard web browsers. Such agents may occur in different applications and roles, as instructor [20] or sales assistant [22], translator for hearing handicapped,1 or representative of real persons in shared environments. It has been pointed out, that fully embodied agents should have the capability of gesturing, both to increase efficiency and naturalness of communication [4]. Gestures are to express, redundantly or additively relative to what is being told, information about different aspects of the world [19, 28]: - identify certain objects or events (e.g. the victory emblem), - indicate characteristics of objects (location, shape, size), events (time, motion, start/end/repetition) or concepts (e.g. two hands with pointing fingers moving opposite to each other till touching, indicating ‘conflict’),

1

http://www.vcom3d.com/

-

show the state of the speaker (emotional, physical, cognitive, availability of modalities). Gestures also improve the elicitation of spoken text, by punctuating it with special gestures to indicate: - syntactical chunks (contrast, enumeration, new topic); - communicational coordination (listening, turn giving/taking). Besides the above straightforward functions, gestures also convey information about the identity of the speaker (cultural, gender, age, personality) [31]. Naturally, other nonverbal modalities (facial expressions, head and gaze movement) play also a role, often together with hand gestures. In this paper, we concentrate on the definition and generation of hand gestures. The task is more complicated than defining and concatenating pre-cooked motion patterns. A single-hand gesture involves the coordinated motion of 18 hand and arm joints, and many variants of the same gesture occur. Moreover, for a single communicative function (e.g. greeting, pointing, emphasis) alternative gestures can be used, and the actual choice is made on the basis of static and dynamic characteristics of the speaker (e.g. right- or left handedness, is one of the hands full,…) and of the environment (e.g. relative location of speaker, listener and the object to be pointed at) and other factors of the presentation, such as the time is available to perform a gesture. Also subtle coordination to speech is a must. Summing up, the programming framework for the gesturing of agents should be appropriate, ideally, for all the following tasks: 1. provide high-level and compositional definition and control of gestures (as opposed to defining and controlling all joints individually); 2. allow the definition of gesture dictionaries, that is, all the gestures ‘known’ by an embodied agent; 3. allow to reason about the choice of hand and other parameters of the gesture to be used; 4. support the generation of gestures which are: a) individual; b) non-repetitive; c) resemble the motion qualities of humans [5], d) subtly synchronized (also to speech). 5. make possible the real-time generation of hand gestures in virtual environments. We have a preference, in accordance with items 1-3, for a rule-based declarative language. Parallelism, obviously, is a must! Requirement 4 makes it clear that handling time and other numerical parameters is a necessity. The final requirement makes one ponder seriously if at all a logic-based declarative approach could be appropriate, because of the performance demands, especially as the animation has to be performed in VRML, notoriously known of tedious manipulation and low efficiency of rendering. In our paper we will outline how the STEP language, as an interface between distributed logic programming and VRML, can be used to deal with the above problems

related to gesturing. In Section 2 declarative programming and the DLP language is outlined, the STEP language is introduced and the modeling of humans in VRML is discussed. Section 3 is devoted to the definition and generation of gestures by using STEP. In Section 4 we sum up our results, compare it to other methods for gesturing, and discuss further possibilities. In the paper we use conducting as the running example, and refer to hand gestures of conducting. Though a synthetic conductor has not really useful practical applications, it is an excellent vehicle to demonstrate and experiment with the solutions given to the general problems of gesturing, listed in the introduction.

2 The languages and tools: DLP, humanoids in VRML, and STEP 2.1 DLP The Distributed Logic Programming Language (DLP) [6],2 combines logic programming, object-oriented programming and parallelism. DLP has been used as a tool for web agents, in particular, 3D web agents [15]. The use of DLP as a language for the implementation of agent-based virtual environments is motivated by the following language characteristics: object-oriented Prolog, VRML EAI extensions, and distribution. DLP incorporates object-oriented programming concepts, which make it a useful tool for programming. The language accepts the syntax and semantics of logic programming languages like Prolog. It is a high-level declarative language suitable for the construction of distributed software architectures in the domain of artificial intelligence. In particular, it’s a flexible language for rule-based knowledge representation [7]. In DLP, an object is designed as a set of rules and facts, which consist of a list of formulas built from predicates and terms (variables or constants). DLP is an extensible language. Special-purpose requirements for particular application domains can easily be integrated in the existing object-oriented language framework. DLP has been extended with a run-time library for VRML EAI [18] and a TCP/IP library for the network communication. The following predicates are some examples of DLP VRML built-in predicates: •

URL-load predicate loadURL(URL) loads a VRML world at URL into the Web browser. • Get-position predicate getPosition(Object,X,Y,Z) gets the current position of the Object in the VRML world. • Set-position predicate setPosition(Object,X,Y,Z) sets the position of the Object in the VRML world. • Get-rotation predicate getRotation(Object,X,Y,Z,R) 2

http://www.cs.vu.nl/~eliens/projects/logic/index.html

gets the current rotation of the Object in the VRML world. Set-rotation predicate setRotation(Object,X,Y,Z,R) sets the rotation of the Object in the VRML world. • Get-property predicate getSFVec3f(Object, Field, X, Y, Z) gets a value (which consists of three float numbers X, Y, and Z) of the Field of the Object. • Set-property predicate setSFVec3f(Object, Field, X, Y, Z) assigns the SFVec3f value X, Y, and Z to the Field of the Object. •

DLP programs are compiled to Java class files, which makes it a convenient tool for the implementation of VRML EAI applets. DLP is also a distributed programming language. DLP programs can be executed at different computers in a distributed architecture by the support of its TCP/IP library. 2.2

Humanoids in VRML

The avatars of 3D web agents are built in the Virtual Reality Modeling Language (VRML) or X3D, the next generation of VRML. These avatars have a humanoid appearance. The humanoid animation working group3 proposes a specification, called H-anim specification, for the creation of libraries of reusable humanoids in Web-based applications as well as authoring tools that make it easy to create humanoids and animate them in various ways. According to the H-anim standard, an H-anim specification contains a set of Joint nodes that are arranged to form a hierarchy. Each Joint node can contain other Joint nodes and may also contain a Segment node which describes the body part associated with that joint. Each Segment can also have a number of Site nodes, which define locations relative to the segment. Sites can be used for attaching accessories, like hat, clothing and jewelry. In addition, they can be used to define eye points and viewpoint locations. Each Segment node can have a number of Displacer nodes, that specify which vertices within the segment correspond to a particular feature or configuration of vertices. Figure 1 shows the typical major joints for H-anim humanoids. Furthermore, the hands each have 15 finger joints. Turning body parts of humanoids implies the setting of the corresponding joint’s rotation. Moving the body part means the setting of the corresponding joint to a new position. H-anim specifies a standard way of representing humanoids in VRML. The scripting language STEP has been designed for embodied agents/web agents which are based on H-anim humanoids.

3

http://h-anim.org

Fig. 1. Major joints of a typical H-anim humanoid.

2.3

The STEP language

STEP (Scripting Technology for Embodied Persona) is a scripting language for embodied agents, in particular for their non-verbal acts like gestures and postures[16]. Based on dynamic logic [11], STEP has a solid semantic foundation, in spite of a rich number of variants of the compositional operators and interaction facilities on worlds. The design of the scripting language STEP was motivated by the following principles listed below: • Convenience – may be used for non-professional authors; • compositional semantics -- combining operations; • re-definability – for high-level specification of actions; • parametrization – for the adaptation of actions; • interaction – with a (virtual) environment. The principle of convenience implies that STEP uses some natural-language-like terms for 3D graphics references. The principle of compositional semantics states that STEP has a set of built-in action operators. The principle of re-definability suggests that STEP should incorporate a rule-based specification system. The principle of parametrization justifies that STEP introduces a Prolog-like syntax. The principle of interaction requires that STEP is based on a more powerful meta-language, like DLP.

Turn and move are the two main primitive actions for body movements in STEP. Turn actions specify the change of the rotations of the body parts or the whole body over time, whereas move actions specify the change of the positions of the body parts or the whole body over time. Turn actions and move actions are expressed as follows: turn(Agent,BodyPart,Direction,Duration) move(Agent,BodyPart,Position,Duration) In the above expressions BodyPart refers to a body joint in the H-anim specification, like l_shoulder, r_elbow, etc., Directon states a rotation, which can be an item with a form like rotation(1,0,0,-1.57), or ’front’, a natural-language like term. Position states a position, which can be an item like position(1,0,0), or ’front’, a natural-language like term. Duration states the time interval for the action, which can be a direct time specification, like time(2,second) and beat(2), or a natural-language like term, like ’fast’, or ’slow’. For instance, an action such as turn(humanoid, l_shoulder, front, fast) indicates turning the humanoid’s left arm to the direction of front fast. The meaning of the natural language-like terms as such ’front’ and ’fast ’ are defined by the ontology component in STEP. The STEP animation engine uses the standard slerp rotation interpolation to create the turning animation. The rotation interpolations are considered to be linear by default. STEP also supports non-linear interpolation by using the enumerating type of the interpolation operator. An example: turnEx(Agent, l_shoulder, front, fast,enum([0,0.1,0.2,0.7,1])

The above expression turns the agent’s left arm to the front via the interpolation points 0, 0.1, 0.2, 0.7, 1, that is, by covering equal parts of the entire traject in the indicated portions of the total time. Scripting actions can be composed by using following composite operators: • Sequence operator ’seq’: the action seq([Action1, ...,Actionn]) denotes a composite action in which Action1, ...,and Actionn are executed sequentially. • Parallel operator ’par’: the action par([Action1, ...,Actionn]) denotes a composite action in which Action1, ...,and Actionn are executed simultaneously. • Non-deterministic choice operator ’choice’: the action choice([Action1, ...,Actionn]) denotes a composite action in which one of the Action1, ..., Actionn is executed. • Repeat operator ’repeat’: the action repeat(Action, T) denotes a composite action in which the Action is repeated T times. When using high-level interaction operators, scripting actions can directly interact with internal states of embodies agents or with external states of worlds. These interaction operators are based on a meta-language which is used to build embodied agents. Motivated from dynamic logic, the higher-level interaction operators in STEP are: • Test-operator test(p), which tests if p is true in a state, • Do-operator do(p), which executes a goal p in the meta language level. • Conditional if_then_else(p,Action1,Action2), which executes the Action1 if the state p holds, otherwise executes the Action2.

STEP has been implemented in the Distributed Logic Programming language DLP [8,13,15]. XSTEP is the XML-encoded STEP language, which serves as a XMLbased markup language for embodied agents [14]. The resources of STEP and XSTEP can be found in the STEP website: http://wasp.vs.vu.nl/step.

3

Definition of hand gestures

3.1

The body space and reference points

Human hand gesturing happens in certain parts of the space in front of and next to the body. In order to be able to express easily where a gesture should or may start/end, we use reference parameters and planes, relative to the human body, see figure 2.

Fig. 2. Humanoid with body region planes and the body region right_to_head.

The body region planes, which are similar to the ones used in the HanNoSys sign language system [27], are the following (see Figure 2): Horizontally: navel, breast, shoulder, head, thigh.

Vertically: middle, left_shoulder, right_shoulder, left_uparm, right_uparm, left_arm, right_arm. In depth: depth_base (through the two middle lines of the resting arms and standing body), front_uparm, front_arm. The hands can reach (roughly-speaking) a half sphere in the space. (For the time being, we do not consider human gestures which happen behind the depth_base plain). The above plains divide the space around the body into unit reference body regions. Several of these, and the union of some of them are labeled. These, by label identified, regions will be used as possible start and end position of gestures. The semantics of the particular body space regions can be easily given in terms of the boundary spaces, see the example below and Figure 2.: body_region(Agent, right_to_head, (X,Y,Z)):above(Agent, shoulder, (X,Y,Z)), below(Agent, head, (X,Y,Z)), right_to(Agent, right_shoulder, (X,Y,Y)), left_to(Agent, right_arm, (X,Y,Z)). below(Agent, BodyPlane, (_,Y,_)):height(Agent, BodyPlane, PY), Y