Shifting of Focus of Perceptual Attention in

Shifting of Focus of Perceptual Attention in Embodied Conversational Agents Youngjun Kim Institute for Creative Technologies 13274 Fiji Way, Suite 600 Marina del Rey, CA 90292 [email protected]

Randall W. Hill, Jr. Institute for Creative Technologies 13274 Fiji Way, Suite 600 Marina del Rey. CA 90292 [email protected]

ABSTRACT

conversations are not held in isolation from the rest of the world. One approach to modeling perception and gaze in these situations is to avoid the issue of attention altogether by making the embodied agent omniscient. In this approach, gaze behaviors are generated as a way of making the agent appear to be attending, rather than serving a cognitive function. The problem with the omniscient approach is that the suspension of disbelief is broken when the human participant realizes that agents can perceive things without looking at them, e.g., they can see through walls, or they know what is happening on the other side of town without being there. The other extreme is to limit the agent’s attention behaviors to focus only on the parties engaged in a conversation and ignore what is happening around them. In this case it looks very strange indeed when the embodied agent does not react to events like crashes, explosions, yelling and other distractions that should have led to some kind of gaze shift at the very least. Traum & Rickel [13] presented a state-of-art model of multi-party dialogue in immersive three-dimensional virtual worlds, which is organized as a set of dialogue management layers. In multi-party, multi-conversational dialogue layers of their model, the attention layer concerns the object or process that embodied agents attend to. Their attention model is not fully implemented and only (visual) give attention is currently modeled. Our aim in this paper is to extend their attention layer by addressing the issue of perceptual attention. We present a computational model of perceptual attention for Embodied Conversational Agents that takes a decisiontheoretic approach. This model controls the ECAs’ focus of attention so that they dynamically interact with objects in the environments, while balancing this with the need to attend to others in a conversation, which is ultimately the challenge that we are only beginning to address in this paper.

We have developed a computational model of perceptual attention that interleaves top-down and bottom-up attention processes of an embodied agent in a virtual environment. What has been lacking in our model until now is how it would work when the one of the tasks being performed by the agent is holding a conversation with one or more agents. In this paper, we propose some modifications to our model that will integrate perceptual attention toward objects and locations in the environment with the need to look at other parties in a social context.

1. Introduction An important characteristic of an embodied conversational agent is the ability to direct its perceptual attention to objects and locations in a virtual environment in a manner that looks natural and serves a functional purpose. Not only must the agent direct its gaze at objects related to the tasks it is performing, but it must also be able to cope with sudden events that demand attention. For example, if the agent hears the sound of a car crash, the natural response under most circumstances would be for it to shift its attention toward the source of the noise. After seeing what caused the noise, the agent can then decide whether to resume its first task, or change to a new task, such as helping the car crash victims. In either case, the focus of perceptual attention will shift back to a top-down control in service of its tasks. It would be unnatural, under most circumstances, if the agent remained fixated on its tasks and ignored the car crash. Suppose that one the tasks being performed by the agent at the time of the car crash was holding a conversation with one or more agents in a group. In this case, the agent would need to shift its perceptual attention from one agent to another, according to social conventions, as well as the need to pick up non-verbal cues. At the same time, the agent needs to continue to pay attention to the environment, since most

2. Human’s Perceptual Attention Perceptual attention (e.g. perception and cognition) is critical for enabling awareness of situations in the synthetic environment. ECAs need perception so that

Action Selection

Emotions

NLU (pragmatics) Working Memory

Task Reasoner

Soar

Tcl/Tk

Dialogue Manager

World State Reasoner

Perception Perception (state, events)

Natural Language Generator

Action Motor Commands

Communications Bus Figure 1. Virtual Human Architecure [8] they can perform tasks, react to contingencies, interact with other agents (both virtual and human), plan, and make decisions about what to do next or at some future time [7]. Perceptual attention is a set of mechanisms that integrates percepts of objects given from bottom-up (e.g., sensing and perception) and top-down (e.g., cognition) processes, and then chooses an object that is the most salient among others in a given time. The concept of perceptual attention is somewhat illusive, but at the same time, obvious. Virtual humans that either do not pay attention to their surroundings or pay attention to the wrong things at the wrong times will look mechanistic and clumsy. In humans, attention provides a way of limiting perceptual processing by focusing on specific objects, and it provides a way for cognition to control perception by shifting the focus from one object to another. Attention is a critically important mechanism for restricting the sensory information being processed by the perception module in an ECA. In the first stages of perception, there are attention mechanisms that filter the perceptual information that comes through the sensory system. Subsequent processes selectively strengthen or weaken the importance of the information. Directing attention toward the interests of a particular region in space can be achieved by two distinguishable shifts, covert and overt shifts of attention. It is well known that covert and overt attention shifts affects gaze direction to locations in space [14]. The sequences of eye fixations describe the way in which overt attention is deployed, whereas directing attention to locations in space without moving the eyes describes the way in which covert attention is deployed.

3. Modeling Perception in Embodied Agents

Our virtual humans are implemented in Soar, a general architecture for building intelligent agents [10] and build on the STEVE [11] in the immersive environments called the Mission Rehearsal Exercise (MRE) [12]. As such, their behavior is not scripted; rather, it is driven by a set of general, domainindependent capabilities. The virtual humans perceive events in the scenario, reason about the tasks they are performing, and they control the bodies and faces of the PeopleShop™ characters to which they have been assigned. They send and receive messages to one another, to the character bodies, and to the audio system via the Communication Bus shown in Figure 1. We have developed a model of perceptual resolution based on psychological theories of human perception [6,7]. Hill’s model predicts the level of details at which an agent will perceive objects and their properties in the virtual world. He applied his model to synthetic helicopter pilots in simulated military exercise. We extended the model to simulate many of the limitations of human perception, both visual and aural, and limited the virtual human’s simulated visual perception to 190 horizontal degrees and 90 vertical degrees. The level of detail perceived about objects is high, medium, or low, depending on where the objects is in the virtual human’s field of view and whether attention is being focused on it. The virtual human perceives dynamic objects, under the control of the simulator, by filtering updates that the system periodically broadcasts. To model aural perception, we estimate the sound pressure levels of objects in the environment and compute their individual and cumulative effects on each listener based on the distances and directions of the sources. This enables the agents perceive aural events involving objects not in the visual field of view. For example, when a virtual human hears a vehicle is

Figure 2. A snapshot of the MRE simulation approaching from behind, he can choose to look over his shoulder to see who is coming. Another effect of modeling aural perception is that some sound events can mask others. A helicopter flying overhead can make it impossible to hear someone speaking in normal tones a few feet away. The noise could then prompt the agent to shout and could also prompt the addressee to cup his ear to indicate that he cannot hear. Given a set of perceived objects, the agent’s perceptual model updates the attributes of objects that fall in the limited sensory range. At any point in time, the agent must recognize which object is the most salient among those objects and draw his focus of attention on the object. The next section describes our approach to computing the salience of the objects in the field of view and the subsequent behaviors associated with shifting the agent’s gaze.

4. Model of Dynamic Perceptual Attention To compute object salience and control gaze behaviors, we have developed a model called Dynamic Perceptual Attention (DPA). Internally, DPA combines objects selected by bottom-up and top-down perceptual processes with a decision-theoretic perspective and then selects the most salient object. Externally, DPA controls an embodied agent’s gaze not only to exhibit his current focus of attention but also to update beliefs (e.g., position) of the selected object. That is, the embodied agent dynamically decides where to look, which object to look for, and how long to attend to the object.

4.1 Decision-Theoretic Control One of the consequences of modeling perception with limited sensory inputs is that it creates uncertainty on each perceived object. For instance, if an object that is being tracked moves out of an agent’s field of view, the attention model increases the uncertainty level of the object. To illustrate this idea, consider the screen snapshot of the MRE simulation shown in figure 2. A mother and a medic are taking care of an injured boy. The sergeant is conversing with a human participant. Since the mother, the boy, and the medic are out of the visual field of view of the sergeant, his uncertainty levels

for each of these characters will increase with time. Uncertainty is one of factors that help the sergeant decide which object he has to focus on. To deal with uncertainties of the perceived objects, we have chosen to take a decision theoretic approach to computing the perceptual costs and benefits of shifting the focus of attention of the perceived objects. The expected cost is computed by calculating the perceptual cost of moving gaze to the selected object. The expected benefit is computed by considering the benefit of information accuracy on the selected object. Then, DPA controls an agent’s gaze to exhibit his attention on the object which has the highest expected utility. By our definition, the expected utility (EU) is computed as follows: EU(Gaze(obji)|E) is the expected utility of gazing at an object(i) given the evidence. EU(Gaze(obji)|E) = ∑P(Result(Gaze(obji))|Do(Gaze(obji)),E)*(Result(Ga ze(obji))) where, Gaze(obji): Gazing the object(i), Result(Gaze(obji)): possible outcome states, P(Result(Gaze(obji))|Do(Gaze(obji)),E) :probability of gazing the obj(i) E: the agent’s available evidence about the world, Do(Gaze(obji)): execution of the gazing the object(i) U: utility

We assume that the probability of gazing at an object is always 1, that is, an observer always gazes at the object after shifting the focus of attention. Therefore, EU(Gaze(obji)|E)=∑ U(Result(Gaze(obji)))

4.2 Computing the Benefit We consider the object-based information accuracy as a key factor in computing the benefit of shifting the focus of attention to the object. The term, object-based information accuracy, is used here to describe the levelof-detail of the information of an object rendered in the agent’s mental image of a virtual world. Human beings always determine information accuracies on the perceived objects (from 0.0 to 1.0) based on his

subjective preference and then make efforts to maintain the desired information accuracies (DIA) within certain ranges. The orthogonal process model of the changes of information accuracy and the benefit of information accuracy is shown in figure 3. That is, the information accuracy is dynamic both in space and time and requires stochastic functions of time and space to describe its dynamics. If the information accuracy is within the information accuracy tolerance boundary (IATB), the benefit of acquiring information is 0. If the accuracy of information goes out of the boundary, the benefit of acquiring information accuracy will be increased or decreased (range of the benefit: -∞ to +∞). The desired information accuracy is determined by the priority. The boundary is closely related to the relative concern of the object with respect to the context of the situation. The higher the importance of the information is, the narrower the length of the boundary is. +∞ B

: Desired Information Accuracy

Benefit of Information Accuracy

: Current Information Accuracy B : The Benefit for Current Information Accuracy : Information Accuracy Tolerance Boundary

0.0

-∞ 0.0

Information Accuracy

1.0

Figure 3. The interrelation of information accuracy and benefit of information accuracy The duration of gazing at an object affects accuracy. While an object is monitored with the gaze of the agent (i.e., overt monitoring), the accuracy of the target information of the object becomes higher. In other words, while the object is monitored by memory and projection (i.e., covert monitoring) of the agent, the accuracy of the target information of the object becomes lower. That is, the overtly monitored information is rendered from the virtual world in real time and will always satisfy or over-satisfy the desired accuracy, whereas the covertly monitored information is not rendered from the real world but is reused from the previously rendered information. Covert monitoring will cause uncertainty over information to increase. While an agent overtly and covertly monitors information, he must have strategies for selectively acquiring information.

To model strategies for selective information accuracy, we first adopt a model of memory decay [9] to compute the changes of information accuracy (x-axis) as follows: CIA = CIA + p * exp(c * t), where, CIA: current information accuracy, p: priority of the target object, c: the degree of concern, t: : time in seconds since last fixation, We applied the priority attribute, p, to indicate how important the selected object is (e.g., being alive is an extremely high priority on each agent) and the concern attribute, c, to indicate how much concern should the agent assign on the selected object (e.g., even though the agent puts a high priority on being alive, he may not give a lot of concern until there is a threat on being alive). As the current information accuracy (CIA) moves away from desired information accuracy (DIA), there are two kinds of reactions: a desire to perceptually acquire objects and a desire to perceptually inhibit objects. The desire to acquire objects will increase as CIA goes below the lower information accuracy tolerance boundary. The desire to inhibit objects begins as CIA goes over the upper information accuracy tolerance boundary. The behavior of inhibiting and acquiring objects comes from the concept, the inhibition of return [Klein 2000], which is the process by which the currently attended location or information is prevented from being attended to again and is a crucial element of attentional deployment of humans. The desires of acquiring or inhibiting objects are increased or decreased by (k * x2) / 2 respectively, where k is constant and x is the distance between CIA and the information accuracy tolerance boundary. If an agent shifts his perceptual attention from a monitoring object to another, the constant, k, will increase in proportion to his concern on the object.

4.3 COMPUTING THE COST Even if the benefit of drawing attention to one object is higher than the benefits of attending to others, an agent should not automatically select the object as the best one since the cost of shifting the focus of attention must also be considered. For instance, if you are driving and somebody in your car asks you a question, you decide whether to look him or not based on the traffic conditions and how important eye contact is, since making eye contact during conversation is important in many of social situations. If your car is about to crash into another car, you probably should not make eye contact with him, even if he asks critical questions. We compute the cost of information accuracy based on several factors (e.g., the degrees of head and eye movements, the loudness of voice, distance efficiency).

+∞

+∞

: Desired Information Accuracy

Benefit of Information Acquisition


: Current Information Accuracy B : The Benefit for Current Information Accuracy : Information Accuracy Tolerance Boundary

B

0.0

0.0

B

-∞

-∞ 0.0


0.0

1.0

(a)


1.0

(b) +∞



+∞

B

LT: IS The LZ SECURE?

0.0

0.0

-∞

-∞ 0.0


1.0

0.0

(c)


1.0

(d) Figure 4. Experimentation with Scenario

The following procedure guides or agents in choosing the resource with the lowest cost: set the default minimal cost for each object concerned from top-down and bottomprocess compute the distance and angle between an observer and the object for each modality that the observer can use if the observer can contact the object at distance with the modality, then compute the cost of contacting the object if the cost is greater than the minimal cost, then set the cost as minimal cost return the minimal cost of contacting the object

5. IMPLEMENTATION WITH SCENARIO When our simulation starts, a simulated army vehicle carries a human participant (lieutenant) to an accident site where an Army vehicle has crashed into a civilian car, injuring a boy. The participant then takes on the task of controlling the troops to rescue the boy by interacting with our three embodied conversational agents—the sergeant (SGT), the mother, and the medic. The lieutenant inquires, “Sergeant, what happened here?”

The sergeant faces the lieutenant. “They just shot out from the side street, sir. The driver couldn’t see them coming.” “How many people are hurt?” “The boy and one of our drivers.” “Are the injuries serious?” Looking up the medic answers. “The driver’s get a cracked rib, but the kid’s...” Glancing at the mother, the medic checks himself. “Sir, we’ve gotta get a medevac in here ASAP.” Then, the lieutenant commands the sergeant to execute the task of securing the landing zone (LZ) so that the medevac can safely land. After this task is executed, the sergeant contacts the 3rd squad in order to dispatch the squad to the LZ. The initial setting between the initial information accuracy of position of the 3rd squad and the benefit of acquiring the information is shown in figure 3. The sergeant initially knows the position of the 3rd squad and the location of the LZ. When the task (secure-lz) is executed, the sergeant tries to gain high-level information accuracy of the spatial information of the 3rd squad, who will be dispatched to the LZ to secure it. Since the squad is not in the area for securing the LZ, the sergeant will get a benefit for acquiring the current spatial information of the squad. Next, the sergeant will contact and then command the squad to forward to the LZ. After he gets the information

that the squad is moving toward the LZ, he will reduce the slant of the curve since he gets hopes of achieving the task that may be given from the emotion module (shown in figure 4(a)). After the LZ is secured by the squad whom the sergeant highly trusts, he does not need to be eager to keep the information (i.e., the LZ is secure) with one hundred percent accuracy but can lower the importance of the information (i.e., lower the desired accuracy of information and expand the tolerance boundary) and then to monitor, search, or track other information (shown in figure 4(b)). While the sergeant monitors, searches, or tracks other objects, the accuracy of the information of the security of the LZ will gradually decrease (shown in figure 4(c)). Then, the LT says “Is the LZ secure?” This speech event increases the desired accuracy of information and makes the tolerance boundary narrow since the LT wants to have high accuracy of information (shown in figure 4(d)). This speech event changes the benefit by the change of accuracy and then transfers the changed benefit to the emotion that increases the degree of distress of the sergeant and suggests that he should update the belief of the status of security of the LZ. As the result of this, the SGT gazes at the landing zone whether it is still secure by the dispatched squad members and then respond the status of the landing zone to the LT.

6. Related Work There are few frameworks describing where and what an observer has to look at in a given time. Findlay and Walker [4] present a psychological model of the information flow routes and competitive pathways in saccade generation. Eye movements of agents have been simulated in a number of ways as a form of exhibiting their focus of attention. Eye movements of agents have been simulated in a number of ways as a form of exhibiting their focus of attention. Chopra and Badler [2] built a psychologically motivated computational framework for generating visual attending behaviours (such as locomotion, object manipulation, or visual search) in an embedded simulated human agent. Cassell and Vilhjalmsson [1] have used gaze as an important communicative behavior in their animated characters. Their animated characters have several limitations: (1) the model does not operate in real time, (2) the model only includes conversational gaze, and (3) the model does not include variability due to emotional state or individual differences. Rickel and Johnson [11] also employ gaze in their tutoring agent, STEVE, who looks at the student in the eye during conversational interaction, and looks at objects in the environment when performing tasks or monitoring the student. Their main purpose of adopting eye movements into agents is to generate eye movements for non-verbal communication (e.g. turn-taking) that are controlled by

top-down attention. The general limitation of STEVE is that a gaze command typically comes at the beginning of a cognitive activity, but is not updated during that activity. So, for example, if STEVE starts talking to a person, he gazes at them. Then, if his attention is drawn to an action in the environment, he will remain gazing at that action until something else causes a gaze command. Hill [6,7] applied a simulation of attention for a virtual helicopter pilot. The virtual helicopter pilot selectively draw attention to an object(s)/area(s) based on features of objects and their importance to tasks, and perceptual grouping of objects. However, the helicopter pilot has no animation of head and eye movements. We extended Hill’s model of perceptual resolution based on psychological theories of human perception.

7. Discussion The computational model of shifting the focus of perceptual attention for ECAs provides the potential to support multi-party dialogues in a virtual world. As we begin to integrate perceptual attention into multi-party, multi-conversational dialogue layers [13], we have demonstrated that embodied agents can respond dynamically to events that are not even relevant to the tasks. In addition, the decision-theoretic framework provides a way of shifting attention among objects in the environment. By integrating language tasks with the model we have described in this paper, we hope to make progress toward natural gaze behaviors in embodied conversational agents.

8. References [1] J. Cassell and H. Vilhjalmsson: “Fully Conversational Avatars: Making Communicative Behaviors” Autonomous Agents and Multi-Agent Systems, 2:45-64, Kluwer Academic Publishers, 1999. [2] S. Chopra-Khullar and N. Badler: “Where to Look? Automating Attending Behaviors of Visual Human Characters” Autonomous Agents and Multi-Agent Systems, 4(1-2), pp.9-23, 2001. [3] H. Egeth and S. Yantis: “Visual Attention: Control, Representation, and Time Course” Annual Review of Psychology, 48:269-97, 1997. [4] J. Findlay and R. Walker: “A Model of Saccade Generation based on Parallel Processing and Competitive Inhibition”. Behavioral and Brain Sciences, 22, 661-721. 1999 [5] J. Gratch: “Emile: Marshalling passions in training and education” In Proceedings of the Fourth International Conference on Autonomous Agents, pp. 325-332, New York, ACM Press, 2000 [6] R. Hill: “Modeling Attention in Virtual Humans” Proceedings of the 8th Conference on Computer Generated Forces and Behavioral Representation, SISO, Orlando, Fla., 1999

[7] R. Hill: “Perceptual Attention in Virtual Humans: Toward Realistic and Believable Gaze Behaviors” Proceedings of the AAAI Fall Symposium on Simulating Human Agents, pp.46-52, AAAI Press, Menlo Park, Calif., 2000. [8] R. Hill, J. Gratch, S. Marsella, J. Rickel, W. Swartout, and D. Traum: Virtual Humans in the Mission Rehearsal Exercise System. Künstliche Intelligenz (KI Journal). Special issue on Embodied Conversational Agents, 2003. [9] N. Moray: “Designing for attention” Attention: Selection, Awareness, & Control, Oxford Press, 1993. [10] A. Newell: “Unified Theories of Cognition” Cambridge, MA: Harvard University Press, 1990. [11] J. Rickel and W. Lewis Johnson: “Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control” Applied Artificial Intelligence, 13:343-382, 1999 [12] W. Swartout et al.: “Toward the holodeck: Integrating graphics, sound, character and story” In Proceedings of 5th International Conference on Autonomous Agents, 2001. [13] D. Traum and J. Rickel: “Embodied Agents for Multi-party Dialogue in Immersive Virtual Worlds”, AAMAS’02, July 15-19, 2002, Bologna, Italy. [14] J. Wolfe: “Guided Search 2.0: A revised model of visual search” Psychonomic Bulletin & Review, 1 (2), pp.202-238, 1994.