Combining Sensory and Symbolic Data for Manipulative ... - CiteSeerX

Combining Sensory and Symbolic Data for Manipulative Gesture Recognition Jannik Fritsch, Nils Hofemann and Gerhard Sagerer Bielefeld University, Faculty of Technology, Bielefeld, Germany fjannik, nhofeman, [email protected] Abstract In this paper we propose to recognize manipulative hand gestures by incorporating symbolic constraints in a particle filtering approach used for trajectory-based activity recognition. To this end, the notion of situational and spatial context of a gesture is introduced. This scene context is incorporated during the analysis of the trajectory data. A first evaluation in an office environment demonstrates the suitability of our approach. Different from purely trajectory-based approaches, our method recognizes manipulative gestures including the information which objects were manipulated.

cle filtering approach for trajectory-based activity recognition [2]. Based on our extension, the scene context can be incorporated during the analysis of the trajectory data. Before presenting our approach, we will review in Section 2 related literature dealing with the recognition of manipulative gestures and describe in Section 3 the trajectory recognition algorithm. The details of our extension to the trajectory-based particle filtering are outlined in Section 4. An evaluation of the method based on actions performed in an office environment is described in Section 5 before we conclude with a summary.

2. Related Work 1. Introduction A famous categorization of motion recognition approaches is due to Bobick [3] and differentiates Movement, Activity, and Action depending on the level of knowledge required during recognition. Today many gesture recognition approaches analyze only the hand trajectory and therefore recognize activities in Bobick’s categorization. Recognizing a ’pick’ activity, however, is not very informative if there are many different objects that could be picked. In such a situation, the symbolic information what object has been manipulated is crucial and needs to be incorporated into the recognition system to obtain a complete description of the manipulative gesture performed, i.e., to recognize the action instead of the activity. Only the parallel use of symbols and sensory data allows us to cope with the uncertainties in the individual cues and to take the temporal development into account, i.e., to associate the correct scene object with the observed hand motion. For example, the hand may pass by several objects before picking one. Instead of fusing the two cues at a single time point it is therefore necessary to continuously process both cues in a single recognition framework to perform the recognition of manipulative gestures. Here we introduce a novel framework for the recognition of actions that incorporates symbolic constraints in a parti-

One of the first approaches exploiting hand motions and objects in parallel is the work of Kuniyoshi [6] on qualitative recognition of assembly actions in a blocks world domain. Kuniyoshi’s approach features an action model capturing the hand motion as well as an environment model representing the object context. Action recognition is performed by relating the two models to each other based on rules contained in hierarchical parallel automata. In the action recognition approach by Ayers and Shah [1] a person is tracked based on skin color. Interaction with an object is defined in terms of intensity changes within the object’s image area that has to be provided at initialization. Neither Kuniyoshi nor Ayers and Shah use motion models but instead define an action as the static relation between the position of a hand or person and an object. An approach that actually combines sensory trajectory data and symbolic object data and uses motion models is the work by Moore et al. [7]. Different processing steps are carried out to obtain image-based, object-based, and actionbased evidences for objects and actions observed with a camera mounted on the ceiling. To obtain action-based evidence the hand trajectory is analyzed with Hidden-MarkovModels trained offline on different activities related to the known objects. The sensory trajectory information is used primarily as an additional cue for object recognition. Different from this object-centered approach, we present here a hand-centered approach for recognizing manipulative gestures with the help of symbolic information.

In Proc. Int. Conf. on Pattern Recognition, number 3, pages 930-933, Cambridge, United Kingdom, 2004. IEEE.  IEEE. Personal use of this material is permitted.However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: IEEE Intellectual Property Rights Office / IEEE Service Center / 445 Hoes Lane / Piscataway, NJ 08855-1331 / phone: (732) 562-3966 / fax: (732) 981-8062

3. Activity Recognition Our framework is based on the C ONDENSATION algorithm which is a particle filtering algorithm introduced by Isard and Blake [5]. Black and Jepson [2] adapted the C ON DENSATION algorithm in order to classify hand trajectories by representing movements as parameterized models which are matched with the input data. We follow the implementation of Black and Jepson where each movement model () consists of a 2-dimensional trajectory XT which describes the motion of the hand during execution of the movement.

m

m = XT = fx ; : : : ; xt; : : : ; xT g ( )

(1)

0

The hand trajectory is acquired by performing an adaptive skin color segmentation algorithm on an image sequence and tracking the resulting skin-colored regions with a Kalman filter (for details see [4]). To recognize activities consisting of a sequence of basic movements we use parent models ( ) as a concatenation of child models (i ) ( ) linked together by transition probabilities ai !j :

M

m

M

( )

= fm 1 ; : : : ; m g (

)

( k)

(2)

m s

For comparison of a model () with the observed data t at time t the sample vector t is used. This vector defines the child model belonging to the parent model . The time index indicates the current position within the model trajectory. The parameter is used for amplitude scaling while defines the scaling in time.

z

st = (; t; t; t; t) (3) The goal of the C algorithm is to determine the parameter vector st so that the fit of the model trajectory with the observed data zt is maximized. This is ONDENSATION

achieved by temporal propagation of N weighted samples

n

(st

(1)

o ; t(1) ); : : : ; ( (tN ) ; t(N ) )

s

(4)

(s z )

which represent the a posteriori probability p t j t (n) at time t. The weight t of a sample can be calculated from

zs zs

p( j ) t(n) = PN t t (j) j =1 p( t j t ) n

( )

(z s ) =

with p t j t

Y 2

i=1

s

p(zt;i j t ):

(5) For each component i of the model trajectory scaled by t and t the probability p zt;i j t is calculated by comparing the last w time steps of the model starting at t with the observed data. For calculating the difference between model and observed data a Gaussian density is assumed for each point of the model trajectory (see [2] for details). The propagation of the weighted samples over time consists of three steps and is based on the sample pool of the

(

s)

s

previous time step: (n) Select: Selection of N M samples t 1 according to their respective weight from the sample pool at t (see Eq. 4) and random initialization of M new samples. (n) Predict: The parameters of each sample t are predicted by adding Gaussian noise to t 1 and t 1 as well as to the position t 1 that is increased in each time step by t . (n) If t > max the next child j;t of the sample parent (n) (n) is selected or if i;t 1 was the last child model a new sample is initialized. (n) Update: Determination of the weights t (see Eq. 5).

1

s

The classification of movements is achieved by calculating the end-probability Pend that a certain child model i is completed at time t: N (n) X t ; if i 2 (tn) ^ > : max Pend;t i ; else n=1 (6) A parent model, i.e., an activity, is considered complete if the probability Pend;t k of the last child model k exceeds a threshold which has been determined empirically.

( )=

s

0

(

09

)

( )

4. Incorporating Context in Particle Filtering The application of particle filtering for recognizing hand trajectories enables the recognition of activities that are defined solely on the basis of hand motions. In this section we will present our extension to Black and Jepson’s approach in order to enable the incorporation of context knowledge. For recognizing manipulative gestures we consider the scene objects as context. The symbolic scene information contains for each scene object Ol its type and a specific ID:

t = fO ; : : : ; Ol ; : : : ; OLg Ol = (typel ; IDl ) 8 l 2 f1; :::; Lg 1

(7) (8)

We first describe the situational context and the spatial context of a gesture before introducing the incorporation of this context into the particle filtering algorithm.

4.1. Situational Context The situational context of a gesture are the Precondition that must hold for its recognition and the Effect the gesture has on the state of the scene. For modeling the state of the scene a global hand state (GHS) is defined that can be either empty or hold an object Ol (see Eq. 8): GHSt

= f;jOlg

(9)

Precondition and effect of a child model operate on the GHS. For example, as a precondition of an action ’take cup’, the hand would need to be empty and the effect of recognizing the action would be the hand holding the object ’cup’.

4.2. Spatial Context For a manipulative gesture the hand trajectory has to be related to the object that is manipulated. Obviously, this object must be close enough to the hand trajectory to be touched or picked for interaction. In limited environments this general formulation of object relevance by having a small distance to the hand trajectory may be sufficient [1], but in more complex environments several objects will fulfill such a simple distance criterion. In order to avoid the computational complexity as well as the increase in ambiguity associated with a larger number of relevant objects, we define a context area to be the image area containing objects potentially relevant for a specific manipulative gesture. This area is positioned at the current hand position and is defined as a circle segment with radius r and start/end angle , . For interaction with objects that do not have a specific ’handling direction’ and can be approached from different directions, the orientation

orient of the context area is relative to the hand direction. For objects that need to be approached from a specific direction the context area has an absolute orientation. Besides defining where symbolic context is expected, we need to specify what context is expected. First of all, the object type type defines what kind of object is expected in the context area. Additionally, the symbolic context can be irrelevant, necessary, or optional represented in the context importance imp . This is essential as a manipulated object may be temporarily occluded by the hand manipulating it preventing its successful recognition. The overall spatial context CT of a gesture is therefore defined by the individual spatial contexts t for each time step t:

t

CT

= =

( imp ; orient ; ; ; r ; type ) ( ; : : : ; t; : : : ; T ) 1

(10) (11)

Note that the object type type can denote a class of objects with the same manipulative properties, for example, objects that can be used to drink from (e.g., bottle, can, . . . ).

4.3. Incorporation of Context Information Adding situational and spatial context to Eq. 1 results in a movement model suitable for action recognition:

maction = (Precondition; XT ; CT ; Effect) ( )

(12)

The definition of an action model contains static information about the symbolic context of an action. However, similar to the parameters ; ; relating the trajectory model XT to the observed motion data in the sample vector t in Eq. 3 we need to incorporate information about the relation between the spatial context CT and the observed symbolic data. Using the example of the action ’take cup’, the generic model context ’cup’ needs to be associated with a

s

(

)

specific object present in the scene. To relate the context object type type to a specific context object Ol in t , the symbolic sample data t is used to store the ID of the scene object and is added to the sample vector t giving:

st = (t; t; t; t; t;

)

s

t with t

= (IDt; HSt )

(13)

Furthermore, t contains the sample-specific hand state (HS) that is defined similar to the GHS (see Eq. 9). Initialization of the ID in t is performed the first time that an object context is expected by a gesture model and an appropriate scene object is found. Through this binding of the model to a specific object ID it can be assured that the manipulative gesture defined by this model interacts with the same object in the following time steps. Using this extended sample vector, we now have to adapt the select, predict, and update steps of the particle filtering algorithm. Actually incorporating the context to affect the recognition is done in two ways: The situational context is applied in the select step of the particle filtering algorithm to initialize and select only those samples whose precondition matches the actual situation, i.e., the GHS contents. On recognizing an action, the GHS is changed by the effect defined in the action model. The spatial context is taken into account in the update step by changing the weight of samples based on their match with the symbol-based observations. For this purpose we extend the calculation of the sample weight in Eq. 5 with a multiplicative context factor Psymb representing how well the observed scene t fits the expected context: t(i) / p t j (ti) Psymb t ; (ti) (14)

(z

s

)

(

s

)

Currently, a constant factor is used for Psymb depending on whether the expected context is present (Ppresent ) or not (Pmissing ). To determine the object manipulated after recognizing a gesture (see Eq. 6), the symbolic sample data in all samples in the sample set that were used for calculating Pend;t is analyzed. The binary binding between a sample and an object does not provide information about the quality of the binding, but the sample weight represents the information how good the overall sample matches the observed gesture. Therefore, the object probability PObj;t is calculated for every object Ol based on the weights of the samples belonging to the gesture model i and containing this object in the symbolic sample data: 8 N < t(n) if Ol 2 t(n) ^ i 2 (tn) X PObj;t Ol ; i ^ > : max : n=1 else (15) The object with the highest probability PObj;t Ol ; i is selected as the one manipulated by the recognized gesture. This probabilistic selection of the manipulated object allows us to recognize gestures that interact with an object in the vicinity of other objects of the same type.

()

(

)=

0

s

09

(

)

5. Recognition Performance The applicability of the proposed framework is demonstrated in an office environment (see Fig. 1) containing the actions listed in table 1 that are performed with the right hand. For the model generation several movement trajectories were recorded and a mean trajectory was calculated after manual segmentation. The transition probabilities were set by hand. The symbolic scene information t for the office domain contains the object types cup, phone, keyboard, and book (see Eq. 8). For the experiments correct object recognition results were available. Note that all actions except type on keyboard consist of two manipulative gestures, one fetching the object, and one placing the object back on the table.

Figure 1. Office scenario used for evaluation with example trajectory and spatial context. We have evaluated the recognition quality with six test subjects that were shown typical actions beforehand. Each of the subjects performed every action ten times, leading to a total of 60 examples per action. The parameters were N =3500, M =175, w=10, Ppresent =1.0, and Pmissing =0.95. All other parameters were set similar to [2]. The results obtained are given in Table 1. Actions Recognized (#) Recognized (%) take cup 58 97 stop drinking 53 88 pick up phone 59 98 hang up phone 58 97 pick book 60 100 stop reading 52 87 type on keyboard 52 87 392 93.3

P

Table 1. Recognition results for 420 actions. The actions take cup, pick up phone, and pick book consisting of action models with context objects are recognized with a high recognition rate. A similarly high recognition rate was obtained for the hang up phone action that contains a very characteristic trajectory model. The type on keyboard had a lower recognition rate as the distance between the rest

position of the hand and the keyboard was small resulting in a short trajectory model. The low recognition rates of stop drinking and stop reading were mostly due to an too early recognition of these actions while the person was still interacting with the object causing random motions that were erroneously recognized as actions. While the overall recognition rate of 93.3% is promising, further work will concentrate on using view-independent trajectory models with relative motion r and direction , and a more detailed evaluation of the influence of the context factor.

6. Summary In this paper we introduced our integrated action recognition approach with explicit modeling of the situational and spatial context of a manipulative gesture. Situational context is used in the select step of the particle filter while the spatial context effects the update step. The performance of the proposed action recognition approach was demonstrated with the recognition of actions in an office. The proposed approach provides a complete recognition of manipulative gestures including which objects were manipulated.

Acknowledgements This work was partially supported by by the German Research Foundation within the special research project ‘Situated Artificial Communicators’ and the Graduate Program ‘Task Oriented Communication’.

References [1] D. Ayers and M. Shah. Monitoring human behavior in an office environment. In IEEE Workshop on Interpretation of Visual Motion, CVPR, Santa Barbara, CA, June 1998. [2] M. J. Black and A. D. Jepson. A probabilistic framework for matching temporal trajectories: CONDENSATION-based recognition of gestures and expressions. Lecture Notes in Computer Science, 1406:909–924, 1998. [3] A. Bobick. Movement, activity, and action: The role of knowledge in the perception of motion. Phil. Trans. Royal Society London, B(352):1257–1265, 1997. [4] J. Fritsch. Vision-based Recognition of Gestures with Context. Ph.D. dissertation, Bielefeld University, Technical Faculty, 2003. [5] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. Lecture Notes in Computer Science, 1064:343–356, 1996. [6] Y. Kuniyoshi and H. Inoue. Qualitative recognition of ongoing human action sequences. In Proc. Int. Joint Conf. on Artificial Intelligence, pages 1600–1609, 1993. [7] D. Moore, I. Essa, and M. Hayes. Exploiting human actions and object context for recognition tasks. In Proc. Int. Conf. on Computer Vision, volume 1, pages 80–86, Corfu, 1999.