Real-Time Recognition of Human Gestures for 3D ... - Springer Link

3 downloads 0 Views 3MB Size Report
our system in a real-time 3D interaction application and the obtained results are .... In this case, the proposed game, a modified version of Tetris, allows users to ...
Real-Time Recognition of Human Gestures for 3D Interaction Antoni Jaume-i-Cap´ o, Javier Varona, and Francisco J. Perales Unitat de Gr` afics i Visi´ o per Ordinador Departament de Ci`encies Matem` atiques i Inform` atica Universitat de les Illes Balears Edifici Anselm Turmeda Ctra. de Valldemossa km 7,5 (07122) Palma de Mallorca, Spain {antoni.jaume,xavi.varona,paco.perales}@uib.es http://dmi.uib.es/∼ ugiv/

Abstract. A fundamental natural interaction concept is not yet fully exploited in most of the existing human-computer interfaces. Recent technological advances have created the possibility to naturally and significantly enhance the interface perception by means of visual inputs, the so called Vision-Based Interaction (VBI). In this paper, we present a gesture recognition algorithm where the user’s movements are obtained through a real-time vision-based motion capture system. Specifically, we focus on recognizing users motions with a particular mean, that is, a gesture. Defining an appropriate representation of the user’s motions based on a temporal posture parameterization, we apply non-parametric techniques to learn and recognize the user’s gestures in real-time. This scheme of recognition has been tested for controlling a classical computer videogame. The results obtained show an excellent performance in online classification and it allows the possibility to achieve a learning phase in real-time due to its computational simplicity. Keywords: Vision-Based Gesture Recognition, Human-Computer Interaction, Non-Parametric Classification.

1

Introduction

Since computers first appeared, researchers have been conceiving forms of interaction between people and machines. Today human-computer interfaces are created in order to allow communication between humans and computers by means of a common set of physical or logical rules. Vision-based interfaces (VBI) use computer vision in order to sense and perceive the user and their actions within an HCI context [1]. In this sense, the idea of using body gestures as a means of interacting with computers is not new. The first notable system was Bolt’s Put That There multimodal interface [2]. Bolt combined the use of speech recognition with pointing to move objects within a scene. F.J. Perales and R.B. Fisher (Eds.): AMDO 2008, LNCS 5098, pp. 419–430, 2008. c Springer-Verlag Berlin Heidelberg 2008 

420

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

The topic of human motion analysis has received great interest in the scientific community, mainly from biomechanics, computer graphics and computer vision researchers. From the computer vision community there are a lot of works that present several results in this field, for an exhaustive revision it is possible to read one of the most recent reviews on this topic [3]. From a human-computer interaction point of view, we are especially interested in obtaining user motions in order to recognize gestures that can be interpreted as system’s events. In this sense, the approaches used for gesture recognition and analysis of human motion in general can be classified into three major categories: motion-based, appearance-based, and model-based. Motion-based approaches attempt to recognize the gesture directly from the motion without any structural information about the physical body [4,5]. Appearance-based approaches use two dimensional information such as gray scale images, edges or body silhouettes [6,7]. In contrast, model-based approaches focus on recovering the three dimensional configuration of articulated body parts [8,9,10]. However, model-based approaches are often difficult to apply to real-world applications. This fact is mainly due to the difficulty of capturing and tracking the requisite model parts, the user’s body joints that take part in the considered gestures. Usually, a partial solution can be to simplify the capture to a fewer body parts and using its temporal trajectories in order to recognize the gestures of interest [11]. For example, Rao et al. consider the problem of learning and recognizing actions performed by a human hand [12]. They target affine invariance and apply their method on real image sequences using skin color to find hands. They characterize a gesture by means of dynamic moments, which they define as maxima in the spatio-temporal curvature of the hand trajectory that is preserved from 3D to 2D. Their system does not require a model, in fact, it builds up its own model database by memorizing the input gestures. Another approach of hand-based gesture recognition methods use hand poses as gestures for navigating in virtual worlds [13]. Nevertheless, exploiting the sole 3D location of one or two hands is indeed not sufficient for the recognition of complex gestures in order to control interactive applications. On the other hand, several techniques have been used for classification in gesture recognition systems. In the majority of approaches the temporal properties of the gesture are typically handled statistically using Hidden Markov Models (HMM), mainly due to the fact of directly using the image values [11]. However, these approaches are not applied in real-time because HMMs require a hard learning phase in order to tune all the model’s parameters. Our idea is to use the gesture recognition system in real-time, taking the advantage of having a robust estimation of the 3D positions of the user’s body joints. In this paper, we present a gesture recognition system that takes into account all body limbs, involved in the considered gestures. The advantage of our system is that it is built over a motion capture system that recovers the body joints positions of the user’s upper body in real-time. From the computed joints positions we make this data spatially invariant by normalizing limbs positions and sizes, only using the limbs orientations. From limbs orientations, the

Real-Time Recognition of Human Gestures for 3D Interaction

421

Fig. 1. System scheme

user posture is represented by an appropriate representation of all the limbs in a histogram. Cumulating the posture histograms we represent a gesture in a temporal invariant form. Finally, using this gesture representation, the performed gestures are classified for generating the desired computer events in real-time, see Figure 1. This paper is organized as follows. The used real-time full-body motion capture system to obtain the user motions is presented in section 2. Next, in section 3, our human gesture representation is described. How the human gestures are recognized, is explained in section 4. The performance evaluation of our system in a real-time 3D interaction application and the obtained results are described in section 5. The obtained results are discussed in the last section to demonstrate the viability of this approach.

2

Human Motion Capture

In this work, the real-time constraint is very important due to our goal of using the captured motions as input for gesture recognition in a vision-based interface (VBI). In this sense, the motions of the user’s limbs are extracted through a real-time vision-based motion capture system. Usually, locating all the user body joints in order to recover the user’s posture is not possible with computer vision algorithms only. This is mainly due to the fact that most of the joints are occluded by clothes. Inverse Kinematic (IK) approaches can solve the body posture from their 3D position if we can clearly locate visible body parts such as face and hands. Therefore, these visible body parts (hereafter referred to as end-effectors) are automatically located in real-time and fed into an IK module, which in turn can provide a 3D feedback to the vision system (see Figure 2). The only environmental constraint of the real-time vision-based motion capture system is that the user is located in an interactive space in front of a wide screen (such as a workbench), and that the background wall is covered with

422

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

Fig. 2. General architecture of the real-time vision-based motion capture system

Fig. 3. Interactive space

chroma-key material, as it is shown in Figure 3. This system may work without chroma-key background; however, using it ensures a real-time response. This interactive space is instrumented with a stereo camera pair calibrated previously by means of an easy automatic process. We apply chroma-keying, skin-color segmentation and 2D-tracking algorithms for each image of the stereo pair to locate the user’s end-effectors in the scene. Then, we combine this result in a 3D-tracking algorithm to robustly estimate their 3D positions in the scene. Applying the Priority Inverse Kinematics algorithm with the 3D positions of the wrists as end-effectors, motion capture system recovers the 3D positions of each considered user joint. Detailed technical information on this system can be found in [14].

3

Human Gesture Representation

Using the computed 3D positions of the involved body joints, we address the main problems in the gesture recognition challenge: temporal, spatial and style

Real-Time Recognition of Human Gestures for 3D Interaction

423

variations between gestures. Temporal variations are due to different gesture speed between different users. Spatial variations are due to physical constraints of the human body such as different body sizes. Style variations are due to the personal way in which a user makes its movements. To cope with spatial variations we represent each body limb by means of a unit vector. Temporal variation is managed using a temporal gesture representation. Finally, the most difficult challenge, style variations, is solved by parameterizing the gestures of each user in an easy learning phase. An appropriate posture representation for gesture recognition must cope spatial variations. In order to make data invariant to different body sizes, the first step is to change the reference system because the calibration process of the Vision-PIK algorithm defines the reference system. In our system we use a planar pattern for computing the intrinsic and extrinsic parameters of the camera stereo pair [15]. Using this approach, the coordinate system is placed in the world depending on the location of the calibration object, as it is shown in Figure 4. Therefore, joints’ positions are referenced from an unknown world origin.

Fig. 4. Reference system alignment for unifying the joints’ 3D positions reported by the vision-based motion capture system. Left: vision reference system. Right: user’s centered reference system.

To solve this problem, the coordinate system is automatically aligned with the user’s position and orientation in the first frame, as shown in Figure 4. The reference system origin is placed in the user’s foot position. Next, the y-axis is aligned to the unit vector that joins the user’s foot and back and the x-axis is aligned to the unit vector that joins the user’s right shoulder and left shoulder, setting the y component to zero. Once the reference system is aligned with the user position and orientation, joints 3D positions are environment independent because the reference origin is aligned with the user’s body and does not depend on the calibration process. However, the data still depends on the size of the user’s limbs. A possibility to make data size invariant is given by the use of motion information of the

424

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

joints through Euler angles [16]. Nevertheless, in this case, motion information is unstable, i.e., small changes of these values could give wrong detections. Alternatively, we propose a representation of each body limb by means of a unit vector, which represents the limb orientation. Formally, the unit vector that represents the orientation of limb, l, defined by joints J1 and J2 , ul , is computed as follows J2 − J1 ul = , (1) J2 − J1  where Ji = (xi , yi , zi ) is the i-th joint 3D-position in the user centered reference system. In this way, depending on the desired gesture alphabet, it is only necessary to compute the unit vector for the involved body limbs. This representation causes data to be independent from the user’s size and it solves the spatial variations. Once the motion capture data is spatially invariant, the next step is to represent the human posture. We build the posture representation by using unit vectors of the limbs involved in the gesture set. The idea is to represent the user’s body posture as a feature vector composed by all the unit vectors of the user’s limbs. Formally, the representation of the orientation of a limb, l, is − + − + − ql = (u+ x , ux , uy , uy , uz , uz ),

(2)

− where u+ x and ux are respectively the positive and negative magnitudes of the x− + − component of unit vector, ux , note that ux = u+ x −ux and ux , ux ≥ 0. The same applies for components uy and uz . In this way, the orientation components of the limb unit vector are half-wave rectified into six non-negative channels. Therefore, we build a histogram of limbs orientations which represents the complete user’s limbs orientations. We propose two forms to build the histogram, see Figure 5. The first one is by cumulative limbs orientations, see Equation 3,

q=

n 

ql ,

(3)

l=1

and the second one is by linking limbs poses, see Equation 4, q = {ql }l=1..n ,

(4)

where n, in both cases, is the number of limbs involved in the gestures to recognize. The main difference between the two representations depends on the considered gesture set. The cumulative representation is more robust to tracking errors, but the set of recognized gestures is much reduced. For example, the same movements of different limbs can not be distinguished. On the other hand, the linked representation allows the definition of more gestures, although it is more sensible to errors in the estimation of the limbs orientations. Our approach represents gestures by means of a temporal representation of the user’s postures. The reason for using posture information is that the postures directly define the gestures, even, in several cases, with only one posture it is

Real-Time Recognition of Human Gestures for 3D Interaction

425

Fig. 5. Construction of the posture representation

possible to recognize a gesture. If we consider that a gesture is composed by several body postures, the gesture representation feature vector is composed by the cumulative postures involved in the gesture, that is ˆt = q

t 1  qi , T

(5)

i=t−T

where t is the current frame and T is the gesture periodicity, and could be interpreted as a temporal window of cumulative postures. This process resumes the temporal variations of gestures by means of a detection of the periodicity of each user’s gesture performance in order to fix the T value, that is, its temporal extent.

4

Human Gesture Recognition

An important goal of this work is that the human-computer interaction should be performed using natural gestures. As it has been shown in several experiments with children[17], a gesture is natural depending on the user experience. The key is to take advantage of the system overall possibility of working in real-time. For these reasons, before the recognition process starts the system asks the user to perform several of the allowable gestures in order to build a training set in real-time. Therefore, previously to starting the game, the system asks the user randomly to make several isolated performances of each gesture. Performing several times the gestures in random order, the gesture models consider styles variations. This is a way to automatically build the training set. Besides, we have tested how

426

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

Fig. 6. Interpretations of the rotation gesture by different users

the users interpret each of the gestures, mainly the complex gestures, which are performed by different users in a completely different way, see Figure 6. This fact reinforces the idea of making user’s specific gestures models. In order to complete the process, it is necessary to choose a distance for comparison between ˆ . We choose the Earth Mover’s ˆ t , and a gesture model, p the current gesture, q Distance (EMD), the measure of the amount of work necessary to transform one weighted point set into another. Moreover, it has been shown that binby-bin measures (e.g., Lp distance, normalized scalar product) are less robust than cross-bin measures (e.g., the Earth Mover’s Distance (EMD), which allows features from different bins to be matched) for capturing perceptual dissimilarity between distributions [18].

5

Performance Evaluation

In order to test our gesture recognition approach, we have proposed playing a computer videogame interacting by means of body gestures with different users. In this case, the proposed game, a modified version of Tetris, allows users to use four different forms of control: left, right, down and rotate. The present system has been implemented in Visual C++ using the OpenCV libraries [19] and it has been tested in a real-time interaction context on an AMD Athlon 2800 + 2.083 GHz under Windows XP. The images have been captured using two DFW-500 Sony cameras. The cameras provide 320 × 240 images at a capture rate of 30 frames per second. To test the real-time, we have calculated the time used to recognize a gesture once the joints positions of the user have been obtained. The real time of

Real-Time Recognition of Human Gestures for 3D Interaction

427

Fig. 7. Some visual results of gesture recognition. In the case of the rotate gesture, a sequence gesture is shown.

the vision-based motion capture system was tested in [14]. Including the gesture recognition step the frame rate is 21 frames per second. Therefore, we can conclude that our approach works near real-time. For testing purposes, we acquired different sessions of different users while producing all the gestures during videogame. The game is manually controlled by a user in order to provide the immersive experience of really playing the game with its own gestures. This is the classical Wizard of Oz experiment [17]. At the moment, our dataset contains a training set where each user performs three times each form of control. After the training of the system, we evaluate its performance by testing the real behaviour of the recognition system. Specifically, the testing set is composed by 73 different gesture performances of the command set by three different users. Table 1. Comparative results between the proposed posture representations Posture Gestures Correct Wrong Non False Representation Recognized Possitive cumulated linked

73 73

84.95% 4.10% 10.95% 87.69% 2.73% 9.58%

7.20% 4.18%

The results presented in Table 1 and in Figure 7 show that both representations obtain good results with a reasonable rate of correct recognition. Although, it should be considered that in this application the gesture set is reduced. Note that the linked representation is more accurate because the number of false positives is smaller than the cumulated representation, considering a false positive

428

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

Fig. 8. Recognition misclassifications due to errors of the Vision-PIK estimation of user’s body joints

when the system recognizes a gesture although the user does not perform any gesture. In addition, the majority of misclassifications and not recognized gestures are due to errors on the Vision-PIK estimation of the user’s body joints, see Figure 8. In this case, in Table 1 it can be seen that the linked representation is again more robust to these feature extraction errors than the cumulated one.

6

Conclusions

We have shown the potential of the system through a user interface. In this sense, we have defined two appropriates gesture representation capable of copeing with variations between gestures in different users and performances, also making recognition in real-time possible. Our approach is original and it could be extended to represent more complex gestures and human activities. In fact, hand-based gesture recognition can be approached with the presented representation by substituting the user’s body posture with finger poses. The complete system has been tested in a real-time application, a gesturebased videogame control, and the results obtained state that the presented approach for gesture recognition performs well. From these experiments, we can conclude that for the control of interactive applications with a reduced alphabet, the linked representation could alleviate the errors of the feature extraction step making the interface more robust. On the other hand, experiments have shown that, from a practical point of view, this technique of classification is appropriated for real world problems due to its simplicity in learning and on-line classification. Besides, the system adapts itself to each particular user’s way of performing the gestures, avoiding a previous user’s off-line training to learn the gestures that can be recognized by the system. As future work, this approach can be extended to more complex gestures than the ones shown in the presented application adding more limbs to the gesture representation. It is important to point out that our approach needs further testing. Concretely, it should be tested in real sessions with more users. These sessions should test how the number of learning exemplars affects the recognition of gestures. Currently, we plan to use the recognized gesture in the

Real-Time Recognition of Human Gestures for 3D Interaction

429

IK algorithm to improve its results using the on-line information of the gesture being performed.

Acknowledgements This work has been supported by the Spanish M.E.C. under project TIN200767993 and the Ramon y Cajal fellowship of Dr. J. Varona.

References 1. Turk, M., Kolsch, M.: Perceptual interfaces. In: Emerging Topics in Computer Vision. Prentice Hall, Englewood Cliffs (2004) 2. Bolt, R.A.: ’put-that-there’: Voice and gesture at the graphics interface. In: SIGGRAPH 1980: Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pp. 262–270. ACM Press, New York (1980) 3. Moeslund, T.B., Hilton, A., Kr¨ uger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2–3), 90–126 (2006) 4. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Analysis and Machine Intelligence 23(3), 257–267 (2001) 5. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of International Conference on Computer Vision (ICCV 2003) (2003) 6. Starner, T., Weaver, J., Pentland, A.: Real-time american sign language recognition using desk andwearable computer based video. IEEE Trans. Pattern Analysis and Machine Intelligence 20(12), 1371–1375 (1998) 7. Elgammal, A., Shet, V., Yacoob, Y., Davis, L.: Learning dynamics for exemplarbased gesture recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2003), pp. 571–578 (2003) 8. Kojima, A., Izumi, M., Tamura, T., Fukunaga, K.: Generating natural language description of human behavior from video images. In: Proceedings of 15th International Conference on Pattern Recognition, vol. 4, pp. 4728–4731 (2000) 9. Ren, H., Xu, G., Kee, S.: Subject-independent natural action recognition. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 523–528 (2004) 10. Wang, L., Tan, T., Ning, H., Hu, W.: Silhouette analysis-based gait recognition for human identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12), 1505–1518 (2003) 11. Wu, Y., Huang, T.S.: Vision-based gesture recognition: A review. In: Braffort, A., Gibet, S., Teil, D., Gherbi, R., Richardson, J. (eds.) GW 1999. LNCS (LNAI), vol. 1739, pp. 103–115. Springer, Heidelberg (2000) 12. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions. International Journal of Computer Vision 50(2), 203–226 (2002) 13. O’Hagan, R.G., Zelinsky, A., Rougeaux, S.: Visual gesture interfaces for virtual environments. Interacting with Computers 14(3), 231–250 (2002) 14. Boulic, R., Varona, J., Unzueta, L., Peinado, M., Suescun, A., Perales, F.: Evaluation of on-line analytic and numeric inverse kinematics approaches driven by partial vision input. Virtual Reality (online) 10(1), 48–61 (2006)

430

A. Jaume-i-Cap´ o, J. Varona, and F.J. Perales

15. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 16. Moeslund, T.B., Reng, L., Granum, E.: Finding motion primitives in human body gestures. In: Gesture in Human-Computer Interaction and Simulation: 6th International Gesture Workshop, pp. 133–144 (2006) 17. H¨ oysniemi, J., H¨ am¨ al¨ ainen, P., Turkki, L., Rouvi, T.: Children’s intuitive gestures in vision-based action games. Commun. ACM 48(1), 44–50 (2005) 18. Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000) 19. Bradski, G., Pisarevsky, V.: Intels computer vision library. In: Proceedings of Computer Vision and Pattern Recognition (CVPR 2000), vol. 2, pp. 796–797 (2000)