Dynamic Gesture Recognition - IEEE Xplore

9 downloads 0 Views 701KB Size Report
May 19, 2005 - fingers), this classified information is passed to a 2D to 3D module that transforms the 2D classified information into a full 3D space applying it ...
IMTC 2005 – Instrumentation and Measurement Technology Conference Ottawa, Canada, 17-19 May 2005

Dynamic Gesture Recognition Chris Joslin, Ayman El-Sawah, Qing Chen, Nicolas Georganas Discover Laboratory, University of Ottawa, SITE, 800 King Edward Avenue, Ottawa, Ontario, K1N 6N5 {joslin, aelsawah, qchen, georganas}@discover.uottawa.ca

Abstract - In this paper we introduce our method for enabling dynamic gesture recognition for hand gestures. Like a number of other research work focusing on gesture recognition we use a camera to track the motions and interpret these in terms of actual meaningful gestures; however we emphasise the tracking of fingers as well as the hand in order to cover a much wider range of gestures. The recognition is processed as part of three key stages, with a fourth in development. The first stage processes the visual information from the camera, and identifies the key regions and elements (such as the hand and fingers), this classified information is passed to a 2D to 3D module that transforms the 2D classified information into a full 3D space applying it to a calibrated hand model using inverse projection matrices and inverse kinematics. Simplifying this model into posture curvature information we apply this to a Hidden Markov Model (HMM). This model is used to identify and differentiate between different gestures, even ones using the same finger combinations. We briefly discuss our current development in the application of context awareness to this scenario, which is used in combination with the HMM in order to apply a different semantic to each gesture. This is especially useful due to the huge overlap in semantics specifically appropriated to hand gestures.

this is the sign for “okay”. The complexities lie in two distinct areas: identifying the actual motion itself (by tracking a particular limb, or a comparison of video images, for example) and then the understanding of the motion, compared to hundreds of other specific and non-specific gestures. In our research we focus on gestures made by the hand and arms, relative to the body (i.e. we do not consider other body or facial motions, although we do account for their interference). A. Constraints and Considerations It is important to discuss the constraints and conditions applied to our research in order to justify our method; these are as follows: x

Keywords – Gesture Recognition, Hidden Markov Models, 3D space

x

I. INTRODUCTION Complete Human Computer Interaction (HCI) is a domain reliant on its own sub-domains as any other research field. Voice recognition, nuance, and overall understanding are still reliant on each other’s input such as a view of the user’s face; this mimics a human’s requirement for fully understanding a conversation from hearing the voice, watching inclusive body gestures and poses, and facial expressions rather than a using a textual transcript. Dynamic gesture recognition is one part of HCI and whilst it can play a large role in itself in obtaining information from a hand gesture and identifying it with a specific meaning, it can also play its role in a larger system identified with a specific context. Essentially, dynamic gesture recognition is the recognition of a set of user-centred motions in a single continuous flow. For example, a user makes the “thumbs-up” sign and the computer processes this and determines that from its database

0-7803-8879-8/05/$20.00 ©2005 IEEE

x

1706

Generic Environment ~ it was very important that the system was capable of working in a generic environment; without static or homogeneous backgrounds. It could not be assumed that a specific camera could be relied upon, or the presence of fixed lighting or background segmentation at any point. Generic Applications ~ whilst the research was aimed at enabling Dynamic Gesture recognition, the actual end application was not fixed and in fact as stated at the beginning of the introduction it would be a case of context which would derive a final indication as to an actual gestures meaning. Therefore, in our approach we could not define a subset of gestures that would only be applicable to a specific situation. This generic approach is also carried through to the 2D to 3D conversion module described in Section 2.2 whereby the data from the 3D space is used for gesture recognition, but can also be used for control in a 3D space. Marker-less Gestures ~ the initial research presented in this paper shows a “markered” approach to obtaining 2D positions, and consequently gestures, however our overall aim was to use an approach that would eventually remove all but a single identification marker. This one marker would provide the system with a reference point indicating which hand would be identified in a crowded room as being the one providing the gesture input.

B. State-of-the-Art Gesture Recognition is a populated research domain with firm beginnings in the early 1990’s; consequently many approaches have been proposed for different aspects. The term ‘gesture’ is often used lightly and incorrectly as a replacement of pose (i.e. a gesture is a sequence of poses over time); whilst we abhor this vague usage we also address methods using pose recognition in order to provide a complete analysis of the state-of-the-art. There are several key domains that are encompassed under the broad umbrella of “gesture recognition”; here we highlight some of the most significant. The earliest key instances were from Ishibuchi et al. [1] introducing hand gesture recognition using a 3D prediction model in 1993, and in the same year Ahmad et al. [2] introducing the use of a Bayesian techniques in order to recognise 3D hand gestures even with large amounts of noise and uncertain or missing input data. Mu-Chun et al. [3] discuss the use of a neural network to recognise only static postures, whilst Jung et al. [4] demonstrate methods for finding a human hand in an image sequence using colour separation; from the same group, Min et al [5] introduce the use of Hidden Markov Models to recognise gestures. Eickeler et al [6] followed this and presented a continuous Hidden Markov Model that essentially introduced dynamic feature extraction and evaluation methods. More recently, Bretzner et al. [7] have used a multi-scale colour segmentation technique, coupled with particle filters in order to track hand postures. MingHsuan et al. [8] use 2D motion trajectories in order to track hand gestures, whilst Xia et al. [9] use depth data to provide information on hand gestures. Other significant achievements in gesture recognition have contributed to this domain, but due to the obvious paper length limitations we offer only a smattering of the key research in this area. In the following section we describe our approach, how it differs from previous and on-going research and the direction it is taking in terms of future work. II. APPROACH AND RESULTS Holistically speaking, we use a camera-based system to record information about the gesture, this is processed from 2D space into 3D space in order to obtain a broader understanding about the gesture, and then recognition is performed using a Hidden Markov Model (HMM). For our approach we chose a modular system consisting of four key components, each with a separate task and interlinked with simple Application Programming Interfaces (APIs). This was done for two main reasons: firstly it permitted multiple researchers to flexibly work together, and secondly we were able to plug different replacement modules for specific circumstances (which will be briefly addressed later). The system is described in the following sub-sections, and an overview illustrated in Figure 1.

Fig. 1. Dynamic Gesture Recognition System Overview

Typically, we have emphasised our research on gestures, rather than hand poses, taking into account the significant roles played by the fingers, as well as the hand itself. Again this differentiates itself globally as many previous research papers use the term “hand gestures” to indicate mainly a strong arm movement involving the hand. A. Two-Dimensional Image Processing We use a non-specific camera to provide raw images in an HSV colour format, ideally we would prefer a total 24-bit colour depth, but 12-bits or more is sufficient (in our initial model we used 16-bits in a Bayer tile format). This raw image is continuously updated at approximately 30 frames per second (fps) and initially processed to remove noise and compensate for light variations. We use a colour reference technique to identify both the hand(s) and the markers; this is mainly due to the fact that for the orientations and distances that we are using, it is not possible to use pattern recognition techniques and it is also

1707

faster to process colour information over pattern information. The user, to which the gesture will be attributed, uses a specific set of markers on their hand. This enables the system to quickly identify the gesturing hand and to distinguish it from other users that may be in the same field of view. As part of the initial phase, we convert the colour-space into the HSV (or Hue, Saturation, and Luminosity) space with 255 values for each plane. This provides the best performance in terms of identifying a very narrow variance (usually 3 or 4 levels) of colour (Hue), with a small acceptable variance (larger than Hue, usually around 10) for shading (Saturation), and a large acceptable variance (usually full range: 255) for the lighting conditions (luminosity). We initially identify all the “skin” coloured elements in the room associated with a specific marker colour (in this case a specific yellow) passing a pixel if it falls within the specified range. We process the identified pixels for noise by counting the number per segment (determining if a mass of pixels exist in one place, a segment being 8 by 8 pixels) and segment clustering (determine if segments join with each other). The aforementioned technique basically identifies the gesture hand, and provides information as to its orientation. Markers of different colour markers are used on the fingers in order to recognize the finger end-effectors positions.

duplicate pixel information, followed by a sequential ordering of the pixels to make a path (i.e. we search each border pixel to determine its neighbouring points in order to make a continuous path). We then process this order pixel path, determining the gradient, whereby each severe gradient change indicates a corner. The corner information here is only used to determine the sides of the marker (i.e. each sides start and end point). We then pass each side as a single implicit linear equation (A.X + B.Y + C = 0) using linear regression in order to determine A, B, and C (whereby B is fixed to -1 when A  0). Using the four linear equations (one for each side) the line intersect points are found (within a bounding region, as opposite lines are very rarely parallel) and used to determine the true marker corners. B. Two to Three Dimensional Conversions In order to work in a 3D space, the 2D coordinates of the hand and finger end points are transformed using an inverse projection, and then refined using inverse kinematics to determine the finger joint angles. The marker on the back of the hand is used as a key reference point, it is made to be exactly 0.02m x 0.02m square and this is used in order to determine the hand’s orientation. Using the perspective projection equation and perpendicular nature of the marker, as shown in Figure 3, we are able to dynamically determine the orientation of the hand in one of two possible positions. Using the finger markers, we are able to refine this to only one position. The refinement process includes evaluation of all possible 3D coordinates corresponding to the 2D coordinates we evaluated in the previous step. Due to the requirement for dynamic and realtime processing of hand orientation/joint information, we introduced a method for computing the joint angles using a constraints-based hash-table.

Fig. 2. Hand/Marker Identification

The marker on the back of the hand is 0.02m by 0.02m square, and we search for the marker’s corners (as shown in Figure 2) in order to pass this information for transformation into the 3D domain. As the information pertaining to corner’s positions is essential for an accurate interpretation of the hands orientation and distance from the camera, and as the corners as appreciated by the camera are rather fuzzy we process this information in order to obtain the most accurate positions possible. To do this we first determine the edges of the marker, assuming a solid edge (i.e. the colour boundaries in both the X and Y direction of the frame), and then remove

Fig. 3. Computation of the Hand’s Orientation using Marker

By using forward kinematics and computing the endeffector’s position of each finger for all possible joint positions, we were able to create a hash table corresponding to each end position and the corresponding angles. Using a

1708

binary search method in the hash table, we could easily compute all the joint angles for a hand with minimal computation expenditure. This effectively produces a set of 3D coordinate-based joint angles, based on 2D positional inputs. In terms of our modular approach, we can replace the first two modules with a CyberGloveTM [10], which is capable of accurately recording finger joint motions, via flex sensors in a tightly fitting glove and aids in training.

“State Layer” representing the dynamic gesture and an “Observation” layer, representing the observed sequence of hand postures represented by finger joints and the joint angles respectively.

Fig. 5. HMM Learning and Processing Fig. 4. 2D to 3D Conversion Block Diagram

An overview of the 2D to 3D conversion module is shown in Figure 4. The tracking algorithm refines the hand orientation and the evaluated 3D inverse kinematics postures. It utilizes visual clues, in the form of extra 2D feature points or projection lines, and a posture estimator acting on a posture history table. A major advantage of the posture estimator is the provision of an interface that allows us to decouple the 2D to 3D conversion module’s processing loop from the dynamic gesture recognition module(s), running in different thread(s), and requiring precise and essentially higher sampling frequency, using a mutual exclusion (mutex) or concurrent reads and exclusive write (CREW) monitor. C. Recognition The Hidden Markov Model (HMM) [11-16] is a doubly embedded stochastic process with an underlying stochastic process that we want to recognize. The chosen HMM uses a

A learning pre-processing step (as shown in Figure 5) is used to represent dynamic gestures in terms of a set of timedependent stochastic matrices. The model is trained much like a neural network with statistically variable input data, representing a specific gesture, presented during the learning stage. Statistical variation is enriched by adding random offset and time scaling to accommodate for dynamic gesture acuteness and speed respectively. The model could also be merged with other models to allow further generalization using different training data, or to build a hierarchical taxonomy of dynamic gestures by grouping similar gestures (models) in an abstract model, which could be identified further using a more specific models. During the recognition stage we are able to feed the model by a time sequence of hand attributes, which are classified as postures. The probability of each dynamic gesture given the observed sequence is calculated using the stochastic matrices representing each model, calculated in the training stage. The

1709

dynamic gesture with the highest probability above the acceptance threshold is recognized. The attribute set used to define postures is defined using the finger joints and their angles and a 26 degrees-of-freedom hand model. We argue that the attributes we are using, namely finger curvature, tilt and abduction, are more suitable for recognition because they directly relate to the finger posture – i.e. they do not require a hand model to visualize the posture. We also provide different levels of abstraction in which posture, and consequently dynamic gestures, could be represented; saving data storage requirements and processing time. Figure 6 presents a graphical representation of the finger curvature – calculated as the reciprocal of the radius of the circle on each finger.

x x

Situational Context ~ depending on the scenario (i.e. the same gesture in CAD and Film Production will have completely different meanings). User Context ~ the user’s own abilities or disabilities might influence the context, such as if someone is visually or hearing impaired, a person’s ethnic background or gender, or even their age.

All these elements will influence how a gesture is recognised and interpreted within that specific context. As with everyday life, the same gesture can have a multitude of meanings in different contexts; our current research is focused on classifying these elements and determining how to attribute this influence. III. FUTURE WORK The work presented here shows the initial results of an ongoing chapter of research. As well as our work in providing Context Awareness to the application, we are looking towards specific marker improvements in order to remove all but one marker (specifically designed to orientate the 3D reference world, and be more visible from a larger range of orientations). We are also using higher resolution cameras equipped with specific Region of Interest functions to enable large room capabilities to recognise gestures in more useful and free environments, coupled with the research into the problems associated with such complexities. We are also training our HMM model to recognise a much greater range of gestures.

Fig. 6. Posture Attribute Extraction

ACKNOWLEDGEMENT

Currently, our HMM is able to recognise, with a 100% success rate, 5 gestures ranging from a quotes sign to a grasp, but all involving finger motions rather than only hand motion. This figure is drops to just above 90% even when the recognition is performed on a user to which it has not been trained. However, this is being improved by the use of an automatic calibration technique (the hand size causing the major influence of misinterpretation). D. Context Awareness

REFERENCES [1]

This fourth module we examined the influence of specific contexts that would have influence over a recognised gesture. Contexts include the environment within which the gesture was being used, the situation and the user’s own context. As this work is currently being researched and due to the large number of possibilities we can only offer a smattering of examples, as follows: x

The research presented in this paper has been funded by the CITO (Communications and Information Technology Ontario) project VERGINA (Virtual Environment Research in hapto-visual Gesture-recognition Interfaces). The authors would also like to thank Thierry Metais for his advice, and Francois Malric for providing essential technical support.

Environmental Context ~ this might include differentiating between quiet and noisy environments, situations with low visibility, and possible geographical locations and variances.

[2]

[3]

[4]

1710

K. Ishibuchi, H. Takemura, F. Kishino, “Real Time Hand Gesture Recognition Using 3d Prediction Model”, IEEE International Conference On Systems, Man And Cybernetics: 'Systems Engineering In The Service Of Humans', 17-20 Oct. 1993, Vol. 5, pp. 324 – 328 S. Ahmad, V. Tresp, “Classification With Missing And Uncertain Inputs”, IEEE International Conference On Neural Network”, 28 Mar. 1 Apr. 1993, Vol. 3, Pp. 1949 - 1954 S. Mu-Chun, J. Woung-Fei, C. Hsiao-Te, “A Static Hand Gesture Recognition System Using A Composite Neural Network” Proceedings Of The Fifth IEEE International Conference On Fuzzy Systems, 8-11 Sep. 1996, Vol. 2, pp. 786 - 792 S. Jung, Y. Ho-Sub, W. Min, B. W. Min, “Locating Hands In Complex Images Using Color Analysis”, IEEE International Conference On Systems, Man, And Cybernetics, 1997. 'Computational Cybernetics And Simulation', 12-15 Oct. 1997, Vol. 3, pp. 2142 – 2146

[5]

[6]

[7]

[8]

[9]

[10] [11] [12]

[13]

[14] [15] [16]

B. W. Min, Y. Ho-Sub, S. Jung, Y. Yun-Mo, E. Toshiaki, “Hand Gesture Recognition Using Hidden Markov Models”, IEEE International Conference On Systems, Man, And Cybernetics: 'Computational Cybernetics And Simulation', 12-15 Oct. 1997, Vol. 5, pp. 4232 – 4235 S. Eickeler, A. Kosmala, G. Rigoll, “Hidden Markov Model Based Continuous Online Gesture Recognition”, International Conference On Pattern Recognition (ICPR), Aug. 1998, pp.1206-1208 L. Bretzner, I. Laptev, T. Lindeberg, “Hand Gesture Recognition Using Multi-Scale Colour Features, Hierarchical Models And Particle Filtering”, Proc. Fifth IEEE International Conference On Automatic Face And Gesture Recognition, 20-21 May 2002, pp. 405 – 410 Y. Ming-Hsuan, N. Ahuja, M. Tabb, “Extraction Of 2d Motion Trajectories And Its Application To Hand Gesture Recognition”, IEEE Transactions On Pattern Analysis And Machine Intelligence, Aug. 2002, Vol. 24, No. 8, pp. 1061 – 1074 L. Xia, K. Fujimura, “Hand Gesture Recognition Using Depth Data”, Proc. Sixth IEEE International Conference On Automatic Face And Gesture Recognition, 17-19 May 2004, pp. 529 – 534 CyberGlove from Immersion: http://www.immersion.com L. R. Rabiner, B. H. Juang, “An Introduction To Hidden Markov Models”, IEEE ASSP Magazine, Jan. 1986, pp. 4 – 15 Zoubin Ghahramani, "An Introduction to Hidden Markov Models and Bayesian Networks”, International Journal of Pattern Recognition and Artificial Intelligence 15(1), pp 9-42, 2001 Lawrence R. Rabiner, “A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Volume: 77, 1989 Peter Guttorp, “Stochastic Modeling of Scientific Data”, Chapman & Hall, 1995 William J. Stewart, “Introduction to the numerical solution of Markov chains”, Princeton University Press, 1994 X. Rong Li, “Probability, random signals, and statistics: a textgraph with integrated software for electrical and computer engineers”, CRC Press, 1999

1711