Eye Activity Detection and Recognition using ... - CiteSeerX

17 downloads 0 Views 102KB Size Report
Automatic recovery of eye gestures from image se- quences is one of the important topics for face recognition and model-based coding of videophone ...
Eye Activity Detection and Recognition using Morphological Scale-space Decomposition I. Ravyse , H. Sahli , M.J.T. ReindersÝ , J. Cornelis 

Vrije Universiteit Brussel, Dept. of Electronics and Information processing-ETRO Pleinlaan 2, B-1050 Brussels, Belgium icravyse, hsahli, jpcornel@etro.vub.ac.be Ý

Delft University of Technology, Dept. of Electrical Engineering P.O. Box 5031, 2600 GA Delft, The Netherlands [email protected] Published by 15th International Conference on Pattern Recognition, ICPR2000, Vol. 1, pp. 1080-1083, September 3-8, 2000, Barcelona, Spain

Abstract Automatic recovery of eye gestures from image sequences is one of the important topics for face recognition and model-based coding of videophone sequences. Usually complicated models of the eye and its motion are used. In this contribution an eye gesture parameter estimation is described. A previously published automatic eye detection/tracking algorithm, based on template matching, is used for the eye pose detection. The eye gesture analysis is realised with a mathematical morphology scale-space approach, forming spatio-temporal curves out of scale measurement statistics. The resulting curves provide a direct measure of the eye gesture, which can then be used as an eye animation parameter. Experimental results demonstrate the efficiency and robustness.

1. Introduction Vision-based facial gesture analysis extracts, from an image sequence, spatio-temporal information about the facial features pose and expressions. The modeling and tracking of faces and expressions is a topic of increasing interest. An overview can be found in [6]. Numerous techniques have been proposed for facial feature detection and facial expression (gesture) analysis such as template matching [5], principal component analysis [1], interpolated views [7], deformable templates [15], active contours [9], optical flow [4], physical-model coupled with optical flow [8], etc    Here, we propose a tracking algorithm based on template matching to locate the positions of the eye regions in subsequent frames, coupled to a scale space approach, namely the

sieves, to extract eye gesture parameters from the gray-level data directly. The proposed method resembles the work done by Matthews et al. [11] for lip reading, but differs in the implementation and the analysis of the spatio-temporal curves for eye gesture parameter estimation. In this paper, the eye gesture is defined as the closing of the eyelids. The direction of gaze is not considered yet. The estimated lowlevel animation parameter is represented by an Action Unit (AU) [12]. An AU stands for a small visible change in the facial expression. More precisely, AU43 parameterizes the amount of eye closure. The paper is organized as follows. Section 2 summarizes the approach for the tracking of eye regions. Section 3 describes the sieve decomposition. In section 4 spatiotemporal curves of granules are introduced, from which gesture parameters are estimated. Section 5 discusses the experimental results obtained with test sequences Suzie and Miss America. Final conclusions are drawn in section 6.

2. Tracking eye regions A previously published eye tracking scheme [14] [13] , is used for the tracking of the location of the left and right eye regions in image sequences. The proposed scheme is based on template matching, but is kept invariant for orientation, scale and shape by exploiting temporal information. To cope with appearance changes of the eye feature, a codebook of iconic views is automatically constructed during the eye region tracking. For each frame the position and orientation of the left and right eye region are provided as well as the normalized eye regions for the template matching. Figure 1 shows the detected left eye region for the first 9 frames of the Claire sequence. These images will be used to illustrate our algorithm for eye state (gesture) estimation.

4. Eye gesture parameters estimation

3. Sieve Decomposition Diffusion based scale-space approaches [10] in image processing allows (i) to simplify the images at multiple scales, by removing fine details, and (ii) to accurately estimate the scale, the position and the identity of the objects in an image, from measures of extrema and edges. However, simplification is not only achieved by diffusion, but alternative nonlinear morphological filters have been introduced to this end [2], [3]. Here, we consider the one dimensional increasing scale of alternating sequential filters with flat structuring elements, known as sieves. The sieves decompose a one dimensional bounded function,  , to a sequence of increasing scale: granule functions,      , representing the key features of the original signal. A formal presentation of the decomposition is given as follows: 

  







    



(1)

The operator   can be an opening   , a closing  , an opening/closing   , or a closing/opening   , for a flat structuring element of size . Note that for the one dimensional case the opening and closing operators are defined as:

   ¾      ¾    ¾      ¾

 

 







with     

    , the set of intervals of  with  elements. The sieves are able to robustly reject noise in the manner of medians, and do so more effectively than diffusion based schemes. Moreover the sieves are also shown experimentally to be fast [2]. As it can be seen from Eq. 1 and the definition of the operator  extrema of a particular scale are removed at each stage of the cascade. The differences between successive  , called granule functions       , are locally nonzero for intervals of  samples. The non-zero intervals are called granules. The granules are scale related primitives to which the edge information is intimately bounded [3].

1000 scale 1

Figure 1. Tracked left eye of the Claire sequence

sum gran. amplitude



500 0

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5 frame number

6

7

8

9

500 scale 2



sum gran. amplitude



0 scale 3



sum gran. amplitude



2000 1000 0 4000

scale 4



The one dimensional recursive sieve defined above may be used to decompose a two dimensional image of an eye, by scanning vertically over each image column, i.e. in the direction on which eye motion occurs when blinking. The resulting granularity spectrum contains the position, amplitude and scale information of the set of granules that describe the image. By ignoring position and amplitude information Matthews et al. [11] defined a scale histogram as the number of granules, the granules energy (squared amplitude), the granules amplitudes or the granules absolute amplitude summed across the entire image. This histogram was used, for audiovisual speech recognition, as a feature vector for standard Hidden Markov Models. Instead of working with the eigenvalue decomposition approach [11], we suggest another way of combining the scales. Following the same basic idea as Matthews et al. [11], we form, for each scale , a spatio-temporal curve where the granules amplitude, summed across the entire eye image, are represented as a function of time (frame).

sum gran. amplitude



scale 5



sum gran. amplitude



3000 2000 400 200 0

Figure 2. The granules amplitude summed across the entire eye image (ordinate) as function of time - each of the images of Figure 1 - (abscissa) per scale As it can be seen from Figure 2, obtained using the opening operator, the sum of the granules amplitude found at a given scale for an image frame changes whenever the eye blinks. The relevant information (eye gesture) is captured by the way the sum of the granules amplitude changes, rather than by their specific values. For some scales the sum of the granules amplitude will increase or decrease along time, with sudden transitions (minima/maxima) corresponding to clear eye state changes. In order to make the spatio-temporal curves (observation at each frame ) comparable at each scale , we compute







 Ö   

for 150 frames each. The left and right eye positions were estimated using the tracking algorithm, and the eye gesture analysis using the opening sieve decomposition. Figure 5 shows the results of the estimated left eye gesture for Suzie.

i

100

50

z

their z-scores. The z-score of an observation  in a sample Ö   ½ ,. . . ,   is given by    ÜÜ  , where Ö  and  Ö  are the mean and the standard deviation of Ö , respectively. Since the mean is sensitive to outliers, replacing it by the median provides more robust z-scores. Let  the granules amplitude summed across the image frame , for a given scale . The robust z-score for an observation  is given by:

0

Ö 

(2)

r

i

scale 2

5

z

0

−5

1

2

3

4

5 frame number

6

7

8

9

Figure 3.   obtained using the observations in Figure 2 for scale   



The robust z-score for the   image frame is then defined 

 to be:          (3)    

100

150

Figure 5. Estimated  for the Suzie sequence By labeling the frames where the eye is closed, groundtruth is created. This is illustrated for the left eyes of the Miss America sequence in Figure 7. The choice of an optimal threshold on the graph of the estimated parameter   depends on the miss detections and false alarms of the receiver operating curve (ROC) (Figure 6). The ROC shows the number of miss detection (defined as ‘eye not estimated as closed but labeled as closed’) and of false alarm (defined as ‘eye estimated as closed and not labeled as closed’), plotted as a function of the threshold. The optimal threshold on the  graph is situated at the joint minimum of both curves. It is indicated on the  graph in Figure 7. number of frames

Ö      Ö  

 is the mean absolute deviation. Figure 3 shows the curve of the robust z-scores for the observations of scale    (Figure 2). Abnormal observations representing a clear eye gesture change (open or closed), are marked on the curve.

50 frame number

where

 

0

150 false alarms miss detected optimal threshold

100 50 0



0

10

20

30

40

50 threshold

60

70

80

90

100

Figure 6. Receiver operating curves for the Miss America sequence

where  is the number of images, and   is the robust zscore corresponding to observation at scale .

100 100

z

i

zi

threshold 50

50

0 0

1

2

3

4

5 frame number

6

7

8

9

Figure 4. Total z-score The normalized  values, estimated using the observations of Figure 2 are shown in Figure 4. As it can be seen, the estimated values follow the eye gesture shown in Figure 1: the lowest value corresponds to an open eye, and the highest value to a closed eye. The  scores (Figure 4) can be used as values for the AU43 for eye animation (see Figure 10).

5. Experimental Results For testing the performance of the proposed approach, it has been applied to the Suzie and Miss America sequences,

0

50

100

150

frame number

Figure 7. Closed eye groundtruth (double underlined parts), Estimated  and optimal treshold for the Miss America sequence For all frames, the decision between a closed and an open eye as well as the automatical estimation of the AU43 value for face animation, have been carried out successfully by the proposed algorithm. The tracking algorithm can also be applied on the mouth corners. Images of the mouth are then used to detect vertical opening, as in the eye images (Figure 8). Although the graph can be considered as a first estimate of the mouth state (Figure 9), at the optimal threshold more miss detections and false alarms are found then for the eyes. This is because

zi

100

50

0

0

50

100

150

frame number



















Figure 8. Estimated  for the mouth of the Miss America sequence (a) 

(b) 

(c) 

Figure 9. (a) Open mouth (b) Closed mouth and (c) miss detected closed mouth

a mouth has much more ways to open and reflections make the lips look more white (as in Figure 9(c) ).

Figure 10. Left eye animation results using   as values for the AU43, related to the sequence given in Figure 1

6. Discussion - conclusions The paper presents an approach for automatic estimation of eye gesture. By using morphological scale-space, information about edges and regions of the eye is preserved. We have used a similar approach as Matthews et al. but at the same time, we proposed some necessary modifications to incorporate additional information about the spatiotemporal nature of an eye gesture. For the estimation of animation parameters, we introduce an observation measure that depends both on scale and time and reflects more efficiently each eye gesture state. It is shown that in a typical video sequence, we can automatically detect the eyes regions and estimate eyes gestures states, for almost all the images. In our future work, we will consider the complete deformation of the facial feature.

References [1] P. Antoszczyszyn, J. Hannah, and P. Grant. Automatic fitting and tracking of facial features in head-and-shoulders sequences. In Proceedings IEEE International Conference on Acoustics, pages 2609–2612, May 1998. [2] J. Bangham, P. Ling, and R. Harvey. Scale-space from nonlinear filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5):520–528, May 1996. [3] J. Bangham, P.Chardaire, C. Pye, and P. Ling. Multiscale nonlinear decomposition: The sieve decomposition theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5):529–539, May 1996. [4] B. Bascle and A. Blake. Separability of pose and expression in facial tracking and animation. In IEEE Proceedings 6th. International Conference on Computer Vision, 1998. [5] R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1024–1053, Oct. 1993.

[6] R. Chellapa, C. Wilson, and S. Sirokey. Human and machine recognition of faces: a survey. Proceedings IEEE, 83(5):705–740, 1995. [7] T. Darrell, I. Essa, and A. Pentland. Task-specific gesture analysis in real time using interpolated views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):1236–1242, Dec. 1996. [8] I. Essa and A. Pentland. Facial expression recognition using image motion. In Motion-based Recognition M. Shah and R. Jain, pages 271–298. Kluwer Academic Publishers, 1997. [9] S. Lepsøy and S. Curinga. Conversion of articulatory parameters into active shape model coefficients for lip motion representation and synthesis. Signal Processing: Image Communication, 13:209–225, 1998. [10] T. Lindeberg. Scale space theory in computer vision. Kluwer, 1994. [11] I. Matthews, J. Bangham, R. Harvey, and S. Cox. A comparison of active shape model and scale decomposition based features for visual speech recognition. In Proceedings European Conference on Computer Vision: Lecture Notes in Computer Science, pages 514–528. Springer-Verlag, 1998. [12] F. Parke and K. Waters. Computer Facial Animation. A K Peters, 1996. [13] I. Ravyse, M. Reinders, J. Cornelis, and H. Sahli. Eye gesture estimation. In IEEE Benelux Signal Processing Chapter, Signal Processing Symposium, SPS2000 (March 23-24, 2000) Hilvarenbeek, The Netherlands, page 4, Mar. 2000. [14] M. Reinders. Eye tracking by template matching using an automatic codebook generation scheme. In Proceedings of third Annual Conference of the Advanced School for Computing and Imaging (ASCI), Heijen, The Netherlands, pages 215–221, Jun. 1997. [15] A. Yuille and P. Hallinan. Deformable templates. In Active Vision, A. Blake and A. Yuille Edt., pages 21–38. MIT press, 1993.