Robustification of detection and tracking of faces - Semantic Scholar

9 downloads 0 Views 74KB Size Report
communication system (video electronic mail, video telephony, video conferencing), a mediaspace (virtual office), or virtual working tools (Magic board).
Robustification of detection and tracking of faces Karl Schwerdt? , James L. Crowley?? , Jean-Baptiste Durand? ? ? Project PRIMA, Lab. GRAVIR - IMAG INRIA Rhone-Alpes 655, ave. de l’Europe 38330 Montbonnot St. Martin France

Abstract. Although computing power and transmission bandwidth have both been steadily increasing over the last few years, bandwidth rather than processing power remains the primary bottleneck for many complex multimedia applications involving communication. Current video coding algorithms use, for instance, intelligent encoding to yield higher compression ratios at the cost of additional computing requirements for encoding and decoding. The use of techniques from the fields of computer vision and robotics such as object recognition, scene interpretation, and tracking can further improve compression ratios as well as provide additional information about the video sequence being transmitted. This paper discusses current developments on a Multi-modal Face Tracking System (MFTS), which was introduced in [1]. This face detection and tracking system is currently being used for the Video Vorkbench [2], a Mediaspace [3], and serves as a basis for our research in detecting facial expressions and gestures [4]. Our latest improvements targeted two major goals: a) improving robustness and b) precision. This should be done without compromising the perfomance of MFTS. Key words: Face tracking, computer vision, human-machine interaction

1 Introduction High-performance computers are available today at reasonable price levels, which enables the use of complex algorithms at the human-machine interface in order to make interaction more flexible, more intelligent, more ergonomic, more intuitive, more efficient. In other words, we are able to use computing power to enhance the usability of technical devices. This goal can be achieved by making computers or similar devices with a calculation unit “aware” of what is happening around them. This is a change to previous stages of research and development, since the socioeconomic context of how we use any kind of computing device is implicitly being redefined. Applications for research in this context include means of computer input and output, electronic communication, information retrieval systems, devices integrating digital T.V. and internet, security and surveillance systems, and support devices in educational environments like university class-rooms (remote teaching). Despite exponential growth in computing power and transmission bandwidths, the data to be handled will by far exceed any available storage or transmission capacity if left untreated (un-coded, un-compressed). Effective data administration and handling and efficient treatment of any kind of data will therefore be in the focus of research efforts. All this serves as framework for research in our group. Our research revolv es around detection, recognition, interpretation, and tracking of objects and human beings, as a whole and of important body parts like hands and faces. Once detected and recognized, the information about humans or objects gained is fed into applications like a video communication system (video electronic mail, video telephony, video conferencing), a mediaspace (virtual office), or virtual working tools (Magic board). We make use of mathematical (Eigen-space), statistical (correlation, histograms, kalman filter), artificial intelligence (confidence factors for decision making), and fuzzy logic (genetic programming) methods and algorithms. Multi-modal Face detection and Tracking System (MFTS) was originally developed in the robotics area to normalize a video sequence to centered images of a face. Face-tracking allowed us to implement a compression scheme based on Principal Component Analysis (PCA), which is called Orthonormal Basis Coding (OBC) and implemented ? Email: [email protected] ?? Email: [email protected] ??? Email: [email protected]

in the Video Workbench [2]. Video Wokbench uses MFTS for bit allocation of a standard video codec following ITUT recommendation H.263 [5]. We designed and implemented the face tracker and video codecs entirely in software. Moreover, MFTS is being used for a Mediaspace [3], and serves as a basis for our research in recognition of facial expressions and gestures [4]. Although MFTS reliably detects and follows a face in real-time, we found it to be not precise enough for appearance based coding and interpretation. Appearance based coding and interpretation, i.e., treatment of image signals by orthonormal transformations, requires exact knowledge of the position of salient features like eyes and mouth. MFTS is built with a modular structure and is robust and fast. The robustness comes from parallel use of partially redundant modules. Our latest improvements targeted two major goals: a) improving robustness and b) precision. This should be done without compromising the perfomance of MFTS. We present in this paper our latest developments on MFTS and discuss the benefits and problems that arised. Since everything decribed in this paper is part of ongoing research, we do not give final results and conclusions. The face tracker as described here is still being tested and adjusted. Section 2 gives an overview of our new approach, and section 3 summarizes and discusses the results so far.

2 Approach The modular structure of MFTA gave us the possibility to individually add and remove modules and study their impact and performance seperately. Basically, MFTS automatically detects a face, keeps track of its position, and steers a camera to keep the face in the center of the image and at approximately the same size. This offers a user the possibility to freely move in front of the camera while the video image will always be normalized to his face. While being reliable and recovering quickly from failures, it turned out that MFTS in its former version lacked precision for Eigenspace-coding or -interpretation system. For example, “jumping” background consumed a nonnegligible number of basis dimensions. The techniques used for the modules of the face tracker have already been documented in [1], so they do not have to be discussed here in every detail. The most important change over [1] was that MFTS does not maintain a face box. Both eyes and the mouth are being tracked instead. Figure 1 contains the block diagram of MFTS. The face contour module is used to generate a region of interest, such that the eyes and mouth modules do not have to search the entire image for eyes and mouth. Plus, it is a corrective for the results of these modules.

Eyes Module Blink Detection

Luminance Extraction

Correlation Right Eye

Image Acquisition Unit

Grayscale Images

Correlation Left Eye

Images in RGB format

Kalman Filter to video codec Position Vector

Face Tracker (MFTS)

Mouth Module Correlation Mouth

Color Histogram Mouth

Face Contour Color Histogram

Fig. 1.: Block diagram of the face tracker

MFTS has six modules: Eye Blink Detection, (Normed) Cross-Correlation for tracking left and right eye, mouth detection by Color Histogram, (Normed) Cross-Correlation for mouth tracking, and face contour detection by Color Histogram. For both eyes each and the mouth we maintain a rectangle. While they can calculate size, center, and orientation of the face, the face tracker modules partially overlap in their functions and are thus actually partially redundant. This redundancy guarantees that the face tracker is robust. In order to either interpret head movements by a trajectory in its eigenspace, or to efficiently encode such a sequence by OBC, we need to know position and orientation of a face or head to the pixel. Tracking a face by a face box yielded jitter from frame to frame of about 1 - 3 pixels, even if filtered through a Kalman filter. A more restrictive Kalman filter can eliminated those jumps, for the price of loosing track of the exact size of the face. We hope to finally avoid this dilemma by tracking smaller objects than the entire face, that is, eyes and mouth. 2.1 Confidence Factors While all modules return a result, usually in form of a position vector, those results have to be qualified in order to react properly to what is going on in front of the camera. A confidence factor is computed from the modules’ output in order to estimate the probability of a successful detection. A confidence factor is represented by a numerical value between 0 (failure or no confidence) and 1 (hit or certainty). Given an observed N -dimensional vector, , and assuming a Gaussian density function, the confidence factor, CF, of the k -th frame is calculated as

Y

CF[k ] = exp



? 12 [Y ? s]T



?1 [Y ? s ]

Cs

;

where Cs is the covariance and s the mean of successful detections. An empirically determined threshold is then applied in order to decide, if the position detected by a particular module is good. In former versions of MFTS, Cs and s were pre-calibrated by measuring a sample set of correct detections. This, and especially the covariance matrix showed to be too restrictive at initial detections. Starting with a “neutral” covariance matrix, i.e., the identity matrix, and a pre-estimated mean, and recursively updating both after each successful detection showed to be more reliable. Figure 2 shows the flow-chart of MFTS. Control flow is more complicated than in [1] and [2] and has some particularities, for instance a delay loop (LOOP 1), or two concurrently performed loops (LOOP 2 and LOOP 3).

Tracking Loop

BEGIN BDT

no

yes MCOR (MHST)

CF > T5

LOOP 2 no

CF > T1

yes CF > T4

yes

yes

MCOR initialized?

no BDT Init HST

no

Init COR, MDT by HST

LOOP 1 no

Right and Left ECOR

CF > T2

yes LOOP 3 Init COR

yes

CF > T2

no

BDT: CF: COR: HST: MDT: MCOR: ECOR: MHST: EHST:

Fig. 2.: Flow chart of the face tracker

Blink Detection Confidence Factor Correlation (Color) Histogram Mouth Detection Mouth-Tracking by COR Eye-Tracking by COR Mouth-Tracking by HST Eye-Tracking by HST

2.2 Eyes Module The Eyes module has been enhanced from just blink detection to three modules: the detection of eye-blinking and the tracking of both eyes by NCC. Blink Detection is used for quick initialization or recovery (re-initialization) of the face tracker. While appearance of different people can be extremely different (skin color, beard, hair, etc.), they all have to blink to keep their eyes moist. That is, we can be sure to detect eyes by taking the difference between two images with a time lag of approximately 200 msec. This has shown to give reliable results. A good example of improving reliability by updating the confidence factor is the blink detection module, where the covariance matrix is used to calculate a confidence factor identifying the two rectangles framing the eyes among a possibly big number of found rectangles. This covariance matrix was measured in advance with some arbitrary examples and never updated after a successful detection. In consequence, at the intial detection (at program start time), the admitted variance in eye position and size was too low, whereas at subsequent detections, this variance was not restrictive enough. By starting with a neutral covariance matrix (i.e., the identity matrix) and updating at each successful detection, initial detection could be performed easier and more reliable (each failure delays the start of the face tracker by about 200 msec). Following cycles of the eye blink detection module could then use a priori information of eye position and size since starting the face tracker. After a successful detection of the eyes the correlation module has to wait until it can be sure the eyes have re-opened. Otherwise it would break immediately once the eyes reopen. Therefore, a loop the guarantees such a delay has been added. It is shown as LOOP 1 in Figure 2. After proper initialization by the blink detection module, LOOP 2 and LOOP 3 are run concurrently tracking eyes and mouth. 2.3 Tracking by Normalized Cross-Correlation (NCC) A common method for object tracking is to cross-correlate certain areas of successive images. However, this can turn out to be costly if 1) the areas to be correlated are relatively large and/or 2) the area to be searched is relatively large. It also cannot follow 3D-motion very well [6]. Hence, it appears natural to use correlation only for the tracking of small movements. Our correlation algorithm needs to be initialized with a correlation mask, and this correlation mask should represent a characteristic area of the image. We limit the area to be searched to a so called “region of interest”, which is roughly identical to a face box which is being detected by the Face contour module (see next sub-section). A (normalized) cross-correlation algorithm (NCC) finds a best match between the correlation mask and an area of the same size in the region of interest. Originally, a region between the eyes of about 20 x 20 square pixels containing parts of the eyebrows was chosen as correlation mask. Choosing a region between the eyes of a detected face gave a distinctive area, but gave away the possibility of tracking special facial features like eyes or mouth. The speed of the NCC module allows us to run three correlation modules at each cycle. For precise and close tracking it is desirable to follow particular face areas like eyes and mouth. At the same time, this gives us the knowledge we need to recognize head movements and facial expressions. 2.4 Face Contour Module Another way to detect or follow objects within an image is by detecting their color. The color of human skin is usually well distinguishable in an environment like an office. It has been shown [7], that the technique of color histograms is stable and reliable if color components of a pixel are normalized by the pixel luminance. The color histogram module searches the entire image and returns the biggest area of skin. The quality of the color histogram algorithm depends on its initialization. Once properly initialized, it can recover the position of a face quickly and reliable. The face contour module is initialized by the blink detection module and in turn initializes the color histogram of the mouth detection module. 2.5 Mouth Module It is desirable to be able to detect and track the mouth of a person communicating, since, combined with eye tracking, it gives valuable information about size, position and orientation of a talking head. The major problem when trying to detect and follow a mouth is that it permanently changes its shape (which is quite normal when talking).

At frame processing rates of above 15 Hz, mouth tracking by correlation seems to be possible. However, the quick changes of the shape of the mouth has us expect the NCC module to break relatively often. Particularly if the system frequency drops significantly below 10 Hz. This is, for reliable mouth tracking we need to recover the mouth tracking module quickly. A way to accomplish this has shown to be a lip detection algorithm using the red color components of the image. Mouth (Lip) Detection by Color Histogram is one of the new features of the current MFTS. Figure 3 shows the flow-chart of the mouth detection process by color histogram. While the Face contour module uses a two-dimensional color histogram (of normalized read and green color values) to detect a face, the mouth detection module only uses the red component, since the difference of the red component is likely to be the discriminating value of skin and lip color.

Init HST after EDT Compute (normalized) average mean of Red components µ r

MDT cycle

Mouth Detection by Color Histogram

Apply HST to image

Extract a region (iZone) supposed to contain the mouth

Set B & G values in region iZone to 0 and R = C * TMax(0, r -µ r )

eliminate pixels with R+G+B+ < T

eliminates pixels too dark to be mouth

Threshold Get biggest connected area

Fig. 3.: Flow chart of MDT modules

Mouth-Tracking by NCC is straightforward and happens in the same manner as for eye tracking. Since the area to be searched is limited and the appearance of a mouth changes during a conversation, the detection rate of this module was rather low. 2.6 Kalman Filtering All six modules above output a position vector, whose values are generated in different manners. A zeroth order Kalman filter was therefore added in order to smooth out jitter and to keep alterations of the size of the rectangles for eyes and mouth smooth. The Kalman filter was the first module with a permanently updated covariance matrix. The results were promising, but the resulting face box still was not precise enough to be input for eigenspace coding (see above). Position and size of the detected face box were not sufficient to follow a changement in the relative position of eyes and mouth as it happens, for instance, when a head turns. However, in order to precisely track and identify those head movements and facial expressions, the relative position of eyes and mouth to the face box have to be known anytime. More dimensions were therefore added to the input vector to keep track of the position of eyes and mouth. A constant uncertainty matrix, which is actually a diagonal matrix, permits adjustment of variance and adaptation speed of the data to be filtered. It keeps the covariance matrix of the observed data from becoming too restrictive about the size and position of the face.

3 Results and Discussion Three criteria are important for the performance evaluation of MFTS: 1) Robustness measured in failure and recovery rate, 2) precision, and 3) speed, measured in Hz. A metric to measure tracking precision is yet to be developed. 3D movements like a head nodding or turning are particularly difficult to be determined precisely. The operating conditions of MFTS in its current version differ too much from the version that in [1] and [2] to give a final estimate of its performance at this point. The concurrent use of several models might impose speed problems, even though the operating area of each model has been reduced. The MFTS of [2] most of the time used one module for one cycle, and yet its speed was not negligible for the overall system performance. The new face tracking system, as it is discussed in this paper, is currently in its evaluation phase and solely run on pre-recorded video streams. That is, its operating conditions are not yet comparable to earlier version of the face tracker. Figure 3 shows a screen dump of Video Workbench running the new MFTS.

Fig. 4.: Screen dump of Video Workbench, using the new MFTS

4 Conclusions Complementary visual processes have been integrated to build a new MFTS. Primay focus was the decisive improvement of precision. MFTS in its new form is intended to feed an video stream normalized to a face into a system transforming this video stream into an orthonormal basis space. High precision is needed to avoid losing basis dimensions for motion compensation. Applications for this kind of research are video communication and gesture recognition. Further improvements target, e.g., asynchronous camera control.

References 1. J. L. Crowley, F. Berard, and J. Coutaz, “Multi-modal tracking of faces for video communications,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 640–645, June 1997. 2. K. Schwerdt and J. Crowley, “Contributions of computer vision to the coding of video sequences,” in Symposium on Intelligent Robotics Systems, (Edinburgh, United Kingdom), July 1998. 3. J. Coutaz, F. Brard, E. Carraux, and J. L. Crowley, “Early experience with the mediaspace CoMedi,” in Journal of Cognitive Neuroscience, (Heraklion, Greece), September 1998. 4. J. Martin, D. Hall, and J. Crowley, “Statistical recognition of parameter trajectories for hand gestures and face expressions,” in Workshop on Perception of Human Actions, (Freiburg, Germany), June 6th 1998. 5. ITU-T Study Group XV, “Recommendation H.263: Video coding for low bit rate communication,” tech. rep., ITU-T, Geneva, Switzerland, “http://www.itu.ch”, 1996. 6. A. Elefteriadis and A. Jacquin, “Automatic face location detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates,” Signal Processing Image Communication, vol. 7, pp. 231–248, 1995. 7. B. Schiele and A. Waibel, “Gaze tracking based on face color,” International Workshop on Automatic Face- and GestureRecognition, June 1995.

This article was processed using the TEX macro package with SIRS98 style