Real-Time Head Pose Estimation Using Random Regression Forests

Real-Time Head Pose Estimation Using Random Regression Forests Yunqi Tang, Zhenan Sun, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China {yqtang,znsun,tnt}@nlpr.ia.ac.cn

Abstract. Automatic head pose estimation is useful in human computer interaction and biometric recognition. However, it is a very challenging problem. To achieve robust for head pose estimation, a novel method based on depth images is proposed in this paper. The bilateral symmetry of face is utilized to design a discriminative integral slice feature, which is presented as a 3D vector from the geometric center of a slice to nose tip. Random regression forests are employed to map discriminative integral slice features to continuous head poses, given the advantage that they can maintain accuracy when a large proportion of the data is missing. Experimental results on the ETH database demonstrate that the proposed method is more accurate than state-of-the-art methods for head pose estimation. Keywords: Head pose estimation, Random regression forests, Integral slice features.

1

Introduction

Automatic estimation of head pose from range images has become an active topic in recent years, inspired by the availability of affordable and reliable depth camera, and the increasing demands of many applications such as face recognition, human computer interaction and facial feature analysis. Compared with 2Dbased methods [2,3,4,5], estimation of head pose with depth images has several advantages. Firstly, pixel value of depth image has physical meaning, indicating the distance from objects to the viewpoint. Secondly, it is easy to detect and segment head area from depth images. Thirdly, depth images are illumination invariant. With these advantages, we proposed a robust method to estimate head pose from depth images in [13], which achieves encouraging performance. In this paper, a more accurate method is proposed based on our previous work [13] to estimate head pose from depth images. In this method, one more dimension, namely difference of depth value, is included in the atom vector of Integral Slice Features (ISF) [13]. It is a more discriminative feature for indicating different head poses. Furthermore, a more powerful regressor, random forests [11], is employed to map the discriminative integral slice features to real-valued parameters of head poses. The performance of this method is evaluated on ETH [9] and our results are superior to the state-of-the-arts. Z. Sun et al. (Eds.): CCBR 2011, LNCS 7098, pp. 66–73, 2011. c Springer-Verlag Berlin Heidelberg 2011

Real-Time Head Pose Estimation Using Random Regression Forests

67

The rest of this paper is organized as follows. Section 2 presents the existing work of head pose estimation. In section 3, we describe the technical details of proposed head pose estimation method. Section 4 discusses the experimental results on the public database, and section 5 concludes the paper.

2

Related Work

Head pose estimation has been a hot research topic over the last decade. There is a large amount of literature on this topic [1]. Generally, existing methods can be classified into three categories: appearance template based methods [9,4], geometry based methods [6,12] and learning based methods [2,3,7,5,8,10]. Appearance template based methods [9,4] compare directly a new head image with a set of exemplars, which are labeled with discrete poses, in order to find the most similar image and take its angle as the estimation result. For instance, the work of [9] firstly used geometric features to get some hypothesis nose tips, then compared the new image with a set of template images and finally took the most similar template’s orientation as the estimation result. Although these methods do not require negative training samples and facial feature points, they are time consuming and cannot estimate a continuous head pose. Geometry based methods[6,12] take facial features points, such as the inner and outer corners of eyes, nose tip and left and right corners of mouth, to calculate head pose directly. The advantage of these methods is that they do not need training process, but the performance of these methods highly depends on the accuracy of facial feature point detection. Learning based methods[7,5,8,10] use the tools of machine learning to map the input images or feature data to discrete or continuous head poses. For example, Fanelli et al. [7] synthesized a large head pose database to train a mapping from single depth features to real-valued parameters. However, these methods may suffer from over fitting. This paper is a tradeoff between geometry based method and learning based method. It takes the position of nose tip as precondition to design a new efficient feature and uses the tools of machine learning to estimate the parameters of head poses. Nose tip is one of the most significant points of face and can be detected easily compared with other feature points. The feature based on nose tip has physical meaning and can accurately indicate different head poses. The experimental results on public dataset show that the performance of proposed method achieves state-of-the-arts[9,7,13] performance.

3

Head Pose Estimation with Random Regression Forests

In this section, we detail the discriminative integral slice features and describe how random regression forests are used to map depth features to head poses.

68

3.1

Y. Tang, Z. Sun, and T. Tan

Discriminative Integral Slice Features

A depth image is an image or image channel that contains information related to the distance from the surfaces of scene objects to the camera. Therefore a head’s depth image can be regarded as a 3D surface. Then a vertical slice of head’s depth image can be defined as a set of pixels whose values belong to [l, h]. We formulate a slice of image I as Sω (I) = {(x, y)|l ≤ dI (x, y) ≤ h}

(1)

where Sω (I) is a vertical slice of depth image I, ω = (l, h) denotes the slicing parameters (low threshold l and high threshold h), and dI (x, y) denotes the depth value of pixel (x, y). Thus the integral center of this slice can be defined as xi /N (Sω (I)), yi /N (Sω (I))) (2) Cω (I) = ( (xi ,yi )∈Sω (I)

(xi ,yi )∈Sω (I)

where Cω (I) is the coordinate of the integral center of slice Sω , N (Sω (I)) denotes the number of pixels within a slice, and (xi , yi ) is the coordinate of pixel i in depth image I. For a given slice Sω , the feature is defined as a 3-dimentional vector from the geometric center of slice to a reference point, which can be formulated as Fω (I) = (Fω (I)|x , Fω (I)|y , Fω (I)|z )

(3)

where Fω (I)|x = Cω (I)|x − Rx is x-component, Fω (I)|y = Cω (I)|y − Ry is ycomponent, Fω (I)|z = dI (Cω (I))−dI (R) is z-component, and R is the coordinate of reference point which can be the nose tip and optionally a slice center. It is well known that face is bilateral symmetric about nose tip. Thus the 3dimension vector from the geometric center of a slice of head surface to nose tip has significant physical meaning. Fω (I)|x and Fω (I)|y indicate roll information of a head pose, Fω (I)|x and Fω (I)|z provide yaw information of a head pose, and Fω (I)|y and Fω (I)|z provide pitch information of a head pose. Figure 1 illustrates five features crossing different poses. We can see that if somebody turns his or her head to right, Fω (I)|x would have a positive response, and negative response for left; if somebody raises his or her head up, Fω (I)|y would have a negative response when the slices with low threshold l and h, and positive response for down. 3.2

Random Regression Forests

As a bagging based classifier, random forest have lots of advantages. It is able to yielding good results with common datasets; It have a comparable performance with Adaboost, while more robust than Adaboost; and due to its parallel structure, it can be implemented efficiently on GPU. A forest consists of a number of parallel decision trees. Each tree can be trained or used to predicate separately. There are two types of nodes within each tree : split node and leaf node. A split

Real-Time Head Pose Estimation Using Random Regression Forests ˶=(0, 20)

˶=(0, 40)

˶=(20, 40)

˶=(40, 60)

69

˶=(60, 80)

Fig. 1. Examples of discriminative integral slice features crossing different poses. There are five slices for each head depth image with the different parameter ω. The blue points are the position of nose tip. The green points are the geometric center of slices. The red arrows are the vectors from geometric center of slices to nose tips.

Input image

Leaf node Binary test

Split node

F(0,20) |x ! W1 ?

F(20,40) |x ! W 3 ?

F(0,40) | y ! W 2 ?

F(0,40) |z ! W 4 ?

F(20,60) |x ! W 5 ?

(56,6,0)

Fig. 2. How head pose is estimated with a decision tree. The purple circle represents leaf node. The blue circle represents split node. The orange arrow denotes the path of the input image going down from the root to a leaf.

node is always corresponding to a binary test, which directs samples towards the left or right child. A leaf node can be regarded as a cluster of samples that can be described by a simple model. During training process, the samples falling in a leaf node are averaged to get the model’s parameters, and the parameters are stored in this leaf node. During testing process, the new sample goes down from the root of decision trees till leaf nodes are reached. Then the average model of these leaf nodes is used to predicate its result. Figure 2 simply illustrates how head pose is estimated with a decision tree. Training. Usually it is impossible to accurately describe a large dataset using a uniform distribution. Thus it is natural to divide the large dataset into dozens of small datasets, and employ simple distributions to describe these small datasets. This is the main idea of decision trees, namely divide-and-conquer. Each split node divides the sample set reached it into two subsets according the result of a binary test:

70

Y. Tang, Z. Sun, and T. Tan

Fω (I)|θ={x,y,z} > τ

(4)

where τ is the threshold of the binary test; and the leaf node is related to the small set that can be described with a simple distribution. We model the pose of a head as a 3-dimention variation P = (α, β, γ), which follows the distribution of multivariate Gaussian at each leaf node. We build a decision tree recursively starting from the root node with the following steps: Firstly, randomly select a subset of training samples which are composed of discriminative integral slice features and the annotated poses of heads. Secondly, select a subset of features by randomly generating a set of binary tests T = (ω, θ, τ ) according to Equation (4). Thirdly, for the samples (S) reaching the node, if the number of these samples is bigger than a fixed threshold, then split the samples into left and right subsets (SL and SR ) using each test t in T : SL = {I|Fω (I)|θ={x,y,z} < τ, t(ω, θ, τ ) ∈ T, I ∈ S}

(5)

SR = {I|Fω (I)|θ={x,y,z} ≥ τ, t(ω, θ, τ ) ∈ T, I ∈ S}

(6)

and compute its information gain, which is defined as: IG(t) = H(S) − (wL H(SL ) + wR H(SR ))

(7)

where H denotes differential entropy. wL is the ratio between the number of samples in SL and S, and wR is the ratio between the number of samples in SR and S. Fourthly, compute the largest information gain IG(t∗ ) = arg max(IG(t)) t∈T

(8)

If IG(t∗ ) is below a fixed threshold, then keep it as a split node and assign the parameters of corresponding test t∗ to the node. Or keep it as a leaf node and store the mean and covariance of P in this node. Finally, if the depth of the tree is below a maximum value, then go to the third step for recursively building the left and right children. Testing. When we get a new depth image of head, discriminative integral slice features are extracted and sent to the trained forest. In each tree, the new sample goes down from the root, and is recursively directed to left or right child according the binary test stored in split nodes until a leaf node is reached. The distributions of all reached leaf nodes are averaged to generate the final result: p(d|l) =

N 1 pi (d|l) N i=1

(9)

where N denotes the number of trees within a forest, pi (d|l) means the distribution of the leaf node reached by the given sample in the ith tree.

Real-Time Head Pose Estimation Using Random Regression Forests

4

71

Experiments

To evaluate the performance of proposed method, we conducted an experiment on ETH Face Pose Range Image Data Set [9] and compared the experimental results with the state of the art methods [9,7,13] whose results are also based on the dataset of ETH. ETH is a large and public database with about 10k range images covering almost all possible rotations information of head including yaw from −90◦ to +90◦ and pitch from −45◦ to +45◦ . This paper focuses on the problem of head pose estimation, thus assumes that head and nose tip are detected in the image. And noise of depth images is removed with the method of [13]. Our method is mainly controlled by 3 parameters which are the l, h and the number of trees n. l and h contribute to the production of feature slices. Each pair of l and h can result in a slice which is related with a feature of images. For 255 (255 − i) = 32640 slices at most. In present a depth image, there would be i=0

setup, we set l and d = h − l as increasing arithmetic sequences of [0, 240] with a 24 common difference of Δd = 10. Thus there will be 3 ∗ (24 − i) = 900 features i=0

for each image. 98

1

96

0.98

92

Accuracy %

Accuracy %

94

90 88

0.96

0.94

0.92

86

82

0.9

Nosetip SliceCenter

84

5

10

15

20

25

30

Trees (angle error