Depth Perception Model Based on Fixational Eye Movements ... - CNRS

1 downloads 0 Views 2MB Size Report
adaptive adjustment of frame rate is required. We have proposed a method with no necessity of variable frame rate, which is based on multi-resolution ...
2010 International Conference on Pattern Recognition

Depth Perception Model Based on Fixational Eye Movements Using Bayesian Statistical Inference Norio Tagawa Faculty of System Design Tokyo Metropolitan University Tokyo, Japan [email protected]

microsaccade

Abstract—Small vibrations of eyeball, which occur when we fix our gaze on object, is called “fixational eye movements.” It has been reported that such the involuntary eye movements work also for monocular depth perception. In this study, we focus on “tremor” which is the smallest type of fixational eye movement, and construct depth perception model based on tremor using MAP-EM algorithm. Its effectiveness is confirmed through numerical evaluations using artificial images.

drift

tremor

Keywords-fixational eye movement; depth perception; structure from motion; Bayesian estimation;

I. I NTRODUCTION

Figure 1. Illustration of fixational eye movement including microsaccade, drift and tremor.

It is well known that fixational eye movements which mean irregular involuntary motions of eyeball arise, when human gazes fixed targets [1]. Because human’s retina can keep sensitivity of receiving by finely vibrating images of targets on a retina, fixational eye movements are firstly required to watch something. As an additional function, it has been reported that fixational eye movements play clues for depth perception, regardless of unconsciousness of image motion caused by it in a retina, and an actual vision system based on fixational eye movements has been proposed [2]. On the other hand, as a monocular stereopsis, “structure from motion (SFM)” has been a central subject, and a lot of notable results have been reported. Although there are various computational principles for SFM, when computatinally efficient and spatially dense depth recovery is considered to be important, the gradient method is effective [3], [4], [5]. For the gradient method, it has to be noted that there is an adequate image motion size to recover accurate depth. Since the gradient equation can completely hold for infinitesimal motions, the equation error can not be ingored for highly large motions. Inversely for small motions, motion information is burried in observation errors of spatio-temporal differentials of brightness. To make motion size suitable, adaptive adjustment of frame rate is required. We have proposed a method with no necessity of variable frame rate, which is based on multi-resolution decomposition of images [6], but it takes high computational cost. We pay attention to the small motion so as to avoid equation error in the gradient method. To solve the above mentioned S/N problem caused 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.411

for small motion, we can use many observations collectively. For such strategy, motion direction and motion size have to take various values. From the above discussions, in this study, we examine a depth perception model based on fixational eye movements. Fixational eye movements are classified into three types as shown in Fig. 1: microsaccade, drift and tremor. Here, as a first report of our attempt, we focus on tremor, which is the smallest one of the three types. In the subsequent step, we are plannning to use analogy of drift and/or microsaccade for further outcome. II. G RADIENT METHOD FOR TREMOR A. Motion model of Eyeball We use perspective projection as our camera-imaging model. The camera is fixed with an (X, Y, Z) coordinate system, where the viewpoint, i.e., lens center, is at origin O and the optical axis is along the Z-axis. The projection plane, i.e. image plane, Z = 1 can be used without any loss of generality, which means that the focal length equals 1. A space point (X, Y, Z) on the object is projected to the image point (x, y). We introduce a motion model representing fixational eye movements. We can set an eyeball’s rotation center at the back of lens center with Z0 along optical axis. We assume that there is no explicit translation of eyeball. However, 1666 1662

where i = 1, · · · , N and j = 1, · · · , M , and σo2 is an unknown variance. We ignore the temporal correlation of tremor which is needed to form drift component, and we assume that r(j) is a 2-dimensional Gaussian random variable with a mean 0 and a variance-covariance matrix σr2 I, where I indicates a 2 × 2 unit matrix.  (j) (j)  1 r r (j) 2 √ , (7) exp − p(r |σr ) = 2 2 2σ ( 2πσr ) r

since the rotation center and the coordinate origin differ, the translational vector u with respect to the coordinate origin is caused by an eyeball’s rotation r, and is formulated as follows: ⎤ ⎤ ⎡ ⎡ ry 0 (1) u = r × ⎣ 0 ⎦ = Z0 ⎣ −rx ⎦ . 0 Z0 From this equation, it can be confirmed that the rotation around Z axis rz causes no translation, and hence, rz has no information about depth. Therefore, we set rz = 0 and redefine r ≡ [rx , ry ] as a rotation vector of eyeball. Using Eq. 1 and an inverse depth d(x, y) = 1/Z(x, y), the optical flow v = [vx , vy ] as follows: vx = xyrx − (1 + x2 )ry − Z0 ry d ≡ vxr − ry Z0 d,

(2)

vy = (1 + y 2 )rx − xyry + Z0 rx d ≡ vyr + rx Z0 d.

(3)

where σr2 is assumed to be unknown. Since multiple frames vibrated by irregular rotations {r(j) } are used for processing and no tracking procedures are employed, the recovered d(i) at each pixel takes an average value of the neigboring region defined by vibration width in image. Therefore, {d(i) } should be assumed to have a correlation in the neigboring region, the extent of which depends on the depth value. In this study, to simplify modeling, we use the following equation as the depth model.   1 d Ld 2 exp − , (8) p(d|σd ) = √ 2σd2 ( 2πσd )N

In the above equtions, d is an unknown variable at each pixel, and r is unknown common parameter for the whole image. B. Gradient equation for rigid motion The gradient equation is the first approximation of the assumption that image brightness is invariable before and after the relative 3-D motion between a camera and an object. At each pixel (x, y), the gradient equation is formulated with the partial differentials fx , fy and ft , where t denotes time, of the image brightness f (x, y, t) and the optical flow, as follows: (4) ft = −fx vx − fy vy .

where d is a N -dimensional vector composed of {d(i) } and L indicates a matrix corresponding to the 2-dimensional Laplacian operator with a free end condition. By assuming this probabilistic density, we make a recovered depth map smooth. In this study, the variance σd2 is controled heuristically in consideration of smoothness of a recovered depth map. In future, we are going to examine a strategy for determination of σd2 in the whole system which models all fixational movements including microsaccade and drift also. Hereafter, we use the definition Θ ≡ {σo2 , σr2 }.

By substituting Eqs. 2 and 3 into Eq. 4, the gradient equation representing a rigid motion constraint can be derived explicitly. ft

D. Computation algorithm

= −(fx vxr + fy vyr ) − (−fx ry + fy rx )Z0 d ≡ −f − f d. r

u

(5)

C. Probabilistic models We use M as the number of pairs of two successive frames and N as the number of pixels. In this study, we assume that optical flow is very small, and hence, observation errors of ft , fx and fy , which are calculated by finite difference, are small. Additionally, equation error is also small, and therefore we can assume that error having no relation with ft , fx and fy is added to the whole gradient equation. (i,j) From this consideration, we assume that ft is a Gaussian (i,j) (i,j) and fy have no error. random variable, and fx 1 |d(i) , r (j) , σo2 ) = √ 2πσo ⎧

2 ⎫ (i,j) (i,j) ⎪ r ⎨ ft ⎬ +f + f u (i,j) d(i) ⎪ , × exp − ⎪ ⎪ 2σo2 ⎩ ⎭ (i,j)

p(ft

(6)

By applying the MAP-EM algorithm [7], parameter {d, Θ} can be estimated as a MAP estimator based on (i,j) p(d, Θ|{ft }), which is formulated by marginalizing the (i,j) joint probability p({r(j) }, d, Θ|{ft }) with respect to (j) {r }, but the prior of Θ is formally regarded as an uniform distribution. Additionally, {r(j) } can be estimated as a (i,j) ˆ in which ˆ d), MAP estimator based on p({r(j) }|{ft }, Θ, ˆ· means a MAP estimator described above. (i,j) In the EM scheme, {{ft }, {r(j) }} is considered as a complete data, {r(j) } is treated as a missing data, and {d, Θ} is treated as an unknown parameter. E step and M step are mutually repeated until they converge. At first, at the E step, calculate the conditional expectation of the (i,j) log likelihood of the complete data with observing {ft }, ˆ Θ}. ˆ It is generally using the current MAP estimates {d, called Q function. Especially for the MAP-EM algorithm, the objective function J(d, Θ) maximized at the M step is equal to the Q function augmented by the log prior densities

1663 1667

of parameters. In the following, the values computed using ˆ are indicated as ˆ· also. Θ Based on the densities defined at II-C, the objective function is derived as MN 3M J(d, Θ) = Const. − ln σo2 − ln σr2 2 2 N N  M  (i,j) 1   (i,j) 2 ˆ (j) (i,j) − 2 f +2 ft w rm 2σo j=1 i=1 t i=1  N   M  1  ˆ ˆ(j) (i,j) (i,j) +tr w w trR(j) R − 2 2σ r i=1 j=1 −

d Ld , 2σd2

(9)

using the following definitions derived by formulating the (i,j) ˆ ˆ Θ). posterior density p({r(j) }|{ft }, d,   ˆ(j) (i,j) ˆ ˆ r m ≡ E r(j) |{ft }, d, Θ 1 ˆ(j)  (i,j) (i,j) ft w ˆ , Vr σˆ2 N

= − 

ˆ(j)

Vr =

o

i=1

N 1  (i,j) ˆ + 1 I w ˆ w (i,j) σˆo2 i=1 σˆr2

−1 (11)

,

ˆ(j) (j) ˆ  ˆ = V (j) r + r m rm , 

w(i,j)

(i,j)

(i,j)

(12) 2

fx x(i) y (i) + fy (1 + y (i) ) = 2 (i,j) (i,j) −fx (1 + x(i) ) − fy x(i) y (i)   (i,j) fy (i) +Z0 d (i,j) −fx (i,j)

≡ w0

(i,j)

+ Z0 d(i) wd



.

(13)

In the M step, {d, Θ} is updated so as to maximize Eq. 9. We rewrite Eq. 9 with ignoring a constant value as follows: MN ln σo2 − M ln σr2 2 1 1 ˆ d Ld − − 2 Fˆ ({d(i) }) − 2 G . 2σo 2σr 2σd2

J(d, Θ) = −

(14)

From this representation, σo2 and σr2 can be updated as ˆ Fˆ ({d(i) }) G , σr2 = . (15) MN 2M For d, the partial derivative of the last term in Eq. 9 with respect to each d(i) includes the neighboring ds, hence the simultaneous equations have to be solved to update d. To avoid such complicated solution, we use the One-Step-Late (OSL) technique [8], i.e. we consider the above mentioned σo2 =

(16) σo2

is also evaluated at the current estimate and the where matrices A(i,j) and B (i,j) are defined as (i,j)

A(i,j) ≡ wd (i,j)

B (i,j) ≡ (wd

(i,j) 

w0

(i,j) 

wd

(i,j)

+ w0

(17)

,

(i,j) 

wd

)/2,

(18)

and d¯(i) indicates a local mean value using 4-neigboring system with no use of d(i) .

(10)

  (i,j) ˆ ˆ ≡ E r (j) r (j) |{ft }, d, Θ

ˆ R(j)

neighboring ds as constants and evaluate those at the current ˆ As a result, each d(i) can be updated individually estimate d. as follows: ˆ σˆo2 d¯(i) d(i) =

M ˆ σd2 Z02 j=1 tr A(i,j) R(j) + σˆo2  

M ˆ (i,j) (i,j)  (j) (i,j) ˆ(j) 2 rm + tr B R σd Z0 j=1 ft w d − ,

 (i,j) ˆ(j) + σˆo2 σd2 Z02 M R j=1 tr A

III. N UMERICAL EVALUATIONS To confirm the effectiveness of the proposed method, we conducted numerical evaluations using artificial images. Figure 2(a) shows the original image generated by a computer graphics technique using the depth map shown in Fig. 2(b). The image size assumed in these evaluations is 128 × 128 pixels, which corresponds to −0.5 ≤ x, y ≤ 0.5 measured using the focal length as a unit. In Fig. 2(b), the vertical axis indicates the depth Z using the focal length as a unit, and the horizontal axes means the pixel position in the image plane. In our model, pairs of two successive images are assumed to be used in turn to calculate ft . In this study, we ignore drift of fixational eye movements, and hence we should avoid a divergence of a movement range at each image position. To simplify the procedures, each rotation value was sampled as a Gaussian independent random variable with respect to an initial image shown in Fig. 2(a), and image pairs for the gradient equations were taken as the initial image and each successive image. Additionally, in order to firstly justify our algorithm for the assumed statistical models, we computed {ft } using Eq. 5 with true values of r and {d} and use them for depth recovery. We executed the proposed algorithm using σr2 = 10−4 . Under this condition, the mean magnitude of optical flow was approximately one pixel, which is sufficiently small compared with the textures in Fig. 2(a) and is consistent with our model. Gaussian random values with a 0 mean and a deviation corresponding to 1% of the mean value of {ft } were added to the true {ft }. By varying the value of σd2 , we evaluated the effectiveness of the smoothness constraint introduced by Eq. 8. The initial values of both σo2 and σr2 were 1.0 × 10−2 as arbitrary values, and {d} was assumed

1664 1668

10

24

11

22

9.5

10.5

20 9

18

10

16

9.5

14 8.5

8

12

9

10

8.5

8

8

6 140

4

120

7.5

100

140

80

0 20 60

40

80 100 120

120

100

20

20

20

100 120

60

10−1

M = 50 M = 100

0.6983 0.4741

10−2

10−3

10−4

40

80 100

20

120

1400

20 1400

(b)

Figure 2. Example of the data used in the experiments: (a) artificial image; (b) true depth map. Table I RMSE OF RECOVERED DEPTH σd2 /σo2

60

40

40

80

(a)

80

0

60 60

(b)

100

80

0

1400

40

(a)

140

120

60

40

10−5

9.2 9.15

0.3948 0.3079

0.2097 0.1841

0.0945 0.0769

0.3699 0.2124

9.1 9.05 9 8.95 8.9 8.85 140 120

initially as a plane of Z = 9.0. Examples of the results with M = 100 are shown in Fig. 3. Additionally, to confirm the effectiveness of collectively using many observations caused by small motions, we varied M for each σd2 . The RMSEs of the recovered depth are shown in Table I with the values of σd2 and M . From these results, it can be confirmed that in order to reduce the actual degree of freedom of d the smoothness constraint is important. However, if this constraint is excessively applied, the recovered depth map becomes too smooth and hence, the recovery error increases. It should be noted that the scales of the Z-axes in Figs. 3(a)(c) differ. We can also know that the observations collection works well as the number of motions increases.

100 80

0 20

60

40 60

40

80 100 120

20 1400

(c) Figure 3. Results of recovered depth maps with M = 100: σd2 is (a) σo2 × 10−1 ; (b) σo2 × 10−3 ; (c) σo2 × 10−5 .

R EFERENCES [1] S. Martinez-Conde, S. L. Macknik, and D. Hubel, “The role of fixational eye movements in visual perception,” Nature Reviews, vol. 5, pp. 229–240, 2004. [2] S. Ando, N. Ono, and A. Kimachi, “Involuntary eye-movement vision based on three-phase correlation image sensor,” in Proc. 19th Sensor Symposium, pp. 83–86, 2002.

IV. C ONCLUSIONS [3] B. K. P. Horn and B. Schunk, “Determining optical flow,” Artif. Intell., vol. 17, pp. 185–203, 1981.

In this study, we propose a depth perception model with fixational eye movements, especially tremor. This model can recover a depth map collectively using multiple images over the period corresponding to one drift. Since it treats small changes of image brightness pattern, the linear approximation error contained in the gradient equation becomes small. Additionally, because one depth map corresponding to multiple successive images is recoved, the bad influence of observation errors can be reduced. In future, we have to show the effectiveness of the proposed model through the real image experiments, and have to construct the whole model based on fixational eye movement including drift and microsaccade. A determination technique of σd2 should be also examined. If we can know regions of discontinuity as a prior information, σd2 has to be determined as a function of (x, y) to preserve edges of 3-D shape without the line process.

[4] E. P. Simoncelli, “Bayesian multi-scale differential optical flow,” Handbook of Computer Vision and Applications, vol. 2, pp. 397–422, 1999. [5] A. Bruhn and J. Weickert, “Locas/kanade meets horn/schunk: combining local and global optic flow methods,” Int. J. Comput. Vision, vol. 61, no. 3, pp. 211–231, 2005. [6] N. Tagawa, J. Kawaguchi, S. Naganuma, and K. Okubo, “Direct 3-d shape recovery from image sequence based on multi-scale bayesian network,” in Proc. ICPR’08, pp. CD– ROM, 2008. [7] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data,” J. Roy. Statist. Soc. B, vol. 39, pp. 1–38, 1977. [8] P. J. Green, “On use of the em algorithm for penalized likelihood estimation,” J. Roy. Statist. Soc. B, vol. 52, pp. 443– 452, 1990.

1665 1669