SPATIO-TEMPORAL CONSISTENT DEPTH MAPS ... - IEEE Xplore

SPATIO-TEMPORAL CONSISTENT DEPTH MAPS FROM MULTI-VIEW VIDEO Marcus Mueller, Frederik Zilly, Christian Riechert, Peter Kauff {marcus.mueller, frederik.zilly, christian.riechert, peter.kauff}@hhi.fraunhofer.de ABSTRACT The demand for high quality depth maps from stereo and mu lti-camera videos increases constantly. The main application for these depth maps is rendering new perspectives of the captured scene by means of Depth Image Based Rendering (DIBR). Accurate depth maps are the linchpin of DIBR. On the basis of a four-camera set-up, we show that combining hybrid recursive matching with motion estimation, cross -bilateral post-processing and mutual depth map fusion produces spatio-temporal consistent depth maps appropriate for artifact-free view synthesis. Index Terms— Multi-view matching, disparity postprocessing, depth map fusion, temporal consistency 1. INTRODUCTION Today's 3D productions are mainly based on classic stereovideo, captured with two cameras and displayed on stereoscopic glasses-based screens. Recently more and more auto-stereoscopic multi-view and lightfield displays are put on the market. These displays however need many different perspectives that have to be rendered according to the specific display requirements. In general the field of view covered by the baseline of a classic stereo-shooting is too small for such displays. In MUSCADE [1] -one of the EU FP7 projects wh ich are exp loring aspects of future 3D TV- a four camera rig is used to enlarge the field of v iew, preventing ill-posed view-extrapolation. To be backwardscompatible with standard stereo, a common stereo-rig is enhanced with two satellite cameras as shown in figure 1 (the actual MUSCA DE rig uses a mirror rig for the center stereo pair).To interpolate the needed perspectives by DIBR, consistent dense depth maps for all camera views are needed. Since temporal inconsistent depth maps lead to annoying flickering artifacts during virtual view synthesis, enforcing temporal consistency is of great importance. Behind this background we introduce an approach to estimate spatially and temporally consistent depth maps fro m mu ltip le synchronized camera v iews. The remainder of the paper is organized as follows. After a short discussion of related work in section 2 we introduce the overall concept in section 3. Section 4 briefly repeats the idea of hybrid recursive matching HRM [2] and details its modifications to

978-1-61284-162-5/11/$26.00 2011 © IEEE

improve temporal consistency. Results are shown in section 5 and we conclude in section 6. 2. RELATED WORK Many approaches to multi-baseline stereo have been proposed during the last years. Some of them operate in a mu lti-image manner, e.g.[3], meaning that all images are treated equally, while others use repeated application of two- or three-view techniques, e.g.[4] . Also global optimization techniques have been reformulated to include additional cameras [5]. Recently, the concept of scene flow was introduced to jointly estimate motion and structure fro m a stereo image sequence. Scene flow is the threedimensional mot ion field of points in the world, just as optical flow is the two-dimensional motion field of points in an image [6]. 3. MULTI-VIEW DISPARITY-ESTIMATION

Cam1

Cam2

Cam3

Cam4

Figure 1. Arrangement of 4 camera rig Figure 1 shows a sketch of the multi-camera rig that was used for the experiments. We start the computation by independently estimating two disparity maps for the following camera pairs:    

( Cam1 ; ( Cam1 ; ( Cam2 ; ( Cam3 ;

Cam2 ) => Disparity_12 & Disparity_21 Cam3 ) => Disparity_13 & Disparity_31 Cam4 ) => Disparity_24 & Disparity_42 Cam4 ) => Disparity_34 & Disparity_43

This selection is motivated by the increased depth-resolution of a larger baseline. The inner stereo-pair ( Cam2 ; Cam3 ) may be used to initialize the disparity estimat ion for the selected pairs. A hybrid recursive matching algorithm, described in the following section, is used for this purpose, yielding the in itial d isparity maps shown in the first row of figure 3. In the next step the consistency of the disparity maps is checked with in stereo-pairs by left-right

consistency-check and across stereo-pairs by trifocal consistency-check, see second row of figure 3. Note that the different matching processes are carried out independently and therefore the probability of detecting and reject ing all mis matches by these consistency checks is very high. After these checks we have trifocal consistent disparity maps for camera 2 and camera 3 as well as left right consistent disparity maps for camera 1 and camera 4. We can now use these maps to fill the occlusions of the center pair disparity maps, see third ro w of figure 3. Occlusions in the outer cameras are filled by a huge cross-bilateral filtering window covering half of the image width. In a post-processing step all d isparity maps are cross-bilaterally filtered with their respective camera views. The final disparity maps are fed back to guide the processing for the succeeding frame. Cam1

Cam2

Cam3

bottom position in the actual frame 

A temporal predecessor, taken fro m the same position in the previous frame

The candidates are tested to find the best match using locally adaptive support windows with sum of squared color difference, similar to [7]. Obviously- without any update block recursion is not able to follow spatio-temporal deviations in the disparity field. A pixel recursion step therefore acts as a permanent update process, injecting one new candidate vector per pixel position into the block recursion. A detailed description of the block-recursion and the pixels-recursion processes can be found under [2] . The following modificat ions were made with respect to [2] to improve temporal consistency:

Cam4

MOTION ANALYSIS

PAIRWISE DISPARITY ESTIMATION

CONSISTENCY CHECKS

DEPTH MAP FUSION AND OCCLUSION FILLING

FINAL FILTERING



integration of motion estimation



introduction of a temporal weight factor per p ixel

As mentioned before the temporal candidate is used to enforce temporal consistency. Using the disparity from the previous frame however is a rather bad prediction when there are fast moving objects in the scene or when the camera is moving. To improve the prediction we employ a simp le block-matching based motion estimat ion algorithm. Since the motion fields of this simp le approach are noisy we adapted the cross-trilateral median filter [8] to motion vectors to smooth the motion vector fields wh ile preserving motion discontinuities (see figure 4). The filter weight of a motion vector Mp at location p in the filter window centered at location s is co mputed as: weights(dp ) = r(Mp) · w(p-s) · c(Ip -Is ) .

DEPTH CAM 1

DEPTH CAM 2

DEPTH CAM 3

DEPTH CAM 4

where w(p-s) represents the standard domain kernel

w( x)  e

Figure 2. Concept of mult i-image matching

1 |X |  ( ) 2 s

c(Ip -Is ) represents the standard range kernel 4. TEMPORAL CONS ISTENT MATCHING The main idea of hybrid recursive matching (HRM ) is to unite the advantages of block-recursive disparity matching and pixel-recursive optical flow estimation in one common scheme [2]. To enable real-t ime perfo rmance and to enforce temporal and spatial consistency the search for corresponding points during block recursion is restricted to only three candidate positions: 

A horizontal predecessor, taken from the left or right position in the actual frame



A vertical predecessor, taken fro m the top or

c( x)  e

1 |X |  ( ) 2 c

and r(Mp) represents the confidence kernel. The confidence of a motion vector is a co mbination of t wo reliability values: 

the forward-backward consistency and



the correlation value

The filtered motion vector M's at position s is thus: M 's  med{weight s (M 0 )  M 0, weight s (M1 )  M1,..., weight s (M p )  M p },

M  ns , where weight s (M)  M denotes duplication of M by a factor weight s (M). Here, the forward-backward consistency check is the equivalent of left-right consistency used to check disparity consistency within a stereo-pair. To control temporal s moothness a weighting factor is assigned to the temporal candidate for each pixel. The temporal candidate is selected if its dissimilarity value is smaller than the current best candidates' dissimilarity value mu ltip lied with this weight factor. The strength of the weight factor is controlled by 

the reliab ility of the motion vector RM



the reliab ility of the previous match RD .

Reliab ility of the motion vector RM is defined as the product of the zero-normalized cross-correlation value corrval and the reciprocal of the forward-backward consistency Cfb .  0 if corrval < ThresCorr or Cfb > Thres  RM    corrval * 1/C else fb 

Reliab ility of the disparity RD is just the reciprocal of the trifocal consistency Ctrif or left-right consistency Clr . 0 if Ctrif > Thres and Clr > Thres   1/ Ctrif if Ctrif < Thres RD    1/Clr  where Thres is a predefined threshold (2 for the results shown). For the results shown the temporal weight factor was set to Tmp WeightFac = 1 + RD + RM . 5. EXPERIMENTAL RES ULTS For the results shown, we applied the algorith m to one of the MUSCADE test-sequences called Band06. The baseline of the outer stereo pairs is about three times the baseline of the inner stereo-pair. Figure 3 shows examp les of the different steps during the estimation process, while figure 4 gives an example of the motion field post-processing. Showing temporal consistency is hardly possible on paper and will be shown during presentation. The modifications to the basis hybrid recursive matching algorithm led to dramat ic

improvement in case of moving objects and moving cameras. Even in case of little object movement and static camera we observed distinct improvement of the resulting depth maps. 6. CONCLUS ION A very fast and easy to implement method to enforce spatially and temporally consistent depth maps for mu ltip le stereo-pairs was introduced. First experiments showed very good results, in particular when applied moving cameras . Further improvements can be expected by using more elaborated motion algorith ms, wh ich we p lan in the near future. 7. REFERENC ES [1] www.muscade.eu/ [2] N. Atzpadin, P. Kauff, O. Schreer, "Stereo Analysis by Hybrid Recursive M atching for Real-Time Immersive Video Conferencing", IEEE Trans. on Circutis and Systems for Video Technology, Special Issue on Immersive Telecommunications, pp. 321-334, Vol. 14, No.3 January 2004 [3] R. Collins,"A space-sweep approach to true multi-image matching", Proceedings of the Image Understanding Workshop, Vol. 2, February 1996, pp. 1213-1220 [4] P. M errell, A. Akbarzadeh, L. Wang, P. M ordohai, J-M . Frahm, R. Yang, D. Nister, and M . Pollefeys, "Real-Time Visibility-Based Fusion of Depth M aps”, IEEE International Conference on Computer Vision (ICCV), 2007 [5] G. Vogiatzis, C. Hernandez, P. H. S. Torr, and R. Cipolla,"M ulti-view Stereo via Volumetric Graph-Cuts and Occlusion Robust Photo-Consistency", IEEE Transactions in Pattern Analysis and M achine Intelligence, Vol.29, No.12, pp. 2241-2246, Dec.2007 [6] S. Vedula, S. Baker, P. Rander, R. Collins and T. Kanade, "Three-dimensional Scene Flow", IEEE Transactions on Pattern Analysis and M achine Intelligence, Vol 27, No.3, pp. 475-480, M arch 2005 [7] K. Yoon, I. Kweon," Locally adaptive Support-Weight Approach for Visual Corresponence Search", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2005 [8] M . M ueller, F. Zilly, P. Kauff, "Adaptive Cross-Trilateral Depth M ap Filtering", Proceedings of the 3DTV Conference, Tampere, June 2010

Initial depth maps for Cam1, Cam2, Cam3 and Cam4 (from left to right)

Left-right consistency for Cam1 and Cam 4 (outer left and outer right) and trifocal consistency for Cam 3 and Cam4 (middle left and middle right

Merged depth maps for Cam 2 und Cam3 (middle left and middle right)

Finally filtered depth maps for Cam1, Cam2, Cam3 and Cam4 (from left to right)

Figure 3. Interims results and final depth maps of mult i-image matching algorith m

Initial motion vector fields for Cam1, Cam2, Cam3 and Cam4 (from left to right)

Forward-backward-consistency for motion vector fields of Cam1, Cam2, Cam3 and Cam4 (from left to right)

Finally filtered motion vector fields for Cam1, Cam2, Cam3 and Cam4 (from left to right) )

Figure 4. Post-processing of motion vector fields