Robust Enhancement of Depth Images from Kinect Sensor - IEEE Xplore

Robust Enhancement of Depth Images from Kinect Sensor ABM Tariqul Islam∗1 , Christian Scheel†1 , Renato Pajarola‡2 and Oliver Staadt§1 1 Visual 2 Visualization

Computing Lab, University of Rostock, Germany and MultiMedia Lab, University of Z¨urich, Switzerland

A BSTRACT We propose a new method to fill missing or invalid values in depth images generated from the Kinect depth sensor. To fill the missing depth values, we use a robust least median of squares (LMedS) approach. We apply our method for telepresence environments, where Kinects are used very often for reconstructing the captured scene in 3D. We introduce a modified 1D LMedS approach for efficient traversal of consecutive image frames. Our approach solves the unstable nature of depth values in static scenes that is perceived as flickering. We obtain very good result both for static and moving objects inside a scene.

(a)

(b)

!

Figure 1: Kinect images – (a) color image, (b) depth image.

Index Terms: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—[Virtual Reality] I.4.3 [Image Processing and Computer Vision]: Enhancement 1

I NTRODUCTION

Like in many multimedia applications, 3D scene reconstruction is crucial in telepresence applications. A telepresence system provides a virtual window for local users through which they can interact with remote users and objects [3]; 3D representation of remote users and objects makes the communication more realistic and natural. In order to produce 3D content for telepresence, researchers (such as in [3]) often used low-cost depth cameras such as Kinects. Kinects capture a scene with acceptable resolution and speed, but the generated depth data is unstable and have poor quality. So, the quality of 3D contents generated with Kinect’s depth data is often ill-suited for many applications including telepresence. We have seen quite a few works ([2], [1], [4]) on enhancing the depth images of Kinect. While some of the works (such as [4]) use an off-line approach, and, thus, are unsuitable for realtime applications, some others (such as [1]) are used for static scenes and sometimes they cannot handle the dark regions where color information is missing (such as in [2]). We propose an approach which enhances the depth frames of both static and dynamic parts of a scene. We modify the original least median of squares (LMedS)[5] approach, a popular statistical regression technique for outlier handling, to fit our purpose. We modify the LMedS to use as a 1D LMedS; in 1D LMedS, for each pixel, we go through a window of frames, detect the missing depth values, identify them as outliers by comparing to certain constraints and finally, replace them with stable depth values. Here, we only include our work for enhancing the depth frames for the invalid and unstable depth values; although we observe quite less edge bleeding with our approach, our work on improving the remaining edge bleeding is well under development ∗ e-mail:

[email protected] [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] † e-mail:

IEEE Virtual Reality Conference 2015 23 - 27 March, Arles, France 978-1-4799-1727-3/15/$31.00 ©2015 IEEE

Figure 2: Kinect images with static objects– (a) color image, (b) depth image of 4 consecutive frames. The depth frames in (b) are the corresponding depth images of red marked area of (a).

and we use an anisotropic diffusion approach along with structural similarity of depth and color frames. The GPGPU implementation of our idea provides us with real-time speed; although a CPU implementation would give us already a performance of 7 fps. Kinect’s depth images suffer severely from various noises such as black holes due to invalid depth values. The black holes in Figure 1(b) show the invalid depth values. Kinects produce such holes even if the objects remain static; we can see such phenomenon in the green and orange marked areas in Figure 2(b). In the yellow marked areas in Figure 2(b), we can see that for some pixels in a frame, we have valid depth values, but in the next frame we have invalid depth values at those pixels. Therefore, Kinect’s nature of generating depth values at different pixels doesn’t follow any pattern and the values are unstable as well. We take this unstable nature into consideration and apply 1D LMedS to replace unstable and invalid values with valid stable depth values. 2

P ROPOSED

STRATEGY

To fill the invalid depth values, we propose to use our 1D LMedS approach which is very robust to outliers. Let’s consider, we have 300 consecutive depth frames and we take 7 frames in a sliding window. According to the original LMedS [5], filtering the outliers involves calculating the minimum of the medians (MJ ) of squared residuals (ri2 ) of all the points mi of a sample set; mi s are also subsampled based on conic parameters PJ . The formula of LMedS, and the explanation of the variables and constants are given in [5]. In our approach, we do not need 5 conic points (for PJ in [5]), rather we need 1 point because we do not take the neighboring pixels of the current pixel as subset of sample (which original LMedS do in its 2D approach) and traverse in one direction. The name 1D

197

pixel (30, 10, 902) pixel (30, 10, 904) pixel (30, 10, 923) pixel (30, 10, 2047) pixel (30, 10, 903) pixel (30, 10, 904)

time, t

time, t+1

time, t+2

(a)

time, t+3

time, t+4

902, 904, 923, 2047, 903, 904

time, t+5

depth values of a pixel

!

(a)

! ! !

!

(b)

Fig. 16. M

(c)

Step-1(absolute distance) 2, 21, 1145, 1, 2 Step-2 (median)

2, 19, 1143, 1, 0 . . . 2, 0, 19, 1143, 1

2, 2, 5, 1, 2

Step-3 (minimum median)

1

! depth val = 903 (b)

outliers inliers avg final depth value

Figure 3: (a).Window of frames with valid and invalid raw depth values in green and red color respectively, (b) Flow digram of 1D LMedS.

! LMedS comes from the one directional traversal. To fill the! invalid ! depth values of these frames, one would need to start by iterating ! though each pixels through the window of frames and calculate the ! 2 MJ of ri for each pixels on all of these frames. In 1D LMedS, we ! use the absolute difference for getting the residuals (ri ) among the ! 2 caldepth values; in [5], ri was used to avoid negative values. The ! culation of σˆ , and wi also follow the calculation pattern of! ri . The 1D LMedS formulae are given in Equation 1. We show the ! whole process of 1D LMedS in Figure 3. But, we still need quite ! some time to traverse thorough the frames. Therefore, we adapted and ! modified a faster data traverse procedure [5] to fit into 1D LMedS. ! In our faster pixel traverse approach, we sort the depth! values beforehand and split the traversal path to half of the window size ! and calculate distances among the depth values. We do the same for ! all the depth values inside the window and finally take the minimum ! distance. We find the index of the minimum distance and ! retrieve the depth value which have this index. Based on this depth ! value and index for a pixel, we calculate the σˆ , and find the inliers and !

outliers according to wi of Equation 1. We then take the average of the depth values of the inliers as the final depth value. Now, in a Kinect’s depth image, both invalid and valid depth values fluctuate from one frame to another (see Figure 2), even if the objects remain static. Figure 3(a) shows the different valid depth values for the same pixel (30,10, depth). 1D LMedS solves this issue by assigning a stable valid depth value over the consecutive frames. 5 MJ , MJ = medi:1,...,n ri (mi ) , σˆ = 1.4826 1 + n−1 (1) 1, i f r ≤ 2.5σˆ wi = 0, otherwise

3 R ESULTS We have applied 1D LMedS to both static and moving objects inside the captured scene. 1D LMedS worked quite well with both static and moving objects (see Figure 4(a)-(c) and Figure 4(d)-(e) respectively), and filled the invalid depth values with valid ones while stabilizing the depth values over consecutive frames. One limitation for moving objects is, when we increase the size of the window beyond 30 frames, we see ghosting effect. For the test, we used a window size of 15 for scenes with static objects and 30 for moving objects. Besides, when for a particular pixel, there is no valid depth value inside the window of frames, that pixel continues to remain as black hole. If we increase the window size perhaps that pixel might have a valid depth value at some frame and then, 1D LMedS can assign a valid depth value for that pixel. We also tested 1D LMedS on the test sequence from [1];

198

(d)

(e)

Figure 4: Results with static & moving object–(a) color image, (b),(d): Fig. 14. Indoor environment. (a) Raw depth data. (b) Filtered depth map. raw depth image, (c),(e): enhanced depth image with our approach.

(a)

(b)

(c)

Fig. 17. P

(d)

Fig. 15. Static object detail. (a) Filtered depth map. (b) Raw depth map.

explained in Sections III-B and III-C. Fig. 13 reports two depth maps of an indoor scene with a moving object. The use of adaptive-parameters in the models (third row in the figure) allows rendering accurate Fg regions simultaneously for situations in which the moving object is either far or close to the camera and, at the same time, reducing the depth-map measurements (e) (f) high level of noise on(g)very far located objects (i.e., the wall). On the contrary, using a fixed σmin , it is difficult ! to obtain simultaneously an acceptable segmentation for both, Figure 5: Results with far static objects in test sequence and and close moving to the camera moving objects. In fact, by using large value for ! σraw reduceimage the temporal at from [1] – (a),(e): color aimage, (b),(f): depth (staticfluctuation and min to far distance, Fig.13 (fourth row),object) part of by the [1], Fg regions are moving objects), (c) enhanced depth image (static (d) incorporated to the If a small of σmin is used enhanced depth image (static object) bybackground. our approach, (g) value enhanced to improve the Fg detection accuracy, Fig. 13 (fifth row), the depth image (moving object) by our approach. false detections due to the noise at far distances increases. Fig. 14(a) reports raw depth data provided by the Kinect, which is severely affected by the noise problems described in 5. the introduction. corresponding point cloud see the results in Figure It is worthThe to mention that, from themodel is

(windowsize + 1)th frame we would observe the enhanced depth frame. For the (windowsize + 2)th frame and later ones, we just have to wait for one extra frame’s calculation which is very little. ACKNOWLEDGEMENTS This work was supported by the EU FP7 Marie Curie ITN “DIVA” under REA Grant Agreement No. 290227 and by the German Research Foundation (DFG) within the GRK 1424 MuSAMA. R EFERENCES [1] M. Camplani, T. Mantecon, and L. Salgado. Depth-color fusion strategy for 3-d scene modeling with kinect. IEEE Transactions on Cybernetics, 43(6):1560–1571, 2013. [2] L. Chen, H. Lin, and S. Li. Depth image enhancement for kinect using region growing and bilateral filter. In 21st International Conference on Pattern Recognition (ICPR), pages 3070–3073, Nov 2012. [3] A. Maimone and H. Fuchs. Encumbrance-free telepresence system with real-time 3d capture and display using commodity depth cameras. In 10th IEEE ISMAR, pages 137–146, October 2011. [4] S. Matyunin, D. Vatolin, Y. Berdnikov, and M. Smirnov. Temporal filtering for depth maps generated by kinect depth camera. In 3DTV Conference, pages 1–4. IEEE, 2011. [5] P. J. Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.

Fig. 18. P

presented 3-D mod improves in Fig. 1 are stab boundari Definitio moving very sign the actu correspon been red pc monit sharply d regions i