High Accuracy Depth Filtering for Kinect Using Edge ... - IEEE Xplore

High Accuracy Depth Filtering for Kinect Using Edge Guided Inpainting Saumik Bhattacharya, Sumana Gupta, K.S. Venkatesh lIT Kanpur, India

Abstract-Kinect is an easy and convenient means to calculate the depth of a scene in real time. It is used widely in several applications for its ease of installation and handling. Many of these applications need a high accuracy depth map of the scene for rendering. Unfortunately, the depth map provided by Kinect suffers from various degradations due to occlusion, shadowing, scattering etc. The major two degradations are the edge distortion and shadowing. Edge distortion appears due to the intrinsic properties of

Kinect and makes any

depth based operation

perceptually degraded. The problem of edge distortion removal has not received as much attention as the hole filling problem, though it is considerably important at the post processing stage of a RGB scene. We propose a novel method to remove line distortion in order to construct high accuracy depth map of the scene by exploiting the edge information already present in the

RGB image. Index Terms-Kinect; edge distortion; Contour detection; High Accuracy depth Map

I.

INTRODUCTION

Kinect is used widely for measuring depth of a scene in real time. It uses IR structured light pattern to estimate the depth of a scene. Due to the structured light method, several distortions appear in depth calculation. Hole and line distortion are two major degradations among them [2]. Holes, which appear due to the shadowing, can be filled quite accurately using different inpainting methods assuming certain priors [1]-[3]. Line distortion, on the other hand, is critical to handle because it is difficult to estimate the actual region of each object of the scene individually. Moreover, the distortion is both time varying and depth varying. Combining all these factors, the removal of the edge distortion becomes a challenging task. As human eyes are extremely sensitive to edge information, any operation, which is dependent on the depth map, looks very artificial if the edge rectification is absent or not properly done on the depth map. Thus, it is immensely important to rectify the distorted edges of the depth map properly. The line distortion appears due to the inherent structured light method of Kinect. As Kinect uses IR dot pattern to estimate depth, it is not certain that the pattern will capture the object edges properly and the depth measurement at the object edges become erroneous most of the time. As the depth estimation itself deteriorates as the distance increases, the line distortion also increases. Another factor, which affects the depth based operations, is the improper alignment of the depth edge and RGB edge. As the RGB camera and the IR sensor are located apart in case of Kinect, we need a calibration before capturing the scene [4], [6]. Though there are several excellent

978-1-4799-3080-7114/$31.00 ©2014 IEEE

algorithms present for calibrating Kinect, it is very difficult to align the object edge and corresponding depth edge exactly due to the inherent distortion [4], [5]. Several authors have proposed different methods for hole filling [1], [7], [8], but the line distortion removal has not being properly investigated yet. Although there are a few methods for edge modification, majority of them do not give sharp edges that are properly aligned with the RGB image after processing [9]-[11]. The method proposed by Chen et aI., uses bilateral filtering which reconstructs the depth map quite well and preserves the structure of the objects in the scene. However it does not remove line distortion [9]. Camplani proposed two methods for depth modification using Kalman filtering and bilateral filtering. Here again the postprocessed depth map does not follow the RGB edges faithfully and the methods are computationally exhaustive [6], [7]. Lai et al. has proposed a method which uses structural information of the RGB scene for depth modification, but the depth modification is largely dependent on prior hole filling of the scene and it performs poorly for large distortions present at distant objects [12]. Several other methods, such as denoising the depth data captured using ToF cameras could be used to refine the depth map captured by Kinect. However this will not deal with the edge distortion problem as the ToF camera has considerably small distortion at the edges and it does not increase with the distance drastically due to its different working principle [13]. We propose a novel approach which exploits the edge information present in the RGB scene to reconstruct the depth map. In result, we obtain a modified depth map which is perfectly aligned with the RGB image without any edge distortion. This, in turn, ensures that any depth dependent application appears realistic and accurate. Our Contributions: •

•

•

In the first part of the algorithm, we propose a contour extraction algorithm to generate continuous contour map. We propose a robust edge based inpainting approach to remove edge distortion to get high accuracy depth map of the scene. Proposed method removes the misalignment problem that appears due to small errors in calibration and hole filling.

The rest of the paper is divided as follows- in Section 2, we discuss the proposed approach. In Section 3, we show the experimental results. Section 4 discusses the results obtained using the proposed approach and concludes the paper.

868

RGBlmage

•

Depth m.p .fter holeflt!!n,

equ.tlon consrderina tne contour point} as perfect insulator Inp.lnt the contour points Fln.1 Depth M.p

edges and to enhance the contour edges keeping the edge continuity intact. Consider a RGB scene I of size m x n with corresponding calibrated depth map as D. Applying the algorithm proposed by Catanzo et al., we first calculate the tentative contour map of the scene Ct. Now, Ct will contain the contours of the scene as well as some texture edges. We apply the suppression-facilitation mask for each potential contour point, i.e for the pixel p-;' ( xo, Yo ) with intensity lower than the upper saturation value ( 255 in case of UintS images and 1 for binary images). The surround facilitation Fa(P-;') is defined as =

Fig. 1. Block diagram: Removal of edge distortion

(1) pEF-region

II. A.

PROPOSED ApPROACH

Boundary Detection

The main objective of our algorithm is to remove the edge distortion in such a way that the object in RGB image and the corresponding depth map align perfectly. Before starting the edge modification it is important to calibrate the Kinect as the raw data from the RGB camera and IR sensor have almost no correspondence because of their positions and viewing angles [4]. We use the method proposed by Z. Zhang to calibrate the Kinect using the chess board pattern [14]. In order to obtain refined object edges, the processing should follow the contours of the RGB scene. So, at first the prime goal is to extract the contour points from the RGB scene to define the region of interest. In the real world, ideal contours of a scene should be continuous and thin. Several methods have been proposed by many authors for detecting the contours of a scene [15]-[IS]. We combine the algorithm proposed by Catanzaro et al. and Qu et al. with significant modifications to produce continuous binary contours [16], [17]. The algorithm proposed by Catanzaro et al. produces smooth continuous contour of a scene, but the contour is thick and non-binary in nature. The algorithm also detects good number of object edges of the scene. Normal thresholding can not preserve the continuity of the contour points, so it is difficult to convert them to thin binary contour. On the other hand, contour extraction algorithm, proposed by Qu et al., generates thin binary contour of a scene but the algorithm does not produce continuous edges and it drops many contour edges also. We adopt the method proposed by Catanzaro et al. as our initial contour estimation [16]. In spite of fast operation and continuous estimation of the contour points, the algorithm detects edges which appear from different patterns within an object and do not belong to object contour. Unlike other methods for contour extraction, in the algorithm, proposed by Catanzo et al., the usage of surround facilitation and surround suppression masks considering the Gestalt law is absent [15]. The surround facilitation and surround suppression masks consider the neighbouring information of an edge pixel and help to separate out pattern edges from object contours. The suppression-facilitation mask, proposed by Qu et al., is applied on the initially estimated contour [15] to suppress the object

where p (x, y) is a point in the facilitation region (F-region), is the inverted image of Ct and WF (p,p-;') is defined as follows: (2) =

c:

We calculate cue as:

w'ic

which works as the probabilistic grouping

exp( -((C: (p) - C:(p-;')) /(]'g ) 6 ) . exp( _(p _ p-;,) 2 /(]'�). exp( - tan((BE(p) - BE(p-;'))/2))

(3)

(]'g & (]'d represent the effect of gradient contrast and distant from pixel of interest respectively, and can be computed as discussed in [15]. BE denotes the orientation of the edge and can be calculated as: (4) BE(P) t a - ('il yIa(P)/'il xIa(P)) - 900 'ilxIa(P) & 'ilyIa(P) are the x-component and y-component of the scale dependent gradient of RGB scene I defined as: 'ilxIa(P) (I (8ga/8x))(P) (5) 'ilyIa(P) (I (8ga/8y))(P) ga (f) is a zero mean Gaussian function with variance (]' n

=

1

=

*

=

*

represented as

ga(r')

=

(x2+v2) 1 --e-�

(6)

21m2

The surround suppression is computed as: (7) pES-region

where Ws l/(smax - s(p-;')), Smax denotes the number of points in S-region, and s(p-;') is defined as: =

s(p-;')

a(p,p-;'),

=

(S)

PES-region

( if C;(P) :s; j3C;(p-;') a P,Po ) - 0 if C;(P) > j3C:(p-;') �

�

_

{I

(9)

j3 is a constant and can be evaluated as proposed in [15]. The final contour map is decided as

20 i4 international Conference on Advances in Computing, Communications and Informatics

if R(p-;') if R(p-;')

(ICACCI)

=

=

1

0

(10)

S69

where,

H(z) and

R(p�)

�f z < 0 z zf z :::: 0

{O

=

(11)

is a classifier defined as:

�) R(Po

=

Ct

if s(p�) :::: St 0 if s(p�) < St

{I

(12)

St is a threshold which depends on the captured RGB scene and can be calculated as roposed by [15]. After calculating the final contour map Ct we use binarization, proposed by ' [19], to get the final binary contour map Cb. Cb(f5) 1 if P is contour point and Cb(f5) 0 otherwise.

,

=

=

Fig. 2. Edge Guided inpainting. The pixels in 'l1 de are known depths and the pixels in 'l1 Ue are unknown depths. The dotted lines are the normals from the boundaries of the known regions. The red line is the contour Cb

B. Line Distortion Removal Kinect depth estimation suffers severely from occlusion and shadowing. Any missing estimation at a certain point is shown by a black pixel in the depth map. These bad pixels appear due to the structured light method used by Kinect for depth measurement and is widely referred as 'hole' in the literature. Hole filling is an interesting problem and several authors have proposed various methods for hole filling. In our algorithm, hole filling can be viewed as a pre-processing step and we apply the method proposed by Matyunin et al. for filling the holes initially [1]. After eliminating all the missing pixels, we initiate the process for the removal of line distortion. The process is divided into two steps- puncturing of probable distorted contour points with morphological filtering and edge guided inpainting. 1) Morphological Puncturing: After calculating the binary contour map we modify our depth map as follows: (13)

dh is the estimated depth after hole filling and complement of C f which is defined as

Cf

is the

(14)

870

20 J 4 International Conference on

where Ct denotes the translation of the binary image Cb to point t and k is the structural element of the morphological operation. Ct can be further simplified further as =

{y

E

Z21y

=

a

+ t, for some

a

E

Cb & t E k}

(15)

In dp, the wrongly estimated depth pixels, appeared due to the edge distortion, are removed. But some of the properly estimated depth pixels are also dropped in this process. So, we need inpainting to fill the punctured pixels with proper depth estimation.

2) Edge Guided lnpainting: Inpainting is a well known method to replace wrong or unwanted pixels from neighbour ing pixel information. The well known methods, e.g. patch based inpainting, exemplar based inpainting and inpainting using isotropic heat diffusion or anisotropic heat diffusion can neither handle big missing areas nor give sharp edges [20] [22]. We propose an edge guided method to inpaint large missing area keeping the edge information intact. In normal inpainting, we do not have any prior knowledge about the missing region, which, in turn, makes the inpainting process more challenging. B ut in the depth inpainting of Kinect data, we already have the RGB scene and the contour information extracted from the RGB image can be used as a prior to make the inpainting process simpler and accurate. As depth map contains significantly low texture information, it is better to use diffusion to inpaint the missing regions in a depth map for particularly two reasons- 1. Diffusion performs well even for large unknown regions. Exempler based inpaiting or patch based method fail in that scenario. 2. It is easier to control the diffusion process to produce good result provided the image to be inpainted contains less texture. The depth map can be considered as very low textured image. Thus, it is better to use diffusion to reduce the complexity of the algorithm. We define the modified depth map with prior d� as: (16) For each point p E d� such that:

e

( Cf ) ,

we can consider a patch

'l1

P

from (17)

where 'l1 U is the patch consisting of unknown pixels, 'l1 d is a patch with known pixels and e . is the Canny edge detector.

()

If for a certain patch 'l1Pa' \:Ipe E Cb, Pe rt. 'l1Pa' then we can inpaint the group of unknown pixels 'l1 Ua directly using the values present in 'l1 da• We use heat flow model, proposed by Bertalmio et al. [22], for inpainting the pixels of 'l1 Ua' As depth map contains almost no texture, heat flow equation can estimate the missing pixels quite accurately. The inpainting in this case is defined as:

'n' denotes the inpainting iteration, 6.t denotes the rate of improvement and 'l1;L (f5) is the update at the nth iteration

Advances in Computing, Communications and Informatics (ICACCI)

calculated from the pixels of III Pa' The update term can be calculated as: (19) where fJn(j!) is the direction of propagation, calculated as the normal from e ( C;), i.e., from the boundary of the region to be inpainted. oL(j!) is the change in information of surrounding pixels of P and it is defined as

oL(j!)

=

(a)

(b)

(c)

(d)

(e)

(f)

(Ln(i+l,j)-Ln(i-l,j),Ln(i,j+l)-Ln(i,j-l))

(20) where P (i,j) and Ln(p) is the actual information to propagate. For smooth surface, Ln(i,j) can be approximated using Laplacian operator. In saturation, llI;t (j!) becomes zero and we get 1lI�� 1 (j!) IlI�Jj!) and the process for p is stopped. If for a certain patch III Pe' there is some Pe S.t. Pe E Cb,& Pe E III Pe' i.e. there is some contour point inside the patch, then we inpaint the missing region III Ue as described in eq. (18) with some modifications in the update term. For III Ue' we define t the update term llI;t (j!) at n h iteration as: =

=

llI;t (j!) He n(j!) where

R(.)

=

{O

=

oL(j!).He n(j!)

0

if �(fJn(j!)) Cb Nn otherwise

(21)

#¢

(22)

is an operator to select the pixels through which

fJn(j!) is going. In other words, if for a missing pixel estimation, normal from the boundary of the region to be inpainted crosses any contour point, we drop the normal. After the removal of all punctured point, only the contour points need to be inpainted. Each contour point is replaced with the minimum depth value among its nearest neighbouring pixels. In Fig. 2, we show how the inpainting process works. The figure shows a patch III Pe which includes some portion of Cb (shown with red colour). The normals from the boundary of the known regions, shown as yellow and orange lines, contribute differently in the inpainting. The normal which crosses the RGB contour Cb to reach the unknown point Pe , orange line in this case, should be dropped in the update term of the point Pe. The update term for the missing pixel should be dependent only on the normals shown as green lines in the figure. The whole inpainting process can be visualized as a heat flow model where the known pixels are the source of heat and the contour points act as the perfect insulator with zero conductivity. For fast and reliable edge guided inpainting, we choose the size of III P as 3 x 3. III.

EXPERIMENTAL RESULTS

We consider a RGB scene and the corresponding depth map captured by the Kinect and apply our algorithm after filling the holes in the depth map. Fig. 3(f) shows the detected contour of a RGB scene using our algorithm. Fig. 3 clearly shows that the contour, detected using our algorithm, has thin, continuous contour points as well as the edges that appear due to the textures in the scene are much less compared

20 i4 international

Fig. 3.

Contour Detection: (a) Original RGB image: (b) Output of Canny edge detector; (c) Output of [16J; (d) Output of [15J; (e) Output of [17J; (f) Output of our algorithm

to 3(b) and 3(c). The continuity of the contour points are important as they ensure the smooth refinement of the edges. As it is impossible to have a ground truth for a depth map captured by Kinect, we try to quantitatively measure the performance of our algorithm on different artificial patches. Each computer generated patch has a size 50 x 50. Fig. 4 shows the edge refinement on typical distorted depth maps of artificial RGB patches. If the ground truth depth map for a given depth map is dtrue and the refined depth map is dref then, the difference image is defined as ddiff Idtrue - drefl. The total number of non zero elements in the difference image gives the idea about the efficiency of the reconstruction. Lesser number of non zero pixels indicates better reconstruction. As shown in Fig. 4 (p) (r), our algorithm completely outperforms the other methods. The total number of non zero pixels (NP) in the difference images are tabulated in Table 1. Both the other algorithms depend heavily on the initial depth map and introduce blur in the final refined patch due to filtering. On the other hand, our algorithm removes the distortion completely and produce sharp edges in each case. =

The algorithm is also tested on large number of actual RGB scenes and their corresponding depth maps captured by Kinect. The improvement for each image is significant and

Conference on Advances in Computing, Communications and informatics (iCACCi)

871

Method

293

9.3

372

12.5

774

14.2

Method

498

5.8

681

6.4

1135

7.1

0

7.4

6

8.9

15

16.1

of 9

of 10 Our

method

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(k)

(I)

(m)

(n)

(0)

(p)

(q)

Table 1: Comparison of different methods for edge refinement

almost always the depth modification tracks the actual RGB contour points faithfully. As shown in Fig. 5(e)-(h), most of the other algorithms are dependent on the accuracy of the hole filling algorithm and miss the RGB contour points drastically if the hole filling is not so perfect. Moreover, due to the filtering operation, they introduce blur near the edges. The edge refinements using our algorithm is shown in 5(i) & G). All the depth modifications are done using a 7 x 7 structural element. As the outward pixels of the punctured depth are filled before filling the inner pixels, if any contour edge is broken, still the leakage due to diffusion will not be significant. The red boxes show only few of the significant rectifications using our algorithm. Each red box has been zoomed and put to insets of respective images for better viewing. Most of the changes are subtle and it is better to use maximum zooming of the electronic copy to fully appreciate the changes. To establish the extent of our algorithm, we apply our algo rithm on a video frame where the edge alignment is more difficult if there is any motion. In Fig. 6, we compare the results when the final depth maps, refined using the algorithms of [10], [11] and ours, are used for segmenting the scene. As the algorithms proposed by [10] and [11], depend largely on the hole filling algorithm, any wrong estimation in that stage is not rectified later. Thus Fig. 6 (d) & (f) have extra regions around the body which are difficult to remove and do not look good perceptually. In our algorithm, the depth map aligns perfectly with the RGB scene. Thus, there is no extra region around the body after segmentation. IV.

DISCUSSION AND CONCLUSION

The proposed algorithm refines the edges quite accurately in spite of the inaccurate depth estimation in the hole filling stage which is carried out before any edge refining processing. The object edges do not hamper the refinement as depth does not vary drastically with in a particular object and any object edge contains almost same depth value around it. So while inpainting, the depth remains same inside a particular object. The time required by the algorithm is almost same as that of the existing methods but with an improvement that is significantly high. Our algorithm works better than existing algorithms for static as well as dynamic scenes. The future scope of this algorithm could be the reduction of complexity in continuous contour detection so that the algorithm can be used in real time.

872

(r)

Fig. 4. Line distortion removal for an artificial patch(a)-(c) RGB images of size 50 x 50 each; (d)-if) Ground truth; (g)-(i) Distorted depth map; (j)-(l) Edge modification using the algorithm of [10J; (m)-(o) Edge modification using the algorithm of [l1J; (p)-(r) Edge modification using the proposed algorithm

REFERENCES

[1] S. Matyunin et aI., TEMPORAL FILTERING FOR DEPTH MAPS GEN ERATED BY KINECT DEPTH CAMERA, 3DTV-CON IEEE, 2011. [2] L. Cruz et aI., Kinect and RGBD Images: Challenges and Applications, 25th SIBGRAPI IEEE, 2012.

20 J 4 International Conference on Advances in Computing, Communications and Informatics

(ICACCI)

(a)

(b)

(a)

(b)

(c)

(d)

(c)

(d)

(e)

(f)

(e)

(f)

(g)

(h)

(g)

(h)

Fig. 6. Line distortion removal from video frame for segmentation: (a) Original RGB image; (b) Depth map with line distortion after hole jilling (c)-(d) Depth map after distortion removal using [lO) and respective foreground segmentation; (e)-(f) Depth map after distortion removal using [11) and respective foreground segmentation; (g) (h) Depth map after distortion removal using our algorithm and respective foreground segmentation

(i) Fig. 5.

Line distortion removal for original scene: (a)-(b) Original RGB image; (c)-(d) Depth map after hole jilling; (e)-(f) Line dis tortion removal using the algorithm of [lO}; (g)-(h) Line distortion removal using the algorithm of [11}; (i)-(j) Our algorithm

[3] K. Essmaeel et aI., Temporal denoising of Kinect depth data, SITlS IEEE, 2012. [4] K. Khoshelham & SO Elberink, Accuracy and resolution of kinect depth data for indoor mapping applications, Sensors 2.2: 1437-1454, 2012. [5] J. Smisek et aI., 3D with Kinect, Consumer Depth Cameras for Computer Vision Springer London,3-25, 2013. [6] K. Khoshelham & S. O. Elberink, Accuracy and resolution of kinect depth data for indoor mapping applications, Sensors 12. 2 (2012): 1437-1454. [7] M. Camplani & L. Salgado, Efficient spatio-temporal hole .filling strategy for kinect depth maps, Proceedings of SPIE Vol. 8920, 2012.

2014

[8] X. Kang et aI., A method of hole-jilling for the depth map generated by Kinect with moving objects detection, BMSB IEEE, 2012. [9] L. Chen et aI., Depth image enhancement for kinect using region growing and bilateral Jilter, ICPR IEEE, 2012. [10] D. Miao et aI., Texture-assisted kinect depth inpainting, ISCAS IEEE, 604607, 2012. [11] J. Yang et aI., Depth recovery using an adaptive color-guided autoregressive model, ECCV, Springer-Verlag, 58171, 2012. [12] P. Lai et aI., Depth map processing with iterative joint multilateral jiltering, PCS, IEEE, 9-12, 2010. [13] C. Chen et aI., A color-guided, region-adaptive and depth-selective unified framework for Kinect depth recovery, MMSP, IEEE, 9-12, 2013. [14] Z. Zhang, A flexible new technique for camera calibration, Pattern Analysis and Machine Intelligence, IEEE Transactions, 22.11, 13301334, 2000. [15] Z. Qu et aI., Contour detection based on contextual influences, IClA, IEEE, 2010.

International Conference on Advances in Computing, Communications and Informatics (ICACCI)

873

[16] B. Catanzaro et a!., Efficient, high-quality image contour detection, ICCV, IEEE, 2009. [17] GD. Joshi & J. Sivaswamy, A simple scheme for contour detection, VISAPP, 2006. [18] P. Arbelaez et a!., Contour detection and hierarchical image segmentation, Pattern Analysis and Machine Intelligence" IEEE Transactions, 33.5 (2011): 898-916 [19] J. Canny, A computational approach to edge detection, Pattern Analysis and Machine Intelligence, IEEE Transactions, 6: 679-698, 1986. [20] TF. Chan & J. Shen, Nontexture inpainting by curvature-driven dif fusions, Journal of Visual Communication and Image Representation, 12. 4: 436-449, 2001. [21] Z. Xu & J. Sun, mage inpainting by patch propagation using patch sparsity, Image Processing, IEEE Transactions, 19.5: 1153-1165, 2010. [22] M. Bertalmio et aI., Image inpainting, Proceedings of the 27th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co., 2000.

874

20 J 4 International Conference on Advances in Computing, Communications and Informatics

(ICACCI)