Calculating Real World Object Dimensions from Kinect ... - IEEE Xplore

Calculating Real World Object Dimensions from Kinect RGB-D Image Using Dynamic Resolution l l Atif Anwer[ l,AsimBaig[ l

2 Rab Nawaz[ l

SciFacterz Islamabad, Pakistan {atif, asim. baig}@scifacterz. com

Muhammad Ali Jinnah University Islamabad, Pakistan r nawaz@hotmail. com _

Abstract - One of the major research directions in robotic vision focuses on calculating the real world size of

objects

in

a

scene

using

stereo imaging.

This

information can be used while decision making for robots, manipulator localization in the workspace, path planning and collision prevention, augmented reality, object classification in images and other areas that require object sizes as a feature. In this paper we present a novel approach to calculate real world object size using

RGB-D image acquired from Microsoft

Kinect™, as an alternative to stereo imaging based approaches. We introduce a dynamic resolution matrix that estimates the size of each pixel in an image in real world units. The main objective is to convert the size of objects represented in the image in pixels; to real world units (such as feet, inches etc.).We verify our results using

available

OpenSource

RGB-D

datasets.

The

experimental results show that our approach provides accurate measurements.

Keywords-Kinect, RGBD

imaging,

Depth

maps,

Dynamic Resolution Matrix, real world dimensions

I.

INTRODUCTION

A lot of work has been done in detecting and segmentingthe objects using stereo images. Techniques like depth map generation [1] to estimate depth information from single input image, point feature extraction that use object boundaries for object recognition and pose identification[2] are verified yet are somewhat complicated approaches to extract depth information and object segmentation from stereo images. Approaches such as [3] require calibration of stereo cameras and usage in real world scenarios is complicated by occlusion and other inherent issues of stereoscopic imaging. For effective manipulation of real world/environment, it is often required to know the sizes and dimensions of the objects. It is analogous to human vision, which is very adept in recognizing objects whether they are small or large, placed further or nearer the eye. It is also human nature to recognize and classify objects based on their size and dimensions as an inherent property. Inferring real world size of objects that are at a distance helps us make decisions

regarding tasks to be performed such as picking them or manoeuvring around them while walking or driving etc. Similarly, in order to localize a robot in an environment, it needs to recognize and segment objects in the surrounding. The size of the objects can be one of the features that can helpimprove this segmentation process. The depth information has generally been acquired using stereo cameras or expensive hardware such as LIDARS or Time of Flight cameras, each having their own sets of pros and cons [4]. With the advent of low cost RGB-D sensors such as Microsoft Kinect ™ and Asus Xtion sensors, modern segmentation techniques[5, 6, 7]utilizing the colour as well as depth information from these devices are being developed.However,the features calculated by these techniques do not include or attempt to estimate the real world size of the object uses RGB-D imagery. In this paper, we propose a method of calculating the dimensions of rectangular surfaces and objects in meters, thereby calculating an important real world property, unique to each object in an image. The objects can be in any orientation or pose in an imageand within Kinects' range. ln a digital image, each pixel should ideally represent same sized segment of real world but this is not the case. Each pixel in an image represents different sized segments of the real world; this is due to linear perspective, which is discussed in detail later in the paper. The proposed approach overcomes this issue of linear perspective by developing a dynamic resolution matrix. This matrix is generated by using the depth image and intrinsic parameters of the Kinect camera. This matrix contains actual distance in mm (or meters) that each pixel covers in real world. This matrix is then used to calculate the actual size of any object selected in the image. The rest of the paper is structured as follows; section 2 provides a brief description of the Kinect Sensor and outlines the issues of linear perspective in the acquired images. Section 3 discusses the proposed approach for calculating the dynamic resolution matrix. Section 4 provides information regarding experimental setup and methodology with section 5

978-1-4799-6369-0115/$31.00 ©2015 IEEE

Proceedings of 2015 12th International Bhurban Conference on Applied Sciences & Technology (IBCAST) Islamabad, Pakistan, 13th - 17th January, 2015

198

discussing the results. Conclusion and future work are provided in section 6.

II.

OVERVIEW OF KINECT

The raw sensor values returned by the Kinect's depth sensor are inversely proportional to the depth of image. So objects closer will have higher depth valuesthan the ones further away.

AND RGB-D IMAGING

This section begins by providing brief introduction to the Kinect sensor and how it acquires and generates the depth image. The later part of this section delves intothe theory of linear perspective and its effect on the digital imagery. The following section (section 3) discusses the fonnulation of the dynamic resolution matrix from depth image acquired from Kinect.

x

Fig.3. Kinect Axes

Microsoft Kinect and RGB-D image The Microsoft Kinect ™ camera works by generating a 3x3 times repeated pattern of infra-red dots and detecting the disparities between the emitted pattern and observed positions [8]. IR Emitter

Color Sensor

IR Depth Sensor

Each depth pixel is represented by one 16-bit unsigned integer value in which the 13 high-order bits contain the depth value (in millimetres). Any depth value outside the reliable range or at places where the depth was unable to calculate (mostly sharp edges, IR absorbing and reflective surfaces) is replaced with a zero.

Linear perspective When viewing an object with the naked eye, objects appear to be smaller as they become more distant because of linear perspective. The same concept is also valid for digital images, including the ones acquired from Kinect.

Microphone Array

Fig.l. Components inside Microsoft Kinect

The aspect ratio of the depth image acquired by the Kinect is 4: 3 (pixel resolution of 640x480), a horizontal field of view of 58. 5 degrees and vertical field of view of 45. 6 degrees, as is specified in MSDN. The depth in an image is the perpendicular distance from the object to the camera-laser plane rather than the actual distance from the object to the sensor as shown in the figure below as explained in [9]:

S 1\

I

\

I " I �I

'" I .� I

�:

..c.1 0.1

�I

�:I :;;;:: I I I c

.

I I

I

. .

\

\

\

"

FigA. Linear Perspective

'"'9''i(,Ol\ ...... ' �.\

�\ "'('((1\

\

III. \

\

\

\

\

\

� �inect �

.

DYNAMIC RESOLUTION

The image captured by the image sensor in a frame, is affected by linear perspective, for example; two parallel lines appear to meet at a distant point. Each pixel covers a certain portion of the length of the scene, as illustrated by the following figure:

Fig.2. Kinect depth distance vs actual distance


199

pixels, as shown below using the Freiburg Dataset (fr3/cabinet).

Fig.5. Each pixel in a column represents a portion of length in the scene, where area captured by 'a' pixel is lesser than the area captured by 'b' pixel.

Due to linear perspective, the area of the scene captured in a pixel is proportional to the distance of the object from the camera. So, the pixels closer to the camera will capture a smaller area 'a' as compared to the pixels representing objects further awaycapturing larger area 'b' , as explained in figure 5. In other words,pixels in a given area decrease with the distance. We call this phenomenon of varying number of pixels per distance"Dynamic Resolution". Hence, dynamic resolution decreases with the depth of an imageboth in columns well as rows direction as shown in the figure below:

Fig.7. Depth image from Kinect with zero values

""" """ ,,,,,,

o:tJl

Fig.8. Depth image after normalization

Fig.6. Each pixel in a row represents a portion of width in the scene, where area captured by 'a' pixel is lesser than the area captured by 'b' pixel.

Using a statistical mode returns sharper edges than using statistical mean values, since mode only inserts the maximum occurring value in the surrounding 25 pixels to the centre pixel. Other methods such as bilateral filters etc. can also be used for depth normalization; however the results of using statistical modes were verified to be reasonably accurate and were found to be fast to compute in real time.

This is because the linear perspective and angle of view stretches the scene represented by the pixels. The pixels in a row closer to the camera will capture a smaller area 'a' horizontally, as compared to the pixels representing objects further away 'b' , in a parallel row farther ahead, as explained by the figure above.

Dynamic resolution matrices

A.

Similarly, we can also calculate the column resolution to be as follows:

Depth Normalization

Once the Kinect depth images are acquired the regions with the missing information, represented by zero values, need to be filled prior to processing the images similar to[6].The process called Depth Normalization is done by filling in the zero value pixels with the statistical mode of the surrounding 25

We can fmd the value for the row resolution as: 2*Hi*tan resrow

(�)

image height


(I)

image width

2*Hi*tan reseol

(AOV �)

(2)

200

Equation (1) and (2) form the dynamic resolution matrices. We shall use these resolutions to get width and height of the object, respectively. In order to calculate resrow and rescol matrices for Kinect we can make of use of the following Kinect properties and technical specifications:

Angle of View (AOVj: Kinect' s IR camera has a horizontal angle of view (AOl.�·ow) of 58. 5 degrees and vertical angle of view (AOVcol) of 45. 6 degrees. Image Resolution: Kinect outputs the depth image resolution of 640 x 480 pixels. The actual IR sensor has a resolution of 1280x1024; however an image of 1280x900 is received by the Kinect drivers. The final * . png image however is saved in a 4: 3 aspect ratio image. Distance from Camera to Object (Hi). This value is the II-bit depth value at the output of the depth image. However, for the Freiburg dataset, the value is simply the decimal value of each element, divided by 5000.

Calculating dimension of objects Once the three matrices (depth, row resolution and column resolution) have been created, the dimensions along the Kinects z-axis of an object can be calculated simply by numerical addition of each pixel along the perimeter of the segmented object. However, since it is rarely the case that an object is perpendicular to Kinect, therefore the objects are at a perspective, along an arbitrary direction. For calculating and verifying resultsof the method on the dataset, the objects and surfaces to be measured were taken manually from the user as input via a GUI in Matlab®. The user was required to point out vertices of the object. Once the vertices were input, point slope form of a straight line equation (y- yl m(x - xl)) was used to calculate the slope of the line. The calculated slope of the line was used to identify the matrix element indices of pixels along the perimeter of the selected polygon, and their value was added to calculate the estimated height and width of the polygon. =

Polygon (created from

Polygon

(created from vertice.s selected

user)

by

Fig.9. Polygon created from vertices selected by user in fr3/cabinet dataset on cabinet vertical surface

IV.

METHODOLOGY SUMMARY

In order to validate our results, we are using the publically available datasets from Computer Science Department, Technical University of Munich, Germany [10], titled "Freiburg RGB-D Dataset", under the Creative Commons Attribution License. The depth images provided by the Freiburg dataset are scaled by a factor of 5000, i. e. , a pixel value of 5000 in the depth image corresponds to a distance of I meter from the camera (instead of 1000 representing I meter). Therefore, the Freiburg depth images need to be divided by 5000 to get the actual depth value for each pixel in meters. The technique presented in this paper was tested on two datasets. The frl/xyz dataset consists of 798 frames of RGB and depth images taken from Kinect of a general office table as shown in figure 9). Here the "Multiple View Geometry" text book and the Samsung monitors (Samsung SyncMaster 2494MH) were the objects with known dimensions. The tests were also done on the fr3/cabinet dataset. The dataset consists of images taken from Kinect while moving 360 degrees around a drawer cabinet. Dimensions of the cabinet, as confirmed by members of the Computer Vision Group, TUM, are W: 43.7cm, D: 78.7cm, H: 56.5cm (from the floor). The dataset consists of 1147 frames of RGB and Kinect depth images. All images in both datasets were read sequentially and dimensions calculated multiple times on various frames. The method of calculating summarized as follows:

Fig.9. Polygon created from vertices selected by user in frl!xyz dataset on "Multiple View Geometry" book (object)

dimension

IS

After depth normalization (as explained in section 3. 4) and converting to actual depth data (by dividing by 5000, as explained above), we are left with an image with values representing depth of each pixel in meters, from the Kinect camera. Two matrices, with elements equal to the size of the Kinect depth image, are created by parsing the Kinect depth image once


201

with the resrow formula and rescol formula each. As a result, the three resulting image matrices are a Kinect depth image in meters, and a matrix in which each pixel represents the actual area in meters with varying column depth resolution and a matrix in which each pixel represents the actual area in meters with varying row depth resolution. Most importantly, these pixels accommodate the scaling of the scene due to linear perspective property, both in vertical direction and horizontal direction. V.

TABLE IV: RESULTS FOR TARGET OBJECT (CABINET) IN FR3/CABINET DATASET Freiburg Dataset: fr3/cabinet Target Object: Side faces of the cabinet Actual Dimensions of the object: (h x w x d em): 56.5 x 43.7 x 78.7 (h x w x d meters)· 0 565 x 0 437 x 0 787

#

Frame

Calculated height (m)

Calculated depth (m)

Accuracy (height)

Accuracy (depth)

I

21 89 151 158 643

0.364 0.548 0.505 0.498 0.632

0.720 0.611 0.698 0.723 0.693

64.42% 96.99% 89.38% 88.14% 89.39%

91.49% 77.64% 88.69% 91.87% 88.06%

2

3

RESULTS AND CALCULATED DIMENSIONS

The results of the dimensions of the book, dimensions of the monitor and the cabinet size were captured at various frames. The frames were selected based on the images where the target object was clearly visible with minimum camera blur and camera shake, but located on random locations some of which were closer to the centre while others were closer to the edges. The dimensions obtained are shown in the following tables, along with the accuracy of the size (where accuracy represents the closeness in percentage to actual size of object):

4 5

As shown in the above tables, the objects dimensions are successfully estimated to be quite close to the original dimensions of the objects using the technique proposed in our paper. The objects in the frames are at various angles (some very acute) and pose when the measurementswere taken verifying that the dynamic resolutions calculated are reasonably accurate and the area represented by each pixel is close to the values of the matrices created by the rescol and resrow formula.

TABLE I: RESULTS FOR TARGET OBJECT IN FR IIxyz DATASET

The reasons for the possible errors in calculation,

Freiburg �atas�t: irllxyz Tesulting in smaller or larger dimension calculation are Target Oblect: Multi View Geometry" book (placed uear the left edge Of attributed to the following problems: the table) Actual Dimeusious of the object: 1) The Kinectsensor itself is not as accurate as (h x w x d iuches): 9.6 x 6.7 x l.3 some of the other devices being used for a precise (h x w x d meters)· 0 243 x 0 170 x 0 033

#

Frame


Calculated width (m)

Accuracy (height)

Accuracy (width)

I

57 260 465 561 689

0.214 0.210 0.221 0.225 0.241

0.156 0.150 0.157 0.162 0.162

88.07% 86.42% 90.95% 92.59% 99.18%

91.76% 88.24% 92.35% 95.29% 95.29%

2 3 4 5

TABLE II: RESULTS FOR TARGET OBJECT (MONITOR) IN FRlIxyz DATASET Freiburg Dataset: frllxyz Target Object: "Samsung Sync Master 2494MB"' Actual Dimensions of the object: (h x w x d inches): 16.54 x 22.56 x 9.84 (h x w x d meters)· 0 420 x 0 573 x 0 249

#

Frame



Accuracy (height)

Accuracy (width)

1 2 3 4

80 88 169 216

0.446 0.3625 0.377 0.379

0.467 0.5118 0.712 0.5404

89.93% 86.31% 89.76% 90.24%

81.50% 89.32% 80.47% 94.31%

5

305

0.53

0.533

78.79%

93.02%

TABLE III: RESULTS FOR TARGET OBJECT (CABINET) IN FR3/CABINET DATASET Freiburg Dataset: fr3/cabinet Target Object: Front/back of the cabinet Actual Dimensions of the object: (h x w x d cm): 56.5 x 43.7 x 78.7 (h x w x d meters)· 0 565 x 0 437 x 0 787

#

Frame

1 2

10 269

3

334

4 5

345 438

Calculated height (m) 0.4551

0.557 0.535 0.517 0.609


Accuracy (height)

0.36 0.296 0.409 0.454 0.360

80.55% 98.58% 94.69% 91.50% 92.77%

Accuracy (width) 82.38%

67.73% 93.59% 96.25% 82.38%

f1 epth measurement. Kinect suffers from calibration

ssues and certain internal errors [II]. Kinects' low ost means the trade-off in accuracy is however ustified, and by using the technique in this paper, imensions of objects can be measured with easonable accuracy, at a much lower cost than using L1DARS or other precise ranging devices. 2) Kinect ROB and depth images are not synced and have slightly different timestamp therefore there is a varying offset between the two images. It is estimated that the technique can give much better esults on real-time synched Kinect images. 3) Depth normalization means filling the zero alues in the depth image (which are inherent to the ower accuracy Kinect) with estimated values. These alues are generally found in places where there are R absorbing surfaces, sharp edges or glossy/reflective surfaces. Since exact edges of objects cannot be found therefore the exact values are hard to calculate, although the technique in this papers calculates dimensions as close as 90-95%. VI.

CONCLUSIONS

The technique presented in the paper provides an asy and fast method of performing dimension easurement of objects using low cost sensors such as icrosoft Kinect. The technique is useful in non-contact metrology setups where the distance

�'


202

between the camera and the objects being measured are not known, a priori. In future, we also aim to extend the proposed approach to calculate the actual size objects of various other shapes such as circles. Pitch angle if included in the calculations will improve the accuracy. One interesting application of pitch angle is that by using depth and pitch angle we can estimate the horizontal distance of the object from the robot.More databases will be tried in future.

[9]

M. R. Andersen, 1. Jensen, P. Lisouski, A. K. Mortensen, M. K. Hansen, 1. Gregersen, and P. Ahrendt, 'Technical Report: Kinect depth sensor evaluation for computer vision 2012. applications,"

[10]

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, "A benchmark for the evaluation of RGB-D SLAM systems," in Intelligent Robots and Systems (IROS), 2012 IEEEIRSJ International Conference on, 2012, pp.

Presently the object segmentation is being manually performed. Future work may include automatic segmentation along with sub-pixel edge detection methods of objects before their sizes are calculated.

[ I I]

573-580. K. Khoshelham and S. O. Elberink, "Accuracy and resolution of Kinect depth data for indoor mapping applications," Sensors,vol. 12,no. 2,pp. 1437-1454,2012.

ACKNOWLEDGMENT

The authors would like to thank members of the Computer Science Department, Technical University of Munich, Germany for providing the Open Source dataset and especially to Dr. Strum Jurgen and Dr. FelixEndres for the queries related to the dimensions of objects in the dataset.

REFERENCES

[ I]

S. Battiato, A. Capra, S. Curti, and M. La Cascia, "3D stereoscopic image pairs by depth-map generation," in 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings. 2nd International Symposium on,2004,pp.124-131.

[2]

B. Steder, R. B. Rusu, K. Konolige, and W. Burgard, "Point feature extraction on 3D range scans taking into account object boundaries," in Robotics and automation (icra), 2011 ieee international conference on, 2011, pp. 2601-2608.

[3]

S. Oh, S. Park, and C. Lee, "Vision based platform monitoring system for railway station safety," in Telecommunications.

2007. 1TST'07. 7th

Conference on ITS,2007,pp.

International

1-5.

[4]

V. Morell-Gimenez, M. Saval-Calvo, 1. Azorin-Lopez, 1. Garcia-Rodriguez, M. Cazorla, S. Orts-Escolano, and A. Fuster-Guillo, "A Comparative Study of Registration Methods for RGB-D Video of Static Scenes," Sensors,vol. 14, no. 5, pp. 8547-8576,2014.

[5]

A. Richtsfeld, 1. Morwald, 1. Prankl, M. Zillich, and M. Vincze, "Segmentation of unknown objects in indoor environments," in Intelligent Robots and Systems (IROS), 2012 IEEEIRSJ International Conference on, 2012, pp.

4791-4796. [6]

G. Triantafyllidis, M. Dimitriou, T. Kounalakis, and N. Vidakis, "Detection and Classification of Multiple Objects using an RGB-D Sensor and Linear Spatial Pyramid Matching," Electronic Letters on Computer Vision and Image Analysis,vol. 12,no. 2,2013.

[7]

L Dryanovski, W. Morris, R. Kaushik, and 1. Xiao, "Real-time pose estimation with RGB-D camera," in Multisensor Fusion and Integration for Intelligent Systems (MFI), 2012 IEEE Conference on,2012,pp.

[8]

13-20.

B. Langmann, K. Hartmann, and O. Lofleld, "Depth Camera Technology Comparison and Performance Evaluation," in ICPRAM (2),2012, pp. 438-444.


203