3D Modeling of Indoor Environments Using KINECT ... - IEEE Xplore

3 downloads 70 Views 5MB Size Report
Abstract—3D scene modeling for indoor environments has stirred significant interest in the last few years. The obtained photo-realistic rendering of internal ...
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

3D Modeling of Indoor Environments Using KINECT Sensor Arafa Majdi, Mohamed Chafik Bakkay and Ezzeddine Zagrouba Équipe SIIVA / Laboratoire Riadi, Institut supérieur d’informatique, Université de Tunis El Manar 2 avenue Bayrouni, 2080 Ariana, Tunisia [email protected], [email protected], [email protected] the introduction of the Microsoft Kinect sensor has opened up new opportunities for young researchers to have a personal highly performant yet low cost 3D sensor. These RGB-D cameras capture visual data along with per-pixel depth information in real time (30 RGB-D images per second). Although the Kinect depth data is compelling, particularly compared to other commercially available depth cameras, it has important drawbacks in the 3D reconstruction process. In fact, the Infra-Red (IR) Kinect sensor provides accurate depths only up to a limited distance (typically less than 5 m) for a limited field of view (∼60◦). In addition, compared with the laser scanner, the depth estimates are not very accurate (∼3 cm at 3 m depth). Also, Kinect can’t estimate depths reliably for all type of the objects surfaces; transparent and reflective objects like glasses, pans, bottles and pitchers cannot be detected and may cause serious problems that could lead to a partial or a complete failure of the acquisition step Fig. 1. These technical limits have been addressed in the state of art by few works [11] [12]. However, there are remaining problems that could be overcome. In this paper, we describe a new method to overcome these technical limitations. We present a new reconstruction system Fig. 2 which creates a complete 3D model of a physical scene based on two different systems of vision: the active Microsoft Kinect sensing system and the passive Structure from Motion (SFM) technique. The system takes a live RGB and depth data from a moving Kinect camera and passively retrieves an additional depth map using two consecutive RGB key-frames. This allows filling most of the existing holes in the active map. The Kinect camera can be used in any indoor space to reconstruct an accurate 3D model of the physical scene within few seconds.

Abstract—3D scene modeling for indoor environments has stirred significant interest in the last few years. The obtained photo-realistic rendering of internal structures are being used in a huge variety of civilian and military applications such as training, simulation, patrimonies conservation, localization and mapping. Whereas, building such complicated maps poses significant challenges for both computer vision and robotic communities (low lighting and textureless structures, transparent and specular surfaces, registration and fusion problems, coverage of all details, real time constraint, etc.). Recently, the Microsoft Kinect sensors, originally developed as a gaming interface, have received a great deal of attention as being able to produce high quality depth maps in real time. However, we realized that these active sensors failed completely on transparent and specular surfaces due to many technical causes. As these objects should be involved into the 3D model, we have investigated methods to inspect them without any modification of the hardware. In particular, the Structure from Motion (SFM) passive technique can be efficiently integrated to the reconstruction process to improve the detection of these surfaces. In fact, we proposed to fill the holes in the depth map provided by the Infrared (IR) kinect sensor with new values passively retrieved by the SFM technique. This helps to acquire additional huge amount of depth information in a relative short time from two consecutive RGB frames. To conserve the real time aspect of our approach we propose to select key-RGB-images instead of using all the available frames. The experiments show a strong improvement in the indoor reconstruction as well as transparent object inspection. Keywords—3D modeling, RGB, Kinect, indoor environment, transparent objects

I.

INTRODUCTION

The 3D reconstruction process is a set of structured steps by which a real scene is re-created, as faithfully as possible, within a virtual three-dimensional space using a computer. This can be done in a number of different ways, but usually involves the use of visual and/or shape information available in a physical scene. The obtained 3D model can be used in many applications such as medical uses, video games, training, virtual reality, and law enforcement. Over the last few decades, laser and Lidar [1] sensors have become the most used sensors for 3D large-scale modeling of inside and outside buildings. However, these sensors are very expensive and need many sophisticated platforms. Recently,

978-1-4673-6101-9/13/$31.00 ©2013 IEEE

(a)

(b)

Fig. 1. Failure of Kinect depth camera on transparent and specular surfaces. (a) Kinect depth map, (b) Kinect RGB correspondent image.

67

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

generate depth discrete range measurements of the physical scene. In section 3, we describe how to overcome these limitations arguing all the technical choices. Then in section 4, we illustrate some experimental results performed to evaluate performances of the system in different circumstances of acquisition. Finally, we conclude the study and present our perspectives for the future works. II.

RELATED WORK

The 3D reconstruction process usually starts with an acquisition step. There are two main classes of acquisition techniques: the passive methods and the active ones. In the first class, the shape information is typically inferred under general illumination and natural light. The shape can be retrieved from passive monocular techniques, using a single camera, such as the Shape from Motion and the Shape from Focus, or using multiple cameras like in Stereo vision methods. These passive systems are able to produce 3D accurate models with high resolution under restrictive ideal conditions but when they are used separately, they are not suitable for most general class of objects (e.g. the matching process could fail in the dark or textureless places). In contrast to passive techniques, active sensing techniques can directly get the shape information based on triangulation [8] by emitting a structured light or another form of energy in the direction of the object to be scanned. These active sensors are able to provide very accurate depths but they are more suitable for measuring small objects because they have a short standoff distance. However, in many cases, using a single active sensor that satisfies all the requirements is still not available yet. Nowadays, there is a moving toward combining many sensors to detect more and more details about the filmed scene. For example, authors in [10] combine a time-of-flight camera with a stereo one to build 3D models of middle objects in real time. Also, the approach described in [7] uses a consumer RGB-D camera to obtain accurate models of large indoor environments. Using the same RGB-D camera, [3] incorporate various aspects of user interaction to make the 3D modeling process more robust and easier for large scale usage. In our case, we use the Kinect RGB-D camera to acquire visual and shape information of the scene. As mentioned previously, transparent and specular surfaces are very hard to be acquired using such cameras. Recently, some initial works have been proposed to solve this problem. Authors in [11] proposed a cross-modal stereo for combination with Kinect depth estimates to get reliable depth estimation on specular and transparent objects with Kinect, but its quality is not very well especially for Lambertian objects. I. Lysenkov et al. in [12] presents new algorithm based on a segmentation step and using edge fitting process for recognition and pose estimation. But the pose estimation and the transparent object recognition usually require a training step for pose estimation and recognition and by which a 3D model of the transparent object is created. Once the 3D acquisition step is achieved, then the obtained 3D images have to be registered, so that, they are aligned in a common coordinate frame. This process called “registration” is commonly assorted as an optimization problem. We can

Fig. 2. The primary components of our system.

The proposed scanning system contains four main steps. First, it starts by acquiring the available visual and shape data using the Microsoft Kinect sensor. For each new acquisition, the SFM technique is used to fill the holes that exist in the active depth map by adding new depth measurements in the map. As the camera moves, we cover more details about the physical scene and the depth map is enriched by an additional depth data by the time. As a result, we obtain an updated depth map that contains a few holes. These data are then reprojected as a set of discrete 3D points (or point cloud). After achieving the next acquisition, we make a spatial alignment of consecutive 3D data under the supervision of a parallel loop-closure detection thread. This thread is executed to match the current observation against the previous ones taking into account spatial considerations. If a loop-closure is detected, then a global alignment algorithm is executed taking the advantages of combining the visual and the shape information. Finally, we proposed a mesh surfel representation to faithfully incorporate all input data into a single efficient representation. The rest of this paper is structured as follows. The next section is devoted to make a brief summary of the most related relevant works in the state of the art. We illustrate the Kinect failure cases caused by the use of a structured light technique to

68

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

are difficult to patch up autoomatically, unless an additional knowledge about the object is provided to the hole-filling algorithm [3]. To overcome these limitatioons, we proposed a solution that consists of filling the holes in the depth map retrieved by the depth Kinect sensor with new values retrieved from the SFM technique. This helps to acquuire additional huge amount of depth information from 2D consecutive c RGB frames in a relative short time. However, this approach has revealed the problem of selecting the RGB B images to be involved in the passive SFM reconstruction proocess. Thus, instead of using all the available stereo images, wee use the best consecutive stereo pairs (key-frames) that ensure that t the camera motion between two successive Kinect poses is as large as possible. This allows reproducing the pinhole modeel while still having sufficient feature points to make a mattching between the consecutive Key-frames. The heuristic keey-frames selection method is detailed in B.

divide registration methods into two main caategories. The first one is based on features description and maatching technique. For each method, the motion is obtained from fr the matching between discrete features overlapping the images i areas. The second class consists of iterative approaches that are faster and easier to implement than the previous onees. In this second class, the motion is obtained by the minim mization of a cost function typically through iterative refiniing of an initial transformation (e.g given by RANSAC algorithm). One of the most popular methods is the ICP algorithhm [7]. It iterates between associating each point in one tiime frame to the closest point in the other frame and com mputing the rigid transformation that minimizes distance between the point pairs. Recently, the robustness of ICP in 3D has been improved by incorporating many ideas such as multi-agent approaches [4] and by incorporating additioonal attributes like geometric descriptors, reflectance values and a point-to-plane associations [9]. After achieving the registration step, a merging m process is required to join all the scans into a singgle non-redundant model. The Merging methods can be classsified into surface mesh integration approaches [7] and voluumetric ones [12]. After that, a post processing step can bee proposed if the merged model is a polygon mesh, to reduuce the number of polygons at low-curvature regions [5] that caan be smoothed by some surface fairing methods [1]. III.

A. Calibration step In order to find correspondingg pixels between IR and RGB images, first the stereo image pairs should be rectified (rowaligned). Thus, we should calcculate the intrinsic and extrinsic parameters of the RGB and IR Kinect cameras. Once the calibration parameteers of IR and RGB cameras are obtained, we start first by doiing the rectification toward the two consecutive RGB image pairs. p As they are row-aligned, we calculate the difference between their homologous pixels. As a result, we obtain the dispparity map for every pair of the RGB key-frames Fig. 3.

SYSTEM OVERVIEW W

The Microsoft Kinect camera uses a structurred light technique to generate depth maps containing discrete range measurements of the scene. Theoretically, thhe camera depths z, are provided for all physical points by activee triangulation (1). Thus, we can calculate the disparity d for each pair of corresponding points (x1, x2) in the two images using the equation (1), and then generate 3D point clouuds in real time. d = x1 − x2 =

bf , z

(1)

where b is the baseline between the infraredd (IR) emitter and IR camera sensor, and f is the focal distannce of the pinhole model [7]. However, in practice, there are many distuurbing factors that make the Kinect depth camera providing wrong values of depth. In fact, the object’s depths are capturred if the distance between the object and the depth camera is bigger b than 50 cm. If this distance is beyond 5 meters, thee intensity of the reflected IR radius becomes too low to be measured accurately. When this distance is less thhan 2 meters, the opposite phenomenon is observed and thee signal becomes completely saturated. i the reflectance Furthermore, there are some difficulties in properties object surfaces (both overly-absoorbent and overlyreflective surfaces). As the structured ligght pattern passes through transparent or translucent objecct, it is usually deviated. All these technical limitations may cause holes in the acquired depth images. These holes in the reconstructed model

(a)

(b)

(c)

Fig. 3. The disparity map in (c) is calculated c based on the left RGB keyframe in (a) and the right one in (b).

B. Fusing SFM depth with Kinnect depth The first RGB frame is chosenn as a key-frame image because it is the only available one. We W denote it I1. The second keyframe denoted I2 is chosen as farthest as possible from I1 and having at least M matches poinnt with I1. This strategy is more suitable to guarantee a sufficiennt number of matching between the two selected consecutive key-frames. k The camera motion

69

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

inliers is above a threshold, we do not add F as a keyframe. As the camera continues to move, its view contains progressively fewer 3D feature point matches with the previous keyframe. The first frame that fails to match against the previous keyframe becomes the next keyframe.

is already calculated using the RANSAC algorithm. Thus, it is useless to re-compute it by SFM. But the camera motion is used as a precondition before applying the SFM technique. In fact, when the rotation or the translation between the two consecutive camera poses is less than a threshold, this means that there is no sufficient motion to reproduce the pinhole model between the two consecutive images. After converting the disparity measurements from the consecutive stereo pairs into depth measurements using the SFM algorithm, we can now directly compare these values with the active depths (pixels which have the same row and column index in the two depth maps). The concerned invalid values are simply recognized in the Kinect raw depth output. In fact, these pixels are marked by the hardware with a depth integer value of 2047 =0x7FF (i.e. the largest 11 bit integer value). As a result, we obtain an updated Kinect depth map. This map is thereafter carried out in a 3D point cloud.

D. Dense alignment step It consists of transforming the 3D points in the source cloud Pt using the current transformation T. In the first iteration, T is initialized by the rigid transformation provided by the visual RANSAC algorithm when enough visual features are available. Then, for each point in Ps, it determines the nearest point in the target point cloud Pt. Like in the local alignment step, we investigate a measure incorporating re-projection error for feature point association as follows (3). ⎡⎛ ⎢⎜ ⎢⎜ * T = argmin ⎢⎝ T ⎢⎛ ⎢⎜⎜ ⎢⎣⎝

C. Sparse alignment step and loop closure detection In this step, we try to align the nth acquisition with the best previous one. Therefore, we investigate to use RANSAC algorithm to find correspondences between the current acquisition at time i with the previous acquisition at time i−1. RANSAC takes as input a source RGB-D frame, Ps and a target RGB-D frame Pt. It also has access to the previous relative transformation Tp, which is initialized to the identity at the beginning. First, SURF (Speeded Up Robust Features) is used to provide faster and reliable feature points extracted from the RGB images. These keypoints are then associated with their corresponding depth values using the calibration parameter. Then, we use a standard stereo-vision-based re-projection error measure (2), as described in [7], to find the optimal transformation between the two sets of keypoints in pixel space. ⎛ 1 T * = arg min ⎜ ⎜ Af T ⎝

∑ Proj (T ( f ) ) − Proj ( f ) i s

i∈ A f

i

t

2

⎞ ⎟ ⎟ ⎠

1 Af 1 Ad

∑ Proj (T ( f ) ) − Proj ( f ) i s

i

t

i∈Af

∑ w (T ( p ) − p ).n j

j s

j

t

j 2

t

j∈Ad

⎞ ⎟⎟ ⎠

2

⎞ ⎤ ⎟ +⎥ ⎟ ⎥ ⎠ ⎥ ⎥ ⎥ ⎥⎦

(3)

Here, the first term minimizes the re-projection error between the fixed feature points associations obtained from RANSAC. The loop exits after the transformation no longer changes above a small threshold θ or when the maximum number of iterations is reached. Otherwise, the dense data associations are recomputed using the most recent transformation. Now, considering that each RGB-D frame contains roughly 250,000 3D points, it is necessary to create a map that incorporates all available data into a concise representation. One method for doing this is Surfels [6]. A surfel consists of a location, a surface orientation, a patch size and a color. We follow rules similar to those in [6] for updating, adding, and removing surfels. Finally, we mention that the surfel representation will be very useful in the next section to argue our contributions especially in the inspection of transparent objects.

(2)

Here, Af contains the associations between feature points in the two sets, Each term in the summation measures the squared distance i between a feature point in the target frame ft and the i transformed pose of the associated feature point fs in the source frame. T is the transformation between the two feature sets fti and fsi The Perform_RANSAC_Alignment algorithm is jointly used * to optimise the data association Af and the transformation T . We refer the reader to [7] for more details about this function. After we had aligned a frame F, we shall check the existence of a loop closure. This consists in detecting if the kinect has returned to a past location after having discovered new place for a while. First, we define keyframes. These are a subset of aligned frames selected based on visual overlap. Then, we reuse the visual features to find a rigid transformation with the most recent keyframe, As long as the number of RANSAC

IV.

EXPERIMENTS

We performed several experiments to evaluate many aspects of the approach (accuracy and completeness) on challenging conditions. Our first mission is to build accurate 3D consistent models of large scale indoor environments in real time. Specifically, we demonstrate in A, that the integration of the SFM technique does not affect the overall consistency of the map especially towards the modeling of big scenes. In B, we demonstrate the contribution of the approach in filling the Microsoft kinect depth map, and we show the advantageous of our approach in reconstructing transparent and specular surfaces, also in real time. A. 3D modeling of indoor environments The system was tested in two different scenes: a large much textured living room Fig. 4 and a small less textured hall Fig. 5.

70

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

(a)

(b)

Fig. 6. 3D reconstruction of a hall space s without activating SFM (a), and activating SFM (b).

The visual quality of the tw wo models (a) and (b) is less significant than what it was inn the living room model. This is due to the limited number of acquired frames and the textureless components of the scene. However, the model in (a) contains less holes than thee model in (b). This can be seen in the glasses of the cupboard and in the area at the top of the refrigerator. In fact, for these areas, the depth Kinect camera was incapable of estimating depth d measurements because of the detailed Kinect Infrared camera drawbacks, unlike the SFM technique which only bassed on the visual information to retrieve depths. However, manny areas in (b) for each the SFM technique have not been activaated, due to a very slow motion or an insufficient number of matches, we don’t touch a considerable amelioration (the left part of the scene under the painting). Nevertheless, the touched impprovement needs an additional cost time of calculation in the acquisition step. We estimated that for each depth map given by the SFM algorithm, we need in average 2.8 seconds to accoomplish this step. However each RANSAC, ICP, SURF, the surfel s modeling and the Loop closure detection algorithm takkes respectively 21 milliseconds, 48 milliseconds, 60 milliseconnds and 80 milliseconds. Thus, the SFM algorithm is the most consumptive step in the system. But during the acquisition, this technique is called in a few times (five times in our case).. Therefore, this additional cost time still not very influencing in i the modelling process.

Fig. 4. 3D reconstruction of a living room space.

The geometric correctness of the 3D model was evaluated by comparing four object dimensions in the model with their correspondent ones in the real scene Fig. 5.

Fig. 5. Quantitative evaluation of the measuredd comparing those in the physical world.

virtual dimensions

We show that the accuracy of the map is still conserved despite the fact that we have introduced new n data with an approximate accuracy. In fact, the SFM techhnique uses a local transformation given by RANSAC to passively p retrieve depths; this local transformation may cause the t inconstancy of the model by the cumulative error that inttroduces the local pose. However, the global alignment step makes an adjustment of the overall depth map beforee adding it to the model. In addition, the reconstructed scene is sufficiently textured to make successful the passivve reconstruction process. However, we must measure the costt introduced by the use of this additional treatment in more critiical circumstances to check if the SFM advantageous prooperties are still conserved in such cases. Thus, in the next test, we proposed to build 3D models of a small less textured hall h Fig. 6 without the use of SFM (a), and using the SFM technnique (b). We tested the two RGB-D mapping system ms in a hall space under same circumstances of acquisition (tthe brightness and the number of acquisitions which are taken in different points of view). During the test, the camera was caarried slowly by a person and pointed to the same direction (frrom the left to the right) so that each location is visited inn one time. Each obtained model (a) and (b) in Fig. 6 consists of exactly 16 frames over a length of 4 m.

B. 3D modeling of transparentt and specular surfaces We use a translucent cup whiich contains the main common states of surfaces. For each neew acquisition, we try to follow the total number of surfels added to the map. As the incremental nature of surfeel map construction requires rebuilding the surfel map whhen camera poses are updated, then we can directly touch thee contribution of integrating the SFM technique in the additionnal number of surfels added to the model, as using a surfell representation avoids adding redundant surfels to the model.

Fig. 7. 3D reconstruction of a transluucent cup without integrating the SFM technique (Left) and integrating the SF FM technique (right).

71

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

often require a prohibitive amount of computation. Thus, the computer vision and the robotics community are moving towards the use of more sophisticated sensors for efficient 3D building process. RGB-D sensors are proposed, these sensors can be easily used to scan indoor spaces into 3D models in real time. In this paper, we investigate how the Microsoft Kinect, an inexpensive RGB-D camera, developed mainly for gaming and entertainment applications can be jointly used with a passive method (SFM) to build dense 3D maps of indoor environments. The experiments have shown that this approach is able to effectively treat translucent and transparent objects. In fact, it uses the visual information (color of the objects or its background) to retrieve depth data, and thus optimize the visual quality of the virtual reconstructed scene. Another interesting avenue for research is the use of the YIQ color space that has the special ability to separate the luminance component from the chromaticity; which can improve significantly the quality of reconstruction in outdoors or obscure locations.

For the left obtained 3D model in Fig. 7, without the participation of the SFM technique, we show that a lot of scanned points in the surface (about 30%) have not been reconstructed in the model unlike the golden ball attached above the cup. Indeed, all the transparent physical points in the surface have left crossing the structured light pattern, and some translucent points, those in the right side of the cup, have refracted it. Therefore, all of these physical points will never be added to the 3D model. On the other hand, when the SFM technique is integrated, these physical points are more likely to be reconstructed. In fact, many of the transparent physical points in the surface make perceived colors from the background of the cup. The translucent physical points however, have a white light specific color. Thus, for both of these physical points, the SFM technique can retrieve depth data passively based on color information. Now, each set of points is well reconstructed and successfully introduced to the 3D virtual model, from which comes our contribution. To touch the improvement, we propose to track the evolution of the model’s completeness through the number of surfels progressively added to the representation during the reconstruction process.

REFERENCES [1]

F. Leberl, A. Irschara, T. Pock, P. Meixner, M. Gruber, S. Scholz, and A. Wiechert, “Point Clouds: Lidar versus 3D Vision“, Photogrammetric engeneering & remote sensing, 2010. [2] G. Taubin, “A Signal Processing Approach to Fair Surface Design“, Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, 1995. [3] H. Du, P. Henry, X. Ren, M. Cheng, D. B. Goldman, S. M. Seitz, D. Fox, “Interactive 3D modeling of indoor environments with a consumer depth camera“, Proceedings of the 13th international conference on Ubiquitous computing, 2011. [4] M.A. Jallouli, E. Zagrouba, N. Doggaz and M. Abidi, “Decomposition of an alignment problem of two 3D images by a multi-agent approach“, 4th International Conference Innovations in Information Technology, 2007. [5] M. Garland and P. S. Heckbert, “Surface Simplification Using Quadric Error Metrics“, Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, 1997. [6] M. Krainin, P. Henry, X. Ren and D. Fox, “Manipulator and object tracking for in-hand 3D object modeling“, The International Journal of Robotics Research 30, 2011. [7] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. “ RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments“, Symposium on Experimental Robotics, 2010. [8] S. Edmond, M. Matteo, M. Stefano, A. Mauro, M. Emanuele, “RealTime 3D Model Reconstruction with a Dual-Laser Triangulation System for Assembly Line Completeness Inspection“, Intelligent Autonomous Systems Conference, 2012. [9] S. May, D. Dröschel, D. Holz, E. Fuchs, S. Malis, A. Nüchter, et al., “Three-dimensional mapping with time-of-flight cameras“, Journal of Field Robotics, 2009. [10] Y. Cui, S. Schuon, D. Chan, S. Thrun and C. Theobalt, “3D shape scanning with a time-of-flight camera“, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [11] W. C. Chiu, U. Blanke and M. Fritz, “Improving the Kinect by CrossModal Stereo“, 22nd British Machine Vision Conference (BMVC) Dundee, UK, 08, 2011. [12] I. Lysenkov, V. Eruhimov, G. Bradski, “Recognition and Pose Estimation of Rigid Transparent Objects with a Kinect Sensor,“ Robotics: Science and Systems 12, 2012.

Fig. 8. Evolution of the Surfels number when integrating the SFM technique (the blue curve) and without integrating the SFM technique (the red curve).

As it can be seen in Fig. 8, if the SFM technique is integrated, over Surfels are added to the presentation. V.

DISCUSSION AND FUTURE WORK

3D scene modeling for indoor environments has a variety of civilian and military applications: 3D mapping, simulation and gaming. Although, significant challenges are posed: low lighting and textureless surfaces, repetitive structures and the coverage of all details in a 3D consistent model. Many approaches were proposed in the state of art; many of them are very accurate but they are typically expensive and slow (Laser scanning). Other approaches, based only on vision, are less expensive but suffer from lack of accuracy and robustness and

72