Hybrid optical system for three-dimensional shape ... - OSA Publishing

Hybrid optical system for three-dimensional shape acquisition In Yeop Jang, Min Ki Park, and Kwan H. Lee* Mechatronics Department, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-gu, Gwangju, South Korea *Corresponding author: [email protected] Received 1 March 2013; revised 21 April 2013; accepted 22 April 2013; posted 22 April 2013 (Doc. ID 185905); published 22 May 2013

Hybrid concepts are often used to improve existing methods in many fields. We developed a hybrid optical system that consists of multiple color cameras and one depth camera to make up the concavity problem of the visual hull construction. The heterogeneous data from the color cameras and the depth camera is fused in an effective way. The experimental results show that the proposed hybrid system can reconstruct concave objects successfully by combining the visual hull and the depth data. © 2013 Optical Society of America OCIS codes: (100.6890) Three-dimensional image processing; (110.5200) Photography; (110.6880) Three-dimensional image acquisition; (150.6910) Three-dimensional sensing. http://dx.doi.org/10.1364/AO.52.003680

1. Introduction

As 3D applications, such as 3D movies, 3D TV, and 3D games, have become prevalent, the demand for generating 3D content has increased significantly. Acquiring 3D graphic objects has become an essential task, and various methods have been proposed to reconstruct 3D shapes in the real world with efficiency and speed. Multiview stereo using only color images [1,2] is a popular method for acquiring 3D geometric information. It provides users with results that are easily obtained by using off-the-shelf cameras. However, its accuracy is highly dependent on the condition of the texture of a subject, and it often requires a large number of cameras and a long processing time to obtain high-quality results. As such, ambitious attempts have been made to extract geometric information using a smaller number of cameras. The visual hull method, first introduced by Laurentini [3], is a representative shape reconstruction method that uses a small number of input images. The visual hull is the maximal shape 1559-128X/13/163680-09$15.00/0 © 2013 Optical Society of America 3680

APPLIED OPTICS / Vol. 52, No. 16 / 1 June 2013

consistent with a set of given silhouettes of the subject from multiple views. As we can see from its principle, which requires only a small number of silhouette images, it enables reconstructing a watertight object robust to pixel noise and with low computation. Simioni et al. applied this technique to obtain 3D models using their x-ray absorption images [4]. However, the visual hull suffers from the fundamental problem that it cannot recover any concave region that is not identified in the silhouette cues. This is called a concavity problem in this paper. Matusik et al. [5] partially solve the concavity problem using photoconsistency, which is the information identified from different views. However, we need to solve the fmore difficult concavity, which cannot be identified even with all views, such as occurs with a deep concave region (e.g., a cup or bowl). Feng et al. attempted to solve the difficult concavity problem by positioning a virtual camera inside the concave region under a hypothesis in which all the concave regions are empty and have the same shape [6], but it is an approximation technique so so that shape fidelity of the reconstructed concave region may not be good. Progress on 3D shape reconstruction has also been made in terms of hardware. A commercial 3D scanner to scan a human head and face has been

developed [7]. The off-the-shelf scanner enables the user to scan an object freely using a handheld device, but it is expensive and requires processing time that is unreasonable for practical applications. Meanwhile, depth cameras have also appeared with a new concept. They provide both the depth value and the RGB color value of a subject concurrently. They use infrared structured light or the time-of-flight technique [8–10]. However, their expensive hardware components and low resolution have hampered widespread use of the technology in practical applications. The recent introduction of a new lightpattern-based optical sensor (Kinect) [11] provides a breakthrough in this technology. This device is inexpensive and provides a relatively high resolution (640 × 480) RGB image and a depth image with 11 bit precision at 30 Hz based on active stereo using the infrared light pattern. Some researchers have tried to use the Kinect as a 3D scanner: Henry et al. [12] built dense 3D maps of indoor environments, and Newcombe et al. [13] acquired a relatively accurate shape using a single Kinect camera with a real-time mapping technique. Kinect improves the correspondence problem on depth extraction by projecting an invisible infrared structured pattern onto the subject, so it can provide more accurate depth reconstruction, even for color homogeneous regions, than can stereo matching. Although many methods for 3D reconstruction have been presented, the concavity problem still remains challenging. Figure 1 shows examples acquired by the existing methods, in which the concave regions are not properly reconstructed. In 3D scanning, many cases occur in which the object must be scanned without contact, such as the reconstruction of the shape of relics or valuable objects. However, most existing contactless scanning techniques cannot handle a concave object since a deep concave region is poorly identified by contactless sensors. For many years hybrid concepts have been adopted to improve the original algorithms in various fields. In [16–18], a ToF sensor and the classical stereovision method are combined to complement or speed up the results. Other hybrid systems [19,20] generate an improved free-viewpoint video in terms of accuracy and computational time by using both a depth camera and regular cameras: the depth camera to obtain an initial estimate of the view-dependent geometry and the regular cameras to refine the geometry. On

Fig. 1. (a) Original cup, (b) the image produced by a 3D scanner [14], (c) as reconstructed by the classical visual hull [15], and (d) as estimated by multiview stereo [1].

the other hand, hull- or template-based concepts, which are effective in generating a single object, have been proposed in various forms. Bogomjakov et al. [21] generate free-viewpoint video of a single object by proposing a depth hull concept that is an extension of the classical visual hull. They configure and synchronize two depth cameras and a few video cameras and provide a real-time shape reconstruction and rendering solution in which they significantly reduce the visual artifacts in the occluded areas. Guan et al. [22] improve the details of a shape with a few devices by fusing a visual hull and a depth image from a ToF sensor based on a probabilistic occupancy grid. Esteban and Schmitt [23] reconstruct an accurate full 3D model based on the fusion of texture and silhouette. Our work is similar to these works, but it is more focused on solving the deep concavity problem of the visual hull technique. For a known subject shape, such as the human body, a 3D template is often used as an alternative to the visual hull. Carranza et al. and Weiss et al. [24,25] estimate the pose of a human from silhouette cues and deform the template proxy geometry by using the pose. Deformation-based shape reconstruction methods often require an accurate template 3D model, but it is difficult to obtain an accurate template in practical cases. Tong et al. [26] scans a 3D full human body using three Kinect depth cameras. A rough body template is estimated at the first frame and updated continuously based on a pairwise non-rigid-geometry registration over the next frames. As we observe, visual hull construction has been shown to have practical merits in 3D shape reconstruction. At the same time, optical sensors have evolved remarkably over the past decade. In this paper, we present a 3D object modeling method to overcome the concavity problem in 3D reconstruction by fusing the visual hull with an optical sensor. In our method, we also build a novel optical hardware system for object scanning, which combines multiview cameras and an advanced optical depth sensor [11]. 2. Proposed System A. Overview of the Framework

The overall framework of the proposed system is described in Fig. (2). First, we acquire the input images from several color cameras and a depth camera, collocated around a subject, and perform geometric and depth camera calibration. Next, an initial convex shape is constructed as a mesh surface using a visual hull construction [15]. Then, a raw depth image from the depth camera is enhanced before combining with the initial shape since it contains measuring errors and optical noise. To achieve this, we generate a filter for the depth image based on an image filtering technique with the multiscale concept. The filtered depth data and the visual hull are combined in the next step by employing the visual hull as a deformable 1 June 2013 / Vol. 52, No. 16 / APPLIED OPTICS

3681

Fig. 2. Overall framework of the proposed method.

template model. Last, the enhanced visual hull is displayed as a textured 3D model. B.

Data Acquisition and Camera Calibration

Fig. 4. Depth linearization.

As shown in Fig. (2), our hybrid camera array consists of a depth camera [11] and several color cameras. The depth camera is in charge of capturing the concave region of a subject and the color cameras are in charge of covering the overall shape of the subject. We acquire a pair of color and depth images from the depth camera and obtain the color and silhouette images from the multiple color cameras. All the acquired color images are used at the geometric camera calibration step. The color image from the depth camera is used again in the depth enhancement step. The silhouette images are used to construct the template model, but we do not consider automatic silhouette extraction, which is out of the scope this paper. First, we perform geometric calibration using techniques described in [27,28]. In this step, we use the depth camera as the reference camera. After geometric calibration, we perform depth data calibration of the depth camera. Unfortunately, the depth camera may return a nonlinear depth and scale profile according to measuring distance due to the limitation of sensor. In other words, the depth range between the maximum and the minimum depth of the subject and the scale of the captured subject are inconsistent with the distance between the depth camera and the subject. Since these problems may cause shape distortion to the final reconstructed model, an additional preprocess is necessary to resolve them. In order to achieve this, we perform

the calibration of the depth camera that spatially linearizes the raw depth data in terms of the depth and the scale. We need a robust function that always outputs linear data regardless of the captured depth data, whether it is linear or not. We first measure the depth and the width of a planar object with intervals as shown in Fig. (3). Then, we apply a line fitting based on the measured depth values D to define a linear depth function DL that produces a consistent depth value according to the measuring distance I (Fig. 4). Thereafter, the scale linearization of the depth data is performed using a curve fitting. First, a linear function L is computed through the line fitting on the measured width values O of a planar subject and then the scale error Ei is estimated by difference between Li and Oi [Fig. 5(a)]. That is, the scale of the subject, which is linear according to the measuring distance, can be obtained if it compensates the measured width Oi as the corresponding scale error Ei on all width values at any measuring distance. We finally compute a compensation function P by performing a polynomial curve fitting based on E [Fig. 5(b)]. It makes the scale of the captured subject consistent at all measuring distances. Thus, our depth camera calibration process is making the measured data from the depth camera consistent with respect to the scale and the depth of a subject wherever it is captured.

Fig. 3. Measurement of depth and scale of a planar object for the depth camera calibration.

Fig. 5. Linearization of the scale. (a) The measured width of subject, and (b) the scale errors and the scale compensation function P.

3682


Fig. 6. (a) Visual hull and (b) the depth hull.

C.

Generation of a Template Model

The proposed method uses a sensor fusion concept that combines the multiview camera and the depth camera. Accordingly, a shape deformation concept is adopted to naturally combine the heterogeneous data from the two kinds of cameras. The technique often requires a template model that is modeled accurately in advance by the data from the user as well as the 3D scanner. However, the preliminary template model not only creates burdensome work, but it also is not a valid way in terms of the effectiveness of content generation. We automatically generate a template model based on an image-based modeling using multiview images. A visual hull construction generates a watertight surface that does not contain outliers or holes, so it is proper to be used as a template model. Figure (6) shows the original visual hull and our depth hull for the template model that is used to restore the concave region. As shown in Fig. (6), the original visual hull is the intersection of visual cones from the cameras and the visual cone is obtained by passing though the silhouette of the subject extracted at each view. On the other hand, our depth hull is generated by carving the visual hull using depth values from the depth camera. The visual hull is constructed as a volumetric form by [3] and the volumetric form is reformed as an polygonal surface by [29]. Finally, the surface is used to deform the concave region as the template model [Fig. 13(a)]. D.

Depth Enhancement

The raw depth image from the depth camera contains the measurement error and the optical noise. The depth image enhancement is necessary prior to using the depth image in the shape deformation. To achieve this, we perform depth image filtering and convert the filtered depth image as the signed distance function (SDF) proposed by [30]. In mathematics, the SDF determines the distance of given points from a boundary. If a point is on the boundary then its SDF is zero. We apply this concept to the depth image and extract an optimal depth image by aggregating the depth pixels whose SDFs are zero (Fig. 7). Figure (8) shows the overall procedure of our depth enhancement. First, the color and the depth image pyramids with multiscale are generated from the depth camera at the user-specified levels. Next, a joint bilateral filter (JBF) using a distribution of the reference image is

Fig. 7. Signed distance function in 3D. In practice, no depth values align along the z axis as in the figure due to measurement error and noise. The depth value at the first measurement may be greater than the one at the others and vice versa. If we generate depth surfaces using the raw depth images, the surfaces may be shuffled and interpenetrate each other. The SDFs are computed over the depth images and their combination, subject to D 0, results in an optimal isosurface [30].

applied to the depth pyramid to restore accurate depth values of ambiguous pixels in the raw depth image. In this paper, we use the corresponding color image as the reference image and apply it accumulatively from the lowest level to the highest. This aims to filter the depth image considering the distribution of both the depth and the color. In experiments, we obtain k frames of color (ci ) and depth (di ) from the depth camera and the above process is independently applied to the k frames in the same way. Thereafter, each pixel of the filtered depth images is converted to the signed distance functions (d1 xdk x) over k frames based on the algorithm from [30], and their combination finally results in a cumulative signed distance function, Dx, which makes a final depth image to be used for the restoration of the concave region in the next step: Dx

Pk i

wi xdi x ; Pk i wi

(1)

Fig. 8. Overall procedure of depth enhancement. 1 June 2013 / Vol. 52, No. 16 / APPLIED OPTICS

3683

where x x1 ; x2 ∈ Z2 is a pixel index of the raw depth image, i is the frame number of the k-frame depth video, and w ∈ R and d ∈ R are the weight value and the signed distance for each pixel, respectively. We extract an isosurface that satisfies Dx 0 in the least squares sense (see Fig. 7). That is, the isosurface can be extracted by gathering the pixels for which the corresponding signed distance value is 0. The wi x depends on the viewing angle of the depth sensor (see [30]), but all wi x are set with the same value since the angle is fixed in our case. The depth values, however, in the boundary area of objects oscillate so that the area has the signed distances di x with large variance. We compute a maximum variance between the signed distances di x as follows: Bd x maxdx − mindx;

di ∈ d; fdi ji 1…kg; (2)

where x is the pixels and i is the frame number of the corresponding depth image. If Bd x is bigger than an arbitrary threshold (≈30), the corresponding depth pixel x is excluded in the computation of cumulative signed distance Dx. This process leads to a boundary cleaning of the final depth data. Figure (8) shows that the final depth image is produced from a raw depth image by the enhancement. E.

Extraction of the Region of Interest

The restoration of a concave region starts from finding a part that is deformed in the template model. The part is the region of interest, which includes the concave region of the subject. The subproblem, which extracts only the region of interest, is solved as Fig. (9). First, we adopt a back-face culling method to extract the front face of the template model, which corresponds with the view of the depth camera. One of back-face culling concepts simply separates the front and back faces using an algebraic operation [31]. It discards all polygons where the dot product of polygon normal vectors and a camera view vector is greater than or equal to zero. This idea is proper to extract the region of interest for our method since the polygon normal vectors are computed in the template generation step, and the camera view vector is given in the camera calibration step. To apply the back-face culling method, the template model is first transformed to the coordinate of the depth camera. The front surface of the template model, which is viewed from the depth camera, is extracted by the back-face culling method. Figure (10)

Fig. 9. Procedure for extracting the region of interest. 3684


Fig. 10. Extraction of the front surface. The color of the model represents the angle between each normal vector and the view direction.

shows the extraction of the front surface using the back-face culling algorithm. As explained above, each vertex normal ⃗ni and the ⃗ are used in the computation of dot view vector, d, ⃗ product f ⃗ni ; d, and if each dot product is bigger than an arbitrary constant, T, the corresponding vertex is considered as a candidate vertex for deformation. At this time, some vertex normal vectors on the front surface may have the wrong direction due to local artifacts. Figure 11(a) depicts the local artifacts and Fig. 11(b) shows our strategy for the estimation of the normal vectors. We compute the dot product considering the neighboring vertices, NV⃗ i , as well as each vertex, V⃗ i using the following equation: Gi med f ⃗nj ; d⃗ ⃗nj ∈ normal V⃗ i ; N V⃗ i ;

(3)

Where G means the median cost of the dot product between the view vector, d, and the normal vectors of neighboring vertices of V⃗ i including itself. This aims to get a normal filtering effect by considering neighboring normal vectors. So the candidate vertex set for the deformation, S, can be defined by the following condition: S f∃V i ∈ R3 jGi > T; i 1…ng;

(4)

where n is the number of the vertices of the template model. Next, a series of image processing, as shown in Fig. (12), is performed to extract the concave area more precisely. The depth camera is positioned

Fig. 11. Normal vectors: (a) the local artifacts, and (b) estimation of the normal vectors.

M V i ; Ei ; F. Fig. 12. This figure shows that the concave region can be extracted even at the diagonal view. (a) The projected depth image of the front surface, (b) the captured depth image, (c) adaptive thresholding applied to the difference of (a) and (b), (d) the outlier removed using the iterative erosion and dilation operations, and (e) the inner region after the extraction of the contour.

directly toward the concave region of the subject so that it can be captured at once. Consequently, all the candidate vertices in the front surface, S, belong to the concave area. However, the front surface looking directly from the depth camera may not correspond to the concave region since the front surface contains both the concave region and the convex region, such as shown in Fig. 12(b). Accordingly, we need to separate the vertices from the candidate vertices that do not need to be deformed. The front surface is 2D projected toward the depth camera view. Then, two depth images are obtained from the viewpoint of the depth camera: the enhanced depth image (Dk ) from the depth camera and the projected depth image (Ds ) from the front surface. Next, a difference image (F) from the two depth images and the standard deviation of the difference image (σ F ) are computed. If σ F has a higher value than the threshold α, it means there exists a concave region along that viewpoint. Because the difference value of two images (F) increases in the concave region where the projected depth image (Ds ) has a positive depth value, but the captured depth image (Dk ) is supposed to have a negative one. Consequently, we apply an adaptive thresholding [32] to the difference image (F) to obtain only the concave region. The following equation and figure summarize the above operations: F Ds − Dk ; v uX m u n X Fi; j − μ2 ; σF t mn i1 j1 g

adaptive − thresholdF otherwise

if σ F > α

;

(5)

where g is a segmented image [Fig. 12(c)] acquired by applying the adaptive thresholding to F. Next, the morphological erosion and dilation operations are iteratively applied to remove the outliers in the segmented image g [Fig. 12(c)]. Then, a contour is extracted by [33] and the inner region of the contour is acquired by [34] in the difference image [Fig. 12(d)]. The inner region corresponds to the final region of interest, which should be deformed in the template model. Now, the vertices V of the template model that correspond to the final region of interest are considered as another mesh surface, M, with their edges E:

∀V i ∈ S:

(6)

Restoration of the Concave Region

In this step, the concave region is restored by fusing the depth data and the visual hull. Fusing two types of geometric information together is a difficult task. Mesh deformation is an excellent method to generate an optimal surface by modifying geometry according to control points. We adopt the concept of mesh deformation and use the enhanced depth data corresponding to the concave region as the control points to combine the heterogeneous geometric data. Let the mesh M be described by a pair V; E, where E describes the connectivity and V fV m …V k jm < k < n; n of vertices of template modelg describes the geometric 3D positions in the region of interest. To deform the M, the region of interest of the template model, we utilize a Laplacian deformation technique [35]. To apply Laplacian deformation to the mesh M, the coordinate system of the mesh should be converted to the Laplacian coordinate system instead of the Cartesian one. The Laplacian coordinate L uses a set of differentials instead of using absolute 3D coordinates to describe the V: LV i V i −

1 X V: jN i j j∈N j

(7)

i

Specifically, each vertex V i is represented by the difference between V i and the average of its neighbors V j . This allows that the two surfaces, of which the geometry of scale and the coordinate system are different, are naturally combined. The template model is transferred to the depth camera coordinate, and then the x-, y- coordinates of the projected region of interest are coincident to the ones of the depth image. This means that the topology of the region of interest can also be projected to the corresponding depth image and the depth image can be constructed as another surface, H, having the same topology as the region of interest of the template model. The surface is represented as the Laplacian coordinate, as well. Now we have the two Laplacian mesh surfaces at the depth camera view: the mesh M corresponding to the region of interest, and the depth surface H. Sorkine et al. [35] proposed a surface coating concept that transfers the surface details of the source surface onto the target surface. To restore the concave region of the M based on H, we adopt this coating transfer technique. Let mi and hi be the i th vertex in M and H, respectively. The coating value, ξi , is defined at each vertex by ξi hi − mi ;

fi ∈ m…kg:

(8)

Then we can reconstruct a complete geometry that recovers the concave region, S, by simply adding the ξ to the M in the Laplacian coordinate and by applying the inverse Laplacian transformation L−1 to 1 June 2013 / Vol. 52, No. 16 / APPLIED OPTICS

3685

turn it back to the Cartesian coordinate using the following equation: S L−1 M ξ:

(9)

3. Experimental Results

In order to evaluate the proposed system, two types of experiments are carried out. In the first experiment, we measure a subject having a deep concave region, such as a paper cup [Figs. 13(e)–13(j) In the other experiment, we test on a more complex case in which a concave object contains some convex regions, such as a fruit bowl Figs. 13(a)–13(d)]. Eleven CCD cameras with resolution of 2560 × 1920 measure around the whole subject. There is no specific reason to choose 11 cameras, but we chose the minimum number of cameras experimentally to capture the subject fully within their fields of view considering the computation complexity. One depth camera captures the concave region of the subject with 640 × 480 resolution. From all these cameras, we concurrently acquire 12 color images covering all directions of the subject with a planar check board. The acquired color images are used for geometric calibration of the camera. The geometric calibration of the camera is implemented according to [28] and it estimates the positional parameters and the intrinsic parameters of the color cameras and the depth camera, as well. In addition, we measure the average depth values and the widths of a planar subject from the depth camera at a distance interval of 0.9 m seven times to calibrate the depth camera, as shown in Fig. (3). Our depth calibration consists of two steps: the linearization of the depth and the scale. In case of scale correction, we consider the width of the subject as the value representing the scale of the subject since a decreasing ratio of the scale according to the measuring distance is consistent with that of the width. The following list shows the measured data and the estimated linearization functions for the depth camera calibration. • The measuring distance (meters), D: 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3. • The measured depth value (pixel intensity), DL : 231, 208, 184, 161, 136, 113, 90. • The measured width (millimeters) DL : 152.67, 76.8, 54, 40.2, 30.5, 22, 20. • The width difference, Ei Li − Oi ∶ − 0.256383, 0.23119792, 0.3997037, 0.40825871, 0.23409836, −0.1514545, −1.0152. • The depth linearization function: DL −26.23D 254.9. • The scale linearization function: L −21.68D 132.5, P −0.8782D4 14.72D3 − 90.45D2 230.6D− 183.7. • The accuracy of depth function (R2 : 0–1): 0.9999. • The accuracy of scale function (R2 : 0–1): 0.9956. As shown above, we evaluate how well the linearization functions correct the raw depth data by using 3686


Fig. 13. Results. (a)–(d) show reconstruction of a concave region containing convex objects. It indicates why the concave region must be reconstructed based on the real measured data. (a) The visual hull, (b) the raw depth data of a concave region, (c) the geometry reconstructed by the proposed method, and (d) the textured model. (e)–(g) show the reconstructions of concave objects having color homogeneous regions. (h)–(j) show the concave region of earthenware reconstructed by the wireframe, depth map, and the textured model, respectively.

the R-square statistic, which usually measures the accuracy of a regression model. The evaluation shows that the linearization functions are accurately modeled as R2 close to 1. In short, the key of our depth camera calibration is focused on how accurately linearizing the measured depth and the measured scale of the subject are according to the measuring distance. To this end, we estimate and evaluate the mathematical models that linearize the depth data from a depth camera. After that, the template model is constructed using a polygonal visual hull algorithm [15]. The raw depth image is enhanced by the pyramidal joint bilateral filtering and the SDF. In this process, we use about 30 depth and color images from the depth camera, directly heading for the concave region of the subject. When we construct the pyramid of color and depth images, the pyramid has four levels, and the image of the upper floor is half the size of the one for the lower floor. Then the image pyramid is filtered cumulatively from the top to the bottom by the JBF. The JBF uses the corresponding color image as the range filter kernel with a kernel size of 10, a spatial parameter of 3, and a ranging parameter of 0.1 The filtered 30 depth images are applied to the generation of SDF with the weight wx 1 and a single enhanced depth image is generated. The process of SDF is implemented using PCL [36] and it is operated in

real time on a GPU. Next we extract the region of interest to be deformed by using back-face culling with a threshold of dot product T 0.5 and apply the morphological erosion and dilation operation. Last, the template model and the enhanced depth image corresponding to the concave region are generated as two independent surfaces with Laplacian coordinates, and the two surfaces are used in the deformation. The final outcome demonstrates the natural fusion of heterogeneous geometric data. Figure (13) shows the results from the proposed system. We test the proposed system with a bowl and cups having deep concave regions, which are good examples for showing its performance. The results show why it is hard to scan the concave region, and why the region must be reconstructed based on the real measured data, not by an approximation. The concave regions of most of the concave objects is often color homogeneous. The existing vision-based methods [1], such as stereo matching, are fatal for the color homogeneous region, so it is hard to reconstruct the concave region. Also, the other existing method [6], which approximates the shape of the concave region instead of scanning it may not work for irregular cases where the shape of a concave region cannot be anticipated. As the results show, the proposed system can robustly reconstruct the objects even when the concave region is color homogenous or contains several objects inside. The processes are implemented using MATLAB under dual-core 2.8 Ghz CPU with a PC with 4 GB RAM and a 730 Mhz GPU. It takes about 154 s to generate a concave model having 8400 vertices. We focus on showing a system that can reconstruct a concave model with only passive scanning. We expect that the performance of the system can be further improved through optimization. 4. Conclusion and Discussion

In this paper, we propose a 3D modeling system that can scan a concave object, as well as a convex one, with totally passive sensors. Scanning concave regions has been a challenging problem in 3D shape reconstruction since even a high-performance 3D scanner requires an additional probe to do the job. Our proposed system uses an improved visual hull construction method that effectively fuses the depth data from an optical depth sensor with the visual hull from a shape-from-silhouette algorithm. We have taken a step toward innovating the classical image-based modeling systems utilizing the potential of an off-the-shelf depth sensor for shape reconstruction. The proposed system consists of several commodity cameras and a single depth camera, so additional process is required to combine the heterogeneous data from different types of cameras. First, the calibration of the depth camera is performed to linearize the depth data according to the measuring distance. Nevertheless, the depth range of the visual hull is not coincident with that of the

depth map, so we effectively fuse the heterogeneous depth by applying a mesh coating algorithm that transfers geometric signals among different mesh surfaces regardless of their scale. The proposed method makes more credible reconstruction of the concave region possible even though the concave region is not empty. It is also more practical in the way that it is robust on homogeneous regions, where the existing stereovisionbased fusion methods generate unstable disparities. The proposed method can enable optical 3D scanners to reconstruct even concave regions with only passive sensing, so that the optical scanners can be used further in wider 3D applications. The proposed system overcomes the concavity problem under a hypothesis that the subject has a single concave area. Of course, the system is scalable so that the number of cameras can be increased to deal with multiple concave regions. However, indiscreet increase of sensors causes low system performance. Also, we adopt a mesh-based shape representation since it reduces computation resources by constructing the geometry with less point data, but it occasionally causes loss of shape details. For further work, we plan to find an optimal camera composition that uses the minimal number of cameras to properly cover all concave regions of the subject and to preserve lost details of the shape for better results. Special thanks K. Forbes for his valuable comments. This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2012008699). References 1. Y. Furukawa and J. Ponce, “Accurate, dense, and robust multi-view stereopsis,” IEEE Trans. Pattern Anal. Mach. Intell. 32, 1362–1376 (2010). 2. G. H. Liu, X. Y. Liu, and Q. Y. Feng, “High-accuracy three-dimensional shape acquisition of a large-scale object from multiple uncalibrated camera views,” Appl. Opt. 50, 3691–3701 (2011). 3. A. Laurentini, “The visual hull concept for silhouette based image understanding,” IEEE Trans. Pattern Anal. Mach. Intell. 16, 150–162 (1994). 4. E. Simioni, F. Ratti, I. Calliari, and L. Poletto, “Threedimensional modeling using x-ray shape-from-silhouette,” Appl. Opt. 50, 3282–3288 (2011). 5. W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan, “Image-based visual hulls,” in SIGGRAPH, Proceedings of the Conference on Computer Graphics and Interactive Techniques (ACM, 2000), pp. 369–374. 6. J. Feng, B. Song, and B. Zhou, “Bottom and concave surface rendering in image-based visual hull,” in Proceedings of the 7th ACM SIGGRAPH International Conference on VirtualReality Continuum and Its Applications in Industry (2008), paper 3. 7. http://www.cyberware.com/products/scanners/px.html. 8. http://www.pmdtec.com/. 9. http://en.wikipedia.org/wiki/Canesta. 10. http://www.mesa‑imaging.ch/. 11. http://www.microsoft.com/en‑us/kinectforwindows/. 12. P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments,” Int. J. Rob. Res. 31, 647–663 (2012). 13. R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, 1 June 2013 / Vol. 52, No. 16 / APPLIED OPTICS

3687

14. 15.

16.

17. 18.

19. 20.

21.

22.

23.

“Kinectfusion: real-time dense surface mapping and tracking,” in Proceedings of IEEE International Symposium on Mixed and Augmented Reality (IEEE, 2011), pp. 127–136. http://www.3d3solutions.com/products/3d‑scanner/hdi‑advance/. F. Keith, V. Anthon, and B. Ndimi, “Using silhouette consistency constraints to build 3D models,” in Proceedings of Fourteenth Annual South African Workshop on Pattern Recognition (PRASA) (2003). J. Zhu, L. Wang, R. Yang, and J. Davis, “Fusion of time-offlight depth and stereo for high accuracy depth maps,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2008). U. Hahne and M. Alexa, “Depth imaging by combining time-offlight and on-demand stereo,” in Proceedings of the Dynamic 3D Imaging Workshop (Dyn3D) (2009), pp. 70–83. Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Misusik, and S. Thrun, “Multi-view image and ToF sensor fusion for dense 3D reconstruction,” in Proceedings of IEEE Conference on Computer Vision Workshops (IEEE, 2009), pp. 1542–1549. E. Tola, C. Zhang, Q. Cai, and Z. Zhang, “Virtual view generation with a hybrid camera array,” CVLAB-Report-2009-001 (École Polytechnique Fédérale de Lausanne, 2009). C. Kuster, T. Popa, C. Zach, C. Gotsman, and M. Gross, “A hybrid camera system for interactive free-viewpoint video,” in Proceedings of Vison, Modeling, and Visualization (VMV) (2011), pp. 17–24. A. Bogomjakov, C. Gotsman, and M. Magnor, “Free-viewpoint video from depth cameras,” in Proceedings of the International Workshop on Vision, Modeling and Visualization (VMV) (2006), pp. 89–96. L. Guan, J. S. Franco, and M. Pollefeys, “3D object reconstruction with heterogeneous sensor data,” in Proceedings of International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT) (2008), paper 108. C. H. Esteban and F. Schmitt, “Silhouette and stereo fusion for 3D object modeling,” Comput. Vis. Image Underst. 96, 367–392 (2004).

3688


24. J. Carranza, C. Theobalt, M. Magnor, and H. P. Seidel, “Freeviewpoint video of human actors,” ACM Trans. Graph. 22, 569–577 (2003). 25. A. Weiss, D. Hirshberg, and M. J. Black, “Home 3D body scans from noisy image and range data,” in Proceedings of IEEE Conference on Computer Vision (IEEE, 2011), pp. 1951–1958. 26. J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, “Scanning 3D full human bodies using Kinects,” IEEE Trans. Vis. Comput. Graph. 18, 643–650 (2012). 27. K. Nakano and H. Chikatsu, “Camera calibration techniques using multiple cameras of different resolutions and bundle of distances,” Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. XXXVIII, 484–489 (2010). 28. http://www.vision.caltech.edu/bouguetj/calib_doc/. 29. W. Matusik, C. Buehler, and L. McMillan, “Polyhedral visual hulls for real-time rendering,” in Proceedings of Twelfth Eurographics Workshop on Rendering (EGWR) (2001), pp. 115–125. 30. B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in SIGGRAPH, Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (ACM, 1996), pp. 303–312. 31. T. Akenine-Möller and E. Hanines, “Acceleration algorithms: backface culling,” in Real-Time Rendering, 2nd ed. (2002), Chap. 9.3, pp. 359–363. 32. N. Otsu, “A threshold selection method from gray-level histogram,” IEEE Trans. Syst. Man Cybern. SMC-8, 62–66 (1979). 33. Y. K. Liu and B. Zalik, “An efficient chain code with Huffman coding,” Pattern Recogn. 38, 553–557 (2005). 34. E. Haines, “Point in polygon strategies,” in Graphics Gems IV, P. S. Heckbert, ed. (Morgan Kauffman, 1994), pp. 24–46. 35. O. Sorkine, D. Cohen-Or, Y. Lipman, M. Alexa, C. Rossl, and H. P. Seidel, “Laplacian surface editing,” in Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing (SGP) (2004), pp. 175–184. 36. http://pointclouds.org/.