Combining Stereo and Visual Hull for On-line ... - Semantic Scholar

27 downloads 6601 Views 195KB Size Report
The recovery of photometric properties of the scene also be- comes possible. Finally .... the need for storing huge amounts of data, but also enables in- teractivity ...
1

Combining Stereo and Visual Hull for On-line Reconstruction of Dynamic Scenes Ming Li, Hartmut Schirmacher and Hans-Peter Seidel Max-Planck-Institut fuer Informatik Stuhlsatzenhausweg 85, D-66123, Saarbruecken, Germany {ming,htschirm,hpseidel}@mpi-sb.mpg.de

Abstract— In this paper, we present a novel system which combines depth from stereo and visual hull reconstruction for acquiring dynamic real-world scenes at interactive rates. First, we use the silhouettes from multiple views to construct a polyhedral visual hull as an initial estimation of the object in the scene. The visual hull is used to limit the disparity range during stereo estimation. The use of the constraints imposed by the silhouettes from the earlier processing stage reduces the amount of computation significantly. The restricted search range improves both the speed and the quality of the stereo reconstruction. In return, stereo information can compensate for some of the inherent drawbacks of the visual hull method, such as inability to reconstruct some concave regions.

I. I NTRODUCTION In recent years, the acquisition of dynamic scenes using multiple cameras has found exciting applications in the field of telecommunication, human-computer interaction and entertainment. However, some technical problems need to be overcome before commercial applications can be developed, such as geometry estimation, photometry and lighting reconstruction, motion recovery. Among them, the reconstruction of 3D geometric information is the key issue. Once the 3D scene structure is recovered, we can examine the scene from arbitrary viewpoints. The recovery of photometric properties of the scene also becomes possible. Finally, real geometry allows editing and composition with other virtual objects or environments. Considering these advantages, our research focuses on improving the quality of 3D reconstruction. There are many ways to reconstruct 3D information from images. Correlation-based “depth from stereo”[25] is one classical approach. Fuchs et al. [9] propose to apply this method to reconstruct the real-world scene from a large amount of fixed cameras. But at that time the speed of the stereo algorithm was much limited by the poor performance of machines. To overcome this problem, Kanade [14] develops a video-rate stereo machine with the help of customized hardware. Shortly after, Kanade et al. [13] coin the term “Virtualized Reality” to characterize a system that is able to capture dynamic scenes. They use a multi-baseline stereo algorithm to improve reconstruction quality. However, they do not use their stereo machine to pursue the on-line processing due to the high cost. Recently, thanks to

the fast advance of CPU speed and digital video camera technology, some commercial products[21], [7] as well as academic research[5], [22] demonstrate that stereo reconstruction can be carried out by general purpose computers in real time. An alternative approach is shape-from-silhouette, which reconstructs the scene as visual hull[15]. Roughly speaking, visual hull is a conservative shell that envelopes the true geometry of the object. This shell can be determined from object silhouettes in multiple images taken from different viewpoints. Moezzi et al. [20] construct visual hull using voxels in an offline processing system. Later, Cheung et al. [4] show that the voxel method could be implemented in real time. Lok[16] exploits the graphics hardware to accelerate the voxel method and also achieved real time performance. Matusik et al.[17] propose an image-based visual hull algorithm. Their system also runs in real time, but no explicit 3D model is recovered. Soon after they find that the polyhedral visual hull[18] not only provides an explicit 3D model, but also can be reconstructed even faster using an acceleration technique. Each of the above two methods has some drawbacks. The visual hull can not recover concave regions no matter how many images are used. In addition, for recovering subtle details on the object, the visual hull method requires a large number of different views. Parallel to that there were also efforts of using silhouette information to recover the 3D model of dynamic scenes. As Simon Baker et al.[1] point out, the visual hull method makes use of rays that are tangent to the surfaces of the object of interest, while stereo primarily matches rays that are radiating from the object body. This means they are quite complementary in nature. But to our surprise, few attention has been paid to investigate the combination of the two methods. One idea to use estimated geometry for the stereo reconstruction has been proposed by Debevec et al.[6]. They construct a hierarchy of blocks instead of visual hull as an geometric estimation. With the geometric information available, one image of the stereo pair is reprojected from the other image’s viewpoint. In this way, the foreshortening problem for the far-apart stereo pair is eliminated and the stereo reconstruction can be done more robustly. Later, Vedula et al. [27] present a combined method for their

Virtualized Reality system. They first execute the stereo algorithm and then use voxel merging to get a volumetric model, from which the polygonal mesh is extracted. After that, foreground and background objects are separated according to the depth continuity and the foreground objects are reprojected to generate per-pixel search bounds for stereo matching, together with the silhouettes of the objects. Finally stereo reconstruction is refined by using the search bounds and the improved depth maps are merged again into a volumetric model. They call their method Model-Enhanced Stereo (MES). We propose a combined method, in which a polyhedral visual hull is first constructed using silhouette information, then this visual hull serves as an initial geometry estimation to limit the search range of the following stereo algorithm. On the one hand, visual hull improves both the quality and the speed of the stereo reconstruction. On the other hand, depth-from-stereo can recover more geometry detail and concave regions on the object, and compensates for the inherent drawback of the visual hull. Thus, better reconstruction quality can be obtained using our combined method. On-line processing not only eliminates the need for storing huge amounts of data, but also enables interactivity between multiple virtual and real objects. We aim at building a system which can capture the dynamic scene, and perform 3D reconstruction and rendering in real time. In order to address the performance issue, we deploy our system in a client-server architecture to distribute image acquisition and stereo reconstruction to several client machines. Multi-thread techniques and the highly optimized IPL[11] & OpenCV[12] libraries are used to further boost performance. The choice of a polyhedral representation of the visual hull also contributes to a faster reconstruction. As far as we know, this is the first effort to combine the two classic 3D reconstruction methods at real time. We demonstrate that we are able to combine them at interactive rates using mainstream hardware. Our method has the same spirit of theirs in using the restricted search bounds for the stereo computation. But there are some noteworthy distinctions. First, we extract silhouette information from original images in the earlier processing stage, while they generate the silhouettes from the estimated 3D objects in the middle stage. Secondly, we use a polyhedral visual hull representation for a direct and faster reconstruction of an initial geometric estimation. As a result, our system performs at interative reconstruction and rendering rates. The remainder of this paper is organized as follows. Sect. 2 gives an overview of our system. System setup issues are discussed in Sect. 3. We explain our reconstruction algorithms in detail in Sect. 4 and describe the 3D visualization in Sect. 5. After some timings and result images are shown in Sect. 6, we conclude this paper and present some ideas for future research. II. S YSTEM OVERVIEW Our system consists of several cameras observing the scene from different viewing directions. These cameras are grouped

2

pairwise and arranged along an arc. Each camera pair is connected to one client computer. These computers communicate with the server via the standard TCP/IP network. All cameras are calibrated in advance and image acquisition is synchronized at run time. Fig 1 shows the acquistion setup.

Fig. 1. Acquisition setup: 3 camera pairs are arranged along an arc.

A. Processing stages The system initialization includes recording a background image for each camera and sending calibration information from each client to the server. After the initialization, the system enters the processing cycle, which is defined as the time of processing one synchronized image set collected by all cameras. According to the direction of the network transfer, we divide one cycle into three stages, illustrated in Fig. 2. In stage 1 the object’s silhouette is estimated. First, the stereo pairs are individually rectified to align along the image scanline because this is needed for the later stereo computation. Then for each client the moving foreground object is segmented out from the background and will be used for the silhouette extraction. Since stereo cameras are very close to each other, using both silhouettes does not improve the visual hull reconstruction much. Therefore, as seen in Fig. 2(a), we only extract the silhouette for one camera of the stereo pair and transfer it to the server. In stage 2, when all the silhouette information is available, the server computes a polyhedral visual hull using general 3D CSG intersection and then broadcasts the polyhedral model back to all clients. In the last stage, all clients use the visual hull to guide the stereo computation to generate depth maps. The depth maps, together with the color images, are sent over the network to the server for rendering. Note that since we already have the silhouette information, the stereo computation only needs to be performed on the foreground object mask instead of the whole image. The network transfer can be also greatly reduced by only transferring a rectangular region of interest(ROI) of the image. The system architechture has been designed to distribute the computational load between the server and the clients. The time needed for image acquisition, rectification, silhouette extraction and stereo reconstruction is independent of the number of stereo

Stereo Pair 1 silhouette Client

Stereo Pair 1 Client

Client

Client

Stereo Pair 2 silhouette Client Client

Client

Stereo Pair 2 Server

Client

Visual Hull Server

Client

Stereo Pair 3 silhouette Client Client

Color+Depth

Stereo Pair 2 Color+Depth Client Client

Stereo Pair 3

Server

Color+Depth

Stereo Pair 3 Color+Depth Client

Client Client

(a)

3

Stereo Pair 1 Color+Depth Client

Client

Color+Depth (c)

(b)

Fig. 2. Processing stages of the combined visual hull and depth-from-stereo algorithm

pairs. This provides good scalability and allows us to achieve interactivity. B. Calibration Calibration is a rudimentary task for all 3D reconstruction algorithms. For multiple cameras, we must calibrate them in a common Euclidean coordinate system. This is necessary for both the computation of the visual hull and the rendering of color-plus-depth images. In our system a checkerboard is used as the calibration target since it is easy to use and manipulate. However, the 3D feature points on the checkerboard cannot be seen simultaneously by all cameras pointing into different directions. This means we cannot directly calibrate them under a common coordinate system. To tackle this problem, we first calibrate a subset of cameras, all of which can see the checkerboard. Then the checkerboard is repositioned. And we choose another subset of cameras, which include one camera belonging to the previous subset. As a result, one camera is calibrated twice with respect to different checkerboard positions. These two calibration data help to register all the camera parameters in the two subsets in one common frame. Such procedure can be repeated for calibrating all the cameras in our system. The calibration algorithm we use is the well-know coplanar calibration proposed by Tsai[26]. C. Synchronization The visual hull construction is based on silhouette information, which is not very sensitive to synchronization problems across multiple video streams. On the other hand, the intensitybased stereo algorithms depend heavily on per-pixel color information. Slight inconsistencies between the stereo image pair results in erroneous depth estimates. Hence, synchronization is crucial to the performance of our system. As in [17], [18], we use external triggers on the cameras to implement the synchronization. We connect the socket of all cameras’ triggers to the pins of the parallel port on the server

computer. After each processing cycle, we send a signal to order every camera to deliver a frame to its host computer. The trigger method not only guarantees the simultaneous image capturing but also ensures that the server obtains the synchronized information from the clients in spite of various network latencies. From the result of our system, we conclude that this method works well for online 3D reconstruction. III. V ISUAL

HULL RECONSTRUCTION

One can think that the visual hull contains all the 3D points that project into the silhouette for every original viewpoint. From the other perspective, the visual hull can be regarded as the intersection of 3D generalized cones, which are produced by the extrusion of all the silhouette contours along viewing directions. These two different ways of thinking inspire two different methods of visual hull reconstruction: volumetric and polygonal. The former one is computationally stable. But the downsides are intensive memory requirement, limited recovery precision and time-consuming rendering of the volumetric data. Contrary to this, the latter one uses a more elegant polygonal representation, requires less memory, and is suitable for fast direct rendering. But a robust polyhedral intersection is hard to be implemented. No matter which kind of method is chosen, silhouette extraction is a common step necessary for both. And the extraction first needs classifying the whole image into “foreground” and “background” parts. Image differencing[3] is used in our system. First we accumulate 30-50 frames while keeping the background unchanged. Then the mean and standard deviation images for the background are computed. When a foreground object enters the scene, we subtract the statistical background information from every new frame to get the foreground object mask. After noise is removed by a median filter, we apply morphological operators to fill some small holes in the mask. Finally, from the silhouette information, the contour can be retrieved as a 2D polygon. The above operations are all performed using functions in IPL[11]&OpenCV[12] libraries for

the sake of fast computation. Fig. 3 shows an original image, the extracted foreground person, the corresponding mask and the polygonal contour.

Fig. 3. Image differencing and silhouette extraction

Once the 2D contour is sent to the server, it can be easily extruded to form a cone-like volume using the camera calibration information. The extrusion scale is specified as a very large value so that we do not lose any part of the object. At the moment we use the Brep library [2] to carry out the 3D intersection of several cone-like volumes. Fig. 4 illustrates how to obtain a visual hull from three volumes.

(a)(b) Fig. 4. Visual Hull Construction (a) three cone-like volumes extruded from different views. (b) the intersection result — a polyhedral visual hull

The quality of the visual hull depends on the correctness and precision of silhouette extraction. But shadowing is a quite annoying problem for getting a correct silhouette. Some shadows cast onto a static surface like ground and the shadowed region should be classified as the background because the geometry of the surface remains unchanged. However, the shadow moves with the foreground object and makes the shadowed region darker. Thus, the image differencing technique will wrongly identify such region as the foreground. For the moment we try to avoid this problem by proper lighting and camera positioning. Cheung et al.[4] suggest to use color information and multiple thresholds.

IV. BASIC

STEREO ALGORITHM

4

The stereo algorithm can be decomposed into three steps. (1) Rectification[8], (2) matching, and (3) depth recovery. It is well known that The use of epipolar constraints reduces the dimension of matching from 2D to 1D. For a general stereo configuration, however, the epipolar lines are not aligned with the image scanline, which makes the pixel traversal more difficult. To address this issue, rectification, a perspective image warping, is applied to the stereo images to align the epipolar lines. Considering that the stereo cameras are fixed and the transformation between original and rectified images also remains constant, this warping operation can be accelerated by using a precomputed lookup table. Matching is the most important step in the stereo computation because once the correspondence between pixels is established, we can immediately calculate the depth through the following equation Depth

=

Baseline



F o alLength

Disparity

(1)

In the above equation, Baseline is the distance between the optical centers of the stereo pair, F o usLength is one of the intrinsic camera parameters that we have obtained by calibration, Disparity is the coordinate difference between the matching points. Since the stereo pair has already been rectified, the disparity along the Y axis of the image is always 0. So in the following, the disparity we mention only refers to a scalar, which is the coordinate difference in X direction between corresponding pixels. We try to determine the correspondences by comparing small pixel neighborhoods around the points, for instance a rectangular window region. There are several criteria to evaluate the difference between the pixel neighborhoods, such as normalized cross-correlation (NCC), sum of squared difference (SSD), sum of absolute difference (SAD), etc. Among them, SAD is the fastest method and has proved to be able to yield satisfactory result. Therefore, we choose SAD as the difference score for the correspondence search. We have further optimized the SAD computation and made it independent of the window size by exploiting the coherence of neighboring pixels. To improve the quality of the depth map, we interpolate neighboring difference scores to obtain sub-pixel accuracy[10] for the disparity. We also compute the disparity map for both stereo images and then check the left-right consistency to remove false matches. V. C OMBINATION

OF VISUAL HULL AND STEREO

One of the key factors influencing the correspondence search in the stereo computation is the disparity range. If the range is too large, more correspondences must be examined. As a result, this will not only retard the overall performance but also increase the possibility to get false matches, which lead to a false depth recovery. Therefore, our combination is based on

tightening the disparity search range for the stereo matching. We exploit the polyhedral visual hull to guide the stereo algorithm in two different ways. The first one imposes a dynamic range constraint on the stereo computation for all the pixels. The constraint is calculated by using the bounding box of the visual hull. The second one generates per-pixel disparity range through off-screen rendering of the visual hull in order to provide more precise constraint. In the following, we will describe both in detail. A. dynamic range constraint In the traditional stereo matching, we can only specify one fixed depth range for all the video frames. This range should accommodate all the depth changes of the dynamic events. In other word, it must be larger than the union of the depth range of all frames. A large depth range corresponds to a large disparity range, which both slows down the stereo matching and deteriorates the depth recovery result. Hence, we need to find a way to compute a tighter depth bound for every video frame. The visual hull can serve to this purpose in the following way. First, we compute a bounding box for the visual hull. And for each client, we transform 8 vertices of the box from the world coordinate system to camera coordinate system. Then the minimum and maximum depth values are calculated from these transformed vertices. With equation 1, the depth range can be converted to a disparity range. Recalling that visual hull is a conservative estimation of the actual object, so is the bounding box of the visual hull. Thus, the disparity range computed from the bounding box can be applied to restrict the stereo matching. Moreover, this computation is done for every frame and the disparity range also varies from frame to frame. Therefore this method provides a tighter dynamic range constraint. Fig. 5 shows an example for this idea. One can see that the fixed depth range is much larger than each dynamic range. Note that the exact fixed depth range actually can not be determined before all the video frames are recorded and analyzed. So an estimation of the fixed range could be even larger than that depicted in this figure. Fixed Range Dynamic Range

5

the need of transferring the whole visual hull information to the client. Precisely speaking, only eight vertices need to be transferred, which nearly costs nothing. Despite the fast speed of this method, the dynamic range constraint turns out to be a very rough estimation since it is global with respect to all the pixels of the foreground object. B. Per-pixel range constraint So far, we haven’t taken full advantage of the visual hull information. Actually, if we send back the complete polyhedral visual hull to the client, the disparity range can be refined to a per-pixel level. Thus, the stereo algorithm can further benefit from the more precise constraint. In the following, we will present two schemes for generating per-pixel range constraint. One is not conservative but faster, the other is truely conservative at the expense of more computation. Since both of them need a disparity map generated from the visual hull, we will first discuss this problem. 1) Compute disparity map from visual hull: Given the polyhedral geometry and the camera pose information, we are able to generate a depth map using the Z-Buffer algorithm. Since the Z-Buffer algorithm is implemented in the graphics hardware, we use OpenGL off-screen rendering to accelerate the depth map generation. Once we have the depth map, it can be easily converted to a disparity map using equation 1. We find that this conversion can be decomposed into two simple arithmetic image operations, which can be performed using fast IPL functions. The following will explains it in details with some formulae. First, from the OpenGL transformation pipeline[23] we can derive the equation:

Zgl

=(

Depth

(

F ar

+ N ear ) + 2

(

Depth



F ar



N ear

F ar )

N ear

+ 1)=2

(2) is the value read from the OpenGL depth buffer, F ar; N ear is the far and near plane used in the perspective viewing volume setup, Depth is the depth value of a 3D point on the object. We rewrite equation 2 in order to represent the real depth value in terms with Zgl : where

Zgl

Depth

=

Zgl

(

F ar



N ear N ear )

F ar

(3)

F ar

Then substituting equation 3 to equation 1 leads to:

Disparity

=(

Zgl

(

F ar

F ar



N ear )

N ear

Baseline

Fig. 5. Dynamic range constraint. The balls are the same object in different frames. Tighter bounds can be achieved for each single frame.

The main advantage of this method is that the bounding box computation can be done very quickly. And it also removes



1 N ear

)

(4)

F o alLength

Since F ar; N ear; Baseline; F o alLength are all constants, the above equation tells us that the disparity can be computed as a float multiplication plus a float addition from the original OpenGL depth.

2) Approximate range limit : With the default setting of OpenGL, the Z-Buffer contains the depth value for the front surfaces. Therefore the disparity map we get in Sec. 5.3.2.1 is the upper bound of the disparity for every pixel. In order to generate the lower bound to form a range, we can simply subtract the upper bound by a common threshold. The selection of the threshold depends on the maximum concavity of the actual object. Consider the situation in Fig. 6, for the depth recovery of point B, the disparity threshold must be greater than that computed from the depth of A and B. Otherwise, the actual disparity will fall out of the estimated range limit, the depth can not be recovered correctly. However, the concavity of the object is not known in advance for a dynamic scene, so this scheme does not guarantee the correct disparity range coverage for the object. In other word, it is not conservative. 111111111111111111111 000000000000000000000 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 C 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 B 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 000000000000000000000 111111111111111111111 A 000000000000000000000 111111111111111111111 Camera1

Camera3 Camera2

Fig. 6. Per-Pixel Range Constraint. The shadow region is the visual hull represented in 2D. The actual object is drawn using a thick line.

In our system, the test scene is a moving person, which does not exhibit obvious concavities. Therefore the approximate range limit works very well. Fig. 7 shows the result. Note that in fig. 7(a) the disparity map is very smooth, which means the surface detail of the object is poorly reconstructed by the visual hull. In Fig. 7(b), there is a big region of false depth recovery on the right lower part of the body due to large disparity range. And in Fig. 7(c) we can see that the false recovery is corrected by using the approximate per-pixel range limit. At the same time more details are also recovered by the stereo algorithm compared to the visual hull. In the previous scheme, we only make use of the front surfaces of the visual hull. But if we also know the depth value of the back surfaces, then we can get both upper and lower bound of the disparity. Again take the example in Fig. 6, a conservative disparity range for point B can be determined from the depth value of A and C. This conservative range is necessary for a correct depth recovery of point B. One disadvantage of this method is that it needs two passes of the OpenGL rendering to to generate the depth map for both front and back surfaces. But since the visual hull is rendered by the graphics hardware, so the real-time performance of the whole system will not be lowered too much. Even with the help of the per-pixel range constraint, the stereo algorithm is still possible to fail in some specular or tex-

6

(a)

(b)

(c)

(d)

(e)

Fig. 7. Improved stereo reconstruction. (a) Disparity map generated from the visual hull. (b) Depth map without using visual hull information (c) Improved depth map using approximate range limit by a threshold. (d) Rendered view from depth map in b. (e) Rendered view from depth map in c.

tureless regions. Then we can use the disparity map generated from the visual hull to fill these regions.

VI. R ENDERING It is straightforward to synthesize a novel view from multiple fully-calibrated color images with depth information. Some systems compute a 3D point cloud from multiple depth maps and render these points directly. Forward 3D warping[19], [22] warps every reference view to the destination view and provides possibility to combine different samples corresponding to one common point in 3D. Of course one can also merge the depth maps into a single polyhedral model and render it using texture mapping. This can yield a more consistent rendering result. However the polyhedral model extraction is too hard to be accomplished in real time. Based on the above analysis, we choose the 3D warping method for our rendering as a tradeoff of quality and speed. We implemented the incremental computation proposed by Shade et al.[24] in order to achieve faster results. The rendering is decoupled from the 3D reconstruction processing cycle. It can run much faster than the 3D reconstruction. Fig. 8 shows a sequence of dynamic events rendered from a novel viewpoint.

7

Fig. 8. Rendering Result. A sequence of dynamic events rendered from a novel viewpoint at about 10 fps

VII. I MPLEMENTATION AND RESULTS We use 6 Sony DFW500 FireWire video cameras, which are connected to 3 Linux PCs. All of them have a single Athlon 1.1GHz CPU, Nvidia GeForce2 graphics card and 768MB RAM. Another machine with the same configuration is used as the server. The video capturing resolution is set to 320 x 240, while the off-screen rendering of depth map and stereo computation run at half of that resolution in order to get higher frame rates. We use a multi-thread implementation on both the client machines and the server and achieved 2.5-3.3 fps for the reconstruction. The rendering is running at about 10 fps. We have measured the timings of every step in one processing cycle on one client machine. Table 1 gives the result. Note that the timings of the two cameras in one stereo pair are presented separately. The accumulated total time is a bit smaller than the actual total time because there is some overhead due to the synchronization. Processing Image capturing & Rectification Silhouette extraction Visual hull reconstruction Disparity computation from visual hull Disparity computation from stereo Depth computation Network transfer of color+depth map Accumulated total time Actual total time

Timings(ms) 29/29 29/43 15 87/74 30 25/20 57/48 272/259 299/272

TABLE I T IMINGS OF OUR SYSTEM . T HE TIMINGS OF THE PAIR OF CAMERAS ARE PRESENTED SEPARATELY.

When we only use three silhouettes, the visual hull reconstruction does not take too much time, as seen in table 1. But

when more cameras are used for capturing the entire object, this time will increase and probably become a bottleneck of the whole system. Then, a faster algorithm as in[18] can be implemented to subdue the problem. VIII. C ONCLUSION

AND FUTURE WORK

In this paper, we present a combined method for reconstruction of dynamic events in an on-line processing system. We use the polyhedral visual hull representation as an estimation of the object in the scene. It can be quickly reconstructed and hence meets the demand of a real time system. This estimated geometry imposes a bounding volume constraint or a per-pixel constraint on the depth-from-stereo algorithm to improve and speed up the recovery result. The stereo algorithm can recover geometry detail and concave regions on the object, both of which cannot be accomplished by the visual hull method alone. Compared to previous systems, ours gives better reconstruction result and still keeps the near real-time performance. This is made possible by fast reconstruction of visual hull, distributed computation across several machines and processors, and by using intensively optimized image processing libraries. We hopt that our work can inspire more research on hybrid approaches for real-time 3D reconstruction. In the future, we would like to develop more reconstruction and rendering algorithms based on the combination of stereo and visual hull. For example, the visual hull can be also used as a basic model for carrying out model-based stereo[6] algorithm. Two other possibilities to employ visual hull information are: 1) help to merge the range map into one consistent model. 2) predict occlusion information for stereo reconstruction. A common question to the above proposals is how precise the visual hull should be in order to help in different tasks. Because we mainly focus on developing a real time system, higher frame rates are always desirable. Therefore network transfer, compression, and parallelization also need to be further investigated.

Finally, we plan to integrate virtual objects into our system and allow real time interaction between real and virtual objects. The seamless integration and interaction would provide the user a great sense of immersion. ACKNOWLEDGMENT Special thanks to Marcus Magnor’s for his valuable advice on this paper. R EFERENCES [1] S. Baker, T. Sim, and T. Kanade. A characterization of inherent stereo ambiguities. In Proceedings of the 8th International Conference on Computer Vision, 2001. [2] Phillipe Bekaert. Boundary representation library. http://breplibrary.sourceforge.net/. [3] M. Bichsel. Segmenting simply connected moving-objects in a static scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(11):1138–1142, November 1994. [4] Kong Man Cheung, Takeo Kanade, J.-Y. Bouguet, and M. Holler. A real time system for robust 3d voxel reconstruction of human motions. In Proceedings of the 2000 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’00), volume 2, pages 714–720, June 2000. [5] K. Daniilidis, J. Mulligan, R. McKendall, G. Kamberova, D. Schmid, and R. Bajcsy. Real-time 3d tele-immersion. In A. Leonardis et al., editor, The Confluence of Vision and Graphics. Kluwer Academic Publishers, 2000. [6] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and imagebased approach. Computer Graphics, 30(Annual Conference Series):11– 20, 1996. [7] Videre Design. Small vision system. http://www.ai.sri.com/ konolige/ svs/svs.htm. [8] Olivier Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, Cambridge, Massachusetts, 1993. [9] Henry Fuchs, Gary Bishop, Kevin Arthur, Leonard McMillan, Ruzena Bajcsy, Sang Lee, Hany Farid, and Takeo Kanade. Virtual space teleconferencing using a sea of cameras. In First International Symposium on Medical Robotics and Computer Assisted Surgery, pages 161–167, 1994. [10] A. Fusiello, V. Roberto, and E. Trucco. Efficient stereo with multiple windowing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 858–863. IEEE Computer Society Press, June 1997. [11] Intel. IPL library 2.2 for Linux. http://www.intel.com/software/products/ perflib/ipl/index.htm. [12] Intel. OpenCV library beta 2.1 for Linux. http://www.intel.com/research/ mrl/research/opencv/. [13] T. Kanade, P.J. Narayanan, and P.W.Rander. Virtualized reality: Concept and early results. IEEE Workshop on the Representation of Visual Scenes, June 1995. [14] Takeo Kanade. Development of a video-rate stereo machine. In Proceedings of the 1994 ARPA Image Understanding Workshop (IUW’94), pages 549 – 558, November 1994. [15] A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Machine Intell., 16(2):150–162, February 1994. [16] Benjamin Lok. Online model reconstruction for interactive virtual environments. In Proceedings 2001 Symposium on Interactive 3D Graphics, pages 69 – 72, March 2001. [17] Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J. Gortler, and Leonard McMillan. Image-based visual hulls. In Kurt Akeley, editor, Siggraph 2000, Computer Graphics Proceedings, pages 369–374. ACM Press / ACM SIGGRAPH / Addison Wesley Longman, 2000. [18] Wojciech Matusik, Chris Bueler, and Leonard McMillan. Polyhedral visual hulls for real-time rendering. In Proceedings of Twelfth Eurographics Workshop on Rendering, pages 115–125, June 2001. [19] Leonard McMillan. An Image-based Approach to Three-Dimensional Computer Graphics. PhD thesis, University of North Carolina at Chapel Hill, 1997.



8

[20] Saied Moezzi, Arun Katkere, Don Y. Kuramura, and Ramesh Jain. Reality modeling and visualization from multiple video sequences. IEEE Computer Graphics and Applications, 16(6):58–63, November 1996. [21] Pt Grey Research. Digiclops stereo vision systems. http://www.ptgrey.com/index.htm. [22] Hartmut Schirmacher, Ming Li, and Hans-Peter Seidel. On-the-fly processing of generalized Lumigraphs. In Proc. Eurographics 2001, volume 20 of Computer Graphics Forum, pages 165–173. Eurographics Association, Blackwell, 2001. [23] Mark Segal and Kurt Akeley. The OpenGL Graphics System: A Specification(Version 1.3). Silicon Graphics, Inc., Aug. 2001. http://www.opengl. org/developers/documentation/version1_3/glspec13.pdf. [24] Jonathan W. Shade, Steven J. Gortler, Li-Wei He, and Richard Szeliski. Layered depth images. Computer Graphics, 32(Annual Conference Series):231–242, August 1998. [25] Emanuele Trucco and Alessandro Verri. Introductory techniques for 3-D computer vision. Prentice Hall, 1998. [26] R. Y. Tsai. An efficient and accurate camera calibration technique for 3-D machine vision. In Proc. IEEE Computer Vision and Pattern Recognition, pages 364–374, June 1986. Miami Beach, FL. [27] Sundar Vedula, Peter Rander, Hideo Saito, and Takeo Kanade. Modeling, combining, and rendering dynamic real-world events from image sequences. In Proc. 4th Conference on Virtual Systems and Multimedia (VSMM98), pages 326–332, November 1998.