Single Image-Based On-line Camera Calibration and VE ... - CiteSeerX

Header for SPIE use

Single Image-Based On-line Camera Calibration and VE Modeling Method for Teleoperation via Internet JiaCheng Tan a, Gordon J. Clapworthy a, Igor R. Belousov b a

Department of Computer & Information Sciences, De Montfort University, Milton Keynes MK7 6HP, United Kingdom b Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Moscow 125047, Russia ABSTRACT Virtual Environment (VE) techniques provide a powerful tool for the visualization of the 3D environment of a teleoperation work site, particularly when “live” video display is inadequate for the task operation or its transmission is constrained, for example by limited bandwidth. However, the ability of VE to cope with the dynamic phenomena of typical teleoperation work sites is severely limited by its pre-defined model-based nature. Thus, an on-line composing mechanism is needed to make it environment compliant. For this purpose, this paper describes an on-line technique for camera calibration and an interactive VE modeling method that works on a single 2D image. Experiments have shown that the methods are convenient and effective for on-line VE editing. Keywords: Virtual environment, visualization, teleoperation, robotics, image-based modeling

1. INTRODUCTION Teleoperation usually excludes the possibility of direct access to the work site; it thus requires close supervision by the human operator, whose primary need is for accurate information about the site. In a dynamic environment, one of the major challenges is to provide such information at a speed suited to the activity. In many cases, live video (monoscopic or stereoscopic) is used to meet this requirement, presenting the operators with “live” views of the work site from particular positions. However, there exist severe restrictions on video display when it is used for the Internet-based teleoperation, particularly in relation to the frame rate which, in practice, is far below the minimum requirement and may vary considerably depending upon the communication volume on the Internet at the time. Another shortcoming of live video display is that the scene presented is not navigable, and the operator cannot interact with the objects in the display. To overcome such restrictions, VE techniques have been explored to re-create the remote work scene and present it to the teleoperator1. VEs have a number of advantages – they are visually convincing, highly interactive, and require low bandwidth to update them because of the model-based nature of computer graphics. Nevertheless, VE technologies are currently inadequate to handle the uncertainties of changes to the environment which may take place during teleoperation. These may be caused by accidents, by external interventions, or simply because the environment evolves in an unpredicted way - such dynamic factors potentially exist in any teleoperation system. Obviously, proper interpretation of these dynamic factors is crucial to subsequent task operations. For these reasons, VR alone is seldom used for teleoperation purposes. It is usually combined with a measure of video transmission because video images naturally represent dynamic changes to the environment. However, simply putting together the VR and video image displays does not make much sense, for as long as the changes are not appropriately integrated into the VE, the part of environment that has changed cannot be manipulated in the same sense as the remaining parts of the VE, and the advantages afforded by VR cannot be utilized in these regions. At present, a commonly-applied technique used to tackle the problem is augmented reality (AR)2 in which the model-based graphics contents are not affected by any unexpected environmental change, and real images of the changed parts of the environment are imposed on to the VE. However, the constraints associated with image transmission and display mentioned previously mean that AR is difficult to apply in Internet-based teleoperation. Moreover, if the environmental change in the

image cannot be reflected in the form of graphics models, AR will face the same problem of being unable to interact with this part of the environment in processes such as task and path planning. The above arguments suggest the need for an effective on-line modeling method if VR to be successfully applied in aid of teleoperation. For this purpose, we have pursued a light-weight modeling method based on single images to capture the distinctive features of environmental changes so that on-line VR editing can be realised. To achieve this, we need a fast but effective method for image analysis and for object modeling. This paper introduces the work we have done in this respect.

2. BACKGROUND Recognition and recovering objects from images has long been a subject in computer and robot vision, but in recent years, the subject has also received much attention from the area of computer graphics. The driving force behind this is the increased interest in image-based rendering; by rendering a 3D scene using 2D images, the slow and tedious modeling processes for virtual reality, animation and CAD can be avoided, but the photo-realistic rendering quality can, at the same time, be retained. However, many aspects of this problem remain unsolved. The most impressive progress during the period has been mainly in recognizing 2D objects in single images and in recognizing 3D objects in range maps. Considerable progress has also been made in the recognition of 2D or 3D objects by using multiple 2D images. At present, under certain circumstance and with certain costs, 3D objects can be recovered from images. Complex scenes are also recoverable on an object-by-object basis. Some of these techniques have been successfully applied in commercial software such as PhotoModeler (http://www.photomodeler.com). Roughly, these methods can be classified into two categories, according to whether they use multiple images or just a single image. 2.1. Multiple-image-based methods The multi-image-based methods work on two or more images that are taken from successive carefully-chosen views of objects. By analyzing the geometric relationships between the different camera positions from which the images are taken and the correspondence between features on the successive images, depth information can be recovered. A very popular method that works on this principle is stereo correspondence matching 3,4,5. One of the difficulties in applying the method is the determination of stereo correspondences between images. In general, the method is successful only when the images are similar in appearance, i.e. the images are obtained by using cameras that are closely spaced. As the distance of separation (or baseline) increases, the surfaces in the images exhibit different degrees of foreshortening, different patterns of occlusion and large disparities in their locations in the two neighboring images – all of these are determinant factors in stereo correspondence matching. Besides these factors, a lack of texture in the scene may make a stereo algorithm fail to find the correct match for a particular point since many neighborhoods of the point will be similar in appearance. While the use of a small baseline makes the correspondence matching easier, as the baseline is decreased, the depths recovered from the images become very sensitive to noise in the image measurements. The structure from motion method also works on multiple images and also utilizes the correspondence between the features of images. But unlike stereo matching, the method does not require the knowledge of the parameters of the camera, and the images used can be calibrated or uncalibrated6. The key process of the method is to find a projective transform. With this, one is able to recover the 3D locations of points and the positions of cameras, up to an unknown scale factor4,7. As in stereo matching, the limitation of the method is that the recovered structure is very sensitive to noise in the image measurements when the translation between the available camera positions is small. Instead of focusing on particular points or lines as in the methods above, the shape from silhouette contours technique considers the object as a whole. The silhouette of the object and the camera center associated with a particular image define an enclosing conic region of space. The intersections of many such cones define a bounding volume that represents the shape of the object. Given a sufficient number of images captured from different view angles, in principle the method is able to recover any convex 3D object8. While this set is by no means complete, the methods described above do demonstrate the basic characteristics of multipleimage-based methods. In general, such methods are applicable under certain conditions, to a certain class of problems, in which sufficient views of objects are available, and a sufficient number of correspondences can be found. The advantages of the methods are that they do not normally require prior knowledge about the objects or scenes, and they can be implemented as fully-automatic algorithms.

However, several negative elements must also be taken into account when using these methods. Firstly, the results obtained are usually not very satisfactory – the complexity of the real scene and the fragility of the recovering algorithms are among the factors accounting for this. Secondly, the methods usually present inferior real-time performance. Thirdly, in some applications we cannot take it for granted that we shall always have as many 2D images as required by the methods. Very often, due to the restrictions of the working environment, the camera cannot move freely, and its locations may be confined to a very limited space. In such cases, a method using as few 2D images as possible is more advantageous. 2.2. Single-image-based methods There has been relatively very limited work conducted on recovering 3D objects from single images. The reason for this is very clear: one single image of a generic 3D object does not contain sufficient information for the full reconstruction of the object. In fact, it has been proven that the recognition of a specific 3D object usually requires at least two appropriatelydefined 2D model views in the case of orthographic projection9. Under more general conditions, such as perspective projection, or non-uniform transformation, many more views may be required10. However, it has been shown that 3D objects can be partially or completely recovered from a single 2D view by exploiting the properties of, and the constraints on, the objects or the scene. For example, symmetry is a strong constraint for information restoration. The constraints may take other forms if simple properties are not easy to find. For example, the objects belong to a specific class that may be known a priori or may be learned from views of “prototypical” objects of the same class. In this sense, single-image-based modeling methods are actually techniques that exploit the properties of the objects or scene so that the information lost in perspective projection can be restored or can be generated according to the prior knowledge. The following are some influential techniques that work on single images, though they may not all be equally suitable for 3D modeling. The shape-from-shading technique obtains 3D information of objects by an analysis of the irradiance equation of the scene. The irradiance equation governs the relation between the brightness in the image and the normal of a specified area on the surface on an object. In principle, the continuity constraint on the visible surface of an object facilitates the solution of the surface, and the hidden surface can be recovered by using other constraints on the objects, such symmetry. However, the irradiance equation is not sufficiently accurate for interpreting images because it regards shading as a local phenomenon. In the real world, surfaces that reflect light to observer reflect light to other surfaces as well. Theoretically, the inter-reflection effect can be taken into account by involving a radiosity equation, which expresses the radiosity on a surface patch as the integral over the light being reflected to it by all other surface patches in the image11. However, the relationship between an object’s shape and its radiosity is exceptionally complex, because it requires the exact knowledge of the positions and reflectance of all surfaces, including those out of the field of view of the camera12. For the same reason, it is difficult to obtain absolute depth information of the surface even if well-controlled illumination is applied. Thus, except for very simple scenarios, the method is more suitable for generating new model views than for extracting 3D objects. To produce a novel view of objects for image-based recognition and graphic rendering, Vetter and Poggio13 investigated a method that works on a single image. Their study focuses on specific objects that can be represented by a linear combination of a sufficiently-small number of other simpler objects – the prototypes. The study of the properties of such objects, the linear object classes, shows that new orthographic views of any object of the class under uniform transformation can be generated exactly if the corresponding transformed views are known for the set of prototypes. Although the method reported works only for a specific class of objects, the generalization of the method provides ways in dealing with more general objects, such as reported by Lando and Edelman14 in their work in face recognition. The weak point of the method is that it requires the knowledge of the prototypes viewed from a different direction, which may be too strong a requirement to be met in practical applications. The previously-mentioned methods can be helpful when 3D models of the objects are not explicitly required. To model a 3D object, as in the construction of virtual environments, we need qualitative descriptions as well as quantitative descriptions of objects. In this respect, Debevec15 reported an interactive modeling system that exploits the constraints presented in the architectural scene and is capable of recovering an architectural structure model from a single image. The method took the advantages of the prior knowledge about the architectural structures to be recovered. By casting the predefined model on to the image plane and aligning the models with the image features, the model parameters were obtained. Another attempt to model 3D object/scene from a single image has been reported by Horry et al.16 in the area of computer animation, called Tour Into the Picture. By specifying the vanishing point of the scene, they created a simple 3D model of

the scene so that a 3D feeling of “walking or flying through” the 2D picture can be achieved in animation. Unfortunately, the model thus established was very crude and not fully 3D-structured, so the method is not applicable for accurate 3D modeling. Obviously, single-image-based methods are relatively immature compared with multiple-image-based methods. However, considering the scarcity of the source images of our application, we have paid particular attention to the single-image-based methods. In this paper, we present an interactive modeling technique that extracts 3D models from a single image and convert the models into graphic objects of the virtual environment for teleoperation.

3. ON-LINE CAMERA CALIBRATION Unlike some of the multiple-image-based methods in which explicit camera calibration is not required, single-image-based methods require the camera used to be explicitly calibrated if absolute parameters are required. This is because a single image does not contain the absolute depth information that is present in multiple images. For a given camera, the intrinsic parameters can be found beforehand using standard calibration methods such as reported by Tsai17. This is an once-for-all process. In contrast to this, external calibration of the camera may be required, from time to time, in practice. How often a camera needs to be externally calibrated depends upon how the camera is mounted in the working environment. If a camera is rigidly fixed in the teleoperation environment, its location is known, so the external calibration, just like the intrinsic calibration, has to be done only once. However, if a camera is mounted on the robot manipulator and thus moves with it, or if a camera does not move with the manipulator but is not permanently mounted, a mechanism for on-line adjustment of the external parameters will be needed. Because our system is based on single images, any external calibration method that relates to correspondence matching or requires more than one image is excluded, and the calibration has to be carried out with the current image of the environment. To find the external parameters of a camera system from a single image, we require sufficient absolute knowledge about the scene. This requirement is usually difficult to meet for many applications that occur in computer vision. Fortunately, for teleoperation systems, we can often find some objects about which we have prior knowledge, for example, the robot manipulator or part of it. By assigning feature points on the robot manipulator, or on other objects such as the calibration landmarks, finding the external parameters of the camera is straightforward. Obviously, a calibration method that requires a huge input effort would be prohibitive for on-line use, so we pursue a method that requires as few inputs as possible. For this purpose, we investigate a simple method to accomplish the calibration, as described below.

Figure 1. Perspective projection model

For simplicity, in the following discussion we use the well-accepted pin-hole camera model. It is well known that at least six points, of which no four are coplanar, are needed to obtain the calibration matrix of the camera if we directly solve the calibration equations. According to our experience, it is very hard to obtain acceptable external parameters if the user inputs are not strictly controlled when exactly six points are used. The equations are very sensitive to input errors, and are not

stable enough to guarantee a solution. So, we are interested in methods that lead directly to the solution of the external parameters, rather than by solving the calibration equations. 3.1. Calibration formulation It is known that if the intrinsic parameters of the camera are separated from the external ones, only six parameters remain to be solved. This means we can use as less as three points to obtain them. To do so, we choose the camera coordinate system as shown in Fig. 1. Suppose A( x1w , y1w , z1w ) , B ( x 2 w , y 2 w , z 2 w ) and C ( x3 w , y 3 w , z 3 w ) are three known points in the world coordinate system, and A( x1c , y1c , z1c ) , B ( x 2 c , y 2 c , z 2 c ) and

C ( x3c , y 3c , z 3c ) are their representations in the camera coordinate system. These points are mapped on to the image plane as a ( x1 , y1 ) , b( x 2 , y 2 ) and c( x3 , y 3 ) by

perspective projection. Suppose the three points can be recovered in the camera coordinates, then these points define a vector coordinate system

n c = v AB × v AC where

 ic  =  x 2c − x1c x − x 1c  3c

jc y 2c − y1c y 3c − y1c

n c in the camera

  z 2c − z1c  , z 3c − z1c  kc

(1)

v AB and v AC are vectors defined by points A, B and A, C respectively. n c is fully defined up to a direction. The

same vector represented in the world coordinate system will be

 iw  n w =  x 2 w − x1w x − x 1w  3w

jw

  − z w1  . − z1w 

kw

y 2 w − y1w y 3w − y1w

z w2 z 3w

(2)

In equations (1) and (2), i w , jw , k w and

i c , jc , k c are the unit vectors of world and camera coordinate systems respectively. With these unit vectors and one of the three points, for example point A , we can construct a reference coordinate system Or that is defined so that n c is its unit vector k r , v AB (or v AC ) is i r , and k r × i r is jr . The point A will be chosen to have its location expressed as ( x 0 w , y 0 w , z 0 w ) and ( x 0 c , y 0 c , z 0 c ) in world coordinates and camera coordinates, respectively. With this definition, it is not difficult to write vectors i r , jr and k r in both world coordinates and camera coordinates. Suppose we have

 x jrw   xirw   x krw      (i r ) w =  y irw  , (j r ) w =  y jrw  , (k r ) w =  y krw  ,  z jrw   z irw   z krw   

(3)

in world coordinates and their equivalents in camera coordinates

 x jrc   xirc   x krc      (i r ) c =  y irc  , (j r ) c =  y jrc  , (k r ) c =  y krc  ,  z jrc   z irc   z krc    then the rotation of the world coordinate system relative to the reference coordinate system R wr can be written as:

R wr = [(i r ) w

( jr ) w

(k r ) w ]

−1

 xirw  =  y irw  z irw 

x jrw y jrw z jrw

(4)

−1

x krw   y krw  . z krw 

(5)

Similarly, we have the rotation of the camera coordinate system relative to the reference coordinate system R rc ,

R rc = [(i r ) c

( jr ) c

 xirc  (k r ) c ] =  y irc  z irc 

x jrc y jrc z jrc

x krc   y krc  . z krc 

(6)

Then, the transformation matrices corresponding to the above rotation matrices will be

Twr

where

  R wr =  0 0 0 

x0 w    R  y0w  rc and Trc =    z0w 0 0 0 1  

x0 c  y 0 c  , z 0c  1 

(7)

( x0 w , y 0 w , z 0 w ) , the origin of the reference coordinate system expressed in world coordinates, can be obtained by

− z1w ] with R −wr1 , and the position of ( x0 c , y 0 c , z 0c ) , the origin of the reference coordinate system expressed in camera coordinates, is defined to be at ( x1c , y1c , z1c ) . Finally, we obtain the

rotating the vector

[− x1w

− y1w

T

transformation that transforms the points in the world coordinate system into the points in the camera coordinate system Twc = Trc Twr . (8) With this transformation and the intrinsic parameters of the camera available, it is not difficult to obtain the perspective projection matrix. 3.2. Retrieval of positions from image We now return to the problem of determining the three points in the camera coordinate system. There are nine parameters that define the positions of the points in the camera coordinate system. Theoretically, it is not difficult to find the solution of the problem, since we can easily establish the equations that define the problem, although there are difficulties in practice. As these three points are known in the world coordinate system, we can exploit various relationships that exist between them. Of the possible relations, the distance constraint is the most obvious and it is, in fact, the most convenient and the easiest to verify, on a real robot manipulator. Suppose the distances between these points are

d12 , d 13 and d 23 , so we have

( x1w − x 2 w ) 2 + ( y1w − y 2 w ) 2 + ( z1w − z 2 w ) 2 = d 122 ( x1w − x3w ) 2 + ( y1w − y 3w ) 2 + ( z1w − z 3w ) 2 = d 132 ( x 2 w − x3w ) + ( y 2 w − y 3w ) + ( z 2 w − z 3w ) = d 2

2

2

(9)

2 23

and

( x1c − x 2c ) 2 + ( y1c − y 2c ) 2 + ( z1c − z 2c ) 2 = d 122 ( x1c − x3c ) 2 + ( y1c − y 3c ) 2 + ( z1c − z 3c ) 2 = d 132 .

(10)

( x 2 c − x3c ) 2 + ( y 2c − y 3c ) 2 + ( z 2c − z 3c ) 2 = d 232 According to the co-linearity restriction of perspective projection, we have

x1c x1 x 2 c x 2 x x = , = , 3c = 3 z1c f z 2c f z 3c f y1c y1 y 2c y 2 y 3c y 3 = , = , = z1c f z 2c f z 3c f

.

(11)

By substituting (11) into (10), we arrived at a set of conic equations:

mz12c + nz 22c + 2qz1c z 2c = d 122 mz12c + pz 32c + 2 sz1c z 3c = d 132 , nz + pz + 2tz 2c z 3c = d 2 2c

2 3c

(12)

2 23

where 2

2

2

2

2

2

x  y  m =  1  +  1  + 1  f   f  x  y  n =  2  +  2  + 1  f   f  x  y  p =  3  +  3  + 1 ,  f   f  xx y y q = 1 22 + 1 2 2 + 1 f f xx y y s = 1 23 + 1 2 3 + 1 f f x x y y t = 2 23 + 2 2 3 + 1 f f where

(13)

f is the focal length of the camera. It is obvious that m > 0, n > 0, p > 0 .

(14)

It is not difficult to prove that for any three points with distinctive images on the image plane following inequalities hold:

q 2 − mn < 0 s 2 − mp < 0 .

(15)

t − np < 0 2

The inequalities (14) and (15) imply that equation (12) represents three elliptical cylinders. The geometric meaning of equation (12) is very clear: the intersecting points of the cylinders are the depths of the three points in the camera coordinate system. However, equation (12) is not analytically manageable. With specified parameters, its solution, in general, is a function of the roots of two fourth-order polynomial equations that require numerical solution. Worse than that, due to the errors in finding the image coordinates of the points, a real solution of equation (12) is not guaranteed. With this conclusion in mind, we investigate a method with which the three locations in camera coordinates can be obtained more easily. Rather than focusing on some points on the objects in the teleoperation scene, we shift our attention to the objects themselves. Of various geometric primitives, a sphere is one of the most versatile under perspective projection and has been extensively studied in camera calibration. Under ideal perspective projection, the images of spheres are ellipses with their major axes lying on the line passing through the principal point and the center of the ellipses. Given the intrinsic parameters of the camera and a sphere, the parameters of the ellipse are determined by only two factors: the distance between the sphere and the image plane and the distance between the sphere and the principal axis of projection, i.e., if we know the location of a sphere, we can compute its image. Conversely, if we can identify the ellipse that best represents the image of a sphere with a known radius, we can find the location of the sphere. This method is attractive for our problem for a number of reasons. Firstly, the sphere is one of the fundamental objects in our teleoperation scenario so we do not need to introduce extra objects into the work site to conduct the calibration. More

importantly, the locations of sphere in the world coordinate system can be obtained by instructing the robot manipulator to move the sphere to an arbitrary number of locations while the direct measurement of its position is avoided. Another appealing factor is that with the known joint angles of the manipulator it is relatively easy to estimate the position of the sphere on the image. Consequently, the segmentation of the image can be carried out more effectively and a fully-automatic calibration procedure is feasible. In our implementation, the sphere is tracked on the image by a small window within which the basic image processing takes place, as shown in Fig. 2(a). The results of the processing are the discrete edge data of the ellipses. Once these data are available, a number of detection methods can be applied to extract the ellipses implied by the data, such as diameter bisection18, least squares fitting and generalized Hough transform methods19. These methods are used for the detection of general ellipses, hence five parameters have to be identified: planar position, orientation, size and eccentricity. In our problem, we are interested in the locations of the spheres rather than in the ellipses themselves. With a camera calibrated intrinsically, we know what the ellipses should look like if we know their positions, i.e. we have only to identify three parameters. We use a method similar to the Hough transform in searching for these parameters. Fig. 2(b) shows approximately 80 positions obtained with the method. The exceptional y values (the depths – the convention for the coordinate system in the figure is different from the camera coordinate system shown in Fig. 1), such as at point 6, 46, and 63 are resulted because the absolute values at these points are near zero. The arithmetic mean and the standard deviation of x = − 2.256 mm , y = − 11.9943 mm , z = − 1.219 mm and d x = 13.7065 mm , the measurements are

d y = 50.6828 mm , d z = 11.636 mm respectively. 50 40 Error (Percentage)

30 20 10

X

0

Y

-10

Z

-20 -30 -40 -50 1

11

21

31

41

51

61

Sample

(a)

(b)

Figure 2 (a): The set-up for external camera calibration. (b): The measurement errors for the positions of sphere

4. REVERSE MAPPING

p ( x w , y w , z w ) is a point in world coordinate system, p ( x c , y c , z c ) is its equivalent in camera coordinate system and p ( x im , y im ) is the prospective projection of the point on to the image plane. The relationships between them are

Suppose

defined by the following equations:

xim zc f y y c = im z c f

xc =

and

(16)

[xim [x c

y im ] = T

yc

f [xc zc

yc ]

z c ] = Twc [x w T

T

,

yw

(17)

zw ]

T

where f is the focal length of the pin-hole camera. Equations (16) and (17) define a unique mapping from world points to image points, but what we need in the reconstruction of 3D objects is the reverse, i.e., to find points in the workspace from their perspective projections. As the reverse mapping process is not one-to-one, to determine a 3D point from its 2D image we have to reduce the dimension of the workspace. Rather than seeing an image point p ( x im , y im ) as the projection of a point p ( x w , y w , z w ) in the 3D workspace, we take it as the image of a point

or a point

pΠ ( x w , y w , z w ) on a plane Π Π : a1 x + b1 y + c1 z + d1 = 0 ,

(18)

 x = x0 + k x t  L :  y = y0 + k y t . z = z +k t 0 z 

(19).

p L (t ) on a line L

By using the chosen reference planes or reference lines, the reverse mapping can be found. If reference planes are chosen, equation (17) produces two linear equations involving x, y and z . These equations, together with the equation of the reference plane, give the world coordinate of the image point

p( xim , yim ) .

 a1 x w + b1 y w + c1 z w + d1 = 0  a 2 x w + b2 y w + c 2 z w + d 2 = 0 a x + b y + c z + d = 0 3 w 3 w 3  3 w If a reference line is chosen, the image plane degenerates to image lines, hence there exists a mapping from 2D line

(20)

Lim

 x = f (t ) Lim :  im  y im = g (t )

(21)

 x w = x0 + k x t   yw = y0 + k y t z = z +k t 0 z  w

(22)

to 3D line

As the reverse mapping is based on the establishment of a series of references, we require that the object to be reconstructed should possess some geometric references within the workspace and should not be freely floating somewhere within it. Actually, for our application and for most terrestrial applications, this condition can be met. With the reverse mapping available, the 3D interpretation of the image is straightforward. According to the approach described in this section, a typical reconstruction process includes: • find a suitable reference for the object so that its location can be determined;

• • •

choose, from the model set, the most appropriate geometric primitive for the object or part of it ; set the parameters of the primitive so that it is in alignment with the object to be reconstructed; assign the appearance or motion attributes to the object.

5. IMPLEMENTATIONS Besides the hardware, the system basically consists of a VE module and an image-based modeler. To make the system accessible over the Internet, it is implemented in pure Java, using the Java3D and Java Advanced Imaging (JAI) APIs. The VE module is a standard Java3D application that provides the facilities for manipulation control, view-platform control and manipulator path planning. The modeler is implemented as a separate self-defined rendering engine for the rendering of 3D wire-frames. To achieve a rapid system response, the objects defined for this rendering engine are highly simplified. This rendering engine works as a VR editor. When an image is downloaded from the server at the robot work site, the modeler performs the external calibration for the camera if it is required. The result of the calibration is a workspace model. Superimposed on to the image, the workspace model makes the 2D image 3D navigable, allowing the reconstruction work to be done. All the work concerned with object generation and deformation, and the selection of position and orientation, is performed in this modeler. The appearance of the model, as generated, depends upon the visual characteristics of the object. If the object is uniformly colored, that color is assigned as the color of the model, otherwise a texture mapping is performed using the portion of the image occupied by the object. Once the geometric properties and the appearance attributes of an object are determined, the object is converted into a standard Java3D node and is inserted into the living scene graph for rendering. Because the VE module and the modeler work in an interactive manner, we can either choose an object from the living VE for editing, or generate an object to be inserted into the living VE.

6. EXPERIMENTAL RESULTS To verify our reconstruction method and the principles upon which the modeling system is established, an off-line modeling experiment using a sample image has been conducted. We use a synthetic image captured from a Java3D-based VE as a sample because a virtual camera provides the possibility to adjust the scene set-up and camera parameters. However, the use of the synthetic image will not weaken the validity of the method. (a)

(b)

(c)

Figure 3 (a): Sample image used for object reconstruction. (b): Constructed wire-frame objects. (c): The VE obtained from a single image

In the sample image shown in Fig. 3(a), we have some simple objects such as blocks and a cylinder, and a complex object – the robot manipulator. To test the reconstruction method, these objects will be reconstructed by using the modeler (we do not need to model the robot manipulator at run time, because it has been well modeled beforehand as a basic element of the VE).

It has proved to be extremely easy to reconstruct an object for which a reference is well defined and can be easily found, such as for the blocks on the floor, or for the base and trunk of the manipulator (vertical cylinders). In other cases, the reconstruction process is rather complex, as for the shoulder (the horizontal cylinder on the top of the trunk) and the arm components. Here, we have to find the references for these objects upon an object that has been reconstructed. Sometimes, additional information is needed, for example, to determine the position of the shoulder, we have to presume that its axis is at a right angle to the axis of the trunk. As expected, the method is unable to reconstruct the two suspended balls because we cannot find the location references for them within the workspace – their point of attachment is outside the view, thus it is impossible to determine their depth. The results of the modeling process are shown in Fig. 3(b) where the objects are expressed in a wire-frame form. For clarity, the background (sample image) has been filtered out. Fig. 3(c) shows the corresponding 3D VE reconstructed. As there is an explicit model for each object in the navigable VE, these objects are “live” objects rather than static image patches. Once they are registered in the VE, they are “sensible” to the VE, and the teleoperation operator can interact with them and the embedded path planner can track them in the path planning for the robot manipulator.

7. DISCUSSIONS The work undertaken has established that the method, based on camera calibration and image and workspace analysis, is a feasible way of dealing with a fairly wide range of dynamic phenomena within teleoperation. Compared with approaches such as image augmentation (AR) or automatic model-extraction, the on-line VR editing method is more effective, more reliable and computationally more economical for this particular application. By integrating the environmental changes into the VE, the method supports a fully model-based VE for teleoperation, so that the changes can be properly presented in task operation and in task planning. Besides the use in teleoperation, the method is also suitable for constructing virtual architectural structures where the relationships between various objects are more clearly defined. Several improvements and extensions can be made to our current work and we hope to proceed with these in the near future. A major one should be the integration of automatic alignment. At present, this is totally reliant on the operator’s observation and experience. Another improvement should be in the investigation of the techniques with which the reference lines and planes can well be tracked when constructing structures similar to the robot manipulators.

REFERENCES 1.

I. R. Belousov, J.C. Tan, G.J. Clapworthy, “Teleoperation and Java3D Visualization of a Robot Manipulator over the World Wide Web”, Information Visualization 99, IEEE Computer Press, pp.543-548, 1999. 2. A. Rastogi, P. Milgram, D. Drascic, “Telerobotic Control with Stereoscopic Augmented Reality”, SPIE Volume 2653: Stereoscopic Displays and Virtual Reality Systems III, pp.115-122, 1996. 3. Z. Zhang, R. Deriche, O. Faugeras, Q. Luong, “A Robust Technique for Matching Two Uncalibrated Images through the Recovery of the Unknown Epipolar Geometry”, Artificial Intelligence, Vol. 78, pp.87-119, 1995. 4. O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993. 5. W. E. L. Grimson, From Images to Surface, MIT Press, 1981. 6. A. Shashua, “Projective Structure from Two Uncalibrated Images: Structure from Motion and Recognition”, MIT A.I. Memo. No.1363, 1992. 7. J. J. Koenderink, A. J. Van Doorn, “Affine Structure from Motion”, Journal Opt. Soc. Am. A, Vol. 8, pp.377-385, 1991. 8. R. Szeliski, “Image Mosaicing for Tele-Reality Applications”, Proc. Workshop on Applications of Computer Vision, pp.44-53, 1994. 9. T. Poggio, “3D Object Recognition and Matching: On a Result of Basri and Ullman”, Spatial Vision in Humans and Robots, L.Harris and M. Jenkin, Cambridge University Press, Cambridge, UK, 1993. 10. T. Poggio, S. Edelman, “A Network That Learns to Recognize 3D Objects”, Nature, Vol. 343, pp.263-266, 1990. 11. T. Wada, H. Ukida, T. Matsuyama, “Shape from Shading with Interreflections under Proximal Light Source: 3D Shape Reconstruction of Unfolded Book Surface from a Scanner Image ”, Proc. of 5th Int. Conf. on Computer Vision, pp.6671, 1995.

12. J. Haddon, D. Forsyth, “Shading Primitives: Finding Folds and Shallow Grooves”, Int. Conf. on Computer Vision, January, 1998. 13. T. Vetter, T. Poggio, “Linear Object Classes and Image Synthesis from a Single Example Image”, MIT A.I. Memo. No.1531, 1995. 14. M. Lando, S. Edelman, “Generalization from a Single View in Face Recognition”, International Workshop of Automatic Face- and Gesture-Recognition, Zurich, June 26-28, 1995. 15. P. E. Debevec, C J. Taylor, J. Malik, “Modeling and Rending Architecture from Photographs: A Hybrid Geometry- and Image-Based Approach”, Proc. SIGGRAPH’96, pp.11-20, 1996. 16. Y. Horry, K. Anjyo, K. Arai, “Tour into the Picture: Using a Spidery Mesh Interface to Make Animation from a Single Image”. Proc. SIGGRAPH’97, Los Angeles, California, August 3-8, pp.225-232, 1997. 17. R. Tsai, “A Versatile Camera Calibration Technique for High Accuracy 3D Machine Vision Metrology Using Off-theShelf TV Cameras and Lenses”, IEEE Journal of Robotics & Automation, Vol. 3, No. 4, pp.323-344, August 1987. 18. S. Tsuji, F. Matsumoto, “Detection of Ellipses by a Modified Hough Transformation”, IEEE Trans. on Computers, 27(8), pp.777-781, 1978. 19. E. R. Davies, “Finding Ellipses Using the Generalized Hough Transform”, Pattern Recognition Letters, Vol. 9, No. 2, pp.87-96, February 1989.