Semi-automatic Annotations in Unknown ... - Semantic Scholar

3 downloads 0 Views 3MB Size Report
A search from mk along the normal direction of the edge vi,vi+1 yields a distance ... This is ac- complished through an observation function that predicts the view.
Semi-automatic Annotations in Unknown Environments Gerhard Reitmayr ∗

Ethan Eade †

Tom W. Drummond ‡

Engineering Department Cambridge University Cambridge, UK

Figure 1: Left to right: Selecting a disk bay on a computer; placing a note that this disk should be replaced; the system estimates the 3D location of the note automatically and renders it correctly registered to the environment.

A BSTRACT Unknown environments pose a particular challenge for augmented reality applications because the 3D models required for tracking, rendering and interaction are not available ahead of time. Consequently, authoring of AR content must take place on-line. This work describes a set of techniques to simplify the online authoring of annotations in unknown environments using a simultaneous localisation and mapping (SLAM) system. The point-based SLAM system is extended to specifically track and estimate high-level features indicated by the user. The automatic estimation of these complex landmarks by the system relieves the user from the burden of manually specifying the full 3D pose of annotations while improving accuracy. These properties are especially interesting for remote collaboration applications where either user interfaces on handhelds or camera control by the remote expert are limited. Index Terms: H.5.1 [Information Systems]: Multimedia Information Systems—Augmented Reality; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Tracking 1

I NTRODUCTION

AND RELATED WORK

Augmented reality applications depend heavily on geometrical knowledge about the world to fuse computer graphics with real world environments. Recent work in real-time visual simultaneous localisation and mapping (SLAM) allows a system equipped with single camera to create a model of the user’s environment without any prior knowledge. One scenario, where this is of interest, is remote collaboration in unknown environments [3], where a remote expert annotates a local live view to support a mobile user. This is also the target application of the work presented herein. ∗ e-mail:[email protected]

† e-mail:[email protected]

‡ e-mail:[email protected]

However, placing 3D annotations in the video stream of a moving camera is not easy. The user has to specify 6 parameters with respect to a given world frame, either through direct manipulation interfaces such as seen in CAD software, or using multiple views to triangulate the true pose [1]. SLAM-based systems allow the remote operator to directly use the 3D model created by the mapping system. A reference frame can be specified through manual selection of 3 point landmarks and further manual manipulation of the annotation within the resulting reference frame [3]. Performing such selections with sufficient accuracy is hard in moving video images and users prefer the selection in an abstract 3D view [9]. This work focuses on the real world structure to be annotated, instead of the geometry and location of the virtual objects added to the augmented scene. The user indicates interest in a feature by placing an annotation on the feature. The system then measures the feature’s location and shape and updates the model. Thus, the presented system combines the complementary strengths of a human operator, who understands a scene and can make informed decisions about it, and a computer system which performs accurate measurements beyond any human capability. The result is a simple interface coupled with accurate operation. The focus on complex features differentiates this work also from other state-of-the-art SLAM systems using more general, low level features such as points [3], lines [16] or planes [14]. The close integration of the user with the system was demonstrated in modelling from images [4] and videos [19] delivering high-quality models with minimal user input. 2

A NNOTATING

WITH

SLAM

This work employs the monocular SLAM system described by Eade & Drummond [7] for general pose and landmark estimation. The system implements a FastSLAM [11] particle filter that represents the distribution of camera trajectories as a set of particles and the landmark positions as independent estimates conditioned on each particle’s trajectory. In typical operation the number of particles is small (50) allowing the system to run at framerate of 30 fps with hundreds of point landmarks. The system is extended to provide the user with an interface to

new landmarks

SLAM

new structure

Augmentations

view scene

camera & 3D points

update annotations

Figure 2: Information flow between the main system components. SLAM estimates basic camera pose and 3D point map. The augmentation component renders annotation overlays. User interaction defines new landmarks, while SLAM updates annotations.

track and capture simple structures such as rectangles or circles in the view. Selecting a structure for annotating, the user interface component sends the description of the observed structure to the SLAM system, introducing a new landmark into the map which is measured accurately during subsequent frames (see Fig. 2). These complex landmarks capture more information at once but do not require more state to be estimated (see sec. 3). The location of the annotation itself is derived from the landmark structure, instead of being specified by the user. All parameters are linked to the landmark in the map and are continously updated as the estimate of the landmark converges (see sec. 4). 3

C OMPLEX

LANDMARKS

Complex landmarks are planar structures that can be reliably tracked in the environment. For example, the rectangular outline of a poster on a wall or a sticker on an engine forms such a complex landmark. The outline is tracked from frame to frame using a simple edge tracking technique [5] and provides a measurement for the underlying shape and location in the global coordinate system. To include planar, complex landmarks into the particle filter framework, three components must be defined. (a) A measurement or tracking method of locating the shapes in each frame. (b) An observation function that predicts the shape appearance in a frame from the plane estimate and the static shape parameters. (c) An initialisation method to create an initial estimate of the plane parameters. (a) and (b) are basic requirements for any filter model. The initialisation method (c) avoids convergence of the plane parameters to wrong estimates, if the observations are ill-conditioned due to small camera motion after the creation of a new landmark. The image of a planar shape in one frame is determined by the plane supporting the shape and the camera pose the frame was taken from. Together, these form 9 parameters, necessary to fully describe the 3D location of the shape itself. Within the FastSLAM framework, each particle represents a camera trajectory known with full certainty. Therefore, all that is left to estimate are the parameters of the supporting plane. Each particle estimates the plane parameters independently using a Kalman filter. The variation in camera pose is captured by the assembly of particles, each of which initialises a landmark estimate with a different camera pose. Two models for planar shapes are presented: polygonal shapes described by vertices and edges between them, and ellipsoidal shapes represented by the equation of a 2D conic section, constrained to an ellipse. 3.1 Shape tracking The 2D shape of a complex landmark is described as an ndimensional parameter vector describing its outline. For a new video frame, the shape location is updated by using local edge search from the outline given by a prior estimate of the shape’s location in the frame. We assume a calibrated camera with a radial distortion model and treat all image points as already undistorted and in normalised camera coordinates.

3.1.1 Polygonal shapes For polygonal shapes, the geometry is parametrised by the set of vertices of the polygon. A polygon with n vertices vi = (ui , vi ) is described by a 2n-dimensional parameter vector s concatenating all coordinates. Along an edge between vertex vi and vi+1 , measurement points are initialised in a regular grid as m k = (1 − t)vi + tvi+1

t = 0 . . . 1.

(1)

A search from m k along the normal direction of the edge vi , vi+1 yields a distance dk to the shape outline. Minimising the error function e(s) = ∑ dk2 (2) k

updates the parameter vector s to describe the observed shape. The change δs of s is given by δs = (J T J)−1 J T d, where J is the Jacobian of (dk ) with respect to s. 3.1.2 Ellipsoidal shapes Ellipsoidal shapes are described as implicit curves. An ellipse is parametrised by the 5-dimensional vector s containing the coefficients of the following canonical equation for 2D conic sections f (u, v) = au2 + buv + cv2 + du + ev + 1 = 0.

(3)

The zero set Z( f ) = {(u, v)| f (u, v) = 0} describes the points on the ellipse. Again, generating measurement points m k = (uk , vk ) along the ellipse and calculating the normal vector as " ! 2auk + bvk + d , (4) nk = 2cvk + buk + e a one dimensional linear search along the normal for the location of maximum gradient magnitude is performed. Now the distance dk of the found location to the ellipse parametrised by s is minimised. Obtaining the true geometric distance requires solving a quartic equation. Instead, the following approximation of the distance [17] between a point and an ellipse is used: dist((u, v), Z( f )) =

f (u, v) . #∇ f (x, y)#

(5)

The error function to be minimised is therefore e(s) = ∑ dist(m k + dk nk , Z( f ))2 .

(6)

k

This least-squares estimation scheme is used as for polygonal shapes to update the ellipse parameters. As Eq. (4) can describe any conic section an additional test for the specific ellipsoidal shape is performed. If the final shape is an ellipse, the result is returned as a successful measurement. 3.2 Observation of planar shapes The state of a complex landmark is represented as a plane with the parameter vector p = (a, b, c, 1) such that a point on the plane in homogeneous coordinates x = (x, y, z, 1) fulfills the equation px = ax + by + cz + 1 = 0.

(7)

To update the estimate of a plane, the location of the planar shape in a new frame has to be related to the plane parameters. This is accomplished through an observation function that predicts the view of the shape in a frame given the plane p. The observation function is further parametrised by the shape parameter vector s, the camera pose C0 of the initial frame and the camera pose Ci of the current frame. Camera poses are represented as rigid transformations consisting of a rotation R and translation T .

H dist dist

original frame

H

new frame

Figure 3: Five sample points from the originally seen ellipse are projected into the new frame via H and the distance to the measured ellipse is recorded.

The observation function depends on the homography between two camera frames C0 ,Ci induced by a plane visible in both. Given the transformation (R, T ) between the two cameras and the plane parameters (a, b, c) in camera frame C0 , an image point (u, v) in the first frame transforms to a point (u, ¯ v) ¯ in the second through λ (u, ¯ v, ¯ 1)T = (R − T (a, b, c)) (u, v, 1)T .

(8)

Observing how a measured point differs from the projected point (u, ¯ v) ¯ provides information on the plane parameters. A similar approach is also used in [10] to estimate the normal of planar patches in a SLAM system directly from pixel intensities. 3.2.1 Polygonal shapes For polygonal shapes, the vertex locations vi are transformed into the current video frame as v¯i , using the difference (R, T ) = CiC0−1 between the two camera poses and the current plane estimate to compute the homography H = (R − T (a, b, c)). This yields the following observation function for the vertex locations v¯i = obs(p; vi ,C0 ,Ci ) project(x, y, z)

=

project(Hvi ), where

=

T

(x/z, y/z) .

(9) (10)

Then, local edge tracking of the polygonal shape locates the vertices in the new video frame. This provides an update δs for the shape descriptor s with measurement covariance. The resulting measurement updates the Kalman filter of the plane parameters in each particle using the Jacobian of (13) with respect to p. 3.2.2 Ellipsoidal shapes The ellipse parameter vector s is not used directly as an observation of the plane parameters, because the difference between the projected and the measured ellipse is only an algebraic error function. Instead, the distances of a set of measurement points m k to the measured ellipse are minimised. This ensures that the error minimised is indeed a re-projection error. To track the ellipse, the ellipse parameter vector s is transformed to the current video frame through the homography H induced by the difference in camera poses and the plane estimate. Then a local edge search using the distance function described in section 3.1.2 is performed. The parameter vector sm of the measured ellipse and the covariance of the parameters provide the basis for the following observation function. Five points m k equally spaced along the circumference of the ellipse in the original frame are also projected into the current frame using H (see Fig. 3). The distance function (6) is used to measure the distance of the predicted points to the measured ellipse. This yields the following composite observation function

Figure 4: Tool cursors for placing new annotations. (Left) Point cursor shows the supporting triangle and points at the intersection point. (Middle) Polygon cursor captures a polygonal outline from a rectangular shape. (Right) Ellipse cursor starting with a circular shape.

3.3 Initialisation The update step requires a good initial estimate to avoid converging to wrong estimates, due to non-linearities in the observation functions. The initial view provides the initial shape parameters s and the camera pose C0 , but does not provide any information on the plane estimate itself. Only a second view from a reasonably different camera pose Ci allows computation of a plane estimate. To initialise the filter with a good plane estimate, the system creates a new complex landmark in a special unobserved state. While in the state, the system projects and tracks the planar shape given a preliminary estimate of the plane, but does not update the filter. Instead, it calculates an estimate of the plane parameters given the initial and current view, using the unscented transform to obtain a normal distribution of the plane parameters. If the ratio between the standard deviation and the mean of the depth of the plane falls below a given threshold, the system accepts the plane parameters and promotes the landmark to normal operation. 4 A NNOTATION PLACEMENT The system infers location of annotations automatically from the SLAM landmark information, once the user has selected the location in a view of scene. The system supports two different types of annotations, point annotations and planar annotations, directly related to point and planar landmarks. 4.1 Point annotations For point landmarks, the system assumes that the point landmarks sample a closed surface. The user specifies a ray by clicking into the view and the intersection between the ray and the surface is computed to define the origin of the annotation reference frame. The ray is intersected with the local triangle of the Delaunay triangulation of the re-projection of all visible point landmarks yielding the intersection point x. The point x is also represented in barycentric co-ordinates with respect to the triangle to allow later updates of its position. The orientation of the annotation frame is defined by the normal vector of the triangle and the up-vector of the initial camera frame. The normal vector and resulting reference frame are updated every frame from the current estimates of the triangle corners. 4.2 Planar annotations Defining a planar annotation also initialises a new complex landmark in the system (see Fig. 2). The system presents the user with an active tracking cursor that performs edge tracking (see Fig. 4) from some fixed initial shape to locate an outline in the image that is close to the initial shape. Bringing the tracking shape close to the salient edges of the real shape, the cursor snaps onto the real shape.

dk = obs(p; m k ,C0 ,C1 , sm ) = dist(project(Hm i ), Z(sm )), (11) where Z(sm ) denotes the zero set for the measured ellipse defined by parameter vector sm . The resulting measurement updates the Kalman filter estimate of the plane parameters in each particle using the Jacobian of (15) with respect to p.

Figure 5: Possible annotations include: 3D arrows drawing attention to a detail, highlights of objects, texture labels to signal danger, or text callouts with instructions.

Figure 6: (Left) The reconstructed map seen from two views. The coordinate frame represents the camera pose. Dark polygons describe the 3 rectangles and the 5 measurement points on the estimated CD. (Right) shows corresponding shots from the video view with arrows registered to the centre of the complex landmarks. The white outlines show the tracked estimates of the shapes.

Then, clicking on the shape creates both the new annotation and the underlying complex landmark initialised with the outline shape. Planar annotations directly use the planar shape and pose of a single complex landmark. The centre of mass of polygonal shapes or the centre of ellipsoid shapes defines the origin of the annotation frame. The normal corresponding to the z axis is directly given by the estimated plane of the complex landmarks. For polygonal shapes, the first edge in the shape description specifies the direction of the x axis and the y axis is computed as the cross product between z and x. For ellipsoid shapes, the same approach as for point annotations is applied.

The user has to select a salient feature in the environment only once, without having to fully specify its location. The system then automatically estimates the full pose of the feature and the attached annotation from subsequent video frames. Future work will focus on improving the robustness of the underlying SLAM system which can easily lose track due to fast camera motion or transient occlusions. A dense representation of the environment would allow to recover more complex objects. Recognition of geometric object such as cuboids, cylinders or even more general shapes could simplify the interaction further. This work is supported by a grant from the Boeing Company.

4.3 Annotation graphics Different graphical icons can be used with the individual annotation types. For point landmarks, the graphical icons are simply transformed from the local annotation frame into the world frame and rendered. Possible choices are a transparent arrow pointing along the normal at the intersection point on the estimated surface, a textual label attached to the intersection point, or an image overlaid in the tangent plane. Annotations of planar landmarks also use landmark’s geometry to render outlines, semitransparent shapes, or text notes and images centred on the landmark (see Fig. 5).

R EFERENCES

5 R ESULTS In general, complex landmarks accumulate more information than point landmarks and converge more quickly to their final locations. Fig. 7 shows some qualitative results of including multiple rectangles and a circle. The two smaller rectangles lie in the same plane and correspond well to the surrounding point cloud. The larger rectangle is angled away from the wall because the poster is fixed only at the top and the system approximates it as a flat surface. The CD on the desk below the wall is also coplanar with close-by points. Both surfaces are approximately normal to each other. Fig. 1 shows the typical workflow of creating annotations. The appropriate tracking cursor is selected (here rectangle) and placed to match the interesting feature closely. Upon clicking the select button, the annotation is placed in a default position parallel to the camera pose. After some frames with significant camera motion the annotation snaps into the right pose. The annotation location is further refined as the system continues to track the underlying shape. 6 D ISCUSSION AND FUTURE WORK The presented system demonstrates how user input can direct the attention of a vision system to directly support the application. Here, the creation of annotations in previously unknown environments at run-time is simplified using the SLAM system’s measurement capabilities to accurately determine the location of new annotations.

[1] Yohan Baillot, Dennis Brown, and Simon Julier. Authoring of physical models using mobile computers. In Proc. ISWC’01, pages 39–46, Zurich, Switzerland, October 8–9 2001. IEEE. [2] Andrew J. Davison, Walterio W. Mayol, and David W. Murray. Realtime localisation and mapping with wearable active vision. In Proc. ISMAR 2003, pages 18–27, Tokyo, Japan, October 7–10 2003. IEEE. [3] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. Proc. SIGGRAPH ’96, pages 11–20, 1996. [4] Tom W. Drummond and Roberto Cipolla. Visual tracking and control using lie algebras. In Proc. IEEE CVPR 1999, June 23–25 1999. [5] Ethan Eade and Tom Drummond. Scalable monocular slam. In Proc. CVPR 2006, volume 1, pages 469–476, 2006. [6] Walterio W. Mayol, Andrew J. Davison, Ben J. Tordoff, N. D. Molton, and David W. Murray. Interaction between hand and wearable camera in 2d and 3d environments. In Proc. BMVC, September 2004. [7] Nicholas D. Molton, Andrew J. Davison, and Ian D. Reid. Locally planar patch features for real-time structure from motion. In Proc. BMVC 2004, September 2004. [8] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben Wegbreit. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proc. IJCAI, Acapulco, Mexico, 2003. IJCAI. [9] Gilles Simon. Automatic online walls detection for immediate use in ar tasks. In Proc. ISMAR 2006, pages 39–42, October 22–25 2006. [10] Paul Smith, Ian Reid, and Andrew J. Davison. Real-time monocular SLAM with straight lines. In Proc. BMVC 2006, volume 1, pages 17–26, Edinburgh, September 2006. [11] Gabriel Taubin. Estimation of planar curves, surfaces, and nonplanar space curves defined by implicit equations with applications to edge and range image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 13(11):1115–1138, 1991. [12] Anton van den Hengel, Anthony Dick, Thorsten Thorm¨ahlen, Benjamin Ward, and Philip H. S. Torr. Building models of regular scenes from structure and motion. In Proc. BMVC 2006, September 2006.