Semi-Autonomous Generation of Appearance-based ... - CiteSeerX

Semi-Autonomous Generation of Appearance-based Edge Models from Image Sequences Jeremiah Neubert∗

John Pretlove†

Tom Drummond‡

University of North Dakota

ABB Research

Cambridge University

(a)

(b)

(c)

Figure 1: Creation of 3D edge models directly from image sequences. The user indicates planar regions of an object to generate a polygon in (a). The planar regions are reconstructed, keyframes, consisting of sets of edgels (white lines) inside the planar region, are selected to capture viewpoint related changes in appearance (b). The resulting model is used to track the object in real-time (c).

A BSTRACT Many of the robust visual tracking techniques utilized by augmented reality applications rely on 3D models and information extracted from images. Models enhanced with image information make it possible to initialize tracking and detect poor registration. Unfortunately, generating 3D CAD models and registering them to image information can be a time consuming operation. Regularly the process requires multiple trips between the site being modeled and the workstation used to create the model. The system presented in this work eliminates the need for a separately generated 3D model by utilizing modern structure-from-motion techniques to extract the model and associated image information directly from an image sequence. The technique can be implemented on any handheld device instrumented with a camera and network connection. The process of creating the model requires minimal user interaction in the form of a few cues to identify planar regions on the object of interest. In addition the system selects a set of keyframes for each region to capture viewpoint based appearance changes. This work also presents a robust tracking framework to take advantage of these new edge models. Performance of both the modeling technique and the tracking system are verified on several different objects.

∗ e-mail:

[email protected] [email protected] ‡ e-mail: [email protected] † e-mail:

978-1-4244-1750-6/07/$25.00 ©2007 IEEE

1

I NTRODUCTION

The use of 3D CAD models for tracking has been the focus of a large body of work [1, 3, 12, 15, 18]. Contour tracking using these models can be made fast and efficient using a 1D search. One such technique presented by Drummond and Cipolla [3] renders a wire frame model onto the image using a prior estimate of pose. A discrete set of points are selected from the rendered edges and matched to the nearest variation in image intensity found by searching along the normal to the edge. The distance between the correspondences are then used to calculate the posterior estimate of the pose using a robust M-estimator. Unfortunately systems that match wire frame models to the nearest variation in image intensity fail in cluttered environments because they assume that the nearest edge is the correct one. Moreover, errors caused by bad edge correspondences lead to a bad prior pose estimate in the next frame making it almost impossible to recover. The systems presented in [20] and [14] address the issue by supplementing the model with image information (such as texture) to improve edge tracking and provide a mechanism for recovery. The system presented by Fua et. al. [20] uses point features registered to the model to aid in image tracking, failure detection, and initialization. Point features have strong descriptors which provide robust matching to obtain more reliable correspondences for the initial pose estimation and failure recovery. In addition Fua et. al. allows for multiple hypotheses of edgel location to further increase robustness. Rosten and Drummond [15] use a similar method, but the point features are extracted at runtime after the pose has been estimated. The features are then back projected onto the 3D model using the posterior pose estimate from the edge tracker. The back projected features are matched with the next frame to obtain a better initial pose estimate. This improves edge tracking by increasing the like-

lihood that the nearest edge is the correct one. These points also prevent the system from becoming stuck in a local minima when two edges render to the same location in the image. The drawback to such a system is that there is no mechanism to detect or handle failure. These tracking techniques are powerful and robust methods for augmenting the world, but they require a complete 3D model supplemented with image information. This may be difficult and expensive to generate in some environments. We are currently working to create an augmented reality system for use in large industrial environments. Large industrial plants are dynamic structures due to constantly changing technology and safety regulations. The dynamic nature of an industrial environment makes maintaining accurate models almost impossible. Even where there are CAD models available they are either significantly out of date or do not yield the information needed for successful visual tracking. When the decision is made to create the needed models the process can be prohibitively expensive because large industrial environments can include more than 30, 000 objects. It would be convenient if a person using a handheld computer instrumented with a cheap off-the-shelf camera could capture all the information needed to produce a model supplemented with the desired image information without prior knowledge of the object structure. This paper proposes such a method using modern structure-from-motion techniques such as [5, 13] to recover the camera trajectory over the sequence of images. This camera trajectory is combined with cues from the user, provided in the form of annotated images, to create an appearance based edge model. The set of annotated images is no larger than the number of planar surfaces being added to the model. The selected planar regions are then reconstructed using point features. Then keyframes are selected to model the region’s appearance as a function of camera pose. The keyframes consist of a list of edgel locations and 1D feature vectors. The result of this process is a “crude,” appearance based edge model which contains a set of planar regions reconstructed up to a scale factor and their associated keyframes. Because the model contains few details about the actual 3D structure of the object, keyframes are used to capture the effects of unmodeled structure on the object’s appearance. This differs from previous systems that relied on complete 3D models textured with image information, which were rendered to produce the desired features. Relying on keyframes rather than textured models has the added benefit of reduced computational load because rendering such models on small handheld computers can consume considerable resources. Moreover, the “crude” model does not require prior knowledge of the object’s structure to create and is sufficiently accurate for use with an edge tracking system such as that described in [14]. This work also presents a modified edge tracker for utilizing the newly created models. The system is based on that of [3] with the enhanced edge localization of [14]. Modifications to these existing techniques are needed to accommodate the “crude” representation of the object’s structure, which requires the system to detect when a planar region is not visible because some of the object’s structure may be unmodeled leading to unexpected self occlusions. In addition, the new tracking system must select the best keyframe for each planar region unlike other systems that use textured models or whole object keyframes to capture pose based appearance changes. The remainder of the paper is organized as follows. First, we present a brief background on the creation of edge models from a sequence of images. Our algorithm for edge model creation is then described in Section 3. This includes how the 3D location of the plane is recovered from the image sequence and the criteria used to select the keyframes to represent the viewpoint based appearance of each planar region. A graph of neighboring keyframes is then created to aid the proposed tracking system in determining optimum keyframe to use on a given image. In Section 4 the edge tracking

system is presented. The paper then concludes with a demonstration of the presented system on several objects and a discussion of the results. 2

BACKGROUND

Several systems already exist for creating contour models using image information. One such system outlined by Reitmayr and Drummond [14] uses a 3D CAD model textured with image information created offline to generate new edge models for any given object pose. The new edge model is generated by rendering the textured CAD model into a frame buffer which is processed to locate edgels and extract their corresponding 1D feature descriptors. The feature descriptors not only provide more robust localization, but are also used to detect failure. The drawback to the system is the cost of building textured models. In addition textured models are often sparsely populated with interesting features, thus there is a lot of unused information stored with it. The SLAM community has also been working on creating 3D edge models from image sequences without any prior knowledge of the world. The system presented in [4] replaces standard point features with edge features. Using a particle filter it is able to accurately determine the camera pose relative to some arbitrary world frame. Unfortunately, the model would be difficult to use to augment objects in the world because the system is only aware of a 3D cloud of features corresponding to edgels. There is no knowledge of what objects they lie on, occlusion, or surfaces. In fact, features that become occluded are simply removed from the map. Also without any information from the user the system may not even track any part of the desired object. Gee and Mayol-Cuevas [9] move beyond edgels to line segments. This gives the system a slightly deeper understanding of the world in that there is more than a cloud of 3D points, but rather points gathered together to form lines. This is not enough for the edge trackers presented above because there is no method to identify occlusion and no real understanding of the world. In [8] the SLAM system assigns all the edgels to a plane, but it assumes a single plane of fixed orientation to the camera. Most objects being augmented have several planes and the orientation of the camera with respect to them is constantly changing. Our system combines structure-from-motion and user input to construct a 3D model of the world that will facilitate augmentation of objects. Because the model contains planes rather than a cloud of 3D points it is a trivial matter to add augmentations. 3

M ODEL C REATION

The first step to creating a new model is to record a video sequence observing the object from positions distributed over the volume of space an end user would use the system in. The images are then passed through a SLAM system such as [5, 13] to get an initial estimate of the camera trajectory. The accuracy of the camera trajectory can be further improved using a non-linear optimization over the entire sequence [10]. The trajectory is described by a series of poses Tcw (0), ..., Tcw (N). The pose of the camera when image i was captured relative to some arbitrary world frame is expressed as a SE(3) transformation R ~t Tcw (i) = , (1) 0 1 where R ∈ SO(3) and~t ∈ R3 . The subscript cw indicates that the transformation represents the {c}amera pose relative to a {w}orld coordinate frame. The point ~pw ∈ R3 known with respect to the world frame can be expressed in one of the camera frames i by applying the rigid transformation Tcw (i) as shown in equation ~pc (i) = R ~pw +~t.

(2)

The final pieces of data needed before the model can be created are user cues indicating the planar regions to be extracted and modeled. The reconstructed planar surfaces are combined to create an appearance based edge model similar to the 3D models already commonly used. The user can specify the planar regions to add to the model using any one of several methods for specifying an image region including painting the pixels in the desired area or drawing a polygon around them. In this work the planes are indicated using the latter method where the polygon is constructed from a series of mouse clicks. While the system is robust to significant noise in the selection area, the majority of the selected pixels should correspond to the desired planar region. Fig. 2 shows an example selection. The remainder of the process outlined below is performed for each indicated region before it is added to the model.

One problem with using a normalized image patch as a discriptor is that some are more unique than others. This is especially true with manufactured objects where textures may contain a large number of features that have similar descriptors. A weak, or commonly occurring, feature discriptor can lead to false correspondences and poor localization. One method for handling features with weak descriptors is placing a large covariance on their location, which allows the information from these features to still be incorporated into the solution. Unfortunately, the information that they contribute is often misleading and can skew the solution, thus it is better to neglect them. Features with weak descriptors are identified and removed by searching the image near where they were extracted for likely false correspondences. If the feature is detected anywhere but the pixels neighboring original location it is rejected. Another phenomenon that can lead to errors in reconstruction of the planar region is using a disproportionate number of unique features from a small area. This gives the local region undue influence over the reconstruction results. To avoid this the system enforces a minimum distance between the unique features. If subsequent features are extracted within the minimum distance of an existing feature they are not used during reconstruction. Feature Reconstruction: It is assumed that a 2D feature fi corresponds to a 3D point ~p fi = [x y z]T .

Figure 2: User annotation of a model plane. The ‘x’ denotes a user selected point used to create the desired polygon.

3.1

Reconstruction of Image Plane

For each indicated region the 3D description of the plane is found relative to the camera pose when the annotated image r was captured. A plane can be represented by a four-vector ~L = [~n, d]T , where ~n is the unit length three-vector normal to the plane and d the distance from the origin to the plane. ~L is constructed such that if point ~p ∈ R3 is on the plane described by ~L the following is true 0 =~nT ~p + d.

(3)

Point Feature Selection: The process used to recover the user specified planar region ~Lc (r) begins with the location of FAST features [16] in image r. At the location of each feature an image patch is extracted and used to create a normalized feature descriptor. The feature is localized in prior and subsequent frames by searching for the image point that maximizes the normalized cross correlation score with the feature’s discriptor. When the maximum cross correlation score is not sufficiently high the correspondence is rejected because there is a high probability that it is false. The search is made faster by using the known camera motion and epipolar geometry to get an initial estimate of feature location. Once the reconstructed 3D location of point is known with sufficient accuracy it can be projected onto the image using the camera trajectory and only pixels within three standard deviations of the estimate need to be searched for a match.

(4)

By observing the feature in multiple images with known camera pose it is possible to determine the 3D location. Unfortunately, the error in this reconstruction is not linear in depth and has a non zeromean distribution. Such a distribution will lead to a biased estimate if it is not modeled correctly. One solution would be to use an unscented transform [11] to get an estimate of the feature’s 3D location from image observations. Although an unscented transform will reduce the effects of these non-linearities, a linear solution to the problem is found in [5] which reconstructs the points in what is referred to as inverse depth coordinates. This formulation is shown to have an approximately linear mean zero distribution in all dimensions. The same point ~p fi can be expressed in inverse depth coordinates as ~b f = u f v f q f T , (5) i i i i where the normalized camera coordinates ~v fi = [u fi v fi ]T are defined as x fi y fi 1 ~v fi = (6) = proj ~p fi and q fi = . z fi z fi z fi This holds as long as the depth of the point is not zero, which is reasonable because the z-axis in our system is defined to be parallel to the optical axis and zero at the optical center. By dividing eq. 2 by z f i it can be shown that a rigid transformation can be applied to ~b f using the equation i  ~bc (i) = R  fi

 u fi v fi  + q fi ~t, 1

(7)

where R and ~t are the rotation and translation from the original coordinate frame to the camera coordinate frame at i. An initial estimate of inverse depth coordinates can be made from two observations provided the camera motion between the two observations is not degenerate. The features were originally extracted in frame r, the user annotated frame, thus one of the observations used in the initial estimate will be from frame r. The second observation can come from any frame that does not lead

to a degenerate solution, but for simplicity of notation it is assumed that the second observation comes from the subsequent frame r + 1. The observations are obtained from images in distorted pixel locations ~w fi (r) and ~w fi (r + 1). Using the inverse camera model, found using eq. 25, the normalized camera coordinates of the observations are found ~v fi (r) and ~v fi (r + 1). The covariance of these observations are C fi (r) and C fi (r + 1) respectively, which is calculated by differentiating the inverse camera model and multiplying it with the expected measurement noise. Using these results the optimum estimate of the feature’s inverse depth coordinates will minimize T −1 ~e(r)T C−1 fi (r) ~e(r) + ~e(r + 1) C fi (r + 1) ~e(r + 1).

(8)

The residual error ~e is the difference between the observation ~v and the projection of the feature’s inverse depth coordinates into normalized camera coordinates, given by     u fi ~e = ProjError ~v, ~b, R, ~t =~v − proj R  v fi  + q fi ~t  (9) 1 The first step in minimizing eq. 8 is calculating the residual error ~e(r) and ~e(r + 1) used to update the estimate of fi ’s inverse depth coordinates. There is no rotation or translation between ~bcfi (r) and ~v fi (r) because they are expressed in the same coordinate frame. The residual for the second observation is calculated with respect to the reference frame of the first measurement, using the rotation R and translation ~t from the rigid transformation T(r+1)r . The results are stacked to get   ProjError ~vcfi (r), ~bc (r), I,~0 ~e(r)  (10) = ~e(r + 1) ProjError ~vc (r + 1), ~bc (r), R, ~t fi

The next step is to determine the Jacobian that relates the residual errors to changes in ~bcfi (r) by calculating the partial derivative of eq. 10 with respect to u, v, and q to obtain   J=

q−1

1 0

0 1

 1 0 0 0 1 0  , −u q−1 ~ ~ R1 R2 ~t −v q−1

(11)

where ~R1 and ~R2 are the first two columns of R,~t is the translation, and the elements of ~bcfi (r + 1) denoted as u, v, q for brevity. For the next part of the solution it is helpful to define C fi (r)−1 0 , (12) Q−1 = −1 0 C fi (r + 1) which is the weight applied to information supplied by each observation based on their uncertainty. Using the initial guess that ~bc = [uc (r) vc (r) 1]T and the equations above the solution can be fi fi fi found iteratively by adding −1 ~e(r) ∆~bcfi (r) = JT Q−1 J JT Q−1 (13) ~e(r + 1) to ~bcfi . The process of calculating the residual error and Jacobian, and updating the estimate of ~bc is repeated until the result of eq. 8 fi

is sufficiently small. It is also important to note that the covariance of the solution is −1 P = JT Q−1 J , (14)

which is used to initialize an extended kalman filter (EKF). Using the EKF provides a framework for including additional measurements to refine the estimate of ~bcfi . In our implementation the innovation is defined by eq. 9, which is mapped to ~b fi by the Jacobian in the bottom half of eq. 11. Plane Reconstruction: Once a sufficient number of points have been reconstructed accurately the equation of the plane can be estimated. Fortunately, there is no need to convert the inverse depth coordinates of the features into Cartesian coordinates because a solution for calculating ~Lc (r) with inverse depth coordinates can be derived. First, note that a point on the plane described by ~L obeys eq. 3, which, by expressing ~p = [x y z 1]T in projective three space P3 , can be rewritten as ~LT ~p = 0. Taking advantage of the fact that λ~p ≡ ~p in P3 the following is true ~p ≡ z−1 [x y z 1]T = [u v 1 q] .

(15)

Thus the equation of the plane can calculated directly from the inverse depth coordinates using a standard least squares solution and RANSAC [7] to reject outliers. 3.2 Edgel Extraction The next step is identifying frames that view a portion of the user specified planar region by projecting the annotations onto the reconstructed plane using the method outlined below in Section 3.3 to get a 3D bounding polygon. The bounding polygon can be rendered onto any image in the sequence using the known camera trajectory. Frames that contain a portion of the user specified region viewed from the side annotated by the user in frame r are converted to monochrome byte images and blurred using a gaussian kernel with a σ of 1.5 to remove high frequency image noise. The blurred images are processed using a standard edge detector such as canny [2]. A list of edgels ei j contained in the projection of the user specified polygon in image i are stored in what is referred to as an edge frame Ei , where Ei = {ei1 , ..., ein }. The edge frame is then added to the set possible keyframes E . One of the major issues with not having a priori knowledge of the 3D object structure is an inability to detect self occlusion. Edgels generated by structure that does not lie on the plane are detected and removed by analyzing their motion. Those that lie on the plane are able to be mapped to other frames using a planar homography calculated from the plane equation ~Lc (r) and the camera motion between the frames. The motion of the camera between the two image i and j is T ji = Tcw ( j)Tcw (i)−1 .

(16)

The equation of the plane with respect to the camera at i can be calculated from ~Lc (r) and the known camera motion using ~Lc (i) = ~Lc (r) ∗ Tri .

(17)

Combining these results the homography is H = R ji −~t ji

~nTLc (i) dLc (i)

.

(18)

It is assumed that the object appearance is constant “near” any particular camera pose. It can also can be assumed that the camera motion is smooth and continuous. Given these assumptions the object’s appearance should be invariant over a small subset of sequential frames. Thus an edgel present in Ek should be present in each set in the interval Ek−g , Ek+g , where g is a relatively small number. The edgel eki ∈ Ek and its covariance are projected into the

frames in Ek−g , Ek+g . Edgel eki has a match in a frame if there is an edgel within 3σ of its predicted location with a similar gradient normal. If the number of frames that do not contain a corresponding edgel within 3σ of the prediction indicates with more than 95% confidence that the edgel is not on the plane it is removed from Ek . Using g = 3 an edgel that is not in two or more sets has less than a 5% chance of being on the plane and is removed. 3.3 Keyframe Selection An object’s appearance is a function of camera pose. As the camera moves through space edgels appear and disappear from view. These changes have been estimated in the past by using a textured model rendered into a frame buffer. Our system utilizes keyframes rather than textured models because textured models can be computationally expensive to render on a handheld computer and without a “complete” 3D model self occlusions needs to be captured in some other way. A large amount of previous work outlining methods for selecting a small subset of frames to represent a large sequence of images without significant information loss exists. The methods vary significantly depending on the information being preserved by the keyframes as seen in [17] and [6]. The function of the keyframes in this work is to capture changes in plane appearance as a function of camera pose. This includes modeling the appearance of new edgels and the disappearance of old edgels. In addition to changes to the visible edgels, the keyframes need to capture how the feature descriptors change because unlike the systems in [14] and [20] that extract feature descriptors from a frame buffer, our system relies on the keyframes to obtain 1D edgel descriptors. The goal of the selection scheme is to find Ekey f rames ⊂ E which minimizes |Ekey f rames |, while also minimizing the number of unrepresented changes in the extracted planar region’s appearance. To describe the similarity between two sets of edgels an approximation to the intersection of the sets needs to be defined. An approximation is used because when mapping a set of image points from one image to another multiple pixels in the first may map to one pixel in the corresponding image. Thus expressing the intersection in frame i or j will yield different sets. The operation Ei → E j is the set of all

The value of best score(Ei ) represents the amount of information in Ei that is unmodeled by Ekey f rames . When the unmodeled information exceeded an acceptable level specified by a threshold, in the case of this work 0.7, Ei is added to Ekey f rames . Once the construction of Ekey f rames is completed the neighbors of each keyframe are found so that the tracking system in Section 4 can traverse Ekey f rames . For each edge list Ek ∈ Ekey f rames the set of neighbors Eneighbors k ⊂ Ekey f rames is defined as the set that maximizes sim score(Ek , E j ), (22) ∑ ∀ E j ∈ Eneighbors k

where |Eneighbors k | = h. We chose h = 3 and found it to be sufficient for our purposes. Assuming that the appearance of the plane changes continuously, it is reasonable to assume one of the members of Eneighbors k will match the image when the Ek no longer does. Larger values of h may increase the likelihood of this, but it will also increase the computational load of the tracking system (see Section 4). Two examples of neighbor selection are shown in Fig. 3. The blue dashed lines show the sets of neighbors for keyframe 1 and 2. The relationship is not bidirectional. That is E j may be a neighbor of Ei , but Ei may not be a neighbor of E j . This is only important for keyframes near the edge of the graph which have limited options for neighboring frames possibly leading to dissimilar neighbors (see Section 5.3). The keyframe labeled 2 in the figure depicts how poses close in position may not be similar as the viewing angle and the distance to the object create the most significant changes in appearance.

H

the edgels in Ei that can be mapped to an edgel in E j with a similar gradient normal using the homography H (see eq. 18). The metric used to quantify the difference between any two edge lists Ei and E j is defined as sim score(Ei , E j ) = |E j | − |Ei → E j | + |Ei | − |E j → Ei | H H . 1.0 − max |Ei |, |E j |

(19)

The numerator of sim score(·) is the difference in appearance expressed as the number of edgels in each frame that are not contained in the other. This term is normalized by the maximum number of edgels in the sets. The sim score(·) was created to quantify the differences between two edgel lists. The set of keyframes Ekey f rames is initialized by adding the edgelist that satisfies the following arg max (|Ei |) Ei ∈E

(20)

which ensures that the frame with maximum information is in Ekey f rame . Anecdotal evidence suggests that this is a frame where the planar region is approximately normal to the optical axis and occupies the largest number of image pixels. The remainder of the keyframes are selected by evaluating each edge frame Ei ∈ E to find best score(Ei ) =

arg max( sim score (Ei , Ek ) ) .

Ek ∈ Ekey f rames

(21)

Figure 3: The red line represents a planar region that has been extracted. The camera pose for each keyframe is shown in the picture. The neighbors for two of the keyframes indicated by the dashed blue lines.

The final step in the creation of Ekey f rames is calculating the 3D locations required for tracking. The 3D locations ∀eki ∈ Ek can be obtained by calculating the distance from the camera to the plane ~Lc (r) along the line defined by the optical center of the camera and the normalized camera coordinates of eki . This distance corresponds to the point’s depth. The function used to recover the depth of the point in camera k is depth(~veki ) =

−dk , ueki ∗ nk1 + veki ∗ nk2 + nk3

(23)

where ~Lc (k) can be obtained using eq. 17 and its normal component is~nk = [nk1 nk2 nk3 ]T . Then given the definition of normalized camera coordinates in eq. 6 the 3D position of ek j is calculated as ~pTeki = depth(~veki ) ∗ [ueki veki 1.0]. 4 E DGE T RACKING The edge tracking system developed for use with the appearance based edge models created in Section 3 is an extention the edge

tracker presented in [3] with the improved edge localization outlined in [14]. There are two modifications outlined below to address keyframe selection for each planar region and detecting when a plane is occluded. Detecting occlusion is particularly important because of the increased likelihood of bad edge correspondences and determining when to render augmentations associated with the plane. 4.1

Keyframe Selection For Tracking

Figure 4: The red line depicts a 3D plane viewed from two positions represented as black dots. The area of the image the plane occupies, solid black line between the dashed lines, is the same for both camera poses. Using the area metric described in [19] the two images contain the same information and the appearance should be the same. Unfortunately, it is likely the plane’s appearance is significantly different at these two poses.

Defining a metric for selecting the keyframe that best reflects the current appearance of a planar region given an estimate of pose is difficult. One metric proposed by Fua et. al. [19] attempted to capture the effects of changes in viewing angle and scale by selecting a keyframe based on the number of pixels occupied by each surface of the model. The pixel area of the surfaces is determined by rendering the 3D model with each surface assigned a different color into a frame buffer. The area of the image occupied by each plane is then estimated by raster-scanning the buffer. The keyframe k that minimizes

∑

(A (S, CamProj(·), Tcw (i)) − A (S, CamProj(·), Tcw (k)))

L ∈ Model

(24) is defined to be nearest keyframe Tcw (i). The function A(·) returns the area of surface S in pixels based on the camera projection function CamProj(·) and camera pose Tcw . This method only works with models containing multiple planes and self occlusion. If the model was as simple as a rectangular planar region the system will fail because vastly different camera poses yield the same area (see Fig. 4). As pointed out in [19] the metric used must reflect differences in both the viewing angle and scale, thus the area metric is not suitable for determining the best keyframe on a plane-by-plane basis. Additionally, defining a metric that weights the effects of viewing angle and scale appropriately is not apparent. We propose utilizing a set of neighbors Eneighbors found in Section 3.3. When the percentage Ek ’s successfully located edgels drops below a threshold the Eneighbors k are searched for a better match to the object’s appearance. If the search is limited only to frames in Eneighbors k there is a possibility of landing in a local minima where a keyframe with lower percentage of successful matches must be passed through to get to the optimum keyframe. To prevent this each time Eneighbors k is searched for a better keyframe an additional frame is selected at random from Ekey f rames for evaluation. Because a new random keyframe is evaluated each time a transition is made the system should eventually find the optimum keyframe regardless of whether or not it is in Eneighbors k . The frame with the highest percentage of successful edgel matches is then selected for use by the tracking system. The random frame is also very useful when recovering

from failure or an occlusion because the viewing angle may have changed significantly. 4.2

Locating Edgels in the Current Frame

The first step in estimating an edgel’s location requires an initial position estimate obtained with camera pose T− cw , where − denotes a prior estimate. As stated above T− cw ∈ SE(3) not only represents the camera pose but also the rigid transformation from the world to the current frame. The initial estimate of the position of edgel ei is found by applying a rigid transformation to its 3D position calculated above using eq. 2 to obtain ~pei . Projecting ~pcei onto the image requires a calibrated camera model. In this work a quintic camera model  xe  r0 zei uei f 0 ou  yeii  (25) = CamProj(~pcei ) = u vei 0 fv ov  r0 zei  1 q 2 x 2 + yz and r0 = 1 + αr2 + β r4 provide is used, where r = z the correction for radial lens distortion. The parameters fu and fv are the focal length of the camera and [ou , ov ] is the location of the projection of the optical center in the image. CamProj(~pcei ) is the initial estimate of the location of ei in the current image. The measurement associated with ei used to update the pose in Section 4.4 is the distance di from the prior estimate of its location to the position along its normal ~nei that maximizes the normalized cross correlation score with the extracted 1D feature discriptor. If the maximum score is below a minimum threshold localization failed and ei is not used in pose estimation. 4.3

Plane Visibility and Failure Detection

After the system has attempted to locate each of the edgels in the current image the results are used to identify occluded planes based on the percentage of edgels successfully matched. When the percentage of successfully matched edgels in any particular planar region is below a threshold, in our system 60%, all measurements obtained from that region are discarded because of an increased likelihood of false correspondences. The region is also marked as not visible and any augmentations associated with it are not rendered. Tracking has failed when none of the planar regions are visible. A more “correct” method was proposed in [14] where the distribution of the normalized cross correlation scores for successfully and unsuccessfully tracked objects are modeled. Tracking failure is detected by determining which distribution the cross correlation scores were most likely drawn from. Because the focus of our system is to create a method for quick, easy modeling of objects, estimating the needed distributions was rejected as too involved. The distributions can change with the operating environment and the object being tracked. In the absence of this data a threshold on the percentage of successfully tracked edgels is a good approximation. 4.4

Updating Camera Pose

The actual position of the edgels differ from that of the prior estimate by a rigid transformation ∆T ∈ SE(3). Rigid transformations compose the Lie Group SE(3), thus ∆T can be decomposed into a six-vector ~θ using the exponential map. The goal of the system is to obtain an accurate posterior pose estimate by determining ∆T in − T+ cw = ∆T Pcw .

(26)

~θ can be estimated from the measurements di by calculating the Jacobian, which is the partial derivative eq. 25 with respect to ~θ

multiplied with the edge normal ~nei 

 ∂ uei ∂ di   =~nei  ∂∂vθ  . ei ∂θ ∂θ

(27)

The measurements di and their corresponding Jacobians can be stacked to form the N-vector d~ and the N × 6 matrix J leading to the formulation ~ Jθ = d, (28) which can be solved using least squares. Unfortunately, the least squares solution is not robust to outliers so an iterative reweighted least squares solution is adopted. 5 R ESULTS All the data and results shown here were gathered using a VAIO U-71 tablet PC with a 1.1 GHz Pentium III M CPU instrumented with a USB 2 webcam with a 3.6 mm lens. The camera collected 640 × 480 RGB images, but the tracking is done on quarter sized byte images. The images collected by the tablet for model creation were transmitted via a wireless network connection to a 3.0 GHz Pentium workstation to obtain the camera trajectory using the system presented in [5], reconstructing the planar regions, and selecting the keyframes. The performance of the model making method outlined in Section 3 is demonstrated using it to create appearance based contour models for several items of various difficulty. The results show that while the system is constructed to model the planar regions on an object, the regions need only be an approximation of a plane and can contain significant deviations. The initial model created was for a box-like computer with two large planar surfaces and a rounded front panel. The second object modeled was a laser jet printer, which has several recessed handles and switches as well as being composed completely of rounded surfaces. The last item modeled was a control panel similar to those found in modern industrial environments. 5.1 Trajectory Recovery There were many options available for recovering the trajectory of the camera over the image sequence including commercial systems such as BoujouTM from 2d3TM and the SLAM systems presented in [5, 13]. Our initial attempts to use Boujou were unsuccessful because of the large amount of radial distortion in our images. While it may have been possible to use Boujou, the initial performance and lengthy processing time led to the adoption of the system presented in [5]. This system proved to be robust and sufficiently accurate for our application. It also was able to process up to 20 frames per second on the workstation. The accuracy of the camera trajectory was further improved using non-linear optimization over the entire sequence [10]. Trajectory recovery rarely failed. The failures that did occur were generated by an inadequate number of visible features or a camera motion that violated the assumptions made by the algorithm. This type of failure is catastrophic and easily detected. Such failures can be avoided with minimal training before using the system. 5.2 Object Modeling The first model discussed will be the for the computer. It is useful for demonstrating the ability of the system to reconstruct the planar regions. While it is a simple structure, small errors in the reconstruction of the planar surfaces are very obvious because of its long, narrow shape. The close proximity of the two parallel planes makes any errors in reconstruction very apparent. Fig. 5 shows the reconstructed planes corresponding to the sides of the computer are parallel and that the registration of the annotations remains constant with respect to the computer regardless of viewing angle.

(a)

(b)

Figure 5: The model created using the presented system is projected over the object. The two planes on the sides of the computer appear to be parallel as one would expect.

The second object modeled was a laser jet printer. The laser printer is approximately box-like, but unfortunately the surfaces are not planar. When viewing the printer from the top it is obvious the structure is far from a box. The middle portion of the front shown in Fig. 7(b) juts out more than 10mm further than the edges of the planer region. In addition there are several recesses inside the selected region. The side of the printer shown in Fig. 7(a) has deviations of more than 15mm excluding the recessed handles and switches. Even if the sides were approximated as planes for the purpose of creating a 3D CAD model the object would be difficult to draw due to the many curved edges and non-conventional shapes. The hours required to measure and draft such an object are avoided using our modeling technique. The model shown in Fig. 8 was created from a video sequence and two annotations requiring less than 20 mouse clicks total(see Fig. 7). The quality of the model created can be seen in the frames from the tracking sequence in Fig. 8, which shows the planes to be approximately normal to one another and that the annotations are correctly registered to the object. The top line on Fig. 8(b) looks to be slightly high in this frame, which is a result of the annotation being placed slightly above the plane as shown in Fig. 7(b).

(a)

(b)

Figure 7: The images above show the annotations supplied by the user to create the model shown in Fig. 8. Creating the annotations took less than twenty clicks of the mouse.

The last and most relevant of the results are shown in Fig. 9. Several of the knobs on the panel are more than 100mm above the plane. Previously a complete 3D model of the control panel would require the position and height of each knob to be measured and images collected. Our system was able to create a model of the panel with a video sequence and less than ten mouse clicks. This underscores the importance of the system to industry because panels with similar appearance are abundant in industrial environments. It is very unlikely that accurate, up-to-date CAD models of such panels exist because they are constantly modified to accommodate plant changes. This system would significantly reduce the time and

#16

#148

#202

#283

#322

#447

#509

Figure 6: The keyframes acquired to represent a planar region selected by the user in Fig. 7. The numbers below indicate the position of the frame in the original image sequence. Only a quarter of the total edgels, white dots, found are shown. Their corresponding gradient normal is depicted with a line.

Table 1: The neighbor relationships for the keyframes in Fig. 6 Keyframe 16 148 202 283 322 447 509

148 202 148 322 283 509 447

Neighbors 202 283 16 283 322 283 202 16 202 447 322 283 322 16

cost required to obtain models for visually tracking objects in large industrial environments facilitating the use of augmented reality. 5.3

Keyframe Selection

The keyframes selected using the annotated frame in Fig. 7(a) are shown in Fig. 6. As you can see the frames selected using the technique in Section 3.3 represent the full range of viewing angles with an approximately minimal number of keyframes. The only two frames from a similar viewing angle were #16 and #148. This was caused by a slight variation in the registration of the annotations with respect to the printer that allowed the edgels on the left side of the printer to be added in frame #148. Variation in the exact 3D location of the polygon can occur because of noise in the camera trajectory or errors in the plane reconstruction. In this case only a few pixels of variation in the position of the polygon relative to the printer generated a significant change in appearance between the two frames because of the large number of edgels near the boundary. Now looking at the neighbor relationship between the keyframes in Table 1 it can be seen that the results were for the most part as expected. The neighbors of #16, #148, #202, #283 #322, and #447 are the frames taken from similar viewing angles. The only frame with an unexpected neighbor was frame #509, which claimed #16 as a neighbor. The reason for this is the fact that both #509 and #16 have few if any edgels in the top left corner. It also should not be completely unexpected because #509 has the steepest angle and there are no keyframes beyond it. This means that the third nearest frame #283 has more than a 45o difference in viewing angle, which is more than the difference between any other set of neighbors. This difference is reflected in the fact that the only frame with #509 as a neighbor is #447. 5.4

transitions can be observed by watching for a discontinuous change in set of edgels drawn. An anecdotal investigation revealed that the system transitions from the current frame to a frame from its set of neighbors the majority of the time. The random frame was only chosen after partial of full occlusion occur. On rare occasions it has also been noticed that the system can become stuck on a keyframe that views the plane from an extreme angle such as frame #509 in Fig. 6. Such frames contain a relatively small number of edgels increasing the likelihood that it can be successfully matched at other poses. In time the frame will fail to match the object’s appearance and the system will transition to a more appropriate keyframe, but using the wrong keyframe can lead to instability and failure. The problem could be corrected by forcing the system to evaluate a frame’s neighbors when it contains less edgels then all its neighbors. The video also shows the system’s ability to detect occlusion and failure reliably. When a plane becomes occluded the system no longer renders the edgels used for tracking. After an occlusion the system is able to reacquire the object most of the time, but there is no robust initialization scheme in place to allow quick recovery from a large range of poses. The addition of a method for automatic initialization like those presented in [20] and [14] would enhance the system’s robustness. Obtaining the information required by these systems during the keyframe extraction phase of model creation would be trivial.

(a)

(b)

Figure 8: Two frames from a video where the object is being tracked using the newly created model.

Tracking

The models were then verified using the presented tracking system. The posterior pose estimate obtained with the tracker was filtered using an EKF with a constant velocity model. The EKF also provided the prior pose estimate. The included video and Fig. 8 show the printer being tracked using the newly created model. Another video shows the model created for the control panel in use. The edgels are rendered onto the video sequence as x’s. The keyframe

Figure 9: This annotation was used to extract a model of the control panel which was used to track it.

6

C ONCLUSION

In this paper we presented a powerful tool for extracting “crude” 3D models from a sequence of images using minimal user interaction. The paper also presented a framework for incorporating the newly created models into a contour tracking system based on existing technologies known to be robust with complete 3D models. The presented methods are verified by modeling and tracking the objects in Section 5. The goal of this work was to develop a system for producing models sufficient for contour tracking from a sequence of images and a minimal amount of user input, which, as the results show, was accomplished. The system could be further improved by not relying on point features to reconstruct the plane. While this works well for most objects there is a small subset of objects that contain edge features, but are devoid of point features. Currently our system would require point features to be added to the object before it could be modeled. This limitation could be addressed by reconstructing edge features using the techniques in [4] to make it possible to recover the plane equation even in the absence of point features. ACKNOWLEDGEMENTS The authors wish to thank ABB for their generous support and sponsorship without it this work would not have been possible. Also thanks to Ethan Eade for supplying the SLAM software used in this work. R EFERENCES [1] G. Blesser, H. Wuest, and D. Stricker. Online camera pose estimation in partially known and dynamic scenes. In IEEE Proc. of the Intr. Symp. on Mixed and Augmented Reality, Santa Barbara, CA, 2006. IEEE. [2] J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6), Nov. 1986. [3] T. Drummond and R. Cipolla. Real-time visual tracking of complex structures. IEEE Trans. Pattern Anal. Mach. Intell., 24(7), July 2002. [4] E. Eade and T. Drummond. Edge landmarks in monocular slam. In British Machine Vision Conf., volume 1, pages 7–16, Edinburgh, September 2006. [5] E. Eade and T. Drummond. Scalable monocular slam. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 469–476, Washington, DC, USA, 2006. IEEE Computer Society. [6] B. Fauvet, P. Bouthemy, P. Gros, and F. Spindler. A geometrical key-frame selection method exploiting dominant motion estimation in video. In Int. Conf. on Image and Video Retrieval, volume 3115 of Lecture Notes in Computer Science, pages 419–427, Dublin, Eire, July 2004. [7] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting applications to image analysis and automated cartography. In Proceedings of the Image Understanding Workshop, pages 71–88, 1980. [8] J. Folkesson, P. Jensfelt, and H. Christensen. Vision SLAM in the measurement subspace. In IEEE Intr. Conf. on Robot. and Autom., pages 30–35, Apr. 2005. [9] A. Gee and W. Mayol-Cuevas. Real-time model-based SLAM using line segments. In Inter. Symp. on Visual Computing, November 2006. [10] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [11] S. Julier and J. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls, 1997. [12] G. Klein and D. Murray. Full-3d edge tracking with a particle filter. In British Machine Vision Conf., Edinburgh, September 2006. BMVA. [13] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping. In Inter. Joint Conf. on Artificial Intelligence, Acapulco, Mexico, August 2003.

[14] G. Reitmayr and T. Drummond. Going out: Robust model-based tracking for outdoor augmented reality. In IEEE Proc. of the Intr. Symp. on Mixed and Augmented Reality, Santa Barbara, CA, 2006. IEEE. [15] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In Intr. Conf. on Computer Vision, pages 1508–1515, Washington, DC, USA, 2005. IEEE. [16] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In Proc. of the European Conf. on Computer Vision, volume 1, pages 430–443, May 2006. [17] T. Thorm¨ahlen, H. Broszio, and A. Weissenfeld. Keyframe selection for camera motion and structure estimation from multiple views. In Proc. of the European Conf. on Computer Vision, volume 127, pages 523–535, may 2004. [18] L. Vacchetti, V. Lepetit, and P. Fua. Combining edge and texture information for real-time accurate 3D camera tracking. In IEEE Proc. of the Intr. Symp. on Mixed and Augmented Reality, pages 48–57, Washington, DC, USA, 2004. IEEE. [19] L. Vacchetti, V. Lepetit, M. Ponder, G. Papagiannakis, D. Thalmann, N. Magnenat-Thalmann, and P. Fua. Stable Real-Time AR Framework for Training and Planning in Industrial Environments. In S. Ong and A. Nee, editors, Virtual and Augmented Reality Applications in Manufacturing, pages 129–146. Springer-Verlag, 2004. [20] M. zuysal, V. Lepetit, F. Fleuret, and P. Fua. Feature harvesting for tracking-by-detection. In Proc. of the European Conf. on Computer Vision, 2006.