Robotic 3D Visual Mapping for Augmented Situational ... - CiteSeerX

11 downloads 0 Views 2MB Size Report
Robotic 3D Visual Mapping for Augmented Situational ... This paper describes our experiences in the application of localization and mapping techniques to aid.
Robotic 3D Visual Mapping for Augmented Situational Awareness in Unstructured Environments Jaime Valls Mir´o and Gamini Dissanayake ARC Centre of Excellence for Autonomous Systems (CAS) University of Technology Sydney (UTS) Australia {j.vallsmiro,g.dissanayake}@cas.edu.au

Abstract This paper describes our experiences in the application of localization and mapping techniques to aid mobile robots in reaching and exploring terrains which are inaccessible or considered too dangerous for humans. The work is driven, in particular, by the practical constrains to deploy (semi)autonomous platforms in unstructured search and rescue scenarios, where prior knowledge of the environment is unavailable but required before a rescue operation can be dispensed. Smaller, tracked, highly mobile vehicles have been found to be most capable of navigating such challenging environments in six degrees of freedom. Yet they are not able to carry large payloads, such as the ubiquitous SICK laser range scanner, particularly if long-term autonomy is also expected. In this work a mobile robot equipped with a range of sensor packages to best provide useful and understandable information about the interior of a collapsed building in the form of dense 3D maps, which humans can use to navigate and locate victims, is described. The sensorial arrangements employed, and the augmented algorithms adopted to suit the requirements of such challenging field deployment, are also presented. Appearance sensors, capable of providing scene texture information to the rescuer and also provide salient visual features for “place recognition”, feature prominently. In particular, a time-of-flight range imager coupled with a standard high-resolution camera has been used within a delayed-state SLAM formulation in the information form with encouraging results. Tests with monocular and stereoscopic rigs which constitute the initial inspiration of this research work will also be documented. Data collected from a mock-up urban search and rescue arena which simulates a multi-level disaster area has been used to validate the proposed approaches. Despite their limitations, the resulting 3D maps have been found to be effective navigation tools, both for the rescuer as well as for the autonomous platform. Key: Visual 6 DoF SLAM, 3D texture maps, unstructured environments.

Motivation: 3D mapping for USAR robotics Urban search and rescue is related to the emergency response dealing with the collapse of man-made structures. With the advances of technology and the increasing numbers of large earthquakes (California 1989, Kobe 1995, Izmit 1999), terrorist attacks (NY 2001), landslides (Cauca 1994) hurricanes (Mitch 1998) and other catastrophes, the need to devote resources in dealing with the aftermath of these natural and human-caused disasters is all but too apparent. Search and rescue is therefore becoming an important application area for mobile robots where the robots are employed with the ultimate objective of helping in identifying and rescuing victims from a hostile environment. In this process, one of the key objectives is to send a robot to explore unstructured terrains which are inaccessible or considered too dangerous for rescue workers, have it navigate in this environment and build a map that could be used by human rescuers to retrieve victims. Robots have also the ability to bring in medical support and supplies to the victims, hence increasing their chances of

1

(a) 3D Visual sensor package

(b) UTS rescue arena

Figure 1: The visual 3D sensor package and the mock-up search and rescue arena navigated by a small tracked robot

survival during the long process of extraction [Murphy et al., 2005]. Search and rescue robots represent not only a tool for the future as they have already been used in real disasters [Casper & Murphy, 2003]. The limitations of traditional 2D feature maps in providing useful and understandable information which humans can then use for navigating and locating victims in these scenarios has led our research towards the generation of rich textural 3D maps instead. The 3D mapping of a search and rescue scenario using an autonomous robot is a challenging predicament which can be framed within the generic simultaneous localization and mapping (SLAM) problem [Dissanayake et al., 2001], where a robot is expected to move with six degrees-of-freedom in a three-dimensional environment. This, per se, demanding picture is further complicated by the lack of odometry information from the wheel encoders, as this tends to be totally unreliable due to the nature of the disaster environment (see Figure 1(b) for an illustrative example).

Proposition: a 3D visual sensor package for SLAM In our recent work we have advocated for the use of a compact visual 3D sensor package in conjunction with an appropriate SLAM methodology. The strategy, which has already been successfully tested [Ellekilde et al., 2007], is currently being further enhanced by a new generation 3D camera and algorithmic improvements to reduce computational cost. The sensor consists of a 3D camera, CSEM SwissRanger SR-3000 [Oggier et al., 2004], which employs time-of-flight technology to gather 3D data. The camera illuminates the scene with near infra-red radiation, and is able to provide low resolution (176 × 144 pixels) 3D range images without any additional pan/tilt mechanism. Yet the textured images also provided by these sensors are of poor quality and subject to substantial changes in illumination as the camera pose changes. To overcome this, we have proposed mounting the 3D ranger with a traditional camera in order to fully utilise the strengths of conventional, higher resolution 2D image processing techniques in a 3D scenario. The two cameras are fixed relative to each other as illustrated in Figure 1(a), where a PointGrey Dragonfly-2 camera (insensitive to the infra-red light emitted by the SwissRanger) with 1024 × 768 pixels resolution has been combined with the SR-3000. By augmenting the salient 2D features with range information, the combined observations made by these two cameras are used as the sole input in an Extended Information Filter (EIF) based SLAM algorithm. Our recent proposition [Zhou et al, 2007] is a novel algorithm which makes efficient use of the 3D features by giving the robot the ’intelligence’ to select, out of the steadily collected data, the maximally informative observations to be used in the estimation process. We have shown that, although the actual 2

evaluation of information gain for each frame introduces an additional computational cost, the overall efficiency is significantly increased by keeping the matrix compact. The noticeable advantage of this strategy is that the continuously gathered data is not heuristically segmented prior to be input to the filter. Quite the opposite, the scheme lends itself to be statistically optimal.

Related work The construction of environment models as in traditional SLAM has primarily focused on 2D. From being initial addressed mostly using sonars, laser scanners have become the most popular current sensors in 2D. Wide availability of lower cost, lower power, lighter-weight passive cameras as well as maturity of computer vision algorithms have made real-time vision processing much more practical in recent times, and consequently there has been an increasing interest in visually based navigation systems in the robotics community. Vision SLAM in particular has seen many advances in recent years[Davison & Murray, 2002],[Se et al., 2002], [Newman & Ko, 2005]. Cameras are interesting as they provide a wealth of geometric information from an unmodified scene, as well as perceptual information such as textures and colours, which can be matched by few other sensors. In particular, approaches to efficiently represent salient regions of an image such as the Scale Invariant Feature Transform (SIFT) [Lowe, 2004] or more recently Speed Up Robust Features (SURF) [Bay et al, 2006] are of great help in providing reliable data association algorithms, an essential part of SLAM. As monocular systems can not directly retrieve depth information from the scene, traditional Bayesian techniques to solve the SLAM problem such as the Extended Kalman Filter (EKF) [Dissanayake et al., 2001] can not be readily used [Bailey, 2003]. Special landmark initialization techniques have been proposed in the literature to overcome this, thus enabling a full Gaussian estimate of its estate and the application of the traditional EKF technique [Newman et al, 2002], [Kwok & Dissanayake, 2004]. These approaches usually consider that a robust data association is available and also that relatively accurate odometry information can be employed to produce a baseline. A key work by Davison [Davison, 2003] when odometry was not available assumed some restricting dynamics (linear and angular robot velocities) in a scene with a maximum depth of around 5 meter, high sampling rate and known scene information in the form of a template. The proposition of a mixture of Gaussians by Sol` a [Sol`a i Ortega et al., 2005] removed some of these restrictions, and then in some very recent work [Civera et al, 2007] a completely dimensionless monocular SLAM algorithm using an inverse depth parameterisation has been formulating which can cope with features at any depth and without the need for an initial pattern. Another interesting SLAM algorithm using a single camera is vSLAM [Karlsson et al., 2005], where the environment is loosely represented as a topological map. Despite these improvements in single-camera SLAM, the low density of features in the resulting maps pose a serious limitation in the application of the approach, particularly in unstructured areas such as USAR. In recent years, more and more attention has turned towards construction of dense 3D maps and enabling the robot to move in full 3D itself, as is mandatory in USAR operations. Laser scanners are one of the most popular sensors employed in 3D SLAM research because of the accurate range information they provide. The most common approach is to use 2D laser to generate 3D scans [Cole & Newman, 2006]. For example, the AIS 3D laser range finder is built on the basis of a 2D range finder extended with a mount and a servomotor [N¨ uchter et al., 2004], [Lingemann et al., 2005]. Mapping is done by means of a fast pose tracking algorithm and 3D scan matching. The proposed algorithms have been evaluated at the RoboCup Rescue 1 2004 competition in Lisbon [N¨ uchter et al., 2005]. However, the use of a range sensor only, such as laser scanners, may exhibit a number of important limitations [Karlsson et al., 2005]: (i) it is difficult to extract point features, (ii) the large number of features makes efficient map maintenance a very hard task (iii) data association is not very reliable, especially when the loop is closed. Further to this, from an strictly logistic point of view, they they are slow, bulky and tend to be heavy for a small rescue robot. Using a stereo camera rig for 3D mapping is advantageous as that makes the system fully observable in that the sensor provides enough information (range and bearing) to compute the full three-dimensional state of the observed landmarks. Some approaches rely on 2D projections of the features, such as the 1 www.rescuesystem.org/robocuprescue/

3

vertical edges corresponding to corners and door frames as proposed in [Castellanos et al, 1999], which are subsequently tracked using an EKF. This type of visual feature representation is not sufficiently distinctive, and require elaborate data association solutions to reject spurious matches. [Se et al., 2002], [Se et al., 2005] have made use of a Triclops system where three images are obtained at each frame. SIFT features are detected, matched and used in an EKF. The approach is however not globally consistent as the cross-correlations were not fully maintained for computational reasons. [Sim et al., 2005] proposed a Rao-Blackwellised particle filter instead. A motion model based purely on visual odometry was used, effectively generalising the problem to unconstrained 3D motion. Feature management and computational complexity, which grows exponentially with the number of particles, is likely to make this strategy infeasible in a large environment. Other approaches rely on iterative minimization algorithms from vision techniques such as ICP to perform local 3D alignments to solve the SLAM problem [Saez et al., 2005], which produce good results when restricted to fairly structured environments and limited egomotion dynamics. Similar limitations are reported in [Dailey & Parnichkun, 2005], where 3D line segments became landmarks suitable to be used in a particle filter implementation in which each particle also carried with it an EKF. Although these are promising approaches to solve the SLAM problem using visual cues, camera calibration and stereo correlation are still arbitrarily fickle. But more relevant to USAR, determining correspondences between pixels in stereo pairs is a slow process that strongly depends on accommodating surfaces for the presence of texture, as well as ambient lighting conditions [Se & Jasiobedzki, 2006]. Very recently, some research has been published towards the direction of fusing visual and range sensorial data to obtain dense 3D maps. For example, [Newman et al., 2006],[Cole & Newman, 2006] make use of a tilting 2D laser scanner and a camera for SLAM in outdoor environments. The camera is mainly used in the detection of loop closure. [Ohno & Tadokoro, 2005] use a 3D scanner to generate an initial rough 3D map, and then use the dense texture images to recover the final 3D shape. Our efforts have also veered towards sequential 3D dense map reconstruction within a visual SLAM framework by combining, in a compact visual sensor package, a traditional intensity camera and one of the recently developed low resolution 3D range imagers, CSEM SR-3000. The sensor, which has been successfully proved by [Weingarten et al., 2004] to have the ability to be used for some simple local navigation, is enhanced here with a traditional passive imager to perform SLAM. The key idea being to use the traditional camera image for textural 3D reconstruction and to help in the data association, and a selection of the combined 3D features from the aggregated camera and range scans to update the filter.

Past experiences with visual SLAM in USAR Our work was initially motivated by the RoboCup Rescue competition where a robot is supposed to navigate an USAR scenario, find victims and come out with a a good geometric map of the area where to locate them. It could perhaps be argued whether an accurate geometric map is really required for robot navigation in this scenario, but those are the rules of the competition, which naturally lend the problem to be solved within the SLAM framework.

2D visual SLAM We set ourselves the target to evaluate whether a robot with a small payload could autonomously achieve these tasks. Our small in-house built robot ‘HOMER’, depicted in Figure 2(a), was entered into the easiest level of the competition, whereby motion was constrained to 2D. An example of this environment can be seen in Figure 2(b). The robot was not able to carry a large SICK laser scanner so a MEGA-DCS stereo camera, with 90 deg field of view lenses, from Videre Design was used as navigation sensor. Autonomous mapping using EKF-based SLAM with SIFT visual features was implemented on the platform. Our SLAM approach was limited to a ‘flattened’ 2D environment for a computational advantage. Here, features were initialised by matching them across the stereo pair and using the bearings obtained to compute range. Afterwards, a separate bearing and stereo disparity observation model was used to update them separately, depending on their perceived accuracy. We found that mapping was successful when a map pruning strategy was used to manage computational costs, although the reliability was poor when the robot was turning quickly as the odometry was unreliable and time between images could be up to six seconds in some runs. The average speed was in the range of 0.3 m/sec. Resulting 4

(a) HOMER robot with stereo rig

(b) HOMER during RoboCup Rescue competition

(c) Stereo SLAM output in office scenario

(d) Stereo SLAM output in UTS rescue arena

Figure 2: Homer robot with stereo camera and during the RoboCup Rescue competition, map of the office area (left) and of the search and rescue arena in our lab (right). Green line is the SLAM output. Red ‘*’ are features detected and blue dots are walls generated by superimposing laser scans using the SLAM output poses as ground truth, indicating that the map quality is good

maps are shown from running in a mock-up search and rescue arena (Figure 2(d)) as well as among the structured workstations in our lab (Figure 2(c)). SIFT worked well as a ‘place recognition’ engine and was able to facilitate loop closure quite successfully and approximately locate the robot with respect to features. Tests using a Pioneer robot where a laser rangefinder was used to get ‘ground truth’, showed that the location estimates were reasonable but the geometric accuracy and the statistical consistency of the maps deteriorated as the areas grew larger, directing the work towards the use of local submaps as proposed by other researcher [Estrada et al., 2005],[Huang et al., 2006]. The reader is referred to [Valls Mir´ o et al, 2005], [Valls Mir´ o et al, 2006] for more details on this work.

Range and intensity fusion for dense 3D modeling Derived from these experiences a working relationship was established with the Rescue/USAR Section of the NSW Fire Brigades where the need for richer, more informative maps became distinctly evident if they were ever to be used by rescuers. With lack of texture deeming pure stereo processing not viable, it was decided to take advantage of novel time-of-flight sensors, coupled with a camera for robust data association. By exploiting the SIFT technique on the textured images, two-dimensional feature locations are obtained which can then be identified in the range images. The full 3D information can in turn be employed to calculate the relative registration between consecutive camera poses by a least-square fitting method [Arun et al., 1987] and the application of an outlier removal such as RanSaC [Fischler & Bolles, 1981]. With the newly obtained 5

(a) Two textured scans to be merged with correspondences from SIFT

(b) Frontal view of images merged to form a single 3D model

(c) 3D visual sensor package

Figure 3: An example of two neighbouring range images augmented with textures from the conventional camera together with the (3D) locations as obtained from the SIFT feature matching algorithm, and the final result of merging the two images together which shows the high quality and human-readable characteristics of the final 3D composition. The sensor package: SwissRanger SR-2 mounted with a standard 640 × 480 pixels resolution UniBrain Fire icamera

transformation, the information from the textured range images can then be combined to form a single model of the environment. A depiction of this process can be see in Figure 3. The reader is referred to [Ellekilde et al., 2006] for more details on this work.

The Extended Information Filter for SLAM The 3D construction algorithm was proved to work well in combining neighbouring images locally and it was only natural to extend the approach to enable 3D mapping of larger environments. However, as the absolute error was expected to increase as more and more images were merged together, the methodology was extended to a fully-fledged SLAM algorithm. The goal being to enable consistent updates and the correction of feature locations as these were re-observed, as well as performing global corrections when loop closure was identified. In our work an EIF [Thrun et al., 2005] has been proposed to estimate the state vector. A number of reasons can be cited for employing an EIF instead of an EKF: (i) as a sequence of camera poses instead of only the last one is included in the state vector, the formulae for EIF is simplified (ii) the resulting information matrix is sparse and the sparseness can be exploited to reduce the computational cost (iii) the entire sequence of camera poses can be maintained in the state vector (iv) in general, there are computational advantages of using an EIF rather than an EKF. All the features as well as the history of camera poses in the state vector are maintained in the EIF 6

contemplated here. New camera poses are initialised with respect to the best matching frame at a known pose and measurement updates are additive in the information form. As the sensor package is assumed to operates in full 6 DoF without odometry, the formulation of the filter becomes simpler and results in a naturally sparse information matrix: there is no process motion prediction step to correlate the current state with previous states. When expressed in formulaic form, in our full 6 DoF SLAM the state vector X contains a set of 3D features and a set of camera poses, where the camera poses are represented as (xC , yC , zC , αC , βC , γC )

(1)

in which αC , βC and γC represents the ZYX Euler angle rotation and the corresponding rotation matrix is referred to as RP Y (αC , βC , γC ). A 3D point feature in the state vector is represented by (xF , yF , zF )

(2)

expressed in the global coordinate frame. Let i represent the information vector and I be the associated information matrix. The relationship ˆ , the corresponding covariance matrix P , the information vector i, between the estimated state vector X and the information matrix I is described by ˆ = I −1 i, P = I −1 X

(3)

The first camera pose is chosen as the origin of the global coordinate system. At time t = 0, the state vector X contains only the initial camera pose [0, 0, 0, 0, 0, 0]T , and the corresponding 6 × 6 diagonal information Matrix I is filled with large diagonal values representing the camera starting at a known position. An estimation of the position of the new features is provided by the observation model as       x ˆC x ˆF −1 xL   yL   yˆF  =  yˆC  + RP Y (ˆ (4) αC , βˆC , γˆC )T zL zˆC zˆF where xL , yL and zL are the feature location expressed in the local reference frame. In the update step, the information vector and information matrix update can be described by I(k + 1) = i(k + 1) =

T I(k) + ∇Hk+1 Q−1 k+1 ∇Hk+1 T i(k) + ∇Hk+1 Q−1 k+1 [z(k + 1)− ˆ ˆ −Hk+1 (X(k)) + ∇Hk+1 X(k)]

(5)

where Qk+1 is the covariance matrix of the observation noise wk+1 and z(k + 1) is the observation vector. ˆ + 1) can be computed by solving a linear equation The corresponding state vector estimation X(k ˆ + 1) = i(k + 1) I(k + 1)X(k

(6)

Once the accurate camera poses (in the global frame) at which the images were taken is made available by the filter, the local textured ranging images acquired at each camera pose can be combined to produce a 3D point cloud map. An example of the quality of the maps achieved with this strategy can be seen in Figure 4, where it is also apparent how simple registration results in an unacceptable accumulative error (Figure 4(a)). The reader is referred to [Ellekilde et al., 2007] for more details on this work.

The way forward: an information-efficient visual EIF on a ruggedised platform The technique employed so far uses a conventional EIF approach which recovers the robot and feature poses at the end of each ’acquire, update’ cycle. The compulsory inversion of the information matrix 7

(a) Map obtained by simple registration of the images in (b) Final 3D map – detail of a corner with two sequence without any filtering paper boxes

(c) Final 3D map – with robot trajectory and walls (mea- (d) [Picture of the corner with two paper boxes sured by hand)

Figure 4: Final 3D maps and textured close-ups of the the mock-up UTS USAR.

evoked during each estimation cycle comes at a significant computational cost. Although the sensor package is capable of delivering data at a minimal frame rate of 10Hz, which theoretically should guarantee appropriate frame registration, the increasing sampling delay between consecutive frames caused by extensive computation might yield inadequate data association and therefore unsuccessful frame registration as environments grow larger. Such problem is particularly magnified in unstructured environment where robot’s sights change rapidly and unpredictably along its undulating path. Many efforts [Thrun et al, 2004], [Eustice et al., 2005], [Cole & Newman, 2006] have been made in recent years to reduce the computational encumbrance generally faced by delayed-state SLAM algorithms, and the reader is referred to [Zhou et al, 2007] for an closer review. However, one noticeable common problem of these strategies is that loop closure can not be automatically detected. Separate loop closure methods are required in conjunction with their proposed techniques. Moreover, their propositions require either human supervision over the acquired data or raw odometry measurements in order to minimise the number of critical robot poses that should be maintained. Yet none of these resources are normally

8

(a) Information-efficient EIF 3D map detail

(b) Standard picture of the same area

Figure 5: Detail of the 3D map of UTS lab obtained with the proposed information-efficient EIF SLAM, and one of the pictures used in the filter, acquired by the standard camera of the sensor package.

readily available in settings such as the search and rescue scenario that motivates this work. We have engaged the computational argument from an alternative perspective: instead of gathering the minimal information and try to deal with it in the filter in the most efficient way from a purely mathematical standpoint, we seek an information-driven solution in which we allow the sensor package the freedom to collect data at its own rate, while giving the robot the ‘intelligence’ to choose those critical observations that should be incorporated in the estimation process. To that end, we extend the method first introduced in [Ellekilde et al., 2007] and propose an improved filtering technique whereby given a desired estimation error bound, only the frame giving maximal information gain out of a buffer of sampled overlapping frames (determined by the desired coarseness of the final 3D map) is added to the filter. As the information matrix has an inverse relationship with the covariance matrix, as described by (3), the maximally informative frame must also update the information matrix so as to contribute with the largest determinant. Based on that premise, we use the natural logarithm of the determinant of the information matrix, denoted as log(det(I(k + 1))), as the measurable quantity of this information update. Hence, with the new observations made at each new robot pose, the information gain can be obtained without recovering the state vector which is the major computational expense in EIF SLAM. The camera pose at which the observations provide maximal information gain is added to the filter By doing so, the filter incorporates a reduced number of robot poses optimally distributed along the robot trajectory based on uncertainty belief. Using this technique we can afford to maintain both the minimal set of robot poses with respect to the pre-set error bound, and their associated feature poses in the state vector in an efficient manner. Hence, making the case for producing more accurate maps when compared to strategies where only the robot poses are kept. Moreover, with the proper data association, the scheme lends itself to automatic detection of loop closure, be that by visual matching or Bayesian gating, or both. To validate the proposed strategy, we have firstly tested our algorithm in a simulation environment. Performance improvement results from a simple experimental setup in which observations are made at 200 camera poses equally distributed along a circular trajectory of 3 loops and 3 meters radius are summarised in Table 1. 500 features have been randomly populated in the 10×6×10 meters environment. In the ‘Consecutive’ case, when all 200 frames are used to update the filter - the equivalent case to our traditional EIF approach where all frames are considered, as previously described in the previous Section - the simulation completes in 428 seconds, and 96 features are maintained. If we relax on information loss and the filter is allowed to look ahead in a buffer size of 3 frames (‘Look ahead 3’), only 109 frames and 95 features are kept in the filter and the simulation completes in 185 seconds. When the filter is set to look ahead for longer (4 steps, ‘Look ahead 4’) only 60 frames are maintained with 90 features. A more visual example of the validity of the approach is illustrated in Figure 5, where a detail of the

9

Table 1: Performance Comparison between traditional and information-efficient EIF Measurement time(sec) No frames No features Max Matrix Size

Consecutive 428.96 200 96 1488

Look ahead 3 185.37 109 95 939

Look ahead 4 118.55 60 90 630

Figure 6: The newly acquired UTS USAR robot: iRobot’s PackBot Explorer

3D map obtained in a typical laboratory can be seen alongside one of the snapshots from the passive camera used to build the map. While it is difficult to empirically evaluate the goodness of the result, as in the previous case the size of the environment and the visual leads are in agreement with the resulting map. The reader may refer to [Zhou et al, 2007] for further details and more results. Currently, the sensor package is being fitted to a recently acquired highly robust (military-rated) tracked mobile robot, iRobot’s PackBot Explorer - shown in Figure 6 in the UTS USAR arena - in order to evaluate the real-time characteristics of this approach to the solution of visual SLAM in USAR environments.

Conclusions and future directions A review of the author’s experiences with the use of a range of appearance sensors for SLAM in an USAR scenario has been presented. From the initial steps in 2D visual SLAM, a novel sensor package based on a range imager and a passive camera has emerged as a real viable alternative. We have presented registration and filtering approached for dynamically incorporating observations into the estimation process for 3D robot navigation in unstructured terrain. Our concluding results have shown that by gauging the information gain of each frame, we can automatically incorporate the most informative observations for the purpose of SLAM in order to assimilate a comprehensive collection of the information about the environment we intend to explore. For real tasks in search and rescue scenarios, extracting the minimum information is only the first step towards efficiently understanding the spatial structure of the unknown environment. Our current work is moving to extend our study from how to extract the optimal information to how to optimally extract such information. As proposed in [Huang et al, 2005] and [Martinez-Cantin et al, 2007], active trajectory planning and exploration seem the natural companions to our proposed strategy in order to best comprehend the environment surrounding the robot, in minimum time and within bounded estimation errors.

10

Acknowledgements This work is supported by the Australian Research Council (ARC) through its Centre of Excellence programme, and by the New South Wales State Government. The ARC Centre of Excellence for Autonomous Systems (CAS) is a partnership between the University of Technology Sydney, the University of Sydney and the University of New South Wales. The authors would also like to acknowledge Miss Weizhen Zhou and Mr Lars-Peter Ellekilde for their participation in some of the work describe therein.

References [Fischler & Bolles, 1981] Fischler, M. & Bolles, R. (1981). RANdom SAmpling Consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun. Assoc. Comp. Mach., 24:381–95. [Arun et al., 1987] Arun, K. S., Huang, T. S. & Blostein, S. D. (1987). Least square fitting of two 3-d point sets. IEEE PAMI, 9(5):698–700. [Castellanos et al, 1999] Castellanos, A. J., Montiel, J. M. M., Neira, J. & Tard´ os, J. D. (1999). Sensor influence in the performance of simultaneous mobile robot localization and map building, in 6th Int. Symp. on Experimental Robotics, Sydney, Australia, pp. 203–212. [Dissanayake et al., 2001] Dissanayake, G., Newman, P., Clark, S., Durrant-Whyte, H. & Csorba, M. (2001). A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. on Robotics and Automation, 17:229–241. [Davison & Murray, 2002] Davison, A. J. & Murray, D. (2002). Simultaneous localization and mapbuilding using active vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(7):865– 880. [Se et al., 2002] Se, S., Lowe, D. G. & Little, J. (2002). Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. Int. J. of Robotics Research, 21(8):735–758. [Newman et al, 2002] Newman, P., Leonard, J., Rikoski, R. & Bosse, M. (2002). Mapping partially observable features from multiple uncertain vantage points. Int. J. of Robotics Research, vol. 21(10–11), pp. 943–976. [Casper & Murphy, 2003] Casper, J. & Murphy, R. R. (2003). Human-robot interactions during the robot-assisted urban search and rescue response at the world trade center. IEEE Transactions on Systems, Man and Cybernetics Part B, vol. 33(3):367–385. [Davison, 2003] Davison, A. J. (2003). Real-time simultaneous localisation and mapping with a single camera. In Proceedings of 9th IEEE International Conference on Computer Vision, pp. 1403–1410. [Bailey, 2003] Bailey, T. (2003). Constrained initialization for bearing-only SLAM. in Proceedings of the IEEE Int. Conf. on Robotics and Automation, Taipei, Taiwan, pp. 1966–1971. [Weingarten et al., 2004] Weingarten, J. W., Gruener, G. & Siegwart, R. (2004). A state-of-art 3D sensor for robot navigation. In Proceedings of the IEEE Int. Conf. on Intelligent Robot and Systems, Sendai, Japan, pp. 2155–2160. [Thrun et al, 2004] Thrun,S, Liu, Y., Koller, D., Ng, A. Y., Ghahramani, Z. & Durrant-Whyte, H. (2004). Simultaneous Localization and Mapping with Sparse Extended Information Filters. Int. Journal of Robotics Research, vol. 23, pp. 693-716. [Kwok & Dissanayake, 2004] Kwok, N. M. & Dissanayake, G. (2004). An efficient multiple hypothesis filter for bearing-only SLAM. In Proceedings of the IEEE Int. Conf. on Intelligent Robot and Systems, Alberta, Canada, pp. 736–741.

11

[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. of Computer Vision, 60(2):91–110. [N¨ uchter et al., 2004] N¨ uchter, A., Surmann, H., Lingemann, K., Hertzberg, J. & Thrun, S. (2004). 6D SLAM with an application in autonomous mine mapping. In Proceedings of the IEEE Int. Conf. on Robotics and Automation, pp. 1998–2003. [Oggier et al., 2004] Oggier, T., Lehmann, M., Kaufmann, R., Schweizer, M., Richter, M., Metzler, P., Lang, G., Lustenberger, F. & Blanc, N. (2004). An all-solid-state optical range camera for 3D real-time imaging with sub-centimeter depth resolution (SwissRanger). Proceedings of the SPIE, (5249):534–545. [N¨ uchter et al., 2005] N¨ uchter, A., Lingemann, K. & Hertzberg, J. (2005). Mapping of rescue environments with Kurt3D. In Proceedings of the IEEE International Workshop on Safety, Security and Rescue Robotics, pp. 158–163. [Ohno & Tadokoro, 2005] Ohno, K. & Tadokoro, S. (2005). Dense 3D map building based on LRF data and color image fusion. In Proceedings of the IEEE Int. Conf. on Intelligent Robot and Systems, pp. 1774–1779. [Estrada et al., 2005] Estrada, C., Neira, J. & Tardos, J. D. (2005). Hierarchical SLAM: real-time accurate mapping of large environments. IEEE Transactions on Robotics, 21(4):588–596. [Lingemann et al., 2005] Lingemann, K., N¨ uchter, A., Hertzberg, J. & Surmann, H. (2005). High-speed laser localization for mobile robots. Robotics and Autonomous Systems, 51:275–296. [Murphy et al., 2005] Murphy, R. R., Riddle, D. & Rasmussen, E. (2005). Robot-assisted medical reachback: A survey of how medical personnel expect to interact with rescue robots. In Proceedings of the 14th International Workshop on Robot and Human Interactive Communication. [Eustice et al., 2005] Eustice, R. M., Singh, H. & Leonard, J. J. (2005). Exactly sparse delayed-state filters. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2428– 2435. [Saez et al., 2005] Saez, J. M. & Escolano, F. (2005). Entropy minimization SLAM using stereo vision. in Proceedings of the IEEE Int. Conf. on Robotics and Automation, Barcelona, Spain. pp. 36–43. [Sim et al., 2005] Sim, R., Elinas, P., Griffin, M. & Little, J. J. (2005). Vision-based SLAM using RaoBlackwellised particle filter. in Intl. Joint Conf. on Artificial Intelligence workshop on Reasoning with Uncertainty in Robotics, Edinburgh, Scotland. [Sol`a i Ortega et al., 2005] Sol` a i Ortega, J., Lemaire, T., Devy, M., Lacroix, S. & Monin, A. (2005). Delayed vs undelayed landmark initialization for bearing-only SLAM. In Proceedings of the the IEEE Int. Conf. on Robotics and Automation workshop on SLAM - Workshops, Barcelona, Spain. [Thrun et al., 2005] Thrun, S., Burgard, W. & Fox, D. (2005). Probabilistic robotics. The MIT Press. [Dailey & Parnichkun, 2005] Dailey, M. N. & Parnichkun, M. (2005). Landmark-based simultaneous localization and mapping with stereo vision, in Proceeding of the IEEE Asian Conf. on Industrial Automation and Robotics, Bangkok, Thailand. pp. 108–113. [Karlsson et al., 2005] Karlsson, N., Di Bernardo, E., Ostrowski, J., Goncalves, L., Pirjanian, P. & Munich, M. E. (2005). The vSLAM algorithm for robust localization and mapping. In Proceedings of the IEEE Int. Conf. on Robotics and Automation, Barcelona, Spain, pp. 24–29. [Newman & Ko, 2005] Newman, P. & Ho, K. (2005). SLAM - loop closing with visually salient features, in Proceedings of the IEEE Int. Conf. on Robotics and Automation, Barcelona, Spain, pp. 644–651. [Huang et al, 2005] Huang, S., Kwok, N. M., Dissanayake, G., Ha, Q. P. & Fang, G. (2005). Multi-step look-ahead trajectory planning in SLAM: possibility and necessity”. In Proceedings of the IEEE Int. Conference on Robotics and Automation, Barcelona, Spain. pp. 1091-1096. 12

[Se et al., 2005] Se, S., Lowe, D. G. & Little, J. (2005). Vision-based global localization and mapping for mobile robots. IEEE Trans. on Robotics, 21(3):364–375. [Valls Mir´ o et al, 2005] Valls Mir´ o, J., Dissanayake, G. & Zhou, W. (2005). Vision-based SLAM using natural features in indoor environments, in Proceedings of the IEEE Int. Conf. on Intel. Sensors, Sensor Networks and Information Processing, Melbourne, Australia, pp. 151–156. [Huang et al., 2006] Huang, S., Wang, Z. & Dissanayake, G. (2006). Mapping large scale environments using relative position information among landmarks. In Proceedings of the IEEE International Conference on Robotics and Automation, Orlando, Florida, USA, pp. 2297–2302. [Newman et al., 2006] Newman, P., Cole, D. & Ho, K. (2006). Outdoor SLAM using visual appearance and laser ranging. In Proceedings of the IEEE International Conference on Robotics and Automation, Orlando, Florida, USA, pp. 1180–1187. [Cole & Newman, 2006] Cole, D. M. & Newman, P. (2006). Using laser range data for 3D SLAM in outdoor environments. In Proceedings IEEE International Conference on Robotics and Automation, pp. 1556–1563. [Bay et al, 2006] Bay, H., Tuytelaars, T. & Van Gool, L. (2006). SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria. [Ellekilde et al., 2006] Ellekilde, L.P., Valls Mir´ o, J. & Dissanayake G. (2006). Fusing range and intensity images for generating dense models of three-dimensional environments. Proceedings of the IEE Int. Conf. on Man-Machine Systems, Langkawi, Malaysia. [Se & Jasiobedzki, 2006] Se, S. & Jasiobedzki, P. (2006). Photo-realistic 3D model reconstruction. In Proceedings of the the IEEE Int. Conf. on Robotics and Automation workshop on SLAM, Orlando, USA. pp. 3076–3082. [Valls Mir´ o et al, 2006] Valls Mir´ o, J., Zhou, W. & Dissanayake, G. (2006). Towards vision based navigation in large indoor environments. In Proceedings of the IEEE Int. Conf. on Intelligent Robot and Systems, Beijing, China, pp. 2096–2102. [Civera et al, 2007] Civera, J., Davison, A. J., Montiel & J. M. M. (2007). Dimensionless monocular SLAM. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis. [Ellekilde et al., 2007] Ellekilde, L.P., Huang, S. Valls Mir´ o, J. & Dissanayake G. (2007). Dense 3D map construction for indoor search and rescue, Journal of Field Robotics, vol. 24(1/2), pp. 71–89. [Martinez-Cantin et al, 2007] Martinez-Cantin, R., De Freitas, N., Doucet, A & Castellanos, J. A. (2007). Active policy learning for robot planning and exploration under uncertainty”. In Proceedings of Robotics: Science and Systems, Atlanta, USA. [Zhou et al, 2007] Zhou, W. Valls Mir´ o, J. & Dissanayake, G. (2007). Information efficient 3D visual SLAM in unstructured domains, in Proceedings of the IEEE Int. Conf. on Intel. Sensors, Sensor Networks and Information Processing, Melbourne, Australia, pp. 323–328.

13