Appearance-based robot navigation

Appearance-based robot navigation Luca Regini, Guido Tascini, Primo Zingaretti Istituto di Informatica, University of Ancona {regini, tascini, zinga}@inform.unian.it

Abstract. The paper presents a robot navigation vision-system based on an appearancebased image-matching process and a stochastic position evaluator. In particular, the use of colour sets, weighted walkthroughs and POMDPs are the main peculiarities. The choice of the representation of image appearances is fundamental. We use imagedomain features, as opposed to interpreted characteristics of the scene, and we adopt feature vectors including both the chromatic attributes of colour sets and their mutual spatial relationships. The colour sets are obtained by auto-thresholding the colour histograms and weighted walkthroughs are used to capture the spatial relationships between colour sets. Information provided by the appearance-based image matching is used for updating a POMDP describing the belief distribution about the actual position of the robot in the environment.

1 Introduction This paper describes an appearance-based robot navigation system that does not require modifications to the environment. Autonomous robot navigation has to face three major problems: self-localisation, obstacle avoidance and path planning. In this paper we are mainly concerned with the self-localisation problem, or positioning, which consists in the determination of the position and orientation (often referred to as the pose) of a mobile robot in its environment. There are numerous approaches to self-localisation, and there are different sensor modalities as well. In this paper we shall only consider self-localisation by means of vision. Due to the tremendous amount of information provided by visual sensing about the robot’s environment, the extraction of visual features for positioning is not an easy task. A preliminary task becomes the selection of an appropriate set of visual features (often referred to as landmarks) to be used for the navigational trials [6, 20]. Another difficulty of a self-localisation process is to solve the matching between the observations provided by the robot’s sensors and the landmarks themselves [7, 19]. This problem could be considerably simplified if a robot’s pose (localisation prediction) is given a priori. On the contrary, without prediction, the absolute matching is quite difficult because the observations are not error-free. Only recently, appearance-based, or view-based, approaches have been proposed [1, 5, 8, 14]. An appearance-based approach provides qualitative measurements of the position of the robot, thus monitoring the progress of the overall mission. Once certain relevant positions are attained, other navigation behaviours are launched. This strategy is further justified by the fact that the positioning accuracy depends on the distance and angle between the robot and the landmark.

Landmark navigation is rather inaccurate when the robot is further away from the landmark and a higher degree of accuracy is obtained only when the robot is near a landmark [2]. In an appearance-based approach robot localisation is performed without using explicit object models. On the contrary, robot positioning is based on the knowledge of some reference snapshots taken from a large number of positions and orientations in the environment (eventually connected with some kind of co-ordinates, geo-referenced or symbolic). During the navigation process the system compares the appearance of the current view with those of the reference images. The appearance of a scene varies according to the position and the orientation from which the images are taken. An appearance-based approach uses this property to estimate robot’s pose. The choice of the representation of the appearances is fundamental for the matching process (i.e., the calculation of the similarity between two images). Pourraz and Crowley [9] compressed a large set of images using the Karhunen-Loéve transform and used the first few dimensions of the new representation space to capture the significant variations in scene appearance. Sim and Dudek [12] employed a Principal Components reconstruction of a feature vector which encodes both appearance and image geometry. Our approach also uses image-domain features as opposed to interpreted characteristics of the scene. We describe image appearance in terms of feature vectors including both the chromatic attributes of colour sets and their mutual spatial relationships, captured by weighted walkthroughs [15]. In previous papers we described in details the image-segmentation algorithm adopted for obtaining the colour sets [18] and the image matching technique developed ad hoc for mobile robot localisation [17]. That prototypical system was tested both in indoor and outdoor real environments. The analysis of the results detected two distinct causes of failure: the absence of reference images near the starting position and a too rough planning strategy. Consequently, a position evaluator has been integrated with the appearance-based matching process to overcome these failures or, in any case, to make more robust the self-localisation process. In particular, the information provided by the appearance-based image matching is used for updating a partially observable Markov decision process (POMDP) [13] describing the belief distribution about the actual position of the robot in the environment. While the main three modules of the system here proposed for appearance-based selflocalisation are separately described elsewhere [17-18], in this paper we describe in more details this system as a whole.

2 The navigation system This section describes the working of the navigation system, which consists of two phases: the offline learning or exploration phase and the on-line localisation phase. An outline of the system is depicted in Fig. 1. During the learning phase a visual tour in the environment is performed and images (i.e., snapshots of the environment) are gathered together with their coordinates. This information is first used to initialise a POMDP and then by a clustering algorithm, which groups all similar snapshots into the same class. The set constituted by the most representative image (seed) of each class will be considered as the reference image-database for that environment. An example of a visual tour and of the corresponding seeds that have been extracted by the clustering algorithm is shown in Fig. 2, where each square of the grid superimposed over the map represents a state of the POMDP that has been initialised.

Figure 1: Outline of the appearance-based navigation system.

Figure 2: The left-hand figure shows a visual tour consisting of 445 snapshots taken at 388 different positions (in some positions snapshots with different orientation were taken). Each point denotes the place where a snapshot of the environment was taken and neighbouring points with the same colour belong to the same class, as resulting from the clustering algorithm. In the right-hand figure the 31 points that correspond to the seeds of each class, and constituting the reference image-database for this environment, are re-drawn to emphasize their view direction.

At the beginning of the localisation phase the robot does not know its position. The robot first compares the appearance of the image acquired from the unknown position with those of the reference images, which are stored in its memory as feature vectors. Then the system selects from the reference database the most similar image and starts moving in order to increase the similarity between the current snapshot and an image (eventually different from the first selected) from the reference database. Finally, when the similarity between the snapshot acquired in the last position

and a reference image is above a threshold, this processing phase stops and the POMDP process starts. The result of its working will lead either to the iteration of the first processing (in case of temporary failure) or to the successful localisation of the robot. The above description of the two phases shows the fundamental importance of the feature extraction module both in the off-line and in the on-line phases. The feature extraction module, represented by the shadowed area in Fig. 1, receives in input the snapshots taken from the environment and outputs a feature vector for each image. 2.1 Image segmentation by histogram thresholding We perform image segmentation by histogram thresholding because it well fits to appearancebased approaches: in particular, histogram thresholding is computationally simpler than other existing algorithms and does not need a priori information about the image. A first drawback of histogram thresholding is the lack of spatial knowledge: the back-projection from the histogram colour space to the image may spread each colour interval into several separate regions. However, our approach to robot self-localisation does not require connected regions as results of the segmentation and this greatly simplifies colour segmentation. The number of bins in a typical multimodal histogram range from few tens to a few hundreds, therefore a second drawback of histogram autothresholding is the high dimensionality of the relative feature vectors. Our segmentation technique, fully described in [18], overcomes also this drawback reducing the modes of an histogram by two iterative operations: histogram smoothing and constrained merging of neighbouring histogram modes. In this manner the segmentation process results in the extraction of few dominant colours, typically less than 10, which are usually sufficient to characterise the colour information in an image. This last property is very important because small errors in the selection of segmentation thresholds do not affect significantly the subsequent image matching process, provided that any dominant colour is not excluded. As the colour sets resulting from the segmentation process usually have jagged contours, a sequence of morphological operations, above all closing operations, is then performed. In some cases, this processing can greatly reduce the complexity of the successive computation of spatial relationships between colour sets without affecting the meaning of the results. 2.2 The appearance-based metric Typically an appearance-based approach does not use semantically relevant information (i.e., interpreted characteristics of the scene or of the objects in the scene), which implies the solution of hard and time consuming processes as robust feature extraction and recognition. On the contrary, image-domain features that are easily to detect should be used. In addition, in an appearance-based approach the system must be able to compare global similarity measures of the images (i.e., the main task is to find images that appear visually similar as a whole). The framework here adopted for computing the global similarity between images derives from the research area of contentbased retrieval of images from an image database [10, 11]. Directional relationships between colour sets are included in the feature vector by adopting the representation referred to as weighted walkthroughs [15]. Weighted walkthroughs are a spatial modelling technique, which is suitable to support a representation that captures the spatial arrangement of the colour sets of an image without taking into account their distribution in separate non-connected regions. Given a Cartesian reference system, the projection of two pixels a = and b = on each reference axis can take three different orderings: a before, coincident, after b. The

combination of the two projections results in 9 different bi-dimensional displacements, or primitive directions, called walkthroughs and encoded by an index pair , with: −1  i= 0 +1 

if xb < xa   if xb = xa  if xb > xa 

−1  j= 0 +1 

if yb < ya   if yb = ya  if yb > ya 

Given two sets of pixels A and B, a pair is a walkthrough from A to B if encodes the displacement between at least one pair of pixels belonging to A and B, respectively (see Fig. 3a). As multiple walkthroughs may connect different pairs of pixels in A and B, a weight ωi,j(A, B), which measures the number of pairs of pixels belonging to A and B whose displacement is captured by the direction , is associated with each walkthrough (see Fig. 3bc). Weighted walkthroughs have very important properties: they are reflexive, invariant with respect to shifting and scaling of the image, and, above all, they are compositional, which permits to reduce the complexity of their computation to a linear combination of a set of terms representing the weighted walkthroughs between any partition, in particular rectangular entities, of original sets. Different arrangements of a pair of pixels sets A and B can then be compared by a distance ds(w, w’), which composes the differences between homologous weights in the two different weighted walkthroughs w(A, B) and w’(A, B) corresponding to the two arrangements:

(

)

d s w, w' =

1

1

∑ ∑

i = −1 j = −1

wi , j − wi' , j

Due to the integral nature of weights ωi,j, ds(w, w’) satisfies a property of continuity, which: i) ensures that slight changes in the mutual positioning or in the distribution of pixels in two sets A and B result in slight changes in the weighted walkthroughs between A and B; ii) provides a quantitative basis to smooth the trade off between accuracy and computational complexity when pixel sets are approximated to a multi-rectangular shape. In addition, coefficients may be introduced in the definition of ds(w, w’), as in [15], to change the relative weight of alignment, overlapping, horizontal, vertical, displacements, thus impacting on the mutual relevance of directional and topological relationships. Our appearance-based localisation system describes image appearances using a feature vector which includes both the chromatic attributes of colour sets and their mutual spatial relationships. We have defined a metric by which the similarity µ between two feature vectors is given by the product of two components: the attribute similarity µa and the spatial similarity µs: µ = α a µa * α s µ s

(a)

w−1, 1

w0, 1

w1, 1

0 0 .679

w( A, B) = w−1, 0

w0, 0

w1, 0

w(A, B) = 0 0 .323

w−1, −1

w0, −1

w1, −1

0 0 .321

(b)

(c)

Figure 3: (a) Pixel sets A and B are connected by only three distinct walkthroughs , , and ; The nine weights w i, j arranged in a 3x3 array; (c) weights associated with the walkthroughs connecting A to B.

(b)

where the roots αs and αa may balance the mutual relevance of spatial and attribute similarity. For a detailed description of the computation of µa and µs refer to [17]. The matching process starts by finding the correspondences between the colour sets of the two images being compared. If a colour set, in either of the two images, remains without a correspondence then a penalty pi, proportional to its dimension, is set. The attribute and spatial similarity µa and µs are then given by the products (multiplicative composition) of the similarity (=1-distance) measures among the nc corresponding colour sets ih and ik respectively belonging to the two images h and k being compared, minus the sum of the nu penalties pi among unmatched colour sets: nc

nu

i =1

i =1

µ a = ∏ µi h , i k - ∑ p i ;

nc

nu

i =1

i =1

µs = ∏ (1 − d s (ih , ik )) − ∑ pi

Finally, the matching process is completed by finding in the reference database the image (feature vector) that maximises the similarity µZLWK UHVSHFW WR WKH IHDWXUH YHFWRU RI WKH FXUUHQW snapshot image. 2.3 The euristic-stochastic position evaluator The causes of failure of the localisation process described in previous papers [17, 18] are related to the absence of any notion of state. POMDPs [13] have been found a reliable and computationally feasible tool to solve localisation problems while accounting for uncertainties and sources of error. POMDP are formally defined as a 5-tuple {S, A, Pa,O,Po} where S is a finite set of states, A is a finite set of actions, O is a finite set of observations (sensory readings); Pa(s’→s|a) denotes the probability that action a when executed at s’ carries the robot to s; Po(o,s) denotes the likelihood of perceiving o at s. For each state s ∈ S only a subset A(s) ⊆ A of all actions are feasible. A POMPD introduces a probability distribution Bt(s), usually called the belief distribution, over the set of states S. The belief Bt(s) is a measure that expresses the robot’s confidence of being in s. A POMPD first of all initialises Bt(s) and then uses Pa and Po to update its belief every time an action is taken or a sensor is read according to the following formulas, respectively:

(

)

Bt ( s ) = K ∗ ∑ s '∈S |a∈ A( s ') Pa ( s ' → s | a ) ∗Bt −1 (s ')

(1)

Bt (s) = K ∗ Bt (s) ∗ Po (o, s)

(2)

where K is a normalization factor that assures that Bt(s) sums up to 1. It should be noted that while an action considers the history of robot’s moves, a sensory reading modifies the belief of the current state independently from previous perceptions. Pa can be obtained from a model of the robot kinematics and from information contained in the topological map about the ability of reaching a place [3]. Here we are concerned in finding a suitable way to update belief after a sensor reading (i.e., a substitute for eq. 2) when information comes from a vision device and observations can be assessed with an appearance-based algorithm. An appearance-based method models data directly in the visual domain, instead of making a geometric model from a set of observations. The environment is tessellated in square cells (50cm x 50cm) and, for each cell, we consider eight possible orientations, each corresponding to a different view of the environment. A state s is then

defined as a triple {x, y, D} where x and y are the coordinates of the cell and D is the view direction; actions are movements that change robot’s orientation or cell position. During the visual tour a map building process encodes adjacency relationships between places in the environment visited by the robot. The adopted data structure is a many-to-many list by which a node have pointers to following and previous nodes in the path direction [16]. Krose et al. [5] computed Po thanks to the sum of kernel functions around a database of 8000 camera snapshots taken every 10 cm in the environment during the training phase. Really, the knowledge (availability) of all this information might be an unrealistic situation because it requires a very time consuming off-line phase. Gechter et al. [4] replaced Po with a coefficient proportional to the inverse of the distance, between the image perceived and the image associated to the current state of the vehicle, furnished by a PCA (Principal Components Analysis) module. Our approach also replaces Po with the inverse of a distance, but using a different appearancebased metric, which provides a very reliable measure of similarity between images thus reducing drastically the number of reference snapshots needed. We require the system to maintain a database of observations Odb(S) where ok=Odb(sk) is the feature vector observed from state sk; ok takes an “empty” value when no observation is available. The database is first initialised with data from the visual tour and is updated with data coming from successful localizations (see below). Keeping track of observations in Odb(S) allows us to update the belief distribution according to the following formula: Bt ( s) = K * Bt (s )*

1 1 − µ (o, o r )

where µ(o,or) is the similarity between feature vectors of current appearance o and reference appearance or=Odb(sr). Obviously, Bt(s) is updated only if o≠∅. Within this framework localisation can be obtained following a procedure that minimizes the entropy H of the belief distribution [3]: H = − ∑ Bt ( s ) log Bt ( s) s∈S

However, because update of the belief distribution occurs only for states for which there is at least one previous observation it is very difficult to correctly localise when the robot is situated in a starting position that is far away from these states. Thus we adopt a two step localisation process. During the first phase we run iteratively the strategy employed in our previous works until a pose with a high similarity, with respect to one of the reference images in the database, is reached. Odometry information is used to avoid running into short loops. When this phase is concluded we can assume to be in the surroundings of the visual tour with a good degree of confidence so that we can start the POMDP localisation. This second phase helps to discriminate regions that posses some states with a similar appearance. The localisation process stops when a suitable minimum of the entropy has been found, meaning that the belief distribution is strongly clustered around few states, possibly one. In the last case, we use information collected during the last ten actions to update both the image reference database and the observation database Odb(S). In conclusion, the use of a stochastic position evaluation based on a POMDP where the observation probability is conditioned by the results of an image processing module permits to avoid (or, at least, to reduce the effects of) perceptual aliasing: two appearances at two different (either spatially or in the view direction) poses have similar observation probabilities. In fact, the system can discriminate among ambiguous poses by exploiting the history of its movements.

References [1] C. Andersen, S.D. Jones, J.L. Crowley, Appearance based processes for visual navigation, Symposium on Intelligent Robotics Systems (SIRS), 1997. [2] J. Borenstein, H. R. Everett, L. Feng, Navigating, Mobile Robots: Sensors and Techniques, A. K. Peters, Ltd., Wellesley, 1996. [3] W.Burgard, D. Fox, S. Thrun, Active Mobile Robot Localization by Entropy Minimization, EUROBOT’97 Brescia, IEEE Computer Society, 155-162, 1997. [4] F. Gechter, V. Thomas, F. Charpillet, Robot Localization by Stochastic Vision Based Device, 5th Conf. Systemics, Cybernetics & Informatics (SCI), July 22-25 Orlando, Florida USA, 2001. [5] B.J.A Krose, N. Vlassis, R. Bunschoten, Y. Motomura. A probabilistic model for appearance-based robot localization, Image and Vision Computing, 19(6), 381-391, 2001. [6] S. Livatino, C.B. Madsen, Autonomous robot navigation with automatic learning of visual landmarks, Symposium on Intelligent Robotics Systems (SIRS), 1999. [7] E. Mouaddib, B. Marhic, Geometrical matching for mobile robot localisation , IEEE Trans. On Robotics and Automation, 16(5), 542-552, 2000. [8] K. Ohba, Y. Sato, K. Ikeuchi, Appearance-based visual learning and object recognition with illumination invariance, Machine Vision and Applications. 12(4),189-196, 2000. [9] F. Pourraz, J.L. Crowley, Continuity properties of the appearance manifold for mobile robot position estimation, Symposium on Intelligent Robotics Systems (SIRS), 251-260, 1998. [10] S. Ravela, R. Manmatha, Retrieving images by similarity of visual appearance, IEEE workshop on content-based access of image and video libraries, 67-74, 1997. [11] T.K. Shih, C.S. Wang, A.Y. Chang, C.H. Kao, H.R. Tyan, An efficient color-spatial indexing and retrieval scheme for image database, 7th Int. Conf. On Parallel and Distributed Systems, 117-122, 2000. [12] R. Sim, G. Dudek, Learning visual landmarks for pose estimation, Proc. of Int. Conf. On Robotics and Automation (ICRA), Detroit, 1972-1978, 1999. [13] R. Simmons, S. Koenig, Probabilistic navigation in partially observable environments, Proc. 19th Int. Joint Conf. On Artificial Intelligence, Montreal, Canada, 1995. [14] I. Ulrich, I. Nourbakhsh, Appearance-based place recognition for topological localization, Proc. of Int. Conf. On Robotics and Automation (ICRA), San Francisco, vol.2, 1023-1029, 2000. [15] E. Vicario, W.X. He, Weighted walkthroughs in retrieval by contents of pictorial data, Lecture Notes in Comp. Sc. vol. 1311: Image Analysis and Processing, II, 132-139, Springer, 1997. [16] F. Zanichelli, Topological maps and robust localization for autonomous navigation, IJCAI workshop on Adaptive spatial representations of dynamic environments, 1999. [17] P.Zingaretti, Image matching for appearance-based self-localization, EUROBOT'01, Lund, Lund University Cognitive Studies, vol. 86, 41-48, 2001. [18] P.Zingaretti, L.Bossoletti, Image segmentation for appearance-based self-localisation, in Image Analysis and Processing - ICIAP'01, IEEE Computer Society, 113-118, 2001. [19] P. Zingaretti, A. Carbonaro, Route following based on adaptive visual landmark matching, Robotics and Autonomous Systems, vol. 25, 177-184, 1998. [20] P. Zingaretti, A. Carbonaro, Learning to acquire and select useful landmarks for route following, EUROBOT’99, Zurich, IEEE Computer Society 161-168, 1999.