An interactive method for the image alignment ... - Semantic Scholar

2 downloads 42354 Views 6MB Size Report
Aug 7, 2014 - we explain the interactive registration method. ... Expert Systems with Applications 42 (2015) 179–192. Contents ... Humans are very good at finding the correspondences between .... of the image is a face and in the case that it is, imposes the name ..... coordinates are out of the domain of image I2 (or IÃ. 1).
Expert Systems with Applications 42 (2015) 179–192

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

An interactive method for the image alignment problem based on partially supervised correspondence q Xavier Cortés, Francesc Serratosa ⇑ Universitat Rovira i Virgili, Departament d’Enginyeria Informàtica i Matemàtiques, Spain

a r t i c l e

i n f o

Article history: Available online 7 August 2014 Keywords: Homography estimation Image alignment Interactive method Partially supported learning Salient points correspondence

a b s t r a c t In a robotic scenario in which robots are localised through the images they obtain, it is usual that the current image registration techniques fail due to large differences between images, occlusions, shadows and so on. In these cases a human can interact on the computer vision system to help it to find a preliminary mapping between some salient points. We present a method to solve the alignment problem between two images in which a human (or oracle) imposes a partial correspondence between salient points extracted from the images through a Human–Machine Interface and the Interactive Correspondence Method computes a new image alignment and a set of point correspondences. Practical evaluation shows that in two human interactions our method finds an almost optimal alignment and correspondence set between images that any current completely automatic method is able to obtain. Moreover, we demonstrate that in situations that there is a large difference between images, adding interactivity is the only method that achieves an acceptable result.  2014 Elsevier Ltd. All rights reserved.

1. Introduction In recent years, interaction between robots and humans has increased rapidly. Applications of this field are very diverse, ranging from developing automatic exploration sites (Trevai et al., 2003) to using robot formations to transport and evacuate people in emergency situations (Casper & Murphy, 2003). Within the area of social and cooperative robots, interactions between a group of people and a set of accompanying robots have become a primary point of interest (Garrell & Sanfeliu, 2012). One of the low level tasks that these systems have to face is the automatic recognition of scenes and objects the robot visualises. This problem is usually called image registration in the computer vision research field. Image registration tries to determine which parts of one image correspond to which parts of another image. This problem often arises at the early stages of many computer vision applications such as scene reconstruction, object recognition and tracking, pose recovery and image retrieval. Therefore, it has been of basic importance to develop effective methods that are both robust in the sense of being able to deal with noisy measurements and in the sense of having a wide field of application. Two

q

This research is supported by the CICYT project DPI2013-42458-P.

⇑ Corresponding author.

E-mail addresses: [email protected] (X. Cortés), [email protected] (F. Serratosa). http://dx.doi.org/10.1016/j.eswa.2014.07.051 0957-4174/ 2014 Elsevier Ltd. All rights reserved.

interesting image registration surveys are (Zitová & Flusser, 2003; Salvi, Matabosch, Fofi, & Forest, 2007). We present an interactive method such that a human can help the object or scene recognition modules incorporated on the robots when these modules consider they are not able to automatically solve the registration problem. This type of interaction is completely different from the ones presented in Trevai et al. (2003), Casper and Murphy (2003) and Garrell and Sanfeliu (2012) since in those cases, the human interaction is performed in a higher level. For instance, in Trevai et al. (2003), Casper and Murphy (2003), the interaction is based on imposing orders such us ‘‘move straight ahead’’ or ‘‘go up stairs’’. In Garrell and Sanfeliu (2012), the orders are ‘‘follow this person’’ or ‘‘bring me to the exit’’. Nevertheless, our experience has shown us that in most of the cases, robots cannot perform these orders due to they cannot solve the low-level registration problem. Therefore, interacting in this low-level task, that it is really easy for humans, frequently makes not necessary to interact on higher level tasks in which the interaction is more complicated do to the need of having more knowledge of the current situation such as position of other robots, automatically built map of the environment or current position of other objects. We have not found any paper that the aim is to interact on the registration problem integrated in a robotics’ framework. For this reason, we first explain the general robotics’ framework and then we explain the interactive registration method.

180

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

The three typical steps involved in the solution of the image registration problem are the following (Xu & Petrou, 2011). First, some salient points are selected from both images. Second, a set of tentative matches between these sets of points is computed together with the image alignment. And third, a process of outlier rejection that eliminates the spurious correspondences can further refine these tentative matches and the initial alignment. Salient points, which play the role of parts of the image to be matched, are image locations that can be robustly detected among different instances of the same scene with varying imaging conditions. These points can be corners (intersection of two edges) (Tomasi & Kanade, 1991), maximum curvature points (Han & Brady, 2005) or isolated points of maximum or minimum local intensity (Rosten & Drummond, 2006). There is an evaluation of the most competent approaches in Mikolajczyk and Schmid (2005). When salient points have been detected, several correspondence methods can be applied that obtain the alignment (or homography) that maps one image into the other (Zhang, 1994), discards outlier points (Fischler & Bolles, 1981) or characterises the image into an attributed graph (Sanromà, Alquézar, & Serratosa, 2012; Sanromà, Alquézar, Serratosa, & Herrera, 2012). The main drawback of image registration methods is that their ability to obtain the correspondence parameters strongly depends on the reliability of the initial tentative correspondences. Moreover, it is needed to jointly estimate the image alignment parameters and correspondence parameters. Considering the alignment parameters, there are two basic strategies. The first one is to consider a rigid deformation and the second one is to consider non-rigid deformation. In the first case, it is assumed the whole image (and so, the extracted salient points) suffers from the same deformation and so the image alignment parameters are applied equally to all the salient points or image pixels. Some examples are Luo and Hancock (2003), Rangarajan, Chui, and Bookstein (1997), Sanromà, Alquézar, Serratosa, and Herrera (2012) and Gold and Rangarajan (1996). In the second case, each salient points suffers a different projection and there are different alignment parameters applied to each salient point or image region. Some examples are Chui and Rangarajan (2003), Myronenko and Song (2010), Sanromà et al. (2012). Usually, the rigid strategy is applied to detect objects on outdoor images in which the deformation is mostly due to the change of the point of view. The non-rigid strategy is mostly applied to object detection or matching in medical or industrial images due to it is assumed objects suffer from deformations although the point of view is the same. Humans are very good at finding the correspondences between local parts of an image regardless of the intrinsic or extrinsic characteristics of the point of view. Human interactivity on image registration has been applied on medical images (Khader & Ben Hamza, 2012; Pfluger, Thomas, et al., 2000; Pietrzyk et al., 1994) and two systems have been patented (Berg Jens Von, 2010; Gering, 2010). These papers and patents are really specific on some medical environments and for this reason cannot be applied on our problem. In Pfluger et al. (2000), they show a comparison of 3-D images on MRI-SPECT format and they concretise on images from the brain. In Pietrzyk et al., 1994), authors present a method to validate the 3D position given 3D medical images. Finally, in Khader and Ben Hamza (2012), the aim is to solve the registration problem given similar medical images extracted from different sensors or technologies. Patent (Berg Jens Von, 2010) defines a system for registration thorax X-ray images such that it does not depend on bony structures. And patent (Gering, 2010) defines a multi-scale registration for medical images where images are first aligned at a course resolution, and subsequently at progressively finer resolutions; user input is applied at the current scale. Current automatic methods to extract parts of images and their correspondences in non-controlled environments are far away of

having the performance of a human. Fig. 1 shows two images extracted from RESID database (http://www.featurespace.org). In each image 50 salient points have been extracted by method (Harris & Stephens, 1988). Outlier detector (Fischler & Bolles, 1981) has considered 43 salient points where outliers and only 7 where inliers. The correspondence detector has missed 6 of the 7 points (red lines) and only has hit 1 point (green line). This is because of the large differences between both images and more precisely, due to the lack of ability of the initial correspondence detector to find a good initial correspondence. For this reason, in this paper, we propose a semi-automatic method in which humans can interact into the system when it is considered the quality of the automatically found correspondences is not good enough and then they impose a partial and initial correspondence between some local parts of two scenes. To do so, the oracle selects through a Human–Machine Interface the points of both images that have to be mapped. The oracle basically helps the pattern recognition system to find the initial correspondence through its interaction. In this action, there is a learning process in which the pattern recognition system learns from the oracle some node correspondences (local knowledge). Note that this kind of learning is different from the one presented in Caetano, McAuley, Cheng, Le, and Smola (2009), in which the pattern recognition system learns some weights (global knowledge) to find the graph node correspondences more accurately. Therefore, our framework is the Partially Supervised Learning (PSL) since only few point correspondences are learned from the oracle. PSL is a research field of machine learning and pattern recognition that builds upon the broad areas of learning, clustering and estimating classifiers from partially (or, weakly) labelled data. PSL studies are particularly expected to involve the identification of scenarios for successful real-world applications. To state just a few examples, significant instances of PSL applications are rooted in image and multimodal information processing, sensor and information fusion, human machine interaction and data mining. Recently a special issue has been published about PSL for Pattern Recognition in which several methods and applications are described (Schwenker, in press-a,b). Human–Machine Interface (HMI), also named Human–Computer Interface, provides more natural, powerful, and compelling interactive experiences. For decades, HMI has been an active research field closely related to new technology advances. Paper published in Matthew (2014) provides an interesting review of the current trends and key aspects of HMI, for instance in nondesktop computing. Some of the latest applications have been natural language understanding (Revuelta-Martinez, Rodriguez, Garcia-Varea, & Montero, 2013), semi-supervised database indexing (Phuong, Visani, Boucher, & Ogier, 2013), image segmentation (de Miranda, Falcão, & Udupa, 2010; Panagiotakis et al., 2013) or handwritten text transcription (Romero, Toselli, & Vidal, 2012) between others. Finally, human interactivity specifically applied to pattern recognition has been considered in Toselli, Vidal, and Casacuberta (2011). Considering the visual presentation of the HMI, three main approaches have been proposed in the literature: (a) side-by-side views (Andrews, Wohlfahrt, & Wurzinger, 2009; Holten & Wijk, 2008) (b) superimposed or merged views (Ahlberg, 1996; Erten, Harding, Kobourov, Wampler, & Yee, 2003) and (c) animations (Diehl and Gorg, 2002; Yee & Fisher, 2001). These approaches are often complemented with techniques for highlighting the correspondences between nodes of both graphs such as lines that go from one graph to the other (Collins & Carpendale, 2006; Holten & Wijk, 2008) or node colour coding (Elmqvist, Dragicevic, & Fekete, 2008; Kleinbe, 2002). Fig. 2 shows, through the visual presentation of our HMI, the results of our Interactive Correspondence Method (ICM). We have

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

181

Fig. 1. An example of automatic registration image where only one point has been properly matched (green line) and 6 points have been improperly matched (red lines). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

chosen the side-by-side view and highlighting based on colour lines. In this case only two interactions where needed (yellow lines) to achieve 10 correct correspondences (green lines) and only one un-correct correspondence (red line). Note that not only the number of correct correspondences has increased but also the number of inliers has increased from 7 to 11. We advocate a cooperative model in which robots have cameras and they have to identify objects and scenes in a cooperative manner with the aim of detecting specific objects or performing Simultaneous Localisation And Mapping (SLAM). We assume there is a master system that receives images from robots and finds the correspondences between images of these robots to perform higher-level tasks. These images may have been taken from different scenes or from the same scene but with completely different points of view, illuminations and so on. When scenes or objects from scenes have been localised, the master system sends new orders to the robots such as specific movements, grasping objects and so on. When a robot is not able to recognise the scene, then it stops and asks to the master system to localise it or send it the information of the scene. The master obtains the whole information through the correspondence between the current and past images of all the robots. Due to the environment (for instance, rescue inside buildings), we assume it is not possible to localise the robots through GPS, 3D triangulation or sensors on the wheels.

When we put into practice our system, we realised most of the cases that robots did not correctly react to the humans orders where due to they did not solve appropriately the low-level image registration problem. Therefore, the strength of our method is that putting the human interaction into the method, most of these noncorrect robot reactions is solved. Nevertheless, technology tends to make systems to run as much autonomous as possible. For this reason, the main weakness of our method is that robots are less autonomous. Better registration methods are going to be discovered, and then, less need of human interaction is going to be needed. Fig. 3 is a screen shot of our HMI. We can visualise the cameras of robot 1 and 2 and the salient points automatically extracted. Both robots go on the pavement and Robot 1 follows Robot 2. Green lines represent the automatically extracted correspondences and yellow line represents the correspondence imposed by the human since he or she has realised one of the correspondences was wrong. This interaction influences over the obtained image alignment and also it influences over the rest of the correspondences. A similar interaction method was presented in Villamizar, Andrade-Cetto, Sanfeliu, and Moreno-Noguer (2012). In that case, there is only one robot and the human decides if a selected part of the image is a face and in the case that it is, imposes the name of the person.

Fig. 2. An example of semi-automatic registration image where 2 interactions have been imposed (yellow lines). Although the extreme differences between images, 8 points have been properly matched (green lines) and only one point has been un-properly matched (red line). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

182

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 3. Screen shot of our HMI with robot 1 and robot 2 images.

Fig. 4 shows a schematic view of our method based on an Interactive Correspondence Method and a Human–Machine Interface. The user visualises both images together with the salient points and the current correspondence between them. Then, the user imposes some point mappings and the output of the module is a correspondence in which the oracle has influenced. This process is repeated until the user considers the correspondence is good enough. A preliminary version of this method was presented in Cortés, Moreno, and Serratosa (2013). In that paper, authors presented a simple interactive method to estimate the homography between two images. Besides, this project is part of a larger project in which social robots guide people through urban areas (Garrell & Sanfeliu, 2012) and they have tracking abilities (Serratosa, Alquézar, & Amézquita, 2012). Fig. 5 represents three robots performing guiding tasks in an indoor environment. Robots fence the visitor group to force them to follow a specific tour. Robots need to work in a

cooperative manner to keep a triangular shape in which people have to be inside. In these cooperative tasks, it is crucial to have a low-level computer vision task such that the salient points extracted from the three robots are properly mapped. In this environment, there is a human that, through our HMI, helps to properly map the salient points. The rest of the paper is organised as follows. In the next section, we succinctly explain the method used to deduct an alignment given a correspondence between two sets of points. In Section 3, we explain our method to achieve an automatic alignment and correspondence and in Section 4, we add interactivity to this method. This means that a human proposes some point mappings and the method finds the best image alignment and point set correspondences considering this human imposition. In Section 5, we experimentally validate our interactive model and show that our method solves the alignment problem on images with large differences. We conclude the paper in Section 6.

Fig. 4. Basic scheme of our method composed of an Interactive Correspondence Method (ICM) and the Human–Machine Interface (HMI).

183

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 5. Three robots performing guiding tasks. Robots are located to fence the group.

2. Alignment Estimation given a Correspondence between two point sets Direct Linear Transformation (Hartley & Zisserman, 2003) is an algorithm that solves a set of variables from a set of similarity relations as follows, x / A  y. Where / denotes equality up to an unknown scalar multiplication. x and y are known vectors and A is a matrix (or linear transformation) which contains the unknown parameters to be solved. This type of relation appears frequently in projective geometry and can be used to obtain the homography (points alignment) that transforms one set of points (represented as vector x) to another set of points (represented as the other y). Note that, in some applications, deformations are not rigid (Serratosa, Cortés, & Solé-Ribalta, 2013) and therefore, the method that we present cannot be applied. Given two images I1, I2 and a set of pairs corresponding points on these images {{(x1, y1) (x01 ; y01 )}, . . ., {{(xn, yn) (x0n ; y0n )}} it is possible to obtain a transformation matrix F that projects each of the points in I1 into the points in I2 with a minimal error using the Direct Linear Transformation algorithm. Then ½x0i ; y0i ; 1T = F[xi, yi, 1]T, where ðx0i ; y0i Þ 2 I2 and (xi, yi) e I1. Throughout this paper we suppose the transformations that images suffer to be aligned to the other ones are affine transformations. If this is not the case, the only part that has to be reconsidered is the error function we arrive below and therefore the obtained matrix A and vector B. If we suppose that the mean error has a normal distribution, then the least-square estimation is optimal (Pedhazur & Elazar, 1982). Moreover, if we assume an affine transformation, then the transformation matrix is defined as

2

a

6 F ¼ 4b

b c

a 0 0

3

7 d 5 where 1

a = Scos(a), b = Ssin(a), S = scale, c = translation in x and d = translation in y. In this case the error function is usually estimated as follows:

Eða; b; c; dÞ ¼

n h i X  2 2 axi  byi þ c  x0i þ ½bxi þ ayi þ d  y0i  i¼1

We find the minimum through deriving the function; dE ¼ 0, dE ¼ 0, da db ¼ 0 and dE ¼ 0. And we arrive to the following linear problem, A dd (a, b, c, d)T + B = 0. Where,

dE dc

2X 2 ðxi þ y2i Þ 6 i 6 60 6 6 A¼6 6X x 6 6 i i 6X 4 yi

xi

i X X ðx2i þ y2i Þ  yi i X

i

yi

i X xi

i

2X

X

0

n 0

X 3 yi 7 i X 7 xi 7 7 7 i 7; 7 0 7 7 7 5 n

and

i

ðxi x0i  yi y0i Þ

3

7 6 i 7 6X 0 0 7 6 ðx y  x y Þ i 6 i i i 7 7 6 i 7: X B¼6 7 6  x0i 7 6 7 6 i 7 6 X 5 4 0  yi i

We solve this linear system using LU factorization (Bunch & Hopcroft, 1974).

3. Automatic Alignment & Correspondence Estimation Fig. 6 shows the main steps of our Automatic Alignment Estimation. Any interest point extraction algorithm like, Harris and Stephens (1988), DoG (Lowe, 2004) or LoG (Lindeberg, 1998; Mikolajczyk & Schmid, 2004) can be used to obtain the salient points P1 and P2, given images I1 and I2. While these salient points are obtained, the system obtains the homography H01;2 between them (alignment between points), for instance, using the Iterative Closest Point algorithm (Zhang, 1992) or RANSAC algorithm (Fischler & Bolles, 1981). In cases that the algorithm not only obtains the homography but also detects the outliers, the algorithm returns a Consistent Set CS1,2, for instance the RANSAC algorithm (Fischler & Bolles, 1981). This consistent set is composed by two sets of points (one from each initial salient point sets) that the algorithm considers they have to be mapped. In our interactive framework (Section 4), we need to validate the returned alignment. For this reason, we propose to project the set of points P1 and image I1 to obtain P 01 and I01 where P01 ¼ H1;2  P1 and I01 ¼ H1;2  I1 . Thus, the human can visualise the output through a Human–Machine Display. Moreover, the display also shows the consensus set generated by the alignment method (green lines). Note that there is no human interaction in this framework but we present the interactive framework in the next section. Fig. 7 shows the basic steps of our Automatic Correspondence Estimation. In this case, three steps (red box) have been added to

184

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 6. Scheme of our Automatic Alignment Estimation model.

obtain the final correspondence f1,2. In the first one, a similarity matrix is defined with the projected points P01 and the original points P2. Then this matrix is modified with the information of the Consensus Set CS1,2 generated by the homography estimator. The method considers the similarity is zero if the mapped points are not in the Consensus Set. For this reason, the whole cells that relate non-mapped points are set to zero. This method is inspired in the algorithm presented in Serratosa, Cortés, and Solé-Ribalta (2012) although in that case, the interactive algorithm modified a cost matrix instead of a similarity matrix. Algorithm 1 shows the process used to update the similarity matrix encapsulated in the Consensus Set Filtering module.

Algorithm 1. Consensus Set Filtering Input: Similarity Matrix M1,2, Consensus Set CS1,2 Output: Similarity Matrix with Restrictions M 01;2 Forall p01 R CS 1;2 and p01 2 P01 M01;2 (i, b) = 0; "b being p01 the ith element in CS1,2 End forall For all p2 R CS 1;2 and p2 2 P 2 M 01;2 (a, j) = 0; "a being p2 the jth element in CS1,2 End forall end Algorithm

The last step obtains the final correspondence using an algorithm that solves the assignment problem such has the Munkres’ algorithm (Kuhn, 1955). In this case, the Human-Matching Display not only shows the projected images and points but also the found points correspondences f1,2 (green lines). Therefore, in this case the green lines are not the Consensus Set returned by the alignment algorithm but the final output correspondences. Again, there is no human interaction. 4. Interactive Alignment & Correspondence Estimation Fig. 8 presents the main scheme of the Interactive Alignment Estimation. It is based on the alignment estimation model shown in Fig. 6 but the oracle’s interaction is added (dashed lines in the scheme). The main idea is to project the initial set of points P1  and P2 into P  1 and P 2 such that the input of the alignment estimation becomes ‘‘easier’’ and so the output alignment tends to be more accurate or more in concordance with the user. As commented in the introduction, the initial concordance estimation is crucial to obtain a good alignment and so our proposal is to modify and improve this initial set. Fig. 9 shows the main scheme of the module called Alignment Estimation given the human Correspondence. The oracle imposes   some correspondences between pointsf 1;2 . Note that f 1;2 is usually a partial labelling between points of few point mappings. It is normal that in each interaction, the user imposes only one point

185

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 7. Scheme of an Automatic Correspondence Estimation model. 

mapping although f 1;2 represents the union of the whole imposed mappings through all interactions. We suppose there are no contradictions between oracle’s feedbacks. The Direct Linear Transform algorithm (Section 2) obtains homography H1;2 . The Points Set Filtering given Images obtains two reduced point sets P 1 and P2 from point sets P 1 and P2. The discarded points in P 1 and P2 are the ones that have to be outliers in I1 and I2. When we project I1 and obtain I1 ¼ H1;2  I1 then I1 has the same information than I1 but the image is shown in the coordinate space of I2. Thus, we consider that a point in P 1 (or P2) is an outlier if its coordinates are out of the domain of image I2 (or I1 ). Fig. 10 graphically shows an example of the Points Set Filtering given Images. The last image represents the large image in which we have marked the salient points and also we have put on it the small image. Thus, blue points are inliers since they are inside the projected small image. Contrarily, red points are outliers. Fig. 11 shows the main scheme of our Interactive Correspondence Estimation model. It is based on the Alignment Estimation Model shown in Fig. 8 but in this case, the feedback of the oracle influences on two modules. The first one, called Alignment Estimation given a Correspondence, has just been commented (Fig. 9). The other one is the Feedback Set Filtering. This module executes algorithm 1 as it does the Consensus Set Filtering module but the  inputs are the feedback of the human f 1;2 and matrix M 01;2 . We force a rigid transformation between images but non-rigid transformations appear in real applications. For this reason, if we make the oracles’ feedback to only influence on the Alignment Estimation given the human Correspondence and the Feedback Set was considered to be null in the Feedback Set Filtering module, the model may not achieve the point correspondences imposed by the oracle. That is, in cases that there is no a real rigid transfor-



mation, it may happen that f 1;2 –f 1;2 although the oracle has imposed the correspondence between all points. For this reason, it is needed the Feedback Set Filtering. We show this effect in the practical evaluation. 5. Practical evaluation To validate our method, we have performed two different kinds of experiments. The aim of the former ones is to validate the Interactive Alignment Estimation whereas the aim of the latter ones is to validate the Interactive Correspondence Estimation. In the first case, it is needed to have an alignment ground truth (homography between images) but in the second case it is needed to have a correspondence ground truth (mapping between salient points of the images). For this reason, we have used two kinds of databases. In Section 5.1 we succinctly explain the used databases and their principal characteristics. In Section 5.2, we show the experiments applied on the Interactive Alignment Estimation and on the Interactive Correspondence Estimation. 5.1. Databases We used four different databases to validate the Interactive Alignment Estimation model: BOAT, EAST PARK, EAST SOUTH and RESID and other four databases to validate the Interactive Correspondence Estimation model: APARTMENTS, HOUSE, HOTEL, FACES. BOAT, EAST PARK, EAST SOUTH and RESID (http://www. featurespace.org): These databases are used to evaluate point extraction algorithms and image registration methods (Fig. 15

186

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 8. Scheme of Interactive Alignment Estimation model.

Fig. 9. Scheme of module ‘‘Alignment Estimation given the human Correspondence’’.

shows some images from these databases). They consist of images taken from different outdoor scenarios across different zooms and rotations. Each database contains 10 or 11 images that are ordered according to a variation of zoom and rotation. Moreover, the homography matrix from the first image to the other ones is given. Points have been extracted using the scale-invariant feature

detector (Lowe, 2004). For each image, we keep the 50 points with the highest scales. There is no correspondence ground truth. APARTMENTS database (http://deim.urv.cat/francesc.serratosa/ databases/): It consists of ten images of an apartments building taken from different perspectives (Fig. 12.a). Twelve salient points have been manually extracted from each image. Each point

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

187

Fig. 10. Example of the Points Set Filtering given Images.

Fig. 11. Scheme of Interactive Correspondence Estimation model.

represents the same part of the scene through the whole images, and so, the point correspondence is implicit in the point enumeration. The first 11 points represent parts of the building and the 12th point is the highest part of a lamppost. HOTEL (http://www.featurespace.org), HOUSE (http://www. featurespace.org) and FACES (Oliver Jesorsky, Kirchberg, & Frischholz, 2001): Similarly to the APARTAMENTS database, they are composed of a set of images (Fig. 12b–d) and from each image, a set of points have been selected such as each point represents the

same part of the scene through the whole images. Again, the correspondence between points is implicit in the point enumeration. Nevertheless, due to the degree of image distortion is very low, we generated other sets of points by artificially adding some noise on the original points (similar to Cortés and Serratosa (2013)). First, we added a global artificial transformation for all points (rotation and translation noise). And second, we added an individual noise for each point following a normal distribution. This noise simulates different viewpoints highly separated

188

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Fig. 12. One image xtracted from (a) APARTMENTS (b) HOUSE (c) HOTEL, and (d) FACES databases.

Fig. 13. (a) Mean End Point Error in pixels. (b) Number of obtained inliers. ( ) Interactive Correspondence Method with no interactions ( ) Interactive Correspondence Method after two interactions ( ) Smooth Point Registration Using Neighbouring (Sanromà et al., 2012a) ( ) Softassign Procrustes (Rangarajan & Chui, 1997) ( ) Unified Approach (Luo & Hancock, 2003) ( ) Dual-Step Method (Cross & Hancock, 1998) ( ) Graduated Assignment (Gold & Rangarajan, 1996) and ( ) Matching by Correlation (Sanromà et al., 2012a).

Fig. 14. (a) Mean End Point Error in pixels and (b) number of correspondences respect of number of interactions. ( ) Boat ( ) Eastsouth ( ) Eastpark and ( ) Resid.

between them and makes the oracles interaction useful to find a good solution since the transformation between the original frame and the artificial one is non-rigid. Note that by construction, we do not guarantee the ground truth correspondence achieves the minimum cost. 5.2. Experiments In all experiments we use Random Sample Consensus (Fischler & Bolles, 1981) as the Automatic Alignment Estimation algorithm.

5.2.1. Interactive Alignment Estimation To evaluate the Interactive Alignment Estimation (IAE) we made two different experiments using the BOAT, EAST PARK, EAST SOUTH and RESID sequences. In the first one, we compared the IAE to the results presented in Sanromà et al. (2012a). In that paper, authors presented a comparative study of six completely automatic methods: Smooth Point Registration Using Neighbouring (Sanromà et al., 2012a), Softassign Procrustes (Rangarajan, Chui, & Bookstein, 1997), Unified Approach (Luo & Hancock, 2003), Dual-Step Method (Cross & Hancock, 1998), Graduated Assignment

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

189

Fig. 15. Some examples of matches from BOAT, EAST PARK, EAST SOUTH and RESID databases. Images above show the results without human interaction and images below show the results with two imposed interactions. Yellow lines: imposed mappings. Green lines: correct mappings. Red lines: incorrect mappings. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(Gold & Rangarajan, 1996) and Matching by Correlation (Sanromà et al., 2012a). As mentioned by the authors, the image registration was performed only between adjacent images due to non-adjacent images presented non-admissible results. Concerning our method, we show the results when no user interactions are performed and at the second interaction. We suppose the oracle imposes node mappings that are in concordance with the alignment ground truth. That is, in each iteration, the oracle selects the point in the first image that has not been selected before and obtains the minimum Euclidean distance between its projection to the second image and the points in this second image. Similarly to Sanromà et al. (2012a), we evaluate the goodness of the methods through the End Point Error, which is obtained as the sum of the Euclidean distances between points of the second image and points of the first image projected with the estimated alignment (homography). As good as the estimated alignment is, smaller have to be the distance between the points of the second image and projected points of the first image. Fig. 13.a shows the Mean End Point Error obtained by the six automatic methods and our method without iterations and at the second iteration. After the second interaction, our model obtains the minimum End Point Error and the smallest standard deviation. Considering our method without interactions, note the important reduction not only on the distance value but also on

the standard deviation. Fig. 13.b shows the number of obtained inliers. We experimentally validate our method tends to select few but correct inliers and then obtains the best image alignment. The aim of the second experiments is to show the goodness of our method when there exist an important difference between the images to be considered. To do so, we have used the same datasets as the previous experiments but the whole combinations between images were considered (not only the adjacent ones). We do not present results of completely automatic methods since there were not acceptable. Fig. 14 shows the evolution of (a) the Mean End Point Error and (b) the number of inliers respect of the number of iterations. We can see that at the second iteration both measures tend to stabilize which means our method only demands to the user to interact two times. Moreover, while comparing the values of Figs. 13a and 14a we realise there is not a large difference between error values. Therefore, we conclude we achieve a good alignment although there is a large difference between the images to be considered. Fig. 15 shows the correspondences using our method but without interactions (images above) and after two mappings have been imposed (images below). Yellow lines: imposed mappings. Green lines: Correct mappings. Red lines: incorrect mappings. Correct mappings considerably increase at the second iteration.

190

Fig. 16. Hamming distance respect iterations. module deactivated.

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

: the complete method.

: Alignment Estimation given a Correspondence deactivated.

Fig. 14 shows a steep descend between iteration 1 and 2. This is because the interactive module establishes an acceptable alignment at iteration 2 that allows having a good initialization for RANSAC algorithm (Fischler & Bolles, 1981). It is very important for the RANSAC algorithm to have a good initialisation because it can properly decide the outlier points. For this reason, there is an important increase in the number of found correspondences at the second interaction.

: Feedback Set Filtering

correspondence applying only this transformation. Finally, in the case that Alignment Estimation given a Correspondence is deactivated ( ), the Hamming distances arrives at zero but more slowly than using the whole method. Through these experiments we show the need of using the whole method since one of the most important qualities that an interactive method needs to have is the ability to react in few oracle’s interactions. 6. Conclusions and future work

5.2.2. Interactive Correspondence Estimation The aim of the experiments performed on the Automatic Correspondence Estimation is to minimize the Hamming distance between the obtained correspondence and the correspondence ground truth. We suppose the oracle imposes node mappings that are in concordance with the correspondence ground truth and, in each iteration, the oracle randomly selects one of the points that has not been selected before. Fig. 16 shows the average Hamming distance respect the number of iterations in the four databases. In iteration 0 and 1, the Alignment Estimation given a Correspondence is not used due to two matchings are needed to deduct a unique affine alignment. Nevertheless, in iteration 1, the first node mapping imposition is introduced into the Feedback Set Filtering module. Interaction 2 is the first one that the whole model is applied. We have performed three different experiments. In the first ones (blue curve ), the complete interactive method is used. In the second ones (orange curve ), the Alignment Estimation given a Correspondence is not used and the human impositions only affect on the Feedback Set Filtering module. Conversely, in the third ones (yellow curve ) the Alignment Estimation given a Correspondence is used but does not the Feedback Set Filtering module. While using ( ), the model establishes an initial alignment at second interaction, for this reason the Hamming distance presents a steep descent. Throughout the other iterations, the model slightly updates the alignment and imposes the restrictions and so the Hamming distance decreases until arriving to zero. When Feedback Set Filtering module is deactivated ( ), the Hamming distance does not arrive to zero. This is because, in a real application, a rigid homography does not exist, which would allow to deduct the ideal

When a team of robots needs to have consistent shared information of its environment based on its computer vision systems, usually the completely automatic methods fail to find a good enough correspondence between the different parts of the scene. This is because of large differences of the seen images. In these cases, an interactive method can find a correspondence establishment (or improve a low quality one) and so increase the consistency of the shared information. We have presented an interactive method that, given two images taken from different robots, the oracle imposes some correspondences between their local parts to estimate an initial alignment and to restrict the point correspondences. The main difference between our method and the methods presented in the literature related on robotics is that the interaction is performed at the low-level image registration process. In the other presented methods, the interaction is based on giving orders at high-level tasks. Results show that in the first two iterations, there is an important increase on the quality of the correspondences. This result is important since it means it is not necessary a long-term dependence on the oracle but it is crucial on the initial stage. Therefore we conclude that with only two human interactions most of the situations that an automatic system cannot solve are now solved and obtaining good results. Our method has been developed in conjunction with a social robot network project (Garrell & Sanfeliu, 2012). Thanks to our iterative model, some new robot tasks have been achieved in which the robot images had low quality due to shadows, partial occlusions or extremely different point of views and illuminations. These good practical results validate

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

our theory that it is useful the human interaction not only on highlevel tasks but also on low-level tasks. Although there is a tendency to make the systems autonomous, as commented in the introduction section, some interactive methods and patents have been presented recently related on image registration. This is because, although this problem has been studied for a while, in some applications or environments, their achievements are far away from been acceptable. Therefore, it is worth to introduce some level of human interaction, although the run time can be affected, to assure an acceptable accuracy. In a near future, it is supposed the interaction is going to be reduced when the automatic methods are going to increase their quality. Nowadays, we are modelling the interaction on several images (or sets of points) simultaneously (not only one-to-one but manyto-many) with the aim of obtaining a coherent set of alignments and correspondences between different scenes. Moreover, we are also studding models for non-rigid homographies. Finally, we want to apply active query strategies as it was done in Cortés and Serratosa (2013) and Cortés, Serratosa, and Solé (2012) to suggest the points to be imposed by the oracle. We also want to reduce the computation run time by applying the Fast Bipartite method (Serratosa, 2014). References Ahlberg, C. (1996). Spotfire: An information exploration environment. SIGMOD Record, 25(4), 25–29. Andrews, K., Wohlfahrt, M., & Wurzinger, G. (2009). Visual graph comparison. In Proceedings IV’2009 (pp. 62–67). IEEE. Berg Von, J., & Neitzel, U. (2010). Interactive image registration, WO 2010134013 A1. Bunch, J. R., & Hopcroft, J. (1974). Triangular factorization and inversion by fast matrix multiplication. Mathematics of Computation, 28, 231–236. Caetano, T. S., McAuley, J. J., Cheng, L., Le, Q. V., & Smola, A. J. (2009). Learning graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6), 1048–1058. Casper, J., & Murphy, R. R. (2003). Human–robot interactions during the robotassisted urban search and rescue response at the world trade centre. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33, 367–385. Chui, H., & Rangarajan, A. (2003). A new point matching algorithm for non-rigid registration. Computer Vision and Image Understanding, 89(2–3), 114–141. Collins, C., & Carpendale, M. S. T. (2006). Vislink: Revealing relationships amongst visualizations. IEEE Transactions on Visualization and Computer Graphics, 13, 1192–1199. Cortés, X., & Serratosa, F. (2013). Active-learning query strategies applied to select a graph node given a graph labelling. In GbR2013 (pp. 61–70). Austria, Vienna. Cortés, X., Serratosa, F., & Solé, A. (2012). Active graph matching based on pairwise probabilities between nodes. In SSPR2012 (pp. 98–106). Hiroshima, Japan. Cortés, X., Moreno, & Serratosa, F. (2013). Improving the correspondence establishment based on interactive homography estimation. In Computer analysis of images and patterns, CAIP2013, LNCS (Vol. 8048, pp. 457–465). York, United Kingdom. Cross, A., & Hancock, E. (1998). Graph matching with a dual-step EM algorithm. EEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1236–1253. de Miranda, P. A. V., Falcão, A. X., & Udupa, J. K. (2010). Synergistic arc-weight estimation for interactive image segmentation using graphs. Computer Vision and Image Understanding, 114(1), 85–99. Diehl, S., & Gorg, C. (2002). Graphs, they are changing dynamic graph drawing for a sequence of graphs. In Graph drawing. Elmqvist, N., Dragicevic, P., & Fekete, J. D. (2008). Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE TVCG, 14(6). Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., & Yee, G. V. (2003). Graphael: Graph animations with evolving layouts. In Symposium on graph drawing (pp. 98–110). Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395. Garrell, A., & Sanfeliu, A. (2012). Cooperative social robots to accompany groups of people. International Journal of Robotics Research, 31(13), 1675–1701. Gering, D. T. (2010). Systems and methods for interactive image registration. U.S. Patent No. 7,693,349. Gold, S., & Rangarajan, A. (1996). A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4). Han, W., & Brady, M. (2005). Real-time corner detection algorithm for motion estimation. Image and Vision Computing, 695–703. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th Alvey vision conference (pp. 147–151). Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge University Press.

191

Holten, D., & Wijk, J. J. V. (2008). Visual comparison of hierarchically organized data. Computer Graphics Forum, 27. . . Jesorsky, O., Kirchberg, K. J., & Frischholz, R. (2001). Robust face detection using the hausdorff distance. In AVBPA (pp. 90–95). Khader, M., & Ben Hamza, A. (2012). An information-theoretic method for multimodality medical image registration. Expert Systems with Applications, 39(5), 5548–5556. Kleinberg, J. M. (2002). Bursty and hierarchical structure in streams. In 8th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 91–101). ACM Press. Kuhn, H. W. (1955). The Hungarian method for the assignment problem export. Naval Research Logistics Quarterly, 2(1–2), 83–97. Lindeberg, T. (1998). Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2). Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91. http://dx.doi.org/10.1023/ B:VISI.0000029664.99615.94. Luo, B., & Hancock, E. (2003). A unified framework for alignment and correspondence. Computer Vision and Image Understanding, 92(1), 26–55. Matthew, T. (2014). Multimodal interaction: A review. Pattern Recognition Letters, 36, 189–195. Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630. Myronenko, A., & Song, X. B. (2010). Point set registration: Coherent point drift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2262–2275. Panagiotakis, C., Papadakis, H., Grinias, E., Komodakis, N., Fragopoulou, P., & Tziritas, G. (2013). Interactive image segmentation based on synthetic graph coordinates. Pattern Recognition, 46(11), 2940–2952. Pedhazur & Elazar, J. (1982). In Multiple regression in behavioural research: Explanation and prediction. New York: Holt, Rinehart and Winston. ISBN 0-03041760-0. Pfluger Thomas, et al. (2000). Quantitative comparison of automatic and interactive methods for MRI-SPECT image registration of the brain based on 3-dimensional calculation of error. Journal of Nuclear Medicine, 41(11), 1823–1829. Phuong, L. H., Visani, M., Boucher, A., & Ogier, J.-M. (2013). A new interactive semisupervised clustering model for large image database indexing. Pattern Recognition Letters. Pietrzyk, U. et al. (1994). An interactive technique for three-dimensional image registration. Validation for the Journal of Nuclear Medicine, 35(12), 2011–2018. Rangarajan, A., Chui, H., & Bookstein, F. L. (1997). The softassign procrustes matching algorithm. In IPMI (pp. 29–42). Rangarajan, A., Chui, H., & Bookstein, F. L. (1997). The softassign procrustes matching algorithm. In Proceedings of the 15th international conference on information processing in medical imaging (pp. 29–42). Springer-Verlag. Revuelta-Martinez, A., Rodriguez, L., Garcia-Varea, I., & Montero, F. (2013). Multimodal interaction for information retrieval using natural language. Computer Standards and Interfaces, 35(5), 428–441. Romero, V., Toselli, A. H., & Vidal, E. (2012). Multimodal interactive handwritten text transcription. World Scientific Publishing. Rosten, E., & Drummond, T. (2006). Machine learning for high-speed corner detection. European Conference on Computer Vision, 1, 430–443. Salvi, J., Matabosch, C., Fofi, D., & Forest, J. (2007). A review of recent range image registration methods with accuracy evaluation. Image and Vision Computing, 25(5), 578–596. Sanromà, G., Alquézar, R., & Serratosa, F. (2012). A new graph matching method for point-set correspondence using the EM algorithm and softassign. Computer Vision and Image Understanding, 116(2), 292–304. Sanromà, G., Alquézar, R., Serratosa, F., & Herrera, B. (2012). Smooth point-set registration using neighbouring constraints. Pattern Recognition Letters, 33, 2029–2037. Schwenker, F., & Trentin, E. (in press-a). Pattern classification and clustering: A review of partially supervised learning approaches. Pattern Recognition Letters. Schwenker, F., & Trentin, E. (in press-b). Partially supervised learning for pattern recognition. Pattern Recognition Letters. Serratosa, F. (2014). Fast computation of bipartite graph matching. Pattern Recognition Letters, 45, 244–250. Serratosa, F., Alquézar, R., & Amézquita, N. (2012). A probabilistic integrated object recognition and tracking framework. Expert Systems With Applications, ESWA, 39, 7302–7318. Serratosa, F., Cortés, X., & Solé-Ribalta, A. (2012). Interactive graph matching by means of imposing the pairwise costs. ICPR, 1298–1301. Serratosa, F., Cortés, X., & Solé-Ribalta, A. (2013). Component retrieval based on a database of graphs for hand-written electronic-scheme digitalisation. Expert Systems With Applications, ESWA, 40, 2493–2502. Tomasi, C., & Kanade, T. (1991). Detection and tracking of point features. Tech. Rep. CMUCS-91-132, Carnegie Mellon University. Toselli, A. H., Vidal, E., & Casacuberta, F. (Eds.). (2011). Multimodal interactive pattern recognition and applications. Springer. Trevai, C., Fukazawa, Y., Ota, J., Yuasa, H., Arai, T., & Asama, H. (2003). Cooperative exploration of mobile robots using reaction–diffusion equation on a graph. In IEEE international conference on robotics and automation (Vol. 2).

192

X. Cortés, F. Serratosa / Expert Systems with Applications 42 (2015) 179–192

Villamizar, M., Andrade-Cetto, J., Sanfeliu, A., & Moreno-Noguer, F. (2012). Bootstrapping boosted random ferns for discriminative and efficient object classification. Pattern Recognition, 45(9), 3141–3153. Xu, M., & Petrou, M. (2011). 3D scene interpretation by combining probability theory and logic: The tower of knowledge. Computer Vision and Image Understanding, 115(11), 1581–1596. Yee, K., & Fisher, D. (2001). Animated exploration of dynamic graphs with radial layout. In INFOVIS (pp. 43–50).

Zhang, Z. (1992). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), 119–152. Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), 119–152. Zitová, B., & Flusser, J. (2003). Image registration methods: A survey. Image and Vision Computing, 21(11), 977–1000.