Computation Sharing in Distributed Robotic Systems ...

1

Computation Sharing in Distributed Robotic Systems: a Case Study on SLAM Bruno D. Gouveia, David Portugal, Member, IEEE, Daniel C. Silva and Lino Marques, Member, IEEE

Abstract—Aiming at increasing team efficiency, mobile robots may act as a node of a Robotic Cluster to assist their teammates in computationally demanding tasks. Having this in mind, we propose two distributed architectures for the Simultaneous Localization And Mapping (SLAM) problem, our main case study. The analysis focuses especially on the efficiency gain that can be obtained. It is shown that the proposed architectures enable us to raise the workload up to values that would not be possible in a single robot solution, thus gaining in localization precision and map accuracy. Furthermore, we assess the impact of network bandwidth. All the results are extracted from frequently used SLAM datasets available in the robotics community and a real world testbed is described to show the potential of using the proposed philosophy. Note to Practitioners—Multi-robot systems commonly suffer from limited computational resources, which interferes with the ability of each individual robot to finish the task underway. Generally in these scenarios, the use of the available computational power varies over time. Hence, based on the Robotic Clusters concept, we propose to allocate computational resources available in the multi-robot system to solve computationally hard problems arising in the real world. As a case-study, we propose two architectures for a solidary multi-robot system engaged in a SLAM task. In this paper, we thoroughly discuss the tradeoffs of the architectures proposed in terms of efficiency, complexity, exchange of data, load balancing and SLAM performance. Keywords—Distributed Computing, Robotic Clusters, Simultaneous Localization and Mapping, Task sharing in Multi-Robot Systems, Networked Robots.

I. I NTRODUCTION ITH the growth of the robotics field, algorithms for navigation, localization, mapping and other robotic tasks tended to increase in complexity over the years. In the particular case of multi-robot systems, economic and limited processing components are often used to keep the overall cost of the team low. In processors with limited computation power, running several of the aforementioned tasks simultaneously may not be possible. Thus, inspired by a variety of networkbased computing concepts, roboticists have been adopting strategies to relieve the load involved in perception, reasoning, storage and computation in diverse robotic tasks.

W

This work was partially carried out in the framework of TIRAMISU project (www.fp7-tiramisu.eu). This project is funded by the European Community’s Seventh Framework Program (FP7/SEC/284747). B.D. Gouveia, D. Portugal and L. Marques are with the Institute of Systems and Robotics, University of Coimbra, 3030-290 Coimbra, Portugal. D.C. Silva is with the Department of Informatics Engineering, University of Coimbra, 3030-290 Coimbra, Portugal. E-Mail: {bgouveia, davidbsp, lino}@isr.uc.pt and

[email protected]

Following this philosophy, new areas have been emerging such as Cloud Robotics and Robotic Clusters. The concept of Cloud Robotics was described in [1] as: Definition 1 (Cloud Robotics): A new approach to robotics that takes advantage of the Internet as a resource for massively parallel computation and real-time sharing of vast data resources. Despite the tremendous potential of cloud robotics, such approaches use resources that are located outside of the robotic system, hence they rely on a persistent connection to a network infrastructure which may not be available in several multirobot tasks, e.g. exploration of remote places, search and rescue missions in the aftermath of a catastrophic scenario, etc. In order to address this issue the concept of Robotic Cluster was introduced in [2] as: Definition 2 (Robotic Cluster): A group of individual robots which are able to share their processing resources among the group in order to quickly solve computationally hard problems. The general idea behind this notion is to design a corporation mechanism for high performance computing in multi-robot systems and solve complex problems in robotics applications, using robots’ wireless communication capabilities and by sharing their processing resources. While the CPUs of these robots are shared in the robotic cluster, each node is still able to run its own tasks independently. Applications of robotic clusters usually demand high processing resources and relatively cheap designs. Both processing and cost are hard constraints that emphasize simplicity of the individual robots, and thus motivate a cluster approach to solve computationally hard problems. An evident application is SLAM, where mobile robots typically produce large amounts of data that must be processed in real-time [4]. Still, modern day computer networks have not been fully explored in this context and the absence of closed SLAM solutions based on parallel computing techniques is surprising. Beyond proposing new SLAM methods to reduce sensor measurement noise and produce accurate representations, an increasing effort has been made to use efficient data structures in the state of the art, such as pose graphs [5] to reduce memory and processing requirements. Conversely, we propose two multi-robot architectures to distribute the computation load of a widely used Rao-Blackwellized Particle Filter (RBPF) SLAM approach, based on the robotic cluster concept. In the next section, seminal work on SLAM and networkbased computing is reviewed, and in Section III, the proposed distributed architectures for solidary SLAM are described in detail. Afterwards, in section IV we compare these architectures in terms of efficiency and network overload by resorting to SLAM datasets that are frequently used by the community.

We also analyze the benefits in performance of turning to distributed computing sharing over the single robot solution, and we validate the work through real world experiments on physical robots with limited processing capabilities. Finally, the article ends with conclusions and future directions of research. II.

P RELIMINARIES

A. SLAM Problem Formulation The SLAM problem requires the robotic platform to estimate a map of its surroundings, and localizing itself in it. Clearly, this implies gathering and then analyzing information from the sensorial system as the robot explores the world. However, this information is plagued by noise and uncertainty, which has led to a strong prevalence of probabilistic solutions to the SLAM problem over mathematically simpler approaches. Probabilistic solutions present the advantage of robustness to measurement noise, and the ability to formally represent uncertainty in the measurement and estimation process. Besides sensor measurements, denoted as z, in the context of mobile robotics, we also deal with controls, which represent the signals transmitted to the robot’s actuators to indicate how it is supposed to move. These controls, which are denoted as u, represent the way the robot is intended to move and not necessarily the way it actually moves, thus it is common to replace controls with odometric measurements which, albeit prone to errors, are fairly accurate at representing robotic motion [3]. Most of the probabilistic models used to solve the problem of SLAM rely on Bayes filters, which are an extension of Bayes rule to integrate temporal data, and are usually formulated as follows: Z

p(st , m|z t , ut ) = η p(zt |st , m)· p(st |ut , st−1 ) p(st−1 , m|z t−1 , ut−1 ) dst−1 ,

(1)

where η represents the Bayes normalization factor that ensures that p(st , m|z t , ut ) is a valid distribution function, st is the robot’s estimated pose at time t, and m represents the map. Also, st = {s1 , s2 , s3 , ...}, i.e. the subscripted symbol represents all the values of that quantity up to time t. To arrive at this formulation, we assume that the map, and therefore the world being mapped, are static. This is an extremely useful assumption in practical terms, given that by rendering the map constant, it removes the need to integrate over it. B. Related Work Typical solutions for the problem of SLAM rely on Kalman Filters (KFs) [6] and Particle Filters (PFs) [7], which are popular implementations of Bayes filters to incrementally compute joint posterior distributions over robot poses and landmarks. Along the years, several enhancements based on these approaches have been described in the literature, such as Extended Kalman Filters (EKFs) and RBPFs. In addition, graph-based algorithms, e.g. [5], [8], have also gained popularity due to the efficiency gained when maintaining large-scale

2

maps, which stems from discarding irrelevant measurements and graph optimization processes. Also popular nowadays are methods which take advantage of high scanning rates of modern day Light Detection And Ranging (LIDAR) technology. These methods rely heavily on scan matching of consecutive sensor readings, with combination of other techniques like multi-resolution occupancy grid maps [9] or dynamic likelihood field models for measurement [10]. Despite the evident advancements in SLAM research, a robot with such capabilities still has to be equipped with appropriate computation power to adequately handle the processing and memory requirements. Beyond the usage of optimized data structures and algorithms, some authors have turned to modern day computer architectures to handle heavy processing SLAM in real-time. According to [11], multiprocessing is extremely efficient in terms of energetic consumption since the speed requirements on each parallel unit are reduced, allowing for a reduction in voltage and frequency. For example, the authors in [12] have proposed a Visual SLAM module implemented in the Compute Unified Device Architecture (CUDA), which runs at a 15 Hz rate. Their approach performs sparse scene flow, real-time feature tracking, visual odometry, loop detection and global mapping in parallel. Similarly, Par and Tosun [13] have addressed classical PF-based Sequential Monte Carlo Localization of a vehicle in a known map with multicore and manycore processors. They have used the Open Multi-Processing (OpenMP) programming model for the parallelization of the predict and update phases of the PF in a multicore CPU, and also implemented a Graphics Processing Unit (GPU) version, obtaining speedup values of up to 4.7 and 75 respectively. Related philosophies for distributing the computation load in robotic application are emerging. This is the case of cloud robotics [1], [14]. In [15], robots performing visual SLAM query a cloud service to run demanding steps of the algorithm and allocate storage space, thus freeing the robot embedded computers from most of the computation effort. This work is part of the RoboEarth project, which provides a cloud robotics infrastructure, more specifically a giant network and database repository for sharing information and triggering learning behaviors for robots [16]. In [17], a cloud-inspired framework named Distributed Agents with Collective Intelligence (DAvinCi) is presented. The authors use a powerful Hadoop1 system composed of eight Intel Quad core server nodes, which works as a private cloud computing environment for service robots in large environments. A RBPF grid-based SLAM implementation was used as a demonstration. The robots are basically mobile sensing nodes, which send data to the DAvinCi server that works as a proxy to the Hadoop framework that takes care of all the computation. With this powerful parallel computing system, the authors were able to obtain a speedup ratio close to 7.3. However, no network analysis or system scalability assessment were reported. The two previously described works imply the continued availability of the cloud entity. This may not be possible in 1 Hadoop is a Java based framework that supports data intensive distributed applications running on large clusters of computers.

tasks involving distributed multi-robot systems over wide areas or outdoor environments. In addition, the connection to/from the cloud represents a critical point of failure in the system, and the available bandwidth may become a bottleneck on the communication flow in systems with high number of agents. In the last decades a noticeable effort has been made in the area of networked robots ever since the pioneer work described by Siegwart et al. [18] on teams of mobile robot systems connected to the Web, and more recently multiprocessor architectures have been emerging in the context of multi-robot systems, e.g. in the scope of the Swarmanoid project [19], where CPU-intensive tasks are allocated to the main processor of each robot and real-time sensor readings are allocated to several microcontrollers. This represented an important architectural shift away from the classic single-microcontroller robots to a distributed and intrinsically modular design. Inspired by networked robots and parallel computation, Marjovi et al. presented in [2] the concept of robotic cluster by empowering heterogeneous robots with the ability of sharing their processing resources when solving complex collective problems. This was applied to the problem of topological map merging in a distributed multi-robot system. Communication between agents relied on a Mesh wireless Ad Hoc network, and a parallel implementation based on Message Passing Interface (MPI) routines was described. Robots are equivalent to nodes of the mobile network, which locally map an unknown environment using a topological SLAM approach, and in such a dynamic system, they continuously connect and disconnect from the mesh network, being able to discover close-by processing units in the robotic cluster, so as to merge locally mapped graphs into global consistent topological maps. Such approach is particularly suited for applications where computation peaks occur on a regular basis, and the latency when connecting to the cluster and querying the available computing nodes is negligible. In this work, we present an evolution of the robotic cluster concept which moves away from a cluster architecture using Message Passing Interface (MPI), by implementing a distributed system with lightweight loosely coupled MessageQueues instead. The use of message-queues enables us to build a less rigid system and choose from different messaging design patterns. When necessary, the system makes use of the available shared computation and it can eventually use computational resources located outside the system without the need to create bridges. We propose to distribute the widely used RBPF SLAM approach GMapping, presented by Grisetti et al. [20], for the first time as far as our knowledge goes, in a solidary multi-robot system, where robots ocasionally share their computation resources to solve heavy algorithm steps, as illustrated in Fig. 1. Hereafter, we clarify the contributions of this work to the state of the art and before describing the proposed architectures, we present the rationale behind the choice of a Gridmap RBPF SLAM approach as the key case study. C. Statement of Contributions The main goal of this work is to describe and discuss two parallel computation architectures used to speed up complex

3

𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑 𝐾


𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑 𝐾 𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑 𝐾


Fig. 1: An illustrative example of the Workload distribution in a multi-robot system composed of K = 5 robots. steps of the GMapping SLAM approach in a distributed multirobot system. Based on the robotic cluster concept, no network infrastructure is assumed to exist and robots are endowed with limited computation capabilities. This enables the system to run a distributed version of the algorithm that can cope with computation peaks. Using this approach, a system of small computers with low energy consumption is allowed to run shared applications using only the resources available in the system, and solving problems that previously needed help from external computers. The solution designed is independent of connections to an outside network to avoid possible communication bottlenecks and enable team scalability. Moreover, the balanced utilization of the processing resources decreases the computation load in the robot’s CPU, thus reducing battery usage and improving response times. Besides presenting distributed architectures that allow robots to assist their peers in the SLAM task, we intend to contribute to the state of the art by quantitatively analyzing the benefits of such approach in commonly used and well-known datasets. Thus, we conduct CPU load analysis and impose limits on the network traffic to characterize the proposed architectures. We also show that the approaches developed enable us to increase the overall workload when compared to classical single robot solutions, therefore increasing localization precision and mapping accuracy. This is corroborated by a well-known benchmarking metric that is applied for performance comparison. Furthermore, we discuss implementation issues and describe an experiment with a team of robots to demonstrate the potential to apply the method in the real world. III.

D ISTRIBUTED C OMPUTING A RCHITECTURES FOR RBPF SLAM

A. The SLAM case-study SLAM has gained increasing attention over the years, and today several approaches exist to address the problem with different levels of success. Promising results to extend the single robot SLAM problem to multiple robots have also been presented recently [21], [22]. However, solutions for the Multi-Robot SLAM problem are not equally matured to single robot SLAM since they involve overcoming several additional challenges like rendez-vous and robots’ mutual detection, extracting relative poses between robots, conducting map alignment and merging, network latency, pose consistency in local and global maps, managing different coordinate sys-

4 Start

UpdateMotion Model

linear distance > ɣ or angular distance > Θ ?

N End

Y First Scan ?

Y

RegisterScan

N

tems, and keeping the information up to date and synchronized in all robots. Even though it is theoretically faster to map an environment by adopting a multi-robot SLAM method, it is also much more computationally demanding than single robot SLAM, due to the above challenges and the fact that each robot is concurrently building its local map at the same time, as well as estimating its local pose in the map. Furthermore, there are no widely acknowledged solutions available in the community. Therefore, we argue that using a multi-robot SLAM approach in robots with limited computational power, e.g. swarm robots, is currently not feasible. In this work, we follow an alternative philosophy, where one of the robots in the team performs SLAM and asks assistance from its teammates to solve heavy steps of the SLAM algorithm in parallel. Advantages of the described philosophy include not having to address the additional challenges of multi-robot SLAM, and freeing the remaining robots in the team to perform other tasks in the shared environment, such as detection of people, gas leakages or other events. Also, due to the computing peaks related with heavy steps of the SLAM algorithm, single processing systems have the disadvantage of requiring a computer that is most of the times oversized for its purposes. Depending on its structure, it may not be possible to parallelize the algorithm or the resultant speedup may not match the expectations, being more advantageous to run multiple instances of the algorithm with different parameters. GMapping is a grid-based RBPF implementation for the SLAM problem. It makes use of a PF, where each particle is encoded by a potential trajectory of the robot and the corresponding occupancy grid map, thus providing an hypothesis to the localization and mapping problem at a given time step. There are several advantages for choosing GMapping as the key SLAM approach addressed in this study. Firstly, it is an open source ready-to-use algorithm available in Robot Operating System (ROS) [23]. Secondly, it is well-established with recognized performance as proven by recent works, e.g. [24]. Finally, the nature of the algorithm is particularly suitable for parallelization, given that it is a single threaded approach where each particle is computed independently. This represents a convenient scenario for parallelization, since there is no need to share data between particles. This is known as an embarrassingly parallel problem. Every time a new laser scan is acquired by the robot, the algorithm follows the steps illustrated in the flowchart of Fig. 2. During execution, the robot continuously updates the pose on each processed particle using its motion model, based on the estimation of odometry. In the beginning, when the first scan is received, it is directly registered in the map. Afterwards, registration only takes place if the linear distance or angular distance traversed by the robot surpass specific thresholds. When this is the case, a laser scan match is performed, thus correcting the estimation of pose in the map of each particle. As a consequence, the weights of the particle tree are updated, and the effective sample size Nef f is calculated in order to estimate how well the current particle set represents the target posterior: 1 Nef f = PN , (2) ω i )2 i=1 (e

ScanMatch

UpdateTreeWeights

UpdateTreeWeights

End

Resample

End

Fig. 2: Flowchart of the ProcessScan function of the GMapping algorithm, which is called every time a new scan is retrieved. where ω e i is the normalized weight of particle i. Finally, each time Nef f drops below N/2, where N is the number of particles, the resampling step is run and particles with a low importance weight are replaced by samples with a high weight. This value drastically decreases the risk of replacing good particles, because the number of resampling operations is reduced during a run, and it is only performed when needed. For more details on the GMapping algorithm the reader should refer to [20]. Using a profiling tool,2 it was possible to verify that on average, 98.47% of the computation time of the ProcessScan function is spent in the scan matching step. This occurs because scan matching involves comparing the set of data returned from the Laser Range Finder (LRF) with the 2D map obtained thus far, which is a process that is repeated for all the N particles involved. Thus, it is executed many times during localization and involves several mathematical operations, representing a high computational burden for any SLAM algorithm. In addition, it should be noted that scan matching only occurs when the distance traveled by the robot surpasses a predetermined linear threshold γ or an angular threshold θ (cf. Fig. 2), which are important parameters of the algorithm. Hence, we expect that frequent computation peaks will occur every time the condition is verified. This is shown in Fig. 3a, where a run of the algorithm in an Asus EeePC 901 2 Callgrind, available in the Valgrind distribution: http://valgrind.org

5

GridSlamProcessor 0 1 ...

Thread

Workqueue

REP Index

In de Em x pt y

REQ

REQ

Inproc LocalWorker REQ scanMatch

ply Workre

REQ Requester

kage Workpac

REQ Requester

Wor krep ly

Wor kpa ckag e

(a) GMapping with N = 10.

Empty

y pt ex Em Ind

Remote Worker REP

Remote Worker REP

scanMatch

scanMatch

TCP

(b) GMapping with N = 40.

Fig. 3: CPU load in the classical GMapping algorithm running on an Asus EeePC 901 with different number of particles N . netbook with N = 10, γ = 1.0 m and θ = 0.5 rad is depicted. While in Fig. 3a the algorithm is run with a low number of particles, in Fig. 3b we present a similar chart with N = 40. In this situation, one cannot distinguish the computational peaks, because scan matching takes too long. Therefore, when it is necessary to proceed with the next scan matches, the previous ones still have not finished processing, and consequently, the algorithm skips steps. This has a tremendous impact on localization and mapping, as shown later on. Thus, in this case the algorithm will only run in real-time if the robot moves slowly, giving the computer enough time to process laser scan matches. Thus, our preliminary study of the GMapping algorithm not only led us to parallelize the scan matching step, by distributing the N particles over the available processor resources, but also to keep the other steps as single threaded, since the overhead of parallelization together with their low computational demand would render the parallelization inefficient in these steps. In the next section we propose two distributed architectures for the GMapping RBPF SLAM in a solidary multi-robot system. B. Distributed Architectures for GMapping Two distinct distributed architectures were implemented based upon the GMapping algorithm so as to divide the laser scan matching step into several tasks, each of which solved by a different node of the robotic cluster. These two architectures mainly differ on the way that they deal with the state of the system. The state (or context) is defined as all the necessary information shared within the system, which enables the external computing nodes to return the intended results. More specifically, in our problem these include the particle data composed of estimated poses and maps, particle indexes and weight, laser scan matching parameters and sensor data. Following this philosphy, the first architecture, deemed as stateless is analogous to a black box, where at any point in time the value of the outputs depends only on the value of the inputs after a certain processing time. In this architecture, the

Fig. 4: Stateless Architecture. teammates that provide assistance run a remote process, whose context is totally independent of the main SLAM process running on the local robot. This enables multiple execution of tasks at the same time without any concurrency issue. On the other hand, the second architecture follows a stateful approach. In such a system, at any point in time the value of the outputs depends on the value of the inputs and of an internal state. This is analogous to a state machine with memory, as the same set of inputs can generate different outputs depending on the previous inputs received by the system. In this architecture, the teammates that provide assistance share their context when the workload that comes from the robot running the main SLAM process is distributed. In this stateful system, the execution of tasks requires a synchronization point to access and update the internal state of the system in an exclusive way. Both architectures are described in more detail below, and their features are discussed. 1) Stateless: Due to the uncertainty of the unknown environment being explored, there may be failures in the communication between robots, as well as changes in the available bandwidth of the wireless network. An ideal distributed application should be robust to dynamic changes in the topology of the system and comprise effective load balancing. In order to achieve the desired features, we followed a stateless approach (cf. Fig. 4) where the external computing nodes of the robotic cluster (remote workers) are totally independent of the main system, i.e. they do not maintain a common state or context. The architecture uses a synchronous request-reply queue (workqueue), where each element in the queue contains the index of the particles requiring a laser scan match. Different requester threads are used to assign the computation of particles in the queue to each robot in the team, running remotely or locally. Thus, such K threads work as a dedicated interface between the SLAM algorithm and the K agents. In order to efficiently transfer particle data across the robotic cluster system, the requester threads build and send a compressed WorkPackage message to the remote workers. An additional synchronous request-reply (REQ-REP) queue is used in each thread to interface with the remote workers and communication is done via TCP. On the other hand, the local

worker does not need a mechanism to transmit data structures since it has direct access to the shared memory of the main system. After receiving the message, each worker runs the scan matching step on the assigned particle and sends the response back to the corresponding requester thread, which updates the local structures with the data received. The response sent by the remote workers contains a partial map, which only includes the area of the map where changes occur. The used request-reply queues have interesting features such as always replying to the last client that sent a request. Also, a remote worker may process requests from various clients and the number of remote workers can change dynamically in the system. Additionally, this mechanism promotes load balancing since the threads that process the data faster will fetch data from the workqueue more often. This also takes into consideration the time to transfer data over the network when using remote workers. In terms of robustness to communication failures, the Requester thread can simply resend the index to the workqueue if it detects a timeout while sending data or waiting for the reply. The corresponding particle can then be reassigned and processed by an available worker (local or remote). In algorithm 1, an overview of the GMapping algorithm adapted from [20] is presented. Lines 5 to 23 (in red) contain the operations over the particle set that are distributed among the different workers in the system. In order to exchange information between computers the following messages were used in the Stateless approach: • WorkPackage: Message with the particle data, the parameters used in the laser scan matcher and the current sensor data (odometry and laser range finder). The particle contains a local map and the current pose. • WorkReply: Reply message with the index of the processed particle, the new pose and new weights obtained from the laser scan matcher, the area on the map that has been modified and optionally the new size of the map, if it grew.

6

Algorithm 1 GMapping algorithm, adopted from [20]. The steps in red refer to the algorithm’s distribution. Data: St−1 , the sample set of the previous time step zt , the most recent laser scan ut−1 , the most recent odometry measurement Result: St , the new sample set 1 2 3

(i)

forall the st−1 ∈ St−1 do (i) (i) (i) (i) < xt−1 , wt−1 , mt−1 >= st−1 // scan-matching

4 5 6 7 8 9 10 11 12

0(i)

xt

(i) x bt

(i)

= xt−1 ⊕ ut−1

0(i)

(i)

= argmaxx p(x | mt−1 , zt , xt )

if xb(i) t = failure then (i)

(i)

xt ∼ p(xt | xt−1 , ut−1 ) (i)

(i)

(i)

(i)

wt = wt−1 · p(zt | mt−1 , xt ) else for k=1,. . . ,K do xk ∼ {xj | |xj − x b(i) | < ∆} end

// compute Gaussian proposal

13 14 15 16 17 18 19 20

(i)

µt = (0, 0, 0), η (i) = 0 forall the xj ∈ {x1 , . . . , xK } do (i) (i) (i) (i) µt = µt + xj · p(zt | mt−1 , xj ) · p(xt | xt−1 , ut−1 ) (i)

22 23

24

(i)

η (i) = η (i) + p(zt | mt−1 , xj ) · p(xt | xt−1 , ut−1 ) end (i)

(i)

(i)

µt = µt /η (i) , Σt = 0 forall the xj ∈ {x1 , . . . , xK } do (i) (i) Σt = Σt + (xj − µ(i) )(xj − µ(i) )T · (i)

end (i)

(i)

p(zt | mt−1 , xj ) · p(xj | xt−1 , ut−1 )

21 (i)

Σt = Σt /η (i) // sample new pose (i)

(i)

(i)

xt ∼ N (µt , Σt ) // update importance weights

25 26

27

2) Stateful: Despite the interesting features of the Stateless architecture, it heavily relies on the exchange of data across the network, especially local gridmaps, which represent large data structures, as opposed to sensor data or scan matching parameters. Hence, a second architecture was implemented to reduce the data exchange phenomenon by maintaining a common state between the distinct nodes of the robotic cluster. For this reason, the architecture proposed herein is designated as Stateful. The Stateful architecture distributes the different particles used in the GMapping algorithm between the existing remote workers and keeps a local synchronized copy at the same time – the Particle Container. This design choice allows the system to recover from faults in the network by processing data locally in such cases, and only send data to the remote workers when they are again available. In addition, keeping a local copy of all particles also simplifies the algorithm when the resampling step occurs, i.e. having all the particles, the main system can keep the trajectory tree updated and send the missing particles to the remote workers when resampling, thus there is no need to exchange information between the different workers in the

St ={}

28 29 30

(i)

(i)

(i)

32 33

(i)

mt = integrateScan(mt−1 , xt , zt ) // update sample set (i)

(i)

(i)

St = St ∪ {< xt , wt , mt >} end Nef f = PN

i=1

31

(i)

wt = wt−1 · η (i) end // update map

1 2 (we(i) )

if Nef f < T then St = resample(St ) end

system. Instead of using synchronous request-reply queues as in the Stateless approach, in this architecture the data is broadcasted to the local and remote workers, using a Publisher-Subscriber (PUB-SUB) queue. Like before, the local worker thread interacts directly with the main system, while the communication with the remote workers is done via TCP. In order to keep the particles synchronized, the remote workers send the result of the laser scan matching step to the main system as soon as it becomes available using a PushPull queue. A Recipient Thread in the main system receives the processed particles and updates the Particle Container, hence guaranteeing the consistency of the information in the whole

7 Local Resample Remote Worker

needs resampling ?

N

Y send indexes send empty message

IndexMessage

Empty ask needed particles Resample Message

register scan

send missing particles Particles

Sync workers

Fig. 5: Stateful Architecture. system. The algorithm follows the execution sequence illustrated in Fig. 5. Some similarities to the Stateless architecture exist, the main differences being related with maintaining a common state and contextualizing the particles processed in all modules of the system. As referred previously, after the scan matching process there is a need for synchronization. This is also true for the resampling process, which is shown in the diagram of Fig. 6. In one hand, if there is no need to resample, the main system sends an empty message to the remote worker to carry on with the algorithm execution. On the other hand, if there is a need for resampling, the main system executes this step and sends the updated particle indexes to the remote workers. In order to maintain consistency in the whole system, the remote workers may ask for missing particles that they do not possess after the resampling step. In this situation, the main system will send a compressed message, with the missing particles to the remote workers. This mechanism guarantees that all the workers in the system become synchronized. Despite drastically reducing the network load, maintaining a Stateful architecture comes at a cost. In this architecture there are no means for automatic load balancing. The workload, i.e. the number of particles, is distributed a priori and it does not adapt to the worker’s processing capabilities or network latency. Another feature that is lost is the ability to use the same remote workers for different clients. To exchange information between computers the following messages were used in the Stateful approach: •

• •

StartPackage: Initial message sent to the remote workers, contains all the parameters needed for the laser scan matcher and the motion model used. This is sent only in the beginning of the algorithm. Sensordata: Message with the sensor data sent in each iteration of the algorithm. Contains updated pose data for every particle and data from the LRF. WorkReply: Reply message with the index of the processed particle, the new pose and new weights obtained from the laser scan matcher, the area on the map that

Fig. 6: Exchange of messages in the Resampling process of the Stateful Architecture.

• • •

has been modified and optionally the new size of the map, if it grew. IndexMessage: Message sent by the main system after the Resampling process. Contains the indexes of all valid particles and is sent to every remote worker. ResampleMessage: Message used as a reply to the IndexMessage. It contains the indexes of the local particles that are missing by the remote worker. Particles: Message sent by the main system after a ResampleMessage. It contains the particle’s data (map, pose and weights).

The implementation of the above architectures was done using the Zero Message Queue (ZeroMQ) library,3 which enables the use of different types of queues, as well as supporting several types of transport. We freely provide an opensource implementation of both architectures for distributing the GMapping SLAM approach.4 IV. R ESULTS AND D ISCUSSION In this section, experimental results are presented and discussed. We propose to analyze the efficiency gain, the effect of limited network bandwidth and the localization and mapping performance using the proposed architectures. Having this in mind, we have tested 3 datasets typically used in SLAM benchmarking: ACES Building, Intel Research Lab, and MIT CSAIL Building,5 and conducted further experiments with physical teams of mobile robots to validate the system in a real world scenario. The experimental tests that were run on the SLAM benchmarking datasets make use of the two distinct computer architectures depicted in Table I. The EeePC runs the main SLAM system (local worker) and Odroid X2 single-board 3 http://zeromq.org 4 Available at: https://github.com/brNX/gmapping-stateless, and: https://github.com/brNX/gmapping-stateful 5 http://kaspar.informatik.uni-freiburg.de/∼slamEvaluation/datasets.php

8

TABLE I: Specifications of the Asus EeePC netbook and the Odroid X2 single-board Table 1: Specifications of the Asus EeePC netbook and the Odroid X2 singlecomputer in the experiments. board computer used in the used experiments.

TABLE II: Average Laser Scan Processing Times (in seconds) over 5 trials for each different combination {dataset, N, method, K}. HH

Computers Used CPU # Cores Clock Speed Cache L2 RAM Storage Operating System Approximate Cost

Ubuntu 12.04, 32 bit

Odroid X2 Exynos4412 Prime ARM Cortex-A9 Quad Core 4 Physical Cores 1.7 GHz 1 MB 2GB (DDR2) 16GB, SD card Ubuntu-based Linaro 12.11, 32 bit

$219

$135

3

4

5

2

3

1.417

1.437

1.430

1.417

1.438

0.616

0.806

-

-

Austin

60

2.797

2.753

2.697

2.698

2.687

1.053

0.864

1.035

1.159

≈ 56×58 (m)

90

4.226

4.082

4.005

3.957

3.966

1.825

1.507

5.500

-

30

1.557

1.406

1.239

1.129

1.077

0.607

0.713

-

-

60

3.102

2.798

2.432

2.210

2.082

1.109

0.702

0.795

0.788

Intel Research Lab, Seattle

4

5

≈ 28.5×28.5 (m)

90

4.653

4.164

3.601

3.269

3.065

1.882

1.068

0.831

0.949

120

6.129

5.512

4.767

4.333

4.055

2.604

1.469

1.098

0.898

MIT CSAIL

30

3.020

2.047

1.607

1.366

1.247

1.218

1.492

-

-

Building,

60

6.007

4.098

3.110

2.626

2.347

2.141

1.482

1.750

1.840

Cambridge (MA)

90

8.939

6.099

4.622

3.870

3.453

3.624

2.227

1.898

2.265

≈ 61×46.5 (m)

120

11.942

8.086

6.158

5.173

4.584

4.973

2.912

2.352

2.138

A. Efficiency Gain In parallel computing, it is common to use the speedup efficiency metric to measure how fast a parallel algorithm is, when compared to the corresponding sequential algorithm. The speedup υ is defined as: TClass , TDist

Stateful

2

computers are used as remote workers. These were chosen in order to validate the approaches proposed using heterogeneous processors with distinct computation power, having at the same time a main system that is very limited in terms of computation power. In these experiments, the computers were connected via Ethernet to a 100 Mbps switch. In these tests, several different configurations were tested. The number of particles of the GMapping algorithm were increased from N = 301 until N = 120 in steps of 30, eventually stopping when an “out of memory” error occurs in the Asus EeePC 901. The total number of local and remote workers in the system K is increased up to a maximum of 5. Furthermore, for each different configuration we run 5 trials, resulting in a total of 460 trials, which lasted 216.2 hours (≈ 9 days) with the following combinations: 1) dataset = {ACES Building, Intel Research Lab, MIT CSAIL Building}, 2) N = {30, 60, 90, 120}, 3) architecture = {Classical, Stateless, Stateful}, 4) K = {1, 2, 3, 4, 5}.

υ=

Stateless

1

ACES Building,

Asus EeePC 901 Intel AtomTM N270 with Hyper-Threading 1 Physical (2 Virtual) Cores 1.6 GHz 512 kB 1GB (DDR2) 20GB SSD

Classical

K HH H 30

N

(3)

where TClass represents the total processing time of the classical method, and TDist the total processing time of a distributed method in the same exact conditions. In this analysis, we start by looking into the average time taken by the laser scan processing step of the SLAM algorithm. To this end, we have used the Classical single processing GMapping approach, the Stateless and the Stateful architectures proposed herein. Table II presents the average time in seconds over the 5 trials with every distinct configuration. Note that there are no results for N = 90 with K = 5 in the Stateful approach, and N = 120 in the ACES Building,

due to the 1 GB memory limitation in the Asus EeePC 901. In all trials, the main system together with the local worker run on the Asus EeePC, and for K > 1, the remote workers run on K − 1 Odroid X2 boards. Although automatic load balancing occurs in the Stateless architecture, in the case of the Stateful architecture this is not true. Thus, for the latter, 10 particles were always assigned to the main system, while the remaining particles were equitably distributed over the K − 1 remote workers. This value was conservatively chosen to guarantee that no scan matching steps are skipped in the main system running on the Asus EeePC, and we leave the optimization of particles assignment in the Stateful architecture as future work. In order to avoid having less particles running on the remote workers than on the local worker, for N = 30 in the Stateful architecture we did not run experiments with K > 3, as seen in Table II. The first immediate evidence is that scan processing usually takes less time in the Stateful architecture than the Stateless one. Despite being more complex and requiring the synchronization of the particles in all the available workers, the Stateful architecture greatly reduces the communication overhead imposed by the Stateless approach. As a consequence, its efficiency is superior. By defining an acceleration factor υSM as the speedup 1 obtained considering only the scan matching (SM) step of the algorithm, it is shown that in the ACES dataset, the Stateless architecture is highly inefficient, providing no acceleration with K > 1. This happens due to the size of the map, which leads to allocation of large memory blocks and non-negligible overhead of communication. In the other datasets, the approach accelerates the scan matching processing time by up to a factor of 1.52 (Intel) and 2.60 (CSAIL). The maximum acceleration factor observed in the Stateful architecture was 3.24 (ACES), 6.82 (Intel) and 5.58 (CSAIL). Some relevant global trends can also be observed. For instance, in the CSAIL dataset the processing time is larger than in the other datasets. This happens due to the resolution of the laser scan messages. In the CSAIL dataset there are 360 beams in each scan, while in the other datasets there are only 180. In addition, for both architectures the scan processing time generally drops with the number of remote workers, due to the

9 Stateful Architecture

Speedup (𝜐)

2.40 2.10

K=5 (Intel) K=4 (Intel) K=3 (Intel) K=2 (Intel)

1.80 1.50

K=5 (ACES) K=4 (ACES) K=3 (ACES) K=2 (ACES)

1.20

0.90 30

60 90 Number of Particles (N)

120

Speedup (𝜐)

K=5 (CSAIL) K=4 (CSAIL) K=3 (CSAIL) K=2 (CSAIL)

6.70 6.20 5.70 5.20 4.70 4.20 3.70 3.20 2.70 2.20 1.70 1.20 0.70

4

K=5 (CSAIL) K=4 (CSAIL) K=3 (CSAIL) K=2 (CSAIL) K=5 (Intel) K=4 (Intel) K=3 (Intel) K=2 (Intel)

K=5 (ACES) K=4 (ACES) K=3 (ACES) K=2 (ACES)

30

60 90 120 Number of Particles (N)

Fig. 7: Speedup using the proposed architectures.

progressive decrease of computational load on each worker when more processors are involved. However, in the Stateful architecture this is not necessarily true, especially when there are too many workers K for few particles N , i.e. when the ratio N/K is low. In such situations, since each worker only has a small portion of the total number of particles, when resampling occurs the probability of the remote workers to ask for missing particles increases. Waiting for the missing particles to be sent over the network has an impact on the time to process the scans, and clearly delays in the system increase with the number of particles requested. Hence, deceleration even occurs if the Stateful architecture is not run within specific intervals of N/K . As a consequence of the above facts, it is possible to run a distributed method with several more particles than the classical method, thus gaining in localization and mapping quality. A clear example of this is shown in the results for the Intel Research Lab, where the classical method with 60 particles takes approximately the same amount of time to process the laser scans as the Stateless architecture method with 90 particles and 4 workers, or the Stateful architecture with 120 particles and 2 workers. Furthermore, since computation sharing accelerates the scan matching step it enables the robot that is collecting data to move faster. When moving at higher speeds, the algorithm reaches the γ or the θ threshold quicker. Performing SLAM with GMapping in real-time without dropping data is possible as long as the processing time is shorter than the time between the thresholding condition is verified. In the analysis presented before, we have only addressed the time to process laser scans. However, it is important to understand how the overall time to run the algorithm has been accelerated by distributing the scan matching step. In Fig. 7, the evolution of the speedup metric υ of the overall approach, with increasing number of particles is shown for both architectures presented in this article. The illustrated curves confirm that the efficiency gain is generally higher when using the Stateful approach, since υ ≤ 6.641. The Stateless architecture presents lower speedup values, υ ≤ 2.615. Moreover speedup tends to increase in general with processing load, i.e. the number of particles N . This is common in distributed computation because the overhead of parallelization, such as the creation and management of threads, is approximately the same independently of the size of the problem. Therefore,

3

Speedup

Stateless Architecture

2.70

150 Mbps 100 Mbps 75 Mbps 54 Mbps

2

1

0

Stateless

Stateful

Fig. 8: Speedup with limited bandwidth in MIT CSAIL (N = 60, K = 3).

the ratio between overhead and workload decreases when the processing load grows. As a result, the efficiency tends to increase with the size of the problem. This is more visible in the Stateful approach, especially when several missing particles are not required in the resampling step. In the Stateless architecture, this is not so evident due to the heavy dependence on the exchange of data across the network. B. Limiting Network Bandwidth In a distributed system such as the one considered herein, all processing resources are connected via a communication network. In the above analysis, the network bandwidth was only limited by the available hardware, i.e. a 100 Mbps switch. However, in order to understand the impact of the communication quality on the proposed architectures, additional tests were made by limiting the network bandwidth using WiFi. In these additional tests, we have used the MIT CSAIL dataset with N = 60 and K = 3, and analyzed both the Stateless and the Stateful architecture with varying WiFi bandwidth limitation of 150 Mbps (802.11n, 40MHz), 100 Mbps (the same specification of the previous one but limited by software), 72.2 Mbps (802.11n, 20MHz), and 54 Mbps (802.11g). All configurations were tested in 5 different trials, and the resulting speedup is depicted in Fig. 8. The chart shows that both tested approaches can operate in different network conditions, and in both architectures, the speedup decreases progressively with the available bandwidth. However, the slope is much smoother in the Stateful architecture, i.e. these results show that the Stateless architecture is much more affected by the available bandwidth than the Stateful one. This occurs due to the greater exchange of data among workers in the network of the Stateless architecure. In order to demonstrate these differences in data exchange, consider the two scenarios given in the rows of Table III, in which the Stateful architecture was run with the parameters reported. Every time that there is a laser scan match in both architectures, it is necessary to communicate the laser scan (180 values of 32-bit floats), and a robot pose (4 values of 32-bit floats for x,y,z,ψ) to the remote worker, which corresponds to 736 bytes. However, on the Stateless architecture, in addition to the scan matcher settings, it is also necessary to send a map

10

Translational Error, ϵtrans

1

Classical Stateless Stateful

0.1

0.01 0

200

400

600

800

1000

1200

1400

1600

# Relations

(a) Classical Approach. (b) Stateless Approach, K = 2. (c) Stateful Approach, K = 2. ¯trans = 0.065 ± 0.381 (m), ¯trans = 0.031 ± 0.048 (m), ¯trans = 0.028 ± 0.041 (m), ¯rot = 2.809 ± 6.539 (deg). ¯rot = 2.060 ± 2.003 (deg). ¯rot = 2.039 ± 2.017 (deg).

(d) Translational Error, trans using the three methods (logarithmic scale).

Fig. 9: Maps obtained and translational errors in the Intel Research Lab dataset with N = 90.

TABLE III: Test cases for data exchange over the network.

ACES Building Intel Research Lab

K

N

2 2

90 120

scan matches 930 517

resampling 9 12

map exchanges in resampling 3 4

for each particle transmitted, which represents the largest data structure communicated by the proposed architectures, and can be arbitrarily high. In the case of the ACES Building, the average compressed map size exchanged was around 70 kB, and for the Intel Research Lab map, it was around 40 kB. These are exchanged NRW times for every scan matching step, where NRW represents the mean number of particles that are processed on the remote worker. On the other hand, in the Stateful architecture, the scan matcher settings are only sent once in the beginning, and maps are only seldom exchanged during the resampling step (see Table III). Additional information in synchronization messages is also exchanged in the Stateful approach, however their size (several tens of bytes) is not meaningful, when compared to the map size. Since the output of the scan matching step running on the remote workers is the same for both architectures (see the description of the WorkReply messages), it can be easily seen that the Stateless architecture leads to much more communication load than its counterpart. In the tests above, and assuming a conservative ratio NRW /N = 1/3, the Stateless architecture would exchange around 620 MB (30 × 40 kB × 517 − 40 kB × 4) more than the Stateful architecture in the Intel map, and around 2.6 GB (40 × 70 kB × 930 − 70 kB × 3) more in the ACES building. C. Performance Analysis Due to the acceleration observed in the scan processing step, it has been shown that the proposed architectures are able to handle a number of particles that cannot necessarily be handled in the classical method without dropping data, as shown previously in Fig. 3. Hereupon, the performance obtained with a robotic clustering architecture, which is related with localization accuracy and mapping precision, is expected to be superior in situations where the classical approach skips scan matching steps, having at least the same performance levels when this is not the case. This is clearly noticed in the

Intel Research Lab dataset with N = 90 particles (cf. Fig. 9). Despite the evident increase of mapping quality, visual inspection of the resulting maps does not allow a detailed comparison. So, the need to precisely evaluate the results asks for a more accurate method - a quantitative scale. In this work, we make use of the benchmarking metric presented in [25] to assess the impact of computation sharing in SLAM performance. This metric evaluates the accuracy of the poses in robot trajectories during data acquisition. It uses only relative relations between poses and does not rely on a global reference frame, which even allows to compare algorithms with different estimation techniques and sensor modalities. Moreover, using this metric, the error is not cumulative, instead it is isolated by comparing the displacement of the differences between estimated poses of the robot to the relation between the ground truth poses, which were manually measured by the authors in [25] for commonly used datasets.6 The mean translational error, ¯trans , and the mean angular error, ¯rot , with the corresponding standard deviation were calculated for the maps generated by all three approaches (cf. Fig. 9a, 9b and 9c). Additionally, in the rightmost chart of Fig. 9d, one can see the evolution of the translational error over the different pose relations in a logarithmic scale. Results clearly demonstrate that the errors of the proposed architectures are inferior to those of the classical method, both for the translational and rotational error. The peaks in the evolution chart show situations where relations measured by the classical method have high errors, which explains why the map generated was only consistent locally, but tends to be inconsistent as a whole. Additionally, the maps generated in the proposed architectures yield very similar results, with greatly reduced error measurements. A further aspect, which is related to the performance of both architectures is the CPU load available in the remote workers to enable them to perform additional tasks. Aiming to analyze this, we measured CPU usage in each remote worker using the Intel Research Lab and the ACES Building with K = 3 and N = 60. Results for the CPU load are depicted in Table IV. As can be seen, the Stateful architecture leads to lower mean values and higher standard deviation, while at the same time reaching higher values of maximum CPU load. In this 6 For more information on the relations metric please refer to http://kaspar. informatik.uni-freiburg.de/∼slamEvaluation/index.php.

11

Fig. 10: Real world Results. a) Experimental scenario (12.85 × 9.15 m). b) Ground Truth map. c) Robots used. d) Map obtained with the Stateful architecture (N = 30, K = 3). e) Map obtained with the classical approach (N = 30).

TABLE IV: CPU load in the remote workers. Mean (x), standard deviation (σ), and maximum (max(x)) percentage (%) values. ACES Building, Stateless ACES Building, Stateful Intel Research Lab, Stateless Intel Research Lab, Stateful

x 4.23 3.42 4.47 3.53

σ 4.74 6.97 5.00 6.66

max(x) 16.85 36.65 18.65 26.50

approach, there is less overhead because since it is rare to send maps, there is no need to wait for encapsulation, encoding or decoding of map data. Also, each remote worker immediately processes the assigned particles as a whole, which leads to several processing peaks over time. In the Stateless architecture, the remote workers fetch single particles from the workqueue. Therefore, in this approach the remote workers are not so quickly fed, since they have to wait for each particle to arrive as they also contain local maps. They only receive the next particle after processing and sending back the previous one. As a consequence, in the Stateless architecture, the computation load in the remote workers tends to be more constant (lower standard deviation), and computation instants are also more frequent, leading to less processing peaks (lower values of maximum CPU load). Results have shown that both distributed architectures lead to significant efficiency gain in a SLAM task, being robust to network limitations and enabling performance gains. The Stateless architecture has the advantage of being simple and requiring an inferior amount of memory in the main system when compared to the Stateful architecture. It also encompasses automatic load balancing, thus there is no need to assign a fixed workload to each worker before run-time and the approach adapts to the computer architectures used in the system. However, these advantages alone do not lead to superior efficiency, due to the exceeding exchange of data between different processors in this architecture, which limits the attained speedup. On the other hand, the Stateful architecture generally leads to higher efficiency gain, if properly dimensioned both in terms of N/K and computation load, because it does not provide automatic load balancing. Furthermore, it is in average much faster than the Stateless architecture since it limits the flow of data in the network. This comes at the cost of requiring more memory in the main system, and occasional delays due

to the transmission of particles to the local workers. For this reason, ideally the robot that is collecting data should not move at very high speeds during the resampling process, when the latter approach is used.

D. Real world Experiment with a team of Robots Interesting advantages of using distributed SLAM architectures have been demonstrated with datasets regularly used in the Robotics community. However, it is crucial to validate the work in real world robotic clusters, where the computation resources are mobile, the wireless communication connectivity is unpredictable and yet the system must perform in real-time. Having this in mind, a large arena was built in a class room of the Department of Electrical and Computer Engineering of the University of Coimbra, which is shown in Fig. 10a, and a ground truth map of the arena was designed as a reference (cf. Fig. 10b). The robots used to perform the experiment were a team of three ROS-enabled iRobot Roombas, equipped with an Asus EeePC 901 with very limited processing capabilities, and an Hokuyo URG-04LX LRF, as shown in Figure 10c. The experimental test consisted in having the three robots exploring the scenario in leader-follower motion. The robot in the front, i.e. the leader, acquired sensor data and run a distributed version of the GMapping algorithm using the Stateful architecture with N = 30 particles, while being assisted by the other two robots, i.e. the followers, which made their processing resources available, thus serving as remote workers. The particles were equally distributed by each of the K = 3 workers. The robots made use of their onboard 802.11b/g/n wireless card for communication, and moved at a maximum linear speed of 0.4 m/s, and angular speed of 0.5 rad/s. The map generated in real-time using the Stateful architecture is illustrated in Fig. 10d. As can be seen when compared to the ground truth, the map is generally consistent and resembles the ground truth reference with some glitches. In order to be able to compare the generated map with the classical approach, all sensor data were recorded during the experiment. This allowed to run the classical GMapping approach offline with N = 30 in the Asus EeePC 901, being fed with the exact same sensor data. The resulting map is presented in Fig. 10e. Once again, it becomes clear the advantage of distributing the computation load of the SLAM algorithm by comparing the

consistency of both maps.7 In these tests, it was not possible to conduct a numeric comparison, since there is no ground truth of relations between poses of the robot during the experiment.

[6]

[7]

V. C ONCLUSION This work presents two distributed architectures based on lightweight loosely coupled message-queues to speed up a 2D RBPF SLAM approach, widely used by the Robotics community. The proposed architectures enable a solidary multirobot system to share its limited computation resources in order to solve a complex SLAM problem without depending on connections to an outside network, thus following a robotic cluster philosophy. This leads to a balanced utilization of resources, reducing battery usage and response time in members of the team, and also fosters team scalability, by avoiding communication bottlenecks to a central entity. Several experimental tests were conducted using very diverse configurations in common benchmark datasets, which demonstrated that processing time drastically decreases when the computational resources are used efficiently. In addition, the impact and robustness to limited network bandwidth was analyzed, and it was shown the increase of localization and mapping performance when the number of particles cannot be fully handled by the single processing solution. Advantages and disadvantages of both architectures have been discussed, and a real world testbed was described to pinpoint the potential of using the proposed philosophy in teams of physical robots. Higher values of speedup than those obtained in this work could eventually be reached using more powerful processors. However, this stands outside the scope of the work, as we proposed to utilize economic, small-sized and limited processing components frequently used in robotic platforms. In the future, it would be interesting to adapt the load distribution online, in the Stateful approach, e.g. by taking into consideration the history of processed particles by each worker in the system; and also specify the intervals of N/K where optimized performance of the approach can be attained. Furthermore, despite not being tested in this paper, the authors suggest that similar architectures can be used in the future to speed up any SLAM algorithm that relies in laser scan matching. This is the case of Particle and Kalman filter-based approaches, which would involve little degree of modification in terms of the associated data structures.

[8] [9] [10]

[11]

[12] [13]

[14]

[15] [16] [17]

[18]

[19]

[20] [21]

R EFERENCES [1]

J. J. Kuffner, “Cloud-Enabled Robots”. In IEEE-RAS International Conference on Humanoid Robots, Nashville, TN, 2010. [2] A. Marjovi, S. Choobdar, and L. Marques, “Robotic clusters: Multirobot systems as computer clusters: A topological map merging demonstration”, Robotics and Autonomous Systems, 60 (9), pp. 1191-1204, 2012. [3] S. Thrun, “Robotic mapping: A survey”. In Exploring artificial intelligence in the new millennium, Morgan Kaufmann Publisher, 2002. [4] H. Durrant-Whyte and T. Bailey, “Simultaneous Localization and Mapping: Part I”, in IEEE Robotics & Automation Magazine, vol. 13, no. 2, pp. 99-110, 2006. [5] P. Agarwal, G.D. Tipaldi, L. Spinello, C. Stachniss, W. Burgard, “Robust Map Optimization using Dynamic Covariance Scaling”. In Proc. of the Int. Conf. on Robotics and Automation (ICRA 2013), Karlsruhe, Germany, May 6-10, 2013. 7A

video of the experiments is available at: http://isr.uc.pt/∼davidbsp/ videos/ieeet-ase

[22]

[23]

[24]

[25]

12

G. Dissanayake, P. Newman, S. Clark, H.F. Durrant-Whyte, M. Csorba, “A Solution to the Simultaneous Localization and Map Building (SLAM) Problem”. In IEEE Transactions on Robotics and Automation, 17 (3), June 2001. M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, “FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem”. In AAAI National Conf. on Artificial Intelligence, 2002. S. Thrun, M. Montemerlo, “The GraphSLAM Algorithm With Applications to Large-Scale Mapping of Urban Structures”. In Int. Journal on Robotics Research, 25 (5-6), pp. 403-429, 2006. S. Kohlbrecher, J. Meyer, O. Von Stryk, U. Klingauf, “A Flexible and Scalable SLAM System with Full 3D Motion Estimation”. In Proc. of the Int. Symp. on Safety, Security and Rescue Robotics, Nov. 2011. E. Pedrosa, N. Lau, A. Pereira, “Online SLAM Based on a Fast ScanMatching Algorithm”. In Progress in Artificial Intelligence, Lecture Notes in Computer Science, Editors: L. Correia, L. P. Reis, J. Cascalho, Vol. 8154, 295-306, Springer, 2013. A.P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, R.W. Brodersen, “Optimizing Power Using Transformations”. In IEEE Transactions on Computer-Aided Design of Integrated Circuits ans Systems, 14 (1), pp. 12-31, 1995. B. Clipp, J. Lim, J. Frahm, M. Pollefeys, “Parallel, Real-Time Visual SLAM”. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS 2010), pp. 3961-3968, Taipei, Taiwan, October, 2010. K. Par, O. Tosun, “Parallelization of Particle Filter based Localization and Map Matching Algorithms on Multicore/Manycore Architectures”. In Proc. of the 2011 IEEE Intelligent Vehicles Symposium, Baden, Germany, June, 2011. B. Kehoe, S. Patil, P. Abbeel, K. Goldberg, “A Survey of Research on Cloud Robotics and Automation”. In IEEE Transactions on Automation Science and Engineering, Special Issue on Cloud Robotics and Automation, 12 (2), April 2015. (In Press) L. Riazuelo, J. Civera, J.M.M. Montiel, “C2 TAM: A Cloud framework for cooperative Tracking and Mapping”. In Robotics and Autonomous System, 62 (4), pp. 401-413, April 2014. D. Hunziker, M. Gajamohan, M. Waibel, R. D’Andrea, “Rapyuta: The RoboEarth Cloud Engine”. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 438-444, Karlsruhe, Germany, May, 2013. R. Arumugam, V. R. Enti, L. Bingbing, W. Xiaojun, K. Baskaran, F. F. Kong, A. S. Kumar, K. D. Meng, and G. W. Kit, “DAvinCi: A cloud computing framework for service robots”, in 2010 IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 3084-3089. R. Siegwart, P. Balmer, C. Portal, C. Wannaz, R. Blank and G. Caprari, “RobOnWeb: A Setup with Mobile Mini-Robots on the Web”. In Beyond Webcams: An Introduction to Online Robots, K. Goldberg and R. Siegwart (editors), MIT Press, 2002. M. Dorigo, D. Floreano, L.M. Gambardella, F. Mondada, S. Nolfi, T. Baaboura, M. Birattari et al., “Swarmanoid: A Novel Concept for the Study of Heterogenous Robotic Swarms”. In IEEE Robotics & Automation Magazine, 20 (4), pp. 60–71, Dec. 2013. G. Grisetti, C. Stachniss, W. Burgard, “Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters”. In IEEE Transactions on Robotics, 23(1), pp. 34-46, Feb. 2007. M. Lazaro, L. Paz, P. Piniés, J. Castellanos, and G. Grisetti, “MultiRobot SLAM Using Condensed Measurements”. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, Nov. 3-7, 2013, pp. 1069-1076. M. Eich, R. Hartanto, S. Kasperski, S. Natarajan, J. Wollenberg, “Towards Coordinated Multirobot Missions for Lunar Sample Collection in an Unknown Environment”. In Journal of Field Robotics, Special Issue on Space Robotics, Part 2, Volume 31, Issue 1, pages 35–74, 2014. M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. leibs, E. Berger, R. Wheeler, A. Ng, “ROS: an open-source Robot Operating System”. In IEEE International Conference on Robotics and Automation (ICRA), Workshop on Open Source Software, Kobe, Japan, 2009. J. Machado Santos, D. Portugal and R. P. Rocha, “An Evaluation of 2D SLAM Techniques Available in Robot Operating System”. In Proc. of the 2013 Int. Symposium on Safety, Security and Rescue Robotics (SSRR 2013), Linköping, Sweden, Oct 21-26, 2013. R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti, C. Stachniss, A. Kleiner, “On Measuring the Accuracy of SLAM Algorithms”. In Autonomous Robots, 27(4), 2009.