Goal-based learning in tactile robotics

3 downloads 0 Views 7MB Size Report
Integrating reinforcement learning and Bayesian active perception ............. 22. 3.5 ..... of the black-box optimisation algorithms used in its implementation. • Chapter 4 describes ...... The problem is that for most practical problems in robotics the.
Goal-based learning in tactile robotics John Lloyd

A dissertation submitted in partial fulfilment for the requirements of the University of the West of England, Bristol for the Degree of Master of Science

Faculty of Environment and Technology, University of West of England, Bristol September 2016

Declaration This study was completed for the MSc in Advanced Engineering Robotics at the University of Bristol and University of the West of England, Bristol. The work is my own. Where the work of others is used or drawn on it is attributed.

Signed:_________________________________

John Lloyd Student No: 15042295

Number of words: 19,631

Date:_____________________

Abstract This dissertation reports on work carried out between June and August 2016, to develop and evaluate a method for goal-based learning in an active touch robotic system. Over the past few years, Bristol Robotics Laboratory (BRL) has led the way in developing high-acuity, active touch, robotic systems, but until now these systems have relied exclusively on predefined control policies. By incorporating a goal-based learning capability, it is hoped that this will give tactile robotic systems the ability to learn more complex, optimal control policies than we currently know how to program directly, and the ability to adapt these policies to changes in the environment over time. This work has addressed the problem by developing a simplified model of active perception and control, and reinterpreting the existing methods in use at BRL under this new model. By abstracting away from the detail, it was possible to extend the model to incorporate a powerful form of reinforcement learning known as direct policy search, and thereby show how to incorporate a goal-based learning capability within the existing framework. The effectiveness of the proposed approach was demonstrated by using it to learn optimal control policies for three tactile robotic tasks that have been widely used in past research at BRL. These tasks include curvature perception, orientation perception, and a more complex exploration task that involves tracking the edge (contour) of a disk. The results show that the new goal-based learning approach can find optimal (or nearoptimal) policies for all of these demonstration tasks. Furthermore, a particular class of methods, based on Bayesian optimisation, can find good solutions in as few as 100 learning trials, which makes it well-suited to online, real-time learning. This is something that has not been possible using other methods in the past. However, more work still needs to be done to improve the efficiency of the proposed methods and investigate how they scale up to larger and more complex problems.

Acknowledgements I would like to thank my wife, Ruth, and our children, Catherine and Anna, for their support, patience and understanding during the course of my studies over the past year. I would also like to thank my supervisor, Dr Nathan Lepora, for his support, guidance and advice during the course of this project.

Table of Contents 1

2

3

4

5

6

Introduction .............................................................................................................. 1 1.1

Aims and objectives ...................................................................................... 1

1.2

Motivation ..................................................................................................... 1

1.3

Main contributions ......................................................................................... 2

1.4

Dissertation outline........................................................................................ 3

Background and related work ................................................................................... 5 2.1

The importance of tactile perception in robotics ............................................ 5

2.2

Active perception, sensorimotor control and learning in tactile robotics ......... 6

2.3

Biomimetic tactile perception and control ...................................................... 7

2.4

Recent work on tactile perception at Bristol Robotics Laboratory .................. 8

2.5

Reinforcement learning for goal-based learning in tactile perception ........... 11

2.6

Related work ............................................................................................... 12

Integrating active perception and reinforcement learning ........................................ 15 3.1

A simplified model of active perception ....................................................... 15

3.2

Bayesian filters and active perception ......................................................... 19

3.3

Reinterpretation of existing methods in terms of Bayesian filters ................. 20

3.4

Integrating reinforcement learning and Bayesian active perception ............. 22

3.5

Direct policy search for reinforcement learning ............................................ 24

3.6

Model-free, episode-based, direct policy search ......................................... 26

3.7

Policy optimisation methods ........................................................................ 28

Application to learning problems in tactile robotics ................................................. 39 4.1

Curvature perception................................................................................... 39

4.2

Orientation perception ................................................................................. 53

4.3

Contour exploration ..................................................................................... 59

Discussion .............................................................................................................. 69 5.1

Significance of results ................................................................................. 69

5.2

Relationship with other work ....................................................................... 70

5.3

Suggestions for further research ................................................................. 70

Conclusions ............................................................................................................ 73

References.................................................................................................................... 75 Appendices ................................................................................................................... 87 A

Performance on standard optimisation benchmarks ............................................... 89 A.1

Use of standard optimisation benchmarks for testing and evaluation .......... 89

B

A.2

Comparison between policy search gradient and hybrid optimisers ............ 93

A.3

Characterisation of optimiser performance on standard benchmarks .......... 93

A.4

Computational costs ................................................................................... 96

A.5

Deterioration of performance under noisy conditions .................................. 96

A.6

Discussion of results ................................................................................... 99

Particle filter implementations of active perception ................................................101 B.1

Particle filter algorithm .............................................................................. 101

B.2

Cylinder classification ............................................................................... 102

B.3

Angle estimation ....................................................................................... 105

B.4

Discussion of results ................................................................................. 108

1 Introduction 1.1 Aims and objectives The aim of this dissertation project is to develop and evaluate a method for goal-based learning in an active touch robotic system. This involves achieving the following objectives: 

Conduct a detailed literature review of goal-based learning in tactile robotics, with a particular focus on reinforcement learning approaches;



Reinterpret and extend current methods of tactile active perception and control developed at BRL, to incorporate a goal-based learning capability;



Demonstrate the new approach on some representative tactile perception and exploration tasks, using a 6 degrees-of-freedom (DOF) robotic arm equipped with a TacTip touch sensor.

1.2 Motivation Active touch robotic systems sense and perceive their environments in an interactive, exploratory manner, using a similar approach to that used by humans and other animals (Prescott et al. 2011; Prescott & Dürr 2013). This bioinspired approach allows robots to reduce uncertainty and resolve ambiguity about their surroundings to an extent that would not be possible using passive touch alone (Lepora 2016a). The exploratory movements or manipulations carried out by a robot during the active touch process are specified by a control policy, which can either be predefined, based on assumptions about the task and environment, or learned using some form of learning paradigm. Two popular approaches for learning in robotics are supervised learning and reinforcement learning. In supervised learning, the robot is shown examples of how to behave by a “teacher” and, over time, it learns to imitate that behaviour. In reinforcement learning, there is no teacher. Instead, the robot receives a scalar reward signal from the environment, which tells the robot how well it is achieving a particular goal. The robot uses this reward signal to work out how it needs to behave in order to maximise the long-term cumulative reward and achieve the goal. A key advantage of a goal-based approach such as reinforcement learning is that it does not require a teacher who already knows how to perform the task. It can also adapt the control policy to changes in the environment in an online fashion (Kormushev et al. 2013; Kober et al. 2013). Over the past few years, BRL has led the way in developing high-acuity, active touch, robotic systems (Lepora 2016b; Lepora et al. 2015; Ward-Cherrier et al. 2016; Lepora et al. 2016; Cramphorn et al. 2016; Lepora & Ward-Cherrier 2015). However, up until now these systems 1

have relied exclusively on predefined control policies. By incorporating a goal-based learning capability, it is hoped that this will give tactile robotic systems the ability to learn more complex, optimal control policies than we currently know how to program directly, and to adapt these policies to changes in the environment over time.

1.3 Main contributions The work described in this dissertation makes the following contributions: 

It provides a comprehensive review of goal-based learning in tactile robotics. This will help to improve understanding of the state-of-the-art in this area and identify gaps in knowledge that can be addressed through further research.



It develops a simplified model of active perception and control, and reinterprets the existing tactile perception methods developed at BRL as a subclass of this model. Specifically, the existing methods are shown to be equivalent to histogram-based implementations of Bayesian filters embedded in a special type of perception/control loop. The extension of this model to incorporate goal-based learning is framed as a partially-observable Markov decision process (POMDP). This new interpretation will allow many useful methods that have been developed in other areas of robotics (e.g., mobile robotics) to be applied in the tactile domain.

Some potential benefits of

adopting this new interpretation have been demonstrated using particle filter implementations of Bayesian filters (see Appendix B). 

It extends existing tactile perception methods developed at BRL by incorporating goalbased learning using direct policy search reinforcement learning. These extensions can be viewed as an approximate method for finding solutions to the underlying POMDP problem. The learning approach has been successfully demonstrated on three tactile robotic tasks relating to curvature perception, orientation perception, and contour exploration. In future, this will enable tactile control policies to be learned or optimised, rather than having to fully predefine them in advance.



It compares the performance of a number of black-box optimisers that were used to implement direct policy search reinforcement learning in the extended framework. During the course of this work, two novel optimisers were also developed, which can be viewed as hybrids between existing algorithms. Since no single optimiser was found to be best-suited to all problems, this collection of optimisers will provide a useful toolkit that can be used in further work in this area. Furthermore, one particular class of optimiser, known as Bayesian optimisers, was found to be suitable for online, realtime learning, which is something that has not been possible in the past.

2

1.4 Dissertation outline The rest of this dissertation is organised into five chapters, as follows: 

Chapter 2 explains why tactile sensing and perception are important in robotics, and, in particular, why a biomimetic approach based on active touch is a good one. It goes on to summarise recent work carried out in this area at BRL, and justifies the need for a goal-based learning capability. Finally, it explains why reinforcement learning is a promising approach for providing this capability, and outlines related work in this area.



Chapter 3 develops a simplified model of active perception and control, and reinterprets the existing tactile perception methods developed at BRL as a subclass of this model. It goes on to explain why the extension of this model to incorporate goal-based learning can be framed as a POMDP problem, and proposes an approximate but computationally tractable solution using Bayesian filtering and reinforcement learning. It also explains why episode-based, direct policy search is the most appropriate reinforcement learning method in this setting, and provides details of the black-box optimisation algorithms used in its implementation.



Chapter 4 describes how the new learning approach was used to find optimal control policies for three tactile robotic tasks that have been widely used in past research at BRL. These tasks include curvature perception, orientation perception, and a more complex exploration task that involves tracking the edge (contour) of a disk. It also includes some performance comparisons between the different black-box optimisers used to implement the reinforcement learning element of the system.



Chapter 5 discusses the main results from this work, explains their significance, and links them back to other relevant research. It also identifies any weaknesses in the methodology and proposes some areas for further work.



Finally, Chapter 6 summarises the main achievements of this project and presents some conclusions.

3

2 Background and related work 2.1 The importance of tactile perception in robotics In humans and other animals, tactile sensing is essential for day-to-day survival (Prescott & Dürr 2013).

Humans rely on it for detection, exploration, recognition and manipulation

(Gallace & Spence 2014). For example, people with chilled or anaesthetised fingertips often have great difficulty in carrying out the simplest of day-to-day tasks, like buttoning up a jacket, or grasping and picking up small objects (Westling & Johansson 1984). Furthermore, any major loss of touch cannot be adequately compensated for by other sensory modalities such as sight, and it results in catastrophic impairments of hand dexterity, haptic capabilities, walking and self-perception (Cole & Paillard 1995; Robles-De-La-Torre 2006). In the context of tactile sensing, it is important to distinguish between active touch and passive touch. Gibson (1962) originally made the distinction as follows: “active touch refers to what is ordinarily called touching. This ought to be distinguished from passive touch, or being touched”. Since then, the definition has been refined somewhat, with active touch systems now characterised as purposive and information-seeking systems (Prescott et al. 2011). Humans tend to use passive touch for detection and response, and active touch for exploration, recognition and manipulation. Other animals use whiskers, tentacles, antennae, fins or feathers in a similar manner (Prescott et al. 2011). Tactile sensing is important in robotics too (Cutkosky et al. 2008), particularly in applications such as medical robotics that require the use of sensitive touch and fine motor control (Eltaib & Hewit 2003; Tiwana et al. 2012). It is also important when operating in unstructured or unpredictable human environments (Kemp et al. 2007; Argall & Billard 2010), and in agricultural applications that involve processing natural products (Lee 2000). Since humans use their hands and fingers for exploration, recognition and manipulation, there has been a great deal of interest in developing dexterous hands for carrying out these activities in robots (Bicchi 2000; Ma & Dollar 2011; Bullock et al. 2013; Ma et al. 2013). Indeed, dexterous robotic hands will be essential in the next generation of more flexible industrial robots with greater manipulation capabilities, and in social and service robots for supporting humans in daily routines or providing assistance to the elderly or the disabled (Kappassov et al. 2015). Thus, tactile sensing and perception are likely to play a major role in these areas of robotics as well. Despite the many apparent benefits of tactile sensing to robotics, progress has been relatively slow in this area as compared to areas such as computer vision and machine learning. One reason for this slow development is that there is no equivalent of the charge-coupled device (CCD) or complementary metal-oxide-semiconductor (CMOS) array sensor in the tactile 5

domain, which is robust to physical impact and abrasions, and compliant enough to be incorporated in non-planar and flexible surfaces (Cutkosky et al. 2008). A second reason is that touch is an inherently active sense, and so progress can only be made using complex robotic systems, whereas progress can be made in vision by treating it as a passive modality. Nevertheless, there has been significant progress in developing new types of tactile sensor in recent years (Dahiya et al. 2010; Girao et al. 2013; Winstone et al. 2013), and with this progress there has been a rapid advance in tactile information processing algorithms for perception exploration and manipulation (Kappassov et al. 2015).

2.2 Active perception, sensorimotor control and learning in tactile robotics Tactile perception is concerned with the interpretation of information provided by touch. Perhaps more than any other sensory modality, touch is an active sense, so tactile perception necessarily involves coordinating motor actions with sensory perceptions in a continuous process of sensorimotor control (Gibson 1962; Loomis & Lederman 1986; Prescott et al. 2011). In humans, this involves a variety of sensorimotor control polices, which are used to move the hands and fingers in a purposive manner to gather more information about the environment, as shown in Figure 2.1 (Lederman & Klatzky 1987; Lederman & Klatzky 1993).

Figure 2.1 – Exploratory procedures used by humans to perceive properties of objects (Source: Sensation and Perception).

In some situations, more than one policy is used during the perception process. So, for example, to recognise a cylindrical container using touch alone, we might use information from the points of contact with our hand and fingers to sense the spatial extent of the object;

6

then use one or more fingers to trace the circular contours at either end; and finally rotate the object in the hand to confirm that it is indeed cylindrical. This process relies on a combination of bottom-up sensory information based on touch, and top-down information about the possible object shapes that could account for the bottom-up sensory information. In order to use these sensorimotor control policies in robotic systems, we either need to predefine appropriate dynamic and kinematic models of the manipulator and objects, and embed these within a conventional control system, or they need to be learned and adapted, based on representative data or interactions with the environment (Schaal & Atkeson 2010; Nguyen-Tuong & Peters 2011). However, even if it was possible to predefine policies for all types of tactile perception and manipulation task, these models would still need to be calibrated, and in the absence of online adaption would become inaccurate over time due to wear-and-tear on components or changes in the environment.

Furthermore, idealised

assumptions made during modelling typically require higher quality sensors and actuators that comply with these assumptions, thus driving up the cost of the system (Ghadirzadeh & Maki 2015). So the ability to learn a control policy rather than having to predefine it has some real advantages from this perspective too. The ability to learn and adapt sensorimotor control policies is also beneficial when operating in unstructured, uncertain and dynamic human environments. It allows robots to infer unobserved properties of the world and select appropriate actions that depend on these properties (Kemp et al. 2007).

2.3 Biomimetic tactile perception and control As in other areas of technology, robotics researchers often turn to nature for inspiration in their search for solutions. Biological inspiration does not mean that there is any attempt to faithfully copy nature. Instead, the goal is to identify the underlying principles and then reinterpret them in the context of biomimetic robotic systems (Bar-Cohen 2006; Bhushan 2009; Pfeifer et al. 2012). The area of biomimetics is already having a positive impact across a number of research areas, and with the rapid growth of discoveries in recent years it has become a leading paradigm for the development of new technologies (Lepora, Verschure, et al. 2013). Indeed, many recent advances in tactile sensors have been based on biomimetic principles (Chorley et al. 2009; Winstone et al. 2013; Assaf et al. 2014). As far as biological principles of sensorimotor control and learning in humans are concerned, there is growing consensus that the underlying computational mechanisms include Bayesian inference and decision theory (Knill & Richards 1996; Knill & Pouget 2004; Kording & Wolpert 2006), forward models of sensorimotor interactions with the environment (Franklin & Wolpert 2011), optimal feedback control (Todorov & Jordan 2002; Todorov 2004), and various types

7

of learning process that operate at different levels of perception and control (Daw & Doya 2006; Doya 2007; Krakauer & Mazzoni 2011; Gottlieb 2012; Botvinick & Weinstein 2014). Many of these principles have also been used to develop biomimetic tactile robotic systems (Lepora et al. 2012; Fishel & Loeb 2012; Xu et al. 2013; Martins et al. 2014; Lepora 2016b).

2.4 Recent work on tactile perception at Bristol Robotics Laboratory Over the past few years, work at the University of Sheffield and Bristol Robotics Laboratory (BRL) has focused on biomimetic approaches to tactile perception, exploration and manipulation. These approaches have been successfully applied to problems in object recognition (Lepora, Martinez-Hernandez & Prescott 2013a; Lepora, Martinez-Hernandez & Prescott 2013b), angle and position discrimination (Martinez-Hernandez, Dodd, et al. 2013), tactile object exploration and recognition (Martinez-Hernandez, Metta, et al. 2013; MartinezHernandez 2014), tactile super-resolution (Lepora et al. 2015; Lepora & Ward-Cherrier 2015), tactile quality control (Lepora et al. 2016), and tactile manipulation (Cramphorn et al. 2016; Ward-Cherrier et al. 2016). The BRL work has made extensive use of a biomimetic optical tactile sensor known as the TacTip (Chorley et al. 2009; Winstone et al. 2013; Assaf et al. 2014). The TacTip has a 3Dprinted construction with a soft, deformable, hemispherical “fingertip” surface, lined with a concentric array of pins on the inside surface. The pins are illuminated by a ring of LEDs, and their positions tracked using computer vision algorithms, which are applied to a sequence of images captured using the built-in CCD web-camera (see Figure 2.2).

(a)

(b)

(c) Figure 2.2 – TacTip sensor: (a) 3D-printed construction; (b) deformation of hemispherical tip and internal pins when placed in contact with an object; and (c) CCD camera images of pins.

8

The TacTip is mounted as an end-effector on a 6-DOF ABB robotic arm that can precisely and repeatedly position the sensor (absolute repeatability 0.01 mm) during the perception, exploration or manipulation process (see Figure 2.3).

Figure 2.3 – TacTip system organisation showing the main hardware and software components.

Most of the core algorithm development is carried out in a MATLAB environment on a highspecification PC. The interface to the robot arm controller and TacTip camera is implemented in Python, and this also runs on the host PC. The active perception methods in use at BRL are based on a biomimetic approach with theoretical underpinnings in Bayesian sequential analysis (e.g., see (Lai 2001) or (Berger 2013)).

The underlying principle is to accumulate evidence for multiple perceptual

alternatives over a sequence of trials until a predefined belief threshold is reached (Lepora et al. 2012). Figure 2.4 shows the output produced by the system for a typical active perception sequence. Here, the system is being used to identify cylinders of different curvature, while simultaneously estimating the position of the sensor and relocating it to a central focal point over the cylinder axis. In this context, the cylinder ID is referred to as the “what” variable, and the sensor position is referred to as the “where” variable (Lepora, Martinez-Hernandez & Prescott 2013a).

9

Figure 2.4 – Example system output for TacTip cylinder identification.

A number of predefined control policies have been used in conjunction with these active perception methods, including: an active focal policy, where the system attempts to move the sensor from its estimated position to a predefined “focal point” at each step; a passive random policy, where the sensor is moved to a random position at each step; and a passive stationary policy, where the sensor remains in the same position. The naming of these policies is somewhat unfortunate because, under the model developed here, a passive policy can be used in conjunction with an active perception process.

Perhaps the terms “static” and

“dynamic” would be more representative of the behaviour of these policies in this context. Some earlier research carried out at the University of Sheffield looked at using reinforcement learning to learn the optimal decision threshold and focal point parameters for an active focal policy (Lepora, Martinez-Hernandez, Pezzulo, et al. 2013). A multi-armed bandit approach was used to optimise the episodic returns over a coarse discretisation of the policy parameters (13 decision threshold values and 16 focal point positions).

The results showed that

reinforcement learning could be used to optimise robot behaviour for a simple tactile perception task. However, this particular method required between 1000 and 2000 learning trials, making it computationally intractable for online learning (and indeed the study was only

10

carried out on pre-collected data). The authors also acknowledged the problem of scaling the approach to larger state spaces and action spaces.

2.5 Reinforcement learning for goal-based learning in tactile perception As discussed in the previous sections, for tactile manipulation to be effective in unstructured, uncertain and non-stationary environments, robotic systems need to be able to learn, improve and adapt their control policies instead of just relying on ones that are predefined through direct programming. There are two well-established approaches for teaching robots new skills: imitation learning and reinforcement learning (Kormushev et al. 2013). Imitation learning, also known as programming-by-demonstration or learning-fromdemonstration (Friedrich et al. 1996; Argall et al. 2009; Meltzoff et al. 2009), can be thought of as a form of supervised learning, where a policy is learned from examples or demonstrations provided by a “teacher” or “supervisor”. It has the advantage of being an intuitive form of learning for humans, who already use this method to teach other humans how to perform tasks. Reinforcement learning (Kaelbling et al. 1996; Sutton & Barto 1998; Wiering & Van Otterlo 2012) allows a robot to autonomously discover an optimal behaviour in a series of trial-anderror interactions with its environment. Instead of explicitly programming the robot to achieve a particular goal, the robot attempts to learn an optimal policy by interacting with its environment and receiving feedback in the form of a scalar feedback signal known as a reward. The reward measures the one-step performance of the robot as it follows a particular policy. The ultimate aim of the learning process is to find a policy that maximises the expected cumulative reward. As discussed by Kormushev et al. (2013), reinforcement learning offers three previously missing capabilities when compared to other learning approaches used in robotics: 1. It can learn new tasks that humans do not know how to solve (or at least find difficult to specify or program, even if they intuitively know how to perform the task). 2. It can optimise the process of achieving difficult goals using only a cost function in situations where there is no analytic or closed-form solution, and where the optimal solution is not known in advance. 3. It can learn to adapt a skill to a previously unseen version of a task or to a slightly different environment. Reinforcement learning has been successfully applied to many other goal-based learning problems in robotics, and has a strong theoretical underpinning that has been established over many years (Kober et al. 2013). As such, it is considered the most promising approach for goal-based learning in tactile robotics. 11

2.6 Related work As far as it has been possible to establish, other than the initial work of Lepora et al. (2013), there are no other published reports of reinforcement learning (or any other form of goalbased learning) being applied to high-acuity tactile sensors in an active perception context. Some researchers have employed similar methods to those developed at BRL (Hsiao & Kaelbling 2010; Hsiao et al. 2011; Fishel & Loeb 2012; Xu et al. 2013; Loeb & Fishel 2014), but these have typically been applied to low-resolution tactile sensors such as the BioTac, BioTac SP or NumaTac devices (SynTouch 2016), which contain 24 “taxels” or fewer, compared with the 127 taxels used in the TacTip sensor. They also make no attempt to learn an optimal policy for achieving a particular goal, but instead try to maximise the information gain associated with each active move. Other researchers have applied reinforcement learning to problems such as dexterous manipulation, but, once again, these approaches typically use low-dimensional tactile data and rely on other sensory modalities (e.g., vision) for support. Fu et al. (2015) used a modelbased reinforcement learning algorithm that encodes prior knowledge from previous tasks using a deep neural network, and combines this with online adaptation of the dynamics model. They showed that this approach can be used to solve a variety of complex robotic manipulation tasks in a single attempt, using prior data from other manipulation tasks. Han et al. (2015) considered the problem of reinforcement learning of complex compound manipulation tasks.

They described an approach for training chains of controllers for

performing compound tasks, with separate sub-goals for each stage. Another group tackled the problem of multi-phase manipulation tasks using a combination of Dynamic Motor Primitives (DMP) and model-based reinforcement learning (Kroemer et al. 2015). Van Hoof et al. (2015) used reinforcement learning with tactile sensory feedback to learn how to roll cylindrical objects from side to side using a compliant, under-actuated robotic hand. Their approach is notable in that it only makes use of tactile and kinaesthetic feedback, unlike most other approaches that also rely on some form of visual feedback. Kober (2013) and Kormushev (2013) provide extensive surveys on the use of reinforcement learning in robotics. Other possible approaches for goal-directed learning in tactile robotics include those that fall under the general heading of “sensorimotor learning”. An example of such an approach is sensorimotor contingency (SMC) theory (O’Regan & Noe 2001a; O’Regan & Noe 2001b; Philipona & O’Regan 2006; O’Regan 2011). A number of applications of SMC theory to robotics have been reported in recent years. One group has applied them in an acoustic and kinaesthetic setting, to object perception and control of behaviour (Maye & Engel 2011); 12

prediction and action planning (Maye & Engel 2012; Maye & Engel 2013), and terrain discrimination and adaptive walking behaviour (Hoffmann et al. 2012). Their implementation of sensorimotor theory relies on a set of -th order Markov models to capture the probabilities of action-observation pairs, conditioned on past action-observation pairs. Another group used sensorimotor learning in a visual perception setting, to enable a robot to learn how to classify objects, based on how they responded to a pushing action (Hogman et al. 2013). In this case, Gaussian Processes (GP) were used to model SMCs, and these were then used in a Bayesian classification framework to determine the object class. Gaussian Processes were also used in another visual perception application of sensorimotor learning, this time to allow a robot to autonomously learn hand-eye coordination (Ghadirzadeh & Maki 2015). Yet another approach used a neural network implementation of sensorimotor learning to allow a visually-guided mobile robot to learn how to perceive different spatial arrangements of obstacles and thereby recognise if a particular arrangement was a dead end or a passageway (Hoffmann 2007). Georgeon et al. (2013) introduced a new type of model for sensorimotor interactions, called an Enactive Markov Decision Process (EMDP), and used it to develop a cognitive architecture for discovering, memorising, and exploiting spatiosequential patterns of interaction. Inspired by earlier work that investigated how infants use a process of “motor babbling” to gain experience of different body configurations (Meltzoff & Moore 1997), other researchers have applied a similar principle in sensorimotor learning methods that allow humanoid robots to autonomously learn internal models of their self-body and environment (Saegusa et al. 2008). Other work has focused on the use of motor babbling with Bayesian networks to autonomously learn forward models for robotic control applications (Demiris & Dearden 2005). While motor babbling relies on random exploration in “motor space”, an alternative approach, known as “goal babbling”, relies on goal-based exploration in “task space” (Rolf et al. 2010; Rolf et al. 2011; Rolf & Steil 2012; Rolf & Asada 2014). The general problem of autonomous learning of higher cognitive functions in robots is studied in the area of developmental robotics (Asada et al. 2009; Lungarella et al. 2003). Stoytchev (2009) proposed five basic principles of developmental robotics, and explained how they can be applied by robots to autonomously learn how to use tools. Using visual perception, Fitzpatrick et al. (2003) investigated how repeated interaction with objects reveal how they behave when acted upon (e.g., sliding vs. rolling when an object is pushed). They describe two experiments where, in a discovery mode, the visual system learns about the consequences of motor actions in terms of visual features, and in a goal-based mode the mapping is inverted to select the motor action that causes a particular change in visual features. 13

14

3 Integrating active perception and reinforcement learning 3.1 A simplified model of active perception The distinction between active and passive perception, and its extension to active and passive learning, has been made in a number of different areas including information processing systems (Bajcsy 1988; Bajcsy et al. 2016), psychology (Gibson 2014), neuroscience (Freeman 1999), computational biology (Lepora & Pezzulo 2015), tactile perception (Lepora 2016a), and machine learning (Castro et al. 2005; Settles 2011; Settles 2012). In the context of robotics, perception is the process by which an agent, such as a robot or robot subsystem, uses its sensors to obtain information about the state of its environment. The distinction between active perception and passive perception is illustrated in Figure 3.1.

(a)

(b)

Figure 3.1 – The distinction between active perception and passive perception in a robotics context: (a) passive perception, and (b) active perception.

In passive perception, an agent uses a set of passive observations (e.g., sensor measurements) to classify, estimate, predict or infer some quantity of interest relating to its environment. Observations may be single-valued, or they may be sequences (e.g., time series), and are usually chosen deterministically or at random from the environment. This type of approach is exemplified by supervised methods in machine learning (e.g., see Bishop (2006) or Murphy (2012)). In active perception, the agent interacts with its environment in a dynamic fashion, by taking actions (e.g., by moving, or manipulating an object) and observing how the environment 15

changes state in response. In doing this, the observations become dependent on the actions, via changes in the environment state. This type of approach is exemplified by recursive probabilistic approaches used in mobile robotics and other areas (Thrun et al. 2005; Särkkä 2013; Ferreira & Dias 2014). An agent’s environment is characterised by its state. In some situations it is reasonable to assume that the state is Markov – that knowledge of past states does not carry any additional information that would help to predict the future – but this need not be the case. In robotics problems, the environment state is typically not fully observable due to noise and ambiguities in the sensor observations (Kober et al. 2013). So the agent must compensate for this by maintaining an internal state that captures its belief about the state of the environment (and possibly its time history). This internal state is often referred to as the agent state, belief state or information state, depending on the context, to distinguish it from the true environment state. The agent state can be represented in a variety of different ways. For example, it can be represented as a full, or partial, time-history of actions and observations. Alternatively, it can be represented as a probability distribution over environment states, or using information derived from the underlying probability distribution. However, in many cases, a single-point estimate of the environment state is not sufficient as it does not convey any information about the level of uncertainty in the estimate. In this situation, the agent will not be able to tell whether any actions it has taken to reduce uncertainty or gather information have succeeded. Where the agent state takes the form of a compressed representation of the probability distribution over environment states it is often referred to as an augmented state space (Thrun et al. 2005). The control actions taken during the active perception process may involve movement of the agent, such as changing the pose of a sensor, manipulator or robot; or it may involve manipulation of objects in the environment. The control actions depend on both the agent state and the particular goal that the agent is trying to achieve.

This dependency is

encapsulated in a control policy. In using terms like “control action” and “control policy” it is natural to make comparisons between active perception and feedback control. These two processes are depicted in Figure 3.2. Other than some minor differences in terminology and notation, they are seen to be very similar. In both cases, an agent (controller) takes actions (applies control inputs) that are determined by a control policy, before observing its environment (system outputs) and updating some internal representation of the environment state. In both cases, taking actions tends to increase uncertainty, while making observations tends to reduce it.

16

(a)

(b)

Figure 3.2 - The similarities between active perception and feedback control: (a) active perception, and (b) feedback control.

However, in active perception, the primary goal is to acquire information and reduce uncertainty. The main focus is on making observations to gather information; taking actions is simply a way of generating new observations. Conversely, in feedback control, the main focus is on achieving some sort of physical state (e.g., moving to a given position, following a particular trajectory). The focus here is on taking actions; making observations is just way of reducing the uncertainty that tends to accumulate in taking these actions. The two processes have an interesting kind of symmetry. In robotics, goals are often expressed in terms of both information acquisition and physical state variables. For example, in mobile robotics, localisation and navigation problems often require a robot to reduce its uncertainty about its environment and position, while simultaneously navigating to a new position. The distinction between perception-led goals and action-led goals is sometimes made using the terms “active perception” and “perceptive action”, respectively (Dahiya et al. 2011). Another term used to describe the relationship between these two processes is “interactive perception” (Bohg et al. 2016). Under this simplified model of active perception, the main objective of the agent is to maintain an internal state, and make decisions based on this state while following a control policy. Thus, it seems sensible to split these two roles into separate components: a state estimator and a decision maker, as shown in Figure 3.3. 17

Figure 3.3 – State estimator and decision maker components of an active perception agent.

This type of approach was used by Hsiao (2010) in the context of tactile robotics, and is widely used in mobile robot localisation (Thrun et al. 2005). It is also used in output feedback control, where the state estimator is usually referred to as an “observer”. However, the aim of an observer is usually to produce a single-point estimate of the environment (plant) state, rather than the more general type of agent state discussed earlier. The role of the state estimator is to compute the agent state, based on the actions taken and observations made at each time step. A popular approach for doing this involves using a Bayesian filter to maintain a probability distribution over environment states, and then deriving the agent state from this distribution. Other approaches include storing a complete, or partial, history of actions and observations, or using a finite state machine, recurrent neural network, or reservoir computing model (Lukoosevicius & Jaeger 2009) to compute the agent state. The role of the decision maker is to produce inferences and determine control actions at each time step, based on the agent state and control policy. If we treat inferences as a type of action (regardless of whether they change the state of this agent’s environment or some other agent’s environment) we can simplify the situation so that the decision maker only has to generate actions at each time step. The same argument can be applied to incoming control reference signals, treating them as a type of observation in an action-led scenario. The control policies used by the decision maker may be deterministic or probabilistic, timedependent or time-independent, stateless or memory-based.

18

3.2 Bayesian filters and active perception If the state estimator of the two-component, active perception model described above is implemented using a Bayesian filter1, we arrive at what is arguably one of the most widelyused systems for mobile robot localisation in use today (see Figure 3.4).

Figure 3.4 – Bayesian filter-based active perception.

At each time step, the Bayesian filter recursively updates a posterior probability distribution over environment states, which is referred to as the belief distribution

. The generic

algorithm for performing the Bayesian filter updates is shown in Table 3.1. 1:

algorithm bayesian_filter(

2:

for all

=

|

4:

=

|

end

6:

return

,

):

do

3:

5:

, ,

Table 3.1 – Bayesian filter algorithm.

Here,

|

,

is the state transition probability2, and it models the effect of actions on

the environment state (which is assumed to have the Markov property3). It is also known as the motion model or prediction model. The probability

|

is the observation probability,

A Bayesian filter is a Bayesian approach for estimating the state of a time-varying signal that is indirectly observed through noisy measurements. 2 Technically these are probability distributions rather than probabilities. 3 In other words, that knowledge of past states does not carry any additional information that would help to predict the future. 1

19

and it models the probabilistic dependence of observations on the state, taking the role of the likelihood in the application of Bayes rule in Line 4 of the algorithm. It is also known as the measurement model or observation model. The coefficient

represents the normalising

factor (i.e., the evidence) in the application of Bayes rule in Line 4 of the algorithm. In situations where discrete probability distributions are used, the integration operation in Line 3 is replaced by a summation.

The probabilistic evolution of environment states and

observations is shown in Figure 3.5.

Figure 3.5 – Probabilistic evolution of environment states and observations in the Bayesian filter.

There are many different ways of implementing Bayesian filters, including Kalman filters, multi-hypothesis tracking, grid-based/histogram approaches, topological approaches and particle filters (Fox & Hightower 2003). Further details of these methods and the trade-offs between them can be found in standard textbooks (Thrun et al. 2005; Russell & Norvig 2009; Särkkä 2013). As was true for the more general case, in the Bayesian filter-based approach the agent state can be represented in a number of different ways, depending on the application. However, as before, if the agent needs to take actions that are only needed to gather information or reduce uncertainty, then a single-point estimate derived from the belief distribution does not convey enough information to do this.

3.3 Reinterpretation of existing methods in terms of Bayesian filters By reinterpreting the tactile perception methods developed at BRL in terms of the model developed in the previous section, we can abstract above the detail and use this as the basis for incorporating a goal-based reinforcement learning capability. The key steps of the BRL tactile perception algorithm are shown in Figure 3.6. By mapping the steps of the algorithm to the state estimator or decision maker components of the Bayesian active perception model developed in Section 3.2, it is possible to establish a relationship between the two methods. Indeed, using the attributes listed in Table 3.2, the two methods are seen to be functionally equivalent.

20

Figure 3.6 – BRL active perception algorithm and mapping to components of the model developed in previous sections. Attribute

Description

Belief state

Represented as a discrete probability distribution

Motion model

Deterministic translation (equivalent to

|

,

trans

Measurement model

,

|

,

= 1 if

=

; 0 otherwise)

Discrete, histogram-based, conditional probability distribution

|

Agent state

Augmented state space comprising single-point, maximum a posteriori (MAP) estimate and associated posterior probability〈

,

〉. State space may be

factored into “where” and “what” components. Decision maker

Utilises various parameterised policies that depend on the agent state (e.g., active focal, passive random or passive stationary). Uses a posterior probability threshold to determine when sufficient evidence has been accumulated for a decision (based on optimal decision theory).

Table 3.2 – Attributes for mapping the BRL active perception algorithm onto the Bayesian active perception model.

21

Establishing a mapping between the BRL tactile perception algorithm and the Bayesian filter methods that are widely used in other areas of robotics is important because it allows us to draw on a larger body of research that could help solve problems in the tactile domain. Similarly, some of the novel aspects of the BRL approach (e.g., use of optimal decision theory) may benefit other areas of robotics that use Bayesian filter methods. For example, in the mobile robotics domain, histogram-based implementations of Bayesian filters are not considered to be particularly efficient or scalable, despite having a number of other advantages. Consequently, more efficient methods have been developed over the years to deal with this problem, including Kalman filters, multi-hypothesis tracking, topological approaches and particle filters (Fox & Hightower 2003). Pending results from a more detailed analysis, these methods might also prove useful in the tactile domain.

3.4 Integrating reinforcement learning and Bayesian active perception For the reasons given in Section 2.5, reinforcement learning is considered to be the most appropriate paradigm for implementing goal-based learning in the tactile robotics domain. To recap briefly, in reinforcement learning, a scalar feedback signal, known as a reward, indicates how well an agent is doing at each time step in making progress towards a particular goal. Under the hypothesis that all goals can be described by the maximisation of expected cumulative reward, the agent’s objective is to select actions that maximise total future reward. With this high-level view of reinforcement learning in mind, the active perception model described in Section 3.1 can be extended to incorporate rewards, as shown in Figure 3.7.

Figure 3.7 – Extending the active perception agent to incorporate rewards.

At each time step , the agent executes action reward

, and receives observation

and a scalar

from the environment. During the same time step, the environment receives action

, and at the next time step emits observation 22

and a scalar reward

. The objective

of the agent is to use the sequence of actions, observations and rewards to learn an optimal control policy

for achieving the goal associated with the sequence of rewards.

As was the case for the active perception agent described in Section 3.1, at this stage no particularly strong assumptions have been made in defining this simple reinforcement learning model. By making stronger assumptions about the environment and its observability, we can draw on a wide array of reinforcement learning algorithms that have been developed over the past few decades. For example, by assuming that the environment state is Markov but not fully observable, we can treat the problem as a POMDP (Kaelbling et al. 1998). While there are many different ways of solving POMDPs (Murphy 2000), exact solutions are intractable to compute for all but the smallest problems, and so approximations are often used (Cassandra 1998; Hauskrecht 2000; Aberdeen 2003).

If, instead, we assume that the

environment is fully observable, we can treat the problem as a Markov Decision Process (MDP) and use more conventional reinforcement learning techniques to find a solution (e.g., see Sutton and Barto (1998)). The problem is that for most practical problems in robotics the environment is not fully observable. One way round this problem is to extend the two-component active perception model developed in Section 3.1 so that the reinforcement learning decision-maker uses the agent state computed by the state estimator in place of the environment state, as shown in Figure 3.8.

Figure 3.8 – Extending the two-component active perception model to incorporate reinforcement learning.

23

This approach also extends quite naturally to the Bayesian filter version of the active perception agent, and hence to the functionally-equivalent tactile perception methods developed at BRL, as shown in Figure 3.9.

Figure 3.9 – Bayesian filter active perception agent with reinforcement learning decision maker.

3.5 Direct policy search for reinforcement learning For the two-component active perception model, a wide variety of reinforcement learning algorithms can be used to implement the (learning) decision maker component. These algorithms can be categorised in various ways, but for the purpose of this study we will use the taxonomy shown in Figure 3.10.

Figure 3.10 – Taxonomy of reinforcement learning algorithms.

24

Using this taxonomy, reinforcement learning algorithms can be classified into three main types. Value-based methods attempt to learn a policy indirectly, by first learning a value function (i.e., the expected reward to come), and then using this with a greedy or near-greedy policy to select actions that maximise the increase in value at each step. Policy-based methods attempt to learn a policy directly, without relying on an explicit value function. Policybased reinforcement learning is also referred to as direct policy search, to distinguish it from the type of indirect policy search used in value-based approaches. A third class of methods, the so-called “actor-critic” methods, are a hybrid between value-based and policy-based approaches. In value-based approaches, policies always act greedily or near-greedily with respect to the underlying value function (e.g., using an -greedy or softmax action selection mechanism), and therefore can only learn deterministic or near-deterministic policies. Furthermore, valuebased approaches usually rely on some form of approximation of the value function to overcome the performance, scalability and storage limitations associated with tabular representations. Unfortunately, these approximations weaken many of the convergence guarantees produced by the Bellman optimality equation (Sutton & Barto 1998). In policy-based approaches, the policy is represented using some form of parameterised model (e.g., using a neural network or Dynamic Movement Primitives (Schaal 2003)) and the agent tries to learn the policy by adjusting the parameters during the course of the learning process. Policy-based reinforcement learning has the following advantages: 

It is intuitively simpler to determine how to act than the value of acting. For example, some policies can be specified in very compact form, whereas their value functions can be very complex.



Unlike value-based approaches, it is possible to learn stochastic and memory-based policies rather than just deterministic or near-deterministic ones. This makes it better suited to operating in partially-observable or non-Markov environments (Singh et al. 1994).



It is effective in high-dimensional and continuous action spaces. Using the valuebased approach, the next action for a given state is determined by maximising the value function over all possible actions for that state. If the action space is continuous and high-dimensional this requires a global optimisation over a continuous, highdimensional space, which can be very time-consuming.



It generally has better convergence properties than value-based reinforcement learning because it does not have to resort to approximate value functions.

25



It is easier to incorporate expert domain knowledge using a policy-based approach because policies can be partially designed and parameterised beforehand, and then fine-tuned during the subsequent learning process.

In actor-critic approaches, the “actor” is the decision maker that learns an optimal policy, while the “critic” simultaneously learns a value function for providing feedback to the actor on the effectiveness of its actions (rather than the actor just depending on feedback from single-step rewards from the environment). While these architectures can be more efficient in some circumstances, they are more complicated to implement and can suffer from some of the disadvantages associated with value-based approaches. Taking these points into consideration, and acknowledging that other methods may be more applicable under some circumstances, policy-based methods are considered to be the most promising approach for implementing the reinforcement learning decision maker in the twocomponent active perception agent.

3.6 Model-free, episode-based, direct policy search Direct policy search reinforcement learning can be further classified according to whether it is model-free or model-based (see Figure 3.10). Model-free policy search methods, including likelihood gradient methods such as REINFORCE (Williams 1992) and natural gradient methods such as Exponential Natural Evolution Strategies (Glasmachers et al. 2010; Wierstra et al. 2014), use samples to directly update the policy. Model-based methods, such as PILCO (Deisenroth & Rasmussen 2011) and Episodic REPS (Daniel et al. 2012), use samples to estimate a forward model of the agent’s dynamics and its environment, and then use the forward model to generate samples used in the optimisation process. Model-based policy search is sample-efficient but only works if a good model can be learned. Model-free policy search does not need to construct a model of the environment, and therefore is simpler to implement and more widely applicable. However, it typically requires a large number of samples to learn a policy. Taking all of these points into consideration, model-free policy search was considered to be the most appropriate approach for this study. Model-free search algorithms can also be classified according to whether they are episodebased or step-based (see Figure 3.10). Episode-based methods such as Episodic REPS, CMA-ES (Hansen et al. 2003), Natural Evolution Strategies and Cross-Entropy optimisation (Rubinstein 1999; Kroese et al. 2006) explore in parameter space; they perturb the policy parameters at the beginning of an episode and evaluate the quality of the parameters using the episodic returns. Step-based methods such as REINFORCE, GPOMDP (Baxter & Bartlett 2001), and Episodic Natural Actor Critic, explore in action space; they perturb the actions at

26

each step and evaluate the quality of state-action pairs by the reward to come (or a Monte Carlo estimate of this quantity). While step-based approaches are generally more data-efficient, they can produce jerky trajectories due to noisy exploration in action space. They also implicitly make use of the structure of the reinforcement learning problem, and, in particular, the Markov assumption. Episode-based methods are more widely applicable and do not rely on the Markov assumption. They also produce smoother trajectories and are more efficient for smaller parameter spaces (Deisenroth 2013). Furthermore, since they are not dependent on the structure of the reinforcement learning problem, they can make use of “black-box optimisers” to optimise the policy parameters. For these reasons, episode-based, direct policy search is considered to be the most appropriate form of reinforcement learning for implementing goalbased learning in active perception agents. Deisenroth (2013) discusses the advantages and disadvantages of different policy search methods in more detail. The episode-based, direct policy search methods used in this study use black-box optimisers to maximise an objective function that is defined in terms of the policy parameters (see Figure 3.11).

Figure 3.11 – Episode-based policy search using a black-box optimiser.

27

At the start of each learning trial, the optimiser generates a parameter vector

, which

determines the behaviour of the control policy during the trial. The control policy can be deterministic, random, stateless or memory-based. The control policy is then run for a single episode, and the rewards received at each time step summed to generate an episodic return . The optimiser treats the return as a noisy estimate of the expected return (i.e., the expected cumulative reward) and adjusts the parameter vector to maximise this quantity.

3.7 Policy optimisation methods 3.7.1 Characteristics of the optimisation problem The optimisation problem that is solved during episode-based, direct policy search has the following characteristics: 

The samples of the objective function (i.e., the returns) tend to be very noisy. The return generated during a learning episode depends on the initial state of the environment, on the states visited while following the parameterised policy, and on the rewards generated in response to this trajectory. In general, all of these are random quantities, and so the returns will also be random.



The gradient of the underlying objective function (i.e., the expected cumulative return) is not directly available.



The objective function may have many local maxima.

To solve such problems we need to use optimisers that are capable of working with noisy, multimodal objective functions, without relying on explicit derivative information.

3.7.2 Selection of optimisers In this study, four main types of black-box optimiser were considered: 1. Policy search gradient methods; 2. Natural Evolution Strategies (NES); 3. Cross Entropy (CE) optimisation; 4. Bayesian optimisation. These optimisers are based on somewhat different underlying principles, but collectively provide a useful toolkit that can be used to solve a wider range of problems than would otherwise be possible using a single method. The first three are stochastic optimisation methods. They define probability distributions (usually referred to as “search distributions”) over the policy parameter space, and iteratively improve the search distribution to find a more optimal solution. The optimal policy parameters are either returned as samples of the final search distribution, or as statistics, such as the mean, median or mode. 28

Bayesian optimisation (Shahriari et al. 2016; Snoek et al. 2012; Brochu et al. 2010) incrementally constructs an approximate model of the underlying objective function, and then optimises over the model to find the optimal parameters. Bayesian optimisation has been rediscovered many times and is also known as Efficient Global Optimisation (Jones et al. 1998), Response Surface Optimisation (Jones 2001), and Sequential Kriging Optimisation (Huang et al. 2006).

3.7.3 Policy search gradient methods Policy search gradient methods define a parameterised search distribution

|

over the

policy parameters . The expected return under the search distribution is then defined as = ∇

|

, and the so-called “log-likelihood trick” used to write the search gradient

as an empirical mean of a set of

sample returns

, weighted by the

corresponding gradients of the log probabilities under the search distribution:





1

∇ log

|

(3.1)



The gradient is used to iteratively adjust the search distribution parameters in the direction of steepest ascent, to maximise the expected return. The canonical form of the policy search gradient algorithm is shown in Table 3.3. 1: 2:

Initialise ̅ = 0

3:

repeat

4:

∙ ,

algorithm policy_search_gradient(

):

= 1…

for

~

|

5:

Draw sample

6:

Evaluate return

7:

Calculate log-derivatives ∇ log

8:

end

9:

Update return baseline ̅ ←

10:



11:

= ←

∑ +

∇ log

|

|

∙ ∑

+ 1−

∙ ̅

− ̅



∙∇

12:

until stopping criterion is met

13:

return mean of search distribution

|

Table 3.3 – Canonical policy search gradient algorithm.

Two variants of this approach were implemented and evaluated in this study. The first variant is based on early work on policy gradient approaches by Williams (1992), and is described in more detail by Lepora (2016c). Using this method, the policy parameters are represented as -bit binary variables and the search distribution is defined using separable multivariate 29

Bernoulli distributions over each -bit variable. The parameters of the Bernoulli distribution are estimated using a neural network-based logistic regression model. updates

Single-sample

= 1 are used to estimate the gradient. A return baseline, ̅ is calculated using

an exponential averaging scheme and subtracted from the return

in Lines 9 and 10 of

the algorithm, to reduce the variance of the gradient estimates. In this study, this method is referred to by the pseudo-acronym REINFORCE. The second variant is also derived from early work by Williams (1992), but in this case a separable multivariate Gaussian distribution is used as the search distribution. In this variant of the algorithm, the learning rate coefficient parameter

= ⁄‖∇

is computed as

‖, and the

← 0.99 at each time step.

is reduced according to an exponential schedule

As before, a baseline is subtracted from the return to reduce the variance of the gradient estimates. In this study, this method is referred to by the pseudo-acronym SGES. Due to the separable nature of the search distributions, both of these variants sometimes encounter problems in situations where the search needs to proceed along “ridges” that are not aligned with the coordinate axes in parameter space (i.e., they are skew).

3.7.4 Natural Evolution Strategies In many ways, Natural Evolution Strategies (NES) (Wierstra et al. 2014) are similar to the policy search gradient methods introduced in the previous section. However, instead of following the standard gradient of the expected return under the search distribution, they follow the natural gradient. This avoids some of the drawbacks of standard gradients, which are prone to slow or even premature convergence (Amari 1998; Amari & Douglas 1998). The natural gradient is discussed in most university-level textbooks on engineering mathematics. The gradient of a function is a vector that points in a direction that maximises the increase in the function for an infinitesimal step length. The standard gradient assumes that the step length is measured using a Euclidean distance metric and this results in the familiar expression for the gradient operator, ∇=

,

,



,

.

However, for non-

Euclidean spaces (e.g., polar, cylindrical or spherical) the distance metric is no longer Euclidean, but is specified in terms of a second-order metric tensor gradient operator then becomes

. The corresponding

∇. If the space represents the parameters of a family of

probability distributions, it has been argued that a more appropriate distance metric is given by the Fisher information matrix, , which is related to the Kullback-Leibler (KL) divergence between two probability distributions (Amari 1998). NES algorithms also use a rank-based, “fitness shaping” technique to normalise the sampled returns from individual learning episodes, mapping them onto a set of equally-spaced utility

30

values. This helps prevent the algorithm from getting stuck on plateaus, or systematically over-stepping steep optima. The canonical form of the NES algorithm is shown in Table 3.4. Two variants of this algorithm were investigated in this study: Exponential NES (xNES) (Glasmachers et al. 2010) and Separable NES (SNES) (Wierstra et al. 2014). The xNES algorithm uses a multivariate Gaussian search distribution with a fully adaptive covariance matrix. It also incorporates some novel techniques for updating the parameters of the search distribution. In each step, the coordinate system is transformed so that the search distribution has zero mean and unit variance. This results in the Fisher information matrix becoming the unit matrix, and the natural gradient coinciding with the standard gradient. The symmetric covariance matrix is encoded using the exponential map exp

=∑

!



, which results in a multiplicative form of the covariance matrix update

and ensures that the covariance matrix remains symmetric positive definite. The specific variant of the xNES algorithm used in this study is described in Algorithm 5 and Table 1 of (Wierstra et al. 2014). 1: 2: 3:

∙ ,

algorithm natural_evolution_strategy(

):

repeat = 1…

for

~

|

4:

Draw sample

5:

Evaluate return

6:

Calculate log-derivatives ∇ log

7:

end

8:

Sort

9:



with respect to =

10:

= ∑

11:





∇ log ∇ log

+



and compute rank-based utilities |

|

|



∙ ∇ log

|

∙∇

12:

until stopping criterion is met

13:

return mean of search distribution

|

Table 3.4 - Canonical Natural Evolution Strategy algorithm.

The SNES variant is very similar to the xNES algorithm, but it uses a separable multivariate Gaussian distribution as the search distribution. This reduces the computational complexity from cubic or quadratic to linear, which improves the scalability of the approach for highdimensional state spaces. The specific variant of the SNES algorithm used in this study is described in Algorithm 6 and Table 1 of (Wierstra et al. 2014).

31

3.7.5 Cross Entropy optimisation Cross Entropy (CE) optimisation (Rubinstein 1999; Kroese et al. 2006) has many similarities with policy search gradient methods and Natural Evolution Strategies. It also defines a search distribution over the parameter space, and iteratively updates the distribution during the search process. However, it uses an importance sampling technique instead of the gradient of the expected return to update the search distribution. The canonical form of the CE optimisation algorithm is shown in Table 3.5. When a multivariate normal distribution is used as the search distribution, the parameter updates in Line 9 of the algorithm have a particularly simple form, and are computed as the sample mean

and sample covariance

of the elite samples. In the multivariate normal

case, different smoothing schemes are used for 1 − 1⁄

in Line 10 of the algorithm. For ,

∈ 0.5,0.9 is used, while for , a dynamic smoothing update

a fixed smoothing parameter =

and

is used, where

is an integer between 5 and 10, and

is a smoothing

parameter between 0.8 and 0.99. This helps to avoid premature convergence of the search distribution. The specific variant of the CE optimisation algorithm used in this study is described in Algorithm 2 of (Kroese et al. 2006). 1: 2: 3:

algorithm cross_entropy_optimisation(

∙ ,

):

repeat = 1…

for

~

4:

Draw sample

5:

Evaluate return

6:

end

7:

Sort

8:

Select the

|

with respect to

9:

= argmax

10:



best-performing samples as the elite samples ∑

log

|

+ 1−

11:

until stopping criterion is met

12:

return mean of search distribution

|

Table 3.5 – Canonical Cross-Entropy optimisation algorithm.

3.7.6 A novel hybrid algorithm for single samples A key difference between policy search gradient, cross-entropy and evolution strategy approaches is that policy search gradient algorithms will work with single samples from the search distribution. Cross-entropy and evolution strategy algorithms need to maintain a population of samples from the search distribution, and then sort these samples based on the corresponding returns. Early experiments with the population-based algorithms showed that

32

their performance degrades significantly and they can become unstable if the population size is reduced to a single sample. To overcome this difficulty, and see whether it is possible to retain some of the benefits of the natural gradient and exponential parameterisation in an algorithm that only depends on a single sample at each optimisation step, a hybrid, heuristic-based algorithm was developed. In the hybrid algorithm, instead of using the xNES/SNES fitness-shaping approach, a baseline is subtracted from the returns, and these are then normalised in a similar way to the policy search gradient approach. The canonical hybrid algorithm is shown in Table 3.6. When this algorithm is run using a single sample ( = 1), the utility values default to

= ±1,

depending on whether the sample return is greater than or less than the current return baseline. 1:

algorithm policy_search_natural_gradient(

2:

Initialise ̅ = 0

3:

repeat

4:

∙ ,

):

|

= 1…

for

~

|

5:

Draw sample

6:

Evaluate return

7:

Calculate log-derivatives ∇ log

8:

end

9:

̅←

10: 11:

∙ ∑ − ̅,

= ∇

=

12:

= ∑

13:





=

∇ log ∇ log

+

∙ ̅

+ 1−



⁄‖ ‖ |

|



∙ ∇ log

|

∙∇

14:

until stopping criterion is met

15:

return mean of search distribution

|

Table 3.6 – Canonical policy search natural gradient algorithm.

Two variants of this algorithm were developed. The first is based on the xNES algorithm, and it uses a multivariate Gaussian with a full covariance matrix for the search distribution. It is referred to by the pseudo-acronym xGES2 in this study. The second variant is based on the SNES algorithm, and it uses a separable multivariate Gaussian with a diagonal covariance matrix.

It is referred to by the pseudo-acronym SGES2.

As was the case for the

corresponding NES algorithms, the separable variant offers improved scalability over the fullcovariance version for high-dimensional state spaces.

33

3.7.7 Bayesian optimisation Bayesian optimisation (Brochu et al. 2010; Snoek et al. 2012; Shahriari et al. 2016) is a sequential, model-based approach to black-box optimisation under uncertainty, which is derived from response surface optimisation methods. Response surface methods iteratively construct a data set of sample points and corresponding function evaluations, which are used ⟼

to build a response surface model,

of the underlying objective function. The

response surface is then used as proxy for the true objective function,

=

, in the

optimisation process. In the context of direct policy search, a sample point represents a particular parameterisation, , of a policy, and the corresponding function evaluation represents the return,

,

generated when that policy is applied for a single learning episode: =

,

(3.2)

In Bayesian optimisation, the response surface is represented using a probabilistic model such as a Gaussian Process (GP), but other types of model (e.g., random forests) can also be used. This allows the model to be constructed using noisy observations, and it explicitly accounts for uncertainty in a principled way. Using such a probabilistic model, the response surface represents a probability distribution over the returns, conditioned on the parameter values. A GP is a distribution over functions prior mean function

~

,

, ′

and a prior covariance function

and is defined in terms of a , ′ (Rasmussen & Williams

2006). For a specified prior mean and covariance function, and data set compute a predictive mean function

and predictive variance function

, it is possible to for any point

in the parameter space. In setting up the GP response surface model for Bayesian optimisation, it is necessary to specify the prior mean and covariance function, as well as various other hyperparameters. In this study, a zero prior mean function was used throughout, and either a squared-exponential or neural network prior covariance function, depending on the context. The hyperparameters were optimised using a (local) optimisation routine provided in the GPML library (Rasmussen & Nickisch 2016). The canonical Bayesian optimisation algorithm is shown in Table 3.7.

34

1:

∙ ,

algorithm bayesian_optimisation(

):

repeat

2: 3:

Construct a GP response surface and acquisition function using data set

4:

Find

5:

Evaluate return

6:

Add

that maximises the acquisition function ,

to data set

7:

until stopping criterion is met

8:

return optimal value of

with respect to final response surface

Table 3.7 – Canonical Bayesian optimisation algorithm.

The response surface is constructed in an incremental fashion, starting with an initial set of points

to bootstrap the process (obtained using stratified uniform random sampling in this

study), and then adding a new point to the data set and rebuilding the response surface at each step. An acquisition function is used to map the response surface probability distribution to a single utility function, which is then optimised to find the next point. Acquisition functions are usually parameterised so that the user can trade off exploration and exploitation during the optimisation process. Various different acquisition functions have been developed over the years, including those based on probability of improvement, expected improvement, upper confidence bound, and information-based methods (Shahriari et al. 2016). All of these functions depend on the predictive mean and variance functions produced by the GP model, but there is no consensus as to whether any of them is better than the others. Probability of improvement (PI) attempts to maximise the probability of finding an improvement over the current best return

, where

=ℙ =Φ

= argmax



+





:

(3.3)

Here, Φ ∙ is the normal cumulative distribution function. The parameter

is used to trade off

exploration against exploitation, with higher values representing a higher degree of exploration. Expected improvement (EI) attempts to maximise the expected improvement over the current best return: =





∙Φ 0

35

+



if if

>0 =0





= 0 Here,

if

>0

if

=0

(3.4)

∙ and Φ ∙ denote the PDF and CDF of the standard normal distribution, respectively.

Once again, the parameter

is used to trade off exploration against exploitation, with higher

values representing a higher degree of exploration. Lizotte (2008) suggests that setting

=

0.01 (scaled by the signal variance if necessary) works well in almost all cases, although the experiments carried out in this study suggest that higher values are required when the returns are very noisy. Upper Confidence Bound (UCB) is defined as: = Here,

+

(3.5)

> 0 is a parameter that determines the trade-off between exploration and exploitation,

with higher values of

representing a higher degree of exploration. A special case of UCB

is GP-UCB, where the parameter

=

=

. With

= 1,

= 2 log



⁄3

and

∈ 0,1 , it can be shown with high probability that this method leads to zero regret in the limit → ∞ (Srinivas et al. 2010). All three types of acquisition function were used in this study. The optimisation over the acquisition function to determine the next sample point was implemented using the DIRECT algorithm (Finkel 2003). DIRECT is a deterministic, derivative-free, global optimiser that has been used in a number of Bayesian optimisation implementations. In noisy environments, instead of using the parameter vector with the best return encountered so far to calculate the PI and EI acquisition functions, we use the parameter vector with the highest expected return,

= argmax

. This avoids the problem of trying to maximise

the probability of improvement or expected improvement using unreliable or noisy samples. For any of these acquisition functions, it is possible to reduce the level of noise by averaging the returns over a number of learning episodes (Hutter 2009).

In extremely noisy

environments it may be better to use more specialised types of acquisition function that have been explicitly designed for this purpose (Picheny et al. 2013).

3.7.8 Modifications to deal with constraints For the Bayesian optimisation algorithms used in this study, the DIRECT algorithm was used to optimise over the acquisition function to select the next point in a valid range. The valid range of parameters is specified in the function call to the DIRECT algorithm. However, for the other optimisation methods (e.g., policy search gradient, natural evolution strategies, cross entropy optimisation) there is no built-in mechanism for handling constraints on 36

parameter values, as they are essentially unconstrained optimisers.

To get round this

problem, a transformation method developed at NASA in the 1970s (Park 1975) was used to map the constrained optimisation problem to an unconstrained problem.

This involved

applying a logistic squashing function to the samples drawn from the search distribution to map them to a valid range before calculating the episodic return.

37

4 Application to learning problems in tactile robotics Before applying the new goal-based learning approach to the three tactile robotic tasks described in this chapter, all of the optimisation algorithms were tested and evaluated on a selection of standard benchmark problems (see Appendix A).

4.1 Curvature perception The first set of experiments show that it is possible to learn an optimal (or close-to-optimal) policy for identifying cylinders with different curvatures using the direct policy search methods described in Chapter 3. An active focal policy was used for these experiments because it is relatively simple and has demonstrated better performance than passive random or passive stationary policies for this type of task in the past. The policy was parameterised by a decision threshold probability and a focal point position (measured in mm). The threshold probability defines a level of confidence that must be achieved before a final decision is produced. The focal point specifies where the sensor is moved to from its estimated position after each tap.

4.1.1 Experimental setup A set of 5 cylinders with different curvatures and diameters (30 to 70 mm) was mounted to one side of the ABB robot arm in the experimental test area, as shown in Figure 4.1. For each cylinder, separate training and test data sets were gathered by making 400 taps at 0.1 mm intervals across a 40 mm range centred on the axis each cylinder. The taps were made along a straight line, perpendicular to each cylinder. The training data was used to compute the likelihood distribution for the Bayesian updates performed during the active perception process. The test data was used to produce simulated sensor readings for different cylinders and sensor locations during the offline Monte Carlo simulations.

(a)

(b)

Figure 4.1 – Experimental setup: (a) ABB robotic arm and work space; (b) cylinders used in the curvature perception experiments. The largest sixth cylinder shown in Figure 4.1 was not used in this experiment, as it was mounted at a different height to the other cylinders, and, as such, presented complications in the training and testing process.

39

4.1.2 What does the objective function look like? Each learning trial can be viewed as a decision process, requiring

time steps to complete,

= 0 if the cylinder is correctly identified and

and with classification error

= 1 if it is

incorrectly identified. The episodic return can then be defined in terms of the negative Bayes risk

=−



, where

and

are positive coefficients that express the relative risk of

decision times and errors. Since only the relative value of ⁄ is important, we can simplify this expression to

=−



, where larger values of

place more emphasis on timely

decision making, and smaller values place more emphasis on correct decisions. The overall goal is to maximise the expected episodic return. In situations like this, where there are only two parameters, we can analyse the shape of the underlying objective function using a brute-force approach. This information can then be used to judge the quality of solutions found by the optimisers. While the individual returns

are

very noisy, the expected values can be approximated using the empirical means. Approximate objective functions obtained using 100 samples for four different values of

are

shown in Figure 4.2.

(a)

(b)

(c)

(d)

Figure 4.2 – Curvature perception objective functions for different values of the decision time vs error trade-off coefficient : (a) = 1.0; (b) = 0.1; (c) = 0.01; (d) = 0.001.

40

Note that the focal point positions are all negative with respect to the start of the tap range in the base reference frame. For the case

= 0.1, there is a reasonably well-defined global maximum, in contrast to the

broad flat regions that characterise the maxima for other values of . The objective function for

= 0.1 is shown from a number of different perspectives in Figure 4.3. Note that Figure

4.3(d) was computed using a higher resolution and larger number of samples in the region around the global maximum. For the focal point parameter, notice that there is also a good local maximum at -18mm, on the opposite side of the cylinder axis to the global maximum at -24mm.

(a)

(b)

(c)

(d)

Figure 4.3 – More detailed view of curvature perception objective function for = 0.1: (a) from above; (b) side elevation (decision threshold); (c) side elevation (focal point); (d) close-up of side elevation (focal point).

Figure 4.4 shows how the level of noise increases considerably as the number of samples averaged over is reduced from 100, through 10, to 1 in the single-sample case. The singlesample case is particularly relevant because it is the one that is most often encountered in these experiments. As Figure 4.4(c) shows, this is a particularly difficult function to optimise,

41

having a large flat region around the global maximum and an extremely low signal-to-noise ratio.

(a)

(b)

(c) Figure 4.4 – Noise levels on curvature perception objective function for returns averaged over, (a) 100 samples, (b) 10 samples, and (c) 1 sample.

4.1.3 Performance on single-sample learning episodes ( = 0.1) Rather than perform a detailed comparison between the different optimisation approaches across all four values of , it was decided to focus initially on the case

= 0.1, which has a

reasonably well defined global maximum, and then extend the results to other values of afterwards. Unfortunately, none of the population-based stochastic optimisers (i.e., CE optimiser, xNES, SNES) could be made to work with this particular data set, probably because of the high level of noise present in the sampled returns. This was not particularly surprising since the performance of these optimisers was found to deteriorate significantly with increasing levels of noise in trials on standard benchmark functions (see Appendix A).

42

The remaining stochastic optimisers (xGES2, SGES2, REINFORCE) were run for 10,000 trials and the three variants of Bayesian optimiser were run for 500 trials to give them sufficient time to converge. The stochastic optimisers typically required between 1000 and 2000 trials to converge, while the Bayesian optimisers required between 100 and 200 trials. Example runs for a selection of optimisers using single-sample returns are shown in Figure 4.5.

(a)

(b)

(c)

(d) Figure 4.5 - Examples of optimisation runs for a selection of optimisers using single-sample returns: (a) xNES, (b) SNES, (c) REINFORCE, and (d) Bayes (PI).

43

Box plots showing the distribution of optimal parameter values and returns after 500 trials, for 10 runs of the experiment, are shown in Figure 4.64.

(a)

(b)

(c) Figure 4.6 – Distribution of optimal parameter values and returns after 500 trials, for 10 runs of the experiment: (a) decision threshold; (b) focal point; and (c) episode return.

The solid green lines in Figure 4.6(a) and Figure 4.6(b) show the actual optimal values of the two policy parameters, based on brute-force evaluation (see Figure 4.3). The dashed green line in Figure 4.6(b) shows the location of the local maximum along the focal point axis

Note that these box plots were produced using the MATLAB Statistics and Machine Learning Toolbox. The red crosses represent outliers, and the dashed lines span between the interquartile range and the minimum/maximum values. 4

44

identified in Figure 4.3(d). Interestingly, the stochastic optimisers all seemed to converge towards this local maximum in preference to the nearby global maximum. However, it was not clear why this was so. With reference to Figure 4.3 and Figure 4.4, and taking account of the fact that the region around the global maximum is relatively flat and the signal-to-noise level very low, the results show that all optimisers can find reasonably good solutions within 500 iterations. However, the REINFORCE method is arguably the most accurate, despite not performing so well on the noise-free benchmark functions, or functions with low levels of noise (see Appendix A). It is also noticeable that the focal point estimates produced by the three Bayesian optimisers have a higher variance than those produced by the other optimisers. While the Bayesian optimisers offer comparable levels of performance to the stochastic optimisers (albeit with a higher variance on the focal point parameter) they typically require an order of magnitude fewer trials to converge. This makes them particularly well-suited to online learning, where each learning trial can take seconds or even minutes to complete.

4.1.4 Increasing the solution accuracy Given that Bayesian optimisers can find relatively good solutions quickly, it is natural to ask whether it is possible to trade some of this speed advantage for increased accuracy. One way of improving the accuracy of Bayesian optimisers in noisy environments is by using more samples per learning trial and averaging out the noise in the sampled returns (Hutter 2009). To see whether this approach works in this context, a second experiment was carried out where the returns were averaged over 10 samples. This approach was only viable for the Bayesian optimisers, since the single-sample stochastic optimisers require an order of magnitude more trials to converge, and generating 10 samples for each trial in these cases would make the learning process intractable. Another benefit of reducing the level of noise on the returns was that the population-based stochastic optimisers could now be made to work on this problem. While the Cross Entropy optimiser was still not a viable option due to the need for a large population of samples in each trial (100 samples per trial were used in this study), both the xNES and SNES optimisers used much smaller population sizes (6 samples in this study) and so were still reasonably competitive in terms of computation time. Example runs for a selection of optimisers using averaged returns for a set of 10 samples are shown in Figure 4.7.

45

(a)

(b)

(c)

(d) Figure 4.7 – Examples of optimisation runs for a selection of optimisers using averaged returns (a) Bayes (PI), (b) Bayes (UCB), (c) xNES, and (d) SNES.

For the SNES optimiser shown in Figure 4.7(d), notice the jump in optimal focal point value from the good local maximum at -18 mm to the global maximum at -24 mm, after 300 trials.

46

Box plots showing the distribution of optimal parameter values and returns after 500 trials, for 10 runs of the experiment, are shown in Figure 4.8.

(a)

(b)

(c) Figure 4.8 – Distribution of optimal parameter values and returns after 500 trials, for 10 runs of the experiment: (a) decision threshold; (b) focal point; and (c) episode return.

The performances of the original single-sample Bayesian optimisers, as well as those of the xNES and SNES optimisers, have been included for comparative purposes. The labels of the Bayesian optimisers that use averaged returns are suffixed with a “+” in the figures. The results show that, for the PI and UCB Bayesian optimisers, this level of noise reduction does not have a significant effect on the accuracy of solutions found by the optimisers, but it

47

does reduce the variance to some extent. In the extreme case, where the returns are averaged over an infinite number of samples, the empirical means would cease to be noisy and the optimisers would effectively be operating on a deterministic objective function. Since all of the population-based stochastic optimisers produce accurate results on deterministic objective functions (see Appendix A), in most situations of practical relevance it is reasonable to expect more accurate results as the level of noise is reduced.

4.1.5 Reducing the learning time Returning to speed/accuracy trade-off, it is also useful to know whether Bayesian optimisers can achieve a reasonable level of performance with even fewer trials and samples than are needed for full convergence. This is particularly important from an online, real-time, learning perspective. To address this question, the PI and UCB Bayesian optimisers that were used in the previous experiment were run for 36 trials to bootstrap construction of the GP response surface, followed by a further 64 optimisation trials. So, in this experiment, the total number of trials and samples was limited to 100. The results are shown in Figure 4.9. The performances of the original, single-sample, Bayesian optimisers have been included for comparative purposes. The labels of the sample-limited Bayesian optimisers are suffixed with a “-” in the figures. The results show that limiting the total number of trials/samples to 100 does not have a significant impact on the accuracy of these two Bayesian optimisers. This is important because when used in an online setting the learning time is dominated by the length of time it takes to generate each sample return. This experiment shows that it is possible to produce good results (though not necessarily optimal) in a tractable period of time – something that is not currently possible using other types of optimiser or reinforcement learning approach.

48

(a)

(b)

(c) Figure 4.9 – Distribution of optimal parameter values and returns after 100 trials, for 10 runs of the experiment: (a) threshold probability; (b) focal point; and (c) episode return.

4.1.6 Performance on single-sample learning episodes (other

values)

A final set of experiments on curvature perception looked at how sample-limited Bayesian optimisers such as the ones considered in the previous section performed for the other values of the

parameter mentioned earlier. To answer this question, a Bayesian optimiser with a

UCB acquisition function5 was run for 100 trials (36 initial trials to bootstrap construction of the GP response surface, and 64 subsequent optimisation trials), for 10 runs of the

The UCB variant tended to run slightly faster than the other variants, and produced comparable levels of performance. 5

49

experiment. Examples of the final response surfaces produced by the Bayesian optimiser for the four different values of

are shown in Figure 4.10.

(a)

(b)

(c)

(d)

Figure 4.10 – Final response surfaces after 100 (36+64) trials of Bayesian optimisation (UCB) for different values of : (a) = 1.0; (b) = 0.1; (c) = 0.01; (d) = 0.001.

Comparing these response surfaces to the mean return functions of Figure 4.2, shows that they capture most of the salient features of the underlying objective functions, at least in the regions surrounding the global maxima. Box plots showing the distribution of optimal parameter values after 100 trials, for 10 runs of the experiment, for the different values of , are shown in Figure 4.11. The results show that as

is increased (corresponding to more emphasis on faster decision

times in the Bayes risk formula) the optimal decision threshold decreases, but there is very little change in the optimal focal point. In fact, for all values of

the optimal focal point is

located close to the centre of the tap range, almost directly over the cylinder axis. The significantly larger variance of the optimal focal point for

= 1.0 is consistent with the shape

of the response function and underlying objective function for this value of (see Figure 4.2(a) 50

and Figure 4.10(a)), and also with decisions being made immediately after the first tap (i.e., there is no active perception).

(a)

(b)

Figure 4.11 – Distribution of optimal parameter values after 100 trials, for 10 runs of the experiment, for four different values of : (a) decision threshold; and (b) focal point.

4.1.7 Summary and discussion of results This first set of experiments showed that the proposed goal-based learning approach can learn an optimal (or close-to-optimal) parameterised policy for identifying cylinders with different curvatures. Optimality was defined in terms of a negative Bayes risk criterion that incorporates a trade-off parameter, , for determining the relative importance of decision time versus decision correctness. In the first experiment, single-sample learning episodes were used to learn an optimal policy for a particular value of the trade-off parameter ( = 0.1). This value was selected because the associated objective function has a relatively well-defined global maximum, making it easier to evaluate the results. The performance of each optimiser was evaluated over 10 runs of the experiment, which was sufficient to demonstrate the robustness of the results. Curvature perception proved difficult with the population-based stochastic optimisers when used with single-sample learning episodes. This might have been because of the high levels of non-Gaussian noise present in the returns and the deteriorative effect this has on performance (Beyer 2000). However, we were still able to find good solutions with the other stochastic and Bayesian optimisers, which produced returns of greater than -0.34 after 500 trials, (the optimal value is approximately -0.26 and the worst-case value over the optimisation domain is less than -5.0). Although none of these solutions was strictly optimal, this might be the best we can hope for with such a low signal-to-noise ratio in the region around the global maximum. Clearly, further research is needed to answer this question. 51

The Bayesian optimisers performed even better than the stochastic optimisers, producing returns between -0.32 and -0.28 after 500 trials. Furthermore, they converged after only 100 to 200 trials, which makes them far better suited to online learning scenarios than their stochastic counterparts. The second experiment looked at whether it is possible to improve the performance of the Bayesian optimisers by averaging out some of the noise on the returns over 10 consecutive learning episodes. In other words, rather than running a single learning episode to generate a return, 10 episodes were run and then the returns averaged. This approach was advocated by Hutter (2009) as a means of handling noisy objective functions in Bayesian optimisation. Somewhat surprisingly, the results showed that this approach does not significantly improve the accuracy in this particular case, producing returns of less -0.28 after 500 trials. This might be due to the extremely high levels of non-Gaussian noise present in the returns (see Figure 4.4). However, another benefit of averaging the returns in this way is that the noise level can be reduced to a point where the population-based stochastic optimisers start to work reliably. In this experiment the xNES and SNES optimisers produced returns between -0.31 and -0.28, which are comparable with the single-sample results obtained using other optimisers. A third experiment looked at whether it is possible to achieve reasonable performance using even fewer trials. For this experiment, two of the Bayesian optimisers were initialised with 36 samples to construct a good initial GP response surface, and then run for a further 64 optimisation trials (making a total of 100 samples/trials). The results showed that it was possible to achieve a comparable level of performance as had been achieved using 500 optimisation trials, with returns of between -0.27 and -0.29. In fact, the performance of the Bayes (UCB) variant actually improved from a mean return of -0.31 to around -0.27. The UCB variant of the Bayesian optimiser was then used to find optimal parameterised policies for three other values of

used in the negative Bayes risk criterion ( = 0.001, =

0.01, = 1.0). The results showed that the optimal decision threshold decreases as

is

increased in value, but the optimal focal point remains approximately constant at the centre of the tap range, close to the cylinder axis. This result is consistent with our intuition, which suggests that if decision time is more important than decision accuracy then the level of evidence required to make a decision should be reduced.

Furthermore, the physical

symmetry of the problem and the stateless nature of the active focal policy suggests that the optimal focal point is likely to be situated somewhere close to the cylinder axis.

52

4.2 Orientation perception The second set of experiments show that it is possible to learn an optimal (or close-to-optimal) policy for estimating the angle of the tangent to a curved edge or contour. An active focal policy with two parameters was also used in these experiments.

4.2.1 Experimental setup A 54 mm radius disk was mounted on a bench in front of the ABB robot arm, as shown in Figure 4.12.

(a)

(b)

Figure 4.12 – Experimental setup: (a) ABB robotic arm and work space; (b) disk used in the angle perception experiments.

For this experiment, the 360 degree range of angles was discretised into 36 classes, ranging from 0 to 350 degrees in 10 degree increments. Rather than move the sensor round the perimeter of the cylinder to collect the training and test data, the sensor was fixed in one position and rotated in 10 degree increments on the spot. For each angle class, 40 taps were made at 0.5 mm intervals across a 20 mm range centred on (and normal to) the edge of the cylinder.

4.2.2 What does the objective function look like? As before, an episodic return is defined in terms of the negative Bayes risk, and the overall goal is to maximise the expected episodic return. Approximate objective functions obtained using 100 samples for four different values of

53

are shown Figure 4.13.

(a)

(b)

(c)

(d)

Figure 4.13 – Angle perception objective functions for different values of the decision time vs error trade-off coefficient : (a) = 1.0; (b) = 0.1; (c) = 0.01; (d) = 0.001.

Unlike the case for curvature perception, the objective functions for the orientation perception problem are not symmetric about the centre of the focal point range; there is a distinctive asymmetry, which is more pronounced for higher values of the threshold probability. They also do not have distinctive maxima, in the sense that they are very flat-topped in these regions. The objective function for

= 0.1 is shown from a number of different perspectives in Figure

4.14.

54

(a)

(b)

(c)

(d)

Figure 4.14 – More detailed view of angle perception objective function for = 0.1: (a) original view; (b) from above; (c) side elevation (decision threshold); (d) side elevation (focal point).

Figure 4.15 shows how the level of noise increases considerably as the number of samples averaged over is reduced from 100, through 10, to 1 in the single-sample case. From Figure 4.14, we can see that if we optimised the objective function for

= 0.1, the

optimal solution would have a decision threshold of around 0.1, and a focal point between 2 mm and 18 mm. The focal point would also have a high variance because the surface is pretty flat and the returns are quite noisy for small values of the decision threshold. This would make it very difficult to judge the quality of any solutions. So, to overcome this problem it was decided to optimise the function over a restricted range of decision thresholds where the global maximum is somewhat more pronounced and easier to identify. So, for this set of experiments, the decision threshold was restricted to values greater than 0.7.

55

(a)

(b)

(c) Figure 4.15 – Noise levels on angle perception objective function for returns averaged over, (a) 100 samples, (b) 10 samples, and (c) 1 sample.

4.2.3 Performance on single-sample learning episodes A Bayesian optimiser with a UCB acquisition function was run for 100 trials (36 initial trials to bootstrap construction of the GP response surface, and 64 subsequent optimisation trials), over 10 runs of the experiment. Examples of the final response surfaces produced by the Bayesian optimiser for the four different values of

56

are shown in Figure 4.16.

(a)

(b)

(c)

(d)

Figure 4.16 – Final response surfaces after 100 (36+64) trials of Bayesian optimisation (UCB) for different values of : (a) = 1.0; (b) = 0.1; (c) = 0.01; (d) = 0.001.

Comparing these response surfaces to the mean return functions of Figure 4.13, shows that they capture most of the salient features of the underlying objective functions, at least in the regions surrounding the global maxima. Box plots showing the distribution of optimal parameter values after 100 trials, for 10 runs of the experiment, for the different values of , are shown in Figure 4.17. The results show that as

is increased (corresponding to more emphasis on faster decision

times in the Bayes risk formula) the optimal decision threshold remains constant at approximately 0.7. This is consistent with the negative slope of the objective function in the decision threshold dimension in Figure 4.14, and the constraint that the decision threshold is greater than 0.7.

57

(a)

(b)

Figure 4.17 – Distribution of solutions found by Bayesian optimisers with UCB acquisition function for all c values: (a) decision threshold; and (b) focal point.

It can also be seen that as

is increased the optimal focal point shifts away from the edge of

the cylinder in the centre of the tap range, to a location outside the perimeter. This is somewhat contrary to intuition, which might suggest that the optimal focal point should be located directly over the edge of the cylinder.

4.2.4 Summary and discussion of results This second set of experiments showed that the proposed goal-based learning approach can learn an optimal (or close-to-optimal) parameterised policy for estimating tangent angles around the circumference of a disk. The angles were discretised in 10-degree intervals, giving 36 angle classes, ranging from 0 to 350 degrees. Once again, an active focal policy was used, and the aim was to learn an optimal decision threshold probability and focal point position. As before, optimality was defined in terms of a negative Bayes risk criterion, which incorporated a trade-off parameter, , for determining the relative importance of the time taken to make a decision, to the correctness of the decision. By plotting the mean return over 100 samples for different values of the threshold probability and focal point, it was possible to approximate the underlying objective functions (i.e., the expected return) for different values of the trade-off parameter . In contrast to the curvature perception experiments, for the angle perception experiments, the objective functions were found to be asymmetric about the centre of the tap range located over the edge of the disk. In particular, the optimal focal point was found to be skewed towards the outside of the disk, and the degree of skewness became more pronounced for higher values of , where decision time is the main factor that determines overall performance. While it is not entirely clear why this is so, one possibility is that since the disk is a solid object, 58

most of the “taxels” are displaced when tapping inside the edge, and the “delta” due to the edge feature is relatively small. When tapping outside the edge, most of the taxels are not displaced from their rest positions, and the delta due to the edge feature is much greater. In other words, tapping the sensor inside the edge of the disk is like trying to detect a small AC signal superimposed on a large DC bias; tapping the sensor outside the edge of the disk is like trying to detect a small AC signal on top of a small DC bias. Another characteristic feature of the objective functions is that they are almost flat for lower values of the decision threshold. So, for this set of experiments, the search for an optimal decision threshold was restricted to values greater than 0.7, where the global maximum is somewhat more pronounced and easier to identify. The UCB variant of the Bayesian optimiser was used to find an optimal set of parameters for the active focal policy, for all four values of

( = 0.001, = 0.01, = 0.1, = 1.0). The

performance was evaluated over 10 runs of the experiment. For each run, the optimiser was initialised with 36 samples (selected using stratified uniform sampling over a 6x6 grid) to construct a good initial GP response surface, and then run for a further 64 optimisation trials (making a total of 100 samples/trials in total). The results showed that the optimiser can find good solutions for all values of . For

=

0.01, = 0.1, = 1.0, the solutions are close to optimal with respect to the approximate objective function. For

= 0.001, the objective function is flat over a large range of focal

points in the region around the global maximum, so the solutions found by the optimiser have a much larger variance in this dimension. The optimal decision threshold was found to be between 0.70 and 0.71 in all cases, which is not surprising given the negative slope of the surface in this dimension and the constraint that is greater than 0.7. The optimal focal point was found to shift outwards from the edge of the disk for increasing values of

(more

emphasis on timeliness of decisions). As noted above, this contrasts with the curvature perception results, where the optimal focal point remained roughly constant, located centrally over the cylinder axis.

4.3 Contour exploration 4.3.1 Experimental setup The third and final set of experiments showed that an ABB robot arm equipped with a TacTip sensor can navigate around the circumference of a disk using tactile feedback alone, and that the proposed learning approach can find an optimal parameterised policy for doing this.

59

4.3.2 Offline simulation approach The same 54 mm radius disk that was used for the orientation perception experiments was also used for this experiment. The same data capture process was also used to collect separate training and test data sets. However, unlike the orientation perception experiments, the Monte Carlo test procedure used in this experiment needs to simulate sensor readings around the edge of the cylinder as it is being traversed. So a simulator was constructed that uses the test data to generate sensor readings that would be produced by the physical robot arm and TacTip for any Cartesian position (and orientation) in the vicinity of the circumference. The simulator was initialised with the dimensions of the cylinder and its location relative to the starting position of the sensor, so that for any position in the vicinity of the edge it can work out the corresponding discretised tangent angle and normal (radial) distance from the edge. It can then sample from the (test data) sensor readings that were collected for these values.

4.3.3 A simple contour exploration algorithm A simple control algorithm was developed to enable the robot arm to navigate around the circumference of the disk. A single iteration of this algorithm is shown in Figure 4.18.

Figure 4.18 – A single iteration of the contour exploration algorithm, showing passive prediction step (tangential) and active correction steps (normal to passive prediction step).

The starting point for each iteration is assumed to be on or close to the circumference of the disk. Each iteration consists of two phases: an initial passive prediction step that moves to 60

the next point on (or close to) the circumference; and a sequence of active correction steps that corrects the error made in the initial prediction step and estimates the tangent angle at the next point on the circumference. In the simplest case, an initial prediction step of length ∆ start point. In this experiment, a step length of ∆ correction steps ∆

is taken along the tangent from the

= 5 mm was used. A sequence of active

are then taken along the normal to the prediction step direction, to bring

the sensor back onto the edge of the disk. The length of each of these steps is calculated as the distance from the estimated sensor location to the desired target location on the edge of the disk, measured along the normal to the edge of the disk. Since the steps are not taken along the normal to the disk, but along the normal to the prediction step direction, they never quite reach the edge of the disk. Strictly speaking, the shifting of distributions in the Bayesian filter step should happen in both the normal (radial) distance and tangent angle dimensions, since a move along the normal to the prediction step direction changes both of these values. However, if ∆

is small compared to the radius of the disk, the errors will be relatively small.

The sequence of active correction steps terminates when the probability of the MAP estimate of the tangent angle exceeds a predefined threshold. This algorithm can be extended by adding an angular offset

to the prediction step

direction, as shown in Figure 4.19.

Figure 4.19 – Passive prediction step taken along an offset angle

61

to the tangent line.

In the figure, a negative angular offset is shown, but in general the offset can be either positive or negative. When the offset is positive, the prediction step will move closer to the disk than if it had moved along the tangent line; when it is negative, the step will move further away. and step length ∆

For some positive value of

the step will lie along a chord of the circle

(disk). The complete algorithm is shown in Table 4.1. 1:

algorithm contour_exploration( ,

2:

Initialise variables

3:

repeat

4:

Passive prediction step:

5:

Move distance ∆

6:

Active correction step:

7:

repeat

):

along line at offset

to estimated tangent angle

,

8:

Update posterior distribution

9:

Compute MAP estimate of tangent angle

10:

if ℙ

Calculate move ∆

12:

Move distance ∆

from estimated distance

along normal to prediction step direction

← +1

14:

17:

to focal point

end

13:

16:

and normal distance



11:

15:

using Bayesian filter update

until ℙ

>

← +1 until stopping criterion is met

Table 4.1 – Contour exploration algorithm.

The algorithm is parametrised by two values: a decision threshold probability

that

determines when the sequence of active correction steps should terminate (lower values will terminate the sequence in fewer steps); and an angular offset

that determines the

direction of the prediction step. Some examples of contour trajectories that were produced using this algorithm are shown in Figure 4.20 and Figure 4.21. The position of the edge of the disk is shown in blue.

62

(a)

(b)

(c)

(d)

Figure 4.20 – Examples of contour exploration trajectories for = 0.5, and (d) = 0.7.

= 0 and, (a)

= 0.1, (b)

= 0.3, (c)

The short black lines that are present in many of the figures are due to the initial radial (normal) position being incorrectly estimated as 0 mm. Since the focal point is situated in the middle of the tap range at approximately 10 mm, this results in an outwards radial move of approximately 10 mm from the actual initial position close to the edge of the disk, to the outside. After the second tap, the radial position is correctly estimated and the sensor moves back onto the edge of the disk.

63

(a)

(b)

(c)

(d)

Figure 4.21 – Examples of contour exploration trajectories for −20°, (c) = 20°, and (d) = 40°.

= 0.5 and, (a)

= −40°, (b)

=

4.3.4 Learning the optimal parameters For this experiment, each learning episode consisted of 50 iterations of the algorithm, and the episodic return was defined as a negative weighted sum of three terms: =− Here,





(4.1)

is the number of taps needed to complete the episode (a measure of the total

time needed),

is the total length of all active steps taken during the episode (a measure

of the movement away from the edge during the episode), and

is the total of the

absolute errors between the estimated radial (normal) positions and target focal points at the end of each active correction sequence (a measure of the ability to track the edge of the disk). Note that the relationship between appear: sometimes

and

is not as straightforward as it might at first

increases when

increases, and sometimes it decreases (see

Table 4.2 and Table 4.3).

64

The coefficients , , and

are positive constants that determine the relative importance of

the different terms in the expression. In this experiment, values of

=

=

= 0.001 were

used. Examples that show how the individual components of the return vary with decision threshold probability

and angular offset

are listed in Table 4.2 and Table 4.3,

respectively.

0.1 0.3 0.5 0.7 0.9

71 171 240 330 482

0° 0° 0° 0° 0°

19 1329 1901 2068 2205

1025 185 43 15 9

Table 4.2 – Variation of performance with threshold probability.

0.5

-40°

260

892

58

0.5

-20°

233

444

19

0.5



240

1901

43

0.5

20°

225

1907

31

0.5

40°

263

1541

13

Table 4.3 – Variation of performance with different values of angular offset.

All three variants of Bayesian optimiser (PI, EI and UCB) were used to learn optimal values of the parameters

and

. Each variant was run for 100 trials (16 initial trials to bootstrap

construction of the GP response surface, and 84 subsequent optimisation trials), over 3 runs of the experiment. Examples of the final response surfaces constructed by the Bayesian optimisers and the distribution of samples selected during the optimisation process are shown in Figure 4.22.

65

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4.22 – Final response surfaces and distribution of samples selected during the Bayesian optimisation process for three variants of the algorithm: (a) Bayes (PI) response surface; (b) Bayes (PI) samples; (c) Bayes (EI) response surface; (d) Bayes (EI) samples; (e) Bayes (UCB) response surface; (f) Bayes (UCB) samples.

66

Note that all of these surfaces are very similar in appearance, despite having been produced by Bayesian optimisers with different acquisition functions. The mean optimal parameter values for each variant of Bayesian optimiser are listed in Table 4.4. Bayes (PI)

Bayes (EI)

Bayes (UCB)

0.30

0.31

0.31

-18.27°

-18.94°

-18.84°

Table 4.4 – Optimal parameters values, averaged over three runs of the experiment.

These results are extremely unintuitive, since a negative angular offset corresponds to a move outwards from the tangent line rather than a move inwards towards the disk, as one might have expected. However, the results are consistent with results from Section 4.2, which showed that when the emphasis is on making quicker decisions the optimal focal point is located to the outside of the disk, rather than directly over the edge. The trajectory produced by the contour-following algorithm using the optimal parameter values found by the Bayes (UCB) optimiser is shown in Figure 4.23.

Figure 4.23 – Contour exploration trajectory for optimal parameter values −18.84°.

= 0.31 and (a)

=

4.3.5 Summary and discussion of results This final set of experiments showed that the proposed goal-based learning approach can learn an optimal policy for tracking the edge (contour) of a disk. A simple two-phase, iterative, control algorithm was developed for this purpose. In the first phase of each iteration (the prediction step), the sensor is moved to the next predicted point on the edge of the disk. For these experiments, the prediction step consisted of a straight-line move at an angular offset to the tangent at the current point on the circumference. In the second phase of each iteration (the correction step), an active perception/control sequence is used to (a) reposition the sensor on the edge of the disk, and (b) estimate the tangent angle at the new location on the 67

circumference. Two parameters were optimised during the learning process: the angular offset used in the prediction step, and the decision threshold probability used in the active perception/control sequence in the correction step. Optimality for this experiment was defined in terms of a weighted sum of three performance criteria, which measured (a) the time taken to complete a learning episode, (b) the degree to which the sensor tracked the edge, and (c) the amount of “wandering about” during the process. All three variants of Bayesian optimiser were applied to this problem, and three runs of the experiment were carried out for each one. Fewer runs were used in this experiment than in some of the earlier ones because it was not intended to be a rigorous statistical comparison between algorithms, but more of a demonstration that provides confidence that the proposed approach can solve non-trivial problems. The results show that where the three components of the performance criterion are equally weighted, the optimal angular offset is between -18 and -19 degrees, and the optimal decision threshold probability is approximately 0.31. This is an extremely unintuitive result, as this angular offset corresponds to a move outwards from the tangent line, rather than a move inwards towards the disk, as one might have expected. However, it is consistent with the results from Section 0, which show that when the emphasis is on making quicker decisions the optimal focal point is located towards the outside of the disk, rather than directly over the edge. This experiment also highlights another benefit of using episodic direct policy search reinforcement learning.

Since very few assumptions are made about the underlying

reinforcement learning problem, black-box optimisers can be used to co-optimise other parameters in the system (e.g., constituent models or algorithms) at the same time as optimising the parameters of an active perception/control policy. Another novel feature of this experiment, at least in the context of the BRL tactile robotics work, was the use of an offline Monte Carlo simulator for developing and testing the algorithm. This reduces development and test time considerably, requiring only a few seconds to perform each learning episode, rather than a few minutes. This type of approach might also be useful for developing other tactile robotic algorithms in the future, although care needs to be taken to ensure that the Monte Carlo environment is a good approximation to the underlying physical system.

68

5 Discussion 5.1 Significance of results Overall, the results show that the proposed goal-based learning approach can be used to learn optimal parameterised policies for a range of tactile perception and exploration tasks that are currently used in the tactile robotics group at BRL, rather than having to completely define them in advance. Furthermore, one particular class of methods, based on Bayesian optimisation, can find good solutions in fewer than 100 learning trials, making them suitable for online, real-time autonomous learning. This is something that has not been possible until now, because existing approaches typically require at least an order of magnitude more learning trials to find a solution. The learning framework has been synthesised from a simplified model of active perception and control, which, in one particular configuration, identifies the mapping between the tactile perception methods currently in use at BRL, and the Bayesian filter methods that are widely used in mobile robotics and other areas. This novel interpretation will allow many useful methods that have been developed in other areas of robotics (e.g., mobile robotics) to be applied in the tactile domain. Some potential benefits of adopting this new interpretation have already been demonstrated as part of this work, in an initial pilot study of particle filter implementations of tactile active perception (see Appendix B). Although it has been argued that direct policy search methods are the most appropriate form of reinforcement learning for tactile robotic problems, the learning framework is certainly more general than that, and there is nothing to prevent other forms of reinforcement learning being used within it. Similarly, while single-point MAP estimates and their associated probabilities have been used to represent the agent state in the demonstrations used in this study, there is no reason why other representations should not be used. Indeed, it is not even necessary to compute the agent state using a probabilistic approach. The same applies to the use of decision-theoretic probability thresholding in the control policies: it was used here because it is a central feature of the existing policies in use at BRL, not because it is a necessary prerequisite of the learning framework. Notwithstanding the high level of flexibility provided by the learning framework, the direct policy search methods advocated and demonstrated in this study do offer a number of important advantages over other methods.

Most importantly, they do not make any

particularly strong assumptions about the context in which they operate. This means that powerful black-box optimisers can be used to optimise the policy parameters and co-optimise any other parameters used by the host system or algorithm. By directly optimising the policy rather than trying to learn a value function and attempting to behave optimally with respect to 69

that function, it is possible to incorporate expert knowledge about the problem domain in an initial policy, and then further optimise this over time. This capability is also important from a safety point-of-view as it allows an initial policy to be defined with a “safe” range of parameters that can be optimised over – something that would be very difficult to do with value-based reinforcement learning.

5.2 Relationship with other work As stated in Chapter 2, other than the work of Lepora et al. (2013), it has not been possible to find any other published reports of reinforcement learning (or any other form of goal-based learning) being applied to high-acuity tactile sensors in an active perception setting. In the aforementioned study, researchers attempted to learn optimal values for the threshold probability and focal point of an active focal policy using a value-based form of reinforcement learning. They also demonstrated their approach on a cylinder classification problem. While it is difficult to make direct comparisons with their results because they used a different type of lower-resolution sensor, cylinders of different diameter, and presented their results in a different way, the approach that they used required more than 1000 learning trials to converge. This makes it completely unsuited for a practical robot situation. They also used a very coarse discretisation of the sensor position, having only 16 position classes, as opposed to the 1000 position classes used in this study. In other work, Martinez-Hernandez et al. (2014; 2016) developed a number of tactile, active perception, contour-following algorithms, but did not apply any form of learning to their approach. Despite using a more complicated architecture for tracking the contour (edge) of a disk, the resulting trajectories do not appear to be as accurate as those obtained in this study (based on a direct visual comparison). Once again, it is difficult to make robust comparisons, because of the different experimental setups and presentation of results, but it does seem to suggest that a simpler but more optimised approach is better for this type of task.

5.3 Suggestions for further research Based on the results and experience gained while carrying out this study, it would be interesting and useful to explore the following areas of further work: 

Experience gained during this project has shown Bayesian optimisation to be something of a “black art”. It would therefore be useful to develop a more automated solution that does not require so much manual optimisation of the GP models and associated hyperparameters.



Some researchers have pointed out that Bayesian optimisation in very noisy environments needs to be treated differently than noise-free or low-noise scenarios 70

(Hutter 2009), and that non-standard types of acquisition function may be better in these situations (Picheny et al. 2013). Since only the three most common types of acquisition function were evaluated in this study, it would be interesting to see whether any of these more specialised functions improve accuracy and convergence speed. It would also be interesting to see whether the increased time required to average out some of the noise on learning episode returns, as discussed in Section 4.1.4, is offset by a faster convergence time overall. In a similar vein, it would be useful to know what the trade-off is between population size and convergence speed for the populationbased stochastic optimisers (only the default recommended population sizes were used in this study). 

In this study, fairly simple parameterised policies have been used throughout. It would therefore be interesting to see whether more optimal solutions could be found using more flexible, stochastic, or memory-based policies (e.g., using neural networks or finite state machines). However, more flexible policies are likely to need more data to learn a larger number of parameters, so this approach may be better suited to offline learning.



The reward functions used in this study have combined multiple objectives using a weighted sum of terms, which involves specifying the relative importance of terms a priori. However, in many practical applications it is advantageous to identify a set of non-dominated solutions on a “Pareto front” rather than deciding on a trade-off a priori, and then maximising over it (Ngatchou et al. 2005). It would therefore be interesting to explore this idea further in this context.



One of the problems encountered in the curvature perception experiments was the high level of noise on the sampled returns. For situations like this, it would be interesting to see whether a combined policy-based and value-based approach (i.e., an “actor-critic” system) would help to reduce the variance of the returns and thereby improve the accuracy and speed of convergence (Deisenroth 2013; Grondman et al. 2012).



In this study, some initial proof-of-concept work has been carried out on particle filter implementations of active perception agents (see Appendix B). Particle filters have also been successfully used in other areas of tactile robotics (Petrovskaya et al. 2006; Petrovskaya et al. 2007; Petrovskaya & Khatib 2011; Platt et al. 2011). It would therefore be interesting to perform a more rigorous comparison between the histogram-based (grid) implementations of Bayesian filters that have traditionally been used in the BRL work, and alternatives such as particle filters.



Using the models developed in this study, the most general formulation of the goalbased learning problem can be viewed as a POMDP problem. As tactile robotic tasks 71

become more complex, it might be possible to find better solutions by resorting to a full POMDP treatment of the problem, rather than using a particular type of approximate method as has been done in this study.

72

6 Conclusions The principal aim of this dissertation project was to develop and evaluate a method for goalbased learning in an active touch robotic system. This has been achieved by reinterpreting and extending current methods of tactile active perception developed at BRL and demonstrating their effectiveness on three goal-based learning tasks in the areas of tactile perception and exploration. More specifically, this project has achieved the following: 

Developed a simplified model of active perception and used it to show that BRL methods of tactile active perception are equivalent to a histogram-based Bayesian filter embedded in a special type of perception/control loop. The perception/control loop uses information derived from the estimated belief distribution to determine which actions to take under a control policy. This mapping between approaches will allow many useful methods that have been developed in other areas of robotics (e.g., localisation methods in mobile robotics) to be applied in the tactile domain.



Extended the simplified active perception model to incorporate a reward signal from the environment and replaced the decision-maker element of the perception/control loop with a reinforcement learning agent.

The reinforcement learning agent is

implemented using direct policy search methods that rely on black-box optimisers to adjust the parameters of the policy in order to maximise the expected returns. This will provide a goal-based learning capability within the existing framework and allow control policies to be learned, optimised or adapted rather than having to be predefined in advance. 

Successfully demonstrated the effectiveness of the goal-based learning capability on three tactile robotic tasks: two tactile perception tasks concerned with cylinder identification and angle estimation, respectively, and an exploration task concerned with tracking the edge of a disk.

From this work we can draw two important conclusions: 

It is possible to integrate goal-based learning in the tactile active perception systems in use at BRL and elsewhere, using the methods described in this dissertation. In its most general formulation, this problem can be viewed as a POMDP learning problem, and the methods developed in this dissertation as approximate solution techniques.



These methods are effective at learning parameterised control policies for tactile robotic tasks that have been used by researchers at BRL over the past few years. However, further work needs to be done to investigate how these methods will scale up to more complex tasks as they arise in future, and on improving their accuracy and efficiency on current problems. 73

74

References Aberdeen, D., 2003. A (revised) survey of approximate methods for solving partially observable Markov decision processes. National ICT Australia, Canberra, Australia, pp.1–41. Amari, S., 1998. Natural gradient works efficiently in learning. Neural Computation, 10, pp.251–276. Amari, S. & Douglas, S.C., 1998. Why natural gradient? Proceedings of the 1998 IEEE international conference on Acoustics, Speech and Signal Processing, 2, pp.1213–1216. Argall, B.D. et al., 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), pp.469–483. Argall, B.D. & Billard, A.G., 2010. A survey of Tactile Human-Robot Interactions. Robotics and Autonomous Systems, 58(10), pp.1159–1176. Asada, M. et al., 2009. Cognitive Developmental Robotics: A Survey. IEEE Transactions on Autonomous Mental Development, 1(1), pp.12–34. Assaf, T. et al., 2014. Seeing by Touch: Evaluation of a Soft Biologically-Inspired Artificial Fingertip in Real-Time Active Touch. Sensors, 14(2), pp.2561–2577. Bajcsy, R., 1988. Active Perception. Proceedings of the IEEE. Bajcsy, R., Aloimonos, Y. & Tsotsos, J.K., 2016. Revisiting Active Perception. arXiv preprint arXiv:1603.02729. Bar-Cohen, Y., 2006. Biomimetics - using nature to inspire human innovation. Bioinspiration & biomimetics, 1(1), pp.1–12. Baxter, J. & Bartlett, P.L., 2001. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, pp.319–350. Berger, J.O., 2013. Statistical decision theory and Bayesian analysis, Springer Science & Business Media. Beyer, H.-G., 2000. Evolutionary algorithms in noisy environments: theoretical issues and guidelines for practice. Computer Methods in Applied Mechanics and Engineering, 186(2–4), pp.239–267. Bhushan, B., 2009. Biomimetics: lessons from nature--an overview. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1893), pp.1445–1486. Bicchi, A., 2000. Hands for dexterous manipulation and robust grasping: A difficult road 75

toward simplicity. IEEE Transactions on Robotics and Automation, 16(6), pp.652–662. Bishop, C.M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics), Secaucus, NJ, USA: Springer-Verlag New York, Inc. Bohg, J. et al., 2016. Interactive Perception: Leveraging Action in Perception and Perception in Action. arXiv preprint arXiv:1604.03670. Botvinick, M. & Weinstein, A., 2014. Model-based hierarchical reinforcement learning and human action control. Philos Trans R Soc Lond B Biol Sci, 369(1655). Brochu, E., Cora, V.M. & De Freitas, N., 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599. Bullock, I.M., Ma, R.R. & Dollar, A.M., 2013. A hand-centric classification of human and robot dexterous manipulation. IEEE Trans. Haptics, 6(2), pp.129–144. Cassandra, A.R., 1998. Exact and approximate algorithms for partially observable Markov decision processes. PhD Thesis. Brown University. Castro, R., Willett, R.M. & Nowak, R.D., 2005. Faster Rates in Regression Via Active Learning. In Neural Information Processing Systems. Chorley, C. et al., 2009. Development of a Tactile Sensor Based on Biologically Inspired Edge Encoding. In 14th International Conference on Advanced Robotics (ICAR 2009). pp. 1–6. Cole, J. & Paillard, J., 1995. Living without touch and peripheral information about body position and movement: Studies with deafferented subjects. In The body and the self. MIT Press, pp. 245–266. Cramphorn, L., Ward-cherrier, B. & Lepora, N.F., 2016. Tactile manipulation with biomimetic active touch. In 2016 IEEE International Conference on Robotics and Automation (ICRA). Cutkosky, M.R., Howe, R.D. & Provancher, W.R., 2008. Force and tactile sensors. In B. Siciliano & O. Khatib, eds. Springer Handbook of Robotics. Springer, pp. 455–476. Dahiya, R.S. et al., 2011. Guest Editorial special issue on robotic sense of touch. IEEE Transactions on Robotics, 27(3), pp.385–388. Dahiya, R.S. et al., 2010. Tactile sensing - From humans to humanoids. IEEE Transactions on Robotics, 26(1), pp.1–20. Daniel, C., Neumann, G. & Peters, J., 2012. Hierarchical Relative Entropy Policy Search. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS

76

2012), XX, pp.273–281. Daw, N.D. & Doya, K., 2006. The computational neurobiology of learning and reward. Current Opinion in Neurobiology, 16(2), pp.199–204. Deisenroth, M.P., 2013. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 2(1), pp.1–142. Deisenroth, M.P. & Rasmussen, C.E., 2011. PILCO: A Model-Based and Data-Efficient Approach to Policy Search. In Proceedings of the 28th International Conference on machine learning (ICML-11). pp. 465–472. Demiris, Y. & Dearden, A., 2005. From motor babbling to hierarchical learning by imitation: a robot developmental pathway. In Fifth International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems. pp. 31–37. Doya, K., 2007. Reinforcement learning: Computational theory and biological mechanisms. HFSP Journal, 1(1), pp.30–40. Eltaib, M.E.H. & Hewit, J.R., 2003. Tactile sensing technology for minimal access surgery - A review. Mechatronics, 13(10), pp.1163–1177. Ferreira, J.F. & Dias, J., 2014. Probabilistic approaches to robotic perception, Springer. Finkel, D.E., 2003. DIRECT optimization algorithm user guide. Center for Research in Scientific Computation, North Carolina State University, 2, pp.1–14. Fishel, J.A. & Loeb, G.E., 2012. Bayesian exploration for intelligent identification of textures. Frontiers in Neurorobotics, 6(June), pp.1–20. Fitzpatrick, P.M. et al., 2003. Learning About Objects Through Action - Initial Steps Towards Artificial Cognition. In Procceedings of the ICRA’03 IEEE International Conference on Robotics and Automation. pp. 3140–3145. Fox, D. & Hightower, J., 2003. Bayesian filtering for location estimation. IEEE Pervasive Computing, 2, pp.24--33. Franklin, D.W. & Wolpert, D.M., 2011. Computational mechanisms of sensorimotor control. Neuron, 72(3), pp.425–442. Freeman, W.J., 1999. Comparison of brain models for active vs. passive perception. Information sciences, 116(2), pp.97–107. Friedrich, H. et al., 1996. Robot programming by Demonstration (RPD): Supporting the induction by human interaction. Machine Learning, 23(2–3), pp.163–189.

77

Fu, J., Levine, S. & Abbeel, P., 2015. One-Shot Learning of Manipulation Skills with Online Dynamics Adaptation and Neural Network Priors. arXiv preprint arXiv:1509.06841. Gallace, A. & Spence, C., 2014. In touch with the future, Oxford University Press. Georgeon, O.L., Marshall, J.B. & Manzotti, R., 2013. ECA: An enactivist cognitive architecture based on sensorimotor modeling. Biologically Inspired Cognitive Architectures, 6, pp.46–57. Ghadirzadeh, A. & Maki, A., 2015. A sensorimotor approach for self-learning of hand-eye coordination. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4969–4975. Gibson, J.J., 1962. Observations on active touch. Psychological review, 69(6), pp.477–491. Gibson, J.J., 2014. The ecological approach to visual perception, Psychology Press. Girao, P.S. et al., 2013. Tactile sensors for robotic applications. Measurement: Journal of the International Measurement Confederation, 46(3), pp.1257–1271. Glasmachers, T. et al., 2010. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation. pp. 393–400. Gottlieb, J., 2012. Attention, Learning, and the Value of Information. Neuron, 76(2), pp.281– 295. Grondman, I. et al., 2012. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. , 42(6), pp.1291–1307. Han, W., Levine, S. & Abbeel, P., 2015. Learning Compound Multi-Step Controllers under Unknown Dynamics. In 2015 IEEE/RSJ International Conference on Intelligent Robot and Systems (IROS). pp. 6435–6442. Hansen, N. et al., 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp.1–18. Hauskrecht, M., 2000. Value-Function Approximations for Partially Observable Markov Decision Processes. Journal of Artificial Intelligence Research, 13, pp.33–94. Hoffmann, H., 2007. Perception through visuomotor anticipation in a mobile robot. Neural Networks, 20(1), pp.22–33. Hoffmann, M. et al., 2012. Using sensorimotor contingencies for terrain discrimination and adaptive walking behavior in the quadruped robot puppy. In From Animals to Animats 12. Springer, pp. 54–64.

78

Hogman, V., Bjorkman, M. & Kragic, D., 2013. Interactive object classification using sensorimotor contingencies. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2799–2805. van Hoof, H. et al., 2015. Learning Robot In-Hand Manipulation with Tactile Features. In 2015 IEEE/RAS 15th International Conference on Humanoid Robots. pp. 121–127. Hsiao, K. & Kaelbling, L.P., 2010. Task-Driven Tactile Exploration. Proceedings of Robotics Science and Systems. Hsiao, K., Kaelbling, L.P. & Lozano-Pérez, T., 2011. Robust grasping under object pose uncertainty. Autonomous Robots, 31(2–3), pp.253–268. Huang, D. et al., 2006. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of Global Optimization, 34(3), pp.441–466. Hutter, F., 2009. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD Thesis. University of British Columbia. Jamil, M. & Yang, X.-S., 2013. A Literature Survey of Benchmark Functions For Global Optimization Problems Citation details: Momin Jamil and Xin-She Yang, A literature survey of benchmark functions for global optimization problems. Int. Journal of Mathematical Modelling and Numerical Optimisation, 4(2), pp.150–194. Jones, D.R., 2001. A Taxonomy of Global Optimization Methods Based on Response Surfaces. Journal of Global Optimization, 21, pp.345–383. Jones, D.R., Schonlau, M. & William, J., 1998. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13, pp.455–492. Kaelbling, L., Littman, M. & Cassandra, A., 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101(1–2), pp.99–134. Kaelbling, L.P., Littman, M.L. & Moore, A.W., 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, pp.237–285. Kappassov, Z., Corrales, J.-A. & Perdereau, V., 2015. Tactile sensing in dexterous robot hands — Review. Robotics and Autonomous Systems, 74, pp.195–220. Kemp, C.C., Edsinger, A. & Torres-Jara, E., 2007. Challenges for robot manipulation in human environments. IEEE Robotics and Automation Magazine, 14(1), pp.20–29. Knill, D.C. & Pouget, A., 2004. The Bayesian brain: The role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12), pp.712–719.

79

Knill, D.C. & Richards, W., 1996. Perception as Bayesian inference, Cambridge University Press. Kober, J., Bagnell, J.A. & Peters, J., 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), pp.1238–1274. Kording, K.P. & Wolpert, D.M., 2006. Bayesian decision theory in sensorimotor control. Trends in Cognitive Sciences, 10(7), pp.319–326. Kormushev, P., Calinon, S. & Caldwell, D., 2013. Reinforcement Learning in Robotics: Applications and Real-World Challenges. Robotics, 2(3), pp.122–148. Krakauer, J.W. & Mazzoni, P., 2011. Human sensorimotor learning: Adaptation, skill, and beyond. Current Opinion in Neurobiology, 21(4), pp.636–644. Kroemer, O. et al., 2015. Towards Learning Hierarchical Skills for Multi-Phase Manipulation Tasks. In IEEE International Conference on Robotics and Automation 2015. pp. 1503–1510. Kroese, D.P., Porotsky, S. & Rubinstein, R.Y., 2006. The cross-entropy method for continuous multi-extremal optimization. Methodology and Computing in Applied Probability, 8(3), pp.383–407. Lai, T.L., 2001. Sequential analysis: some classical problems and new challenges. Statistica Sinica, 11, pp.303–408. Lederman, S.J. & Klatzky, R.L., 1993. Extracting object properties through haptic exploration. Acta Psychologica, 84(1), pp.29–40. Lederman, S.J. & Klatzky, R.L., 1987. Hand movements: A window into haptic object recognition. Cognitive Psychology, 19(3), pp.342–368. Lee, M.H., 2000. Tactile Sensing : New Directions, New Challenges. The International Journal of Robotics Research, 19(7), pp.636–643. Lepora, N.F., Martinez-Hernandez, U., Pezzulo, G., et al., 2013. Active Bayesian perception and reinforcement learning. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 4735–4740. Lepora, N.F., 2016a. Active tactile perception. In Prescott, Tony J., E. Ahissar, & E. Izhikevich, eds. Scholarpedia of Touch. Atlantis Press, pp. 151–159. Lepora, N.F., 2016b. Biomimetic Active Touch with Fingertips and Whiskers. IEEE Transactions on Haptics, 9(2), pp.170–183. Lepora, N.F. et al., 2012. Brain-inspired Bayesian perception for biomimetic robot touch.

80

Proceedings - IEEE International Conference on Robotics and Automation, pp.5111–5116. Lepora, N.F. et al., 2015. Tactile Superresolution and Biomimetic Hyperacuity. IEEE Transactions on Robotics, 31(3), pp.605–618. Lepora, N.F., 2016c. Threshold Learning for Optimal Decision Making. In Neural Information Processing Systems (NIPS) 2016. Lepora, N.F., Martinez-Hernandez, U. & Prescott, T.J., 2013a. A SOLID case for active bayesian perception in robot touch. In Biomimetic and Biohybrid Systems. Springer, pp. 154– 166. Lepora, N.F., Martinez-Hernandez, U. & Prescott, T.J., 2013b. Active Bayesian Perception for Simultaneous Object Localization and Identification. Robotics: Science and Systems IX. Lepora, N.F. & Pezzulo, G., 2015. Embodied Choice: How Action Influences Perceptual Decision Making. PLoS Computational Biology, 11(4), pp.1–22. Lepora, N.F., Verschure, P. & Prescott, T.J., 2013. The state of the art in biomimetics. Bioinspiration & Biomimetics, 8, pp.1–11. Lepora, N.F. & Ward-Cherrier, B., 2015. Superresolution with an optical tactile sensor. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2686– 2691. Lepora, N.F., Ward-cherrier, B. & Member, S., 2016. Tactile Quality Control With Biomimetic Active Touch. IEEE Robotics and Automation Letters, 1(2), pp.646–652. Lizotte, D.J., 2008. Practical Bayesian Optimization. PhD Thesis. University of Alberta. Loeb, G.E. & Fishel, J.A., 2014. Bayesian action and perception: Representing the world in the brain. Frontiers in Neuroscience, 8(Oct), pp.1–13. Loomis, J.M. & Lederman, S.J., 1986. Tactual perception. In Handbook of perception and human performances 2. p. 31.3-31.41. Lukoosevicius, M. & Jaeger, H., 2009. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), pp.127–149. Lungarella, M. et al., 2003. Developmental robotics: a survey. Connection Science, 15(4), pp.151–190. Ma, R.R. & Dollar, A.M., 2011. On dexterity and dexterous manipulation. IEEE 15th International Conference on Advanced Robotics: New Boundaries for Robotics, ICAR 2011, pp.1–7.

81

Ma, R.R., Odhner, L.U. & Dollar, A.M., 2013. A modular, open-source 3D printed underactuated hand. In 2013 IEEE International Conference on Robotics and Automation. pp. 2737–2743. Martinez-Hernandez, U., Dodd, T.J., et al., 2013. Active Bayesian perception for angle and position discrimination with a biomimetic fingertip. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5968–5973. Martinez-Hernandez, U., Metta, G., et al., 2013. Active contour following to explore object shape with robot touch. In 2013 World Haptics Conference, WHC 2013. pp. 341–346. Martinez-hernandez, U. et al., 2016. Active sensorimotor control for tactile exploration. Robotics and Autonomous Systems, In Press. Martinez-Hernandez, U., 2014. Autonomous active exploration for tactile sensing in robotics. PhD Thesis. University of Sheffield. Martins, R., Ferreira, J.F. & Dias, J., 2014. Touch attention Bayesian models for robotic active haptic exploration of heterogeneous surfaces. IEEE International Conference on Intelligent Robots and Systems (IROS), pp.1208–1215. Maye, A. & Engel, A.K., 2011. A discrete computational model of sensorimotor contingencies for object perception and control of behavior. In 2011 IEEE International Conference on Robotics and Automation. pp. 3810–3815. Maye, A. & Engel, A.K., 2013. Extending sensorimotor contingency theory: prediction, planning, and action generation. Adaptive Behavior, 21(6), pp.423–436. Maye, A. & Engel, A.K., 2012. Using sensorimotor contingencies for prediction and action planning. In From Animals to Animats 12. Springer, pp. 106–116. Meltzoff, A.N. et al., 2009. Foundations for a New Science of Learning. Science, 325(5938), pp.284–288. Meltzoff, A.N. & Moore, M.K., 1997. Explaining Facial Imitation: A Theoretical Model. Early Development & Parenting, 6(June), pp.179–192. Murphy, K.P., 2000. A Survey of POMDP Solution Techniques, University of British Columbia. Murphy, K.P., 2012. Machine learning: a probabilistic perspective, MIT press. Ngatchou, P., Zarei, A. & El-Sharkawi, A., 2005. Pareto Multi Objective Optimization. Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems, pp.84–91.

82

Nguyen-Tuong, D. & Peters, J., 2011. Model learning for robot control: a survey. Cognitive Processing, 12(4), pp.319–340. O’Regan, J.K., 2011. Why Red Doesn’t Sound Like a Bell, Oxford University Press. O’Regan, J.K. & Noe, A., 2001a. A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24, pp.939–1031. O’Regan, J.K. & Noe, A., 2001b. What it is like to see: A sensorimotor theory of perceptual experience. Synthese, 129(1), pp.79–103. Park, S.K., 1975. A transformation method for constrained-function minimization, NASA Technical Note TN D-7983. Petrovskaya, A. et al., 2006. Bayesian estimation for autonomous object manipulation based on tactile sensors. Proceedings - IEEE International Conference on Robotics and Automation, 2006, pp.707–714. Petrovskaya, A. et al., 2007. Touch Based Perception for Object Manipulation. Robotics Science and Systems, Robot Manipulation Workshop, (February), pp.2–7. Petrovskaya, A. & Khatib, O., 2011. Global localization of objects via touch. IEEE Transactions on Robotics, 27(3), pp.569–585. Pfeifer, R., Lungarella, M. & Iida, F., 2012. The challenges ahead for bio-inspired “soft” robotics. Communications of the ACM, 55(11), p.76. Philipona, D. & O’Regan, J.K., 2006. Color naming, unique hues, and hue cancellation predicted from singularities in reflection properties. Visual Neuroscience, 23, pp.331–339. Picheny, V., Wagner, T. & Ginsbourger, D., 2013. A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3), pp.607–626. Platt, R., Permenter, F. & Pfeiffer, J., 2011. Using bayesian filtering to localize flexible materials during manipulation. IEEE Transactions on Robotics, 27(3), pp.586–598. Prescott, T.J., Diamond, M.E. & Wing, A.M., 2011. Active touch sensing. Philos Trans R Soc Lond B Biol Sci, 366(1581), pp.2989–2995. Prescott, T.J. & Dürr, V., 2013. The World of Touch. In T. J. Prescott, E. Ahissar, & E. Izhikevich, eds. Scholarpedia of Touch. Atlantis Press, pp. 1–28. Rasmussen,

C.E.

&

Nickisch,

H.,

2016.

GPML

Library.

Available

at:

http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Rasmussen, C.E. & Williams, C.K.I., 2006. Gaussian processes for machine learning, MIT 83

Press. Robles-De-La-Torre, G., 2006. The Importance of the Sense of Touch in Virtual and Real Environments. IEEE Multimedia, 3(July), pp.24–30. Rolf, M. & Asada, M., 2014. Autonomous development of goals: From generic rewards to goal and self detection. In 2014 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics. pp. 187–194. Rolf, M. & Steil, J.J., 2012. Goal Babbling: a New Concept for Early Sensorimotor Exploration. In Humanoids 2012 Workshop on Developmental Robotics: Can developmental robotics yield human-like cognitive abilities?. pp. 40–43. Rolf, M., Steil, J.J. & Gienger, M., 2010. Goal babbling permits direct learning of inverse kinematics. IEEE Transactions on Autonomous Mental Development, 2(3), pp.216–229. Rolf, M., Steil, J.J. & Gienger, M., 2011. Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions. In 2011 IEEE International Conference on Development and Learning. Rubinstein, R.Y., 1999. The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodology and Computing in Applied Probability, 1(2), pp.127–190. Russell, S. & Norvig, P., 2009. Artificial Intelligence: A Modern Approach 3rd ed., Upper Saddle River, NJ, USA: Prentice Hall Press. Saegusa, R. et al., 2008. Active motor babbling for sensorimotor learning. In 2008 IEEE International Conference on Robotics and Biomimetics. pp. 794–799. Särkkä, S., 2013. Bayesian filtering and smoothing, Cambridge University Press. Schaal, S., 2003. Dynamic movement primitives-A framework for motor control in humans and humanoid robots. In Adaptive Motion of Animals and Machines. Springer. Schaal, S. & Atkeson, C.G., 2010. Learning control in robotics. IEEE Robotics and Automation Magazine, 17(2), pp.20–29. Settles, B., 2012. Active learning, Morgan & Claypool Publishers. Settles, B., 2011. From Theories to Queries: Active Learning in Practice. Proceedings of the Workshop on Active Learning and Experimental Design, 16, pp.1–18. Shahriari, B. et al., 2016. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), pp.148–175. Singh, S.P., Jaakkola, T. & Jordan, M.I., 1994. Learning without state-estimation in partially 84

observable Markovian decision processes. Proceedings of the eleventh international conference on machine learning, 31(0), p.37. Snoek, J., Larochelle, H. & Adams, R., 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Neural Information Processing Systems (NIPS). Srinivas, N. et al., 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp.1015–1022. Stoytchev, A., 2009. Some basic principles of developmental robotics. IEEE Transactions on Autonomous Mental Development, 1(2), pp.122–130. Surjanovic,

S.,

2015.

Optimization

Test

Problems.

Available

at:

http://www.sfu.ca/~ssurjano/optimization.html. Sutton, R.S. & Barto, A.G., 1998. Introduction to Reinforcement Learning, MIT Press. SynTouch, 2016. SynTouch Products. Available at: http://www.syntouchllc.com/Products/. Thrun, S., Burgard, W. & Fox, D., 2005. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents), The MIT Press. Tiwana, M.I., Redmond, S.J. & Lovell, N.H., 2012. A review of tactile sensing technologies with applications in biomedical engineering. Sensors and Actuators, A: Physical, 179, pp.17– 31. Todorov, E., 2004. Optimality principles in sensorimotor control. Nature neuroscience, 7(9), pp.907–15. Todorov, E. & Jordan, M., 2002. Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11), pp.1226–1235. Ward-Cherrier, B., Cramphorn, L. & Lepora, N.F., 2016. Tactile manipulation with a TacThumb integrated on the Open-Hand M2 Gripper. IEEE Robotics and Automation Letters, 1(1), pp.169–175. Westling, G. & Johansson, R.S., 1984. Factors influencing the force control during precision grip. Experimental brain research, 53(2), pp.277–284. Wiering, M. & Van Otterlo, M., 2012. Reinforcement Learning: State-of-the-Art, Springer. Wierstra, D. et al., 2014. Natural Evolution Strategies. Journal of Machine Learning Research, 15, pp.949–980. Williams, R.J., 1992. Simple Statistical Gradient-Following Algorithms for Connectionist 85

Reinforcement Learning. Machine Learning, 8(3), pp.229–256. Winstone, B. et al., 2013. TACTIP - Tactile Fingertip Device, Texture Analysis Through Optical Tracking of Skin Features. In Biomimetic and Biohybrid Systems. Springer, pp. 323–334. Xu, D., Loeb, G.E. & Fishel, J.A., 2013. Tactile identification of objects using Bayesian exploration. Proceedings - IEEE International Conference on Robotics and Automation, pp.3056–3061.

86

Appendices

87

A Performance on standard optimisation benchmarks A.1 Use of standard optimisation benchmarks for testing and evaluation Many of the optimisation algorithms discussed in Chapter 3 are complex and difficult to implement. Even if they appear to work on easy test examples, they may still have bugs in them that only become apparent when they are applied in more challenging situations. For this reason, all of the algorithms implemented in this study were tested on a comprehensive suite of optimisation benchmarks that included the following functions: 

2D Gaussian



2D paraboloid



2D peaks



Rosenbrock



Matyas



McCormick



Branin-Hoo



Goldstein-Price



Ackley

These functions were inverted for the purpose of this study, so that they could be used as objective functions for maximisation rather than minimisation problems (see Figure A.1). Definitions and further details are given by Jamil and Yang (2013), and Surjanovic. (2015).

89

(a)

(b)

(c) Figure A.1 – Examples of optimisation benchmarks used for testing and preliminary evaluation purposes: (a) 2D paraboloid, (b) Branin-Hoo function, (c) Ackley function.

A typical sequence of samples produced by the Cross Entropy optimiser when applied to the Branin-Hoo test function is shown in Figure A.2.

90

(a)

(b)

(c)

(d)

(e)

(f)

Figure A.2 – A typical sequence of samples produced by the Cross Entropy optimiser when applied to the Branin-Hoo test function, after (a) 1 trial, (b) 10 trials, (c) 15 trials, (d) 20 trials, (e) 25 trials, and (f) 100 trials.

This sequence was produced with a population size of

= 100 and an elite sample size of

= 10. The Branin-Hoo function has three global maxima, and the sequence shows the multivariate Gaussian search distribution become progressively more focused on one of these maxima at 9.425,2.475 . The slight curvature at the ends of the distribution is due to the nonlinear logistic function that is used to map the constrained optimisation problem to an unconstrained problem (see Section 3.7.8). For each of the 100 trials, 100 samples of the objective function were used, making a total of 10,000 samples in total. 91

A typical sequence of samples and response surfaces produced by the Bayesian optimiser (PI) when applied to the Branin-Hoo test function is shown in Figure A.3.

(a)

(b)

(c)

(d)

(e)

(f)

Figure A.3 – A typical sequence of samples and response surfaces produced by the Bayesian optimiser (PI) when applied to the Branin-Hoo test function, after (a) and (b) 1 trial; (c) and (d) 25 trials; (e) and (f) 100 trials.

Initially, the samples are selected uniformly at random over the optimisation domain (Figure A.3(a)). However, as time progresses, the sampling becomes more focused in the vicinity of

92

the three global maxima (Figure A.3(c) and (e)), and the corresponding response surfaces become better approximations to the underlying objective function shown in Figure A.1(b). In this experiment, 16 initial samples were used to bootstrap construction of the Bayesian optimiser response surface, followed by 100 optimisation samples/trials. So a total of 116 samples were required here, in contrast to the 10,000 samples required by the CE optimiser.

A.2 Comparison between policy search gradient and hybrid optimisers As described in Section 3.7.6, two novel optimisers (xGES2 and SGES2) were developed in this study. These can be viewed as hybrids between Gaussian policy search gradient methods and Natural Evolution Strategies (NES).

During testing, the new optimisers

demonstrated significantly better performance than the Gaussian policy search gradient optimiser (SGES) on all of the benchmark problems, and so were used in preference in this study. Table A.1 shows the performance of the two new optmisers after 1000 trials, when applied to the Branin-Hoo benchmark (statistics are based on 10 runs of the experiment). resid. error

perf. shortfall

mean

std dev

mean

std dev

sges

0.9068

0.7237

-1.5545

1.3986

xges2

0.0073

0.0220

-0.3990

0.0034

sges2

0.0059

0.0177

-0.3980

0.0004

Table A.1 – Comparison of algorithm performance (10 trials).

Since the Branin-Hoo function has three global maxima, performance was measured in terms of the residual error to the closest global maximum, and the shortfall from the maximum performance of 0.3979. For this benchmark, the two hybrid algorithms demonstrated much better performance than the original SGES algorithm. Since the primary goal of this study was not to develop new optimisers, no attempt was made to carry out a rigorous comparison across all benchmark functions.

A.3 Characterisation of optimiser performance on standard benchmarks To get a better feel for performance and develop some insight into their operation, the optimisers described in Section 3.7 were applied to three of the benchmarks described in Section A.1. The performance results are shown in Figure A.4.

93

(a)

(b)

(c)

(d)

(e)

(f)

Figure A.4 – Performance of optimisers on benchmark problems: (a), (c) and (e) stochastic optimisers (500 trials); (b), (d) and (f) Bayesian optimisers (100 trials).

94

The performance of the stochastic optimisers is plotted over 500 trials in Figures (a), (c) and (e). The performance of the Bayesian optimisers is plotted over 100 trials, and compared with the three best-performing stochastic optimisers for reference purposes in Figures (b), (d) and (f). Unfortunately, the REINFORCE method could not be made to work with the BraninHoo function, and so no results were available for this optimiser. Notice that, in general, the population-based stochastic optimisers (CE optimiser, xNES and SNES) and the Bayesian optimisers converge much more quickly than the single-sample optimisers (xGES2, SGES2 and REINFORCE). However, the population-based stochastic optimisers require a larger number of samples per trial than the Bayesian optimisers. The residual errors and corresponding performance shortfalls for the three benchmark functions are shown in Table A.2 and Table A.3. Paraboloid

Branin-Hoo

resid. error

perf. shortfall

ceopt

0.000

xnes

Ackley

resid. error

perf. shortfall

resid. error

perf. shortfall

0.000

0.000

-0.796

0.000

0.000

0.000

0.000

0.000

-0.796

0.000

0.000

snes

0.000

0.000

0.000

-0.796

0.000

0.000

xges2

0.001

0.000

0.000

-0.402

0.000

-0.004

sges2

0.001

0.000

0.000

-0.402

0.001

-0.004

reinforce

0.377

-0.461

n/a

n/a

0.217

-0.611

Table A.2 – Performance of stochastic optimisers on benchmark problems (500 trials).

Paraboloid

Branin-Hoo

Ackley

resid. error

perf. shortfall

resid. error

perf. shortfall

resid. error

perf. shortfall

ceopt

0.000

0.000

0.000

-0.796

0.000

0.000

xnes

0.000

0.000

0.000

-0.796

0.000

0.000

snes

0.000

0.000

0.000

-0.796

0.000

0.000

bayes (pi)

0.007

0.000

0.036

-0.797

0.011

-0.034

bayes (ei)

0.006

0.000

0.169

-0.835

0.011

-0.034

bayes (ucb)

0.007

0.000

1.438

-0.878

0.110

-0.606

Table A.3 – Performance of Bayesian optimisers on benchmark problems (100 trials).

Once again, the population-based stochastic optimisers produced the best performance across all three benchmarks.

However, the Bayesian optimisers were also reasonably

accurate and required far fewer samples to converge than the stochastic optimisers (for example, the Bayesian optimisers only required 116 samples to converge, whereas the CE optimiser required 10,000 samples). 95

A.4 Computational costs When comparing optimiser performance, it is also important to consider the computational costs. For example, the time needed for each optimiser to complete 100 trials for the Ackley benchmark is shown in Table A.4. comp. time (secs)

samples per trial

ceopt

6.47E-02

100

xnes

7.03E-02

6

snes

2.69E-02

6

xges2

4.79E-02

1

sges2

2.72E-02

1

reinforce

1.59E-02

1

bayes (pi)

7.99E+01

1

bayes (ei)

6.94E+01

1

bayes (ucb)

7.32E+01

1

Table A.4 – Computation time for 100 optimisation trials on the Ackley benchmark (MATLAB environment).

Looking at these figures, it might appear that the stochastic optimisers are around three orders of magnitude faster than the Bayesian optimisers. However, this does not tell the full story. The reason why the Bayesian optimisers are so much slower is that they need to fit a GP model to the data and optimise over an acquisition function at each trial. In particular, the function call to the DIRECT optimisation algorithm, which performs the optimisation over the acquisition function, accounts for approximately 85% of the overall computation time, while the call to the routine that optimises the hyperparameters of the GP model accounts for approximately 12%. However, the time needed to evaluate the Ackley objective function at each iteration is only a tiny fraction of this. So, for the Ackley benchmark, or indeed any of the benchmark functions, it is largely irrelevant whether 1 or 100 samples are needed in each trial, since the computation required to obtain them is only a small fraction of the overall computation time. However, for many robotics problems (including the ones considered in this study), the collection of a sample in each learning trial may take milliseconds to carry out in an offline environment, and possibly seconds or even minutes in an online environment. In this situation, the time required to compute/collect a sample of the objective function in each trial becomes the dominant factor in the overall computation time, and the overheads associated with Bayesian optimisation become less significant.

A.5 Deterioration of performance under noisy conditions All of the benchmark functions discussed above are noise-free. However, in robotics, the objective functions used in direct policy search are often extremely noisy (as observed in the 96

curvature perception experiments). Figure A.5 shows how the performance of the optimisers described in Section 3.7 changes as increasing amounts of zero-mean Gaussian noise are added to the 2D paraboloid benchmark function. Figure A.5(a) and (b) show the performance without any added noise. Figure A.5(c) and (d) show the performance for a low level of zeromean noise, with a standard deviation of 0.1. Figure A.5(e) and (f) show the performance for a high level of noise, with a standard deviation of 1.0.

97

(a)

(b)

(c)

(d)

(e)

(f)

Figure A.5 – Deterioration of performance under noisy conditions: (a), (c) and (e) stochastic optimisers (500 trials); (b), (d) and (f) Bayesian optimisers (100 trials).

98

Note that while the expected value of the noisy objective function is equal to 4 at the maximum, it is possible for the sampled noisy values in individual trials to be greater than this. Since the Bayesian optimisers return the best performance encountered so far at each trial (if they did not, the corresponding performance charts would have the appearance of white noise), the performance graphs for these optimisers take the form of rising staircase functions. The residual errors and corresponding performance shortfalls for the three noise levels are shown in Table A.5 and Table A.6. Low noise ( = 0.1)

No noise resid. error

perf. shortfall

ceopt

0.000

xnes

High noise ( = 1.0)

resid. error

perf. shortfall

resid. error

perf. shortfall

0.000

0.027

0.001

0.130

0.122

0.000

0.000

0.080

0.050

0.271

0.421

snes

0.000

0.000

0.104

0.028

0.153

0.179

xges2

0.001

0.000

0.074

0.094

0.102

0.446

sges2

0.001

0.001

0.053

0.141

0.217

0.291

reinforce

0.377

0.502

0.368

0.487

0.193

0.707

Table A.5 – Deterioration of stochastic optimiser performance under noisy conditions (500 trials).

No noise

Low noise ( = 0.1) perf. resid. error shortfall

High noise ( = 1.0) perf. resid. error shortfall

resid. error

perf. shortfall

ceopt

0.000

0.000

0.026

0.002

0.071

-0.008

xnes

0.000

0.000

0.111

0.006

0.314

-0.249

snes

0.000

0.000

0.131

0.002

0.016

-0.034

bayes (pi)

0.007

0.000

0.057

0.226

0.066

2.704

bayes (ei)

0.006

0.000

0.057

0.285

0.153

4.194

bayes (ucb)

0.007

0.000

0.033

0.201

0.067

2.019

Table A.6 – Deterioration of Bayesian optimiser performance under noisy conditions (100 trials).

The results show that the performance of all optimisers deteriorates with increasing levels of noise. This is consistent with results from other studies in this area (Beyer 2000; Picheny et al. 2013).

A.6 Discussion of results Episode-based, direct policy search reinforcement learning involves optimising the expected episodic return using noisy samples that are generated during separate learning trials. This process is typically carried out using black-box optimisers. Two main classes of optimiser were evaluated in this study: stochastic optimisers and Bayesian optimisers. The stochastic 99

optimisers can be further classified according to whether they rely on single samples, or populations of samples at each step of the optimisation process. Three variants of Bayesian optimiser were used in this study. Each one used a different type of acquisition function to select samples (PI, EI or UCB – see Section 3.7.7). All optimisers were tested on a suite of benchmark functions to check that they were working correctly and to gain some insight into their relative performance, especially under increasing levels of noise, which is an important characteristic of many tactile robotic tasks. Population-based stochastic optimisers such as the CE, xNES and SNES optimisers produced the most accurate results, and converged faster than single-sample stochastic optimisers such as REINFORCE, xGES2 and SGES2.

The Bayesian optimisers fell

somewhere in-between in terms of accuracy, but required fewer samples to converge than their stochastic counterparts. When applied to a 2D paraboloid objective function, the performance of all optimisers deteriorated under increasing levels of noise. This finding is consistent with other studies in this area (Beyer 2000; Picheny et al. 2013). For example, in the case of evolutionary optimisation algorithms, Beyer found that increased levels of noise on the objective function reduced both the accuracy of solutions and the speed of convergence. For the benchmark test problems, the Bayesian optimisers typically required around three orders of magnitude more computational time per iteration than their stochastic counterparts. However, this computational time is dominated by the time required to call the DIRECT global optimiser used to optimise the acquisition function, and the time required to optimise the hyperparameters of the GP model. For online learning, the main factors that determine overall computation time are the number of iterations required for convergence and the population size, since the time required to carry out each learning trial is much greater than the time required to carry out the optimisation computations in each iteration.

Thus, Bayesian

optimisers can often find good solutions in much less time than stochastic optimisers in realtime, online learning scenarios.

100

B Particle filter implementations of active perception B.1 Particle filter algorithm A simple particle filter implementation of a Bayesian active perception agent was developed using the methods described in (Thrun et al. 2005). The basic algorithm is shown in Table B.1. 1:

algorithm particle_filter(

2:

Resample particles:

3:

for

4:

,

,

,

):

= 1… Draw with probability ∝

̅

5: 6:

end

7:

for

=

−1

= 1…

8:

Move resampled particle using dynamics model:

9:

Sample

10:

Compute particle weight using observation model: =

11: 12:

end

13:

return

~

∙|

−1 ,

−1

| ̅

,

Table B.1 – Particle filter algorithm.

Note that while it is more conventional to move the particles and re-compute the weights before resampling, here the order is reversed so that each iteration finishes with a set of weighted particles rather than a set of (equally weighted) resampled particles. This makes it easier to compute the MAP estimates used by the decision maker component during each iteration. For this set of experiments, 500 particles were used (

= 500). A low-variance form of

resampling described in (Thrun et al. 2005) was used to implement the resampling step in Lines 3 to 6 of the algorithm. Rather than resample the entire population at each iteration, a small fraction (10% in these experiments) was “respawned” by sampling uniformly at random over the state space. The remaining particles were resampled in the normal way. This helps to maintain diversity in the population and avoids premature convergence to incorrect states (Thrun discusses this “particle deprivation” problem in more detail in his book (Thrun et al. 2005)). Two active perception experiments were carried out using the particle filter: the first involving cylinder classification (curvature perception), and the second involving angle estimation (orientation perception). The cylinder classification experiment used the same experimental 101

setup described in Section 4.1, with the aim of classifying five cylinders of different curvatures and diameters.

The angle estimation experiment used the same experimental setup

described in Section 4.2, with the aim of estimating the angle of the tangent to the curved edge of a disk (radius 54 mm).

B.2 Cylinder classification For the cylinder classification experiment, a mixed discrete/continuous state space was used to represent the state variables. The cylinder ID was represented as an integer in the range 1 to 5. The sensor location was represented as a continuous real-valued variable in the range -40 mm to 0 mm (corresponding to the range of tap positions across the cylinder axis). The motion model was defined as a simple one-dimensional translation with a small amount of additive Gaussian noise,



+ ∆ + , where ~

0,

and

= 0.01 ∙ ∆ .

The observation model was implemented using a radially-symmetric, multivariate Gaussian distribution, with covariance matrix equal to the identity matrix, and a (conditional) mean estimated using a 6-50-254 feedforward neural network, as shown in Figure B.1.

Figure B.1 – Neural network-based observation model used for cylinder classification.

The time series for the

and

displacements of each of the 127 TacTip pins were averaged

so that a single observation consisted of a 254-element vector of mean pin displacements over a single tap (time series). So the observation model can be viewed as multivariate Gaussian distribution of mean sensor readings, conditioned on the cylinder ID and sensor position. Although a categorical, 1-of- , encoding was used to encode the cylinder ID as input to the network, in hindsight, some form of ordinal encoding that captures the fact that cylinders can be ordered by their diameter or curvature might have been more appropriate. The neural network inputs and outputs were pre- and post-processed by normalising them so that their minimum and maximum values fell in the range -1 to 1, respectively. The network was trained using 10,000 iterations of scaled conjugate gradient descent backpropagation (MATLAB NN Toolbox implementation). To minimise over-fitting, the hidden layer size and 102

stopping time were adjusted by partitioning the data into separate training and validation sets, and then using the validation set to check generalisation performance on data that had not been used to train the network. When an approximately optimal hidden layer size had been found, the validation data was merged back in with the training data, and the network retrained on all of the data. The advantage of using a neural network for the observation model, as opposed to a histogram-based approach, is that it forms a continuous mapping between inputs and outputs. So, if it is properly regularised, it can generalise to values it has not been trained on. This more compact representation can also help to reduce the training time and storage requirements of the observation model, and increase overall run-time speed. Unlike some other types of non-parametric statistical model (e.g., Gaussian Processes), neural networks can be trained on very large data sets, which is an important consideration as the dimensionality or size of the state space is increased. A typical sequence of particle distributions for the cylinder classification problem is shown in Figure B.2.

Each distribution is plotted after the resampling phase at the start of the

subsequent iteration/step.

103

(a)

(b)

(c)

(d)

Figure B.2 – Example sequence of particle distributions for cylinder classification problem.

The red crosses in the figures correspond to particles that have been respawned (50 out of 500 particles). The blue circles correspond to the sampled or resampled particles (450 out of 500 particles). An active focal policy with a focal point of -19.95 mm (approximately in the centre of the range of tap positions) was used here. Initially, the particles are uniformly distributed over the full range of sensor positions for all five cylinders. After one tap of the sensor and the first resampling, the distribution of resampled particles is located at approximately -30 mm for Cylinders 2 to 5. There are no resampled particles for Cylinder 1, indicating that this is not a strong hypothesis for the cylinder class. After a move towards the central focal point from the estimated position, a second tap of the sensor, and the second resampling, all of the resampled particles have collapsed onto the 104

central focal point on Cylinder 3 (the correct cylinder class and sensor position).

The

estimated sensor positions and object classes for all 10 iterations of the particle filter are shown in Table B.2. step 1 2 3 4 5 6 7 8 9 10

estimated position -29.18 -20.28 -19.95 -19.96 -19.95 -19.95 -19.95 -19.95 -19.95 -19.95

estimated move 9.23 0.33 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00

estimated object class 3 3 3 3 3 3 3 3 3 3

Table B.2 – Estimated sensor positions and object classes for 10 iterations of the particle filter.

B.3 Angle estimation For the angle estimation experiment, a two-dimensional continuous state space was used to represent the state variables. The tangent angle was represented as a continuous real-valued variable in the range 0 to 360 degrees, and the sensor location was represented as a continuous real-valued variable in the range 0 mm to 20 mm. The motion model for this experiment was defined in terms of two components. The change to the sensor position was defined as a simple one-dimensional translation with a small amount of additive Gaussian noise,



+∆ +

, where

~

0,

and

= 0.01 ∙ ∆ + 1 . The change to the

continuous-valued tangent angle for a (radial) change in sensor position was assumed to be zero with a small amount of additive Gaussian noise, where

~

0,

and



+

(modulo 360 degrees),

= 0.5 ∙ ∆ + 1 .

Once again, the observation model was implemented using a radially-symmetric, multivariate Gaussian distribution, with covariance matrix equal to the identity matrix, but this time the (conditional) mean was estimated using a 3-100-254 feedforward neural network, as shown in Figure B.3.

105

Figure B.3 – Neural network-based observation model used for angle estimation.

The tangent angle was encoded using a dual sine and a cosine input, to capture the natural periodicity of the angle. All other procedures and settings remained the same as for the cylinder classification experiment, including the min/max pre- and post-processing. Once again, the network was trained using 10,000 iterations of scaled conjugate gradient descent backpropagation, and a validation set was used to ensure that over-fitting did not occur. A typical sequence of particle distributions for the angle estimation problem is shown in Figure B.4. Each distribution is plotted after the resampling phase at the start of the subsequent iteration/step.

106

(a)

(b)

(c)

(d)

Figure B.4 – Example sequence of particle distributions for angle estimation problem.

An active focal policy with a central focal point of 9.5 mm (approximately in the middle of the range of tap locations over the edge of the disk) was used here. Initially, the particles are uniformly distributed over the state space. After one tap of the sensor and the first resampling, the distribution of resampled particles is located at approximately 16 mm (close to the location of the first tap) and at an angle of approximately 136 degrees. After a move towards the central focal point from the estimated position, a second tap of the sensor, and the second resampling, all of the resampled particles have collapsed onto the central focal point at an angle of approximately 131 degrees (the actual angle is 130 degrees). The estimated sensor positions and object classes for all 10 iterations of the particle filter are shown in Table B.3.

107

step

estimated position

estimated move

estimated angle

1 2 3 4 5 6 7 8 9 10

16.24 9.59 9.43 9.43 9.68 9.43 9.44 9.65 9.44 9.44

-6.74 -0.09 0.07 0.07 -0.18 0.07 0.06 -0.15 0.06 0.06

136.30 130.55 129.53 129.43 130.20 129.57 129.57 129.71 129.64 129.64

estimated angle (rounded) 140 130 130 130 130 130 130 130 130 130

Table B.3 – Estimated sensor positions and object classes for 10 iterations of the particle filter.

What is particularly interesting about this example is that the particles are located in a fullycontinuous state space, and the observation model is used to compute a likelihood probability for all positions, despite only having been trained on samples with angles at 10 degree intervals. This indicates that the neural network is doing a good job at interpolating (i.e., generalising) between the sampled data points.

B.4 Discussion of results In the cylinder classification experiment, the particle filter correctly identified the cylinder after a single tap, and estimated the sensor position to within 0.01 mm accuracy after three taps. In comparison, the histogram-based approach used in Section 4.1 could only achieve a maximum accuracy of 0.4 mm using a grid of 100 points over a 40 mmm range. In the angle estimation experiment, the particle filter estimated the tangent angle to within 1 degree accuracy and the sensor position to 0.1 mm accuracy after two taps. In comparison, the histogram-based approach used in Section 4.2 could only achieve a maximum accuracy of 10 degrees for the tangent angle, and 0.5 mm accuracy for the sensor position. Furthermore, the particle filter implementation only needed to update 500 sample points per Bayesian filter update, as opposed to the 1440 grid points (40 position classes x 36 angle classes) needed in the histogram-based approach. While it is certainly possible to increase the accuracy of the histogram-based approach by using more grid points, this increases the computational and storage requirements, and much of this computation is wasted because it is associated with regions of low probability in the state space. The particle filter, on the other hand, can increase the accuracy in regions where it is needed by increasing the number of particles in areas of high probability. While the point of these demonstrations is not to perform a rigorous statistical comparison between particle filters and histogram-based methods, they do provide some degree of

108

confidence that particle filter methods will work well in the tactile domain. As such, it would be an interesting area for further work.

109