Combining reinforcement learning and differential inverse kinematics ...

Combining Reinforcement Learning and Differential Inverse Kinematics for Collision-Free Motion of Multilink Manipulators Pedro Martfn ~and Jos6 del R. Mill~n 2 1 Dept. of Computer Science, University of Jaume I, 12071 Castell6n, Spain z Institute for Systems, Informatics and Safety, Joint Research Centre, European Commission 21020 Ispra (VA), Italy

Abstract. This paper presents a class of neural controllers that learn goaloriented obstacle-avoiding strategies for multilink manipulators. They acquire these strategies on-line through reinforcement learning from local sensory data. These controllers are mainly made of two neural modules: a reinforcement-based action generator and a module for differential inverse kinematics (DIV). The action generator generates actions with regard to a goal vector in the manipulator joint space. Suitable goal vectors are provided by the DIV module. This module is based on the inversion of a neural network that has been previously trained to approximate the manipulator forward kinematics in polar coordinates. Results for two- and three-link planar manipulators are shown. These controllers achieve a good performance quite rapidly and exhibit good generalization capabilities in the face of new environments.

1

Introduction

Making a manipulator reach a goal position implies finding a path from an initial configuration of the robot joints to a goal configuration while avoiding collisions with obstacles. We want to build autonomous robot arms that make motion decisions in real time and that only rely on local information coming from their sensors. To this end, the robot arm is equipped with rings of range sensors (e.g., sonar sensors) placed along its links. The path finding problem can be posed as a sequential decision task [3] and a reinforcement-based neural controller can be used to learn the appropriate goal-oriented obstacle-avoiding strategies while the robot is interacting with its environment. We have successfully applied such an approach to autonomous mobile robots [8,9]. For multilink manipulators, we propose a class of controllers that have two modules: a reinforcement-based action generator and a module for differential inverse kinematics (DIV). The action generator provides actions (i.e., increments for the joints) that are codified with respect to a Goal Vector (GV) in the joint space. Such a vector indicates a direction that the robot arm can follow to get closer to one of the possible goal configurations. The DIV module is used to generate suitable goal vectors. The DIV module obtains a GV by means of the inversion of a neural network [5] that has been previously trained to approximate the forward kinematics of the robot arm in polar coordinates.

1325

Other works [11] have previously used reinforcement learning to allow a two-link manipulator to generate collision-free trajectories to a goal position. Yet they do not consider the use of real sensors that can prevent the robot to collide. In our approach, the manipulator stops and retracts if any sensor detects an obstacle closer than a safety distance (virtual collision). Furthermore, those previous works rely on position and velocity information (instead of sensory information) to produce torque or absolute position changes (instead of relative goal-directed changes) for the manipulator joints. Consequently, they fail to generalize when facing a different environment from the one used to train the controller. There also exist connectionist methods that aim at obtaining a model of the free configuration space for a subsequent planning process [1]. On the contrary, our reinforcement-based approach allows the robot to interact with the environment from the very beginning and to improve its performance as learning proceeds9

9

sensory lines

.~,

'

"

.

"

. , "-. "'",-

,.

.

- I .

J'.

L-

"'"

q~.. 9

I

Fig. 1. The simulated three-link robot arm. We have tested this kind of controllers over two simulated planar robot arms: one of two links and another one of three links. In Figure 1, we illustrate the aspect of the simulated robot arms with the more general case of the three-link manipulator. Each link has three rings of range sensors that are evenly placed9 A ring has two sensors pointing to the right and the left side of the link. An additional range sensor is placed at the end-effector. The simulated robot arms have revolute joints with angles (q],q2,q3). The angular movements of the joints are constrained as follows:

-~/~2 -< ql < 5 g / 4 , -7~/~8 -< q2 -< 3~/~4 and -~//44< q3 -< ~//s The environments faced by the robots have obstacles that are represented by circles or by a set of circles in order to simplify calculations (see Fig. 5).

2

Controller Architecture

These controllers are made up of two modules: a reinforcement-based action generator module and a differential inverse kinematics (DIV) module9 Figure 2 shows the modules and how they interact with the environment. For the three-link manipulator, we have used an action generator that has three reactive modules: one simple module that allows the robot to follow the GV (FGV module) and two connectionist reinforcement-based modules, one for Negotiating Obstacles (NO module) and another one for Reaching the Goal (RG module)9 The FGV module is

1326

used when the robot does not detect any obstacle in the direction to approach the goal. The NO module is in operation when the arm end-effector is outside a region of teachability. The region of teachability is a geometrical region in the workspace where all arm configurations that make the end-effector reach the goal axis lie in. The goal axis is the axis that passes through the goal position and the origin of the workspace (see Fig. 2). Finally, the RG module is switched on when the end-effector enters the region of reachability. The action generator used for the two-link robot has only one reinforcement-based module. We have used an Actor-Critic neural architecture [2] for the reinforcement-based modules. Other suitable organizations to acquire the motion strategies through reinforcement learning are also possible for the action generator. The action generator receives an input vector of real numbers in the interval [0,1]. These inputs are divided into three groups: position, sensory and goal inputs. Position inputs p gives information about the current joint configuration of the robot arm relative to the goal axis. The sensory inputs d can be grouped as front, back, joint-limit and end-effector sensory inputs. The front and back components are obtained from the range sensors: the front inputs come from the sensors that lie in the direction that the first link must follow to approach the goal. The joint-limit components correspond to virtual sensors. They give information about the proximity of the joints to their physical limits. Finally, the end-effector component is obtained from the range sensor located at the end-effector. The goal inputs g provide information about how close the robot arm is to the goal position. These inputs corresponds to the components of the Workspace Shortest Path Vector (WSPV) that connects the current position of the end-effector with the goal position in polar coordinates: W S P V = [ O - ~ p , S - s ) . The actor gives an action vector a (e.g., a=(al,a2,a3) for the three-link manipulator) that specifies a movement in relative polar coordinates with regard to the GV. All components are real values in the interval [-1,1]. The last component (e. g., a3) corresponds to the length of the action vector as a proportion of the length of the GV. The previous components represents the angular deviations from the GV (e. g., al and a2). The actual angular increments Aq supplied to the robot joints are calculated from the action vector. There are a total of n inputs to the controller, with n=21 for the two-link robot and n=30 for the three-link case. The reinforcement signal reflects the objective that the robot arm must achieve:

r(t) =

I

-4+(-2+2e-4HwsPvlI)

if

collision

(1)

[ 2 - 2e-4 pg + e-41lwsPvll - 1 otherwise where pg is a measure of the deviation of the action from the GV. For the three-link robot arm, this measure is computed as pg = a 1 .[a 2 .l w h e r e / = ] A q l / m a and ma is the maximum increment the robot arm can move toward the goal when performing an action. The value of pg lies in the interval [0,1]. The actor is severely punished when the robot runs into a virtual collision (i.e., if any sensor detects an obstacle closer than a safety distance). For collision-free situations, r pays attention to approach the goal properly. On the other hand, collisions have different associated penalties. The closer the robot gets to the goal, the less punishment it receives from a collision.

1327

CONTROLLER

SENSOR-BASED MANIPULATOR

Fig. 2. Block diagram of the robot controller.

3

Actor-Critic Modules

We want the controller to handle continuous input and action vector spaces. For this purpose, the reinforcement-based modules of the action generator are implemented with an actor-critic architecture. The actor's aim is to learn to perform those actions that optimize the cumulative reinforcement received along the trajectory to the goal. It is made up of two neural networks: the policy network and the exploration network (see Figure 3). Both networks have one hidden layer of units with sigmoidal activation functions. This activation function presents asymptotes at -1 and 1. POLICY NETWORK

11 =N(g1,Ol)*0.1 ~ , _ _ ~ a l =

1

Ira= N(~tm,~3m)*0.11~

2 /2

EXPLORATION NETWORK

p.i~ [-10,10] c~i~ [0.1,10] i=l ..... m

n=21 2-1ink{ m=2 /n=30 3-1ink tm=3

Fig. 3. Architecture of the actor module. The policy network has Gaussian units to generate the action components. The mean value /1; (t) for the Gaussian unit i is the net input coming from the hidden layer to that unit. The standard deviation cri (t) of each Gaussian unit is provided by the corresponding output unit of the exploration network, which is linear. Action components a i (t) are obtained as shown in Figure 3. The structural credit assignment problem is tackled by means of the REINFORCE algorithm that adjust the weights of the policy and exploration networks [12]

1328

OlnN Awij = o~(r - b)eiy = o~(r - b ) - Owii

(2)

where ~ is the learning rate, b is the reinforcement baseline, and eij is the characteristic eligibility of wij. In order to deal with the temporal credit assignment problem, the critic uses temporal differences (TD) methods [10] to learn to estimate the cumulative reinforcement. In particular, we use TD(0). Other temporal differences methods were tested for the two-link case and TD(0) appeared to be the most successful [7]. The critic is also a neural network with one hidden layer. The activation functions of the hidden units are sigmoidal with asymptotes at -1 and 1, while the output units are linear. More details about the implementation of the actor-critic modules can be found in [7].

4

Differential Inverse Kinematics M o d u l e

For a redundant robot arm, such as a three-link planar robot arm, a closed form of inverse kinematics solutions is difficult, if not impossible, to obtain due to the existence of an infinite number of joint configurations as a solution manifold in the joint space. A typical solution to this problem is the use of the pseudoinverse of the Jacobian matrix that allows to obtain a joint velocity vector from the velocity vector of the end-effector [13]. This approach can be used to generate a suitable goal vector. However, it is computationally expensive [6], since the pseudoinverse of the Jacobian matrix must be recalculated for every new joint configuration of the robot arm. Instead of that approach, we propose the use of a differential inverse kinematics (DIV) neural module. This module is inspired by the distal learning approach [4] to the inverse kinematics of robot manipulators.

4.1

Operation of the DIV Module

Let q and p be the current configuration and the current end-effector location of the robot arm in polar coordinates. Let p* be the goal location in polar coordinates where the end-effector must be finally placed. The robot's task can be stated as a differential inverse kinematics problem: "Given the difference Ap=p*-p=WSPV in the worksp.ace between the current location of the end-effector p=F(q) and a goal location p , obtain the minimum vectorial difference Aq=SPV between the current configuration q and its closest goal configuration q* ." Function F stands for the forward kinematics. The SPV is the Shortest Path Vector in the joint space that connects the current joint configuration with the closest goal configuration. We have applied a backpropagation inversion algorithm [5] to a feedforward neural network that has been previously trained to approximate the robot forward kinematics in polar coordinates. A gradient search in input space is conducted by this algorithm. The search generates a sequence of input vectors q=qO,ql,...,qn such that the network error

1329

is minimal. This is the same error that has been used to train the neural network with the forward kinematics. The inversion of the neural network is carried out iteratively. At each iteration, the configuration ql is fed to the network to generate its corresponding output F(ql)=p l. Then, the corresponding error E l is computed and backpropagated to have the next configuration of the sequence ql+l. This procedure finishes when

p[-p~l