Neural Mechanisms for Training Autonomous Robots

0 downloads 0 Views 163KB Size Report
learning). Robots can hence develop appropriate behaviour much more rapidly. The neural mechanisms and training techniques have been developed on a ...
Neural Mechanisms for Training Autonomous Robots Gordon Wyeth Department of Electrical and Computer Engineering University of Queensland Brisbane, Australia

Abstract Minimalist neural mechanisms are suitable tools for programming and training autonomous robots. This paper explores the limitations of hand-crafted minimalist robot control mechanisms based on a neural paradigm, and then shows that these mechanisms are well suited to robot training using well understood neural learning mechanisms. Training a robot is more powerful than other methods more commonly used for robot learning (such as reinforcement learning and genetic techniques). A trained robot is told more than whether it was wrong or right for a particular action or sequence (reinforcement learning), the robot is also told what it should have done (supervised learning). Robots can hence develop appropriate behaviour much more rapidly. The neural mechanisms and training techniques have been developed on a kinematically realistic simulator. The mechanisms have been ported from the simulated vehicles to a real vision guided robot: CORGI [8]. Results from the simulation and CORGI are presented.

1 Making a Robot Behave It is difficult to make an autonomous robot perform reliably. Making a robot behave is a problem that has been approached in two distinct ways: • robot programming - where the robot is told what to do, or by • robot learning - where the robot is asked to determine appropriate behaviour by interacting with the environment. Programming is complicated by the need to cater for novel situations. Complex sensor spaces and internal state spaces make reliable programming troublesome. The alternative, autonomous robot learning, is still a

relatively young field. Work with evolutionary computing and reinforcement learning has provided solutions to simple tasks (see [2] for an overview), but it is apparent that these approaches do not present immediate solutions for the kind of things we would like robots to do. Training a robot, on the other hand, involves placing the robot in a typical situation and showing the robot appropriate behaviour. Trained neural mechanisms have presented solutions to some real world tasks (driving a vehicle [5], for example) through training, also called supervised learning. For example, driving a vehicle was taught by having a neural mechanism learn an association between visual input and steering angles while the vehicle was under human control. When released from human control, it used the previous examples to steer through new terrain. This kind of efficient learning would be difficult to implement using a reinforcement learning or evolutionary paradigm. The tasks investigated for robot training in this paper are centred about a class of problems we call hunt and gather robotics. The type of tasks that a hunt and gather robot might perform include picking up around the house, collecting litter from streets, parks and waterways, picking crops from a field or orchard, or collecting rocks on interplanetary exploration missions. These tasks are characterised by targets and obstacles that have undefined locations and that may be static or dynamic. The task of the robot is to collect the targets while avoiding the obstacles. The nature of this task allows operation of the robot based on a reactive control system. A commonly used definition of a reactive system is a system that contains no state (or memory). A lack of internal state in a robot enhances the robot’s ability to react to dynamic situations and relieves issues in sensor and effector calibration. A reactive robot circumvents the difficult problems of knowledge representation in an unstructured environment. On the down side, a reactive robot may be less than optimal; for example, a robot may explore a region repeatedly without venturing into new areas, as it has no memory of visiting that region before.

This paper will show that small neural network systems are capable of providing reactive control for a hunt and gather robot. These control systems are based on Braitenberg vehicles [1]. Simulation studies will be used to show the type of neural structures that are useful for autonomous robot behaviour generation, and then show how such structures may be generated by training. The results of the simulation study form the basis of tests on a real hunt and gather robot: CORGI.

2 Simulator Design The simulator is designed to allow parameter changes of several features of the problem, namely: • the environment - in particular the number and position of the obstacles and targets, • the controller - including the type of neurons and neural connectivity used for computation, and • the robot - allowing variation of drive wheel position and sensor range, beam width and position. For the purposes of rapid investigation of controller strategies, a simulator was deemed most suitable for initial design experiments. In order to make the simulation as real as possible the vehicle kinematics are calculated on a fine-grain level, with each step of the simulator lasting only 10 ms. The sensors can only sense the environment within their prescribed range and beam width. A number of simplifying assumptions are made. Firstly, the velocity of each wheel is controlled linearly by the value that is fed to the motor: no consideration is given to inertial effects or non-linear behaviour of the effector. The simulator calculates the kinematics of the vehicle, not its dynamics. Collisions are not detected and do not affect the robot’s motion. Secondly, the sensors return an ideal normalised value that represents the proximity of the object to the sensor. Target sensor readings are not occluded by obstacles. In the case where there is more than one target or obstacle, the sensor uses the range of the closest sensed item. For the purposes of these experiments the robot is configured with the wheels to the rear, and two pairs of sensors (obstacle and target) at the front of the robot. The sensors can sense in a 90o arc in front of the robot, with a 30o overlap in the centre of the arc.

3 Choosing a Neural Network There have been a great number of neural networks developed that are suitable for supervised learning techniques. In choosing a network, it is useful to know

the complexity of the computation to be performed in order to choose the complexity of the network. The simple networks and training systems that were developed thirty years or more ago [6][7], were criticised for their inability to perform certain classes of computation [4]. Most modern researchers turn to more complex networks for robot training, typically some variation of the multi-layer perceptron trained by backpropagation. However, investigations with hand wired examples show that robot behaviour generation may be achieved with the simplest of networks. It is therefore possible to use computationally simple training mechanisms to achieve similar results to those accomplished using the more complex techniques. In particular, two network architectures were investigated with hand-wired examples. In both cases, the robot control architecture was based on a single layer network, with the neural inputs connected to the sensors and the outputs connected to the actuators. The difference in the networks lies in the transfer function of the neuron. The first class investigated used the linear unit or Adaline [7]. It computes: Vi = ∑ wij V j + θ i j

where Vi is the output from unit i and Vj is the input from unit j. wij is the strength of the connection from unit j to unit i. This is contrasted with a type of non-linear unit, the threshold unit or Perceptron, which computes:   Vi = g ∑ wij V j + θ i    j where g() is the Heaviside or threshold function: 0, if x ≤ 0 g( x ) =  1, if x > 0 The following sections illustrate the use of these units in reactive robot control. Firstly, their properties for use with hand-wired controllers are investigated and a comparison of performance made. Secondly, the properties of the units and their associated training algorithms are investigated and conclusions drawn as to the most suitable element for use with reactive robot training. The simulator is used to provide a graphical description of the behaviour of each control system. The results obtained using each architecture are illustrated using screen dumps from the simulator.

3.1 Hand-Wired Networks The hand-wired networks were designed by observing the type of behaviour produced by the four types of network possible with two sensors connected to two

actuators. The four possibilities arise from two parameters: the connections may be crossed or uncrossed and may be excitory or inhibitory. Figure 1 illustrates the four possibilities.

(b) (a) O

T

T

-5

O

-5

-5

-4.5

4

4

(c)

Fig 2. A robot that uses linear units.

Fig 1. The four basic connection strategies. The properties of these connectivities are discussed in detail in [8]. In summary, • the uncrossed excitory connections cause the robot to avoid the sensed object, • the crossed excitory connections cause the robot to accelerate towards the sensed object, • the uncrossed inhibitory connections cause the robot to turn towards the sensed object and slow to a stop near it, and • the crossed excitory connections will cause the robot to slow in the presence of the sensed object and then accelerate away from it. To arbitrate between opposing sensed objects and the related reactive behaviour, the strength of connections may be weighted to favour one kind of behaviour over another. The type of unit, linear or threshold, then decides the type of final behaviour produced. 3.1.1 Linear Units Experimentation with linear units came up with the connectivity shown in Figure 2(a) for the production of the hunt and gather behaviour. The uncrossed inhibitory connections associated with the target sensors causes the robot to wander until it encounters the target, at which point it will turn towards the object and stop nearby. The crossed inhibitory connections are used for obstacle avoidance. The robot slows in the presence of the object and turns away. The asymmetry in the weight of the connections prevents the robot from coming to a complete halt when approaching a wall perpendicularly. Instead the robot is “right-handed”; it turns to the right when both sensors are activated equally.

Some successful simulation results are shown in Figures 2(b) and 2(c). However this simple control system is far from foolproof, even in the restricted environment of the simulation. Depending on the initial state of the robot, there will usually be some regions of the environment that remain unexplored. This may mean that the robot will wander about without locating a target. Figures 3(a) and (c) illustrate this difficulty. In Figure 3(b) the vehicle stops at a target but too far away from the target and with an obstacle intervening. Here the inhibitory effects of the obstacle sensor and the target sensor have combined to stop the robot prematurely. Figure 3(d) illustrates another problem when obstacles and targets are in close proximity. The target sensor has attracted the robot too close to an obstacle causing a collision.

(a)

(b)

(c)

(d)

Fig 3. Some pathological cases. This vehicle shows remarkable ability for such a simple method. The vehicle’s performance with the hunt and gather task is on a par with other early reactive robots (see [3] for comparable examples). The control system calculation consists of four multiplications and six additions.

3.1.2 Threshold Units The threshold unit creates two states for each motor (on or off), and makes four states of motion for the robot: straight ahead, turning left, turning right and stopped. Figure 4(a) shows the effect of changing the units in the previous experiment to threshold units. The vehicle moves at a steady pace straight ahead, with a change of state when a sensor input passes the threshold level. The greater strength of the inhibitory connection to the right hand motor from the left obstacle sensor causes the right hand motor to slow before the left hand motor, which causes the vehicle to turn right at obstacles that it approaches with both sensors.

-5 -5

(a) -5

-4.5 4

(b)

4

-15

-15 10

-5

10 -4.5

4

backpropagation or a genetic algorithm. This paper shows that some simpler training algorithms are suitable for training neural controllers for robots. In particular, the Widrow-Hoff rule and the Perceptron Learning Algorithm are investigated for the training of linear and non-linear units respectively. These algorithms are particularly relevant given the performance of the single layer systems in the hand-wired examples. The chief criticism of the WidrowHoff rule and the Perceptron Learning Algorithm was that they could train only single layers of neural units [4], even though they are guaranteed to converge on a solution if one exists. Single layers of neural units are known to have limited computational abilities. The preceding examples, though, have shown that single layer solutions are possible for the hunt and gather problem.

4

Fig 4. Using threshold units. This vehicle does not respond strongly enough to the target stimulus. Figure 4(a) shows the robot start to turn towards the stimulus but then continue through the area. This is corrected in the robot illustrated in Figure 4(b) by increasing the strength of the connections to the target sensors. Crossed excitory connections are also introduced to ensure that the increased inhibitory value does not saturate the threshold while the robot is still too far away from the target. This vehicle generally performs well, but as with the vehicle designed using linear units it is not guaranteed to explore the entire environment while looking for the target. Further experiments have shown that the addition of recurrent connections and adjustment of the weights from the units to the actuators can make a robot that more thoroughly covers the environment. It is not the purpose of this paper to thoroughly treat the possibilities of these units for robot control. The systems presented demonstrate that simple robot controllers are possible using single layers of neural units.

3.2 Network Training As stated earlier, the majority of controller systems designed around neural networks have been based on complex neural networks either generated by

3.2.1 Widrow-Hoff Rule The Widrow-Hoff rule is a neural network training method based on the gradient descent of the error of the network over a training set of patterns. It works only for linear units, and is the direct antecedent of the popular backpropagation algorithm. The training algorithm can be stated: ∆wij = η(ζ i − Vi )Vj where wij is the weight as before, Vx is the activation of unit x, ζ i is the target value of unit i, and η is the learning rate. (b) (a)

(c)

Fig 5. Training using Widrow-Hoff rule, showing effect of varying the training termination point. In this experiment, an attempt is made to train the robot for the obstacle avoidance task. The network was assigned small random weights with both obstacle sensor inputs connected to both actuator outputs. The robot was driven around the simulated environment using key presses to indicate the desired direction of the vehicle. The prescribed path is shown in Figure 5(a) and represents an obstacle avoidance behaviour. As the vehicle was driven the learning rule was applied with a

learning rate of 0.01. For the path shown this was about 2000 sensor-actuator pairs. The behaviour of the robot after training is shown in Figure 5(b). The robot starts off well, but fails at a boundary point. A significantly different result was obtained by cutting off training just after the robot had encountered a similar situation. This robot can handle walls straight ahead, but fails to address obstacles on the right. The disparity can be explained by observation of the error (ζ − V ) during training. The training data comes in clusters: as the vehicle heads through a region without obstacles, the weights favour forward motion. As the vehicle enters a region requiring obstacle avoidance, the error rises. This forces the weights into a new part of weight space and the error falls away. When a new region is reached the weights change again. In the act of learning about the new region the changes made for the previous region are corrupted. This feature of machine learning is sometimes described as the stability-plasticity dilemma. 3.2.2 Perceptron Learning Algorithm The Perceptron Learning Algorithm [6] is a network training algorithm based on Hebb’s rule. Hebb’s rule implies that a connection’s strength should be increased when the units that the connection relates are active simultaneously. This algorithm was implemented as: ηζ V , if ζ i ≠ Vi ; ∆wij =  i j  0, otherwise. The network had weights initialised to zero with all sensor input units connected to both actuator units. The robot was manually controlled over a single pass of the path shown in Figure 6(a), applying the learning rule with a learning rate η of 0.5 over 3000 sensor - actuator pairs. (b) (a)

(c)

Fig 6. Training using Perceptron Learning, showing effect of retraining on first corner.

When the vehicle was returned to the starting position to run autonomously, it initially had a problem at the first corner where it turned in an unexpected direction (Figure 6(b)). The behaviour that followed was reasonable in that performed obstacle avoidance, but not quite as it was trained. Keeping in mind the stability / plasticity dilemma, this is not unreasonable. The first corner is a situation that it saw only once at the very beginning of session. When the robot had some additional training by a single pass through that corner, the useful hunting behaviour shown in Figure 6(c) emerged.

4 Training a Real Robot The mechanisms described have been investigated for behaviour generation in a vision guided robot: CORGI. CORGI is designed to collect objects from an office environment using only its vision sense for guidance. The vision sense has been developed using trained neural networks, allowing the robot to be trained to collect a variety of objects or be adapted to operate in new environments. The details of the methods used to train the vision component of CORGI to recognise tennis balls in an office environment are presented in [8]. The trained vision system of CORGI is a sensor system with characteristics similar to the sensors simulated above. The system has two outputs, one to represent the presence of the recognised objects on the left, the other to represent the right. An object in the centre of the screen registers on both outputs. For the purposes of this experiment two vision systems were trained, one to detect obstacles and the other to detect tennis balls. The obstacle sensor had a reliability of 89%, while the tennis ball sensor had a reliability of 82%. The output of the control system is used to directly drive the robot motors which are situated wheelchair fashion of either side of the robot - not unlike the robot used in the simulation. Distinctly different though is the dynamic nature of the real robot. The simulated robot had exact velocity control on its motors, the real robot has no velocity control and the time constant imposed by the inertia of the robot comes into play. Most notably, when approaching an obstacle, the robot will tend to move towards that obstacle for some time after the turning commands start. Initial investigations with hand-wired examples found that the motors had to be reversed to achieve a fast enough response. This had the added advantage that if the sensor unexpectedly came upon an obstacle, it would back up sharply, reversing out of harms way. This was achieved using a modified network architecture with two units per motor, as shown in Figure 7. This configuration provides connections for forward and reverse activation of the motor. Note that the

connections to the motors are not trained, and are fixed at 1 and -1 respectively. The Perceptron Learning Algorithm was used to train the connection strengths between sensor outputs and motor inputs, so threshold units were chosen for these motor units. O

T

T

O

about 8 Hz. This represents a small amount of additional overhead compared to the time taken to run the perception network, which runs at about 9 Hz. The experiment also showed that aspects of the robot behaviour could be improved with additional training, without degrading the performance of other aspects of behaviour. This demonstrates that the Perceptron Learning Algorithm has both stability and plasticity when applied to the robot behaviour control problem.

5 Conclusions +1 FL -1

BL L

FR +1 BR L -1 L

Fig 7. Connection architecture for CORGI. The robot was driven about the training environment for about two minutes, using a joystick to avoid obstacles and halt at tennis balls. Observation of the behaviour after two minutes of training showed that the robot favoured left hand turns to avoid obstacles, induced by a bias in the training operation. Some further training favouring right hand turns corrected this tendency. The robot would tend to run past tennis balls rather than stopping, so a further two minutes of training was used to emphasise turning towards tennis balls and stopping. The obstacle avoidance behaviour was the most impressive of the two behaviours. In subsequent trials, it has never failed while in the training environment. It performs well in unseen environments: the performance level degrading as the performance of the sensory network degrades. The trained controller was able to compensate the inertial effects of the robot and the noisy nature of the obstacle sensor. In fact this noise sometimes became a bonus for dealing with symmetrical situations, and making the path across free space less predictable, ensuring that the robot explored all areas of the environment. The tennis ball collection behaviour was not as successful. The noise from the tennis ball perception network made the approach to the tennis ball irregular making subsequent manipulation of the ball impossible. Modifications are required to the tennis ball sensor to improve accuracy and guarantee a consistent approach. The network was able to learn the hunt and gather behaviour after only four minutes of training, which corresponds to about 2000 iterations through the learning cycle. While learning the robot updated its weights at

This paper has shown by simulation and experiments with a real robot that the sensor to motor mapping required for achieving the hunt and gather behaviour is both linearly separable and stationary. Linear separability implies that higher order networks, such as the multilayer perceptron, are not necessary. The fact that the mapping is stationary means that the storage of context information is not required, and hence that the network does not require backward or recurrent connections. The minimalist approach developed here greatly reduces the resources required for network computation and greatly simplifies the complexity of the training algorithm.

References [1] Braitenberg, V. (1984) Vehicles: Experiments in Synthetic Psychology, MIT Press, Cambridge, MA. [2] Dorigo, M. (1996) Introduction to the Special Issue of Learning Autonomous Robots. IEEE Transactions on Systems, Man and Cybernetics, vol. 26, no. 3, June 1996. [3] Maes, P. ed. (1990) Designing autonomous agents: theory and practice from biology to engineering and back, MIT Press, Cambridge, MA. [4] Minsky, M.L. and Papert, S.A. (1969). Perceptrons. MIT Press, Cambridge, MA. [5] Pomerleau, D.A. (1993) Neural Network Perception for Mobile Robot Guidance, Kluwer Academic Publishers. [6] Rosenblatt, F. (1962) Principles of Neurodynamics, Spartan. [7] Widrow, B., and Hoff, M.E. (1960) Adaptive Switching Circuits. 1960 IRE WESCON Convention Record, part 4, pp. 96-104. [8] Wyeth, G. F. (1997) CORGI: A Robot to Play Fetch, To appear in Australian Journal of Intelligent Information Processing Systems.