Evolving Recurrent Neural Network Architectures ... - Semantic Scholar

16 downloads 0 Views 80KB Size Report
Also note that because there is only one psh node, the value of stk3 is clipped to stk0. A constant table of predefined values was used during the GP evolution, ...
Evolving Recurrent Neural Network Architectures by Genetic Programming Anna I. Esparcia-Alcázar & Ken Sharman Dept. of Electronics + Electrical Engineering The University of Glasgow Glasgow G12 8LT Scotland, UK e-mail: [email protected], [email protected] ABSTRACT We propose a novel design paradigm for recurrent neural networks. This employs a two-stage Genetic Programming / Simulated Annealing hybrid algorithm to produce a neural network which satisfies a set of design constraints. The Genetic Programming part of the algorithm is used to evolve the general topology of the network, along with specifications for the neuronal transfer functions, while the Simulated Annealing component of the algorithm adapts the network’s connection weights in response to a set of training data. Our approach offers important advantages over existing methods for automated network design. Firstly, it allows us to develop recurrent network connections; secondly, we are able to employ neurones with arbitrary transfer functions, and thirdly, the approach yields an efficient and easy to implement on-line training algorithm. The procedures involved in using the GP/SA hybrid algorithm are illustrated by using it to design a neural network for adaptive filtering in a signal processing application.

In the past, the problem of automatically obtaining the topology of a neural network has been tackled by several authors, generally in one of two distinctive different ways.

Constructive algorithms start with a network of a given minimum size and add nodes and links as appropriate, whereas destructive algorithms assume a network of a certain maximum size and remove unneeded components. It has been shown that both methods are somewhat restrictive because they limit, in some way, the final network architectures that can be reached. A third class of design methods using Evolutionary Algorithms (such as genetic algorithms (GAs) and evolutionary programming (EP)) has more recently been applied to designing neural networks. These are population-based search methods which do not have such constraints. In GAs, both the structure and parameters of a network are represented as a (usually) fixed-length string and a population of such strings (or genotypes) is evolved, mainly by recombination. However, one problem with this approach is the size of the required genotype bit string (Kitano, 1990): for a fully connected network with N neurones, there must be at least N2 genetic components to specify the connection weights. This large genome size leads to impractical long convergence times. This problem has been addressed by Gruau (Gruau, 1994), who has developed a compact cellular growth (constructive) algorithm based on symbolic S-expressions that are evolved by Genetic Programming. Although this method can evolve very elaborate structures, we have observed that it takes very long to converge to an optimum, which is unsuitable for certain applications On the other hand, in standard EP there is no such dual representation where a genotype maps to a network phenotype. EP algorithms evolve by mutation only, operating directly on the network components. It has been suggested (Angeline, Saunders and. Pollack, 1994) that this approach is much more suited to network design than GAs. The basis for this claim is that the dual gene/phene relationship used in GAs abstracts the physical network topology and leads to deceptive evolution. We believe, however, that any problems are due to the representations and codings used and not to the genetic techniques themselves, and that, with suitable coding schemes, it is possible to develop non-deceptive GA-based design algorithms.

We present here a recombination-based algorithm hybrid of Genetic Programming and Simulated Annealing. The basic idea is to separate the evolution of the structures from the evolution of the weights: the weights are encoded as a vector of node gains and are not part of the genotype. The weight vector is initialised whenever a new genotype is created The genotypes (the structures) will then be evolved by GP and the weights by SA. This GP/SA hybrid algorithm is described in section 2 below and represents a powerful automated design technique for a wide variety of neural network systems. Its main features are: 1. It easily caters for both recurrent networks (feedback) and non-recurrent network architectures. 2. It allows for the use of arbitrary neuronal transfer functions and it is not restricted to the weighted sum sigmoid of classical systems. 3. The training algorithm is automatically provided as part of the evolution.

objects which we evolve using Genetic Programming. To do this we write each function as an expression tree which in turn can be written as a variable length string of symbols in polish (prefix) notation. The following section discusses in detail the components of these expression trees.

2.2 Components of the Expression Trees In this section we introduce a class of nodes as the primitive elements of expression trees which are suited to designing neural networks. A node is a primitive function with one output and zero or more inputs. Every node also has a real valued output gain, w, which is adjusted by the Simulated Annealing algorithm (see next section). The basic nodes we will be using are: • Input/Output nodes xN and yN. The former represent the system input and the latter represent the 1 system output, both delayed by N samples. • Non-linear transfer function, nlN, implements a sigmoidal function, g( x ) =

where the amount of non-linearity β , β∈ [βlo… βhi], is a linear function of the parameter N as follows. The range [βlo… βhi], is partitioned into maxInt equally spaced subintervals that provide maxInt+1 possible values for β; the parameter N simply addresses each of these values, e.g. for nl0 β = βlo and for nlmaxInt β = βhi • Delay node, Z, returns the value of its argument delayed by one time sample. This is implemented by using a memory stack with as many positions as z nodes appear in a particular tree (up to a certain maximum). When a particular Z is found the value at the associated position of the stack is returned as output of the node and its input is stored in its place. th • Function node, fN, executes the N subroutine tree. These, also called automatically defined functions (Koza, 1994) are in every respect the same as any other function used by the main tree except that they can have a variable number of arguments. Subroutine trees are intrinsic to a particular main tree and are created and evolve together with it, not being accessible by any other trees. Thus, an expression tree would be properly defined as the set of a main tree and all its associated subroutine trees, if any. Function nodes are important in addressing the problem of scalability, (i.e. the increment in the size of the expression trees as the size of the networks increases). By providing a compact way of expressing repetitive tasks, bigger networks can be expressed as small trees. • Average node, avgN, returns the average of its N inputs.

2.1 Functional Specification of Neural Networks In this section we describe the basis of our approach which is to describe networks and their dynamics as discrete-time functions. These discrete-time functions are then represented as a set of expression trees which are evolved by Genetic Programming. A Simulated Annealing algorithm operates on the parameters of these expression trees to implement on-line training. We consider discrete-time networks described by the following pair of equations: yn = F(xn, sn) sn+1 = G(xn, sn) where: yn ∈ ℜ , and xn ∈ ℜM, n = 0,1,… are the N outputs from and M inputs to the network respectively. The vector sn ∈ ℜL represents a set of L internal state variables, which are required to describe recurrent connections within the network. The function, N

F: ℜN x ℜL → ℜM x ℜL describes the input/output mapping of the system, while G: ℜN x ℜL → ℜL describes the internal network dynamics in terms of the state variables. These two functions can be written as a list of single-output/multi-input expressions F ={ f1 , f2 , … fN } and G ={ g1 , g2 , … gL }, where fi and gi are single-valued functions of the inputs and state variables. These sets of individual functions are the

1 − e − βx 1 + e − βx

1

The index N represents an integer in the range [0 … maxInt].

• Constant node, cN, returns the N entry of a constant table, whose values can be randomly initialised or preselected by the user. th

2.4 An example In Figure 1 a three-cell RNN and a possible tree representation for it are shown. w22

Together with these, other node types are used to implement mathematical operations. These are +, - , *, /, +1, -1, *2 and /2. It is interesting to note that we are not restricting the networks to have sigmoidal cells with weighted sum of inputs.

P2

w12

p n = x n + c1 pn −1 + c2 pn − 2 y n = p n + c 3 p n −1 + c 4 p n − 2 A possible genotype coding for this using psh and stkN nodes is, (+ (psh (+ x0 (+ (* c1 stk0) (* c2 (Z stk0)))) )(+ (* c3 stk0) (* c4 (Z stk0))))

In this expression tree, the sub-tree shown in bold evaluates the term pn and pushes this value onto the stack memory ready for the next cycle. The stk0 node therefore returns the value, pn-1, which can be delayed by the Z node to get pn-2. It could be argued that an equivalent result can be achieved by using Z nodes only. In digital signal processing practice, however, this would imply a loss of significant digits in the obtained parameters which doesn’t occur when internal recursion is used. An alternative way of achieving internal recursion would be keeping track of the output of every node in the tree in each evaluation and defining a node type that would address this information in the next evaluation. We believe, though, that the amount of memory required would be too high, especially considering that not all the information stored would be required.

w32 w23

y

w31

2.3 Local Recursion nodes In the previous section we have shown how to achieve recursion from the output of the tree. Here we present our main contribution in this paper: the introduction of nodes that allow for local recursion within the network . These are called psh and stkN, the former pushes the value of its associated subtree onto a stack and the latter returns the th value of the N position in the stack. In order to maintain data coherency, two stacks are used whose size equals the maximum number of psh nodes allowed in a tree. In a particular evaluation of the tree (e.g. for the nth input) one of the stacks is used for writing whenever a psh node is encountered; the other stack is used for reading whenever an stk is encountered. In the next evaluation (for the n+1th input) the two stacks are swapped and what was written before is now read. Internal recursion is important for developing modular solutions to certain problems. For example, the biquadratic digital filter section in canonical form is described by the coupled equations, (Proakis & Manolakis, 1992),

w24

w21

P3

P1 w13

w11 w14

w34

w33

P4 x

[Main tree] (+ f1 (* c0 (+ f2 [f1] psh nl1 (avg4 (x0 [f2] psh nl2 (avg4 (x0 [f3] psh nl2 (avg4 (x0

f3))) stk0 stk1 stk2)) stk0 stk1 stk2)) stk0 stk1 stk2))

Figure 1 - A Fully Recurrent Neural Network. Each processing cell labelled P1-3 implements a sigmoidal transfer function on the average of the cell’s input values. The connecting links have independent strengths labelled wij. The system output is taken from cell P3 and the input is applied to each cell simultaneously. Here, each cell in the network is represented by its own function (f1, f2 and f3) which are invoked by the main tree. The main tree executes the tree associated to each function, but only returns the value of the output node, which is represented by function f1 (if c0 = 0). The psh nodes in each function tree store the computed cell outputs on the stack, and these are accessed from the previous cycle using the stkN nodes. Therefore, x0 represents the input in the current instant of time, stk0 represents f1 in the previous instant and so on. The only difference between f1, f2 and f3 are the values of their respective node gains and the amount of non linearity introduced by the nl node. In this representation the weights wij have been omitted for simplicity. We will see how to include them in the next section .

3. Simulated Annealing: Training by Evolution

Table 1: The Annealing Algorithm for adapting node gains while ( not terminated ) do

3.1 Numerical values problem in GP

1 Perturb αpq (i) to get αpq ' (i).

The problem of getting the parameter values for a system has in general been addressed by including constant nodes in the terminal set. For instance, Koza (Koza, 1992) defines an ephemeral random constant as a terminal which is assigned a random value every time it appears in the first generation. The values remain constant afterwards and therefore there is no learning of the parameters. A more recent way of tackling this problem is done by Howard & D’Angelo (Howard & D’Angelo, 1995). In their method an individual consists of a tree plus a binary string, the latter encoding the values of a (predefined) fixed number of coefficients. These coefficients are evolved together with the structure (in a GA-manner), although they may or may not be used by it.

3.2 Node gains In both methods described above, the parameters are terminal nodes. Here we describe another way of addressing the parameter representation by means of what we call node gains. A gain value is assigned to each link between a pair of nodes in the tree, so that the data are modified as they pass from node to node through the tree. These gain values are optimised using a simulated annealing algorithm. Let us consider the link between the output of a node labeled P and the input to a following node labeled Q. The link has a strength of αpq (a real number), and the relationship between the value at the output of node P, (x), and the input to node Q, (y) is y = αpq x.

p

x

y

α pq

q

Fitness maximisation with respect to the node gains is accomplished using an annealing algorithm (Press et al, 1988) which is applied after a new tree has been produced by crossover or mutation during evolution. Let αpq (i) be a vector of the values of all the node gains in a tree at iteration i during the annealing process, and let f(i) be the fitness of the tree using this set of node gains. The annealing algorithm is summarised in Table 1.

2 Evaluate the fitness, f(i)’ using these perturbed parameters. 3 If (f(i)’ ≥ f(i) ) then accept the perturbation, αpq ' (i+1) = αpq' (i), and continue. 4 Else accept the perturbation with probability

e

f ' (i ) − f (i ) T

and continue.

5 Reduce the temperature T according to an annealing schedule .

Two interesting features of the simulated annealing algorithm are first that no knowledge with respect to the derivatives is required and second that the algorithm doesn't get stuck in local optima. These features are also shared by the GA but we believe that SA has advantages over GAs in this particular case. First of all, the simplicity of the implementation: it would be necessary to run a separate GA to adapt the weights of every single individual and also to take into account that the number of weights would be different for each one of them. This looks very computationally expensive when compared to the annealing method. And second, what we consider to be the main attractive of the Simulated Annealing algorithm: that it is similar to the learning process in nature. When an individual is young, it can accept changes that decrease its fitness with a relatively high probability; the older the individual gets, the smaller the probability.

4. Results We applied the techniques presented in this paper to a commonly encountered signal processing task, namely the channel equalisation problem. This consists of restoring a sequence of symbols (a signal), s, which has been distorted by a non-linear communications channel and corrupted by additive gaussian noise n and thus converted into a set of observations, o. nk sk

unknown channel

+

yk +

Σ

ok

adaptive filter

Figure 2: The Channel Equalisation problem

sk +

Σ

εk -

Tree

Neural Network (phenotype) X 3.1497



-0.8095

+

Σ

NL85



+

+

X

Z

1.7968



+



-0.311426

+ X1

Y1

X1



+

+

2.5497

+

PSH

X X0

-0.997972

Σ

-

+ STK3

5.15094

+ 1.8461

Σ

Σ

y 5.80889

-10.4714

-1.88777

-1.36576

Symbol String (genotype)

[ Main Tree ] ( [-0.997972] nl85 ( [5.80889] + ( [5.15094] + [1.8461] stk2 ([-0.8095] Z [3.1497] x0 )) ( [-10.4714] - ([2.5497] psh [1.7968] x1 ) ( [-1.88777] + [-1.36576] y1 [-0.311426] x1 ) ) ) )

Figure 3: An equalising filter for the channel xk = ck + 0.015⋅⋅ck2 + nk , where ck = 0.5⋅⋅ck-1 + sk + 0.6⋅⋅sk-1, nk is the additive gaussian noise and sk is the original signal to be restored. The variance of the noise was chosen so that the resulting signal to noise ratio was 30 dB. The output of the channel, x, is used as an input to the GP system, with s as a reference. y represents the output of the GP system. This solution was obtained in one of the runs and is depicted here in three representations: tree, neural network and symbol string. The links in both the tree and the neural network are weighted, i.e. there’s a gain value associated to them. These gain values are represented by the numbers in brackets in the symbol string. In the neural network ∆ represents a unit sample delay and the sigmoidal function is represented by its characteristic shape. Note that the number of nodes in the tree does not coincide with the number of processing cells in the Neural Network; for instance, the psh node does not have an equivalent processing cell in the NN. Also note that because there is only one psh node, the value of stk3 is clipped to stk0. A constant table of predefined values was used during the GP evolution, but none of them appears in this solution (nor in any of the ones evolved in other runs). This suggests that the node gains approach eliminates the need for constant nodes. No function nodes (or ADFs) were used in this example.

The aim

is to find a

function

F(⋅⋅)

so that

s = F(o) where s is the estimate of the signal, and F(⋅⋅) is the so called equalising filter. The classical approach to this problem involves sending an initial known signal and finding the filter that minimises the mean square output error, s − s . It has been shown (Kechriotis, Zervas & 2

Manolakos, 1994) that recurrent neural networks are very proficient at tackling this problem, known as trained adaptation. An example is given in Figure 3. The symbol string (the values in brackets being the node gains) is represented as an expression tree or as a neural network, as shown. The procedure to obtain this tree was as follows. The fitness for GP evolution was calculated using 50 samples of the observations (after a transient of 20 samples had been removed); the termination criterion for the run was performing a number of node evaluations equal to 3e8. This typically results in obtaining a system -the solution tree with a high fitness value and for which the number of misclassified symbols is zero. This solution was then tested with a further 10000 samples of the observations; the number of misclassified symbols was 0, thus giving a bit error rate (BER) of 0.00%. In the rest of the runs the structure of the solutions obtained were similar to the one showed here (with a nl node at the root of the tree) , with values of the BER ranging from 0 to 0.12%. This range of the BER would be acceptable in speech transmission, where the BER has to be below 1%, i.e. one error every one hundred samples. In computer data transmission, however, the BER has to be below 0.001%, (i.e. one error in 100000 samples) and therefore some of these solutions would be unacceptable.

5. Conclusions We have shown how Recurrent Neural Networks can be generated and trained by means of a combination of Genetic Programming and Simulated Annealing techniques. Genetic Programming is a recombination-based evolutionary algorithm which has been used here to evolve the structure of the networks represented by expression trees. For the tree representation we have introduced node types that are well suited for tackling discrete time systems in general and digital signal processing problems in particular. Of special interest are the nodes that allow for time recursion, because by this means recurrent neural networks can be easily represented. An added feature of the RNNs thus obtained is that they are more general than classic NNs in the sense that different kinds of operations are allowed in their cells. Simulated Annealing is used to adapt the weights of each network, which have been represented as a vector of node gains. SA has the advantages over gradient-based

methods of not needing information with respect to the derivatives and of not getting stuck in local optima. It has also advantages over GAs, namely the simplicity of implementation and the similarity to the learning process in nature. We are aware, however, that the implementation of the hybrid GP/SA is very computationally expensive and therefore not suited yet for real time applications. Further research will still have to be done in this domain.

Bibliography Angeline PJ, Saunders GM and Pollack JB, 1994. An Evolutionary Algorithm that Constructs Recurrent Neural Networks. IEEE Trans. on Neural Networks, vol. 5, Jan94 Gruau F, 1994. Genetic micro programming of Neural Networks. Advances in Genetic Programming, The MIT Press. Howard LM & D’Angelo DJ, 1995. The GA-P: a Genetic Algorithm & Genetic Programming hybrid. IEEE Expert Aug 95, pp 11-15 Kechriotis G, Zervas E and Manolakos ES, 1994 Using Recurrent Neural Networks for Adaptive Communication Channel Equalization. IEEE Trans. on Neural Networks, vol. 5, pp 267-278. Kitano, H 1990. Designing Neural Networks Using Genetic Algorithms with Graph Generation System. Complex Systems, vol. 4. Koza JR, 1992. Genetic Programming: On the programming of computers by means of natural selection. The MIT press Koza JR, 1994. Genetic Programming II: Automatic discovery of reusable programs. The MIT press Press,WH Flannery, BP, Teukolsky, SA & Vetterling, WT 1988. Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press. Proakis G & Manolakis DG, 1992. Digital Signal Processing : Principles, Algorithms and Applications, Macmillan. Sharman KC, Esparcia-Alcázar AI and Li Y, 1995. Evolving Signal Processing Algorithms by Genetic Programming. Procs. of IEE/IEEE Genetic Algorithms in Engineering Systems: Innovations and Applications, GALESIA. Sharman KC and Esparcia-Alcázar AI, 1993. Genetic Evolution of Symbolic Signal Models. Procs. of the 2nd Workshop on Natural Algorithms in Signal Processing, 1993