Evolving Neural Feedforward Networks - CiteSeerX

Evolving Neural Feedforward Networks Heinrich Braun

Joachim Weisbrod

¨ Logik, Komplexität Institut fur und Deduktionssysteme Universität Karlsruhe [email protected]

¨ Programmstrukturen Institut fur und Datenorganisation Universität Karlsruhe [email protected]

Abstract For many practical problem domains the use of neural networks has led to very satisfactory results. Nevertheless the choice of an appropriate, problem specific network architecture still remains a very poorly understood task. Given an actual problem, one can choose a few different architectures, train the chosen architectures a few times and finally select the architecture with the best behaviour. But, of course, there may exist totally different and much more suited topologies. In this paper we present a genetic algorithm driven network generator that evolves neural feedforward network architectures for specific problems. Our system ENZO1 optimizes both the network topology and the connection weights at the same time, thereby saving an order of magnitude in necessary learning time. Together with our new concept to solve the crucial neural network problem of permuted internal representations this approach provides an efficient and successfull crossover operator. This makes ENZO very appropriate to manage the large networks needed in application oriented domains. In experiments with three different applications our system generated very successful networks. The generated topologies possess distinct improvements referring to network size, learning time, and generalization ability.

1 Introduction The imitation of biological mechanisms works well in the case of both neural networks and genetic algorithms. Therefore the idea to follow the biological 1 Evolutiver Netzwerk–Optimierer

paradigm and to optimize neural networks (NN) using genetic algorithms (GA) is very obvious. Actually, the optimization of network topologies according to some explicit performance criteria seems to be a task that is predestinated to the use of evolutionary methods. In our work we show that it is possible to generate appropriate problem specific feedforward network architectures by simultaneously optimizing both network topology and connection weights using genetic algorithms. Our network generator ENZO represents such a system, that has been successfully employed to accomplish the needs of different tasks. Using ENZO we tested our algorithm with three different applications: (1) a pattern recognition problem, (2) the emulation of the kinematics of a backing up truck derived from [1], and (3) the endgame of the two-player game Nine Men’s Morris [2], employing a (23-12-1010)–topology, a (3-20-3)–topology and a (120-60-20-21)–topology, respectively. So at least the last one of them is much more complex than problems considered in publications on comparable topics ([3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]). Arguing with Miller et.al. [9] the search space of possible network topologies is infinitely large, not differentiable, complex, noisy, deceptive and multimodal. These attributes make random, enumerative, gradient descent or heuristic search methods unpracticable. Genetic algorithms, however, represent a search method that is able to manage the demands of the examined search space [16]. Accordingly, there has been a lot of work concerning the use of genetic methods in order to evolve problem specific network architectures. On the other hand there have been attempts to achieve robuster learning techniques by using genetic methods ([12], [13]), the most promising of them be-

while (1) (2) (3)

evolution do select one rsp. two parents generate one offspring evaluate offspring

(4)

insert offspring in population and delete the last population element

selection mutation rsp. crossover learning & evaluation survival of the fittest

Figure 1: Genetic algorithm skeleton ing combinations of standard backpropagation (BP) and GA ([3], [14]). But from our point of view there is no reason to strictly separate topology and weight optimization. Our network generator ENZO successfully hybridizes both optimization processes, additionally establishing powerful mechanisms to both improve the optimization process and to save learning time. Genetic topology optimization methods can be divided into two classes relating to their phenotype– genotype mapping: there are strong and weak representations. In strong representation schemes each gene of the genotype’s genstring is interpreted as an individual connection between two units of the represented network. So the length of the genstring is equivalent to the number of potential connections allowed by the represented architecture. In weak representation schemes the genes correspond to more abstract network properties. Examples for such weak encodings can be found in [6], [7], or [15]. We agree with Miller et.al. [9], that weak schemes may be useful for ‘capturing the architectural regularities of large networks rather efficiently’. But their application also requires a much more detailed knowledge about both genetic and neural mechanisms. For this reason in our work we preferred a strong encoding. Interesting explorations using strong representation schemes are described in [8], [9], or [14], for instance. But in all of these papers the resulting algorithms are only evaluated with very ‘small’ applications like the XOR-problem or the 2-bit-adder and it seems no trivial conclusion to generalize the results of these experiments to real-world applications.

2 Our approach 2.1 Basic algorithm Our main design decisions were influenced by the

desire to create a network generator, that was both easy to control and able to handle ‘large’, real-world applications. For this reason we selected a strong representation scheme, meaning that every gene of the genotype relates to exactly one connection of the represented network. Therefore, the set of possible connections is fixed and the genetic algorithm searches for an optimal topology using a subset of these connections. However, a gene of our genstring has to encode more than just the two states hexisting & learnablei and hnot existingi. Especially for applications with complex networks like the Nine Men’s Morris example mentioned above it is often necessary or at least profitable to fix some network properties a priori. Examples for such properties are fixed weights (hexisting & not learnablei) or connections, that are linked together in order to have the same weights (h existing & linked to : : :i). Our system was to be able to consider a priori arrangements like that. ENZO uses the same scheme as already proposed by Braun ([17], [18]) for finding the optimal solution of large travelling salesman problems (see figure 1). Our algorithm briefly works as follows (see figure 2): Given the population size S , the initial connection density PI , and the specification of the ‘maximal’ topology, ENZO generates a start population pop of S different networks, each of them using about 100 PI % of the total number of connections allowed by the specified network architecture: For every initial network, each of its potential connections is established with the given probability PI . These nets are trained, evaluated and sorted according to their determined fitness values. These fitness values can incorporate any design criteria judged important for the given network application. Arbitrary linear combinations of interesting optimization criteria (e.g. successful training, generalization ability, or network size) may be applied. Now ENZO starts to create offsprings using crossover and/or mutation. With a polynomial bias

PROC ENZO-algorithm; pop:=Generate Start Population(S ,PI ,maximal topology); repeat net1 :=Selection(pop,); offspring:=copy(net1); if (crossover requested) then net2 :=Selection(pop,); offspring:=Crossover(net1,net2 ,PC ,FBP ); fi if (mutation requested) then Mutation(offspring,PM ,FBP ); fi Training(offspring); Evaluation(offspring); Insertion(offspring,pop); until (best element satisfies); Figure 2: ENZO — basic algorithm preferring population individuals with a higher ranking ENZO selects one or two parent networks. Considering the selected network(s) our network generator creates one offspring. To recombine two parent networks net1 and net2 our crossover operator checks for each potential connection, whether this connection is used by the two parents. If the connection exists twice, the emerging offspring receives this connection, too. If the connection is only used by one parent net, the offspring gets a chance PC to obtain this connection. Mutation takes place by changing the state of each potential connection with a given probability PM . At last the new network is evaluated and inserted into the population according to its fitness value, whereby removing the population element with lowest fitness. This means, ENZO does not consider generations, i.e. individuals with high fitness values may live very long. The explanations above describe only the fundamental skeleton underlying our algorithm. In the following section we will take a closer look at some of the decisive components embedded into this skeleton.

2.2 Crucial mechanisms The crucial obstacle to efficient evolution driven network generation are permuted internal representations. That is, in order to solve the given task two successfully trained networks may extract the same features and use the same internal representations,

but distribute these internal representations in a totally different way among their hidden neurons — the contributions of the hidden neurons to the overall solution may be internally permuted. Therefore this problem is also referred to as the phenomenon of different structural mappings coding the same functional mapping. This may cause significant problems to the crossover operator, because two parents successfully solving the given task with almost identical functional mappings may use totally different structural mappings. Applying the crossover operator to such parents will create an offspring with partly doubled and partly missing internal representations. That means, this offspring is unlikely to achieve good results. We encounter the danger of generating poor offsprings when recombining such parents by introducing connection specific distance coefficients. In our network specification we assign an additional attribute Lij to each possible connection Cij between unit j and unit i representing the connection’s length. These coefficients may be induced by an imaginary layout of the network on the 2–dimensional plane, for instance. Considering these connection lengths Lij our system successfully tries to prevent permuted internal representations by preferring for each functional mapping the structural mapping with the shortest amount of connection lengths. Every decision concerning the insertion or deletion of the actual connection Cij is influenced by its length attribute Lij , i.e. every possible connection obeys its own special prob-

abilities PMij = f (PM ; Lij ) and PCij = f (PC ; Lij ). Making long connections less probable than short ones is both biologically plausible and sensible under hardware aspects. The second important mechanism introduced is what we call reduced weight transmission. This mechanism justifies our former statement of ENZO representing a hybrid system. ENZO combines genetic topology optimization with genetic learning. Especially in application oriented domains with large networks computation time becomes the main problem. In our algorithm offsprings not only receive topological properties from their parents, but also knowledge. This is achieved by the following mechanism: If an offspring’s connection Cij is established, its weight Wij receives a fraction FBP Wij0 of the relating parent weight Wij0 instead of a random value2 . Therefore a new offspring’s learning process does not start somewhere in weight space, but most likely in the neighbourhood of the expected optimum. This approach does not only save learning time, but also decreases the danger of getting stuck in a poor local optimum.

3 Experimental results Analyzing experiments with three different applications we were able to justify our algorithm including its particular and crucial ideas. The complete results can be found in [19].

(1) Digit recognition First we examined a digit recognition problem described in [20]. In order to classify seven distorted sets of the trained digits a (23-12-10-10)–topology was used. While in [20] no perfect network could be found3 , ENZO evolved lots of them. As optimization criterion we simply used the number of false classifications with the seven test sets, i.e. 70 distorted digits. To compare ENZO’s performance with standard backpropagation we determined the quotient of necessary learning epochs divided by the number of obtained perfect nets. For standard BP we found (examining 6; 000 networks): 6; 000 [nets ] 220 [ epochs net ] 21 [perfect nets ] 2

63; 000 [

epochs perfect net ]

FBP is another user specified parameter in [0 1].

3 Perfect

;

means no false classification with the given test sets.

To get comparable data using our network generator we started 6 runs, each of them having a population size of 25 networks and computing 1; 000 offsprings: (6

+

epochs ]) 25 [nets ] 200 [ initial net 141 [perfect nets ]

(6 1

epochs ]) ; 000 [nets ] 110 [ ospring 141 [perfect

5; 000 [

nets ]

(initialization)

(evolution)

epochs ] perfect net

Notice that the 141 perfect nets are derived from a total amount of 150 evolved networks, namely 6 runs with a population size of 25 each. The formulas above also show the success of our concept of weight transmission from ancestors to offsprings by evidently reducing the necessary training epochs (from around 200 for initial population elements down to 110 for offsprings). The importance of our distance coefficients when using crossover could be significantly supported by experiments neglecting this distance criterion. Table 1 shows relating results4 : compared to our normal crossover we obtained only 22 perfect nets instead of 119, reaching a mean population fitness of 1:86 instead of 0:25. The last column of table 1 indicates that without the connection specific Lij ’s more then half of the created offsprings could not even be successfully trained, i.e. they did not reach the given training criterion within the allowed 1; 000 training epochs. Table 2 contains additional results related to our concept of weight transmission5 Different values for the weight reduction factor FBP show, that ENZO can transmit both too much and too few knowledge from parents to offsprings. We found, that the optimum value for the parameter FBP depends very much on the complexity of the given application. Without weight transmission, i.e. FBP = 0, ENZO evolved no perfect network at all.

(2) Kinematics of the truck backer–upper Our second application was derived from [1]. A (320-3)–topology was used to emulate the kinematics 4 In each case 6 runs creating 300 offsprings with a population size S 25. 5 Again, in each case 6 runs creating 300 offsprings with a population size S 25.

=

=

mean perfect training successfully with Lij

without Lij

fitness:

nets:

0:25

110

1:86

22

epochs: trained nets: 71

100%

600

45%

Table 1: Results with and without connection specific distance coefficients Lij (crossover and mutation).

mean perfect training

FBP = 0:5 FBP = 0:7 FBP = 0:9

fitness:

nets:

1:91

1

0:75

68

1:24

24

population

training epochs

fitness 6

60

4

40

2

20

inserted

epochs: offsprings: 70

27%

110

31%

85

25%

0

200

400

600

800

1000 offsprings

population fitness training epochs

Table 2: Results for different weight transmission factors FBP (just mutation without crossover).

Figure 3: ENZO and ‘Sokrates’ (crossover and mutation).

of a backing up truck within a delimited range of situations. Despite of being very easy to learn (about 20 training epochs) this application could confirm our conclusions drawn from the digit recognition task. Most of the experiments examined with this application were run to find good choices for the newly established parameters. Moreover, our results sustained the often published opinion that in the given domain mutation alone is a powerful mechanism to achieve satisfying performance (cf. [12], [14], or [15]).

ing function, that can generalize as good as possible onto the about 60; 000 essentially different endgame configurations. We started ENZO with the following parameter values: population size S = 30, initial connection density PI = 0:7, crossover probability PC = 0:6, mutation propability PM = 0:01, relearning factor FBP = 0:7. We let ENZO generate 1; 000 offsprings requiring about one week of computation time on a SUN 4-SLC workstation. Figure 3 shows results of one run using our crossover and mutation operators. It shows that the number of training epochs is reduced from 50 at the start to almost constantly 5 epochs in the evolution phase, as a consequence of weight transmission. In order to compare differently trained ‘Sokrates’ nets in [2] the authors introduce the average improvement as a measure for the network’s playing performance. Besides being used as fitness criterion this measure also allows a comparison between our best evolved net and the handcrafted original ‘Sokrates’ net. Table 3 shows an increase of the average improvement from 0:826 to 1:218, i.e. an performance enhancement of 47%. Additionally, our net used only 43% of its potential connections. At this point we have to emphasize, that the data used for evaluating the fitness of a given network architecture within ENZO’s evolution phase and the data used for evaluating the evolved network’s performance (as shown in table 3)

(3) Nine Men’s Morris Finally we tested our system with a very hard problem requiring a very large network architecture. The task was to learn a scoring function for the endgame of the two-player game Nine Men’s Morris. The so called ‘Sokrates’ net introduced in [21] has gained a high level performance, defeating most human opponents. In order to train the net the employed (60-3010-1)–topology is doubled, leading to a (120-60-20-21)-topology. At this point ENZO’s capability to consider a priori specifications determined by the user is required, because during training the two subnets have to remain identical, i.e. corresponding weights have to be linked together in order to be identically adjusted. With a training set consisting of about 90 pairs of endgame positions the network is to create a scor-

Scoring improvement per move (1; 000 moves) 2

=opt.

0

?2 ?4 ?6 ?8 ?10

Average

Moves to

improvement remis loose

original ‘Sokrates’

81.7% 5.4% 3.0% 2.4% 1.3% 1.6%

3.6%

0.826

0.5% 0.3%

ENZO

86.1% 4.1% 2.6% 2.1% 1.3% 1.2%

2.2%

1.218

0.2% 0.4%

Table 3: Generalization ability of the original ’Sokrates’ net and our best evolved net

Scoring improvement per move (1; 000 moves) 2

=opt.

0

?2 ?4 ?6 ?8 ?10

Average

Moves to

improvement remis loose

1. training

82.4%

5.0% 3.1% 1.6% 2.8% 1.6%

3.2%

0.939

0.1% 0.4%

2. training

83.2%

5.6% 2.9% 1.8% 2.2% 1.4%

2.6%

1.068

0.1% 0.4%

3. training

85.5%

2.6% 2.4% 2.2% 2.0% 1.8%

3.4%

0.971

0.1% 0.4%

4. training

83.5%

4.2% 2.5% 1.5% 2.2% 1.2%

4.6%

0.778

0.0% 0.3%

5. training

82.6%

5.8% 2.6% 1.6% 1.9% 1.0%

4.5%

0.826

0.1% 0.3%

6. training

84.9%

4.6% 2.2% 1.3% 1.7% 1.5%

3.4%

1.022

0.1% 0.3%

7. training

84.0%

5.0% 2.7% 1.1% 1.8% 1.5%

3.7%

0.922

0.1% 0.2%

8. training

81.7%

6.5% 2.4% 1.4% 2.0% 1.8%

3.9%

0.835

0.1% 0.2%

83.5% 4.9% 2.6% 1.5% 2.0% 1.5%

3.7%

0.920

0.1% 0.3%

average values

Table 4: Generalization ability of our best re–trained topology were totally disjoint, of course.

4 Conclusion and future directions

In order to test how well suited the evolved topologies were, we reset the weights of our best performing network and re–trained the net eight times. Table 4 shows the performance properties of the eight re– trained networks. In average we get an increase of the average improvement from 0:826 to 0:920, i.e. an enhancement of performance of 11%. That means, using just the evolved topology (employing only 43% of the connections) distinctly improves the average performance. But on the other hand our mechanism of weight transmission turns out to be crucial, since we are far away from getting the performance of the best evolved net: comparing the best of the eight nets with the handcrafted original ‘Sokrates’ net we find an increase from 0:826 to 1:068, i.e./ an enhancement of performance of 29% instead of 47% for the best evolved net (see table 3).

The essential virtues of our genetic algorithm consist (1) in the new combination of the parental properties when merging the parents’ genes (crossover with connection specific distance coefficients) and (2) in speeding up the learning process by inheriting knowledge from the parents (weight transmission). By solving the problem of permuted internal representations we can propose a successfull crossover operator and by hybridizing genetic topology optimization and genetic learning we can introduce a very efficient network generator, ENZO. For assessing application oriented problems fast learning is crucial, because this is the most time consuming part of the genetic algorithm. In ENZO learning is speeded up by more than an order of magnitude using the reduced weight transmission heuristic, e.g. in application (3) shrinking the learning time on average to just 9% compared to learning from the scratch (random starting weights). Moreover, learning from the scratch means gradient descent to an ‘average’ local minimum whereas our weight transmission mechanism strongly biases the descent to a good local mini-

This last experiment strongly confirms our expectation, that our network generator is able to evolve problem specific network topologies with evidently improved performance properties.

mum. In application (1), for instance, the probability for generating a perfectly trained net is increased by factor 6 during evolution (compared to learning form the scratch). Combining both effects the average time for generating a perfect net is decreased by factor 20 in application (1) just due to the reduced weight transmission heuristic. By examining experiments with our applications we were able to justify both our basic design decisions and some particular crucial details. With applications (1) and (2) being mainly employed to find good choices for our newly established parameters, application (3) represented a hard touchstone for the overall performance of our algorithm. On one hand there was a really ‘large’ network to be managed and on the other hand the network consisted of two equal sub– networks leading to specific a priori restrictions on the network specification. Our system evolved networks impressively surpassing our best handcrafted networks by 47% in performance while using only 43% connections. These results were achieved by an evolution process that required about a week of computation time (SUN 4-SLC workstation). As already mentioned above, ENZO generates a completely specified neural network. Obviously, it is an interesting question, whether an appropriate problem specific topology is evolved. Using just the topology of the best network and training from scratch we could validate that the generated topologies possess three main advantages:

smaller network size (e.g. application (3): generated network’s size is about 43% compared to the handcrafted network)

shorter training time (e.g. application (3): on average only 35% of the training epochs needed for the handcrafted network)

higher generalization capability (e.g. application (3): performance is increased by 29%)

But, of course, networks with the evolved topology but re–trained weights do not reach the performance of the evolved networks due to the synergetic effect of evolving both topology and connection weights. From our point of view with applications becoming more complex our proposed crossover operator will surpass evidently the pure use of mutation. In addition the possibility to consider a priori specifications determined by the network designer surely is essential for such real world applications.

References [1] D. Nguyen and B. Widrow, The truck backerupper, Proc. Internat. Network Conf. (Kluwer Academic Publishers, Dordrecht, 1990) [2] H. Braun, J. Feulner, V. Ullrich, Learning strategies for solving the problem of planning using backpropagation, in: Proc. Fourth Intern. Conf. Neural Networks (Nimes, 1991) [3] R.K. Belew, J. McInerney, N.N. Schraudolph, Evolving networks: using the genetic algorithm with connectionist learning, CSE Technical Report #CS90-174 (University of California, San Diego) [4] D.J. Chalmers, The evolution of learning: an experiment in genetic connectionism, in: Proc. of the 1990 Connectionist Models Summer School (MorganKaufmann, San Mateo, CA, 1990) [5] D.B. Fogel, L.J. Fogel, V.W. Porto, Evolving neural networks, in: Biological Cybernetics 63 (Springer, Berlin, 1990) [6] P.J.B. Hancock, L.S. Smith, GANNET: Genetic design of a neural net for face recognition, in: Parallel Problem Solving from Nature (Springer, Berlin, 1990) [7] S. Harp, T. Samad, A. Guha, Towards the genetic synthesis of neural networks, in: Proc. Third Internat. Conf. Genetic Algorithms (Morgan Kaufmann, San Mateo, CA, 1989) [8] K.U. Hoeffgen, H.P. Siemon, A. Ultsch, Genetic improvements of feedforward nets for approximating functions, in: Parallel Problem Solving from Nature (Springer, Berlin, 1990) [9] G. Miller, P. Todd, S. Hedge, Designing neural networks using genetic algorithms, in: Proc. Third Internat. Conf. Genetic Algorithms (Morgan Kaufmann, San Mateo, CA, 1989) [10] H. Muehlenbein, Limitations of multi-layer perceptron networks – steps towards genetic neural networks, in: Parallel Computing 14 (1990) [11] W. Schiffmann, M. Joost, R. Werner, Performance evaluation of evolutionary created neural network topologies, in: Parallel Problem Solving from Nature (Springer, Berlin, 1990) [12] M. Scholz, A learning strategy for neural networks based on a modified evolutionary strategy , in: Parallel Problem Solving from Nature (Springer, Berlin, 1990)

[13] D. Whitley, T. Hanson, Optimizing neural networks using faster, more accurate genetic search, in: Proc. Third Internat. Conf. Genetic Algorithms (Morgan Kaufmann, San Mateo, CA, 1989) [14] D. Whitley, T. Starkweather, C. Bogart, Genetic algorithms and neural networks: optimizing connections and connectivity, in: Parallel Computing 14 (1990) [15] D. Whitley, S. Dominic, R. Das, Genetic reinforcement learning with multilayer neural networks, in: Proc. Fourth Internat. Conf. Genetic Algorithms (Morgan Kaufmann, San Mateo, CA, 1991) [16] J.H. Holland, Adaption in natural and artificial systems, (Ann Arbor: University of Michigan Press, 1975) [17] H. Braun, Massiv parallele Algorithmen fuer kombinatorische Optimierungsprobleme und ihre Implementierung auf einem Parallelrechner, Dissertation TH Karlsruhe, Fakultaet fuer Informatik, 1990 [18] H. Braun, On solving traveling salesman problems by genetic algor ithms, in: Parallel Problem Solving from Nature, LNCS 496 (Berlin, 1991) [19] J. Weisbrod, Einsatz Genetischer Algorithmen zur Optimierung der Topologie mehrschichtiger Feedforward-Netzwerke, Diplomarbeit TH Karls¨ Informatik (1992) ruhe, Fakultät fur [20] J. Weisbrod, Untersuchung der Einsatzmöglichkeiten Neuronaler Netze zur visuellen Mustererkennung gestanzter Ziffern, Studienarbeit TH Karls¨ Elektrotechnik (1989) ruhe, Fakultät fur ¨ Muhle ¨ [21] V. Ullrich, Erlernen von Spielstrategien fur durch Neuronale Netze, Diplomarbeit TH Karl¨ Informatik (1991) sruhe, Fakultät fur

@INPROCEEDINGS{WB93a, author TITLE editor BOOKTITLE

PUBLISHER ADDRESS YEAR }

= = = =

{Braun, H.\ and Weisbrod, J.}, {Evolving Neural Feedforward Networks}, {R. F. Albrecht and C. R. Reeves and N. C. Steele}, {Proceedings of the International Conference on Neural Networks and Genetic Algorithms (Innsbruck, Austria)}, = {Springer}, = {Wien and New York}, = {1993}