Evolutionary Neural Networks for Nonlinear

0 downloads 0 Views 257KB Size Report
The next step is to assume a functional relationship between the current state x(t) ... 0,], the system approaches a stable equilibrium point for < 4:53, a limit cycle for 4:53 .... over 10 runs) best av ge. DR. 0.4042 0.4249. ELR. 0.4591 0.4656. FR ..... on very long chromosomes, Technical Report, Computer Science Department,.
Evolutionary Neural Networks for Nonlinear Dynamics Modeling I. De Falco, A. Iazzetta, P. Natale and E. Tarantino Research Institute on Parallel Information Systems (IRSIP) National Research Council of Italy (CNR) Via P. Castellino, 111 80131 Naples - Italy

Abstract. In this paper the evolutionary design of a neural network

model for predicting nonlinear systems behavior is discussed. In particular, the Breeder Genetic Algorithms are considered to provide the optimal set of synaptic weights of the network. The feasibility of the neural model proposed is demonstrated by predicting the Mackey{ Glass time series. A comparison with Genetic Algorithms and Back Propagation learning technique is performed.

Keywords: Time Series Prediction, Arti cial Neural Networks, Genetic Algorithms, Breeder Genetic Algorithms.

1 Introduction Arti cial Neural Networks (ANNs) have been widely utilized in many application areas over the years. Nonetheless their drawback is that the design of an ecient architecture and the choice of the synaptic weights require high processing time. In particular, learning neural network weights can be considered a hard optimization problem for which the learning time scales exponentially becoming prohibitive as the problem size grows [1]. Many researchers working in the eld of Genetic Algorithms(GAs) [2, 3] have tried to use these to optimize neural networks [4, 5]. A typical approach followed by them is to use a GA to evolve the optimal topology of an appropriate network and then utilize Back Propagation (BP) [1] to train the weights. BP is the most common method used to nd the set of weights. Unfortunately, this method is based on gradient, so it gets stuck in the rst local optimum it nds. Evolutionary Algorithms have also been utilized to try to overcome these BP limitations in the training process. The usage of these algorithms might yield several advantages with respect to the BP technique. In fact, rstly, they can be used even for non-continuous problem since they do not require gradient information. Secondly, they reduce the possibility to get stuck in local optima. Thirdly they provide the user with more possible neural network con gurations among which he can choose the most appropriate to his needs. The rst evolutionary approaches for the learning phase have been performed by using GAs. This genetic approach allows to attain good results but to do this it has the disadvantage to have extremely high costs in speed of the evolutive search process. This makes GAs impractical for large network design [6, 7].

Recently, other evolutionary systems, based on a real representation of the variables and on real{valued operators, have been revealed to be e ective with respect to GAs for optimizing both ANN structures [8] and ANN weights [9, 10]. In [10], among other problems, time series prediction is considered. This problem represents a sound test to investigate the e ectiveness of forecasting techniques. The problems are faced in [10] by using Evolutionary Programming [11]. We have found that for many real{valued problems Breeder Genetic Algorithms (BGAs)[12, 13, 14] have outperformed GAs. Therefore, our idea is to implement a hybrid system, based both on ANNs ability to understand nonlinearities present in the time series and on BGAs capability to search a solution in the massively multi-modal landscape of the synaptic weight space. Of course this procedure does not guarantee to nd the optimal con guration, anyway it allows to attain a con guration close to optimal performance. We wish to examine the performance of the resulting system with respect to both the classical GA driving ANNs and to the BP on its own. A standard benchmark for prediction models, the Mackey{Glass (MG) time series [15], is considered. The results obtained are compared against those achieved by the other above mentioned techniques in order to establish their degrees of e ectiveness. In order to preserve the features of GAs and BGAs we have decided to make nothing to reduce the detrimental e ect of the permutation problem [7]. In fact, our aim is to compare the techniques on the test problem, rather then to attempt to achieve the best possible solution. The paper is organized as follows. In section 2 a description of the aforementioned benchmark problem is reported. In section 3 some implementation details are outlined. In section 4 the experimental results are presented and discussed. Section 5 contains nal remarks and prospects of future work.

2 The Mackey{Glass Series We can attempt to predict the behavior of a time series generated by a chaotic dynamical system as shown in [16]: the time series is transformed into a reconstructed state space using a delay space embedding [17]. In this latter space, each point in the state space is a vector x composed of time series values corresponding to a sequence of n delay lags: x(t); x(t ? ); x(t ? 2); : : : ; x(t ? (n ? 1))

(1)

The next step is to assume a functional relationship between the current state x(t) and the future state x(t + P ), x(t + P ) = fP (x(t)):

(2)

The aim is to nd a predictor f^P which approximates fP , so that we can predict x(t + P ) based on n previous values. If the data are chaotic, the fP is necessarily nonlinear. P represents the number of time steps ahead we wish to perform our prediction. Based on Takens' theorem [18], an estimate of the dimension D of the manifold from which the time series originated can be used to construct an ANN model using at least 2D + 1 external inputs [19]. For a noise{free system of dimension D, it is sucient to choose n = 2D + 1. It is obvious that for a D{dimensional attractor, n must be at least as large as D [16]. The time series used in our experiments is generated by the chaotic Mackey{ Glass di erential delay equation [15] de ned below: dx = a 1 +x(xtc?(t ?)  ) ? bx(t) dt

(3)

where a, b, c and  are costants. The value of  is very important, because it determines the system behavior. In fact, for initial values of x in the time interval [0, ], the system approaches a stable equilibrium point for  < 4:53, a limit cycle for 4:53 <  < 13:3, and after a series of period doublings for 13:3    16:8 the system becomes chaotic. From the value 16:8 on,  controls how chaotic the series is. The larger  , the larger the dimensionality of the attractor. For example, for  = 17 D is 2.1, while for  = 23 D is 2.4, for  = 30 D is 3.5, and for  = 100 D is 7.5. For our experiments we have decided to set the above values as it follows:  = 17, a = 0:2, b = ?0:1, c = 10. As regards P , we have performed a set of experiments, with P = 6.

3 The Evolutionary Prediction System Following the recipe by Lapedes and Farber [20] for the Mackey{Glass, we have decided to consider a family of MLPs with the following features: the input layer consists of four nodes, two hidden layers, ten neurons per hidden layer, hyperbolic tangent as activation function in the hidden layers, semi{linear activation in the output node. Our aim is to provide an automatic procedure to optimize the learning phase of these neural networks for time series prediction. The hybrid system we have designed and implemented consists in a BGA which evolves a population of individuals or chromosomes which represent potential candidate solutions. Being the network topology totally xed (4{10{10{1), the number of connection weights to be determined is 171 (actually, 150 are the real connections between neurons and 21 are the bias values for the nodes belonging to the two hidden layers and the output layer). We have decided to let these values vary within the range [-0.6, 0.6]. Thus, each individual is constituted by an array of 171 real values. Each of these chromosomes is a \genetic encoding" in which the genotype codes for the di erent sets of connection weights of each MLP. Such encodings are transformable into the corresponding neural networks

(phenotypes). The evaluation of the phenotypes determines the tness of the related genotype. The parameter we have chosen to evaluate the goodness of an individual is the normalized mean square error on the training set, Et . We de ne Et as: ? 1

PN 2  12 i =1 (xi ? oi ) N ? 1 Et = ? 2  21 1 PN N ?1 i=1 (xi ? x)

(4)

where xi is the i{th input value, oi is the i{th predicted value. Speci cally the tness function we use is the following:

F (y) = Et

(5)

In the evolutionary algorithm  genotypes of the population are individually evaluated to determine the genotypes that code for high tness phenotypes. A selection process allows to establish the survivors. Appropriate genetic operators are used to introduce variety into the population and to sample variants of candidate solutions of the next generation. Thus, over several generations, the population gradually evolves toward genotypes that correspond to high tness phenotypes. The process is repeated until an adequate network is achieved or a stopping criterion is ful lled. Let MLP(yi ) with i 2 f1; : : : ; g be the algorithm for training the MLP related to the individual yi representing a neural network con guration. The general scheme for the BGA is the following: Given a training set Ts ; Procedure Breeder Genetic Algorithm

begin randomly initialize a population of  sets of connection weights; while (termination criterion not ful lled) do transform the genotypes into the corresponding phenotypes y ; train y by means of MLP(y ) on T ; evaluate the trained network y ; save the genotype of the best trained network in the new population; select the best  sets of synaptic weights; for i = 1 to  ? 1 do randomly select two structures among the ; recombine them so as to obtain one o spring; perform mutation on the o spring; od update variables for termination; od end i

i

i

s

i

4 Experimental results The data of the time series have been divided into two sets: the training data set Ts used to train the prediction system, and the verifying set Vs to evaluate the performance of the trained system. The former consists of 1200 samples, while the latter is made of 300 values. The population size for all the experiments has been set to 100. Preliminary runs have been made to determine the best operators for the BGA. As far as the recombination operator is concerned, we have taken into account the Discrete Recombination (DR), the Extended Intermediate Recombination (EIR), the Extended Line Recombination (ELR), the Fuzzy Recombination (FR) [21] and the BGA Line Recombination (BGALR) [22]. This latter operator works as follows: let x = (x1 ; : : : ; xn ) and y = (y1 ; : : : ; yn ) be the parent strings with x being the one with better tness, then the o spring z = (z1 ; : : : ; zn) is computed by y ?x zi = xi  rangei  2?k  i i  2 [0; 1] kx ? yk

As regards ELR, EIR and FR, on the basis of their de nitions in [21], several values for the typical parameter d have been used, the best ones resulting in all cases d = 0:5. For the mutation operator, the Discrete Mutation (DM) and the Continuous Mutation (CM) have been investigated, with di erent values for their parameters k and rangei . The best values for them have been set as k = 16 and rangei = 0:75. The DR and the ELR have turned out to be the worst ones. DR allows to obtain a very fast decrease in the best tness during rst generations, but it loses power as the number of generations increases. BGALR and EIR have been very close as regards performance, the former being slightly better. Furthermore, it has allowed to achieve runs showing average values much closer to the best values than EIR. So, the BGALR has been chosen as recombination operator. Due to its structure, it does not require to apply a further mutation operator. In Table 1 we report the best nal values and the average nal values achieved by using the di erent recombination operators. The latter values are averaged over 10 runs. The mutation operator for all of the runs has been the CM (where needed, of course).

Table 1. The best nal (best) and the average nal (av ge) values obtained (averaged over 10 runs) best av ge DR 0.4042 0.4249 ELR 0.4591 0.4656 FR 0.3768 0.3849 0.3165 0.3412 EIR BGALR 0.2617 0.3136 Some trial runs have also been made to nd a good value for the truncation

(a) 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 800

(b) target best prediction

820

840

860

880

900

1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1200

target best prediction

1220

1240

1260

1280

1300

Fig. 1. The results of the BGA on the training set (a) and on the verifying set (b) (a) 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 800

(b) target best prediction

820

840

860

880

900

1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1200

target best prediction

1220

1240

1260

1280

1300

Fig. 2. The results of the GA on the training set (a) and on the verifying set (b) rate, resulting in T = 20%. The maximum number of generations allowed is 1000. In all runs, however, most of the evolution takes place during the rst 500{600 generations. All of the runs performed with BGALR show quite similar behavior. In Fig. 1 we report the results obtained in the best run. In this case we have obtained an error of 0.2617 on the training set (Et ) and of 0.2832 on the verifying set (Ev ). Fig. 1(a) shows the target curve and the obtained one for a slice of the training set, namely the slice [800-900], for clarity's sake. Fig. 1(b) reports the target and the output for the verifying set, in the slice [1200{1300]. It is interesting to compare these results against those achieved by us when using binary GAs to drive evolution. In this case we have decided to employ onepoint crossover with probability Pc = 0:8 and bit- ip mutation with probability Pm = 0:005. Each variable has been encoded with 10 bits. The population size is 100, and the maximum number of generations is 1000. Several runs with di erent selection methods have been performed, resulting in truncation with threshold T = 20% being the best. The best nal value over 10 runs has been 0.4878 for the training set, resulting in Ev = 0:4880. Fig 2 reports the results on the training set (a) and the results on the verifying set (b). We report also the results achieved by the BP on the problem under account.

(a) 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 800

(b) target best prediction

820

840

860

880

900

1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1200

target best prediction

1220

1240

1260

1280

1300

Fig. 3. The results of the BP on the training set (a) and on the verifying set (b) We have used values for the learning of = 0:5 and  = 0:05. This latter has been used throughout all the previously described experiments as well. We have used both a simple BP (SPB) which stops in the rst local optimum it meets, and a more sophisticated one (BP), which has been provided with a mechanism allowing to escape local optima. Namely, training is performed and the nal weight set is saved. Then, learning is repeated and if a new weight set leading to Et decrease is found, it is saved. This mechanism is iterated. The saved weight set is restored every time the search leads to solutions which are worse than the saved one. When the new solution is worse than the saved one by more than 20% the learning ends. We have carried out several experiments with di erent number of training epochs, resulting in a number of 400 being the most suitable for both techniques. Fig. 3 reports the results on the training set (a) and the results on the verifying set (b) for the best BP run. This run has Et = 0:1049 and Ev = 0:1468. Table 2 summarizes the best nal results and the average nal results with the di erent techniques.

Table 2. E and E for the best runs with the di erent techniques t

v

GA BGA SBP BP

Et

0.4878 0.2617 0.1194 0.1049

Ev

0.4880 0.2832 0.1573 0.1468

In Fig. 4 the evolutions of the three techniques employed are shown. Each training phase for the BP is plotted as one generation, to keep the gure congruent. As it can be seen the performance of the BGA is always better than that of GAs at parity of number of generations. MLP trained with the cycling BP technique is the best during all evolution.

Et

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

bga ga bpn

0

100

200

300

400 500 600 Generations

700

800

900

1000

Fig. 4. The evolution of the di erent techniques utilized

5 Conclusions and Future Works In this paper an evolutionary approach for automating the design of a neural network model for time series prediction has been investigated. The BGAs have been used in facing the optimization of the search in the large connection weights space. The approach has been tested on a standard benchmark for prediction techniques, the Mackey{Glass series. We have focused our attention on the prediction for t + 6. The experimental results have proved the e ectiveness of the BGA proposed for the Mackey{Glass time series prediction with respect to GAs. In particular, the BGALR recombination operator has turned out to be the most suitable for the problem at hand. We are trying to draw some hypothesis on why this happens. It has been reported by Whitley [7] that the problem of training an MLP may represent an application that is inherently not a good match for GAs that rely heavily on recombination. Some researchers do not use recombination, while other have used small populations and high mutation rates in conjunction with recombination. The basic feature of the BGALR operator is that in order to create each new component of an o spring it performs, apart from recombination, also some kind of mutation. This might be seen as an extremely high mutation probability, and seems to be in accordance with Whitley's assumption. Though the automatic procedure here provided has allowed to achieve a network architecture with good performance, the results of BGAs are worse than those obtained by BP techniques. However, it is to point out that we have considered naive implementations of the GAs and the BGAs to perform their comparison. More sophisticated evolutionary algorithms, which do not use recombination operators [10], allow to attain performance comparable with that achieved by BP techniques. Future work will be focused both on the investigation of the relative importance of recombination and mutation and on the search of an appropriate tness function. In fact, it is not clear the dependence of the results on the speci c tness function chosen. In this paper we have used to evaluate individuals the classical error on the training set, typical of MLPs. In [23] it is suggested to take

into account the fact that the search performed by a GA is holistic, and not local as is usually the case when perceptrons are trained by traditional methods. We aim to nd new tness functions which can take advantage of this idea. Of course, in order to ascertain the e ectiveness of the evolutionary approach in facing a learning process, di erent time series prediction problems have to be analyzed. Furthermore, since the BGA method can also be easily distributed among several processors, we intend to perform the implementation of a parallel version of the evolutionary approach proposed with the aim at improving the performance and the quality of neural networks, while decreasing the search time.

References 1. D. E. Rumelhart, J. L. McLelland, Parallel Distributed Processing,I-II, MIT Press, 1986. 2. J. H. Holland, Adaptation in Natural and Arti cial Systems , University of Michigan Press, Ann Arbor, 1975. 3. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning , Addison-Wesley, Reading, Massachussetts, 1989. 4. J. D. Sha er, D. Whitley and L. J. Eshelman, Combination of Genetic Algorithms and Neural Networks: A Survey of the State of the Art, in Combination of Genetic Algorithms and Neural Networks, J. D. Sha er, L. D. Whitley eds., pp. 1{37, 1992. 5. X. Yao, A Review of Evolutionary Arti cial Networks, Int. J. Intelligent Systems , 8 (4), pp. 539{567, 1993. 6. D. J. Montana and L. Davis, Training Feedforward Neural Networks using Genetic Algorithms, Proceedings of the Eleventh International Joint Conference on Arti cial Intelligence , Morgan Kaufmann, pp. 762{767, 1989. 7. D. Whitley, Genetic Algorithms and Neural Networks, in Genetic Algorithms in Engineering and Computer Science , J. Periaux, M. Galan and P. Cuesta eds., John Wiley, pp. 203{216, 1995. 8. I. De Falco, A. Della Cioppa, P. Natale and E. Tarantino, Arti cial Neural Networks Optimization by means of Evolutionary Algorithms, in Soft Computing in Engineering Design and Manufacturing, Springer{Verlag, London, 1997. 9. A. Imada and K. Araki, Evolution of a Hop eld Associative Memory by the Breeder Genetic Algorithm, Proceedings of the Seventh International Conference on Genetic Algorithms , Morgan Kaufmann, pp. 784{791, 1997. 10. X. Yao and Y. Liu, A New Evolutionary Systems for Evolving Arti cial Neural Networks, IEEE Trans. on Neural Networks , 8 (3), pp. 694{713, 1997. 11. D. B. Fogel, Evolutionary Computation: Towards a New Philosophy of Machine Intelligence , New York, NY 10017-2394: IEEE Press, 1995. 12. H. Muhlenbein and D. Schlierkamp{Voosen, Analysis of Selection, Mutation and Recombination in Genetic Algorithms, Neural Network World , 3, pp. 907{933, 1993. 13. H. Muhlenbein and D. Schlierkamp{Voosen, Predictive Models for the Breeder Genetic Algorithm I. Continuous Parameter Optimization, Evolutionary Computation , 1(1), pp. 25{49, 1993.

14. H. Muhlenbein and D. Schlierkamp{Voosen, The Science of Breeding and its Application to the Breeder Genetic Algorithm, Evolutionary Computation , 1, pp. 335{360, 1994. 15. M. Mackey and L. Glass, Oscillation and Chaos in Physiological Control System, Science , pp. 197{287, 1977. 16. D. Farmer and J. Sidorowich, Predicting Chaotic Time Series, Physical Review Letter 59, pp. 845{848, 1987. 17. N. H. Packard, J. D. Crutch eld, J. D. Farmer and R. S. Shaw, Geometry from a Time Series, Physical Review Letters, 45, pp. 712{716, 1980. 18. F. Takens, Detecting Strange Attractors in Turbulence, in Lecture Notes in Mathematics , Springer{Verlag, Berlin, 1981. 19. M. Casdagli, S. Eubank, J. D. Farmer and J. Gibson, State Space Reconstruction in the Presence of Noise, Physica D, 51, pp. 52{98, 1991. 20. A. Lapedes and R. Farber, Nonlinear Signal Processing using Neural Networks: Prediction and System Modeling, Los Alamos National Laboratory Technical Report LA-UR-87-2662, 1987. 21. H. M. Voigt, H. Muhlenbein and D. Cvetkovic, Fuzzy Recombination for the Continuous Breeder Genetic Algorithm, Proceedings of the Sixth International Conference on Genetic Algorithms , Morgan Kaufmann, 1995. 22. D. Schlierkamp{Voosen and H. Muhlenbein , Strategy Adaptation by Competing Subpopulations, Proceedings of Parallel Problem Solving from Nature (PPSNIII), Morgan Kaufmann, pp. 199{208, 1994. 23. P.G. Korning, Training of neural networks by means of genetic algorithm working on very long chromosomes, Technical Report, Computer Science Department, Aarhus, Denmark, 1994.