Automatic Design of Modular Neural Networks Using Genetic ...

Automatic Design of Modular Neural Networks Using Genetic Programming Naser NourAshrafoddin, Ali R. Vahdat, and M.M. Ebadzadeh Amirkabir University of Technology (Tehran Polytechnic)

Abstract. Traditional trial-and-error approach to design neural networks is time consuming and does not guarantee yielding the best neural network feasible for a specific application. Therefore automatic approaches have gained more importance and popularity. In addition, traditional (non-modular) neural networks can not solve complex problems since these problems introduce wide range of overlap which, in turn, causes a wide range of deviations from efficient learning in different regions of the input space, whereas a modular neural network attempts to reduce the effect of these problems via a divide and conquer approach. In this paper we are going to introduce a different approach to autonomous design of modular neural networks. Here we use genetic programming for automatic modular neural networks design; their architectures, transfer functions and connection weights. Our approach offers important advantages over existing methods for automated neural network design. First it prefers smaller modules to bigger modules, second it allows neurons even in the same layer to use different transfer functions, and third it is not necessary to convert each individual into a neural network to obtain the fitness value during the evolution process. Several tests were performed with problems based on some of the most popular test databases. Results show that using genetic programming for automatic design of neural networks is an efficient method and is comparable with the already existing techniques. Keywords: Modular neural networks, evolutionary computing, genetic programming, automatic design.

1 Introduction The human brain is the most elaborate information processing system known. Researchers believe that it derives most of its processing power from the huge numbers of neurons and connections. It is also believed that the human brain contains about 10 11 neurons, each of which is connected to an average of 10 4 other neurons. This amounts to a total of 10 15 connections. This sophisticated biologic system teaches us some important principles regarding design of NNs: • Despite its massive connections it is relatively small, and highly organized [1]. • Subsequent layers of neurons, arranged in a hierarchical fashion, form increasingly complex representations. J. Marques de Sá et al. (Eds.): ICANN, Part I, LNCS 4668, pp. 788–798, 2007. © Springer-Verlag Berlin Heidelberg 2007

Automatic Design of Modular Neural Networks Using Genetic Programming

789

• Structuring of the brain into distinct streams or pathways allows for the independent processing of different information types and modalities [1]. • Modules containing little more than a hundred cells, also known as minicolumns, are the basic functional modular units of the cerebral cortex [2]. We propose to capture these global and local structural regularities by modeling the brain in a framework of modular neural networks (MNNs) [3]. Rather than starting from a fully connected network and then letting the desired functions emerge with prolonged learning, we advocate an approach that proceeds from a structured modular network architecture. Utilizing this method of designing, NNs achieved will also be scalable, i.e. expansion modules can be added or redundant modules can be removed if necessary. By choosing the right modular structure, networks with a wide range of desirable characteristics can be developed, as will be outlined in section 3. Non-modular NNs tend to introduce high internal interference because of the strong coupling among their hidden-layer weights, i.e. all neurons try to affect overall output of the NN. On the other hand, complex tasks tend to introduce a wide range of overlap which, in turn, causes a wide range of deviations from efficient learning in the different regions of the input space. A modular NN attempts to reduce the effect of these problems via a divide and conquer approach. It, generally, decomposes the large size/high complexity task into several sub-tasks; each one is handled by a simple, fast and efficient module. Then, sub-solutions are integrated via a multi-module decisionmaking strategy. Hence, modular NNs, generally, proved to be more efficient than the non-modular alternatives [4]. All problems regarding manual design of NNs have led the researchers to invent techniques for automating the design of NNs. Some automatic approaches have been proposed for this task. A recent group of these approaches are based on evolutionary algorithms (EAs) [5], [6]. Some methods of EAs used for automatic design of NNs are genetic algorithm (GA) and evolutionary programming (EP). In GAs, both the structure and parameters of a network are represented as a (usually) fixed-length string, and a population of such strings (or genotypes) is evolved, mainly by recombination. However, one problem with this approach is the size of the required genotype bit string. On the other hand, EP evolves by mutation only, operating directly on the network components. It has been suggested that this approach is much more suited to network design than GAs [5]. Another problem faced by evolutionary training of NNs is the permutation problem, also known as the competing convention problem. It is caused by the manyto-one mapping from the representation (genotype) to the actual NN (phenotype), since two NNs that order their hidden nodes differently in their chromosomes will still be functionally equivalent. For example, NNs shown in Fig. 1 are functionally equivalent, but they have different chromosomes. In general, any permutation of the hidden nodes will produce functionally equivalent NNs with different chromosome representations. The permutation problem makes crossover operator very inefficient and ineffective for producing good offspring. That is why EP works better than GA in evolution of NNs.

790

N. NourAshrafoddin, A.R. Vahdat, and M.M. Ebadzadeh

Fig. 1. Permutation problem. Two equivalent NNs and their corresponding non-equivalent binary representations.

The problems mentioned above are both due to the representations and codings used, and not to the EAs themselves; thus, with suitable coding schemes, it is possible to develop non-deceptive EA-based NN design algorithms. A very well-known and efficient approach which suppresses these problems is genetic programming (GP). GP [7] is an automatic domain-independent method for solving problems. Starting with many randomly created solutions, it applies the Darwinian principle of natural selection, recombination (crossover), mutation, gene duplication, gene deletion, gene insertion and certain mechanisms of developmental biology. It thus breeds an improved population over many generations. GP starts from a high-level statement of a problem’s requirements and attempts to produce a solution that solves the problem. The remainder of the paper is organized as follows: in section 2 we will review almost all the previous attempts to automatic design of NNs using various implementations of GP. In section 3 we will describe our approach in details and also some important implementation points are stated. In section 4 some experiments and their results are explained and in section 5 we conclude our work.

2 Related Works After presenting the recent evolutionary approach, Genetic Programming, by Koza [7], [8], exploiting genetic programming approach to automatic design of NNs is, perhaps, first introduced in [9]. Koza and Rice show how to find both weights and architecture for a NN using a population of LISP symbolic expressions (Sexpressions) of varying size and shape. They use the function set F = {P,W,+,–,*,%} and terminal set T = {R, D0, D1, …, Dn} to produce trees and its corresponding Sexpressions. Function P denotes a neuron with linear threshold transfer function, and function W is the weighting function used to give a weight to a signal going into a P function. +, –, and * are ordinary arithmetic operators and % is protected division operator which returns zero in case of a division by zero. R is the random floating point constant and D0, D1, …, and Dn are network input signals. A sample Sexpression which performs odd-parity (exclusive-or) perfectly, in their work, is shown below: (P (W (+ 1.1 0.741) (P (W 1.66 D0) (W –1.387 D1) ) ) (W (+ 1.2 1.584) (P (W 1.19 D1) (W –0.989 D0) ) )


791

This S-expression can be graphically presented as the tree below in Fig.2(a). This S-expression can be further converted to the form that one would typically see in the NN literature, such as Fig. 2 (b):

Fig. 2. (a) The tree of a simple NN corresponding to above S-expression. (b) and its corresponding NN [9].

Gruau [10] has developed a compact cellular growth (constructive) algorithm based on symbolic S-expressions called cellular encoding that are evolved by genetic programming. A scheme similar to cellular encoding, edge encoding by Luke [11] grows network graphs using edges instead of nodes. While the cellular encoding approach evaluates each grammar tree node in a breadth first search manner, thus ensuring parallel execution, edge encoding is designed to work using depth first search. The cellular encoding tends to create graphs with a large number of connections which must then be pruned using respective operators. Edge encoding, using its depth first search approach, favors graphs with fewer connections. Ritchie et al. [12] implemented a GP-based algorithm for optimizing NN architecture (GPNN) and compared its ability to model and detect gene-gene interactions with a traditional back-propagation NN (BPNN). The rules used for their implementation are consistent with those described by Koza and Rice [9]. Using simulated data, they demonstrated that their GPNN was able to model nonlinear interactions as well as a traditional BPNN and also achieved better predictive ability.

3 Proposed Method Here we develop a GP-optimized NN design approach in an attempt to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward modular NN. This method utilizes more important inputs from a larger pool of variables, optimizes the weights, and the connectivity of the network including the number of hidden layers and the number of nodes in the hidden layer. Thus, the algorithm attempts to generate appropriate modular NN for a given dataset.

792


3.1 Function and Terminal Sets Here we use different transfer functions for different neurons. These transfer functions include: linear (purelin), hard limit (hardlim), symmetric hard limit (hardlims), log-sigmoid (logsig), and tan-sigmoid (tansig). Utilizing these transfer functions means that there is no necessity to use similar transfer functions for different neurons in a NN even in the same layer. Arithmetic functions used are plus, minus, and times functions, which are simple arithmetic operators. For division operation we use a protected function (pdivide) which returns the dividend when divisor is zero. ppower returns the same result as power function unless the result is a complex value. If this is the case, it returns zero. psqrt returns simple square root of its argument. If its argument is less than zero, the function returns zero. Exp returns exponential value of its two arguments and plog returns logarithm of absolute value of its argument, and if its argument is zero the function returns zero. Furthermore min and max operators are also added which return their minimum and maximum arguments respectively. So the function set used in this work is, F = {hardlim, hardlims, purelin, logsig, tansig, plus, minus, times, pdivide, ppower, psqrt, exp, plog, min, max}. We use a random floating point constant generator in addition to n input signals (according to number of dataset features) to form our terminal set. Therefore the terminal set is T = {R, D0, …, Dn} where Di is the ith feature of the dataset and n is the number of dataset features. There is no need for declaring the number of input signals to the algorithm beforehand, instead the algorithm recognizes the number of

Fig. 3. A sample tree individual and its equivalent neural network


793

input signals (dataset features) automatically and adds them to the terminal set. Once a terminal R is used in a tree, a random constant number is replaced for R and this random number will not change until the end of the algorithm. A sample individual and its corresponding NN are shown below in Fig. 3. To further simplify the evaluation of each tree as a NN, in our implementation each NN is a MATLAB executable string. A codification of the tree in Fig. 3 is: HardLim (Plus (Plus (Times (LogSig (Plus (Plus (Times (x1, 0.46), Times(x2, 0.22) ), 0.01) ), 0.5),Times(x3, 0.72) ), 0.4) ) Since each tree should represent a NN, all combinations of functions and terminals can not be used and there are some rules about creation and manipulation of trees in this work, e.g. the first node of the tree (root node) must be a transfer function. To avoid the bloat problem (a phenomenon consisting of an excessive tree growth without the corresponding improvement in fitness) we should continuously supervise depth and size of trees - depth denotes number of levels of a tree and size denotes number of nodes in a tree. To do this we have set some restrictions on depth and size of the trees. The maximum allowed depth of each tree is limited to 15 levels. To generate the initial population we use ramped half-and-half method. Standard tree crossover and mutation are also used as diversity operators. For tree crossover, random nodes are chosen from both parent trees, and the respective branches are swapped creating two new offspring. For tree mutation, a random node is chosen from the parent tree and is substituted by a new random tree created with the terminals and functions available. This new random tree is created using the grow method. For parent selection we use lexicography selection method which is very similar to tournament method. The main difference is that, if two individuals are equally fit, the shortest one (the tree with fewer nodes) is chosen as the best. For survivor selection we use total elitism method. The best individuals from both parents and children are chosen to fill the new population. 3.2 Fitness (Evaluation) Function A fast and easy-calculating criterion for evaluating performance of a NN in a classification problem is to use its classification error. The evolutionary process demands the assignation of a fitness value to each genotype. This value is the result of evaluation of the NN with the pattern set representing the problem. Since each individual corresponds to a NN, so at first glance it seems that every individual should be converted into a NN (with its weights already set, thus it does not need to be trained) to evaluate its fitness value. But the interesting point in our work is that there is no need to convert each tree to a NN. Due to our string based implementations each individual is represented as a MATLAB executable string which can be readily evaluated. Therefore upon reproducing a tree and applying the inputs to it, the fitness value of the corresponding NN is available. The fitness function is sum of the absolute values between expected output (yi) and real output (oi) of each individual NN (N is the number of input samples).

794

N. NourAshrafoddin, A.R. Vahdat, and M.M. Ebadzadeh N

Fitness =

∑ y i − oi i =1

(1)

Calculating fitness may be a time consuming task, and during the evolutionary process the same tree is certainly to be evaluated more than once. To avoid this, some evaluations are kept in memory for future use, in case their results are needed again.

4 Experimental Results To show the effectiveness of our proposed method we have conducted several experiments on some of the most popular knowledge-extraction datasets of different complexity from UCI Machine Learning Repository. Briefly, the goal here is to classify a number of samples in a number of different classes i.e. prediction about an attribute of the database, given some other attributes. First dataset used here is iris dataset. It contains 150 samples in 3 classes and each sample has 4 features. Second dataset is wine dataset which contains 178 samples in 3 classes, each with 13 features. And the third dataset is glass dataset with 214 samples in 6 classes, each with 10 features. We used 70% of samples for fitness evaluation (training error) of the individuals and the remained 30% for performing tests (generalization error). We show our setup for two experiments and results obtained in these experiments. In the first experiment, performed by utilizing the proposed GP approach, we try to produce a single tree (NN) for each dataset, which is able to classify all samples of its respective dataset. Therefore the expected output of this NN is class number of the sample to be classified. Optimal parameters of the algorithm which have been obtained after some runs are as follows:

• • • • • • •

Population size: 150 individuals Initial population by: Ramped half-and-half Termination condition (number of generations): 50 generations Maximum depth of each tree: 15 levels Maximum number of nodes in each tree: No constraint Parent selection method: Lexicography with 10% of population as competitors Survivor selection method: Total elitism

Two other important parameters are the probabilities for crossover and mutation operators. In our implementation these two parameters are variable and can be updated by the algorithm whenever needed [13]. If an individual generated by crossover operator yields a good fitness value (e.g. it will be the best individual so far), the probability of crossover will increase. The same rule applies for mutation operator. Minimum and maximum limits set for these two probabilities are 0.1 and 0.9 respectively. Table 1 shows the characteristics of the best NN and average errors found in this experiment for classification of all samples in each dataset for 10 runs.


795

Table 1. Characteristics of the best NNs found in first experiment for classification of all samples in each dataset

Dataset name

# of classes

# of training samples

# of test samples

Iris Wine Glass

3 3 6

105 125 152

45 53 62

Average training error (Fitness) 0.1278 0.4689 0.7129

Average generalization error

Depth (# of levels)

Size (# of nodes)

0.1150 0.5600 0.9360

15 15 14

59 61 68

Second experiment is conducted so that a NN (module) is evolved, using GP, for each class of the dataset. Inputs to these modules are all samples of the dataset, and the output is either 0 or 1, deciding whether the sample belongs to the corresponding class of the module or not. In the next step, another NN is evolved, as in Fig. 4, which receives the outputs of the previous modules and outputs the final class number of the sample. Briefly, the features of each sample are fed into the modules and the number of its class is produced as the output of the last NN. For all three datasets the same procedure is accomplished with the same set of GP parameters as in first experiment.

Fig. 4. The general structure of modular NNs generated with our method

796


Tables 2, 3 and 4 show the characteristics of the best single NNs for each class and also the final modular neural network found in this experiment to classify all samples of each dataset. These results are the average of 10 runs of the algorithm. As can be seen in Table 1, when a single GP-optimized NN is used for classification of iris dataset, training and generalization errors are not very promising. Average training and generalization errors for this dataset are 0.1278 and 0.1150 respectively. These NNs decide the class number of each sample. Table 2. Characteristics of the best NN modules and final modular NN found in second experiment for classification of all samples of iris dataset

Class


# of test samples

1 2 3 all

35 35 35 105

15 15 15 45

Average training error (Fitness) 0 0.1232 0.1357 0.0228

Average test error

Depth (# of levels)

Size (# of nodes)

------0.0542

10 15 13 15

29 135 75 114

Table 3. Characteristics of the best NN modules and final modular NN found in second experiment for classification of all samples of wine dataset

Class


# of test samples

1 2 3 all

41 50 34 125

18 21 14 53

Average training error (Fitness) 0.0731 0.1044 0.0028 0.0365

Average test error

Depth (# of levels)

Size (# of nodes)

------0.0891

7 15 14 11

21 76 79 32

Table 4. Characteristics of the best NN modules and final modular NN found in second experiment for classification of all samples of glass dataset:

Class


# of test samples

1 2 3 4 5 6 all

49 12 54 9 7 21 152

21 5 22 4 2 8 62

Average training error (Fitness) 0.0693 0.4333 0.0546 0.5555 0.4285 0.0952 0.0125

Average test error

Depth (# of levels)

Size (# of nodes)

------------0.0539

10 4 12 8 4 6 13

28 8 37 17 9 20 70


797

On the other hand, using the second method (modular neural network), the average training and generalization errors for iris dataset as shown in last row of Table 2 are reduced to 0.0228 and 0.0542, respectively. These results show that the second approach is far more effective than the first method. For wine dataset the same methods are applied. The average training and generalization error of the first method are 0.4689 and 0.5600 respectively. Using the second method these errors are reduced to 0.0365 and 0.0891 respectively. The average training and generalization errors for glass dataset with the first method are 0.7129 and 0.9360 while with the second approach these errors are reduced to 0.0125 and 0.0539.

5 Conclusion By comparison of tables 1 through 4, it can be concluded that using a single NN for classification of all samples of a dataset - although it is evolved by an evolutionary algorithm - is not a good idea and this NN doesn't yield a good generalization error. On the other hand utilizing the structure of a modular neural network with some modules according to the number of classes in each dataset and then feeding their outputs into a new GP-optimized NN to decide for the class number of each sample yields much better results. The reason is that since a module is evolved for each class, due to the relative similarity of the samples in each class, learning the input space is easier. In fact in this method each module classifies a single class from other classes and a final NN decides the sample’s class number. In summary the most important features of this work are: 1. 2. 3. 4.

No knowledge of the architecture or connection weights of the NNs is known a-priori. Smaller NNs are preferred to larger NNs with the same performance. Neurons, even in the same layer, are not forced to use the same transfer function and a variety of transfer functions are available for each neuron. Since we have used a string-based representation for our implementation, there is no need to convert chromosomes of the population to real NNs in order to calculate their fitness value.

References [1] Murre, J.M.J.: Transputers and neural networks: An analysis of implementation constraints and performance. IEEE Transactions on Neural Networks 4, 284–292 (1993) [2] Eccles, J.C.: Neuroscience 6, 1839–1855 (1981) [3] Murre, J.: Learning and Categorization in Modular Neural Networks, Harvester– Wheatcheaf (1992) [4] Auda, G.: Cooperative modular neural network classifiers, PhD thesis, University of Waterloo, Systems Design Engineering Department, Canada (1996) [5] Yao, X.: Evolving Artificial Neural Networks. Proceedings of the IEEE 87(9) (1999) [6] Back, T., Fogel, D.B., Michalewicz, Z.: Evolutionary Computation 1: Basic Algorithms and Operators, and Evolutionary Computation 2: Advanced Algorithms and Operators. IOP Publishing Ltd (2000)

798


[7] Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA (1992) [8] Koza, J.R.: Genetic Programming: A Paradigm for Genetically Breeding Populations of Computer Programs to Solve Problems. Stanford University Computer Science Department Technical Report STAN-CS-90-1314 (June 1990) [9] Koza, J.R., Rice, J.P.: Genetic Generation of Both Weights and Architecture for a Neural Network. In: Proceedings of the International Joint Conference on Neural Networks, vol. II, pp. 397–404 (1991) [10] Gruau, F.: Automatic definition of modular neural networks. Adaptive Behaviour 3(2), 151–183 (1995) [11] Luke, S., Spector, L.: Evolving graphs and networks with edge encoding: Preliminary report. In: Koza, J.R. (ed.) Late Breaking Papers at the Genetic Programming Conference Stanford University, July 28-31, pp. 117–124 (1996) [12] Ritchie, M.D., White, B.C., Parker, J.S., Hahn, L.W., Moore, J.H.: Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics 2003 4(28) (2003) [13] Davis, L.: Adapting operator probabilities in genetic algorithms. In: Schaffer, J.D. (ed.) Proceedings of the Third International Conference on Genetic Algorithms, pp. 61–69. Morgan Kaufmann, San Mateo, CA (1989)