Architecture Selection for Neural Networks

Architecture Selection for Neural Networks Pythagoras Karampiperis a, b, Nikos Manouselis a, c, Theodore B. Trafalis b, 1 a

b

Dept. of Electronics & Computer Engineering Technical University of Crete

Dept. of Production Engineering & Management Technical University of Crete c Informatics and Telematics Institute Center for Research and Technology – Hellas

Abstract - Researchers worldwide converge into the fact that the exhaustive search over the space of network architectures is computationally infeasible even for networks of modest size. The use of heuristic strategies that dramatically reduce the search complexity is a common technique. These heuristic approaches employ directed search algorithms, such as selection of the number of nodes via sequential network construction (SNC), pruning inputs and weights via sensitivity based pruning (SBP) and optimal brain damage (OBD). The main disadvantage of the particular techniques is the production of one hidden layer only perceptrons and not a multiplayer network. As a consequence we cannot ensure that the resulted network will produce “optimal” results. This paper investigates the use of other heuristic strategies that produce multiplayer perceptrons. We propose three algorithms for the construction of networks and compare them with the classic approaches. Simulation results show that these methods lead to “near-optimal” network structures by engaging error-related information.

1

INTRODUCTION

During the last years, many algorithms for training a neural network have been proposed. Some of them come from optimization science, like BFGS and Steepest Descent, and other are heuristic algorithms that try to speed up the training process even more. Too little effort has been made thought in selecting the optimal structure of the neural network, even though the exhaustive search over the hole network space is forbidden due to the complexity of the algorithm. The most common technique in order to select the network structure is selecting a locally optimal number of internal units, from a sequence of fully connected networks with increasing number of hidden units, using sequential network construction (SNC) to train the networks in this sequence. Then, using the optimal fully connected network, input variables are pruned via sensitivity based pruning (SBP) and weights are pruned via optimal brain damage (OBD). The main drawback of this technique is that it works only with two-layer perceptrons, so it limits the search space over 1

the networks, and requires also a large number of nodes in real world problems; this means that it is not efficient computationally. Given the futility of exhaustively sampling the space of possible networks in search of an optimum, we present three heuristic search algorithms, capable of producing a multi-layer perceptron. Section 2 describes the training algorithm we used to train the neural networks in order to compare the proposed algorithms, while Section 3 introduces the algorithms for network structure selection and Section 4 shows the comparison in the success rate and the number of epochs required for each algorithm. Finally Section 5 gives details about the testing platform we developed. 2

LEARNING ALGORITHM

The neural network that we have developed for testing the network selection algorithms, consists of neurons, with an activation function of the form Φ j (U ) = a j ∗ tanh(b j ∗ U j ) , j

m

with

(a j , b j ) > 0 and U j = ∑ w ji ∗ X i , and we are i =0

locating the values of the elements of the network requires employing the back-propagation algorithm. The feed-forward error back-propagation (BP) learning algorithm is the most famous procedure for training artificial neural networks. BP is based on searching an error surface using gradient descent for point(s) with minimum error. Each iteration in BP constitutes of two sweeps: forward activation to produce a solution, and a backward propagation of the computed error to modify the weights. There has been much research on improving BP’s performance. The Extended Delta-Bar-Delta (EDBD) variation of BP attempts to escape local minima by automatically adjusting step sizes and momentum rates with the use of momentum constants. To produce a better search over the error

during the preparation of this paper he was on leave from the University of Oklahoma.

0-7803-7278-6/02/$10.00 ©2002 IEEE

surface, we consider the following quantities as free parameters in each neuron: − w : Weight of every synaptic connection, − a, b : Activation function parameters, − rw : Learning rate of w, − ra : Learning rate of a, − rb : Learning rate of b. In this way, every node has a different activation function and every free parameter has its own learning rate. In order to avoid trapping into a local minimum, we adopted a momentum constant equal with (1-rx) for every x free parameter of the network. The corrections of free parameters for epoch n become so:

∆wji(n) = rwji ∗δ j ∗ X i + (1 − rwji ) ∗ ∆wji(n−1) , ∆a j( n ) = ra j ∗ δ j ∗ ∆b j( n ) = rb j ∗ δ j ∗ where

rw ji

Yj a j ( n ) ∗ Φ 'j (U j )

Uj + (1 − rb j ) ∗ ∆b j( n −1) , b j( n )

is

(

+ (1 − ra j ) ∗ ∆a j( n−1) ,

the

)

learning

rate

parameter

of

w ji and 0 < rw ji < 1 ∀weight ( j , i ) , ra j is the learning rate parameter of the

(

(

)

a j and 0 < ra j < 1 ∀node j and rb j is

learning

)

neuron is given by: − For a neuron

j

located

δ j = Φ j( U j ) ∗ [ D j − Y j '

−

For

a

neuron

j

]

in

in

δ j = Φ j(U j ) ∗ ∑ δ k wkj

the a

output hidden

layer: layer:

'

k

And

the

Φ ' j(U j ) =

bj aj

derivative

[

is

][

equal

to

]

∗ a j − Y j ∗ a j + Y j for each neuron j.

Every adjustable network parameter (free) of the cost function has its own individual learning-rate parameter, with is updated using the formula:

R( n ) = − x( n ) ∗

∆x( n ) rx ( n )

( R( n +1) > 0) and ( R( n ) > 0) then increase rx (

−

Rule 1:

−

rx = rx + e ) Rule 2: ( R( n +1)

This measured error is called generalization error and in order to avoid over-training we use it as a stopping criterion for the training process. 3

ARCHITECTURE SELECTION ALGORITHM

The neural network architecture used is a growing multilayered perceptron, which begins from a basic structure of one node and one hidden layer and is then self-organizing until it reaches the optimum structure for the given problem. The self-organization process consists of two phases: a growing one and a shrinking one. In the growing phase of the self-organization the network is supposed to be a fully connected network. Therefore, after selecting the optimal fully connected network we try to improve the generalization error even more through optimal brain damage (second phase). In this section we examine tree different selforganizing algorithms. I. Symmetric Growth (SG)

rate

b j and 0 < rb j < 1 ∀node

parameter of j . The local gradient for each

In order to have an estimation of how general is the result of the neural network we have split the training examples into two sets. The first one is used to train the network using the BP variation described above, and the second set of examples is used in order to measure the error when the network has not learned a specific example.

< 0) and ( R( n ) < 0) then decrease rx (

rx = rx - e ) where e is a small positive number.

0-7803-7278-6/02/$10.00 ©2002 IEEE

The symmetric growth algorithm tries to obtain the best network structure using a trial and error procedure. During the growing phase of self-organization, two basic principles must always be valid: − Every hidden layer has the same number of hidden nodes as the rest of the hidden layers − There are two ways of growing: horizontal (by incrementing the number of hidden layers) or vertical (by incrementing the number of hidden nodes). The growth rule of the network is based on the optimization of the generalization error. At every step of the algorithm of growth, the following potential steps must be examined: − Calculate generalization error, using the current structure − Calculate generalization error, after horizontal growth − Calculate generalization error, after vertical growth. Then the following conditions are examined: − If horizontal growth is proven better than the vertical growth and the current structure, grow horizontally and return to the beginning.

− −

If vertical growth is proven better than the horizontal growth and the current structure, grow vertically and return to the beginning. If current structure is proven better than the horizontal growth and the vertical growth, self-organization stops.

Initialization: Network consists of one hidden layer with one node. Step 1: Calculate generalization error of current network structure Step 2: Increase horizontally and calculate generalization error of the new network structure Step 3: Increase vertically and calculate generalization error of the new network structure Step 4: Compare the current with the two new structures and keep the best among them Step 5: If the best structure is the current structure then stop, else return to Step 1. FIGURE 1: Steps for Symmetric Growth Algorithm (SG).

This growing algorithm starts from the simple one node, one hidden layer perceptron and ends in a MLP network that gives near optimum generalization. After the network structure is chosen, then pruning techniques are used in order to deduct certain weights or nodes from this structure, with minimization of the generalization error. II. Worst Layer Growth (WLG) The worst layer growth algorithm tries to obtain the best network structure using the error information provided during the training process. This algorithm follows the following steps: 1. 2.

We measure the generalization error of the current network structure Calculate for each hidden layer plus the output layer the mean absolute square error produced by all the nodes of the corresponding layer. This calculation is based on the

1 ( E j ) 2 , where J is the ∑ j j total number of nodes that the layer contains and E j is following formula:

El =

3.

where k are all the nodes that belong to the next hidden layer of the hidden layer that j node belongs, connected with node j through synaptic weight wkj. Find the layer or layers that give the maximum mean absolute square error.

4.

For each of them and starting from the first hidden layer to the output layer, add one node to the previous hidden layer. It is obvious that if the first hidden layer belongs to the selected set of layers for growth, then since it is impossible to add a new node to the input layer, another layer will be added – consisting of only one node – as the first hidden layer and the old one will be then the second hidden layer. Also if the output layer belongs to the set of layers for growth then a new node will be added to the last hidden layer.

5.

During this step we measure the generalization error produced by the new network and if it is less than the error produced by the network before the growth the algorithm start all over again, otherwise the algorithm stops keeping the network architecture we had just before the last growth.

It is worthy to notice, that as a starting network we have the simple network consisting of one hidden layer with only one node. So after we apply the above algorithm we conclude to a more complex network consisting of several hidden layers without having the constraint of equality in the number of nodes that the hidden layers contain. Initialization: Network consists of one hidden layer with one node. Step 1: Calculate generalization error of current network structure Step 2: Select the layers/s with the maximum mean square error and add a new node to the hidden layer before them or add a new layer Step 3: Calculate generalization error of the new network structure Step 4: Compare the current with the new structure and keep the best among them Step 5: If the best structure is the current structure then stop, else return to Step 1. FIGURE 2: Steps for Worst Layer Growth Algorithm (WLG).

the absolute error of node j given by: III. Worst Node Growth (WNG)

, if j node is in Output Layer  D − Yj Ej =  δ j ∗ wkj , if j node is in a Hidden layer  ∑ k

0-7803-7278-6/02/$10.00 ©2002 IEEE

The worst node growth algorithm is based on the same idea like the worst layer growth algorithm. It uses the error information provided during the training process, as a criterion in order to decide if a node should be added in an

existing layer or in a new layer. This self-organizing algorithm is simplest than the WLG but as shown in section 4 it leads to better ‘optimum’ network structure and it is more efficient computationally. This algorithm follows the following steps: 1. 2.

We measure the generalization error of the current network structure Calculate for each node of the hidden layers plus the output layer the absolute error for that node. The absolute error of node j is given by:

 D −Y , if j node is in Output Layer  Ej =   ∑ δ j ∗ wkj , if j node is in a Hidden Layer  k

where k are all the nodes that belong to the next hidden layer of the hidden layer that j node belongs to, connected with node j through synaptic weight wkj.

3.

Find the layer or layers that contains the nodes with the maximum absolute error.

4.

For each of them and starting from the first hidden layer to the output layer, add one node to the previous hidden layer. It is obvious that if the first hidden layer belongs to the selected set of layers for growth, then since it is impossible to add a new node to the input layer, another layer will be added – consisting of only one node – as the first hidden layer and the old one will be then the second hidden layer. Also if the output layer belongs to the set of layers for growth then a new node will be added to the last hidden layer.

Initialization: Network consists of one hidden layer with one node. Step 1: Calculate generalization error of current network structure Step 2: Select the layers/s that contain the node/s with the maximum absolute error and add a new node to the hidden layer before them or add a new layer Step 3: Calculate generalization error of the new network structure Step 4: Compare the current with the new structure and keep the best among them Step 5: If the best structure is the current structure then stop, else return to Step 1. FIGURE 3: Steps for Worst Node Growth Algorithm (WNG).

0-7803-7278-6/02/$10.00 ©2002 IEEE

5.

During this step we measure the generalization error produced by the new network and if it is less than the error produced by the network before the growth the algorithm start all over again, otherwise the algorithm stops keeping the network architecture we had just before the last growth.

As in WLG algorithm, the starting network is the simple network consisting of one hidden layer with only one node. So after we apply the above algorithm we conclude to a more complex network consisting of several hidden layers without having the constraint of equality in the number of nodes that the hidden layers contain. 4

COMPARISONS

In our simulations, we studied the three proposed techniques for self-organizing to solve two different parity problems and a real world classification problem from the PROBEN1 database (Prechelt, 1994). In particular, we studied the 3-bit and 4-bit parity problems and the cancer classification problem of the PROBEN1 set (the standard PROBEN1 benchmarking rules were applied). We report performance results for the tree proposed selforganizing algorithms and the sequential network construction (SNC) algorithm. For each of the 3-bit and 4-bit parity problems we performed 50 trials with various initializations of the free parameters of each new node. The maximum number of epochs per trial was set to 1000 and learning was considered successful when Fahlman’s (Fahlman, 1988) ’40-20-40’ criterion was met. Training results are shown in Table 1 for the 3-bit problem and in Table 2 for the 4-bit problem. From these tables, it is evident that WNG is able to solve the problems with high success rate in a relatively small number of epochs. TABLE I EXPERIMENTAL RESULTS FOR THE 3-BIT PARITY PROBLEM. Algorithms #epochs Successes

SNC 620 70

SG 985 75

WLG 390 95

WNG 310 97

The cancer problem concerns the diagnosis of breast cancer. The task is to classify a tumor as benign or malignant based on cell descriptions gathered by a microscope. The problem has nine real valued inputs, two binary outputs and consists of 699 examples partitioned in 350 training examples, 175 validation examples and 174 test examples. We used the cancer3 problem dataset of the PROBEN1 database, which corresponds to a certain assignment of the samples to each of the three partitions.

TABLE II EXPERIMENTAL RESULTS FOR THE 4-BIT PARITY PROBLEM. Algorithms #epochs Successes

SNC 1320 40

SG 1730 32

WLG 970 84

WNG 660 93

For this problem we performed 10 trials with various initialization of the free parameters of each new node and for each tested algorithm. The maximum number of epochs per trial was set to 5000 and learning was again considered successful when Fahlman’s criterion was met. Training results for this problem are shown in Table 3. From this table, we can make the following interesting observations. a. WLG and WNG are the only algorithms that exhibit 100% success rate with WNG requiring a sufficiently smaller mean number of epochs. b. SNC and SG failed to solve the problem in all the trials. TABLE III EXPERIMENTAL RESULTS FOR THE CANCER PROBLEM. Algorithms #epochs Successes

5

SNC Failed 0

SG Failed 0

WLG 3784 100

WNG 3272 100

IMPLEMENTATION

The simulations were carried out in a Neural Network Simulation Platform, implemented using the Borland Delphi 5 visual programming environment for the user interface part (Figure 4) and dll libraries created in C++. The platform was set on a computer system equipped with a Intel Celeron processor of 700MHz, 128MB RAM, and is running under Microsoft Windows 2000 Professional operating system.

FIGURE 4: Simulation Platform .

0-7803-7278-6/02/$10.00 ©2002 IEEE

6

CONCLUSIONS

In this paper, three self-organizing algorithms for neural networks have been introduced. These algorithms are useful in order to ‘automatically’ select the optimal network structure for a specific problem. It was shown that both WLG and WNG can work very good to real world problems without having structural constraints like only one hidden layer (SNC) or equality of nodes in the hidden layers (SG). References [1] Ash, T. (1989), ‘Dynamic node creation in back-propagation neural networks’, Connection Science 1(4), 365-375 [2] Barron, A. (1984), Predicted squared error: a criterion for automatic model selection, in S.Farlow, ed., ‘Self-Organizing Methods in Modeling’, Marcel Dekker, New York. [3] Geman, S., Bienenstock, E. and Doursat, R. (1992), ‘Neural networks and the bias/variance dilemma’, Neural Computation 4(1), 1-58. [4] Fahlman, S. E. (1988). ‘Faster learning variations on back-propagation: an empirical study.’, in D. Touretsky, G. Hinton & T. Sejnowski, Proceedings of the Connectionist Models Summer School (pp. 38-51). San Mateo: Morgan Kaufmann. [5] Hamey, L. (1998). ‘XOR has no local minima: a case study in neural network error surface analysis’, Neural Networks, 11, 669-681. [6] Liu, Y. (1993), ‘Neural network model selection using asymptotic jackknife estimator and cross-validation method’, in S.J.Hanson, J.D.Cowan and C.L.Giles, eds, ‘Advances in Neural Information Processing Systems 5’, Morgan Kaufmann Publishers, San Mateo, CA, pp. 599-606. [7] Prechelt, L. (1994). ‘PROBEN1 – a set of neural network benchmark problems and benchmarking rules.’, Technical Report 21/94, Universitat Karlsruhe, Germany. [8] Schwartz, G. (1978), ‘Estimating the dimension of a model’, Ann. Stat. 6, 461-4. [9] Sprinkhuizen-Kuyper, I.G., & Boers, E.J.W. (1996). ‘The error surface of the simplest XOR network has only global minima.’, Neural Computation, 8, 1301-1320. [10] Sprinkhuizen-Kuyper, I.G., & Boers, E.J.W. (1996). ‘The error surface of the 2-2-1 XOR network: stationary points with infinite weights.’, Technical Report 96-10, Department of Computer Science, Leiden University.