Unsupervised Design of Artificial Neural Networks via Multi

0 downloads 0 Views 250KB Size Report
architecture space. The evolution technique, the so-called multi- dimensional Particle Swarm Optimization (MD PSO) re-forms the native structure of PSO ...
Unsupervised Design of Artificial Neural Networks via Multi-Dimensional Particle Swarm Optimization Serkan Kiranyaz*, Turker Ince**, Alper Yildirim*** and Moncef Gabbouj* *Tampere University of Technology, Tampere, Finland {serkan.kiranyaz, moncef.gabbouj}@tut.fi **Izmir University of Economics, Izmir, Turkey, [email protected] ***TUBITAK, Ankara, Turkey, [email protected]

Abstract In this paper, we present a novel and efficient approach for automatic design of Artificial Neural Networks (ANNs) by evolving to the optimal network configuration(s) within an architecture space. The evolution technique, the so-called multidimensional Particle Swarm Optimization (MD PSO) re-forms the native structure of PSO particles in such a way that they can make inter-dimensional passes with a dedicated dimensional PSO process. So in a multidimensional search space where the optimum dimension is unknown, swarm particles can seek for both positional and dimensional optima. This eventually removes the necessity of setting a fixed dimension a priori, which is a common drawback for the family of swarm optimizers. With the proper encoding of the network configurations and parameters into particles, MD PSO can then seek for positional optimum in the error space and dimensional optimum in the architecture space. The optimum dimension converged at the end of a MD PSO process corresponds to a unique ANN configuration where the network parameters (connections, weights and biases) can then be resolved from the positional optimum reached on that dimension. The efficiency and performance of the proposed technique is demonstrated over one of the hardest synthetic problems. The experimental results show that MD PSO evolves to optimum or near-optimum networks in general.

1. Introduction Several researchers have attempted to design ANNs automatically with respect to a particular problem. The earlier attempts fall into two broad categories: constructive and pruning algorithms [4], [7], [12], where many deficiencies and limitations have been reported [1]. The efforts have been then focused on evolutionary algorithms (EAs) [2] particularly over Genetic Algorithm (GA) [8] and Evolutionary Programming (EP), [6] for both training and evolving ANNs. Most GA based methods have also been found quite inadequate for evolving ANNs mainly due to two major problems: permutation problem and noisy fitness evaluation [14]. Although EP-based methods

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

[1], [14], [15], might address such problems, they usually suffer from their high complexity and strict parameter dependence. In principle, Particle Swarm Optimization (PSO) [9], [11], which has obvious ties with the EA family, lies somewhere in between GA and EP. Yet unlike GA, PSO has no complicated evolutionary operators such as crossover, selection and mutation and it is highly dependent on stochastic processes. PSO has been successfully applied for training feed-forward [5], [10] and recurrent ANNs [13] and several works on this field have shown that it can achieve a superior learning ability to the traditional Back-Propagation (BP) method in terms of accuracy and speed. Only few researchers have investigated the use of PSO for evolutionary design of ANNs or to be precise, the fully connected feed-forward ANNs, the so-called multi-layer perceptrons (MLPs) only with single hidden layer. The major drawback of many PSO variants including the basic method is that they can only be applied to a search space with fixed dimensions. However, in the field of ANNs as well as in many of the optimization problems (e.g. clustering, spatial segmentation, optimization of the dimensional functions, etc.), the optimum dimension where the optimum solution lies, is also unknown and should thus be determined within the PSO process. In order to address this problem, in this paper we first present a multi-dimensional PSO (MD PSO) technique, which negates the need of fixing the dimension of the solution space in advance. We then adapt MD PSO technique for designing (near-) optimal ANNs. The focus is particularly drawn on automatic design of the feed-forward ANNs and the search is carried out over all possible network configurations within the specified architecture space. Therefore, no assumption is made about the number of (hidden) layers and in fact none of the network properties (e.g. feed-forward or not, differentiable activation function or not, etc.) is an inherent constraint of the proposed scheme. As long as the potential network configurations are transformed into a hash (dimension) table with a proper hash function where indices represent the solution space dimensions of the particles, MD PSO can then seek for both positional and dimensional optima in an interleaved PSO process. The optimum dimension found naturally corresponds to a distinct ANN architecture where the network parameters

(connections, weights and biases) can be resolved from the positional optimum reached on that dimension.

2. The Proposed Method Designing the optimal network architecture can be thought as a search process within the architecture space containing all potential and feasible architectures. To accomplish this, the proposed MD PSO technique, is mainly adapted and presented next.

2.1. MD PSO Algorithm Instead of operating at a fixed dimension N, the MD PSO algorithm is designed to seek both positional and dimensional optima within a dimension range, ( Dmin ≤ N ≤ Dmax). In order to accomplish this, each particle has two sets of components, each of which has been subjected to two independent and consecutive processes. The first one is a regular positional PSO, i.e. the traditional velocity updates and due positional shifts in N dimensional search (solution) space. The second one is a dimensional PSO, which allows the particle to navigate through dimensions. Accordingly, each particle keeps track of its last position, velocity and personal best position (pbest) in a particular dimension so that when it re-visits the same dimension at a later time, it can perform its regular “positional” fly using this information. The dimensional PSO process of each particle may then move the particle to another dimension where it will remember its positional status and keep “flying” within the positional PSO process in this dimension, and so on. The swarm, on the other hand, keeps track of the gbest particles in all dimensions, each of which respectively indicates the best (global) position so far achieved and can thus be used in the regular velocity update equation for that dimension. Similarly the dimensional PSO process of each particle uses its personal best dimension in which the personal best fitness score has so far been achieved. Finally, the swarm keeps track of the global best dimension, dbest, among all the personal best dimensions. The gbest particle in dbest dimension represents the optimum solution and dimension, respectively. In a MD PSO process at time (iteration) t, each particle a in the swarm, ξ = {x1 ,..,xa ,..,xS }, is represented by the

xda (t) : Dimension component vda (t) : Velocity component of dimension ~ xda (t) : Personal best dimension component Let f denote the dimensional fitness function that is to be optimized within a certain dimension range, {Dmin, Dmax}. Without loss of generality assume that the objective is to find the minimum (position) of f at the optimum dimension within a multi-dimensional search space. Assume that the particle a visits (back) the same dimension after T iterations (i.e. xda (t ) = xda (t + T ) ), then the personal best position can be updated in iteration t+T as follows,

xy axd, ja ( t +T ) (t + T ) = ⎧⎪ xy axd, ja ( t ) (t ) if f ( xx axd a ( t +T ) (t + T )) > f ( xy axd a ( t ) (t ))⎫⎪ ⎨ xd a ( t +T ) ⎬ (t + T ) else ⎪⎩ xx a , j ⎪⎭

Furthermore, the personal best dimension of particle a can be updated in iteration t+1 as follows,

~ xd a (t + 1) = ~ ⎧⎪ xd~a (t ) if f ( xxaxd a (t +1) (t + 1)) > f ( xyaxd a (t ) (t ))⎫⎪ (2) ⎨ ⎬ ⎪⎩ xda (t + 1) ⎪⎭ else

Let gbest(d) be the index of the global best particle at dimension d and let S(d) be the total number of particles in dimension d, then it follows that, dbest dbest x yˆ dbest ( t ) = xy gbest ( t ),.., xy Sdbest ( t )) ( dbest ) ( t ) = min( xy 1

. For a particular iteration t, and for a particle a ∈ {1, S} , first the positional components are updated in its current dimension, xda (t ) and then the dimensional update is performed

to determine its next (t+1st) dimension,

dimension

jth component (dimension) of the position in

vxaxd, ja (t ) (t + 1) =

vxaxd, ja (t ) (t ) : jth component (dimension) of the velocity in dimension

xda (t)

(pbest) position in dimension

xda (t)

gbest(d) : Global best particle index in dimension d xyˆ dj (t ) : jth component (dimension) of the global best position of swarm, in dimension d

(

c r (t ) xyˆ

2 2, j xd a ( t ) a, j

(t ) − xx

xd a ( t ) j xd a ( t ) a, j

(t + 1) = xx

[

xd a (t ) a, j

)

(t )

[

(t ) + Cvx vx

)

xd a (t ) a, j

]

(3)

(t + 1),{Vmin , Vmax}

]

xxaxd, ja (t ) (t + 1) ← Cxx xxaxd, ja (t ) (t + 1), {X min , X max} where

xyaxd, ja (t ) (t ) : jth component (dimension) of the personal best

(

w(t ) vxaxd, ja (t ) (t ) + c1r1, j (t ) xyaxd, ja (t ) (t ) − xxaxd, ja (t ) (t ) + xx

xda (t)

xda (t + 1) . The

positional update is performed for each dimension component, j ∈ {1, xd a (t )} , as follows:

following characteristics:

xxaxd, ja (t ) (t ) :

(1)

C xx [.,.] ≡ Cvx [.,.] are

the clamping operators applied

over each positional component,

xxad, j

and

vxad, j . C xx [.,.]

may or may not be applied depending on the optimization problem but Cvx [.,.] is basically needed to avoid exploding. Due to space limitations the details of the clamping operators and pseudo-code for MD PSO are skipped in this paper.

θ kl is the bias value of the

2.2. Evolving ANNs via MD PSO As a stochastic search process in multi-dimensional search space, MD PSO seeks for (near-) optimal networks in an architecture space, which can be defined by any type of ANNs with any properties. All network configurations in the architecture space are enumerated into a (dimensional) hash table with a proper hash function, which basically ranks the networks with respect to their complexity, i.e. associates higher hash indices to networks with higher complexity. MD PSO can then use each index as a unique dimension of the search space where particles can make inter-dimensional navigations to seek for an optimum dimension (dbest) and the optimum solution on dbest

. that dimension, xyˆ In this work, we apply MD PSO technique for evolving fully-connected, feed-forward ANNs or the so-called MLPs. The reasoning behind this choice is that MLP is the most widely used in this field and it offers conventional methods such as BP for training. Suppose for the sake of simplicity, a range is defined for the minimum and maximum number of layers, {Lmin , Lmax } and number of neurons for hidden layer l, l l {N min , N max } .Without loss of generality, assume that the size

of both input and output layers is determined by the problem and hence fixed. This yields that the architecture space can now be defined only by two range arrays, Lmax −1 1 for minimum and Rmin = {N I , N min ,..., N min , NO }

kth neuron of the lth hidden/output

layer, respectively. The training mean square error, MSE, is formulated as,

MSE = where

1 2 PN O

∑∑ (t NO

p∈T k =1

p k

− y kp ,O

)

2

(5)

t kp is the target (desired) output and y kp ,O is the actual

output from the kth neuron in the output layer, l=O, for pattern p in the training set T with size P, respectively. At a time t, suppose that the particle a in the swarm, ξ = {x1 ,..,xa ,..,xS } , has

the

positional

component

formed

⎧⎪{w 0jk },{w1jk },{θ k1 },{w 2jk },{θ k2 }⎫⎪ xd a ( t ) xxa (t ) = ⎨ ⎬ O −1 O −1 O ⎪⎭ ⎪⎩,...,{w jk },{θ k },{θ k }

{wljk }

and

{θ kl }

as, where

represent the sets of weights and biases of

the layer l. Note that the input layer (l=0) contains only weights whereas the output layer (l=O) has only biases. By means of such a direct encoding scheme, the particle a represents all potential network parameters of the MLP architecture at the dimension (hash index) xd a (t ) . As mentioned earlier, the dimension range,

Dmin ≤ xda (t) ≤ Dmax,

where MD PSO

Lmax −1 1 Rmax = {N I , N max ,..., N max , N O }, for maximum number of

particles can make inter-dimensional jumps, is determined by the architecture space defined. Setting MSE in (5) as the fitness function enables MD PSO to perform evolutions of both network parameters and architectures within its native process.

neurons allowed for each layer of a MLP. The size of both arrays is naturally Lmax +1 where corresponding entries define

3. Experimental Results

the range of the lth hidden layer for all those MLPs, which can have an lth hidden layer. The size of input and output layers, {N I , N O } , is fixed and same for all configurations in the architecture space within which any l-layer MLP can be defined providing that Lmin ≤ l ≤ Lmax . Lmin ≥ 1 and Lmax can be set to any value meaningful for the problem encountered. The hash function then enumerates all potential MLP configurations into hash indices, starting from the simplest MLP with Lmin − 1 hidden layers, each of which has minimum number of neurons given in R min , to the most complex network with Lmax − 1 hidden layers, each of which has maximum number of neurons given in R max . Let N hl be the number of hidden neurons in layer l of a MLP with input and output layer sizes N I and NO , respectively. The input neurons are merely fan-out units since no processing take place. Let F be the activation function applied over the weighted inputs plus a bias, as follows:

y kp ,l = F ( s kp ,l ) where s kp ,l = ∑ w ljk−1 y jp ,l −1 + θ kl

(4)

j

where

y kp ,l is

the output of the kth neuron of the lth

hidden/output layer when the pattern p is fed,

w ljk−1 is

the

weight from jth neuron in layer l-1 to kth neuron in layer l, and

In order to determine which network architectures are (near-) optimal for a given problem, we apply exhaustive BP training over every network configuration in the architecture space defined. As mentioned earlier, BP is nothing but a gradient descent algorithm and thus for a single run, it is susceptible to get trapped to the nearest local minimum. However, performing it several times with randomized initial parameters eventually increases the chance of converging to (a close vicinity of) the global minimum in the error space, and therefore, it is performed by significantly large amount of times (e.g. K=500) so that for a particular architecture, the minimum error obtained among all trials, can then be assumed to be the (near) optimal score achieved with that network. Note that even though K is kept quite high, there is still no guarantee of converging to the global optimum with BP; however, the idea is to obtain the “trend” of best performances achievable with every configuration under equal training conditions. In this way the optimality of the networks evolved by MD PSO can be justified. In order to show the optimality of the network configurations evolved by MD PSO, we first of all, use a “limited” architecture space ( R 1 : R 1 min = {N I ,1,1, N O } and R1max = {N I , 8, 4, N O } ) containing the simplest 1, 2 or 3 layers MLPs with L1min = 1 and L1max = 3 . It contains 41 networks. Furthermore, we have selected the two-spirals problem, which is one of the most challenging problems in this

field that can be solved with as few as possible “optimal” configuration. In all experiments in this section, we perform 100 MD PSO runs, each of which terminates at the end of 1000 epochs (iterations). The two-spirals problem, which was proposed by Alexis Wieland [16], is highly non-linear and promises further interesting properties. For instance, the 2D data exhibits some temporal characteristics where radius and angle of the spiral vary with time. The error space is highly multi-modal with many local minima, thus methods such as BP encounters severe problems on error reduction [3]. The data set consists of 194 patterns (2D points), 97 samples in each of the two classes (spirals) and NI = 2 , NO = 1. It is now used as a benchmark for ANNs by many researchers. 0.5 Min. Error Mean Error Median Error

0.45

0.4

0.35

0

5

10

15

20

25

30

35

40

45

5

10

15

20

25

30

35

40

45

35 30 25 20 15 10 5 0 0

Figure 1: Error (MSE) statistics from exhaustive BP training (top) and dbest histogram from 100 MD PSO evolutions (bottom) for two-spirals problem.

It is obvious that none of the configurations yield a sufficiently low error value with BP training and particularly BP can at best perform a linear approximation for most of the configurations (hash indices) resulting MSE ≈ 0.49 and only few 3-layer MLPs (with indices 32, 33, 38 and 41) are able to reduce MSE to 0.3. MD PSO also shows a similar performance with BP, achieving 0.349 ≤ mMSE(dbest) ≤ 0.371 for dbest=25, 32, 33, 38 and 41. These are obviously the best possible MLP configurations to which a high majority of MD 1

PSO runs converged in R .

4. Conclusions In this paper, we proposed a novel and unsupervised design methodology for ANNs by the MD PSO, which promises a generic solution for the common drawback of the family of PSO methods such as a priori fixation of the search space dimension. To do so, MD PSO defines a new particle formation and embeds the ability of dimensional navigation into the core of the process. It basically allows particles to make inter-dimensional ‘passes’ with a dedicated PSO process whilst performing regular positional updates in every dimension they visit. Such flexibility negates the requirement of setting the dimension in advance since swarm particles can now converge to the global solution at the optimum dimension, in a simultaneous manner.

With the proper adaptation of the native MD PSO process, the proposed method can evolve to the optimum network within an architecture space for a particular problem. The automatic design methodology is generic and applicable to any type of ANNs in an architecture space with varying size and properties, as long as a proper hash function enumerates all configurations in the architecture space with respect to their complexity into proper hash indices representing the dimensions of the solution space over which MD PSO seeks for optimality.

References [1] P. J. Angeline, G. M. Sauders, and J. B. Pollack, “An evolutionary algorithm that constructs recurrent neural networks”, in IEEE Trans. Neural Networks, vol. 5, pp. 54–65, Jan. 1994. [2] T. Back and H.P. Schwefel, “An overview of evolutionary algorithm for parameter optimization”, Evolutionary Computation 1, pp. 1–23, 1993. [3] E. Baum, K. Lang, “Constructing Hidden Units Using Examples and Queries”, In Advances in Neural Information Processing Systems, vol. 3, pp. 904-910, San Mateo, CA, 1991. [4] N. Burgess, “A constructive algorithm that converges for realvalued input patterns,” Int. Journal of Neural Systems, vol. 5, no. 1, pp. 59-66, 1994. [5] M. Carvalho, and T. B. Ludermir, “Particle Swarm Optimization of Neural Network Architectures and Weights”, In Proc. of the 7th Int. Conf. on Hybrid intelligent Systems, pp. 336-339, Washington DC, Sep. 17 - 19, 2007. [6] U.M. Fayyad, G.P. Shapire, P. Smyth and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge, MA, 1996. [7] M. Frean, “The upstart algorithm: A method for constructing and training feedforward neural networks”, Neural Computation, vol. 2, no. 2, pp. 198–209, 1990. [8] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, pp. 1-25. MA, 1989. [9] J. Kennedy, R Eberhart., “Particle swarm optimization”, in Proc. of IEEE Int. Conf. On Neural Networks, vol. 4, pp. 1942–1948, Perth, Australia, 1995. [10] M. Meissner, M. Schmuker, G. Schneider, “Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural network training”, BMC Bioinformatics 7: 125, 2006. [11] M. G. Omran, A. Salman, and A.P. Engelbrecht, Particle Swarm Optimization for Pattern Recognition and Image Processing, Springer Berlin, 2006. [12] R. Reed, “Pruning algorithms—A survey“, IEEE Trans. Neural Networks, vol.4, no.5, pp.740–747, 1993. [13] J. Salerno, “Using the particle swarm optimization technique to train a recurrent neural model”, IEEE Int. Conf. on Tools with Artificial Intelligence, pp. 45-49, 1997. [14] X. Yao and Y. Liu, “A new evolutionary system for evolving artificial neural networks”, IEEE Trans. On Neural Networks, vol.8, no.3, pp. 694–713, May 1997. [15] X. Yao, “Evolving artificial neural networks”, in Proc. of the IEEE, vol. 87, no. 9, pp.1423-1447, Sep. 1999. [16] Z.H. Zhou, S.F. Chen and Z.Q. Chen, “FANNC: A fast adaptive neural network classifier”, Knowledge and Information Systems, pp. 115–129, 2000.