A hybrid algorithm for artificial neural network training

Engineering Applications of Artificial Intelligence ] (]]]]) ]]]–]]]

Contents lists available at SciVerse ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

A hybrid algorithm for artificial neural network training Masoud Yaghini n, Mohammad M. Khoshraftar, Mehdi Fallahi Rail Transportation Department, School of Railway Engineering, Iran University of Science and Technology, Resalat Sq. Hengam Street, Tehran, Iran

a r t i c l e i n f o

abstract

Article history: Received 30 August 2011 Received in revised form 25 January 2012 Accepted 31 January 2012

Artificial neural network (ANN) training is one of the major challenges in using a prediction model based on ANN. Gradient based algorithms are the most frequent training algorithms with several drawbacks. The aim of this paper is to present a method for training ANN. The ability of metaheuristics and greedy gradient based algorithms are combined to obtain a hybrid improved opposition based particle swarm optimization and a back propagation algorithm with the momentum term. Opposition based learning and random perturbation help population diversification during the iteration. Use of time-varying parameter improves the search ability of standard PSO, and constriction factor guarantees particles convergence. Since several contingent local minima conditions may happen in the weight space, a new cross validation method is proposed to prevent overfitting. Effectiveness and efficiency of the proposed method are compared with several other famous ANN training algorithms on the various benchmark problems. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Artificial Neural Networks Hybrid training algorithm Particle swarm optimization Backpropagation algorithm Cross validation Time-varying parameter

1. Introduction Artificial neural networks (ANNs) are currently being used in a variety of applications with great success. Their first main advantage is that they do not require a user-specified problem solving algorithm (as is the case with classic programming) but instead they ‘‘learn’’ from examples, much like human beings. Their second main advantage is that they possess inherent generalization ability. This means that they can identify and respond to patterns which are similar but not identical to the ones with which they have been trained (Vosniakos and Benardos, 2007). ANN is one of the most important data mining techniques. It is used for both supervised and unsupervised learning (Yaghini et al., in press). Training ANN is a complex task of great importance in problems of supervised learning. Most of ANN training algorithms make use of gradient-based search. These methods have the advantage of the directed search in which weights are always updated in such a way that minimizes the error called ANN learning process. However, there are several negative aspects with these algorithms such as dependency on a learning rate parameter, network paralysis, slowing down by an order of magnitude for every extra (hidden) layer added and complex and multi-modal error space. Therefore, these algorithms most likely are trapped in local minima; making them entirely

Abbreviations: ANN, Artificial neural network; PSO, Particle swarm optimization; BPA, Backpropagation algorithm; EA, Evolutionary algorithm; GA, Genetic algorithm; SI, Swarm intelligence; CEP, Classification error percentage; ACO, Ant colony optimization n Corresponding author. Tel.: þ98 21 77240117; fax: þ98 21 77491030, Mob: þ98 9122963777. E-mail address: [email protected] (M. Yaghini).

dependent on initial (weights), settings which make the algorithms not to guarantee their universal usefulness (Kiranyaz et al., 2009). Metaheuristic global search strategy enables them to avoid being trapped into secondary peak of performance and can therefore provide effective and robust solution to the problem of ANN and training (Castellani and Rowlands, 2009). They have the advantage of being applicable to any type of ANN, feedforward or not, with any activation function, differentiable or not (Kiranyaz et al., 2009). They provide acceptable solutions in a reasonable time for solving hard and complex problems. They are particularly useful for dealing with large complex problems, which generate many local optima. They are less likely to be trapped in local minima than traditional gradientbased search algorithms. They do not depend on gradient information and thus are quite suitable for problems where such information is unavailable or very costly to obtain or estimate (Talbi, 2009). Learning algorithm is an important aspect of ANN based model. In this article, a literature review and categorization for ANN learning algorithms are presented. Then, a method for training ANN is proposed. The proposed method combines global search strategy of improved opposition based particle swarm optimization (PSO) with the local search ability of the traditional backpropagation algorithm (BPA) with the momentum term. The opposition based and random perturbation methods are two diversification components of the algorithm. Time-varying social and cognitive components improve the search ability of the algorithm and constriction factor is another parameter that guarantees convergence of particles. During training a prediction ANN model, overfitting, i.e., learning more than adequate specification of training data is one of the common problems especially with the large data sets (Prechelt, 1994). This problem has an off-putting effect on prediction model to forecast new pattern and causes incorrect or

0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2012.01.023

Please cite this article as: Yaghini, M., et al., A hybrid algorithm for artificial neural network training. Eng. Appl. Artif. Intel. (2012), doi:10.1016/j.engappai.2012.01.023

2

M. Yaghini et al. / Engineering Applications of Artificial Intelligence ] (]]]]) ]]]–]]]

lower than expected prediction (Yaghini et al., in press). Several approaches have been proposed to solve the problem none of which consider multimodality of the weight space. Therefore, a new cross validation method is proposed. The organization of this paper is as follows. Section 2 presents a literature review about the previous research. In Section 3, the components of the proposed algorithm, criterion for accuracy evaluation, the proposed cross validation method, and the steps of the algorithm are explained. In Section 4, the experimental results for the benchmark problems are presented. In Section 5, conclusions and some hints for the future research are given.

2. Literature review Metaheuristic algorithms for training ANN models could be divided into single-solution based and population-based algorithms (S-Metaheuristics and P-Metaheuristics). In training ANN with S-Metaheuristics (Battiti and Tecchiolli, 1995; Sexton et al., 1998), Tabu Search (TS) (Treadgold and Gedeon, 1998; Chalup and Maire, 1999) and Simulated Annealing (SA) are utilized. Using P-Metaheuristics, one could divide ANN training into evolutionary algorithms (EA) and swarm intelligence (SI) algorithms. Learning and evolution are two fundamental forms of adaptation. There has been a great interest in combining learning and evolution with ANN, and combining ANN and EA can lead to significantly better intelligent systems than relying on ANN or EA alone (Yao, 1999). In training ANN with EA, Porto et al. (1995) and Mandischer (2002) make a comparison among EAs and gradientbased algorithms. Sexton and Dorsey (2000) compare BPA with the genetic algorithm (GA) for ANN training over a collection of 10 benchmark real world data sets. Cantu-Paz and Kamath (2005) present an empirical evaluation of eight combinations of GA and ANN on 15 public-domain and artificial data sets to identify the methods that consistently produce accurate classifiers that generalize well. They utilize a predefined parameter setting for GA and indicate that the GA is not dependent on the initial random weights for finding superior solutions. Alba and Chicano (2004) and Malinak and Jaksa (2007) combine EA with gradient-based local search algorithm to obtain better result. Another class of P-Metaheuristic, which is used as training algorithm, is SI. They originate from the social behavior of those species that have a common target (for example, compete for foods) (Talbi, 2009). Among SI algorithms, PSO is one of the most successful one. Unlike GA, PSO has no complicated evolutionary operators such as crossover, selection and mutation, and it is highly dependent on stochastic processes (Kiranyaz et al., 2009). Kennedy and Eberhart (1995) introduced the PSO for the first time. Engelbrecht and Bergh (2000) propose a method to employ PSO in a cooperative configuration achieved by splitting the input vector into several sub-vectors, each of which is optimized cooperatively in its own swarm. Mendes et al. (2002) and Gudise and Venayagamoorthy (2003) make use of PSO to train ANN. In their research, authors use a very simple problem that does not reveal outperformance of their method. Zaho et al. (2005) present a modified PSO, which adjust the trajectories (positions and velocities) of the particle based on the best positions visited earlier by them and other particles, and incorporates population diversity method to avoid premature convergence. Carvalho and Ludermir (2007) have made use of a methodology entirely based on the PSO and apply it to benchmark classification problems of the medical field. The results obtained by this methodology are situated between the results presented by other well-studied techniques such as GA or SA. Carvalho and Ludermir (2006) analyze the use of the PSO algorithm and two variants with a local search operator for neural

network training, and for evaluating these algorithms; they apply 3 medical field benchmark classification problems. Al-Kazemi and Mohan (2002) use multi-phase PSO algorithm (MPPSO) which simultaneously evolves multiple groups of particles that change their search criterion when changing the phases, and incorporates hill-climbing. In addition to the modifications made to basic PSO algorithm, other PSO variations have also been developed. Among these variations, those that incorporate opposition-based learning into PSO are capable of delivering better performance as compared to the standard PSO. Opposition-based learning is first introduced by Tizhoosh (2005) and later it is applied to PSO. Opposition-based learning is based on the concept of opposite points and opposite numbers. Han and He (2007) propose a modified PSO algorithm for noisy problems, which utilizes opposition-based learning. Wu et al. (2008) propose an opposition-based comprehensive learning PSO which utilizes opposition-based learning for swarm initialization and exemplar selection. Omran (2009) presents the improved PSO, which applies a simplified form of opposition-based learning. In this approach, the particle having worst fitness is replaced by its opposite particles. Opposition-based learning is only applied to one particle instead of the whole swarm and is not used at the time initialization. Hybrid Improved Opposition-based Particle swarm optimization and genetic algorithm (HIOPGA) method is a new ANN training algorithm that combines ability of two populations based algorithms (Yaghini et al., 2011). In the beginning, the algorithm starts training with a population of particles and during iteration of the algorithm when some ANNs (or particle position) in the d-dimensional space cannot be improved through the PSO, a sub-population is established and sent to the GA. By utilizing the GA crossover and mutation operators, the sub-population of the trapped particles is evolved. This process is repeated until the algorithm meets the termination condition. The authors compare their proposed method with back propagation algorithm on several benchmark problems. Apart from PSO, researchers employ the other SI algorithms, none of which is successful as PSO. Blum and Socha (2005) present a continuous version of Ant Colony Optimization (ACOR) algorithm. Chen et al. (2008) propose a new hybrid algorithm based on artificial fish swarm algorithm and PSO. They compare their proposed algorithms with specialized gradient-based algorithm for ANN training. Karaboga et al. (2007) proposed An Artificial Bee Colony (ABC) algorithm for classification purposes. The performance of the algorithm has been compared with the traditional BPA and the GA.

3. The proposed algorithm In this section, the proposed hybrid algorithm to optimize the weights of ANN prediction models is explained. It is a combination of local search and global search algorithms. In the ANN prediction models, fully connected layered feedforward networks are employed. Except of input units, each unit in the network has a bias. 3.1. The proposed particles A good, detailed, basic version of PSO algorithm can be seen in Yu et al. (2008). The proposed algorithm combines PSO and BPA (Fig. 1). For simplification in this figure, an ANN with one hidden layer, three input units, one hidden unit and two output units is considered. 3.2. Criterion for accuracy evaluation For classification problems as shown in Eqs. (1) and (2), Classification (CEP) Error Percentage is utilized to evaluate the accuracy. ! ! o p ¼ op1 ,:::,opn and t p ¼ t p1 ,:::,t pn where n is the number of



3

(Clerc and Kennedy, 2002). K is calculated using Eq. (7), where fðt Þ ¼ C 1 ðtÞ þC 2 ðtÞ and fðtÞ Z 4. C 1 ðt þ 1Þ ¼ ðt=mÞ ðC 1m C 1 ðtÞÞ þ C 1 ðtÞ

ð5Þ

C 2 ðt þ 1Þ ¼ ðt=mÞ ðC 2m C 2 ðtÞÞ þ C 2 ðtÞ

ð6Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 KðtÞ ¼ 2 2fðtÞ f ðtÞ4fðtÞ

ð7Þ

3.4. The opposition-based learning components

Fig. 1. (a) An ANN structure, (b) the particle for PSO

ANN output units and opi and tpi are predicted and target values of ! output unit i, p ¼ p1 ,:::,pk is input pattern which k is number of ANN inputs, and P is the number of patterns. ( ! ! 1 if o p a t p ! ð1Þ cð p Þ ¼ 0 otherwise CEP ¼ 100

p X

!

!

cð p Þ=P

ð2Þ

p¼1

For approximation problem, Normalized Root Mean Squared Error (NRMSE) is employed as shown in Eqs. (3) and (4), where N is number of the output units, P is number of pattern, and opi and tpi are predicted and target values of ith output unit for pattern p. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u P N uX X RMSE ¼ t ðt pi opi Þ2 =P N ð3Þ p¼1i¼1

NRMSE ¼ 100

RMSE P N P P t pi =P N

ð4Þ

p¼1i¼1

The proposed algorithm implements the opposition-based learning components in two different ways. First, after population initialization, to start with a better population, the algorithm calculates the opposite position and velocity of each particle; then, for each particle the better one (current particle or its opposite) is inserted into the population. Second, during the iteration, when the algorithm finds a new velocity and position for a particle, the opposite position and velocity of each particle are calculated, and the better one is inserted into the current generation. When creating opposite particles, an important question arises as to what the velocity of these particles should be? One can either have the same velocity as that of the original particle, or one can randomly reinitialize the velocity. Alternatively, the opposite of the velocity of the original particle can be calculated. The velocity of the original particle can not be used because it was calculated using the current position of the original particle which would be invalid for the opposite particle. Reinitializing the opposite particles velocity randomly is not such an inviting option because advantage of the experience gained by original particle would not be taken. Other researchers have not investigated this question and use random initialization of velocity. It is decided here to use the opposite velocity of the original particle, since it is believed that by using opposite velocity, better performance could be achieved, as we do with utilizing opposite positions. The opposite velocity is calculated in exactly the same way as the opposite particles calculation. The pseudocode of opposite particle calculation is illustrated in Fig. 2. [xmin, xmax] is the initial interval of the particle position (initial weight of ANN), and [vmin, vmax] is the velocity interval. The positions and velocity of ith particle at iteration t are Xi(t)¼(xi1(t), y, xid(t)) and Vi(t)¼(vi1(t), y, vid(t)).

3.3. The improved PSO 3.5. Random perturbation Although PSO is capable of locating a good solution at a significantly fast rate, its ability to fine-tune the optimum solution is comparatively weak mainly due to the lack of diversity at the end of the evolutionary process. To improve the search ability of standard PSO, time-varying parameter is utilized. Suppose that t and m are current and final iteration numbers, and C1(t), C2(t), C1(m) and C2(m) are cognitive and social components of current and final iterations. In the first version of PSO, a single weight, c¼ c1 ¼c2, called acceleration constant, is used instead of the two distinct weights in this paper. However, the latter offered better control on the algorithm, leading to its predominance over the first version (Parsopoulos and Vrahatis, 2010). Then, time-varying parameter is calculated using Eqs. (5) and (6). If each of parameter reaches to final values, it is set to initial value again. Using the time-varying parameter, one can implement large cognitive component and small social component at the beginning of the search to guarantee particles moving around the search space and to avoid particles moving toward the best population position. However, a small cognitive component and a large social component allow the particles to converge to the global optima in the search (Ratnaweera et al., 2004). K is another parameter that is utilized along with these parameters and is called constriction coefficient with the hope that it can insure a PSO to converge

PSO can quickly find a good solution but sometimes suffers from stagnation without an improvement (Ratnaweera et al., 2004). Therefore, to avoid this drawback of basic PSO, the velocity of particles is reset in order to enable them to have a new momentum. Under this new strategy, when the global best position is not improving with the increasing number of generations, each particle i will be selected by a predefined probability (0.5 in this study) from the population, and

Fig. 2. Pseudocode of opposite particle calculation.


4


then, a random perturbation is added to each dimension vid (selected by a predefined probability 0.5 in this study) of the velocity vector vi of the selected particle i. The velocity resetting is presented in Fig. 3, where r1, r2 and r3 are separately generated using uniformly distributed random numbers in range (0, 1), and vmax is the maximum magnitude of the random perturbation to each dimension of the selected particle. 3.6. The proposed cross validation method The training error of an ANN may decrease as the training process progresses. However, at some points, usually in the later stages of training, the ANN may start to take advantage of idiosyncrasies in the training data. Consequently, its generalization performance may start to deteriorate even though the training error continues to decrease (Islam et al., 2009). Early stopping in cross validation (Prechelt, 1998) is one common approach to avoid overfitting. In this method, the training data is divided into training and validation sets. The training process will not terminate when the training error is minimized; instead, it stops when the validation error starts to increase. This termination criterion is deceptive because the validation set may contain several local minima. In the proposed algorithm, to decrease negative effect of multimodal validation space on model generalization ability, a simple criterion that terminates the training process of the ANN is used. At the end of each L training iterations, the validation error is evaluated. If validation error have increased for T successive times in comparison to first L training iterations (independent of how large the increases actually are), training process is terminated. The idea behind the termination criterion is to stop the training process of the ANN when its validation error increases not just once but during T consecutive times. It can be assumed that such increases indicate the beginning of the final overfitting, not just the intermittent one. Fig. 4 represents the pseudocode of the proposed cross validation method.

Fig. 3. Pseudocode of random perturbation.

3.7. Termination criterion The algorithm simultaneously uses three criteria as termination condition. First, termination condition is based on training error. In this approach, at the end of each iteration t if the error on the training pattern is less than e, the training process will terminate (for the classification problems e ¼10 2, for the approximation problems e ¼10 6). Second, if the number of iteration becomes greater than a predefined number, the training process will be terminated. Third, according to the proposed cross validation method, if the algorithm meets the over training condition, the training process will be terminated. 3.8. The overall structure of the proposed hybrid algorithm The steps of hybrid improved opposition based particle swarm optimization (IOPSO) and BPA for artificial neural network training are explained as follows: Step 1) Specify starting position and velocity of particles according to initial value of parameters. Set iteration counter to zero (iter¼ 0), and set the counter that calculates number of no improvement after BPA training on gbest to zero (gbest CounterAfterBpTrain ¼0). Step 2) To establish a better population, calculate opposite position and velocity of particles, and insert better particle into population (Pseudocode in Fig. 2). Step 3) For each particle, specify the best personal position and calculate the number of no improvements in it (pbesti and pbestiCounter). In addition, specify the best global position and calculate the number of no improvements in it (gbest and gbsetCounter). Step 4) If Etrain(gbest(iter)) o e, go to Step 23; otherwise, go to Step 5. Step 5) Calculate the best global position of particle validation error (Eval(gbest(iter)). Step 6) According to Eqs. (4)–(6), calculate C1(iterþ1), C1(iterþ 1) and K(iter þ1). Step 7) Calculate new position and velocity of particles (training by PSO). Step 8) For each particle, specify the best personal position and calculate the number of no improvements in it (pbesti and pbestiCounter). Moreover, specify the best global position and calculate the number of no improvements in it (gbest and gbsetCounter). Step 9) If Etrain(gbest(iter))o e, go to Step 23, otherwise, go to Step 10. Step 10) Perform the proposed cross validation. Step 11) If the number of no improvements in the best global position is greater than maximum allowed number (gbsetCounter4 Maxgbest), go to Step 12 (backpropagation training condition).

Fig. 4. Pseudocode of the proposed cross validation method.



Step 12) Set the back propagation counter to zero (BPCounter¼0). Step 13) Train the best global particle by using back propagation algorithm with the momentum term; then, increase the back propagation counter (BPCounterþþ). Step 14) If BPCounter 4MaxNumberofBpTrain, go to Step 16, otherwise, go to Step 15. Step 15) If the training error of the global particle is decreased after back propagation training, set the counter that calculates number of no improvement after BP training on gbest to zero; otherwise, increase the counter (gbestCounterAfterBpTrainþþ). Step 16) If Etrain(gbest(iter)) o e, go to Step 23; otherwise, go to Step 17. Step 17) If gbestCounterAfterBpTrain45, call random perturbation (Pseudocode in Fig. 3). Step 18) Increase the iteration counter (iterCounterþþ). Step 19) If iteration counter is greater than maximum allowed number (iter 4m), go to Step 23, otherwise, go to Step 20. Step 20) If the residual number of iteration divided by the proposed cross validation strip length (iter/L) become zero, go to Step 21; otherwise, go to Step 2. Step 21) If validation error of global position for the current iteration is greater than pervious iteration (Eval(gbest(iter)) 4Eval (gbest(iter-1)), increase overtraining counter (Tcounterþþ); otherwise, set it at zero (Tcounter ¼0). Step 22) If overtraining counter is greater than maximum allowed number (Tcounter4T); go to Step 23 (termination because of overfitting), otherwise go to Step 2. Step 23) Stop the training.

4. Experimental results 4.1. Benchmark problems In this section, a comparison among the performance of the proposed algorithm and three other algorithms on several wellknown benchmark problems are presented. Table 1 constitutes a summary of problems specification in which a considerable variety in the number of patterns, attributes and classes are illustrated. The detailed explanation of these problems can be acquired from the University of California, Irvine, Machine Learning Repository (UCI Repository of machine learning databases (2011)). In this work, data of each problem is partitioned into three sets: a training set, a validation set and a testing set. The number of patterns in these sets is demonstrated in Table 1. The training set is utilized to weight modification for ANN training. The validation set is used for stopping the training process of ANNs, while the testing set is utilized for judging the prediction ability of a trained ANN. These partitions were inspired by suggestion in benchmarking methodologies (Prechelt, 1995; Prechelt, 1996).

5

For each benchmark problem, entropy is calculated according to Eq. (8), where P(Ci) is the probability of class Ci in the data set determined by dividing the number of pattern of class Ci by the total number of pattern in data set. Entropy of a data set is the average amount of information needed to identify the class label of a pattern in data set. In fact, entropy explores class distribution information in its calculation and shows impurity of data set. It can be considered as a criterion for difficulty of the problems. X E ¼ i PðC i Þ log2 ðPðC i ÞÞ ð8Þ For instance, in Gene data set, there are three different class labels. Number of all pattern is 3175 with 762 patterns belonging to the class 1, 756 patterns to class 2 and 1648 patterns belonging to the class 3. Class probabilities are 0.24, 0.24 and 0.52, respectively. According to Eq. (8), entropy for Gene data set is calculated as follows. X E ¼ i PðC i Þ log2 ðPðC i ÞÞ ¼ ð0:24 ð2:05Þ þ0:24 ð2:05Þ þ0:52 ð0:95ÞÞ 1:50

4.2. Comparison with implemented methods The algorithms are implemented with Java programming language and a personal computer with Intel(R) Pentium (R) CPU 2.66 GHz 2.68 GHz, 32 Bits Windows 7 Ultimate operating system, and 4.0 GB installed memory (RAM) is used to achieve all the results. To reduce the effect of random parameter initialization on the prediction ability of the models, each model is independently run 40 times. The average and standard deviation of results are presented in Table 2. The BPA needs much more time and iterations to converge, but less disperses solutions. The random essence of the metaheuristic algorithms cause much more disperse solutions e.g., in Gene problem the test error standard deviation for BPA is 0.71, for combination of two metaheuristic algorithms (IOPSO_GA) is 8.01, for combination of a local and global search in the proposed method is 2.49, respectively. According to Fig. 5 and Table 1, the proposed algorithm has sensitivity to the number of input attributes, problem size and entropy. For example in the benchmark problems, Gene problem has greatest number input attributes it is also a large-scale problem with almost great entropy. As illustrated in Table 2, the proposed algorithm has not shown generalization ability for this problem. This also satisfies the conditions for Horse and Diabetes problems. Fig. 6(a)–(h) represents CPU time-testing error for the best-so-far network from the beginning of the algorithms. As these figures reveal, the combination of the global search ability of the improved opposition base PSO and local search ability of BPA with the momentum term has the best result. As Table 2 discloses, mean and standard deviation over 40 independent runs of algorithms for each of eight benchmark problems are calculated. Small value of standard deviation for IOPSO_BPA in comparison with other three

Table 1 Characteristics of benchmark problems. Problem

Cancer Card Diabetes Gene Heart Horse Iris thyroid

Number of

Entropy

Input attributes

Output classes

Training pattern

Validation pattern

Testing pattern

9 51 8 120 35 58 4 21

2 2 2 3 2 3 3 3

350 345 348 1588 460 182 75 3600

175 173 192 794 230 91 38 1800

174 172 192 793 230 91 37 1800

0.93 0.99 0.93 1.50 0.99 1.32 1.58 0.45


6


Table 2 Performance of algorithms on eight benchmark problems for different parameter values (all results were averaged over 40 independent runs). Problem

Cancer

Card

Diabetes

Gene

Heart

Horse

Iris

Thyroids

Solution method

BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA BPA IOPSO IOPSO_GA IOPSO_BPA

Training error

Validation error

Test error

CPU Time

Mean

SD

Mean

SD

Mean

SD

Mean

SD

2.11 5.64 1.51 1.11 14.89 18.61 12.87 13.14 34.51 29.46 28.34 21.54 54.23 55.98 56.14 39.45 67.78 26.35 19.63 17.75 29.01 27.93 26.47 26.17 18.34 23.76 20.87 17.67 8.01 6.98 6.61 5.91

0.07 1.23 1.42 0.34 0.89 2.1 3.23 0.91 0.26 0.99 1.28 0.34 0.67 3.32 6.89 2.12 0.93 2.44 2.59 1.91 0.66 0.79 1.54 0.67 0.46 0.65 0.98 0.54 0.32 0.62 0.97 0.29

2.42 5.99 1.78 1.13 15.02 18.84 13.64 13.35 35.12 30.98 31.03 22.17 58.70 56.61 56.46 39.70 68.82 27.05 19.98 18.21 29.34 28.21 26.99 26.27 18.96 23.99 21.12 18.15 8.54 7.21 7.03 6.34

0.25 1.16 1.38 0.42 0.71 2.23 3.82 1.26 0.67 1.01 1.47 0.72 0.63 3.43 7.65 2.31 0.81 2.51 2.76 1.99 0.47 1.01 1.56 0.78 0.39 0.69 1.01 0.56 0.29 0.63 1.03 0.32

2.3 6.00 1.72 1.14 15.12 18.91 13.95 13.37 35.00 31.72 30.46 22.92 58.76 57.63 56.62 39.60 68.21 27.00 20.00 18.26 29.26 28.11 27.00 26.37 18.92 24.05 21.19 18.01 8.45 7.28 7.01 6.22

0.32 1.19 1.22 0.47 0.62 2.17 3.78 1.57 0.47 1.03 1.56 0.62 0.71 3.42 8.01 2.49 0.78 2.99 2.83 1.89 0.61 0.99 1.78 0.89 0.44 0.69 1.03 0.57 0.37 0.61 1.02 0.33

9.22 2.99 1.99 0.69 36.92 6.72 7.81 6.22 10.21 1.58 2.23 1.35 365.76 118.35 144.27 52.53 13.35 6.37 13.95 5.99 23.43 4.70 9.38 3.97 1.64 0.48 0.70 0.26 181.42 37.70 68.16 13.63

1.02 1.23 0.99 0.41 3.23 5.47 2.22 1.12 0.41 0.71 0.89 0.36 20.28 18.45 19.32 16.66 0.61 1.78 2.21 1.32 0.55 1.10 2.23 0.97 0.34 0.12 0.18 0.09 10.96 4.45 5.89 3.72

According to Table 2 and Fig. 6(a)–(h), in comparison with other three algorithms for ANN weight optimization, the combination of the global search ability of IOPSO and local search ability of BPA with momentum term is preferred in both CPU time and testing error. In addition, with respect to the variation in solutions, Table 2 reveals its superiority in comparison with IOPSO and combination of IOPSO and BPA over 40 independent runs. It has good variation with regard to the BPA with momentum term. For most problems, the combination of IOPSO and GA can find good solution in a reasonable time, but variation in the solutions in this hybrid algorithm is high. Therefore, IOPSO_GA has not adequate stability for the prediction. 4.3. Comparison with promising methods in the literature

Fig. 5. Test error-CPU time graph for the proposed method over benchmark problems.

algorithms reveals robustness of the proposed method. For example, the pair (test error S.D value, CPU time S.D value) for Thyroids problem in the proposed method is (0.33, 3.72), for IOPSOGA (1.02, 5.89), for IOPSO (0.61, 4.45) and for the BPA is (0.37, 10.96), show that IOPSO_BPA produces less disperse solution, so it is more stable in variable starting condition and can find promising solution in a reasonable time. To do a fair comparison of the models, Fig. 7 is presented. It reveals the rank in the time and the rank in the error for the problems. This figure illustrates the results of different solution methods for various benchmark problems. Accordingly, IOPSO_BPA has substantial dominance over other methods regarding to the CPU time and generalization ability.

There are a number of algorithms for ANN training that can be compared with the proposed algorithm. Different algorithms employ different experimental methodologies; therefore, in this paper a straight comparison with other algorithms using statistical tests is not practical. Furthermore, the required results of independent runs for statistical tests are generally not obtainable. As a result, it is impossible to compare different algorithms reasonably, unless one implements all algorithms under the same experimental setup again. Since the aim of experimental comparison here is to understand the strengths and weaknesses of IOPSO_BPA, in this section a comparison between IOPSO_BPA and other competing methods on the eight classification problems is presented in Table 3. An error rate in the table refers to the CEP over the test data set. It is straightforward to see that the best classification performance is achieved with the proposed technique over Cancer, Card, Diabetes, Heart, Horse and Iris data sets. In this study, in order to examine IOPSO_BPA with other experimented data mining techniques for



70

80 70 60 50 40 30

BPA IOPSO IOPSO_BPA IOPSO_GA

60

10

95

75

55

40

45

20

35

80 70

0 562 967 1373 1810 2200 2621 3058 3463 3853 4274 4680 5070 5476 5881 6287 6692 7129 7519 7925 8315 8705 9095 9485 9908

30

90


60

100 80 70 50

40

40

20

30

10

20

75 65

0 1747 3120 4493 5865 7238 8595 9968 11325 12698 14133 15631 17160 18533 19921 21294 22729 24133 25537 26925 28439 29858 31278 32682 34086

30

85


55 45 35

0 187 265 343 421 499 577 655 733 796 874 952 1030 1108 1186 1264 1342 1420 1498 1576 1638

25 15


90

60

50

95


85

65

50

100

0 2292 4162 5988 7766 9560 11354 13132 14926 16720 18499 20293 22149 23988 25860 27670 29558 31367 33239 35049 36921

20

0 17488 32152 46847 61480 76144 90808 105457 120074 134769 149449 164066 178699 193316 207964 222597 237308 251972 266620 281300 296042 310690 325433 340097 354745

80


90

95 85 75 65 55 45 35 25 15 5

0 1435 2605 3810 4980 6119 7242 8396 9520 10674 11797 13014 14231 15370 16508 17632 18755 20050 21188 22327 23435

90

100


0 10467 19437 28314 37206 46082 54990 64334 73630 82638 91664 100636 109612 118576 127578 136580 145562 154536 163490 172454 181416

100


0 624 1077 1545 1997 2465 2917 3370 3838 4290 4743 5195 5647 6084 6552 6989 7441 7878 8331 8783 9220

100 90 80 70 60 50 40 30 20 10 0

7

Fig. 6. (a): CPU time (millisecond)-testing error of the best-so-far network for Cancer problem. (b): CPU time (millisecond)-testing error of the best-so-far network for Card problem. (c): CPU time (millisecond)-testing error of the best-so-far network for Diabetes problem. (d): CPU time (millisecond)-testing error of the best-so-far network for Gene problem. (e): CPU time (millisecond)-testing error of the best-so-far network for Heart problem. (f): CPU time (millisecond)-testing error of the best-so-far network for Horse problem. (g): CPU time (millisecond)-testing error of the best-so-far network for Iris problem. (h): CPU time (millisecond)-testing error of the best-so-far network for Thyroid problem.

prediction, Table 4 compares results of IOPSO_BPA against those of five prediction methods tested by Friedman (1997). The accuracy rate in the table refers to CEP on the testing set. It is clear that IOPSO_BPA outperforms other algorithms on major data sets. Note that the training time could not be the subject of this section because these algorithms are implemented with different programming languages and computers with different specifications.

5. Conclusions Learning algorithm is an important aspect of ANN based models. In this article, a literature review and categorization of the ANN learning algorithms were presented. Most of the learning algorithms are based on the gradient decent methods which are greedy and are potentially trapped in local optima of the weight


8


Fig. 7. CPU time-Error rank for each problem. Table 3 Comparison between IOPSO_BPA and other competing methods on the eight classification problems. Algorithm and Authors

Problem

IOPSO_BPA HIOPGA (Yaghini et al., 2011) PSO-PSO (Carvalho and Ludermir, 2007) Rprop (Carvalho and Ludermir, 2006) ACOR (Blum and Socha, 2005) ACOR-BP (Blum and Socha, 2005) GABP(Alba and Chicano, 2004) LM (Alba and Chicano, 2004) LM-BP (Alba and Chicano, 2004) GA (Sexton and Dorsey, 2000) Prechelt ( Prechelt,1994) Average

Cancer

Card

Diabetes

Gene

Heart

Horse

Iris

Thyroids

1.14 1.72 4.83 3.49 2.39 2.14 1.43 3.17 0.02 2.27 5.17 2.52

13.37 13.95 – – – – – –

22.92 30.46 24.36 23.83 25.82 23.80 36.46 25.77 28.29 26.23 25.83 26.71

39.6 56.62 – – – – – –

18.26 20 19.88 19.90 21.59 18.29 54.30 41.50 22.66 20.44 22.65 25.41

26.37 27 – – – – – – – 23.99 34.84 28.05

18.01 21.19 – – – – – – – – – 19.6

6.22 7.01 – – – – – – – 3.78 7.23 6.06

15.02 19.24 15.40

Table 4 Comparison with other data mining techniques reported in Friedman (1997) for prediction in terms of the average testing accuracy rate for all data sets. Algorithm

IOPSO_BPA NB BN TAN CL C4.5 SNB Average

Problem Cancer

Card

Diabetes

Heart

1.14 70.47 2.64 70.5 3.08 70.63 3.08 70.67 7.6 70.81 5.27 70.59 3.81 70.63 3.80

13.37 7 1.57 13.77 7 1.1 13.77 7 1.76 15.87 1.24 14.93 7 1.31 14.35 7 1.82 13.43 7 1.81 14.20

22.92 70.62 25.52 70.89 24.61 70.29 24.48 71.11 25.26 71.19 23.96 70.85 23.96 70.83 24.39

18.26 71.89 18.52 73.26 17.78 72.46 16.67 72.48 17.78 72.96 18.89 73.77 18.15 72.83 18.01

space. Moreover, metaheuristic algorithms have global search strategy; it enables them to keep away from local optima, so the ability of both of them is combined and a superior hybrid algorithm is presented. The proposed algorithm combined the global ability of metaheuristics and the local greedy gradient based algorithm resulting in a superior hybrid method. The training time and accuracy of the proposed algorithm on eight benchmark problems are compared

12.04 15.41 30.92

with three other famous ANN training algorithms. The results prove the dominance of the proposed algorithm. In this article, a comparison in terms of the testing accuracy rate between the proposed method and other challenging methods and some data mining techniques for prediction over all data sets, is presented. The evaluation of the results showed superiority of the proposed algorithm. Presenting a satisfactory and efficient ANN training algorithm has always been a challenging subject especially for problems that may need handling a very large amount of data. Therefore, reducing the training time and increasing the model accuracy is a subject with great importance. To do so, future research need to be undertaken in following direction. Local search for global particle training either can be a faster gradient based algorithm such as the quick propagation, Levenberg Marquardt algorithm, or single-solution metaheuristics such as tabu search or simulated annealing algorithms. References Alba, E., Chicano, J.F., 2004.Training Neural Networks with GA hybrid algorithms. In: Proceedings of Genetic and Evolutionary Computation (GECCO04), 26–30 June, Seattle, Washington, 852–863.



Al-Kazemi, B., Mohan, C.K., 2002. Training feedforward Neural Networks using multi-phase particle swarm optimization. 9th International Conference on Neural Information Processing (ICONIP02), 18-22 November, Singapore, Vol. 5, pp. 2615–2619. Battiti, R., Tecchiolli, G., 1995. Training neural nets with the reactive tabu search. IEEE Trans. Neural Networks 6 (5), 1185–1200. Blum, C., Socha, K., 2005. Training feed-forward Neural Networks with Ant Colony Optimization: an application to pattern classification. IEEE 5th International Conference on Hybrid Intelligent Systems (HIS’05), 6–9 November, Rio de Janeiro, Brazil, 233–238. Cantu-Paz, E., Kamath, C., 2005. An Empirical Comparison of Combinations of Evolutionary Algorithms and Neural Networks for Classification Problems, IEEE Trans. Syst. Man. Cybernet – Part B: Cynernet. Carvalho, M., Ludermir, T.B., 2006. An analysis Of PSO hybrid algorithms for feedforward Neural Networks training, Proceedings of the Ninth Brazilian Symposium on Neural Networks (SBRN’06), 23–27 October, Ribeira*o Preto, SP, Brazil. Carvalho, M., Ludermir, T.B., 2007. Particle swarm optimization of neural network architectures and weights. 7th International Conference on Hybrid Intelligent Systems (HIS2007), 17–19 September, Kaiserslautern, Germany, 336–339. Castellani, M., Rowlands, H., 2009. Evolutionary Artificial Neural Network design and training for woodveneer classification. Eng. Appl. Artif. Intell. 22 (4-5), 732–741. Chalup, S., Maire, F., 1999. A study on hill climbing algorithms for neural network training. In: Proceedings of the Congress on Evolutionary Computation (CEC99), 6–9 July, Washington, DC, USA, 3, 2014–2021. Chen, X.,Wang, J., Sun, D., Liang J., 2008.A novel hybrid Evolutionary Algorithm based on PSO and AFSA for feedforward neural network training. 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM), 12–14 October, Dalian, China, 1–5. Clerc, M., Kennedy, J., 2002. The particle swarm: explosion, stability, and convergence in a multi-dimensional complex space. IEEE Trans. Evol. Comput. 6, 58–73. Engelbrecht, A.P., Bergh, F.V.D., 2000. Cooperative learning in Neural Networks using particle swarm optimizers. S. Afr. Comput. J. 26, 84–90. Friedman, N., 1997. Bayesian network classifiers. Mach. Learn. 29, 131–163. Gudise, V.G., Venayagamoorthy, G.K., 2003. Comparison of particle Swarm optimization and backpropagation as training algorithms for neural networks. Proc. IEEE Swarm. Intell. Symp. 24-26, 110–117. April. Han, L., He X., 2007. A novel opposition-based particle swarm optimization for noisy problems. IEEE 3th International Conference on Natural Computation (ICNC2007), Hiakou, Hainan, China, 3, 624–629. Islam, M.M., Sattar, M.A., Amin, M.F., Yao, X., Murase, K., 2009. A new adaptive merging and growing algorithm for designing artificial neural networks. IEEE Trans. Syst. Manage. Cybernet., Part B: Cybern. 39 (3), 705–772. Karaboga,D., Akay,B.,Ozturk, C..,2007 Artificial Bee Colony (ABC) optimization algorithm for training feed-forward neural networks. Proceedings of the 4th International Conference on Modeling Decisions for Artificial Intelligence (MDAI’07), 16–18 August, Kitakyushu, Japan, 318–329. Kennedy, J., Eberhart, R., 1995. Particle swarm optimization. IEEE International Conference on Neural Networks, 27 November–1 December, 4, 1942–1948. Kiranyaz, S., Ince, T., Yildirim, A., Gabbouj, M., 2009. Evolutionary Artificial Neural Networks by multi-dimensional Particle Swarm Optimization. Neural Networks 22 (10), 1448–1462. Malinak, P., Jaksa, R., 2007. Simultaneous gradient and evolutionary Neural Network weights adaptation methods. IEEE Congress on Evolutionary Computation (CEC), 25–28 Steptember, Singapore, Maylisa, 2665–2671. Mandischer, M., 2002. A comparison of evolution strategies and backpropagation for neural network training. Neurocomputing. 42 (1-4), 87–117.

9

Mendes, R., Cortez, P., Rocha, M., Neves, J., 2002. Particle swarm for feedforward Neural Network Training. IEEE International Joint Conference on Neural Networks (IJCNN02), 12–17 May, Honolulu, HI, USA, 1895–1899. Omran, M.G.H., 2009. Using Opposition-based Learning with Particle Swarm Optimization and Barebones Differential Evolution. Particle Swarm Optimization, InTech Education and Publishing, A. Lazinica (Ed), 373–384. Porto, V.W., Fogel, D.B., Fogel, L.J., 1995. Alternative neural network training methods. Int. Workshop databases Expert Syst. Appl. 10 (3), 16–22. Parsopoulos, K.E., Vrahatis, M.N., 2010. Particle Swarm Optimization and Intelligence: Advances and Applications, information science reference. Prechelt L., 1994. PROBEN1-A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Faculty Informatics, University of Karlsruhe, Karlsruhe, Germany. Prechelt, L., 1998. Automaticearly stopping using cross validation: quantifying the criteria. Neural Networks 11 (4), 761–767. Prechelt, L., 1995. Some notes on neural learning algorithm benchmarking. Neurocomputing 9 (3), 343–347. Prechelt, L., 1996. A quantitative study of experimental evaluations of neural network learning algorithms. Neural Networks 9 (3), 457–462. Ratnaweera, A., Saman, K., Watson, H.C., 2004. Self-organizing hierarchical particle swarm optimizer with time–varing acceleration coefficients. IEEE Trans. Evol. Comput. 8 (3), 240–255. Sexton, R.S., Alidaee, B., Dorsey, R.E., Johnson, J.D., 1998. Global optimization for artificial neural networks: a tabu search application. Eur. J. Oper. Res. 106 (2–3), 570–584. Sexton, R.S., Dorsey, E.D., 2000. Reliable classification using neural networks: a Genetic Algorithm and backpropagation comparison. Decis. Support Syst. 30, 11–22. Talbi, E.-G., 2009. Metaheuristic: From Design to Implementation, University of Lille. John Wiley and Sons. Tizhoosh H.R., 2005. Opposition-based learning: a new scheme for machine intelligence. International Conference Computational Intelligence Modeling Control and Automation, 28–30 November, Vienna, Austria, Vol. 1, 695–701. Treadgold, N.K., Gedeon, T.D., 1998. Simulated annealing and weight decay in adaptive learning: the SARPROP algorithm. IEEE Trans. Neural Networks 9 (4), 662–668. UCI Repository of machine learning databases, 2011. Department of Information and Computer Science. University of California, Irvine. Vosniakos, G.C., Benardos, P.G., 2007. Optimizing feedforward Artificial Neural Network Architecture. Eng. Appl. Artif. Intell. 20 (3), 365–382. Wu, Z., Ni, Z., Zhang, C., Gu, L., 2008. Opposition based comprehensive learning Particle Swarm Optimization. 3th International Conference on Intelligent System and Knowledge Engineering (ISKE), 17–19 November, 2, 1013–1019. Yaghini, M., Khoshraftar, M.M., Fallahi, M., 2011. HIOPGA: a new hybrid metaheuristic algorithm to train feedforward Neural Networks for Prediction. The 7th International Conference on Data Mining (DMIN’11), July 18–21, Las Vegas, NV, USA. Yaghini, M., Khoshraftar, M.M., Seyedabadi, M. Railway passenger train delay prediction via neural network model. J. Adv. Transp. doi:10.1002/atr.193, in press. Yao, X., 1999. Evolving Artificial Neural Network. Proc. IEEE. 87 (9), 1423–1447. Yu, J., Wang, S., Xi, L., 2008. Evolving Artificial Neural Networks using an improved PSO and DPSO. Neurocomputing 71 (4–6), 1054–1060. Zaho, F., Ren, Z., Yu, D., Yang, Y., 2005. Application of an improved particle swarm optimization algorithm for Neural Network training. International Conference on Neural Networks and Brain (ICNN&B05), 13–15 October, Beijing, China, 3, 1639–1698.