A New Algorithm for Evolution of Artificial Neural Network Classifier

2006 IEEE Congress on Evolutionary Computation Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006

ANNE - A New Algorithm for Evolution of Artificial Neural Network Classifier Systems Marco Castellani

Abstract— This paper introduces ANNE, a new algorithm for evolution of neural network classifiers. Different from standard divide and conquer approaches, the proposed algorithm evolves simultaneously the input feature vector, the network topology and the weights. The use of the embedded approach is also novel in an evolutionary feature selection paradigm. Tested on seven benchmark problems, ANNE creates compact solutions that achieve accurate and robust learning results. Significant reduction of the input features is obtained in most of the data sets. The performance of ANNE is compared to the performance of five control algorithms that combine different manual and automatic feature selection approaches with different structure design techniques. The tests show that ANNE performs concurrent feature selection and structure design with results that are equal or better than the best results obtained by algorithms specialised only on feature selection or neural network architecture design. Moreover, the proposed approach fully automates the neural network generation process, thus removing the need for timeconsuming manual design.

I. INTRODUCTION

T

he implementation of artificial neural network (ANN) [1] pattern classifiers requires the solution of three complex optimisation problems, namely, the selection of the input vector of data features, the identification of a suitable ANN structure, and the training of the frequently large set of parameters (i.e., the connection weights). The three tasks are mutually related. Since pattern classification is based on the information carried by the feature vector, feature selection affects the classifier accuracy, learning capabilities, and size. At the same time, given a fixed input vector configuration, the learning capabilities of ANN systems are determined by the choice of ANN topology. However, since the goodness of a particular choice of input attributes and ANN architecture can be fully evaluated only on the performance of the trained classifier, the effectiveness of the learning procedure greatly influences the feature selection and structure design processes. Feature selection, structure design and weight training can be regarded as three search problems in the discrete space of the subsets of data attributes, the discrete space of the possible ANN configurations, and the continuous space of the ANN parameters, respectively. These three problems share common features such as the high complexity, multimodality, and deceptiveness of the search space. Due to

M. Castellani is with Centro de Inteligência Artificial (CENTRIA), Departamento de Informática, Universidade Nova de Lisboa, Quinta da Torre, 2829-516 Caparica, Portugal; phone: +351 212948536; fax: +351 212948541; e-mail: [email protected].

0-7803-9487-9/06/$20.00/©2006 IEEE

the difficulty of each of the three optimisation tasks, a divide and conquer approach is commonly used and the search problems are treated separately. Thanks to their global search strategy, evolutionary algorithms (EAs) [2] produce robust results when pursuing optimisation in large, noisy, multimodal and deceptive spaces. EAs have been applied to solve separately the problems of feature selection, structure design and weight training for ANN classifiers [3], [4]. This paper presents ANNE (Artificial Neural Network Evolver), a new EA for simultaneous feature selection, structure design and weight training for ANNs. Similar to the other algorithms in the literature, the proposed method relies on the global nature of the evolutionary search to avoid being trapped by sub-optimal peaks of performance. However, ANNE is characterised by a distinctive approach based on the concurrent evolution of the input vector, the ANN topology and the weights. Compared to standard divide and conquer methods, ANNE’s approach simplifies the optimisation of the algorithm, reduces the ANN design effort and improves the quality of the solutions. In the present study the proposed procedure is tested on a popular type of ANN classifier, the multi-layer perceptron (MLP) [1]. Sections II, III and IV survey the main literature respectively on feature selection, weight training and structure design for ANN systems. Section V introduces the new algorithm. Section VI describes the experimental design and Section VII presents the experimental results. Section VIII summarises the results and concludes the paper. II. FEATURE SELECTION FOR ANN CLASSIFIER SYSTEMS Feature selection can be regarded as a search problem in the discrete space of the subsets of data attributes. The solution requires the removal of unnecessary, conflicting, overlapping and redundant features in order to maximise the classifier accuracy, compactness, and learning capabilities. Due to the often large set of attributes and their interactions, selecting the optimal feature vector is usually difficult and time consuming. The search space is noisy, complex, nondifferentiable, multimodal and deceptive, and the results are strongly related to the type of classifier system used. Two main approaches are customary for feature selection, namely the filter approach and the wrapper approach [5]. The filter approach pre-processes the initial set of attributes based on desirable properties such as orthogonality and information content. The filter method is usually the least computationally intensive. However, since the feature

3294

selection criterion completely ignores the inductive and representational biases of the learning system, filter algorithms are prone to unexpected failures [6]. The wrapper approach assesses each candidate feature vector on the learning results of the classifier. Compared to the filter approach, this method involves a far more severe computational effort that may hinder adequate exploration of the search space. Moreover, since many ANN training algorithms are prone to sub-optimal convergence, the wrapper approach must take into account possibly inaccurate evaluations of the candidate solutions. This problem is exacerbated by the fact that different choices of input features modify the ANN optimisation landscape, making it particularly difficult to define adequate learning parameters. Notwithstanding its limitations, it is commonly accepted that the wrapper approach, when applicable, provides superior results in terms of classification accuracy [7]. Greedy backward elimination and forward selection techniques are generally employed to search the space of the candidate solutions [5], [7], [8]. Unfortunately, such methods evaluate the contribution of each feature separately, failing to take into account possible interactions amongst attributes. The local search of the feature space also makes the above algorithms liable to get trapped by local minima or to be deceived by noise in the evaluation of the solutions. Global search methods are better suited to deal with the complexity and the multi-modality of the attribute space and with noisy fitness evaluations. A variety of studies explored the use of EAs for automatic feature selection [3], [9] for ANN classifier systems, the wrapper approach being the common choice of implementation. Despite encouraging results, the use of EAs has so far been limited by the often lengthy training procedures of the classifiers. Lengthy evaluations of the candidate solutions add to the intrinsic slowness of convergence of EAs, making the application of this approach often problematic. Drastic reduction of computational effort can be achieved by embedding the search for the optimal feature set into the training procedure of the classifier. This embedded approach [5] avoids the computational overheads of repeating the whole training procedure for every evaluation of a solution. Unfortunately, applications of the embedded approach to feature selection for ANN systems are so far less common and mostly concerning backward or forward elimination techniques [5], [10]. Such greedy optimisation techniques, as pointed out before, are only sub-optimal since they are prone to converge to local peaks of performance and do not take into account possible interactions amongst features.

as initialisation of the ANN weights, partition of the training and the test sets of examples, order of presentation of the training data and others. Gradient based learning methods such as the backpropagation (BP) [1] rule can get easily trapped by local maxima or flat areas of the optimisation surface and time consuming experimentation is required before a satisfactory solution is found. Due to the robustness of their search approach, global search methods are often applied to ANN training. The first applications of EAs to ANN parameter learning date back to the late 80s in the fields of Genetic Algorithms (GAs) [2], [11] and Evolutionary Programming (EP) [2], [12]. The common approach is to encode the ANN weights into genes that are then concatenated to build the genotype. Much debated is the representation of the solutions since the popular GA practice of binary coding [2] gives rise to long bit-strings for any non-trivial ANN architecture, leading to the dual problem of a large search space and increased disruptiveness of the crossover operator. Moreover, the larger the strings are, the longer the processing time is. For the above reasons, standard EAs are often modified to allow more compact and efficient encodings [11], [13] and are hybridised with other search algorithms (e.g., the BP rule) to speed up the learning process [11], [14]. The use of genetic crossover is also a major topic in the evolutionary ANN debate because there is no consensus on which are the functional units to swap. On the contrary, the distributed nature of the knowledge base in connectionist systems seems to favor the argument against point-to-point exchanges of genetic material amongst solutions. Relevant to the efficiency of the crossover operator is also the competing convention problem [15], namely the many-to-one mapping from the representation of the solutions (the genotype) to the actual ANN (the phenotype). This problem often leads to high disruption of the solutions' behaviour following genetic recombinations. A way to prevent competing conventions is to match pairs of similarly performing neurons of mated solutions prior to the crossover operation [15]. Alternatively, sub-populations (species) of neurons are evolved, each species corresponding to a fixed position on a pre-defined ANN architecture [16]. Unfortunately, these approaches don't scale well to large ANN structures. Because of its real-valued encoding and the lack of a crossover operator, EP is often regarded as a more suitable approach to ANN training. Several successful examples are reported in the literature, mainly using Gaussian [12] or Cauchy [17] genetic mutation as the main search operator. For further insights on the evolutionary training of ANNs the reader can find broad surveys on the topic in [3], [4].

III. TRAINING THE ANN CONNECTION WEIGHTS Training an ANN classifier requires it to learn the setting of the weights that maximises the system accuracy. The task can be regarded as a combinatorial search problem where a solution is sought in the high-dimensional, highly complex and multimodal space of the ANN parameters. The sampling of such space is usually affected by a number of factors such

IV. STRUCTURE DESIGN FOR ANN CLASSIFIER SYSTEMS The choice of ANN structure has a considerable impact on the processing power and learning capability of the classifier. Too small a topology may not possess enough representation power to fully learn the desired input-output relationship, whereas a too large one may result in the ANN

3295

response too closely modelling the training data. The latter case usually produces a solution with poor generalisation capabilities. Due to lack of clear structure optimisation rules, manual trial and error is still the most common choice for ANN generation. Designing an ANN system can be regarded as a search problem in the space of the allowed ANN topologies. Given some optimality criteria such as system accuracy, compactness and learning speed, the performance level of all architectures forms an infinitely large, discrete, complex and nondifferentiable optimisation surface [3]. This surface is also noisy, deceptive and multimodal [3]. Most of algorithms for ANN design use gradient-based search techniques, such as constructive and destructive algorithms [18], [19]. The main problem with such algorithms is that they are prone to sub-optimal convergence. Several studies report applications of EAs to the design of ANN architectures coupled to customary parameter learning algorithms, a typical example being the evolution of MLP topologies with BP training of the ANN weights [20]. Fitness evaluation is usually expressed as a multioptimisation criterion that takes into account different requirements such as ANN accuracy, size, learning speed, etc. Two main approaches for encoding the candidate solutions have emerged, namely direct encoding and indirect encoding [3]. Direct encoding specifies every ANN connection and node, usually representing individuals by means of connection matrices. Following this approach, chromosomes are easy to decode but the algorithm does not scale well to large ANN structures. Indirect encoding specifies only a compact representation of the ANN structure, usually through parameters describing the network size and connectivity [21] or via developmental rules [22]. While indirect encoding seems more biologically plausible and does not arise the problem of competing conventions, the action of the genetic operators on the actual phenotype becomes less clear and the decoding of the chromosomes more difficult. Moreover, small changes in the genotype produce large changes in the phenotype creating a rugged and more difficult search surface. The use of EAs to design ANNs that are then trained using some parameter learning algorithm allows compact and effective structures to be built. However, imprecision in the evaluation of the candidate solutions must be taken into account due to possible sub-optimal convergence of the weight training procedure. Furthermore, the training of the ANN weights may be excessively slow for adequate exploration of the search space. For the above reasons, it is preferable to simultaneously optimise both the ANN architecture and the parameters. This is done either by alternating steps of evolutionary structure optimisation with steps of standard (e.g. BP-driven) training of the parameters [23] or by evolving at the same time both the connectivity and the weights [24].

In the first case, the standard learning technique behaves like an additional problem-specific mutation operator. This genetic transmission of learned knowledge introduces an element of "Lamarckism" [25] into the search, that is, the permanent storing in the genotype of acquired behaviours resulting from learning by the phenotype. In the second case, different mutation operators modify the ANN structure and the weights. Standard ANN weight training algorithms (e.g., BP for MLPs) are often used to speed up the search through Lamarckian learning. For the reasons discussed in Section III, the use of genetic crossover is not customary and it depends on the representation of the candidate solutions. Due to the difficulty of encoding the connection weights, the use of indirect encoding becomes problematic once the whole ANN system is evolved. Indirect encoding also makes less straightforward the implementation of Lamarckism, an otherwise useful feature to speed up the search process and improve the accuracy of the solutions. V. THE PROPOSED ALGORITHM The ANNE algorithm is designed for concurrent feature selection, structure design and weight training for ANN systems. The simultaneous optimisation of the whole ANN configuration, a fairly unexplored approach in the evolutionary generation of ANNs, is motivated by the need of accounting for the mutual interactions between input vector configuration, ANN architecture and weight settings. As discussed before, sub-optimal convergence of the weight training procedure makes traditional approaches prone to inaccurate evaluation of the candidate input vector configurations and ANN structures. Likewise, poor choices of ANN structure or input data attributes negatively affect the ANN learning accuracy and speed. A fully automated approach to ANN design and training removes also the need for human intervention in the ANN implementation process. The algorithm uses the embedded approach for the selection of the data attributes. The choice of such approach is novel in the evolutionary feature selection for ANNs. Compared to traditional wrapper algorithms, the embedded approach removes the need to restart the whole ANN training procedure for every evaluation of a candidate input vector solution. In an embedded evolutionary feature selection procedure the number of ANN training cycles coincides also with the convergence of the evolutionary procedure. This removes the problem of setting a separate stopping criterion for the weight training algorithm. This problem is of major concern in traditional wrapper algorithms, since different input vector configurations determine different ANN learning speeds. The use of the embedded approach is therefore beneficial in terms of reduced computational overheads [5] and reduced effort for algorithm optimisation. The population is evolved through a mix of random genetic manipulations and Lamarckian gradient based learning. Since it is more suitable for transmitting the setting of the connection weights, the direct encoding approach is

3296

used for representing the candidate solutions. This section presents the implementation of the algorithm to the evolution of MLP classifiers of any pre-defined number of layers. A. General Overview The algorithm comprises three components, namely, a feature selection module, a structure design module and an ANN training module, that act concurrently on the same pool of individuals. The three modules are expected to benefit from their cooccurrence. The presence of similarly performing structural mutations of an individual is likely to favour population diversity. Moreover, manipulations of the input vector and the ANN topology modify the error surface, thus helping the weight training algorithm to escape local peaks or flat areas of fitness. Finally, ANNs possess well-known fault tolerance to addition or removal of input signals and processing units. This capability minimises the number of fatal mutations, since moderate changes of the ANN architecture and the input vector are not likely to cause major disruption to the progress of the learning procedure. The genotype of each individual is composed of two chromosomes, namely a binary string representing the data attributes and a real-valued variable-length string representing the weights. At each generation, the fitness of the population is assessed, then a cycle of the feature selection module, a cycle of the structure design module and a cycle of the ANN training module are executed. Evolution is achieved via random genetic operations of mutation and crossover. Genetic crossover is operated only in the feature selection module. The choice against genetic recombination of the ANN structures is motivated by the lack of clear functional units in ANN systems and by the competing convention problem. The BP rule is included into the ANN training module to support the weight optimisation procedure. This problemspecific operator acts on the weights of the decoded individuals so as to reduce the classification error. The changes are stored into the genotype (Lamarckian learning). As a result of the action of the three modules, every learning cycle a new population is produced through genetic manipulation of the individuals. New solutions replace old ones via generational replacement [2]. The procedure is repeated until a pre-defined number of iterations has elapsed and the fittest solution of the last generation is picked. B. Feature Selection Module The feature selection module selects from an initial broad set of data attributes that subset which maximises the ANN performance. The solution defines the ANN input vector. Candidate solutions are encoded in the chromosome representing the set of data attributes. This chromosome corresponds to a binary mask of length equal to the size of the full feature set and defines whether a particular attribute is fed to the ANN input layer. The feature selection module manipulates the input mask via two genetic operators, namely bit-flip mutation [2] and two-point crossover [2].

C. ANN Structure Design Module The structure design module evolves the size (i.e. number of nodes) of the hidden layer(s) of the ANN classifier. The number of hidden layers is at present fixed a priori and each layer is fully connected to the neighbouring ones. Two genetic mutation operators of node addition and node deletion are used. The former adds a node with connection weights initialised to small random values, the latter deletes a node. Structure mutations therefore add or remove those parts of the genotype corresponding to the mutated units. To bias the search toward compact ANN structures, node deletion is given a slightly higher probability than node addition. In the case node deletion is chosen, the algorithm picks the neuron with weakest connections from a randomly selected hidden layer and removes it. D. ANN Training Module The ANN training module evolves the ANN weights so as to minimise classification error. Evolution is achieved via two genetic operators, namely mutation and the BP rule. Genetic mutations slightly modify the weights of each node of a solution. For each weight, the perturbation is randomly sampled with uniform probability from an interval of pre-defined width. The BP rule is used as a deterministic mutation operator with the purpose of speeding up the learning process. If selected, a solution undergoes one cycle of BP learning over the whole training set. Because BP is computationally expensive, the operator is used with a moderate rate of occurrence. The deterministic weight training operator is the only part where the ANNE algorithm is specific to the ANN paradigm. If other ANN models are to be trained, the BP rule can be substituted by other parameter learning procedures. Incoming node weights from switched off features are still processed by the ANN training module. However, the only possible alteration on such weights comes from the mutation operator, which is a zero-mean random perturbation. The BP operator has no effect since there is no signal (i.e., zero signal) passing through those connections. The genetic drift of these elements is thus expected to be extremely modest. E. Fitness Evaluation Procedure The fitness of each candidate solution is evaluated by testing its classification accuracy on the training data set. To encourage the creation of compact and high performing solutions, whenever the accuracy of two individuals is equal, preference is given first to the solution using the smallest feature set, then to the solution having the most economical structure. ANN optimisation follows therefore a hierarchical criterion where accuracy has priority over compactness. VI. EXPERIMENTAL DESIGN The proposed algorithm is tested on seven benchmark classification problems. Five control tests are performed on the same problems using different combinations of feature

3297

selection and ANN design and training algorithms. A. Data Sets Seven real-world numerical data sets are chosen from UCI Machine Learning Repository [26] for the experimental tests. Their main features are listed in Table I. The Ionosphere, the LandSat and the Vowel databases are pre-divided into a training set and a test set of examples. The other databases are randomly split into a training set containing 80% of the examples and a test set containing the remaining 20%. Each algorithm uses the training set for learning and the test set for final estimation of the learning accuracy. Due to the highly unbalanced distribution of the examples, the size of the classes in the Musk training set is balanced by duplicating randomly picked members of the smaller classes. To reduce the danger of overfitting, the order of presentation of the training samples is randomly reshuffled for every training epoch. B. ANN Training Test The first experiment is performed using the full set of data attributes and manually optimised MLP structures. The results of the learning trials will be used as a baseline for assessing the efficacy of the feature extraction and structure design algorithms under evaluation. For each classification problem, a pre-fixed architecture is trained using the standard BP rule with momentum term. The learning procedure is run for a fixed number of iterations on the training set of examples. C. Feature Selection Tests In the next two experiments, manually optimised MLP structures are trained using two reduced sets of data attributes. The two feature subsets are generated respectively through a feature reduction algorithm based on the filter approach and ANNE’s feature selection module. In the first test, feature reduction is achieved via principal components analysis (PCA) [27] of the full feature set. Even though, strictly speaking, PCA is not a feature selection method but a feature extraction method [7], PCA is chosen for this experiment because it is one of the most popular, well understood and effective feature reduction algorithms. PCA works by transforming the original vector space of (possibly) correlated variables into an equivalent space of uncorrelated variables (the principal components). Principal components are arranged in decreasing order according to their capability of accounting for the variance of the original data. Since most of the original data variance can be accounted for by a small number of principal components, effective reduction of the input vector can be achieved by setting a heuristic cut-off criterion on the last components. For each classification problem, a minimal number of principal components is selected and the input patterns are transformed accordingly. This new data representation is used for training a pre-fixed MLP structure via the BP rule. In the second test, feature reduction is performed using ANNE's feature selection module. That is, the structure

TABLE I DATA SETS Size Ionosphere

351

Features Classes Training Set 33

2

200 - fixed

Iris

150

4

3

80% - random

Landsat

6435

36

6

4435 - fixed

Musk

6598

166

2

80% - random

Segmentation 2310

19

7

80% - random

Vehicle

846

18

4

80% - random

Vowel

990

10

11

528 - fixed

design module is switched off and the EA is used for simultaneously optimising the feature vector and the MLP weights using the feature selection and the ANN training modules. The algorithm is run for a fixed number of iterations and the fittest individual of the last generation is chosen as the final solution. For each classification problem, a pre-fixed MLP architecture is used. This algorithm will be henceforth referred as the Feat algorithm. The purpose of these tests is to assess the efficacy of the evolutionary feature selection module of ANNE. D. Structure Design Test The fourth experiment uses the full set of data attributes and applies ANNE's ANN design module to evolve the MLP topologies. For each classification task, the feature selection module is switched off and the EA is used to simultaneously design and train the ANN classifier using the ANN design and the ANN training modules. This algorithm will be henceforth called the Str algorithm. The purpose of this test is to assess the efficacy of the ANN design module, which is the evolutionary structure design module of ANNE. E. Full Optimisation Tests The last two experiments combine feature selection and automatic ANN design and training. In the first test, the PCA-reduced feature sets generated in the experiment described in Section VI.C are used. For each benchmark problem, the MLP classifier is designed and trained using the procedure described in Section VI.D. In the second test, the full ANNE algorithm is applied. VII. EXPERIMENTAL RESULTS This section presents the experimental settings and the results of the application of ANNE and the control algorithms to the seven benchmark classification problems. Input data are normalised according to the mean-variance procedure. In the learning trials involving the three largest databases, a sampling procedure is used to reduce computational overheads and the duration of the algorithms. The method randomly selects a subset of the training data from each class, at each learning cycle, and uses this sample instead of the full training set. The size of the subset is 10% of the training examples for the Musk database, 20% of the training examples for the LandSat database and 50% of the

3298

TABLE II LEARNING PARAMETERS Learning Algorithms Settings

BP

ANNE

Learning coefficient Momentum term Weights initialisation range Trials Generations Population size

0.1 0.01 [-0.05, 0.05] 10 * -

Feature mask crossover rate Feature mask mutation rate Masked features at start Hidden node addition rate Hidden node deletion rate Weights mutation rate Amplitude weights mutation BP Lamarckian operator rate

-

[-0.05, 0.05] 10 * 100 FS ANND ANNT 1.0 0.05 10% 0.0225 0.0275 0.25 0.2 0.6

* depending upon data set FS = feature selection module, ANND = structure design module, ANNT weight training module

training examples for the Image Segmentation database. The large number of instances guarantees a sufficient representativeness of the sampled subsets. Since the set of input attributes is generally not expressed on a common scale, calculation of the principal components is performed using the correlation matrix of the input variables. The heuristic criterion chosen to reduce of the complexity of the input space retains only that number of first principal components that is sufficient to account for at least 90% of the total data variance. The topology of the MLP classifiers is fixed to one hidden layer. The hidden nodes use the hypertangent transformation function, while the output nodes use the sigmoidal transformation function. The learning algorithms are optimised according to experimental trial and error. Table II reports the main parameter settings. Except for the number of learning cycles, once the parameters of an algorithm are optimised, they are kept unchanged for all the experiments. Experimental evidence shows that the performance of ANNE is robust to reasonable variations of the search parameters. Learning times ranged from a few minutes for the smaller data sets to several hours for the larger data sets on a PentiumIII 1GHz processor with 512 MB of RAM. For each benchmark problem, the algorithm under evaluation is run and the classification accuracy of the final solution is estimated on the test set of examples. This procedure is repeated with different random initialisations ten times for each benchmark problem. The final learning results are estimated as the average values of the ten independent learning trials. A. Learning Results For each classification task, the results of the six learning tests (ANNE plus the five control tests) are reported in Table III. For each experiment, the table reports the size of the hidden layer of the MLP, the number of selected features,

the mean and the standard deviation of the ANN accuracy and the number of learning cycles. Accuracy results report the percentage of successfully classified examples of the test set. Table IV reports the ANOVA F-test statistics for each benchmark problem and the critical value for a 5% alpha level of significance. The test reveals that there are significant differences in the accuracy results obtained on the LandSat, Image Segmentation and Vehicle data sets. The BP training of manually optimised structures produces results that are competitive with the more sophisticated EAs on all but one data sets. The LandSat data set is the only case where the BP rule clearly underperforms. The feature selection trials confirm the efficacy of PCA as a means of reducing the size of the input feature vector. This is particularly true for the larger LandSat and Musk data sets. Unfortunately, the poor accuracy results scored on the Vehicle database confirm the concerns raised in Section II about the risks of occasional failure of filter feature selection approaches. More moderate degradation of the learning performance occurs also on the Image Segmentation data set. The simultaneous evolution of the input vector and the weights allows the manually optimised ANN structures (Feat algorithm) to obtain high classification accuracies. Regarding the size of the feature vector, the most significant reductions are achieved on the Ionosphere and Iris data sets, where the number of selected features is roughly 1/3 of the full attribute set. Substantial reduction of the number of input features is also recorded for the Musk, Image Segmentation and LandSat databases. Compared to the PCA-based feature selection algorithm, the Feat algorithm obtains a smaller input vector for the Ionosphere data set and similarly sized input vectors for the Iris and the Image Segmentation data sets. In the other cases, the filter procedure allows greater reduction of the input features. However, the evolutionary algorithm obtains more accurate ANN solutions. The tests carried out using the full feature set and the PCA-reduced feature set show no appreciable differences in accuracy between manually optimised and evolutionary ANN structures. This result suggests that the optimisation of the feature vector is more difficult and has deeper consequences on the ANN behaviour than structure optimisation. However, in most of the cases the automatic procedure achieves more compact architectures. This is especially true in the case of the Musk classification problem, where the Str algorithm evolves MLP hidden layers that contain on average one third of the nodes contained in the manually optimised structures. Particularly compact solutions are also evolved for the LandSat and Vehicle classification problems, where the average size of the automatically designed hidden layers is half the size of the manually generated counterparts.

3299

Ionosphere Size Hidden L. Inputs Accuracy Std_Dva Iteration Iris Size Hidden L. Inputs Accuracy Std_Dva Iteration LandSat Size Hidden L. Inputs Accuracy Std_Dva Iteration Musk Size Hidden L. Inputs Accuracy Std_Dva Iteration Segmentation Size Hidden L. Inputs Accuracy Std_Dva Iteration Vehicle Size Hidden L. Inputs Accuracy Std_Dva Iteration Vowel Size Hidden L. Inputs Accuracy Std_Dva Iteration

TABLE III EXPERIMENTAL RESULTS BP PCA+BP Feat Str 2.0 2.0 2.0 1.9 33.0 19.0 11.9 33.0 96.6 94.8 94.2 94.0 1.6 0.4 2.3 1.6 1500 8500 1500 2000 BP PCA+BP Feat Str 2.0 2.0 2.0 1.9 4.0 2.0 1.5 4.0 96.3 92.3 94.7 96.3 2.8 5.0 3.6 3.3 700 800 400 2500 BP PCA+BP Feat Str 30.0 20.0 20.0 18.8 36.0 4.0 22.4 36.0 86.6 87.0 88.9 89.1 0.6 0.4 0.4 0.6 10000 9000 6500 7000 BP PCA+BP Feat Str 20.0 15.0 10.0 7.8 166.0 26.0 103.5 166.0 98.6 98.7 98.5 98.5 0.4 0.3 0.3 0.4 1000 8000 9500 7500 BP PCA+BP Feat Str 30.0 20.0 20.0 15.1 19.0 9.0 8.9 19.0 97.2 93.6 96.1 96.6 0.7 1.1 0.8 1.0 9000 8000 9000 8000 BP PCA+BP Feat Str 30.0 20.0 25.0 17.6 18.0 5.0 16.2 18.0 83.1 58.2 82.2 82.6 2.7 3.3 2.0 2.7 3000 10000 5000 5000 BP PCA+BP Feat Str 40.0 30.0 35.0 36.8 10.0 7.0 9.2 10.0 56.2 53.6 56.8 56.5 1.8 3.0 2.5 2.7 800 1200 900 900

TABLE IV ANOVA TEST OF SIGNIFICANCE (α=0.05) PCA+Str 1.5 19.0 93.5 1.5 3000 PCA+Str 1.5 2.0 92.3 3.9 3000 PCA+Str 9.9 4.0 87.1 0.4 6500 PCA+Str 5.3 26.0 97.1 1.3 10000 PCA+Str 17.5 9.0 92.9 1.4 10000 PCA+Str 12.7 5.0 59.6 4.3 8000 PCA+Str 39.9 7.0 54.5 2.2 1200

ANNE 2.2 11.6 94.4 1.4 3000 ANNE 1.3 1.5 96.3 2.9 200 ANNE 18.2 25.5 88.8 0.4 8500 ANNE 8.6 65.8 98.3 0.4 10000 ANNE 16.5 10.6 96.6 0.8 10000 ANNE 18.9 16.8 83.0 1.9 10000 ANNE 40.0 8.6 55.8 3.4 900

Ionos. Iris LandSat Musk F-test Critical

The ANNE algorithm confirms the good results obtained independently by the feature selection and the ANN design and training modules. Concerning the classification accuracy, the results achieved by ANNE are competitive with the best results obtained by the other algorithms. In terms of feature selection, the solutions evolved by ANNE are comparable to the solutions created by the Feat algorithm. Only on the large Musk data set ANNE’s concurrent evolution of the feature vector and the ANN configuration favours the creation of smaller feature subsets. However, since the ANN structures evolved by ANNE are smaller than the manually optimised structures used for the Feat algorithm, the solutions created by ANNE are more compact. Likewise, although the architectures evolved by ANNE are comparable to the solutions created by the Str algorithm using the full feature set, the ANN solutions produced by ANNE are smaller due to the reduced feature vector. Overall, the tests show ANNE’s capability of achieving concurrent feature selection and structure design

0.47

0.29

5.33

0.92

Segm.

Vehicle

Vowel

3.19

17.88

0.22

2.38

with results that are equal or better than the results achieved by specialised algorithms aimed only at feature selection or ANN topology optimisation. Yet, thanks to the simultaneous evolution of the whole ANN configuration, ANNE creates more compact solutions and removes the need for lengthy manual optimisation. Compared to the automatic system using PCA for feature selection and the Str algorithm for ANN optimisation (PCA+ Str), ANNE produces in general less compact solutions. The difference between the two algorithms is particularly noticeable in those classification problems where PCA generates much smaller input vectors. However, the solutions evolved by ANNE obtain superior accuracy results than the solutions produced by the PCA+Str algorithm. This difference in accuracy is most noticeable on the LandSat, Image Segmentation and Vehicle databases. On the whole, the results underline the importance that feature selection has for the outcome of the ANN design and training processes. With the exception of the LandSat benchmark problem, the differences in classification accuracy measured between different algorithms appear to be related almost exclusively to the different input vectors used. In general, algorithms working on the PCA-generated sets of features tend to create more compact but worse performing solutions. Compared to manual design, evolutionary design of ANN structures produces more compact solutions with no degradation of the learning accuracy. The largest differences in classification accuracy are found in the LandSat and the Vehicle classification problems. In the first case, there seem to be a clear bias in favor of evolutionary weight training and a bias against PCA-based feature reduction. In the second case, PCA fails to provide an adequately discriminant feature set for data categorisation. Evolutionary structure optimisation produces the worst results on the Vowel classification problem, where manual design achieves smaller or comparable ANN topologies. Inspection of the learning curves reveals that this failure is due to quick overfitting of the weight training process that stops prematurely the search for more compact structures. As a last remark, despite the removal of an often large number of features, none of the feature selection algorithms appreciably improves the accuracies obtained by algorithms working on the full set of data attributes. This result suggests the presence of superfluous attributes, rather than features that negatively affect the separability of the classes. The ability of ANNs and in particular of MLPs to cope with noisy input signals should also be considered. Via the parameter learning procedure, the solutions are taught to

3300

zero the weight of harmful and conflicting features. Therefore, given that the size of the data sample is adequate to learn the categorization task using the full set of attributes, the removal of even a large number of features has no effect on the classifier accuracy. However, feature selection is of primary importance since it reduces the costs of collecting and pre-processing the data feature set. VIII. CONCLUSIONS

REFERENCES [2] [3] [4]

[6]

[7] [8] [9]

ANNE, a new algorithm for data feature selection, and ANN design and training is presented. The approach is characterised by two distinguishing features, namely the simultaneous evolution of the whole ANN solution (that is, the input vector, the ANN architecture and the weights) and the use of the embedded approach for feature selection. In both cases, the proposed study addresses poorly investigated areas in the evolutionary generation of ANN classifiers. The proposed algorithm is applied to the optimisation of a popular type of ANN classifier, the MLP. Five control algorithms are generated through combinations of manual and automatic feature selection and ANN design techniques. ANNE’s performance is compared to the performance of the five control algorithms on seven real world benchmark problems. Experimental evidence shows that the proposed method generates very compact solutions capable of accurate and robust learning results. Evolutionary feature selection achieves significant reduction of the set of input features in most of the benchmark problems with no degradation of the ANN performance. Experimental comparison with the popular PCA feature reduction algorithm shows that evolutionary feature selection allows more accurate and consistent learning results. Compared to manual optimisation of the ANN architecture, evolutionary ANN design creates smaller topologies capable of competitive learning results. Moreover, the proposed evolutionary technique removes the need for timeconsuming manual design. The limited scope of this paper allowed only a partial discussion on the merits of the proposed technique. In view of the good learning results and the compactness of the evolved solutions, it can be concluded that the ANNE algorithm is a promising tool for the automatic generation of ANN classifiers. Further work should address a discussion on the complexity of the solution space as well as comparison with other algorithms in the literature.

[1]

[5]

R. P. Lippmann, "An introduction to computing with neural nets", IEEE ASSP Mag., vol. 4 issue 2, part 1, pp. 4-22, 1987. D. B. Fogel, Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, 2nd ed., New York, IEEE Press, 2000. X. Yao, "Evolving artificial neural networks", Proc. IEEE, vol. 87, no. 9, pp. 1423-1447, 1999. E. Cantu-Paz and C. Kamath, "An empirical comparison of combinations of evolutionary algorithms and neural networks for classification problems", IEEE Trans. on Syst., Man, and Cyb., part B, vol. 35, no. 5, pp. 915-927, 2005.

[10] [11] [12] [13] [14] [15]

[16] [17] [18] [19] [20] [21]

[22] [23] [24] [25] [26] [27] [28]

3301

A. Blum and P. Langley, "Selection of relevant features and examples in machine learning", Art. Intell., no. 97, pp. 245-271, 1997. S. Salcedo-Sanz, G. Camps-Valls, F. Perez-Cruz, J. SepulvedaSanchis and C. Bousoño-Calzon, "Enhancing genetic feature selection through restricted search and Walsh analysis", IEEE Trans. on Syst., Man and Cyb., part C, vol. 34, no.4, pp.398-406, 2004. L. Portinale and L. Saitta, "Feature selection", Deliverable, D14.1 IST Project MiningMart, IST-11993, 2002. K. Z. Mao, "Orthogonal forward selection and backward elimination algorithms for feature subset selection", IEEE Trans. on Syst., Man and Cyb. - part B, vol. 34, no. 1, pp. 629-634, 2004. P. Zhang, B. Verma and K. Kumar, "Neural vs. statistical classifier in conjunction with genetic algorithm based feature selection", Pattern Recogn. Letters, vol. 26, pp. 909-919, 2005. V. Schetinin, "A learning algorithm for evolving cascade neural networks", Neural Processing Letters, no. 17, pp. 21-31, 2003. D. Montana and L. Davis, "Training feedforward neural networks using genetic algorithms", Proc. 11th Int. Joint Conf. on AI, Detroit, MI, pp. 762-767, 1989. D. B. Fogel, L. J. Fogel and V. W. Porto, "Evolutionary programming for training neural networks", Proc. Int. Joint Conf. on NNs, S. Diego, CA, pp. 601-605, 1990. F. Menczer and D. Parisi, "Evidence of hyperplanes in the genetic learning of neural networks", Biol. Cyb., vol. 66, pp. 283-289, 1992. W. Yan, Z. Zhu and R. Hu, "Hybrid genetic/BP algorithm and its application for radar target classification", Proc. 1997 IEEE Nat. Aerospace and Electr. Conf., NAECON part 2, pp. 981-984, 1997. D. Thierens, J. Suykens, J. Vanderwalle and B. De Moor, "Genetic weight optimisation of a feedforward neural network controller", in Artificial neural networks and genetic algorithms, R. F. Albrecht, C. R. Reeves and N. C. Steele, Eds., Wien, A, Springler-Verlag, 1993, pp. 658-663. R. Miikkulainen and D. E. Moriarty, "Efficient reinforcement learning through symbiotic evolution", Mach. Learn., vol. 22, no. 1-3, pp. 1132, 1996. X. Yao and Y. Liu, "Fast evolution strategies", Proc. of the 6th Annual Conf. on Ev. Progr. EP97, Lecture Notes in Computer Science, vol. 1213, Springer-Verlag, Berlin, D, 1997, pp. 151-161. R. Reed, "Pruning algorithms - A survey", IEEE Trans. Neural Networks, vol. 4, pp. 740-747, 1993. R. Parekh, J. H. Yang and V. Honavar, "Constructive neural-network learning algorithms for pattern classification", IEEE Trans. Neural Networks, vol. 11, no. 2, pp. 436-451, 2000. G. F. Miller, P. M. Todd, and S. U. Hegde, "Designing neural networks using genetic algorithms", Proc. 3rd Int. Conf. on GAs and Appl., Arligton, VA, pp. 379-384, pp. 1989. S. A. Harp, T. Samad and A. Guha, "Designing application-specific neural networks using the genetic algorithm", in Advances in neural information processing systems, vol. 2, D. S. Touretzky ed., Morgan Kaufmann, San Mateo, CA, 1990, pp. 447-454. H. Kitano, "Designing neural networks using genetic algorithms with graph generation system", Complex Syst., vol. 4, no. 4, pp. 461-476, 1990. A. Cangelosi and J. L. Elman, "Gene regulation and biological development in neural networks: an exploratory model", Tech. Rep., CRL-UCSD, Univ. California, San Diego, CA, 1995. J. Chvál, "Evolving artificial neural networks by means of evolutionary algorithms with L-systems based encoding", Res. Rep., Dept. Cyb. and Art. Intell., Tech. Univ. Košice, Sk, 2002. F. Aboitiz, "Mechanisms of adaptive evolution - darwinism and lamarckism restated", Med. Hypotheses , vol. 38, no. 3, pp. 194-202, 1992. UCI Machine Learning Repository, Available: http://www.ics.uci.edu/~mlearn/MLRepository.html K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate analysis, London, UK, Academic Press, 1979.