Evaluation of Neural Network Robust Reliability Using ... - IEEE Xplore

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

1349

Evaluation of Neural Network Robust Reliability Using Information-Gap Theory S. Gareth Pierce, Yakov Ben-Haim, Keith Worden, and Graeme Manson

Abstract—A novel technique for the evaluation of neural network robustness against uncertainty using a nonprobabilistic approach is presented. Conventional optimization techniques were employed to train multilayer perceptron (MLP) networks, which were then probed with an uncertainty analysis using an information-gap model to quantify the network response to uncertainty in the input data. It is demonstrated that the best performing network on data with low uncertainty is not in general the optimal network on data with a higher degree of input uncertainty. Using the concepts of information-gap theory, this paper develops a theoretical framework for information-gap uncertainty applied to neural networks, and explores the practical application of the procedure to three sample cases. The first consists of a simple two-dimensional (2-D) classification network operating on a known Gaussian distribution, the second a nine-lass vibration classification problem from an aircraft wing, and the third a two-class example from a database of breast cancer incidence.

is a measure of the networks’ capacity to learn the underlying structure of the data rather than any noise present in the data. For the example of a classifier, a network with good generalization will tend to produce similar classification rates for both the training and (independent but drawn from the same distribution) testing data, whereas a poorly generalizing network would give high classification on the training set, and relatively poor performance on the test data. Such a network would be said to be overfitted. The problem is likely to be exacerbated for large networks (with lots of independent parameters), and limited amounts of training data. An often cited empirical requirement is that the required for good minimum number of training patterns generalization performance is given by [1], [2]

Index Terms—Information-gap, neural networks, robustness, uncertainty.

(1)

I. INTRODUCTION RADITIONAL backpropagation-based training of feedforward artificial neural networks (ANNs) is accomplished by a process of minimization of an error function which quantifies the network output performance (whether regression or classification) in terms of the difference between the network predicted values and true target values [1] applied to some set of training data. The basic approach of direct gradient–descent has been much improved in recent years by the development of more sophisticated search algorithms such as scaled conjugate gradients [2]. Current work on genetic algorithms [3] and subdivisional searches of the error surface [4] seek to improve the location of the true global minimum, and avoid the problem of solution trapping in local minima which plagues all gradient-based techniques. Assuming that satisfactory network convergence has been obtained from the training data, probably the next largest issue lies in the ability of the trained network to provide satisfactory classification performance on previously unseen data. The generalization capacity of a network

T

Manuscript received March 4, 2005; revised October 21, 2005. This work was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC) under Grants GR/R96415/01 and RA 013700 in association with Defense Science and Technology Laboratory (DSTL), Farnborough, U.K. S. G. Pierce is with the Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow G1 1XW, U.K. (e-mail: s.g.pierce@eee. strath.ac.uk). Y. Ben-Haim is with Faculty of Mechanical Engineering, Technion—Israel Institute of Technology, Haifa 32000, Israel. K. Worden and G. Manson are with the Department of Mechanical Engineering, Dynamics Group, University of Sheffield, Sheffield S1 3JD, U.K. Digital Object Identifier 10.1109/TNN.2006.880363

where is the total number of independent network weights and biases and is the fraction of classification errors permitted on the test data. Practically, the use of cross-validation and early stopping [1], [2] using an independent validation data set are used as termination or selection criteria for network training; the final network performance being evaluated using a third independent test data set. Good generalization performance is an indicator that the information capacity of the network (reflected in the number of weights) is of the same order, or smaller than the total information content of the training set. The rationale here is that if the network is sufficiently complex, it can memorize all the features of the data (including noise). For the network to generalize, it has to start to just store the important features of the training data. This being stated, there also exists evidence to suggest that generalization performance depends more on the size of the weights rather than their number [5]. Over recent years, the development of sophisticated methods of addressing the overfitting problem through regularization techniques has been seen. The problem with overfitted networks can be described as a tendency to converge to weight matrix solutions with high component values that tend to generate excessively sharp decision boundaries (where the sharpness is more characteristic of noise in the data) [2], [6]. Regularization seeks to penalize the large weight and bias values of the network that are associated with such sharp decision boundaries. A simple way to implement regularization is using an additive weight decay term in the error function [1]. A more sophisticated basis for regularization can be found in the approach of Bayesian-evidence update techniques for network training [6], [7] which frame the optimization problem rather differently. The maximum-likelihood approach seeks to find a unique value of the network weights

1045-9227/$20.00 © 2006 IEEE

1350

and biases corresponding to the optimized value whereas the Bayesian technique marginalizes over all possible combinations of the weights, assuming that the solution weight matrix has a posterior probability distribution which will likely be centered close to or at the maximum-likelihood solution. Such an approach is useful as network weight regularization falls naturally into the framework; and additionally it is possible to estimate confidence bounds on the output predictions [6], [7] based on the widths of the posterior probability functions for the weight matrix. The problems of network overfitting and lack of generalization are central to understanding the inertia to the practical application of ANNs especially to safety critical applications. If, for example, we envisage an ANN classifier being used to assess the condition of a major structural component, it is imperative that the performance of the classifier to the most diverse range of inputs is well understood. Poor classifier performance could result in catastrophic failure and possible loss of life. Although a range of techniques (including the aforementioned Bayesian approach) have been developed for output confidence interval predications [6]–[14], they all adopt a probabilistic standpoint and, therefore, suffer from the common drawback that since the probability distributions are usually estimated from the low-order moments of the data (typically mean and standard deviation), there is often no validation of the extremes of the distributions. Unfortunately, it is often the extreme events of the data that are likely to be associated with the unpredictable failure events of greatest interest. The use of fuzzy membership functions [15] provides one approach to understanding the effects of input uncertainties on classifier performance. In this paper, a novel nonprobabilistic approach is described which was applied to the prediction of bounded worst case errors in the presence of a specified level of uncertainty in the input data. Furthermore, having quantified bounded worst case error performance for a specific level of input uncertainty, the technique is then proposed as a tool to discriminate between otherwise equivalent networks possessing sharply contrasting responses to input uncertainty. The technique is based on the theory of information-gap uncertainty [16] and lies in presenting both crisp (single valued) data, and interval [17] data to a number of neural networks under evaluation. The basic idea is to design the ANN to be: 1) robust to unknown idiosyncratic variation of the learning data, and 2) to classify data at a specified rate of success. The authors’ preliminary findings on interval-based [17] techniques to implement information-gap analysis has been applied to nonlinear regression [18] and real damage classification, based on low-frequency vibration data of an aircraft wing [19], although no formal presentation of the underling theory of the information-gap was presented in these publications. In this paper, the authors seek to explicitly formalize the mathematical background to the interval information-gap technique and illustrate the application of the technique to three separate examples. The first example comprises a simple two-class classification problem based on a Gaussian distribution of two-dimensional (2-D) data in order to illustrate the underlying concepts of the information-gap approach. The second and third examples draw on higher dimensional data sets forming


studies of aircraft wing vibration [20] and part of a study of breast cancer classification [22]–[24]. Multilayer perceptron (MLP) networks were used to implement the classifiers, and the interval results obtained were compared with conventional network training based on cross validation (using both maximum-likelihood and Bayesian-evidence training incorporating weight regularization). It is demonstrated that the use of interval-based network evaluation allows two important concepts to be explored. First, a new criterion for network selection can be established as the information-gap technique allows the identification of an ANN classifier which is intrinsically more robust to uncertainty on the input data than network solutions obtained by conventional training. Second, and of equal importance, the reliance on probabilistic-based estimates of confidence bounds on network predictions is obviated by virtue of the fully conservative nature of interval sets. It is possible to make worst case error predictions that are an inclusive bounded solution set (for a specified degree of input uncertainty to the classifier). This approach circumvents the inherent problem associated with all Monte-Carlo-based sampling techniques of the input space, where it is impossible to guarantee that all possible combinations of input space have been correctly sampled. For low-dimensional classification problems and regression problems, it is possible to use the vertices of an interval set [18] as an alternative to, and to verify the results of interval propagation. However, for high dimensionality and possibly highly nonlinear classifiers, the interval technique provides a highly computationally efficient (a single forward propagation through a given network structure) alternative to highly computationally intensive Monte-Carlo methods, and furthermore provides a guarantee of conservative bounds to output predictions. II. NETWORK STRUCTURE AND CONVENTIONAL IMPLEMENTATION A. Network Structure The MLP network implementation and training was undertaken in MATLAB using the NETLAB toolbox developed by Nabney [6]. The first example data set presented consists of a simple 2-D Gaussian distribution (chosen to illustrate the basic concepts) comprising of two classes; therefore, each network had two input nodes ( and ) and two output nodes corresponding to the classes and . The outputs from the second layer being given by

(2) where is the weight matrix, is the bias vector, is the number is number of hidden nodes, and is number of input nodes, of output nodes. The examples of real world data were from the vibration properties of a GNAT aircraft wing (nine input parameters and nine

PIERCE et al.: EVALUATION OF NEURAL NETWORK ROBUST RELIABILITY USING INFORMATION-GAP THEORY

output classes) and a Breast Cancer Incidence Database (nine input parameters, two output classes). The network output was given by transformation of the second layer activations by the output activation function. Since there were a series of -independent output classes it was appropriate to utilize the softmax function [9]

This approach was entirely conventional as follows. After each individual network had finished the training phase, the validation data set was used to appraise the performance. The best performing network was the one which delivered the best overall classification performance on the validation data set. A final check on performance was to measure the classifier performance on a third independent test data set.

(3)

where is the output for the th class And to , where is total number of network outputs. The choice of the softmax activation function ensured that the outputs always summed to unity, and thus could be directly interpreted as class conditional probability values. NETLAB automatically used an appropriate entropy function [6] for the 1-ofoutput class coding to calculate the network error function and included a weight decay penalty regularization term (4) is the netwhere is the target variable for presentation , work prediction for presentation , is the weight decay hyperare the network weights. parameter, and A conventional cross-validation approach [1], [2] was used to investigate both crisp (single-valued) and interval-based input data sets. Note that all network training was performed on the crisp input data set. Conventional training was undertaken using both maximum-likelihood and Bayesian-evidence update techniques [2], [6]–[8] to ensure that good network generalization performance was obtained.

1351

III. INFORMATION-GAP MODEL OF UNCERTAINTY A. Introduction to Information-Gap Theory Having trained networks on the crisp training data, both validation and test data were presented to the networks having made the data sets uncertain using an interval expansion. The validation or test data set contained measured input . The provenance of these vectors denoted as vectors was known, that is, the class membership was known was obwhen the vector was measured. Specifically, datum tained when the data point belonged to . However, it was not guaranteed that these vectors accurately or comprehensively represented the conditions from which they arose. That is, very different vectors could well have arisen from the same classes of conditions. This sort of information-gap associated with generic categories of classification target is quite common because of the complexity and variability of failure processes and their effects. This is especially true for high-dimensional data sets. The uncertainty surrounding each of the test data vectors , was represented by defining a local-information-gap model

(6) B. Conventional Network Selection—Classification for Crisp Data Having independently trained a family of networks on the crisp training data (each distinct network training session used identical training data but starting with different randomly chosen initial conditions), the performance of each individual was evaluated on both the validation and network test data sets. Formally, the fundamental elementary input to and the output as the neural network was given by . was the posterior Using the softmax activation of (3), probability that the input arose from class . The neural network design vector specified the structure and parameters of the neural net. The classification algorithm in the crisp case without consideration of uncertainty was denoted whose value indicated the index of the point represented from measurement . The classification algorithm was simply that was equal to the index of the maximal element of , that is (5)

where

was the infinity norm in

defined as

(7) where was the unknown “horizon of uncertainty” and was the th component label for the individual variable . Thus was not a single set, but rather an unbounded family of nested sets of possible realizations . With this particular contained hypercuboids and choice of norm, the set thus convex sets. The global information-gap model was the Cartesian product of the local models

(8)

The elements of were vectors of the form where . Note that the Cartesian product is used because the elements are possible training sets rather than points. of obeyed the two basic axioms of information-gap models: namely nesting and contraction [16]. Nesting asserts

1352


Fig. 1. Illustration of multiple interval output classification for a nine-class problem. (a) Uncertainty

= 0 03. (b) Uncertainty= 0 06. :

:

that the uncertainty sets become more inclusive as the horizon of uncertainty increases

value of the test datum. However, it was cumbersome, so the following lighter notation was adopted:

(9)

(12)

Contraction asserts that the measured test data are contained within the information-gap model at all horizons of uncertainty (i.e., for all values of )

If it was required to refer succinctly to the index selected by the neural network at uncertainty value based on the th test datum, then the notation was adopted. Recall that was obtained for membership of class . Thus . the decision was correct if The quantification of robustness to uncertainty employed a lower hit function which was binary valued and depended on the . The success or failure of the decision algorithm lower hit function, at horizon of uncertainty , equalled unity if the neural network decision was unambiguously correct, and equalled zero otherwise, as shown in (13) at the bottom of the page. Statement asserts that the neural network identified the corwas idenrect class, i.e., at horizon of uncertainty , datum tified correctly as corresponding to . was the lowest neural network output from among all the realizations of the input at selected by the neural network. uncertainty , in the class was the greatest neural network output from among all the realizations of the input at uncertainty , in all the classes other than that which was selected by the neural network. The assermeans that at horizon of uncertainty , the least tion supportive evidence in favor of was more supportive than the most supportive evidence in favor of another class. In short, the at uncertainty if lower hit function equals unity for datum and only if the neural network gets it unambiguously correct. It has an initial value of 1 (indicating is evident that if correct classification) then it will eventually flip from 1 to 0 as increases beyond a critical value. the horizon of uncertainty Fig. 1 illustrates this concept for the nine-class GNAT wing problem [19]. In Fig. 1, the interval output predictions for the

(10) Clearly, the local information-gap models also displayed nesting and contraction. Note that was a convex information-gap model since from (6) the local models were convex information-gap models.

B. Robustness Function The quantification of the margin of uncertainty tolerance was begun by defining a decision function for the test datum at uncertainty , whose value was the index of the element of the which was maximal on the set of all output output vector vectors in the local information-gap model at

(11) was the neural network decision, at horizon of uncertainty , for the test datum , and represented the index for which the strongest evidence existed at this value of (recall that is a conditional probability for class ). The notation was useful in emphasizing the dependence of the network classification on both the horizon of uncertainty and the

if else

(13)


nine classes are shown (along with the crisp output values) when the input uncertainty parameters were 0.03 and 0.06, respectively. In Fig. 1(a), the network uniquely (and incidentally correctly) identified the input data as belonging to class 1. The lowest output for class 1 (labelled “threshold”) was clearly higher than the greatest output from any of the other classes, and hence the network classification was unambiguous. However, as the input uncertainty parameter increased, it is clear that the output uncertainties also grew, and a point was reached where the lowest output from the winning class was no longer higher than the greatest output from any of the other classes. That is to say, some of the other classes’ highest interval output bound had risen above the threshold value. In Fig. 1(b), this occurred for classes 2 and 3, and hence the classifier identified as belonging to any of classes 1, that particular input datum 2, or 3. The classifier was deemed to have misclassified at the horizon of uncertainty in Fig. 1(b). It was then possible to define the information-gap robustness function. The total number of correct and unambiguous classifications (or hits) by the neural network was given by (14) which was some fraction of the total number of test vectors . be the lowest acceptable fraction of successful hits. Let The robustness of the neural network, which was specified by parameters , was the greatest horizon of uncertainty at which the number of correct and unambiguous classifications was at least a fraction of the number of test data

(15) It is clear that because the lower hit function decreases with increasing , the robustness decreases as the fractional success increases (16) Equation (16) represents a tradeoff: An increase in the critical rate of successful classification causes a decrease in the ro. bustness to uncertainty C. Opportunity Function Retaining the definition of the decision algorithm defined in (11) and (12), the ambiguous decision algorithm for test datum was the set of all classes which could possibly be responsible for the observed test datum

(17) in (17) was the same as in (13), and was The term nearly the same as . Item was the least supportive evi. In contrast, was the most supportive dence for class

1353

asserts that evidence for class . The statement the strongest evidence for class was at least as strong as the . weakest evidence for class As previously described, was the class to which datum was assigned at uncertainty by the decision algorithm . Recall that was not necessarily the correct was the set of all decision classification of . classes which at uncertainty contained at least one neural network output which was at least as supportive as the least . The set always supportive output of class , and when was sufficiently small, contained the index the set contained only this index. However, as the horizon of uncertainty was increased, the least supportive evidence for (namely ) became less convincing while the most class ) became more so, supportive evidence from other classes ( and, therefore, the set became more inclusive. This is illustrated in Fig. 1. Now define an upper hit function for datum at uncertainty , which equalled unity if and only if belonged to the set . In other words, the upper hit function equalled correctly classified at uncertainty even unity if if that classification was not unique if (18) else Referring back to Fig. 1, we note that for small uncertainty values {Fig. 1(a)} the classifier provided unambiguous selection of class 1. As the horizon of uncertainty of the input data increased the classifier did not provide a unique solution. This is shown in Fig. 1(b) where the classifier identified that input belonged to any of classes 1, 2, or 3. The best case datum error position was that in this situation the classifier was deemed to have correctly classified . It is clear that the upper hit function can never be lower than the lower hit function (19) has an initial value of 0, then it will flip Furthermore, if increases beyond from 0 to 1 as the horizon of uncertainty which flips from 1 to a critical value. This is in contrast to 0 as the horizon of uncertainty increases. was the total number of test data which were classified correctly but possibly also ambiguously. The opportuneness of network parameters was the lowest horizon of uncertainty at which ambiguous classification of as much as the of the test data was possible but not guaranteed fraction

(20) The information-gap nesting dictates a tradeoff between opportunity and windfall aspiration [16]. increased (got got larger (got better), which was precisely the opworse) as posite trend from (16) (21)

1354


Fig. 2. 1000 points of training data sampled from a 2-D Gaussian distribution comprising three centers (large signs) and two classes.

+

Equation (21) represents a tradeoff: An increase in the windfall rate of successful classification causes an increase in the level . of uncertainty needed to obtain that windfall The robustness and opportunity for unambiguously correct and could be either sympathetic classification or antagonistic with respect to changes in the design of the neural network. That is, a modification of which caused to improve (increase) may cause to either improve (decrease) or deteriorate (increase). IV. EXPERIMENTAL ILLUSTRATION OF INFORMATION-GAP TECHNIQUE A. 2-D Gaussian Data Having described the theoretical background to the information-gap analysis and its application to the design of neural networks, we applied the techniques to a simple toy data set to illustrate the practical implications. Three data sets labelled training, validation, and test were independently drawn from a known generative 2-D set of Gaussian distribution functions. Fig. 2 illustrates the distribution of training data which consisted of three independent centers (shown as large signs) divided into two designated classes; the dotted line indicates the 0.5 probability contour derived from the posterior distribution, and therefore represents the ideal analytic Bayesian classifier boundary. Two complete data sets were studied; the first consisted of 100 points of data (for each of the training, validation, and test sets), the second of 1000 points. The calculated Bayesian classification rates for the test data set were 90.0%, and 84.2%, respectively, for the 100- and 1000-point sets. B. Crisp Network Training and Results The number of hidden nodes in the second network layer was varied between 1–15 hidden units. Each individual network structure was trained with 100 independent training sessions starting at differently randomly chosen points on the error

surface so that a total of 1500 independent networks were evaluated. Up to 1000 iterations of a scaled conjugate gradient optimization were implemented using both conventional maximumlikelihood- and Bayesian-evidence-based update training [2], [6], [7]. The purpose of the two different size data sets and training techniques was to investigate the generalization behavior displayed by the two training regimes. Note that for networks with a single hidden layer there were a total of seven independent components of the weight and bias matrix. For networks with 15 hidden nodes, there were 77 such components. Allowing for , then (1) a misclassification of 10% on the test data indicates that a minimum of 770 training patterns be used for the most complex network structures (15 hidden nodes). It was clear that the 100-point data sets would, therefore, contain insufficient data content for effective generalization to take place when using large network structures and maximum-likelihood-based training. The use of the Bayesian-evidence update algorithm allowed an additional investigation of the superior generalization performance of the more complex networks when used with sparse training data. The results of the network training are summarized in Tables I–IV for each of the training techniques and both data sets. In each case, the best performing network (from 1500) on the validation data set is identified. Within a conventional network training framework, this result represents the best network choice from all the possibilities. Included at the bottom of each of Tables I–IV is the mean and variance of the forward propagation of the test set through all 100 networks. From Table I (maximum likelihood with 100-point data) it is clear that overtraining was becoming an issue for networks with more than about four hidden nodes as the training data classification rate tended towards 100%, significantly higher than both the validation and test set results. This indicates that the network decision boundaries were unrealistically sharp due to insufficient training data. Since a four-hidden-node network contained 22 independent weight components this was not a surprising finding. This supposition is further supported by the high value of variance (11.0) obtained from the forward propagation of the test set through the 100 networks. Note that the corresponding variance figure for the Bayesian-evidence update trained networks illustrated in Table II was 0.26. These networks, trained on exactly the same training data, did not tend to reproduce the high training set classification rates associated with the maximum-likelihood approach, indicative of superior generalization. Tables III and IV show the maximum-likelihood and Bayesian-evidence update techniques, respectively, applied to the 1000-point data sets. It was reassuring to note that the two techniques produced broadly similar classification results between the data sets. The best classification rates on the validation set were similar at 81.4% and 81.2%, respectively, and the variances to forward propagation of the test set were 0.02 and 0.004, respectively. C. Forward Propagation of Uncertainty Following conventional crisp network training and evaluation, the network responses to uncertainty in the input data were


1355

TABLE I CLASSIFICATION RATES FOR MAXIMUM-LIKELIHOOD TRAINING; 100-POINT DATA SETS

TABLE II CLASSIFICATION RATES FOR BAYESIAN-EVIDENCE TRAINING; 100-POINT DATA SETS

TABLE III CLASSIFICATION RATES FOR MAXIMUM-LIKELIHOOD TRAINING; 1000-POINT DATA SETS

TABLE IV CLASSIFICATION RATES FOR BAYESIAN-EVIDENCE TRAINING; 1000-POINT DATA SETS

investigated. The test data set was made uncertain by applying an interval expansion of size in all dimensions of the data sets. This corresponds to the unknown horizon of uncertainty in the information-gap model of (6). Interval numbers [17] occupy a bounded range of the number line, and can be defined as an orwith such that dered pair of real numbers (22) Therefore, each datum was replaced by the interval where represents the lower interval bound, and upper interval bound

Equation (23) defines the local information-gap model . The intervalized test data set was then presented to a particular network structure and the interval forward propagation calculated through the network. The outputs were also obviously interval numbers, and the classification upper hit rate and lower hit rate were calculated from (13) and (17). D. Single Network Uncertainty Propagation

the

(23)

First, we considered the case of a single network structure. Consider the maximum-likelihood training on the 1000-point data sets. The best crisp trained network was established on the validation sets (Table III) to be three hidden nodes, and network structure number 72. The interval output classification rates for

1356

Fig. 3. Lower and upper hit rate functions for maximum-likelihood training on a 1000-point data set; network number 72 was optimal from cross validation.

lower hit rate and upper hit rate as a function of interval size are shown in Fig. 3. This plot can be used to quantify the performance under uncertainty of a single network and directly assess the robustness of the neural net at any demanded classification rate. For example, in Fig. 3, by setting the failure criterion to a classification rate of 70%, an error of up to 0:0214 can be tolerated in all the elements of the measured vectors, , without jeopardizing the 70% classification rate. That is, the robustness (15) . This value of robustness imis plies that the elements of the measured vectors can all err up to 0:0214 without causing the mean of the lower hit rates to fall short of 70%. Additionally, operating at this level of uncertainty provides the opportunity for classifications of up to 92.4%. In other words, the opportunity function (20) is . This means that if the elements of the measured vectors err up to 0:0214, then the mean of the upper hit rates can be as large as 92.4%.


Fig. 4. Mean and range of lower and upper hit functions for 100 networks, maximum-likelihood training on a 1000-point Gaussian data set.

Fig. 5. Robustness functions for the 100 networks using maximum-likelihood training on the 1000-point Gaussian data.

E. Multiple Network Uncertainty Propagation Having evaluated the single network response to uncertainty, it is more instructive to plot the results from all 100 evaluated networks for a fixed number of hidden nodes. Fig. 4 illustrates the mean (square and circle symbols) and maximum and minimum range (shown as error bars) on both the lower and upper hit functions for three hidden nodes. The individual curves for networks number 72 and 7 are also highlighted. In contrast to the maximum-likelihood algorithm with 100 data points discussed earlier, where very consistent classification performance was obtained for all 100 networks, Fig. 4 shows a large range in the lower hit function (hence larger variation of robustness) depending on the horizon of uncertainty. For an uncertainty of 0.05, the range from maximum to minimum classification rate in Fig. 4 is 34%. From the discussion of the maximum-likelihood algorithm with 100 data points we would conclude that there was little to gain in preferring one network over another. However, when considering the performance

variation inherent in Fig. 4 it is clear that certain networks have significantly superior robustness compared to others. What is of particular interest is to note the performance of the best network (in this case number 72) selected by the conventional cross validation approach. This is plotted as the dashed line in Fig. 4, and for robustness up to 0.03 provides excellent classification rate. However, as the horizon of uncertainty increases, the performance of this network degrades. Fig. 4 clearly shows (dotted line), a different network (number 7) that maintains superior performance at all levels of input uncertainty. For uncertain input data, this network would be a superior choice than that indicated from conventional cross validation on the crisp data. An alternative display of the robustness is presented in Fig. 5 as a funcshowing the calculated robustness function tion of the required classification rate . Each line shows the evolution of robustness of each network (with number of hidden


Fig. 6. Mean and range of lower and upper hit functions for 100 networks, Bayesian-evidence training on a 100-point Gaussian data set.

1357

Fig. 8. Detail of GNAT wing showing position of sensors (A , B , and C ) and removable panels (P1–P9).

selected network (number 15) was among the worst performing networks for all sizes of uncertainty. Analysis of the interval data, however, allowed the identification of network number 28 as providing significantly higher robustness to uncertainty. in Fig. 7 gave a robustness For example, setting for network number 15 and for network number 28. Having demonstrated the basic approach of an information-gap approach to determining network robustness for an artificial 2-D data set, we now apply the same techniques to two real world examples. The first is a damage classification problem on a GNAT aircraft wing discussed in detail in [19]; the second is a set of breast cancer screening data [22], [23] from the University of California at Irvine repository [24]. Fig. 7. Robustness functions for the 100 networks using Bayesian-evidence training on the 100-point Gaussian data.

nodes fixed at three). Networks number 7 and 72 are again highlighted to illustrate that although the network selected from conventional cross validation (number 72) has good robustness at typically required hit rates ( 60%), it is not in general the most robust network at all values of . As a more extreme example, consider the case for the Bayesian-evidence trained networks on the 100-point Gaussian data. From Table II, we see that the best network structure selected from the validation set was for four hidden nodes and corresponded to individual network number 15. The variability of classification of all 100 networks with four hidden nodes on the test data was studied. As in the previous example, the networks provided consistent classification performance on the test data set, with a mean and variance of the classification rate of 89.72% and 0.26%, respectively. Plotting the classification rate against interval size (Fig. 6) and robustness function (Fig. 7), however, shows that in this case the cross validation

F. Application of Technique to Real World Data: GNAT Wing Data Previous work [20] has described the application of a structural health monitoring strategy to the problem of damage location on an aircraft (GNAT trainer) wing located at Defense Science and Technology Laboratory (DSTL), Farnborough, U.K. The wing was instrumented with an array of 12 accelerometers (Fig. 8), to measure the response to forced vibration provided using a Ling electrodynamic shaker attached directly below panel P4 on the lower surface of the wing. The shaker was driven using a white Gaussian noise source. The wing had a series of nine removable panels (P1–P9) which could be removed and replaced to provide a reproducible and reversible representation of changing stiffness conditions on the wing structure. For full details of the experimental data collection and preprocessing, the reader is referred to [20]. Data was collected from all 12 accelerometers for a variety of undamaged (normal condition with all panels in place) and simulated damage data. By systematic removal of panels P1–P9, the effect of damage on the spectral response of the transmissibility functions could be observed. For each panel

1358


TABLE V CLASSIFICATION RATES FOR MAXIMUM-LIKELIHOOD TRAINING; GNAT WING DATA SET

removed, 100 individual measurements were recorded for each of the nine separate transmissibilities corresponding to removal of panels P1–P9. In addition to the damage condition measurements, a set of normal condition (undamaged) cases were recorded. An outlier technique [21] was used to identify statistically relevant changes in the transmissibility spectra, and these were then used to train a series of MLP networks to identify which of the nine panels had been removed. Networks with up to 15 hidden nodes were investigated. As with the Gaussian example, the experimental data was divided into training, validation, and testing sets, and a conventional cross validation approach was used to identify the most appropriate network structure [19] which was identified as having nine input nodes (determined by the spectral line data), four hidden nodes (determined by cross validation), and nine output nodes (determined by the dimensionality of the nine-class problem); see Table V. A mean of 91.0% and variance of 1.22% was obtained for forward propagation of the test set through 100 instances of this network structure. Section III provided along with Fig. 1 a description of definitions of lower hit function (13) and upper hit function (18) in terms of increasing horizon of uncertainty . The example of Fig. 1 shows the nine-class GNAT wing classifier in response to specific uncertainty values of 0.03 and 0.06 and illustrates the multiple class membership possible as horizon of uncertainty increases. As with the Gaussian data sets, after training the networks with crisp data, the network responses to uncertainty in the input data were investigated by forward propagating the test set with different values of uncertainty parameter . The results for all 100 networks with four hidden nodes are illustrated in Fig. 9 with the mean (square and circle symbols) and maximum and minimum range (error bars) for both the lower and upper hit functions. Once again the best selected crisp network (number 85 in Table V shown by dashed line in Fig. 9) from the cross validation approach is obviously not the most robust network to all values of input uncertainty. We identify an alternate network (number 74) that has a 10% improvement in lower hit function , and 20% improvement at compared to at network number 85. G. Application of Technique to Real World Data: Wisconsin Breast Cancer Database As an additional test of the method, an independent database of incidence of breast cancer was analyzed using the information-gap technique. The data was available for download from

Fig. 9. Mean and range of lower and upper hit functions for 100 networks; maximum-likelihood training on GNAT wing data set.

the University of California Irvine Machine Learning Repository [24], and monitored breast cancer incidence measured at University of Wisconsin Hospitals [22], [23]. The original database contained 699 anonymous clinical instances, with two classes (malignant and benign), and nine integer valued measured parameters with values from 1 to 10 comprising clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses [24]. Although in total there were 699 individual measurements (of the set of all parameters), there were 16 instances of missing parameters which occurred for measurements 24, 41, 140, 146, 159, 165, 236, 250, 276, 293, 295, 298, 316, 322, 412, and 618. These measurements were removed from the set, therefore leaving a total of 683 data vectors for network training, validation, and testing. Of these data vectors, there were 444 instances of class benign, and 239 instances of class malignant. The benign set was down-sampled to be of equal size to the malignant set so that network training was conducted with equal representatives drawn from both classes [25]. This avoided the problem of otherwise skewing the data sets by using nonuniform priors. This data was subdivided into three equal training, validation, and testing data sets of size 159 vectors. Since the experimental data was collected over a period of time (1989–1991), it was


1359

TABLE VI CLASSIFICATION RATES FOR MAXIMUM-LIKELIHOOD TRAINING; WISCONSIN BREAST CANCER DATA SET

TABLE VII CLASSIFICATION RATES FOR BAYESIAN-EVIDENCE TRAINING; WISCONSIN BREAST CANCER DATA SET

decided to form these sets using interleaved sampling from the total data set (i.e., every third line for training set etc.). The data sets were normalized to extend over the range 1 to 1 and a series of MLP networks with up to 15 hidden nodes were trained on the training data. Each individual network structure was trained with 100 independent training sessions starting at differently randomly chosen points on the error surface so that a total of 1500 independent networks were evaluated. Up to 100 iterations of a scaled conjugate gradient optimization were implemented using both conventional maximum-likelihoodand Bayesian-Evidence-based update training. The training results are summarized in Tables VI and VII. In each case as before, the best performing network structure from the cross validation data is identified and the mean and variance of the classification rates of the test data set forward propagated through all 100 networks is shown. As with the Gaussian and GNAT data sets, after training the networks with crisp data, the network responses to uncertainty in the input data were investigated by forward propagating the test set with different values of uncertainty parameter . The results for all 100 networks with five and four hidden nodes respectively for maximum-likelihood and Bayesian training are illustrated in Figs. 10 and 11. Again, the mean (square and circle symbols) and maximum and minimum range (error bars) for both the lower and upper hit functions are shown. In Fig. 10, network number 43 was the best cross validation selected crisp network and in this case this particular network displays reasonable robustness to uncertainty. However, it was still possible to identify a different network (number 8 in Fig. 10) that provided equal or better classification performance at values of horizon of uncertainty. For the Bayesian trained networks in Fig. 11, the best cross validation selected crisp network was number 25. Its lower hit function performance was outperformed by network number 78 which displayed high robustness to uncertainty at all measured values of .

Fig. 10. Mean and range of lower and upper hit functions for 100 networks; maximum-likelihood training on Wisconsin breast cancer data set.

H. Conclusions for Uncertainty Propagation In general, our findings indicate the following methodology be adopted to select the network with the highest tolerance to input uncertainty. First, classical training paradigms (for example Bayesian evidence or maximum likelihood) are employed to train a series of networks on the training data. The information-gap robustness function is then used to select the network with the highest robustness to uncertainty in the validation or test data set. This approach allows the most robust network to uncertainties in the input data to be identified, and thus guarantees more quantifiably reliable classifier performance to be obtained at the expense of searching over a multiplicity of networks. It is proposed for future work that this step would be avoided by incorporating a suitable robustness term in the network optimization objective function.

1360


Fig. 11. Mean and range of lower and upper hit functions for 100 networks; Bayesian training on Wisconsin breast cancer data set

proach was applied to a classification problem involving vibrational spectra from a small trainer aircraft wing structure (the GNAT data). The second problem considered the classification of breast cancer measurements into benign or malignant categories. In both cases, it was demonstrated that networks selected by conventional cross validation techniques did not guarantee the best robustness to uncertainties in the input data. Improvements of up to 20% in worst case classification performance were observed for the GNAT data at an uncertainty value of 0.04. For future work it is proposed that the technique would be incorporated into network training algorithms (perhaps by incorporating a robustness requirement term into the objective function). In this fashion, it would be possible to automatically select the most robust network to uncertainty of the input data given an additional input parameter of maximum size of uncertainty. It is anticipated that better quantification of the robustness of neural network classifiers will help to promote greater understanding and acceptance of such classifiers in increasingly demanding safety critical applications where an understanding and quantification of classification performance in response to extreme input conditions is essential.

V. CONCLUSION This paper has described a novel nonprobabilistic approach applied to predicting bounded worst case error predictions of neural networks in the presence of a specified level of uncertainty in the input data. The technique is based around an information-gap uncertainty technique [16], and lies in the investigation of (computationally inexpensive) forward propagation of uncertain data through MLP networks trained on crisp data sets. This approach is considerably computationally less demanding than a corresponding Monte-Carlo-based sampling over the uncertainty of the input space, and additionally due to the conservative nature of the interval propagation, provides a bounded absolute worst case possible error for a given horizon of uncertainty on the input data. Such an approach obviates the reliance on probability-based estimates of confidence bounds on network predictions and offers the capability to make the classifier robust against highly extreme events in the input data. This paper has presented the formal theoretical background to the information-gap approach and introduced the concepts of robustness (lower hit function) and opportunity (upper hit function) available to a network in the presence of uncertainty. We have demonstrated the application of the theory to simple 2-D Gaussian data sets comprising two classes. Using conventional cross validation selection criteria and both maximum-likelihood and Bayesian-evidence training procedures, we trained a large number of MLP classifier networks on two different sized data sets, and selected optimal networks based on the robustness to uncertainty in the input data. In general, we have indicated that the cross-validation network selection framework does not guarantee the highest overall network robustness; in fact, in our second example of Gaussian data this network was among the worst performing over all tested networks. When requiring 70% correct classification, an improvement in robustness to uncertainty by a factor of 2.8 was demonstrated using the information-gap network selection technique. We have further verified the technique by its application to two real world applications. In the first, the information-gap ap-

ACKNOWLEDGMENT The authors would like to thank I. Nabney, Aston University, Birmingham, U.K., who developed NETLAB [http://www.ncrg.aston.ac.uk/netlab/] and S. M. Rump at Technical University of Hamburg-Harburg, Hamburg, Germany, who developed INTLAB [http://www.ti3.tu-harburg.de/~rump/intlab/]. Both software programs were used by the authors. The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from Dr. W. H. Wolberg. REFERENCES [1] S. Haykin, Neural Networks, a Comprehensive Foundation, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 1999. [2] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.: Oxford Univ. Press, 1995. [3] D. E. Goldberg, Genetic Algorithms in Search, Optimisation and Machine Learning. Reading, MA: Addison-Wesley, 1989. [4] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter, Applied Interval Analysis. New York: Springer-Verlag, 2001. [5] P. L. Bartlett, “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Feb. 1998. [6] I. T. Nabney, Netlab-Algorithms for Pattern Recognition. New York: Springer-Verlag, 2002. [7] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2003. [8] ——, “The evidence framework applied to classification networks,” Neural Comput., vol. 4, no. 5, pp. 720–736, 1992. [9] G. Papadopoulos and P. J. Edwards, “Confidence estimation methods for neural networks: a practical comparison,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1278–1287, Nov. 2001. [10] D. Lowe and C. Zapart, “Point-wise confidence interval estimation by neural networks: a comparative study based on automotive engine calibration,” Neural Comput. and Appl., vol. 8, pp. 77–85, 1999. [11] P. M. Wong, A. G. Bruce, and T. D. Gedeon, “Confidence bounds of petrophysical predictions from conventional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 6, pp. 1440–1444, Jun. 2002. [12] G. Chryssolouris, M. Lee, and A. Ramsey, “Confidence interval prediction for neural network models,” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 229–232, Jan. 1996.


[13] R. Shao, E. B. Martin, J. Zhang, and A. J. Morris, “Confidence bounds for neural network representations,” Comput. Chem. Eng., vol. 21, pp. S1173–S1178, 1997, Supplement. [14] W. A. Wright, “Bayesian approach to neural network modeling with input uncertainty,” IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 1261–1270, Nov. 1999. [15] W. Duch, “Uncertainty of data, fuzzy membership functions, and multilayer perceptrons,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 10–23, Jan. 2005. [16] Y. Ben-Haim, Information-Gap Decision Theory: Decisions Under Severe Uncertainty. New York: Academic Press, 2001. [17] R. M. Moore, Interval Analysis. Englewood Cliffs, NJ: Prentice Hall, 1966. [18] S. G. Pierce, K. Worden, and G. Manson, “Information-gap robustness of a neural network regression model,” presented at the Int. Modal Analysis Conf. (IMAC XXII), Dearborn, MI, Jan. 26–29, 2004, paper 292. [19] ——, “A novel information-gap technique to assess reliability of neural network based damage detection,” J. Sound Vib., vol. 293, no. 1–2, pp. 96–111, 2006. [20] G. Manson, K. Worden, and D. Allman, “Experimental validation of a structural health monitoring methodology. Part III. Damage location on an aircraft wing,” J. Sound Vib., vol. 259, no. 2, pp. 365–385, 2003. [21] K. Worden and G. R. Tomlinson, Nonlinearity in Structural Dynamics. Bristol, U.K.: Institute of Physics Publishing, 2001. [22] K. P. Bennett and O. L. Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets,” Opt. Methods Software, vol. 1, pp. 23–34, 1992. [23] W. H. Wolberg and O. L. Mangasarian, “Multisurface method of pattern separation for medical diagnosis applied to breast cytology,” Proc. Nat. Acad. Sci. USA, vol. 87, pp. 9193–9196, 1990. [24] Univ. California Irvine , Machine Learning Repository [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html [25] L. Tarassenko, A Guide to Neural Computing Applications. London, U.K.: Arnold, 1998.

S. Gareth Pierce received the B.Sc. degree in pure and applied physics in 1989 and the Ph.D. degree in instrumentation in 1993, both from the University of Manchester, Manchester, U.K. He is a Lecturer in the Centre for Ultrasonic Engineering (CUE), Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow, U.K. He has undertaken a wide range of research work over the past 15 years and has authored over 50 research publications. His current work in CUE focuses on novel ultrasonic techniques for applications in materials inspection, diagnostic ultrasound and sonar systems. His primary research interests include ultrasonic nondestructive evaluation (NDE), laser ultrasonics, acoustic emission, optical fibre sensors and communications systems, signal processing, statistical pattern recognition,

1361

neural networks, nonlinear systems, uncertainty analysis, composite materials, and fatigue performance of materials.

Yakov Ben-Haim received the B.A. degree in mathematics and chemistry from Beloit College, Beloit College, WI, in 1973 and the M.Sc. degree in nuclear engineering and the Ph.D. degree in chemistry, both from the University of California at Berkeley, in 1978. He is the Yitzhak Moda’i Chair in Technology and Economics at the Technion-Israel Institute of Technology, Haifa, Israel. He is known for his development of information-gap decision theory for modeling and managing severe uncertainty. Information-gap theory is applied in engineering, biological conservation, economics, project management, homeland security, medicine, and other areas. He has published more than 60 articles and four books.

Keith Worden received the B.Sc. degree in theoretical physics from the University of York, York, U.K., in 1982 and the Ph.D. degree in mechanical engineering from Heriot-Watt University, Edinburgh, U.K., in 1989. He is the Head of The Dynamics Research Group, the Department of Mechanical Engineering, the University of Sheffield, Sheffield, U.K. He has been conducting research in structural dynamics for the last twenty years and has mainly concentrated on aspects of signal processing, structural health monitoring, nonlinear dynamics, and machine learning. His current interests lie in the area of uncertainty analysis applied to structural dynamics, for which The Dynamics Research Group at Sheffield holds a current U.K. Engineering and Physical Sciences Research Council (EPSRC) Platform Grant.

Graeme Manson received the B.Eng. degree in mechanical engineering from Heriot-Watt University, Edinburgh, U.K., in 1992 and the Ph.D. degree in mechanical engineering from the University of Manchester, Manchester, U.K., in 1996. He is a U.K. Engineering and Physical Sciences Research Council (EPSRC) Advanced Research Fellow, Department of Mechanical Engineering, University of Sheffield, U.K. The title of his fellowship is “Uncertainty in Structural Dynamics: Quantification, Fusion and Propagation” and his other research areas of interest include damage identification, multivariate statistical techniques, neural networks, genetic and hybrid approaches to optimization and nonlinear system analysis. He is the author of over 80 research papers.