BAYES ERROR RATE ESTIMATION USING ... - Semantic Scholar

1 downloads 0 Views 271KB Size Report
Mar 26, 1997 - Assessing the performance of a given pattern classi er requires knowing the lowest achievable error, or the Bayes error rate. There are severalĀ ...
BAYES ERROR RATE ESTIMATION USING NEURAL NETWORK ENSEMBLES Kagan Tumer and Joydeep Ghosh

Department of Electrical and Computer Engineering University of Texas, Austin, TX 78712-1084 E-mail: [email protected] [email protected] March 26, 1997

Abstract Assessing the performance of a given pattern classi er requires knowing the lowest achievable error, or the Bayes error rate. There are several classical approaches for estimating or nding bounds for the Bayes error. One type of approach focuses on obtaining analytical bounds, which are both dicult to calculate and dependent on distribution parameters that may not be known. Another strategy is to estimate the class densities through non-parametric methods, and use these estimates to obtain bounds on the Bayes error. This article presents two approaches to estimating performance limits based on neural network ensembles. First, we present a framework that estimates the Bayes error when multiple classi ers are combined through averaging. Then we discuss a more general method that provides error estimates based on disagreements among classi ers. The methods are illustrated for both arti cial data, a dicult four class problem involving underwater acoustic data, and two problems from the Proben1 benchmarks. For the data sets with known Bayes error, the combiner based methods introduced in this article outperform the existing methods. Preliminary versions of some of the ideas in this paper appear in conference proceedings [34, 38]. This research was supported in part by AFOSR contract F49620-93-1-0307, ONR contract N00014-92-C-0232, and NSF grant ECS 9307632. 

1 Introduction For many real-life classi cation problems, 100% correct classi cation is not possible. In addition to fundamental limits to classi cation accuracy arising from overlapping class densities, errors creep in because of de ciencies in the classi er and the training data. Classi er related problems such as incorrect structural model, parameters, or learning regime may be overcome by changing or improving the classi er. Other errors that arise from nite training data sets, mislabeled patterns and outliers, for example, can be directly traced to the data, and cannot necessarily be corrected during the classi cation stage. It is therefore important to not only design a good classi er, but also estimate limits (or bounds) to achievable classi cation rate given the available data. Such estimates help designers decide whether it is worthwhile to: try to improve upon their current classi er scheme; use a different classi er on the same data set; or invest more resources in collecting more/better data samples. We have ourselves faced this dilemma in medical and oil services (electrical log inversion) applications where acquisition of new samples is quite expensive [16, 39]. Furthermore, estimation of performance bounds is also helpful during feature extraction. For example, suppose we estimate that one cannot do better than 80% correct classi cation on sonar signals based on their Fourier spectra. This indicates that one needs to look at other feature descriptors (say Gabor wavelets or auto-regressive coecients [15]) if higher classi cation levels (say 90% correct) is desired. For a given set of features, the fundamental performance limit for statistical pattern recognition problems is provided by the Bayes error. This error provides a lower bound on the error rate that can be achieved by a pattern classi er [6, 8, 10, 42]. Therefore, it is important to calculate, or estimate this error in order to evaluate the performance of a given classi er. When the pattern distributions (both likelihoods and priors) are completely known, it is possible, although not always practical, to obtain the Bayes error directly [10]. However, when the pattern distributions are unknown, the Bayes error is not readily obtainable. In Section 2, we highlight the diculties in estimating this value and summarize several classical methods aimed at nding bounds for the Bayes rate. In Section 3 we introduce two combiner based error estimators. First in Section 3.1 we derive an estimate to the Bayes error based on the linear combining theory introduced in [36, 37]. This estimate relies on the result that combining multiple classi ers reduces the model2

based errors stemming from individual classi ers [36]. It is therefore possible to isolate the Bayes error from other error components and compute it explicitly. Because this method relies on classi ers that approximate the a posteriori class probabilities, it is particularly well coupled with feed-forward neural networks [27, 29, 32]. Then in Section 3.2, we present a method for assessing error limits for a given problem and particular number of classi ers. Although for some types of combining the Bayes error can be estimated, there are situations1 when an empirical estimate of the lowest achievable error rate is more desirable. The plurality error limit method introduced herein focuses on the agreement between di erent classi ers and uses the combining scheme to di erentiate between various error types. By isolating certain repeatable errors, (or exploiting the diversity among classi ers [31]), we derive a sample-based estimate of the achievable error rate. In Section 4 we apply these methods to both arti cial and real-world problems, using radial basis function networks and multi-layered perceptrons as the base classi ers. The results obtained both from the linear combining theory and the empirical plurality error limit are reported and show that the combining-based methods achieve better estimates than the classical methods on the problems studied in this article.

2 Background Most real world classi cation problems deal with situations where classes are not fully separable. Therefore, it is unrealistic to expect absolute classi cation performance (100% correct). The objective of any statistical pattern classi cation method is to attain the \best possible" performance. The obvious question that arises of course, is how to determine the optimum classi cation rate. Since the Bayes decision provides the lowest error rates, the problem is equivalent to determining the Bayes error [6, 30, 42]. Let us consider the situation where a given pattern vector x needs to be classi ed into one of L classes. Further, let P (ci ) denote the a priori class probability of class i, 1  i  L, and p(xjci ) denote the class likelihood, i.e., the conditional probability density of x given that it belongs to class i. The probability of the pattern x belonging to a speci c class i, For example, if the amount of available data and/or degree of correlation between classi ers prevents the accurate estimation of the Bayes error as described in Section 3.1. 1

3

i.e., the a posteriori probability p(ci jx), is given by the Bayes rule: p(ci jx) = p(xjcpi()xP) (ci ) ; where p(x) is the probability density function of x and is given by:

p(x) =

L X i=1

p(xjci ) P (ci ) :

(1)

(2)

The classi er that assigns a vector x to the class with the highest posterior is called the Bayes classi er, and the error associated with this classi er is called the Bayes error, which can be expressed as [10, 13]:

Ebayes = 1 ?

L Z X

i=1 Ci

P (ci )p(xjci )dx

(3)

where Ci is the region where class i has the highest posterior. Obtaining the Bayes error from Equation 3 entails evaluating the multi-dimensional integral of possibly unknown multivariate density functions over unspeci ed regions (Ci ). Due to the diculty of this operation, the Bayes error can be computed directly only for very simple problems, e.g., problems involving Gaussian class densities with identical covariances. For most real world problems, estimating the densities (e.g. through Parzen windows) and priors is the only available course of action. Although tedious, a numerical integration method can then be used to obtain the Bayes error. However, since errors are introduced both during the estimation of the class densities and regions, and compounded by a numerical integration scheme, the results are in general not reliable. Therefore, attention has focused on approximations and bounds for the Bayes error, which are either calculated through distribution parameters, or estimated through training data characteristics.

2.1 Parametric Estimates of the Bayes Error One of the simplest bounds for the Bayes error is provided by the Mahalanobis distance measure [6]. For a 2-class problem, let  be the non-singular, average covariance matrix ( = P (c1 )  1 + P (c2 )  2 ), and i be the mean vector for classes i = 1; 2. Then the Mahalanobis distance , given by:  = (1 ? 2 )T ?1 (1 ? 2 ) ; 4

(4)

provides the following bound on the Bayes error: (5) Ebayes  1 +2 PP((cc1 ))PP((cc2 )) : 1 2 The main advantage of this bound is the lack of restriction on the class distributions. Furthermore, it is easy to calculate using only sample mean and sample covariance matrices. It therefore provides a quick way of obtaining an approximation for the Bayes error. However, it is not a particularly tight bound, and more importantly as formulated above, it is restricted to a 2-class problem. Another bound for a 2-class problem can be obtained from the Bhattacharyya distance. For a 2-class problem, the Bhattacharyya distance is given by [6]:

 = ?ln

Zq

p(xjc1 )p(xjc2 )dx:

(6)

In particular, if the class densities are Gaussian with mean vectors and covariance matrices i and i for classes i = 1; 2, respectively, the Bhattacharyya distance is given by [10]:





 ?1 1 +2  = 81 (2 ? 1 )T 1 +2 2 (2 ? 1 ) + 12 ln p 2 : (7) j1j j2j Using the Bhattacharyya distance, the following bounds on the Bayes error can be obtained [6]: q 1 1 ? q1 ? 4P (c )P (c )exp(?2)  E  exp ( ?  ) P (c1 )P (c2 ) (8) 1 2 bayes 2 In general, the Bhattacharyya distance provides a tighter error bound than the Mahalanobis distance, but has two drawbacks: it requires knowledge of the class densities, and is more dicult to compute. Even if the class distributions are known, computing Equation 6 is not generally practical. Therefore, Equation 7 has to be used even for non-Gaussian distributions to alleviate both concerns. While an estimate for the Bhattacharyya distance can be obtained by computing the rst and second moments of the sample and using Equation 7, this compromises the quality of the bound. A more detailed discussion of the e ects of using training sample estimates for computing the Bhattacharyya distance is presented in [7]. A tighter upper bound than either the Mahalanobis distance or the Bhattacharyya distance based bounds is provided by the Cherno bound:

Ebayes 

P (c1 )sP (c2 )1?s

Z

5

p(xjc1 )sp(xjc2 )1?s dx;

(9)

where 0  s  1. For classes with Gaussian densities, the integration in Equation 9 yields exp(?c(s)), where the Cherno distance, c(s), is given by [10]: 2 j : (10) c(s) = s(1 2? s) (2 ? 1 )T (s1 + (1 ? s)2 )?1 (2 ? 1) + 12 ln jsj1 +js(1j?j1s?) s 1 2 The optimum s for a given i and i combination can be obtained by plotting c (s) for various s values [10]. Note that the Bhattacharyya distance is a special case of the Cherno distance, as it is obtained when s = 0:5. Although the Cherno bound provides a slightly tighter bound on the error, the Bhattacharyya bound is often preferred because it is easier to compute [10]. The common limitation of the bounds discussed so far stems from their restriction to 2-class problems. Extensions of these bounds to an L-class problem is presented in [13]. In this scheme, upper and lower bounds for the Bayes error of an L-class problem are obtained from the bounds on the Bayes error of L subproblems, each involving L ? 1 classes. The bounds for each (L ? 1)-class problem are in turn obtained from L ? 1 subproblems, each involving L ? 2 classes. Continuing this progression eventually reduces the problem to obtaining the Bayes error for 2-class problems. Based on this techniques the upper and lower bounds for the Bayes error of an L-class problem are respectively given by [13]: L  min Ebayes 2f0;1g

and

1

L ? 2

L X i=1

(1 ? P (ci ))

L?1 Ebayes ;i

!

+ L1 ?? 2 ;

L L  L ? 1 X (1 ? P (c )) E L?1 ; Ebayes i bayes;i L(L ? 2) i=1

(11)

(12)

L?1 is the Bayes error of the L where Ebayes is the Bayes error for an L-class problem, Ebayes ;i (L ? 1)-class subproblem, where the ith class has been removed, and is an optimization parameter.0 Therefore, the Bayes error for an L class problem can be computed starting 1

from the @

LA pairwise errors. 2

2.2 Non-Parametric Estimate of the Bayes Error The computation of the bounds for 2-class problems presented in the previous section and their extensions to the general L-class problem depend on knowing (or approximating) 6

certain class distribution parameters, such as priors, class means and covariances between classes. Although it is in general possible to estimate these values from the data sample, the resulting bounds are not always satisfactory. A method that provides an estimate for the Bayes error without requiring knowledge of the class distributions is based on the nearest neighbor (NN) classi er. The NN classi er assigns a test pattern to the same class as the pattern in the training set to which it is closest (de ned in terms of a pre-determined distance metric). The Bayes error can be given in terms of the error of an NN classi er. Given a 2-class problem with suciently large training data, the following result holds [5]: 1 1 ? p1 ? 2E   E (13) NN bayes  ENN : 2 This result is independent of the distance metric chosen. For the L-class problem, Equation 13 has been generalized to [5]:

0

1

s

L ? 1 @1 ? 1 ? L E A  E bayes  ENN : L L ? 1 NN

(14)

Equations 13 and 14 place bounds on the Bayes error provided that the sample sizes are suciently large. These results are particularly signi cant in that they are attained without any assumptions or restrictions on the underlying class distributions. However, when dealing with limited data, one must be aware that Equations 13 and 14 are based on asymptotic analysis. Corrections to these equations based on sample size limitations, and their extensions to k-NN classi ers are discussed in [4, 9, 11, 12].

3 Bayes Error Estimation through Classi er Combining In dicult classi cation problems with limited training data, high dimensional patterns or patterns that involve a large amount of noise, the performance of a single classi er is often unsatisfactory and/or unreliable. In such cases, combining the outputs of multiple classi ers has been shown to improve the classi cation performance [18, 22, 35, 37, 40, 41, 43]. In this section we present two methods that use the results obtained from multiple classi ers to obtain an estimate for the Bayes error. 7

3.1 Bayes Error Estimation Based on Decision Boundaries If the outputs of the individual classi ers approximate the corresponding class posteriors, simple averaging can be an e ective combining strategy. The e ect of such a combining scheme on classi cation decision boundaries and their relation to error rates was theoretically analyzed in [36, 37]. More speci cally, we showed that combining the outputs of di erent classi ers \tightens" the distribution of the obtained decision boundaries about the optimum (Bayes) boundary. The classi er outputs are modeled as:

fim (x) = pi(x) + mi (x);

(15)

where pi (x) is the posterior for ith class on input x, and mi (x) is the error of the mth classi er in estimating that posterior [27, 37]. If the classi er errors are i.i.d., combining can drastically reduce the error rates. However, the errors are rarely independent, and generally depend on the correlation among the individual classi ers [1, 3, 20, 37]. Using the averaging combiner whose output to the ith class is de ne by:

fiave (x) = N1

N X

fim (x);

m=1 ave leads to the following relationship between Emodel

(16)

and Emodel (See [36, 37] for details): ave = 1 + (N ? 1) E Emodel (17) model ; N

ave and Emodel are the expectations of the model-based error for the average where Emodel combiner and individual classi ers respectively, N is the number of classi ers combined, and  is the average correlation between the individual classi ers2.

This result indicates a new way of estimating the Bayes error. The total error of a classi er (Etotal ) can be divided into the Bayes error and model-based error, which is the extra error due to the speci c classi er (model/parameters) being used. Thus, the error of a single classi er and the ave combiner are respectively given by:

Etotal = Ebayes + Emodel ; ave = E ave Etotal bayes + Emodel :

(18) (19)

ave = 1 Emodel , a result very similar to that which was For i.i.d. errors, Equation 17 reduces to Emodel N derived in [25] for regression problems, and in [36] for classi cation problems. 2

8

Note that Emodel can be further decomposed into bias and variance as discussed in [3, 14]. The e ect of bias/variance on the decision boundaries is analyzed in detail in [36]. The Bayes error, of course, is not a ected by the choice of the classi er. Solving the set of Equations 17, 18, and 19 for Ebayes , provides: ave ((N ? 1) + 1) Etotal : (20) Ebayes = N Etotal ? (N ? 1)(1 ? ) Equation 20 provides an estimate of the Bayes error as a function of the individual classi er error, the combined classi er error, the number of classi ers combined and the correlation among them. Three values need to be determined in order to obtain an estimate to the Bayes error using the expression derived above. Etotal is estimated by averaging the total errors of the ave is the error of the average combiner. In order to estimate the individual classi ers3. Etotal correlation coecient, , we rst build an estimate for the posterior probabilities of each pattern using both the network outputs and the target classes4 . Then we compute the error (or residue) vectors for each pattern as a deviation between the network outputs and the posterior values. Finally we compute the correlation between the errors of individual classi ers. In order to emphasize that the Bayes error is obtained using these estimates, rather than the true error and correlation terms, we re-write Equation 20 using the estimates:

b b ave b Ebbayes = N Etotal ? ((N ? 1) b+ 1) Etotal ; (N ? 1)(1 ?  )

(21)

where [b] represents the estimate of []. The bayes error estimate obtained through Equation 21 is particularly sensitive to the estimation of the correlation, and we will discuss the impact of using b instead of  in Section 4.

3.2 Heuristic-Based Plurality Error Limit The previous section focused on estimating the Bayes error through linear combining theory, and was dependent on the properties of the ave combiner. In this section, we present a \plurality error limit" that is applicable to a larger variety of combiners. In combiners Averaging classi er errors to obtain Etotal is a di erent operation than averaging classi er outputs to ave [36, 37]. obtain Etotal 4 This posterior estimate relies on the targets and therefore cannot be used in determining the classi cation decisions. 3

9

based on activation values, the decision of each individual classi er carries information that is not directly used in the combining step. The agreements/disagreements among classi ers' decisions form the basis of the heuristics-based \plurality error limit." This limit is based on the available data and provides a value that re ects the amount of separability that is present in the sample. Before proceeding further, let us provide the following de nitions, applicable to the general combining scheme, with N (N  2) classi ers, and L (L  2) classes5 . For a given input pattern, a class is called:

 a likely class, if at least two thirds of the classi ers have picked that class6;  an unlikely class, if at most one third of the classi ers have picked that class;  a possible class, if it is neither likely, nor unlikely. With these de nitions, we specify four distinct types of patterns, namely, patterns where: 1. the correct class is possible or likely, and all the incorrect classes are unlikely; 2. the correct class is possible, but an incorrect class is also possible; 3. the correct class is unlikely, and all the incorrect classes are also unlikely; 4. The correct class is unlikely, but an incorrect class is possible or likely. This characterization of the patterns allows an objective interpretation of the results. Errors occurring in patterns falling in the rst category are easiest to handle. These errors are generally caused by slight di erences in training schemes between classi ers. Since the evidence for the correct class outweighs the evidence for all incorrect classes, even simple combiners should correct this type of error. A pattern of the second or third type produces more problematic errors. In these cases, the evidence for the correct class is comparable to the evidence for the incorrect classes. Some of these errors can be corrected by combiners, while others cannot. Errors caused by patterns in the fourth category are the most dicult, 5 6

The cases where either N = 2 or L = 2, or both are special cases. A likely class, does not necessarily imply a correct class.

10

and in general not recti able by a combiner. In these errors, most evidence points to a particular erroneous class7 . Only rarely will it be possible for certain combiners to correct these errors. Therefore, the number of patterns in the fourth category provides a \plurality error limit" on the number of errors after combining. In fact, combiners that rely directly on classi cation information (majority vote, plurality vote) cannot correct these errors, and for such classi ers, this value provides a lower bound on the attainable error. For other classi ers (e.g. simple averaging), the plurality error limit can be reached, or even exceeded, under certain situations where a single correct decision can be strong enough to rectify the erroneous decisions of a larger number of classi ers.

4 Experimental Bayes Error Estimates In this section, we apply the Bayes error estimation strategy discussed in Section 3. First, two arti cial data sets with known Bayes errors are used. Then a real-life underwater sonar problem is used to test the combiner-based estimate. In both cases, the Bayes error estimates provided by classical methods are presented for comparison. Finally, we present results from two data sets extracted from the Proben1 benchmarks [26].

4.1 Arti cial Data In this section, we apply the method to two arti cial problems with known Bayes error rates. Both these problems are taken from Fukunaga's Book [10], and are 8-dimensional, 2-class problems, where each class has a Gaussian distribution with equal priors. For each problem, the class means and the diagonal elements of the covariance matrices (o -diagonal elements are zero) are given in Table 1. From these speci cations, we rst generated 1000 training examples and 1000 test examples. Then we generated a second set of training/test sets with 100 patterns in each. The goal of the second step in this experiment is to insure that the method works with small sample sizes. The Bayes error rate for both these problems (10% for DATA1 and 1:9% for DATA2) is given in [10]. It is a well-known result that the outputs of certain properly trained feed-forward arti cial neural networks approximate the class posteriors [2, 27, 29]. Therefore, these 7

This situation typically indicates an outlier or a mislabeled pattern.

11

Table 1: Arti cial Data Sets. Data Set i (dimension) Characteristics 1 2 3 4 5 1 0 0 0 0 0 DATA1 1 1 1 1 1 1 2 2.56 0 0 0 0 2 1 1 1 1 1 1 0 0 0 0 0 DATA2 1 1 1 1 1 1 2 3.86 3.10 0.84 0.84 1.64 2 8.41 12.06 0.12 0.22 1.49

6 7 8 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1.08 0.26 0.01 1.77 0.35 2.73

networks provide a suitable choice for the multiple classi er combining scheme discussed in Section 3.1. Two di erent types of networks were selected for this application. The rst is a multi-layered perceptron (MLP), and the second is a radial basis function (RBF) network. A detailed account on how to select, design and train these networks is available in Haykin's excellent book on the topic [19]. Table 2: Experimental Correlation Factors for Arti cial Data. Data Set/Classi er Pairs Estimated Correlation (b) (1000 samples) (100 samples) Di erent runs of DATA1/MLP 0.99 0.96 Di erent runs of DATA1/RBF 0.82 0.67 DATA1/MLP and DATA1/RBF 0.62 -0.01 Di erent runs of DATA2/MLP 0.91 0.99 Di erent runs of DATA2/RBF 0.71 0.53 DATA2/MLP and DATA2/RBF 0.35 -0.27 The single hidden layered MLP used for DATA1 had 5 units, and the RBF network had 5 kernels, or centroids8 . For DATA2 the number of hidden units and the number of kernels were increased to 12. For the case with 100 training/test samples, 5 di erent training/test sets were generated and 20 runs were performed on each set. The reported results are the averages over both the di erent samples and di erent runs. For the case with 1000 training/test samples, the variability between selecting di erent training sets was minimal. For that reason we report the results of 20 runs on one typical set of 1000 training/test samples. 8

The network sizes were established experimentally.

12

Tables 2 and 3 provide the correlation factors and combining results, respectively. Notice that the MLP combining results for DATA1 fail to show any improvements over individual classi ers (row with N = 1). This is caused by the simplicity of the problem, and the lack of variability among di erent MLPs. The similarity between MLPs can be con rmed by the high correlation among them as shown in Table 2. The RBF networks su er less from the high correlations, since variations between kernel locations introduces di erences that cannot be introduced in an MLP. Data Set/ Classi er DATA1/MLP DATA1/RBF DATA1/MLP and DATA1/RBF DATA2/MLP DATA2/RBF DATA2/MLP and DATA2/RBF

Table 3: Combining Results for Arti cial Data. Number of 1000 samples 100 samples Classi ers Error Rate std. dev. Error Rate std. dev. (in %) (in %) 1 10.52 0.19 13.02 0.74 3 10.55 0.10 13.03 0.74 5 10.54 0.09 12.90 0.75 7 10.53 0.08 12.88 0.68 1 10.39 0.81 12.54 2.56 3 10.06 0.39 12.19 1.91 5 10.13 0.29 11.98 1.49 7 10.16 0.28 11.90 1.31 3 10.32 0.27 11.48 1.49 5 10.34 0.19 11.51 1.23 7 10.33 0.14 11.33 1.19 1 3.22 0.44 5.63 0.57 3 3.10 0.25 5.62 0.47 5 3.11 0.24 5.59 0.42 7 3.12 0.24 5.58 0.48 1 3.49 0.29 6.00 2.95 3 3.33 0.17 4.42 2.10 5 3.36 0.12 3.78 1.57 7 3.31 0.10 3.51 1.39 3 2.77 0.25 4.24 0.56 5 2.67 0.27 4.31 0.52 7 2.65 0.16 4.35 0.51

Table 4 shows the di erent estimates for the Bayes error. For each data set, the Bayes error is estimated through the combining results, using Tables 2, 3, and Equation 21. Each row of Table 3, couple with the corresponding correlation value from table 2 provides an estimate the Bayes error. These values are averaged to yield the results that are reported. When the correlation among classi ers is close to one, the Bayes estimate becomes unreli13

Table 4: Bayes Error Estimates for Arti cial Data (given in %). DATA 1 DATA 2 Actual Bayes Error 10.00 1.90 bEbayes (Eq. 21) 9.24 2.15 (1000 samples) Nearest Neighbor Bounds 8.73  Ebayes  15.94 2.15  Ebayes  4.20 (Eq. 13) (1000 samples) 10.70 2.36 Ebbayes (Eq. 21) (100 samples) Nearest Neighbor Bounds 8.62  Ebayes  15.76 2.43  Ebayes  4.75 (Eq. 13) (100 samples) Mahalanobis Bound (True mean and covariance) Ebayes  18.95 Ebayes  14.13 (Eqs. 4-5) ( = 6.55) ( = 10.16) Bhattacharyya bounds (True mean and covariance) 5.12  Ebayes  22.04 0.23  Ebayes  4.74 (Eqs. 7-8) ( = 0.82) ( = 2.36) able, because the denominator in Equation 21 is near zero. In such cases, it is not advisable to use the classi ers with high correlation in the Bayes estimate equation. The Bayes error estimates reported in this article are based on classi ers whose correlations were less than an experimentally selected threshold9. For example, based on the correlations in Table 2, for DATA1 with 1000 samples, only RBF networks and RBF/MLP hybrids were used in determining the Bayes estimate, whereas for DATA1 with 100 samples, all available classi ers (MLPs, RBFs and MLP/RBF hybrids) were used. Studying Tables 3 and 4, leads us to conclude that the performance of the base classi ers has little impact on the nal estimate of the Bayes error. For example, for DATA1, when the individual classi ers were trained and tested on 1000 patterns, they performed well, coming close in performance to the true Bayes error rate. In those cases, combining provided limited improvements, if at all. For individual classi ers trained and tested on only 100 samples, on the other hand, neither MLPs nor RBF networks provided satisfactory results. Combining provided moderate improvements in some, but not all cases (note that combining multiple MLPs still yielded poor results). Yet, the Bayes error estimates were still accurate and close to both the true rate and the rate obtained with 1000 samples. This con rms that the method is not sensitive to the actual performance of its classi ers, but to the interaction 9

For this study, only classi ers with correlations less than or equal to .97 were used.

14

between the individual classi er performance, combiner performance and the correlation among the classi er. For the Mahalanobis and Bhattacharya distances, the bounds were based on the true mean and covariance matrices, as these quantities were known10 . Notice that although the Bhattacharyya bound is expected to be tighter than the Mahalanobis bound, this is not so for DATA1. The reason for this discrepancy is twofold: rst, the Mahalanobis distance provides tighter bounds as the error becomes larger [6]; second, two terms contribute to the distance of Equation 7, one for the di erence of the means and one for the di erence of the covariances. In the case where the covariances are identical, the second term is zero, leading to a small Bhattacharyya distance, which in turn leads to a loose bound on the error. DATA1, by virtue of having a large Bayes error due exclusively to the separation of the class means, represents a case where the Bhattacharyya bound fails to improve on the Mahalanobis bound. For DATA2, the Bhattacharyya distance provides bounds that are more useful, and the upper bound in particular is very similar to the upper bound provide by the NN method. For both DATA1 and DATA2, the Bayes error rate estimate obtained through the classi er combining method introduced in this article provides values that are closest to the actual Bayes error rate. This is particularly remarkable since both experiments are biased towards the classical techniques because they have Gaussian distributions. Tables 5 and 6 show the plurality error limit for these two arti cial data sets, based on the analysis presented in Section 3.2. The intuitive idea behind the plurality error limit is as follows: the rst row of Table 5, states that for the 1000 sample case, one cannot expect to do get lower error rates than 9:92% if combining 5 di erent MLPs. Based on the plurality limits, we also provide estimate the Bayes error, denoted by Ebplu . Recall that the plurality limit is obtained through isolating patterns that appear to be mislabeled, or patterns whose input characteristics and output labeling fall outside the statistical decision boundaries. This description in e ect is a rewording of Equation 3. Note that the correlations also a ect the plurality limits. For example the plurality error limits obtained through MLPs for DATA2 based on 100 samples (a case with unusually high correlation), are noticeable higher than the values obtained through RBF networks (Table 6). In order to estimate the Bayes error rate, we average all the plurality limits obtained by varying the number Because these bounds are not satisfactory, we did not investigate the e ect of using sample means and covariances, as that step further compromises these bounds. 10

15

Table 5: Plurality Error Limits for arti cial DATA1. 1000 Samples 100 Samples Classi er/ Plurality Bayes Error Plurality Bayes Error Feature Set N Error Limit Estimates Error Limit Estimates (in %) (in %) (in %) (in %) 5 9.92 12.39 MLP 6 10.32 12.58 7 10.14 12.46 9 10.28 12.55 5 8.74 9.21 b Eplu = 9.29 Ebplu = 9.47 RBF 6 9.51 9.75 7 9.18 9.18 9 9.31 9.36 5 8.91 7.53 MLP-RBF 6 9.68 8.99 7 9.32 8.50 9 9.70 8.93

Table 6: Plurality Error Limits for arti cial DATA2. 1000 Samples 100 Samples Classi er/ Plurality Bayes Error Plurality Bayes Error Feature Set N Error Limit Estimates Error Limit Estimates (in %) (in %) (in %) (in %) 5 2.58 5.31 MLP 6 2.85 5.45 7 2.74 5.35 9 2.74 5.46 5 2.85 2.40 b Eplu = 2.59 Ebplu = 2.70 RBF 6 3.12 2.74 7 3.01 2.00 9 3.09 2.36 5 1.90 2.39 MLP-RBF 6 2.09 3.18 7 2.00 2.81 9 2.09 3.36

16

of combiners. The results are presented in Tables 11, and once again, only plurality limits obtained from classi ers with correlations less than or equal to .97 are used in the nal estimate. Note that the Bayes error estimates shown in these tables are not the average value of the plurality limits presented in the same tables. This discrepancy arises from the fact that for each di erent set of training/test patterns, the plurality limits are computed, and those values are used to obtain the Bayes error estimate, an operation in which not all plurality limits are used (e.g. those plurality limits resulting from cases with high correlation are dropped). Tables 5 - 6 show the average plurality limits and the average Bayes error estimates.

4.2 Underwater Sonar Data The previous section dealt with obtaining the Bayes error for simple arti cial problems. In this section, we apply the method to a dicult underwater sonar problem. From the original sonar signals of four di erent underwater sources, two qualitatively di erent feature sets are extracted [17]. The rst one (FS1), a 25-dimensional set, consists of Gabor wavelet coecients, temporal descriptors and spectral measurements. The second feature set (FS2), a 24-dimensional set, consists of re ection coecients based on both short and long time windows, and temporal descriptors. Table 7: Description of Data. Class Feature Set 1 Feature Set 2 Description Training Test Training Test Porpoise Sound 116 284 142 284 Cracking Ice 116 175 175 175 Whale Sound 1 116 129 129 129 Whale Sound 2 148 235 118 235 Total 496 823 564 823 Table 7 shows the class descriptions and the number of patterns used for training and testing in each of the two feature sets. The training sets are not comprised of the same patterns, because of the di erent preprocessing techniques. The test sets however, have the exact same patterns, making the combining and/or comparing of the results possible. The availability of two feature sets is an excellent opportunity to underscore the dependence of the Bayes error on the feature selection. Since both feature sets were extracted from the 17

Table 8: Combining Results for the Sonar Data. Feature Set/ Number of Error Rate std. dev. Classi er Classi ers (in %) 1 7.47 0.44 3 7.19 0.29 FS1/MLP 5 7.13 0.27 6 7.09 0.24 7 7.11 0.23 9 7.10 0.19 1 6.79 0.41 3 6.15 0.30 FS1/RBF 5 6.05 0.20 6 5.98 0.19 7 5.97 0.22 9 5.86 0.17 3 6.11 0.34 FS1/MLP 5 6.11 0.31 and 6 6.18 0.30 FS1/RBF 7 6.08 0.32 9 6.00 0.30 1 9.95 0.74 3 9.32 0.35 FS2/MLP 5 9.20 0.30 6 9.18 0.36 7 9.07 0.36 9 9.01 0.38 1 10.94 0.93 3 10.55 0.45 FS2/RBF 5 10.43 0.30 6 10.43 0.30 7 10.44 0.32 9 10.38 0.24 3 8.46 0.57 FS2/MLP 5 8.17 0.41 and 6 8.23 0.32 FS2/RBF 7 8.14 0.28 9 8.07 0.30

18

same underlying distributions, the di erences between the Bayes errors obtained will provide an implicit rating method for the e ectiveness of the extracted features in conserving the discriminating information present in the original data. Table 9: Experimental Correlation Factors for the Sonar Data. Feature Set/Classi er Pairs Estimated Correlation Di erent runs of FS1/MLP 0.88 Di erent runs of FS1/RBF 0.70 FS1/MLP and FS1/RBF 0.35 Di erent runs of FS2/MLP 0.76 Di erent runs of FS2/RBF 0.72 FS2/MLP and FS2/RBF 0.20 Table 10: Bayes Error Estimates for Sonar Data (in %). Feature Ebbayes Nearest Neighbor Mahalanobis Bhattacharyya Set Bounds Bound Bound (Eq. 21) (Eq. 14) (Eqs. 4,5,11) (Eqs. 7,8,11) FS1 4.20 3.27  Ebayes  6.40  14.61  0.20 FS2 7.21 6.88  Ebayes  13.12  19.53  1.10 Two types of feed-forward arti cial neural networks, namely an MLP with a single hidden layer with 40 units, and an RBF network with 40 kernels, are used to classify the patterns. The error rates for each network on each feature set, averaged over 20 runs, as well as the results of the ave combiner are presented in Table 8. The rows where N = 1 give single classi er results. Note that the improvements due to combining are much more noticeable for this dicult problem. Table 9 shows the estimated average error correlations between di erent runs of a given classi er/feature set pair, and between two types of networks trained with a single feature set. Table 10 shows the estimates for the Bayes error using Equation 21, total error rates from Table 8 and correlation values from Table 9, as well as the lower and upper bounds computed through Equation 14, where the upper bound is also the result of the nearest neighbor classi er. The bounds provided by the Mahalanobis distance are not tight enough to be of particular use, since all the classi ers (MLP, RBF, NN and various combiners) provide better results than this bound. The bounds provided by the Bhattacharyya distance on the other hand are not dependable due to the assumptions made on the distributions. 19

Equation 7 is derived for Gaussian classes, and can lead to signi cant errors, when the class distributions are not Gaussian. Furthermore, since Equations 5 and 8 provide bounds for the 2-class case, and need to be extended to the 4-class case through the repeated application of Equation 11, the errors are compounded. Therefore, in this case only bounds provided by the nearest neighbor method can be reliably compared to the combiner-based estimates. Table 11: Plurality Error Limits for the Sonar Data. Feature Classi er N Plurality Bayes Error Set N Error Limit (in %) Estimates (in %) 5 5.98 MLP 6 6.42 7 6.32 9 6.51 5 5.27 Ebplu = 5.37 FS1 RBF 6 5.42 7 5.47 9 5.45 5 4.25 MLP/RBF 6 4.40 7 4.40 9 4.60 5 7.58 MLP 6 7.83 7 7.87 9 6.38 5 8.59 Ebplu = 7.49 FS2 RBF 6 9.07 7 8.97 9 9.14 5 5.89 MLP/RBF 6 5.97 7 6.15 9 6.38 Table 11 shows the plurality error limits and the resulting Bayes error estimates for the underwater sonar classi cation problem. For FS2, the two combining based methods discussed in Section 3 provide similar Bayes error estimates. For FS1, on the other hand, the plurality limit based Bayes error estimate is higher than the linear combining theory based estimate.

20

4.3 Proben1 benchmarks In this section we apply the combiner based Bayes error estimation method to selected data sets from the Proben1 benchmark set11 [26]. The data sets that were included in this study are the GLASS1, and GENE1 sets, and the name and number combinations correspond to a speci c training/validation/test set split12. GENE1 is based on intron/exon boundary detection, or the detection of splice junctions in DNA sequences [23, 33]. 120 inputs are used to determine whether a DNA section is a donor, an acceptor or neither. There are 3175 examples, of which 1588 are used for training. The GLASS1 data set is based on the chemical analysis of glass splinters. The 9 inputs are used to classify 6 di erent types of glass. There are 214 examples in this set, and 107 of them are used for training. Table 12: Combining Results for the GLASS1 Data. Feature Set/ Number of Error Rate std. Correlation Bayes Error Classi er Classi ers (in %) dev. Estimate (in %) 1 32.26 0.57 3 32.08 0.00 MLP 5 32.08 0.00 0.92 6 32.08 0.00 7 32.08 0.00 Ebbayes = 27.59 9 32.08 0.00 1 31.79 3.49 3 29.81 2.28 RBF 5 29.25 1.84 0.68 6 28.68 1.13 7 29.06 1.51 nn  37:74 9 28.68 1.13 21:69  Ebayes 3 30.66 .52 MLP 5 32.36 .82 and 6 32.45 .41 0.08 RBF 7 32.45 .96 9 32.55 .82 Table 12 contains the combining results and the Bayes error estimates for the GLASS1 data, and Table 13 presents the combining results for the GENE1 data, along with the correlation and bayes error estimates. Because for GENE1 the correlations among multiple RBF networks is .98, care must be taken in estimating the Bayes error. More precisely, 11 12

Available at URL ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.Z. We are using the same notation as in the Proben1 benchmarks.

21

Table 13: Combining Results for the GENE1 Data. Feature Set/ Number of Error Rate std. Correlation Bayes Error Classi er Classi ers (in %) dev. Estimate (in %) 1 13.47 0.44 3 12.30 0.42 MLP 5 12.23 0.39 0.73 6 12.05 0.29 7 12.08 0.23 Ebbayes = 9.19 9 12.15 0.23 1 14.62 0.42 3 14.48 0.37 RBF 5 14.35 0.33 0.98 6 14.34 0.37 7 14.33 0.35 nn  29:25 9 14.28 0.30 16:72  Ebayes 3 12.43 0.48 MLP 5 12.28 0.40 and 6 12.35 0.33 0.30 RBF 7 12.17 0.37 9 12.16 0.34 even moderate improvements in classi ation rates with high correlation imply zero or near zero Bayes error rates. Therefore, we estimate the Bayes error rate through combining MLPs and MLP/RBF hybrids only, as discussed in Section 4.1. We have also included the nn nearest neighbor bounds for these two data sets based on Equation 14, denoted by Ebayes in the last column. Note that for the GENE1 problem the nearest neighbor method fails to provide accurate bounds. This shortcoming is mainly due to the high-dimensionality of the problem, where proximity in Euclidean sense is not necessarily a good measuring stick for class belongings. Tables 14 and 15 show the plurality limits for the GLASS1 and GENE1 data sets, and the Bayes error estimate obtained from them. For GENE1, because of the high correlation, the RBF based plurality limits are not used in computing the Bayes error estimate, as discussed in Section 4.1. For the GLASS1 data set the two combining based estimates provide particularly close estimates, and for GENE1, the two estimates are within 8% of each other.

22

Table 14: Plurality Error Limits for the GLASS1 Data. Classi er/ N Plurality Bayes Error Feature Set Error Limit (in %) Estimates (in %) 5 32.08 MLP 6 31.70 7 32.08 9 32.08 5 23.59 Ebplu = 27.57 RBF 6 22.26 7 24.06 9 24.15 5 24.81 MLP-RBF 6 25.66 7 25.94 9 28.77

Table 15: Plurality Error Limits for the GENE1 Data. Classi er/ N Plurality Bayes Error Feature Set Error Limit (in %) Estimates (in %) 5 10.00 MLP 6 10.80 7 10.47 9 10.84 5 14.41 RBF 6 14.04 Ebplu = 9.94 7 13.89 9 14.00 5 9.12 MLP-RBF 6 9.38 7 9.44 9 9.50

23

5 Conclusion Estimating the limits to achievable classi cation rates is useful in dicult classi cation problems. Such information is helpful in determining whether it is worthwhile to try a di erent classi er, or a di erent set of parameters with the chosen classi er with the hope of getting better classi cation rates. For example, knowing that certain errors (e.g. outliers or mislabeled patterns) are too deeply rooted in the original data to be eliminated with better classi ers or combiners, may prevent the search for non-existent \better" classi ers, and instead point to the need for more data samples or better feature extractors. In this article, we presented two methods that can assist in determining these performance limits. First, we derive an estimate for the Bayes error based on linear combining theory and exploiting the ability of certain well-trained neural networks to directly estimate the posterior probabilities. Experimental results show that this error estimate compares favorably with classical estimation methods. Then, we adopt a more general scheme to obtain limits on the error rate for classi ers trained on speci c data samples. Experimental results indicate that in most cases the combiners perform well, and come close to the sampledependent error limit. Furthermore, error estimates presented in this article help isolate classi er/feature set pairs that fall short of their performance potentials. For both Bayes estimation methods introduced in this article, the estimation of the correlation plays a crucial role in the accuracy of the Bayes error. If we use the combiner based estimate given in Equation 21, then the correlation directly in uences the result. If, on the other hand, we use the plurality limit based approach, the correlation provides an indirect bene t: we can determine which plurality limits are reliable, and use only those to obtain a Bayes error estimate. Several researchers have shown the desirability of reducing correlations among classi ers in an ensemble and proposed methods to achieve this [21, 24, 28, 37]. Note that if the classi ers are very highly correlated, the Bayes rate estimation based on Equation 21 becomes less reliable. Thus we expect our technique to provide even better results when applied to ensembles that employ any of these decorrelation methods rst.

24

References [1] K. M. Ali and M. J. Pazzani. On the link between error correlation and error reduction in decision tree ensembles. Technical Report 95-38, Department of Information and Computer Science, University of California, Irvine, 1995. [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1995. [3] L. Breiman. Bagging predictors. Technical Report 421, Department of Statistics, University of California, Berkeley, 1994. [4] L. J. Buturovic. Improving k-nearest neighbor density and error estimates. Pattern Recognition, 26(4):611{616, 1993. [5] T. M. Cover and P. E. Hart. Nearest neighbor pattern classi cation. IEEE Transactions on Information Theory, 13:21{27, 1967. [6] P.A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. PrenticeHall, 1982. [7] A. Djouadi, O . Snorrason, and F. D. Garber. The quality of training-sample estimates of the Battacharyya coecient. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1):92{97, January 1990. [8] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley, New York, 1973. [9] K. Fukunaga. The estimation of the Bayes error by the k-nearest neighbor approach. In L. N. Kanal and A. Rosenfeld, editors, Progress in Pattern Recognition 2, pages 169{187. North-Holland, Amsterdam, 1985. [10] K. Fukunaga. Introduction to Statistical Pattern Recognition. (2nd Ed.), Academic Press, 1990. [11] K. Fukunaga and D. Hummels. Bayes error estimation using Parzen and k-NN procedures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:634{643, 1987. 25

[12] K. Fukunaga and D. Hummels. Bias of the nearest neighbor error estimate. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1):103{112, 1987. [13] F.D. Garber and A. Djouadi. Bounds on the Bayes classi cation error based on pairwise risk functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3):281{288, 1988. [14] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1{58, 1992. [15] J. Ghosh, L. Deuser, and S. Beck. A neural network based hybrid system for detection, characterization and classi cation of short-duration oceanic signals. IEEE Journal of Ocean Engineering, 17(4):351{363, october 1992. [16] J. Ghosh and K. Tumer. Structural adaptation and generalization in supervised feedforward networks. Journal of Arti cial Neural Networks, 1(4):431{458, 1994. [17] J. Ghosh, K. Tumer, S. Beck, and L. Deuser. Integration of neural classi ers for passive sonar signals. In C.T. Leondes, editor, Control and Dynamic Systems|Advances in Theory and Applications, volume 77, pages 301{338. Academic Press, 1996. [18] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993{1000, 1990. [19] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, New York, 1994. [20] Robert Jacobs. Method for combining experts' probability assessments. Neural Computation, 7(5):867{888, 1995. [21] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and avtive learning. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems-7, pages 231{238. M.I.T. Press, 1995. [22] W.P. Lincoln and J. Skrzypek. Synergy of clustering multiple back propagation networks. In D. Touretzky, editor, Advances in Neural Information Processing Systems-2, pages 650{657. Morgan Kaufmann, 1990.

26

[23] M. O. Noordewier, G. G. Towell, and J. W. Shavlik. Training knowledge-based neural networks to recognize genes in DNA sequences. In R.P. Lippmann, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing Systems-3, pages 530{536. Morgan Kaufmann, 1991. [24] D. W. Opitz and J. W. Shavlik. Generating accurate and diverse members of a neuralnetwork ensemble. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems-8, pages 535{541. M.I.T. Press, 1996. In Press. [25] M.P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image Processing, chapter 10. Chapmann-Hall, 1993. [26] Lutz Prechelt. PROBEN1 | A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. Anonymous FTP: /pub/papers/techreports/1994/1994-21.ps.Z on ftp.ira.uka.de. [27] M.D. Richard and R.P. Lippmann. Neural network classi ers estimate Bayesian a posteriori probabilities. Neural Computation, 3(4):461{483, 1991. [28] B. Rosen. Ensemble learning using decorrelated neural networks. Connection Science, Special Issue on Combining Arti cial Neural Networks: Ensemble Approaches, 8(3 & 4):373{384, 1996. [29] D. W. Ruck, S. K. Rogers, M. E. Kabrisky, M. E. Oxley, and B. W. Suter. The multilayer Perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4):296{298, 1990. [30] R. Schalko . Pattern Recognition: Statistical, Structural and Neural Approaches. Wiley, 1992. [31] A. J. J. Sharkey, N. E. Sharkey, and G.O. Chandroth. Neural nets and diversity. In Proceedings of the 14th International Conference on Computer Safety, Reliability and Security, pages 375{389, Belgirate, Italy, 1995.

27

[32] P.A. Shoemaker, M.J. Carlin, R.L. Shimabukuro, and C.E. Priebe. Least squares learning and approximation of posterior probabilities on classi cation problems by neural network models. In Proc. 2nd Workshop on Neural Networks, WNN-AIND91,Auburn, pages 187{196, February 1991. [33] G. G. Towell and J. W. Shavlik. Interpretation of arti cial neural networks: Mapping knowledge-based neural networks into rules. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems-4, pages 977{ 984. Morgan Kaufmann, 1992. [34] K. Tumer and J. Ghosh. Limits to performance gains in combined neural classi ers. In Proceedings of the Arti cial Neural Networks in Engineering '95, pages 419{424, St. Louis, 1995. [35] K. Tumer and J. Ghosh. Theoretical foundations of linear and order statistics combiners for neural pattern classi ers. Technical Report 95-02-98, The Computer and Vision Research Center, University of Texas, Austin, 1995. (Available from URL http://www.lans.ece.utexas.edu/ under select publications{tech reports). [36] K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classi ers. Pattern Recognition, 29(2):341{348, February 1996. [37] K. Tumer and J. Ghosh. Error correlation and error reduction in ensemble classi ers. Connection Science, Special Issue on Combining Arti cial Neural Networks: Ensemble Approaches, 8(3 & 4):385{404, 1996. [38] K. Tumer and J. Ghosh. Estimating the Bayes error rate through classi er combining. In Proceedings of the International Conference on Pattern Recognition, pages 695{699, Vienna, 1996. [39] K. Tumer, N. Ramanujam, R. Richards-Kortum, and J. Ghosh. Spectroscopic detection of cervical pre-cancer through radial basis function networks. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems-9. M.I.T. Press, 1997. (to appear). [40] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241{259, 1992.

28

[41] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classi ers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418{435, May 1992. [42] T. Y. Young and T. W Calvert. Classi cation, Estimation and Pattern Recognition. Elsevier, New York, 1974. [43] X. Zhang, J.P. Mesirov, and D.L. Waltz. Hybrid system for protein secondary structure prediction. J. Molecular Biology, 225:1049{63, 1992.

29