FEATURE SELECTION INCREASES CROSS-VALIDATION ...

7 downloads 15018 Views 57KB Size Report
FEATURE SELECTION INCREASES CROSS-VALIDATION IMPRECISION. Yufei Xiao. 1. , Jianping Hua. 2 and Edward R. Dougherty. 1,2. 1. Department of ...
FEATURE SELECTION INCREASES CROSS-VALIDATION IMPRECISION Yufei Xiao1 , Jianping Hua2 and Edward R. Dougherty1,2 1

Department of Electrical Engineering, Texas A&M University, 3128 TAMU, College Station, TX 77843 2 Translational Genomics Research Institute, Phoenix, AZ 85004

ABSTRACT Even without feature selection, cross-validation error estimation is problematic for small samples owing to the high variance of the deviation distribution describing the difference between the estimated and true errors. This paper investigates the increased loss of crossvalidation precision owing to feature selection by comparing deviation distributions and introducing two variation-based measures to quantify the further degradation in performance. 1. INTRODUCTION A key research goal in genomic signal processing is the classification problem accompanied by feature selection. Intrinsic to this goal is how error estimation affects the choice of features and classifier design. Here, given the typically small sample sizes in genomics, we assume that error estimation is done on the same data set as feature selection and classifier design. A classifier is designed according to a classification rule, with the rule being applied to sample data to yield a classifier. Either the features are given prior to the data, in which case the classification rule yields a classifier with the given features constituting its argument, or both the features and classifier are determined by the classification rule. In the latter case, the entire set of possible features constitutes the feature set relative to the classification rule. Feature selection constrains the space of functions from which a classifier might be chosen, but it does not reduce the number of features involved in designing the classifier. In particular, if crossvalidation error estimation is to be used, the approximate unbiasedness of cross-validation applies to the classification rule, and since feature selection is part of the classification rule, feature selection must be accounted for within the cross-validation procedure to maintain the approximate unbiasedness; otherwise, low bias results. This paper concerns the quality of such a cross-validation estimation procedure. There are various issues to consider with regard to the quality of an error estimator in the context of small samples. The most obvious is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, that is, the distribution of the difference between the estimated and true errors. Model-based simulation studies indicate that, given a prior set of features, cross-validation does not perform as well in this regard as bootstrap and bolstered estimators [1]. Model-based simulation studies indicate that, given a prior set of features, cross-validation does not perform well when ranking feature sets [2]. Moreover, when doing feature selection, similar studies show that cross-validation does not do well in comparison to bootstrap and bolstered estimators when used inside forward search algorithms, such as sequential floating forward search [3]. Here we are concerned with another, potentially more problematic issue, the use of cross-validation to estimate the error of a clas-

sifier designed in conjunction with feature selection. We say potentially more serious because, owing to the computational burden of bootstrap and the analytic formulation of bolstering, these are not readily amenable to situations where there are thousands of variables from which to choose. As in the case of prior-chosen features, the main concern here is the deviation distribution between the crossvalidation error estimates and the true errors of the designed classifiers. Owing to the added complexity of feature selection, one might surmise that the situation would be worse than that for a given feature set, and it is. Even with a given feature set, the deviation distribution for cross-validation tends to have high variance, which is why its performance generally is not good, especially for leaveone-out cross-validation. What we observe in the current study is that the cross-validation deviation distribution is significantly flatter when there is feature selection, which means that cross-validation estimates are even more unreliable than for given feature sets. This inaccuracy of cross-validation in the presence of feature selection has been recently addressed [4]; however, our approach differs in two substantive ways. The major difference is that we are specifically interested in a comparative methodology that will isolate the effects of the feature selection in the deviation analysis of crossvalidation. A second difference is that the analysis in [4] relies on t-test feature selection, which is generally not used in practice since its justification depends on the features being independent. 2. METHOD AND RESULTS Our method is to compare the cross-validation deviation distributions for the classification rule used with and without feature selection. In the latter case, we use the best features among the full collection. By doing this in conjunction with comparing the variances of the deviation distributions, we are able to assess the increased deviation variance due to feature selection. Besides using the deviation distribution for the best features, we also compute the cross-validation error in the wrong way, by applying it to the selected features, posterior to feature selection. The variance of this posterior cross-validation estimation for the selected features is also substantially less than the variance of the deviation distribution when feature selection is within the cross-validation procedure. The experiments discussed in this paper involve two 100 dimensional Gaussian class-conditional distributions with means m0 = (µ1 , · · · , µ100 ) and m1 = −m0 , respectively. The two Gaussian distributions share the same block-diagonal covariance matrix Σ, which contains 20 equal blocks. There are 20 best features, with 1 in every block, and the 5 features within each block are mutually correlated with ρ = 0.1. The two distributions are equiprobable. The optimal classifier given the full distributions is obtained by linear discriminant analysis (LDA). We draw samples of size 50 and apply LDA to design classifiers. We apply sequential floating forward search (SFFS) for feature selection and use bolstered resubstitution

10

correct error est. using best features wrong error est.

8

LOO CV10

SD(∆E) 0.0729 0.0699

SD(∆Eb ) 0.0586 0.0612

SD(∆E 0 ) 0.0503 0.0544

κb 0.1964 0.1238

κ0 0.3104 0.2217

Table 1. Comparison of standard deviations and κb , κ0 measures

6 4

tions, we define two measures for assessing the increased imprecision of the deviation distribution owing to feature selection:

2 0 −0.4

−0.2

0

∆E

0.2

0.4

(a) LOO 8

6

4

2

0 −0.4

−0.2

0 ∆E

0.2

κb

=

κ0

=

0.6

0.4

(b) 10-fold CV

Fig. 1. Simulated distributions of ∆E

for error estimation within the SFFS algorithm owing to its superior performance for SFFS using LDA [3]. Cross-validation is used to estimate the errors of the designed classifiers, both leave-one-out (LOO) and k-fold. In the subsequent description of our methodology, we assume LOO. We first draw a random sample S of size 50, perform feature selection on the whole sample to obtain feature set F , and design the LDA classifier CF on S. We also use the known best features Fb (from the model) to design a classifier CFb . To obtain true classification errors, we generate a much larger sample S 0 of size 5000 to test CF and CF b , and denote their true errors by E and Eb , respectively. To obtain the LOO error estimate (with feature selection), for i = 1, · · · , 50, we leave out point si in S to form the training set Si = S \{si }, perform feature selection on Si to get a feature set Fi , design a surrogate classifier Ci , and test Ci on si , with error 1 if si is mis-classified and error 0 if si is correctly classified. The estimated classification error is then the average of errors for all i, denoted by ˆ To estimate classification error without feature selection, we just E. use the best feature set Fb and obtain the LOO error of CFb , called ˆb . We also estimate the classification in a wrong way, by using E feature set F while designing surrogate classifiers Ci0 , and thereby ˆ 0 . We compute ∆E = E ˆ − E, obtaining the “wrong” LOO error E 0 0 ˆ − E and ∆Eb = E ˆb − Eb . Finally, we repeat the previ∆E = E ous procedure 12, 800 times and plot the deviation distributions for ∆E, ∆E 0 , and ∆Eb , as shown in Fig. 1. The plot shows ∆E (blue), ∆Eb (red) and ∆E 0 (green). Using the standard deviations (SD) of the deviation distribu-

SD(∆E) − SD(∆Eb ) , SD(∆E) SD(∆E) − SD(∆E 0 ) . SD(∆E)

Figure 1 shows the results of LOO and 10-fold CV (CV10) applied to one model using LDA. The plots make the key point: correctly performed LOO or 10-fold cross-validation error estimation is close to being unbiased but feature selection significantly increases the variance of the deviation (flatter curve) in comparison to not using feature selection. This phenomenon is corroborated by the third distribution in the figure, which is obtained by incorrect application of cross-validation. Although it demonstrates low bias, as expected, its variance is close to that of the deviation distribution for the best feature set – neither of these involving feature selection within the cross-validation procedure. For these deviation distributions, their standard deviations as well as κb and κ0 are also computed and listed in Table 1. The phenomena of both κb and κ0 being positive, namely, showing increased variance, are observed in all the models and classifiers we have so far studied. However, the trend κb < κ0 in Table 1 does not always hold true. Our studies have shown that it is model- and classifierdependent. The complete results are to be reported in a full-length paper. 3. CONCLUSION Thus far, the completed experiments show a consistent degradation in performance when using cross-validation with feature selection in comparison with using it in the absence of feature selection. Our full goal is to measure this increase in imprecision under various conditions, namely, for different classification rules, feature-label distributions, and cross-validation protocols. Since it appears that it may be necessary to use cross-validation in certain circumstances when feature selection is being employed, it behooves us to know how it performs in different situations – for instance, to help choose a classification rule or cross-validation protocol. 4. REFERENCES [1] U. Braga-Neto, and E. R. Dougherty, “Bolstered error estimation,” Pattern Recognition, vol. 37, pp. 1267-1281, 2004.

[2] C. Sima, U. Braga-Neto, and E. R. Dougherty, ”Superior feature-set ranking for small samples using bolstered error estimation,” Bioinformatics, vol. 21, no. 7, pp. 1046-1054, 2005.

[3] C. Sima, S. Attoor, U. Braga-Neto, J. Lowey, E. Suh, and E. R. Dougherty, “Impact of error estimation on feature selection,” Pattern Recognition, vol. 38, no. 12, pp. 2472-2482, 2005.

[4] A. M. Molinaro, R. Simon, and R. M. Pfeiffer, “Prediction error estimation: a comparison of resampling methods,” Bioinformatics, vol. 21, no. 15, pp. 3301-3307, 2005.