Random forests and nearest shrunken centroids for the classification ...

5 downloads 19986 Views 150KB Size Report
Jan 25, 2017 - Computations are carried out with the powerful statistical packages diffused by the R project .... To the best of our knowledge, R has yet never.
Available online at www.sciencedirect.com

Sensors and Actuators B 131 (2008) 93–99

Random forests and nearest shrunken centroids for the classification of sensor array data Matteo Pardo ∗ , Giorgio Sberveglieri Sensor Lab, CNR-INFM & University of Brescia, Brescia, Italy Available online 15 December 2007

Abstract Random forests and nearest shrunken centroids are between the most promising new classification methodologies. In this paper we apply them – to our knowledge for the first time – to the classification of three E-Nose datasets for food quality control applications. We compare the classification rate with the one obtained by state-of-the-art support vector machines. Classifiers’ parameters are optimized in an inner cross-validation cycle and the error is calculated by outer cross-validation in order to avoid any bias. Since nested cross-validation is computationally expensive we also investigate the dependence of the error on the number of inner and outer folds. We find that random forests and support vector machines have a similar classification performance, while nearest shrunken centroids have worse performances. On the other hand, random forests and nearest shrunken centroids have an in-built feature selection mechanism that is very helpful for understanding the structure of the dataset and evaluating sensors. We show that random forests and nearest shrunken centroids produce different feature rankings and explain our findings with the nature of the classifier. Computations are carried out with the powerful statistical packages diffused by the R project for statistical computing. © 2007 Elsevier B.V. All rights reserved. Keywords: Random forests; Nearest shrunken centroids; Support vector machines; Feature selection; Electronic nose; Sensor array; Food analysis

1. Introduction While feature plots (e.g. responses of single sensors over time) and descriptive statistics (e.g. calibration tables) may be sufficient for the analysis of small, low dimensional sensor data, proper pattern recognition (PR) methods are needed to evaluate the performance of sensor systems (such as E-Noses) in practical tasks. In the chemical sensor arena simple, visual explorative analysis methods still are predominantly employed because focus is traditionally on materials development and – furthermore – completely automated measurement systems (from sample handling to database recording) are an exception more than the rule. Unfortunately, this hinders proper (statistical) sensor system evaluation, which requires the collection and annotation of representative sets of measurements and their processing with state-of-the-art, sound statistical procedures. PR methods with top performances – in terms of low prediction errors and data interpretation capabilities, tested on various dataset types and often with a theoretical underpinning – are continuously being developed at the border between



Corresponding author. E-mail address: [email protected] (M. Pardo).

0925-4005/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.snb.2007.12.015

machine learning and statistics. For instance, the potential of state-of-the-art PR algorithms like random forests (RF), nearest shrunken centroids (NSC) and support vector machines (SVM) is exploited in diverse application areas, notably in postgenomics (e.g. for DNA microarrays data analysis), yet their take-up is relatively slow in the analysis of chemical sensor array data. An important reason explaining take-up difficulty is the lack of tested, documented and shared PR software. While it may be fair to say that in chemometrics – and most of engineering – Matlab is the language of choice, over the last years the R language [1] for statistical computing is gaining momentum. The R project is an active open source (GNU) enterprise, distributing a wealth of state-of-the-art statistical packages [2]. R has already been successfully uptaken in several application fields, most notably the analysis of genomic data (Bioconductor [3,4]). Being developed inside statistics, R gives particular weight to sound, reproducible results. Gentleman and Lang go a step further, introducing “the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, etc.), and as a means for distributing, managing and updating the collection [5]”. To the best of our knowledge, R has yet never been used for the analysis of chemical sensor data. We believe

94

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99

that its openness and accent on reproducibility can be advantageous, and in this paper we will therefore use some R statistical packages. RF is an ensemble of classification trees [6]. It uses both bootstrap aggregation (bagging), a successful approach for combining unstable learners, and random variable selection for tree building. Nearest shrunken centroids classification [7] makes one important modification to standard nearest centroid classification. The shrinkage consists of moving the class centroids towards zero by a threshold, setting it equal to zero if it hits zero. Finally, SVM are nowadays between the most used learning machines, across a wide spectrum of application fields. RF and NSC are here applied – to our knowledge – for the first time to the analysis of chemical sensor array data. In analytical chemistry there are a few papers that take advantage of the state-of-the-art classification performances and variable selection capabilities of RF. Hancock et al were the first to make a performance comparison of RFs and other modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies [8]. Of the methods they reviewed though, genetic algorithms coupled to multiple linear regression and stochastic Treeboost [9] were found to considerably improve the predictive performance compared to RF. In another paper by the same group a different result was reported: Treeboost with an RF variable reduction step, performs worse than RF alone on a dataset of SELDI-TOF mass spectral serum profiles [10]. Finally, Granitto et al. have a nice paper on the application of Random Forest-Recursive Feature Elimination (RF-RFE) algorithm to the identification of relevant features in the spectra produced by proton transfer reaction-mass spectrometry (PTR-MS) analysis of agroindustrial products [11]. RF-RFE is compared with the more traditional support vector machinerecursive feature elimination (SVM-RFE), extended to allow multiclass problems, and with a baseline method based on the Kruskal–Wallis statistic. They show that RF-RFE outperforms SVM-RFE and KWS on the task of finding small subsets of features with high discrimination levels. 2. Methods SVM have been already used (even if seldom) and we therefore redirect the reader to the literature. Classical textbooks on statistical learning theory and SVM are [12,13]; Burges wrote one of the first tutorials [14]; while the first papers with applications of SVM to chemical sensors are [15–17]. Since random forests and NSC are newcomers to the sensor field we explain them briefly below. 2.1. Random forests (RF) Recently there has been a lot of interest in “ensemble learning”—methods that generate many classifiers and aggregate their results. Two well-known methods are boosting [18] and bagging [19]. In boosting, successive classifiers give extra weight to points incorrectly predicted by earlier predictors. In

the end, a weighted vote is taken for prediction. In bagging, successive classifiers do not depend on earlier classifiers—each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Starting from bagging, Breiman proposed random forests [6], which add an additional layer of randomness. RFs are ensembles of trees (classification or regression trees). In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the trees are constructed. In standard trees, each node is split using the best split among all variables. In a RF, each node is split using the best among a subset of predictors (i.e. features) randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting [6]. In addition, RF has only two hyperparameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. Thirdly, RF automatically gives variable ranking. The RF algorithm estimates the importance of a variable by looking at how much prediction error increases when data not in the bootstrap sample (what Breiman calls “out-of-bag” data) for that variable is permuted while all others are left unchanged. The necessary calculations are carried out tree by tree as the RF is constructed. The randomForest package [20] provides an R interface to the original Fortran programs by Breiman and Cutler. 2.2. Nearest shrunken centroids (NSC) Standard nearest centroid classification computes a standardized centroid for each class. This is the average value for each feature in each class divided by the within-class standard deviation for that feature. The feature vector of a new sample is then compared to each of these class centroids. The class of the closest (in squared distance) centroid is the predicted class for that new sample. NSC [7,21] classification makes one important modification to standard nearest centroid classification. It “shrinks” each of the class centroids toward the overall centroid (for all classes) by an amount called the threshold. Concretely, first the overall centroid is subtracted from data (the new overall centroid is therefore zero). The shrinkage then consists of moving the class centroids towards zero by threshold, setting it equal to zero if it hits zero. For example if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of −3.4 would be shrunk to −1.4, and a centroid of 1.2 would be shrunk to zero. The threshold value is the only hyperparameter of NSC: typically one determines the optimal threshold by cross-validation for a range of threshold values. After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids. This shrinkage has two advantages: (1) it can make the classifier more accurate by reducing the effect of noisy features and

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99

(2) it does automatic feature selection. In particular, if a feature is shrunk to zero for all classes, then it is eliminated from the prediction rule. Alternatively, it may be set to zero for all classes except one, and we learn that that feature characterizes that class. Hastie et al. developed the R package pamr for NSC (PAM stands for prediction analysis for microarrays, since it was originally developed to analyze microarray data). 3. Experimental 3.1. Measurements The E-Nose datasets we analyze in this paper have been produced by the commercial EOS835 E-Nose, manufactured by the Italian company Sacmi s.c.a.r.l. The EOS835 system includes a pneumatic assembly for dynamic sampling (pump, electrovalve and electronic flow meter), a sensor chamber (thermally controlled) with 35 ◦ cm3 of internal volume and an electronic board for controlling the sensor heaters and measuring the sensing layers. The system is equipped with five thin film sensors deposited at the Sensor Lab; the sensors are based on different metal oxides in order to extend the array sensitivity over a larger spectrum of volatile compounds. A suite of software for easy data handling, preprocessing and explorative analysis has been also developed at the Sensor Lab. We analyzed two datasets posing different, commercially relevant problems in food analysis. In the first experiment we investigated the detection of toxigenic strains of the fungus Fusarium verticillioides in corn [22]; in the second experiment we investigated extra virgin olive oil defects [23]. We refer to the cited references for experimental details. For the fungi dataset we have 103 measurements, almost balanced (58 vs. 55) between fungi which produce the deleterious mycotoxin ‘fumonisin’ and fungi which do not. The oil datasets (371 patterns) presented both the problem of distinguishing between two defect types (‘vinegary’, and ‘heated’) and of discriminating the amount of defect. For the latter problem we present the results obtained in the discrimination of the ‘vinegary’ defect in the four classes ‘high’, ‘medium’, ‘threshold’ and ‘under threshold’. 3.2. Computation From each sensor response curve, we extracted five different features. Three are standard features: the classical R/R0 (taking R as the minimum of the resistance response curve during the adsorption; R0 is the steady state value in air) and the integral of the response curve, calculated during the adsorption and the desorption step. We also used the approach of calculating the features in the phase space [24]. The phase space is spanned by the response and its first time derivative (the time variable is implicit in the trajectory described by the sensor). In this space, we calculated the integral of the trajectory during the adsorption and desorption steps, named adsorption phase integral and desorption phase integral, respectively. Both the standard integral and the phase space integrals convey information about the sensor response dynamics. Six features extracted from each of five

95

sensors gives a thirty dimensional dataset. The same features have been extracted from another dataset (determination of coffee ripening), on which feature selection has been performed [25]. We applied each of the three classifiers (RF, NSC, SVM) to each of the three classification problems (i) ‘fumonisin producing fungi versus fumonisin non-producing fungi’; (ii) ‘vinegary versus heated defects in oil’; (iii) ‘discrimination between high, medium, threshold and under-threshold levels of the vinegary defect in oil’. We tuned the classifiers over a grid of possible hyperparameters values. For SVM (Gaussian kernel) the hyperparameters are the standard deviation γ of the Gaussian kernel and the weight (cost) given to misclassified samples. We considered γ = 2ˆ(−2:2), i.e. γ = 2−2 , 2−1 , . . . 22 ; and cost = 2ˆ(−2:2). For the NSC there is only one hyperparameter: the threshold, which determines the shrinking level. We considered 10 possible values. Finally, in preliminary tests we found that the performance of RF does not depend much on the actual value of its hyperparameters inside a large interval—as already reported in the literature [6]. To speed up the training, we hence used the preset values of the package randomForest: number of trees = 500 and number of features at each split = sqrt(total feature number), i.e. floor(sqrt(30)) = 5. In order to select the optimal hyperparameters and to assess the error without bias, we used nested cross-validation. Folds were stratified and we further averaged on two repetitions of each nested CV cycle (subsets in the repetitions contain different patterns). Unbiased error determination is crucial in PR, yet sometimes still overlooked. A typical (incorrect) strategy is to perform a cross-validation loop to tune the hyperparameters, and then to use a new (not nested) cross-validation loop, over the same samples, to estimate the test error. In this case, during the second cross-validation loop each sample set aside as test set is not completely independent of the model being evaluated, because it was previously used to select the hyperparameters. Positive bias in error estimation often arose also in feature selection studies, as, e.g. flagged by an important paper by Ambroise and McLachlan [26]. Since nested cross-validation is computationally expensive we also decided to investigate the dependence of the error on the number of inner and outer folds. We considered a grid of 25 outer folds/inner folds numbers: outer CV folds: 2, 4, 6, 8, 10; inner CV folds: 2, 4, 6, 8, 10. Altogether this means training, e.g. 45050 SVM. We were interested in three computational aspects: 1. impact of the number of inner and outer folds on the estimated error, 2. relative performance of the three classifiers, 3. feature rankings produced by RF and NSC. To carry out computations we used the R package MCRestimate, which implements parameter optimization (hyperparameters, as in our case, but also feature selection) and error estimation by two nested cross-validation loops [27]. MCRestimate is built on top of a number of R packages.

96

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99

4. Results In Tables 1–3 we report the correct classification rate for the three classification problems, as a function of the CV folds, for each of the three classifiers. In bold we highlight the best performance for each classifier; if optimal performance is obtained more than once, we select the CV combination which requires the least time (i.e. smallest CV number = ‘outer folds number’ × ‘inner folds number’). We see that: 1. SVM and RF perform similarly (each classifier does better on one problem), while NSC consistently performs worse. NSC is by far the simpler (and faster) classifier. 2. There is a slight dependence on the number of external CV folds (particularly in the fungi dataset in Table 1, where four

Table 1 Classification ratio as function of classifier and number of inner and outer CV folds, fungi dataset

For each classifier, best row (outer fold number) is in bold, best results are underlined. a No. of inner CV folds.

Table 2 Classification ratio as function of classifier and number of inner and outer CV folds, ‘vinegary’ defect level

external CV folds produce a consistently – over internal CV folds number – higher classification rate), while the number of inner CV folds seems to be immaterial. 3. 2 × 2 nested CV is often enough for a good result. With respect to 10 × 10 CV, 2 × 2 CV requires 4% of the training time, so this result may spare quite some time in future computational studies. For each outer CV fold, NSC and RF produce ranked variable lists. The list is the result of hyperparameter optimization in the inner CV cycles. In fact, NSC performs proper feature selection – through the centroids shrinking mechanisms – as part of classifier tuning, i.e. the performance of each optimized classifier depends only on a subset of selected features; while the RF depends on all features and simply outputs a ranking of all the features. Nothing hinders that RF is used as feature selector and then another classifier is trained after it. Here we did not exploit this possibility, which would require a nested CV cycle to asses the error. As an example, let us consider the lists produced by NSC for 10 × 10 CV. Ten lists are produced, one for every external fold. Since we perform two repetitions of the whole procedure, we obtain 20 feature lists. We note that what counts is the number of outer folds; 10 × 2 CV also produces 20 lists, while 2 × 10 CV produces 4 lists. Given a set of lists one can calculate some easy summary statistics, e.g.: • How often is a feature part of the lists? • What mean/median position does a feature have in the lists? • What is the smallest list a feature is part of? In Table 4 we answer these questions for the 20 lists produced by NSC and 10 × 10 CV on the fungi dataset. Only features present in at least half of the lists (10 out of 20) are shown. We see that: Table 4 NSC: variable importance (10 CV outer, 10 CV inner, fungi dataset)

For each classifier, best row (outer fold number) is in bold, best results are underlined. a No. of inner CV folds.

Table 3 Classification ratio as function of classifier and number of inner and outer CV folds, oil defect recognition

For each classifier, best row (outer fold number) is in bold, best results are underlined. a No. of inner CV folds.

Variable

No. of times

Smallest group

Median position

Mean position

Variance

“2” “14” “4” “16” “17” “1” “5” “25” “13” “26” “29” “27” “28”

20 20 20 19 19 18 18 17 16 15 12 11 10

3 3 3 5 5 7 7 8 9 10 13 13 13

1 2 3 4 6 6.5 7 7 9 10 11.5 11 12.5

1.25 1.75 3.05 4.32 6.32 6.28 6.83 6.82 9.25 10 12.25 11.27 12.4

0.2 0.2 0.05 0.34 1.56 2.33 2.03 2.15 2.2 5.43 3.11 2.22 5.6

The table summarizes the 20 vector of variables obtained through hyperparameter optimization in the inner CV cycles. Only features present in at least half of the optimized classifiers (10 out of 20) are shown.

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99

97

Table 5 NSC: variable importance (8 and 10 outer CV folds, all inner CV folds, fungi dataset) Number of external CV folds

8

10

Number of internal CV folds

2

4

6

8

10

2

4

6

8

10

Variable number: ranking in decreasing order

2 14 4 16 17 1 5 13 25 26 27 28 29 18 15 3 30 6 23 11 7 20 24 10 8 21 12 22 19 9

2 14 4 16 17 1 5 25 13 26 29 27 28 18 30 15 3 6 7 24 20 23 21 22 11 10 8 12 19 9

2 14 4 16 17 5 1 25 13 26 28 27 29 15 3 18 30 7 6 23 11 9 12 19 24 20 10 22 8 21

2 14 4 16 17 5 1 25 13 26 27 28 29 30 18 7 15 6 3

2 14 4 16 5 17 1 25 13 26 29 27 28

2 14 4 16 25 1 17 5 13 26 29 27 28 3 15 18 30 7 6 23 11 20 24 10 22 21 8 12 19 9

2 14 4 16 5 17 1 25 13 26 29 27 28 15 3 18 30 7 6 23 11 20 24 10 8 21 22 12 19 9

2 14 4 16 17 1 5 25 13 26 29 28 27 15 3 30 18 6 23 7 11 20 22 19 24 10 21 8 9 12

2 14 4 16 1 17 5 25 13 26 27 29 28 15 18 3 30 6 7 23 11 24 10 19 21 8 20 12 22 9

2 14 4 16 17 1 5 25 13 26 29 27 28 15 18 3 30 7 6 23 11 20 24 10 22 21 8 19 12 9

Variables in italic refer to top performing variables for RF (Table 7) which are not in the first positions in the present NSC rankings.

• Only 13 out of 30 features are present in at least half of the lists, • The first three features are present in all 20 lists, • The same three features constitute the smallest group of selected features (this happened only once out of 20 hyperparameter optimizations). NSC, for any fixed outer CV fold, performs five different inner CV cycles (inner CV folds number = 2, 4, 6, 8, 10). For instance, it produces five tables as Table 4 for outer CV folds equal to ten. In Table 5 we summarized the rankings produced by NSC for outer folds number equal to eight and ten. These are obtained by just taking the first column of Table 4. All the features are shown, not only those present in at least half of the lists, as in Table 4. We notice a couple of points: • Not all columns have the same length. This is because – for 8 × 8 and 8 × 10 CV – all the 16 optimized models make use only of 19 and 13 features, respectively. The other features are never used for these CV combinations. • There is a substantial homogeneity between the lists: (1) the first 13 elements are the same for all lists, with minor differences in order; (2) The first four features are the same and in the same order for all lists.

In Table 6 we show – for the RF classifier – the same type of table as in Table 4. As said, RF gives a ranking of all features, so each of the 20 lists of features contains all 30 features. What is noticeable is that: (1) only the first two features are in the same order as in Table 4; (2) the variance of the feature rank (apart for the first two features) is much bigger than for NSC in

Table 6 RF: variable importance (10 CV outer, fungi dataset) Variable

No. of times

Smallest group

Median position

Mean position

Variance

“2” “14” “13” “19” “3” “22” “25” “4” “20” “1” “15” “7” “18”

20 20 20 20 20 20 20 20 20 20 20 20 20

30 30 30 30 30 30 30 30 30 30 30 30 30

1 2 4 4 5 7 7 8.5 10 10.5 11 11 12

1.45 2 4.55 4.6 6.4 6.9 7.35 8.65 11.3 11.1 10.3 11.5 11.35

0.68 0.95 3.31 5.52 16.57 9.25 6.34 10.87 19.38 13.67 17.06 7.63 6.87

The first 13 features are presented, in analogy to Table 4.

98

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99

Table 4. The latter point can be explained with the randomness built in RF: at each tree split a subset of five features is randomly selected. Further, RF being a highly non-linear classifier, it is more sensitive to the differences between the 20 data subsets over which the 20 lists are built. In Table 7 we report the rankings for RF, over all external CV folds number. Since RF doesn’t have internal CV, there is only one list for every fixed external CV folds number. Some points should be noted: • The difference between rankings is bigger than the difference between rankings in the corresponding Table 5. This is due to the high variance inside each single RF variable ranking, which we just noted. Further the first columns of Table 7 have rankings which summarize a small number of subrankings, e.g. only four for outer CV folds number equal to two. These rankings are naturally unstable. In Table 5 we start with outer CV folds number equal to eight, for space constraints. • Features 13 and 19 have high ranking for RF (almost always place 3 and 4), while they have a low ranking for NSC: in Table 5 we highlighted these two features in italic. In particular, feature 19 ranks very low for NSC. This can be contrasted, e.g. with feature 9, which is unimportant (low ranking) both Table 7 RF: variablea importance for all outer CV folds, fungi dataset Number of external CV folds 2

4

6

8

10

2 14 13 3 1 4 25 15 19 27 22 7 8 18 20 21 26 5 24 11 16 23 10 17 30 28 9 29 6 12

14 2 13 19 3 22 25 20 15 4 1 18 7 28 16 5 12 21 23 30 27 10 8 11 9 24 17 6 26 29

2 14 19 13 25 3 4 15 22 20 18 7 1 27 8 10 16 5 21 26 24 28 9 30 17 23 6 11 12 29

2 14 13 19 3 4 22 25 18 7 1 20 15 16 21 10 27 5 17 26 30 23 24 8 28 11 6 9 29 12

2 14 13 19 3 22 25 4 20 1 15 7 18 16 27 10 8 21 30 5 24 26 17 23 9 28 11 6 12 29

Variables in italic refer to top performing variables for NSC (Table 5) which are not in the first positions in the present RF rankings. a Variable number: ranking in decreasing order.

for NSC and RF. This striking difference in the importance of feature 19 can be explained as follows: NSC ranks features individually and independently in the classifier construction process. In this way, on the one hand it cannot consider the joint discrimination capabilities of features groups and on the other hand it does not exclude correlated features. Feature 19 has low discrimination ability by itself, but is complementary – inside the non-linear RF classifier – to the other features, and therefore gets high rank. • Symmetrically to the previous point, features 16, 17 have high rank for NSC, yet a low one for RF (they are also in italic in Table 7). They are correlated to the top performing features, and therefore do not bring an advantage when considered inside a group of features which already contains the top scoring ones. 5. Conclusions RF, NSC and SVM are state of the art classifiers with maximal performance in comparative studies. We compared their classification ability on three E-Nose problems. SVM and random forests were found to have similar top classification results, while NSC consistently performed worse. NSC, on the other hand, allows a very fast training, which is necessary for very high dimensional feature sets (e.g. DNA microarrays), but not for feature vectors usually considered in chemical sensor arrays. RF and NSC have an intrinsic feature selection capability, which suggests their use for the fundamental issue of interpreting multisensors data. We saw that NSC, scoring features independently, may loose some of the features found by RF. Computations have been carried out with the R MCRestimate package: we believe that times are mature for the uptake by the sensor community of the R project for statistical computing and its powerful statistical packages. Acknowledgements This work has been supported the FP6 European Project Woundmonitor. Pardo also acknowledges a Short term mobility grant from CNR for a 2 months stay at Max Planck Institute for Molecular Genetics, Berlin. We wish to thank Matteo Falasconi and Manuela Gobbi for sharing the datasets analyzed in the paper. References [1] R. Ihaka, Gentleman R: R—a language for data analysis and graphics, J. Comput. Graph. Stat. 5 (1996) 299–314. [2] The R Project for Statistical Computing. http://www.r-project.org. [3] R. Gentleman, V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Yang, J. Zhang, Bioconductor: open software development for computational biology and bioinformatics, Genome Biol. 5 (2004) R80. [4] BioConductor. http://www.bioconductor.org. [5] Gentleman R, Lang DT. Statistical analyses and reproducible research. In Bioconductor Project Working Papers. The Berkeley Electronic Press, 2004.

M. Pardo, G. Sberveglieri / Sensors and Actuators B 131 (2008) 93–99 [6] L. Breiman, Random forests, Machine Learn. 45 (2001) 5–32. [7] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu, Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Stat. Sci. 18 (2003) 104–117. [8] T. Hancock, R. Put, D. Coomans, Y. Vander Heyden, Y. Everingham, A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies, Chemometr. Intell. Lab. Syst. 76 (2005) 185. [9] J.H. Friedman, Stochastic gradient boosting, Computat. Stat. Data Anal. 38 (2002) 367–378. [10] D. Donald, T. Hancock, D. Coomans, Y. Everingham, Bagged super wavelets reduction for boosted prostate cancer classification of seldi-tof mass spectral serum profiles, Chemometr. Intell. Lab. Syst. 82 (2006) 2. [11] P.M. Granitto, C. Furlanello, F. Biasioli, F. Gasperi, Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products, Chemometr. Intell. Lab. Syst. 83 (2006) 83–90. [12] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [13] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [14] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (1998) 121–167. [15] M. Pardo, G. Sberveglieri, Support vector machines for the classification of electronic nose data, in: In 8th International Symposium on Chemometrics in Analytical Chemistry, Seattle, 2002. [16] C. Distante, N. Ancona, P. Siciliano, Support vector machines for olfactory signals recognition, Sens. Actuators B Chem. 88 (2003) 30–39. [17] M. Pardo, G. Sberveglieri, Classification of electronic nose data with support vector machines, Sens. Actuators B: Chem. 107 (2005) 730. [18] R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee, Boosting the margin: a new explanation for the effectiveness of voting methods, Ann. Stat. 26 (1998) 1651–1686. [19] L. Breiman, Bagging predictors, Machine Learn. 24 (1996) 123–140. [20] Liaw A, Wiener M. Classification and Regression by randomForest. In The Newsletter of the R Project. 2002. [21] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Natl. Acad. Sci. U.S.A. 99 (2002) 6567–6572. [22] M. Falasconi, E. Gobbi, M. Pardo, M. Della Torre, A. Bresciani, G. Sberveglieri, Detection of toxigenic strains of Fusarium verticillioides in corn by electronic olfactory system, Sens. Actuators B: Chem. 108 (2005) 250. [23] M. Falasconi, M. Pardo, M. Vezzoli, S. Esposto, M. Servili, G. Montedoro, G. Sberveglieri, Electronic nose and SPME-GC–MS to identify olive oil defects, in: In 11th International Symposium on Olfaction & Electronic Nose, Barcelona, 2005.

99

[24] E. Martinelli, C. Falconi, A. D’Amico, C. Di Natale, Feature extraction of chemical sensors in phase space, Sens. Actuators B: Chem. 95 (2003) 132. [25] M. Pardo, G. Sberveglieri, Comparing the performance of different features in sensor arrays, Sens. Actuators B: Chem. 123 (2007) 437. [26] C. Ambroise, G. McLachlan, Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. U.S.A. 99 (2002) 6562–6566. [27] M. Ruschhaupt, W. Huber, A. Poustka, U. Mansmann, A compendium to ensure computational reproducibility in high-dimensional classification tasks, Stat. Appl. Genet. Mol. Biol. 3 (2004) 37 (Article).

Biographies Matteo Pardo got his degree in physics (summa cum laude) with a thesis in theoretical surface physics in 1996 and in 2000 he obtained the PhD in computer engineering. Since 2002 he is a researcher of the National Institute for Matter Physics (now CNR-INFM). His research interest is data analysis and in particular the applications of modern machine learning techniques (classification, regression and feature selection) for gas sensor arrays and, recently, for high throughput molecular biology. He published more than 22 journal paper and more than 70 conference papers, mostly as first author. He was an invited lecturer at 6 international conferences (one plenary) and director of the “Short Course on Fundamentals of signal and data processing” for the 2nd EU Network of Excellence on Artificial Olfactory Sensing. In 2003 he was awarded the Wolfgang Gopel Memorial Award. He is the Technical Program Chair of the next International Symposium on Olfaction and Electronic Nose. Prof. Giorgio Sberveglieri was born on July 17, 1947 and received his degree in physics cum laude from the University of Parma (Italy), where he started in 1971 his research activities on the preparation of semiconducting thin film solar cells. In 1994, he was appointed full professor in experimental physics. He is now the Director of the CNR-INFM and Brescia University Sensor Laboratory. He is associate editor of IEEE Sensor Journal, has been the general chairman of IMCS11th (11th International Meeting on Chemical Sensors) and is the Chair of the Steering Committee of the IMCS series Conference. He will be the general chairman of the next International Symposium on Olfaction and Electronic Nose. During 35 years of scientific activity he published more than 250 papers in international journals; he presented more than 250 Oral Communications to international congresses (12 plenary talks and 45 invited talks). He also has been an evaluator of European Union, in the areas of Nanoscience and Nanomaterials and ICT. In the 6FP he was the Coordinator of the EU Project NANOS4 (Nano-structured solid-state gas sensors with superior performance).