bagging-partial least squares regression - OSA Publishing

8 downloads 0 Views 2MB Size Report
with PLSR to develop more robust spectroscopic calibra- ... Keywords: partial least squares regression, bootstrap aggregation, bagging, soil diffuse reflectance ...
R.A. Viscara Rossel, J. Near Infrared Spectrosc. 15, 39–47 (2007)

39

Robust modelling of soil diffuse reflectance spectra by “bagging-partial least squares regression” R.A. Viscarra Rossel Australian Centre for Precision Agriculture, Faculty of Agriculture, Food & Natural Resources, McMillan Building A05, The University of Sydney, NSW 2006, Australia. E-mail: [email protected] Visible (vis), near infrared (NIR) and mid infrared (mid-IR) diffuse reflectance spectroscopy (DRS) coupled with partial least squares regression (PLSR) are increasingly being used in the agricultural and environmental sciences as an efficient complement to conventional laboratory analysis. The DRS techniques are rapid, relatively cheap and more efficient for obtaining data than conventional analysis, especially when a large number of samples and analyses are required. A single spectrum may be used to predict various physical, chemical and biological soil properties. The robustness of PLSR models and their predictions may be improved by combining the implementation of PLSR with bootstrap aggregation or “bagging”. Bagging aims to reduce the variance of predictions by aggregating a number of models obtained in the course of re-sampling. The aim of this work was to test the implementation of bagging with PLSR (bagging-PLSR) using vis-NIR and mid-IR soil diffuse reflectance spectra to predict soil organic carbon (OC). Bagging-PLSR was shown to: (i) be more robust than PLSR alone, (ii) be less prone to over fitting and improve prediction accuracy and (iii) provide a measure of the uncertainty of the models and their predictions. Keywords: partial least squares regression, bootstrap aggregation, bagging, soil diffuse reflectance spectra

Introduction Diffuse reflectance spectroscopy (DRS) coupled with multivariate calibration is increasingly being used in soil science as an efficient complement to conventional laboratory analysis. The advantages of DRS for soil studies using the visible (vis), near infrared (NIR) and mid infrared (mid-IR) have been identified.1 The DRS techniques are rapid, relatively cheap and more efficient for obtaining soil data than conventional laboratory techniques, especially when a large number of samples and analysis are required. Moreover, a single spectrum may be used to predict various physical, chemical and biological soil properties. To this end, partial least squares regression (PLSR) has been successfully used in many studies for the prediction of various soil properties from their diffuse reflectance spectra.1–3 However, PLSR is essentially a linear regression technique and may not always be applied to soil diffuse reflectance spectra as some of the relationships between spectra and soil properties may be non-linear. In such instances, nonlinear relationships may be linearised using transformations4 and by pre-processing the spectra5 or may be avoided using wavelength selection techniques6 or judicious experimental design.7 When using PLSR, weak non-linearities may be compensated by using a few extra latent variables (or factors) in the analysis. However, care must be taken because

DOI: 10.1255/jnirs.694

training models with too many factors will tend to over fit the test/prediction data. The cross-validation error8 is usually used to determine the optimal number of factors to model. The robustness of these PLSR models and their ­predictions may be improved by combining their implementation with (statistical) re-sampling or ensemble techniques such as bootstrap ­aggregation (or bagging).9 Bagging aims to reduce the variance of predictors by aggregating a number of models obtained in the course of re-sampling. Hence, it can prevent over fitting. It has been most successful with unbiased but variable predictors, although Breiman10 suggested the use of iterated bagging to reduce bias. In this instance, I deal with the former, i.e. reduction of variance of PLSR predictors. Few studies exist that use bagging together with ­multivariate calibration techniques. For example, Conlin et al.11 used data augmentation and aggregation techniques with PLSR to develop more robust spectroscopic calibration models. Mevik et al.12 investigated the properties of bagging and data augmentation with PLSR to improve the robustness of calibrations. Hancock et al.13 compared the performance of a number of data mining techniques with common chemometric techniques for multivariate calibration. I did not find any study that investigated the use of bagging with multivariate calibration of soil diffuse reflectance spectra. Therefore, the aim of this paper is to compare PLSR to bagging with PLSR (bagging-PLSR)

ISSN 0967-0335

© IM Publications 2007

40

Robust Modelling of Soil by “Bagging-PLSR”

using vis-NIR and mid-IR soil diffuse reflectance spectra for the prediction of soil organic carbon (OC). Before ­del­ving into the work, I will briefly outline the PLSR and bagging techniques.

Partial least squares regression Partial least squares regression (PLSR) has been implemented in many areas of econometric and scientific research, including soil science,2 since its formulation in the early 1980s. The method was first proposed for analysing NIR spectra by Wold et al.14 who derived an algorithm with orthogonal scores. Since then, attempts have been made to modify PLSR but, essentially, it has not changed. Martens and Naes15 proposed a PLSR algorithm with orthogonal loadings and the reader is directed to that text for details on the algorithm and methodology. In summary, PLSR aims to link the centred response variable vector, y, to the matrix of centred predictors, X, through k latent variables (or ­factors) by: X = t1p1′ + ... + t k p′k + E k y = t1q1 + ... + t k qk + fk

where t is a vector of scores calculated by tk = Xk – 1wk with scaled weights wk = cX′k – 1yk – 1, c is the scaling factor, p are the spectral loadings, q the chemical loadings and E and f are the predictor and response variable residuals, respectively, of the estimated effect for the kth factor. Thus, the algorithm may be defined successively using the above equations and by incrementing k = 1, 2, …, K. The number of factors to use in the PLSR model may be determined through leave-one-out cross-validation.8 The optimal number of factors should allow the modelling of as much as possible of the correlation between X and y without over fitting y. Then, for the selected number of factors one calculates the final linear regression coefficients, b = W(P′W)–1q and b0 = y– – x–′b, to be used in the predictor y^i = b0 + x′i b, where xi is the new spectrum.

The bootstrap and bootstrap aggregation (bagging) In essence, the bootstrap performs sampling within a sample. It is a technique that may be used to estimate the cumulative distribution function (CDF) of a population, its moments and their uncertainty by re-sampling with replacement. The bootstrap assumes that the CDF of the data is sufficiently similar to that of the original population, and that multiple realisations of the population can be replicated from a single data set. Although a bootstrap sample may have 2, 3… duplicate data, it also leaves out approximately 37% of the data in the course of re-sampling.16 The concept and applications of the bootstrap are described in other publications16,17 and the reader is directed to them for thorough accounts. Bootstrap aggregation or “bagging” was first proposed by Breiman18 to reduce the variability of unbiased but ­variable predictors. Breiman showed that bagging can lead to substantial improvements in accuracy in both classification

and regression, especially when alterations in the training data set can cause significant changes in the outcome of the ­modelling procedure. These improvements result from the aggregation of a number of different bootstrapped models, where each model provides unique information. For ­example, using a linear regression, when an unknown is to be predicted, it is predicted with each of the bootstrapped (training) models and the bagged prediction is simply their average. Bagging will work best when procedures are variable and/or unstable. The reader is directed to Breiman18 for a detailed account.

Methods Vis-NIR data One hundred and thirty seven A-horizon soil samples from various locations across Brittany, France, were collected, oven-dried, ground and sieved to a size fraction