Comparison of Geographically Weighted Regression and ... - UConn

0 downloads 0 Views 2MB Size Report
... study area is located in Longyan, Fujian Province, China (115°10′–115°40′ ..... Summary Statistics of SOM Sample Data and Ln-Transformed Data. N. Min.
Comparison of Geographically Weighted Regression and Regression Kriging for Estimating the Spatial Distribution of Soil Organic Matter Ku Wang

Department of Geographical Science, Minjiang University, 350108, Fuzhou, China

Chuanrong Zhang1 and Weidong Li

Department of Geography and Center of Environmental Sciences and Engineering, University of Connecticut, Storrs, Connecticut 06269

Abstract: Soil organic matter (SOM) is an important component of soils, and knowing the spatial distribution and variation of SOM is the premise for sustainably utilizing soils. The objective of this study was to compare geographically weighted regression (GWR) with regression kriging (RK) for estimating the spatial distribution of SOM using field-sample data in SOM and auxiliary data in correlated environmental variables (e.g., elevation, slope, ferrous minerals index, and Normalized Difference Vegetation Index). Results showed that GWR was a relatively better method and could provide promising results for SOM prediction in comparison with RK. The map interpolated by GWR showed similar spatial patterns influenced by environmental variables and the nonapparent effect of data outliers, but with higher accuracies, compared to that interpolated by RK.

INTRODUCTION Soil organic matter (SOM) is one of the important components of soils. Although it accounts for only a small percentage of soil materials, its influence on many soil properties is significant. For instance, SOM has great impacts on soil structure, arability, water holding capacity, and capability of nutrient supply and conservation (Reeves, 1997; Aref and Wander, 1998). Soil organic matter is also considered a gigantic store of organic carbon (C). It is estimated that soil holds approximately three times more C than terrestrial vegetation and twice as much as the atmosphere (Eswaran et al., 1993; Davidson et al., 2000). Therefore, it is important to understand the spatial distribution and variation of SOM for purposes such as environmental protection and sustainable development of agriculture. At regional scales, the spatial distribution of SOM is not only affected by natural ecological processes but also by intensive human activities. However, due to its complex variability, it is difficult to obtain detailed SOM information for large areas. The traditional method of acquiring detailed SOM data is intensive field soil surveys 1

Corresponding author; email: [email protected]

915 GIScience & Remote Sensing, 2012, 49, No. 6, p. 915–932. http://dx.doi.org/10.2747/1548-1603.49.6.915 Copyright © 2012 by Bellwether Publishing, Ltd. All rights reserved.

916

wang et al.

followed by laboratory sample analysis, which is very time consuming and expensive. In addition, sometimes detailed survey and sampling can be difficult to carry out in some regions where access is constrained by topography and other factors. The high cost of collecting SOM data at many locations has created a strong need for effective methods of inferring continuous spatial information about SOM, such as quantitative interpolation methods. A number of prediction methods have been suggested to interpolate SOM from sparse sampling points into continuous surfaces, varying from regression methods (such as simple linear regressions, nonlinear regressions, inverse distance weighting, generalized linear models, and regression trees) to geostatistical methods (such as ordinary kriging [OK], ordinary cokriging [OCK], regression kriging [RK]) and other hybrid techniques (e.g., McKenzie, 1999; Robinson and Metternicht, 2006; Grimm et al., 2008). Among the aforementioned approaches, OK is a commonly used method to estimate the unsampled locations by calculating weighted averages of observed samples of the target variable. While OK intends to minimize the prediction error variance, this approach conducts interpolation relying solely on point samples of the target variable and does not consider auxiliary information. Thus it requires dense sample data to conduct a reasonable interpolation, and is limited by the quantity and density of samples (Pang et al., 2009). Influenced by the same spatial processes or regional conditions, the spatial distribution of SOM has great correlations with other environmental factors. To consider the coregionalization feature, OCK has been employed to improve the prediction accuracy by utilizing an auxiliary variable (Goovaerts, 1998; Li et al., 2006). By incorporating easily acquired auxiliary data, which have close relations to the target variable, OCK has been used for purposes of reducing sampling density and improving prediction accuracy. Another alternative kriging approach to incorporate auxiliary environmental variables is a hybrid interpolation approach, called RK (Odeh et al., 1995). Regression kriging first uses ordinary regression on auxiliary environmental variables to obtain the trend component and then uses simple or ordinary kriging to interpolate the residuals from the regression model. Therefore, it can incorporate a number of environmental variables in prediction. Results from previous studies proved that RK could greatly improve prediction accuracy. Regression kriging has been applied to modeling or mapping the spatial variability of soil properties in some studies in the literature (e.g., Baxter and Oliver, 2005). Comparison studies of RK and OCK (Knotters et al., 1995; Eldeiry and Garcia, 2009) showed that RK had obvious advantages over OCK: the former generated much less of a smoothing effect and also has fewer parameters to compute. The strong smoothing effect of OCK may represent a major drawback on predictions based on the data of multiple environmental variables. In addition, the difficulty in fitting a suitable linear model to the co-regionalization matrix is also a serious concern, especially when auxiliary variables are not strongly correlated with the primary variable and have different correlation ranges (Myers, 1982). Although RK is able to take auxiliary data into account, it adopts the stationary assumption. This means that parameters of RK do not change over different locations. However, many environmental variables are spatially heterogeneous, which implies that every location has an intrinsic degree of uniqueness due to its situation with respect to the rest of the study area. Thus applying a single set of global parameters may not be adequate to describe the relations between the target variable and



spatial distribution of soil organic matter

917

auxiliary environmental variables. Geographically weighted regression (GWR) was specifically designed to deal with issues of non-stationarity by measuring local relationships between the target variable and explanatory variables, which differ from location to location (Fotheringham et al., 2002). Unlike RK, which depends on a ­single set of variogram and/or regression parameters to summarize global relationships, GWR estimates local regression parameters, and its model performance varies across a study region; thus it is a local regression procedure developed to deal with the non-­stationarity issue (Yu et al., 2009). Similar to RK, GWR is also able to incorporate a number of auxiliary environmental variables. Recently, GWR was extensively used to model spatial distributions and relationships in a variety of different fields, particularly to investigate spatial non-stationarity between the target variable and the explanatory variables (e.g., Tu and Xia, 2008; Kamarianakis et al., 2008; Mitchell and Yuan, 2010). However, based on our knowledge, studies using this approach to model the spatial distribution and variation of soil properties have been rare and specific comparison studies between GWR and RK have not been found. We found that Mishra et al. (2010) was the only application case study that used GWR in predicting soil properties. The objective of this study is to compare the performance of GWR with RK in estimating the spatial distribution and variation of SOM. It should be noted that this study does not intend to find a superior spatial prediction or interpolation method, but rather mainly to verify whether GWR is a suitable method for estimating the spatial distribution and variation of SOM at a regional scale. STUDY AREA AND DATA Study Area The study area is located in Longyan, Fujian Province, China (115°10′–115°40′ E., 26°0′–26°40′ N) (Fig.1), covering a total area of 1 260 km2. Elevations in this area decrease from the north, east, and west toward the south and center, with altitudes ranging from 140 m to 1120 m. It has a subtropical monsoon humid climate with abundant but seasonally uneven rainfall. The annual mean precipitation is 1600 mm but over half falls during the rainy season (April–June) when storms are frequent. Drought conditions occur in autumn, as both frequency and amount of precipitation decrease significantly. The annual mean temperature is 19.1°C, the coldest temperature is 3.4°C in January, and the hottest is 35.4°C in July. The main geomorphic types are river valley alluvial plains, red-earth hills, red sandstone hills, and granite and metamorphic rock mountains. The soils in the area are dominated by Haplic Acrisols and Stagnic Anthrosols, dotted with Ferralic Cambisols. Data Sets and Data Processing Field Sample Data. Surface soil samples from the 0 to 20 cm horizon were collected at randomly selected sites in the study region during November 2005. The total number of selected sample sites was 353 (Fig. 1). Coordinates and elevations of the sample sites were recorded using a GPS (GARMIN) instrument with accuracies ranging from 3 to 15 meters. One thousand grams of mixed soil at each site was collected

918

wang et al.

Fig. 1. The study area and sampling sites.

and further prepared in the laboratory for chemical analysis. Soil organic matter was measured using the potassium dichromate volumetric method (SOM = SOC × 1.724) after air drying, grinding, and sieving. To assess the prediction accuracies of RK and GWR approaches, we randomly divided the field sample data set of SOM into a validation data set with 115 samples (one-third of the total) and the calibration data set with the other 238 samples (two-thirds of the total) (Fig. 1). Environmental Data. We collected several environmental data sets, which included: a digital elevation model (DEM) (grid format, 30 × 30 m resolution) from the Fujian Provincial Geomatics Center, China; an ETM+ remote sensing image (January 2000) downloaded from International Scientific Data Service Platform (http://datamirror.csdb.cn); and a land use map (shapefile format of ESRI) from the Land and Resources Department of Fujian Province, China. The environmental indicators included elevation, slope, slope aspect, Normalized Difference Vegetation Index (NDVI), the distance from sample point to river (DIST), land use type, and ferrous mineral index (FMI). Elevation data were acquired directly from the DEM. Slope and aspect data were also calculated from the DEM using the spatial analyst module of ArcGIS. The unit of slope adopted here is degrees, and it ranges from 0 to 70. Aspect data are qualitative data representing directions, and range from 0 to 360 from the north direction. Here, a cosine function was used to transfer slope aspect into quantitative data, with values



spatial distribution of soil organic matter

919

ranging from –1 to 1.2 Normalized Difference Vegetation Index data were derived from the ETM+ image using the following equation (e.g., Pavel and Kappas, 2008)

NDVI = (Band4 – Band3)/(Band4 + Band3) .

(1)

Because underground water greatly influences the accumulation and decomposition of SOM (Jiang et al., 1987), it was used as a factor to analyze the variation of SOM. However, underground water data is difficult to obtain for large areas, so the distances of sample locations to a river and elevation were jointly used to describe the underground water level. In general, if a sample site is close to a river and located at a lower elevation, it can be assumed that its underground water level is shallow, and vice versa. FMI data were obtained from the ETM+ image using the ratio “Band5/Band4” using ERDAS IMAGINE software. This index reflects the mineralization intensity of soil ingredients; the higher the intensity of soil mineralization, the lower SOM content (Zech et al., 1997). There are several land use types in the study area, such as forest, meadow, paddy land, dry land, garden, bare land, and village, which should have impacts on SOM. Because land use belongs to the category of qualitative data, quantitative transformation is needed for land use data to be used with other quantitative data in regression models. In this study, FRAGSTATS, a landscape analysis package, was used to calculate land use patch index. Here, the area-weighted mean shape index (AWMSI) of patches was used for quantitative transformation of land use data. It was thought that this index represented the impact of human activity on the land; the larger the value, the more complex the patches as a result of human activity (Gao and Li, 2011; Wang et al., 2009). METHODS Ordinary Kriging Ordinary kriging is based on the theory of regionalized variables, which assumes that the variables involved are random but spatially correlated at some scales (Matheron, 1971). Assume Z(xi) is a regionalized variable, which has a variogram (autovariogram) γ(h), a function describing the spatial dependence of a spatially random field or stochastic process z(u). The variogram can be estimated by 1   h  = --------------2N  h 

Nh

  z  ui  – z  ui + h  

2

(2)

i=1

where h is the spatial lag between two locations; N(h) is the number of observed data pairs with the lag h; and z(ui) and z(ui + h) are two measured values at locations ui and ui + h, respectively. To describe the experimental variogram for use in kriging, a mathematical model has to be fitted to the experimental variogram. Ordinary kriging is an optimal technique that provides unbiased estimates with minimum and known errors. In this study, the traditional OK was used in RK and it is expressed as 2

The value of cosine aspect of slope (COSA) indicates the degree of northerliness (Lian et al., 2006).

920

wang et al.

n

z  x 0  =

(3)

i=1

n

with a constraint

 i  x0 z  xi 

 i  x0  = 1. Here, z*(x ) is the estimated value of the variable z 0

i=1

at location x0; z(xi) is the measured data; λi(x0) refers to the weights associated with the measured values, which are estimated by the stationary OK system; and n is the number of measured values within a neighborhood. Regression Kriging In the case of RK, the soil property at an unsampled location x0 is estimated by summing the predicted drift and residual ˆ  x  + eˆ  x  , z  x 0  = m 0 0

(4)

ˆ is the drift, which is usually fitted with a linear regression, and eˆ is the where m residual, which is estimated using OK. RK can be expressed as p

z x0  =



ˆ k  q k  x 0  +

k=0

n

 i  x0   e  xi q0  x0  = 1,

(5)

i=1

where ˆ k represents the k-th estimated coefficient of the drift model, qk (x0) is the k-th external explanatory variable or predictor at location x0, p is the number of predictors, λi(x0) are weights determined by the covariance function of the residuals, and e(xi) is the regression residual at location xi. The first part of the right hand side of Equation (5) represents the linear regression and the second part represents kriging of the residuals. Apparently, RK combines both the regression of the target variable (SOM) based on the environmental factors (such as NDVI, elevation, etc.), which are used as explanatory co-variables, and the kriging of the regression residuals (Odeh et al., 1995). Regression kriging is based on the idea that the deterministic component of the dependent variable can be explained by regression and the residuals are assumed to describe the spatially varying but self-dependent components (Sun et al., 2012). We interpolated the spatial distribution of SOM by RK using five steps: (1) determine the LnSOM (i.e., Ln-transformed SOM) drift model using multiple linear regression (MLR); (2) derive the LnSOM drift model residuals at the sample locations; (3) model the covariance structure of the LnSOM drift model residuals using a variogram model; (4) interpolate the LnSOM drift model residuals using ordinary kriging; (5) add the LnSOM drift model surface to the interpolated residuals at each prediction point. Geographically Weighted Regression Geographically weighted regression is an extension of the traditional regression in which variations in rates of change are allowed in order that regression coefficients



spatial distribution of soil organic matter

921

are specific to a location rather than being global estimates (Brunsdon et al., 1998; Fotheringham et al., 1998). Suppose there are series of explanatory variables {xij} and dependent variables {yi}, i = 1, 2, …, m, j = 1, 2, …, n, a conventional linear regression fitted by the ordinary least squares (OLS) method is expressed as yi = 0 +

n

 j xij + i ,

(6)

j=1

where yi is the value of the dependent variable y at location i, β0 is the intercept, βj is the coefficient for independent predictor variable xj. εi represents the error term, which is generally assumed to be independent and normally distributed with zero means and constant variance σ2. In this model, each of the parameters can be thought of as the parameter between one of the independent variables and the dependent variable. This type of regression is known as global because of the assumption of spatial stationarity of its parameter estimates, which means that a single model is fitted to all of the sample data and is applied equally to the entire study area of interest. The regression model and its coefficients are constant across the study area, assuming the relationships between the dependent and independent variables to be spatially constant. However, variations or spatial non-stationarity in relationships between the dependent and independent variables over space commonly exist in spatial data sets and the assumption of stationarity or structural stability over space may be unrealistic (Fotheringham et al., 1998). So, when analyzing spatial data, we should take into account spatial non-stationarity. The local regression approach, known as GWR, recognizes explicitly that the parameter estimates in a regression model can vary across the space in which the regression model is calibrated. Geographically weighted regression allows the parameter estimates to be a function of location. The local estimation of the parameters with GWR is expressed by the following equation y i =  0  u i ,v i  +

n

 j  ui ,vi xij + i ,

(7)

j=1

where (ui, vi) is the spatial location of the i-th observation and βj (ui, vi) is the value of the j-th parameter at location i. The regression parameters of this equation are estimated at each location i(ui, vi). Therefore, this GWR model can measure spatial variations in relationships. The parameters in the GWR model can be calibrated using the weighted least squares approach. In matrix form, the parameters of the GWR model at each location i are estimated by –1 T T ˆ  u i ,v i  =  X W  u i ,v i X  X W  u i ,v i Y ,

(8)

where W(ui, vi) is an (m × m) spatial weighting diagonal matrix with m being the number of observed data for regression point i, X is an [m × (n +1)] independent data matrix with n being the number of explanatory variables, and Y is an (m × 1) dependent data vector.

922

wang et al.

To estimate parameters in the GWR model, it is important to decide the spatial weighting matrix, which can be calculated by different methods. One method is to specify W(ui, vi) as a continuous and monotonic decreasing function of distance dij between point i and point j. For adaptive kernel size, the weight of each point can be calculated by applying the Gaussian function d ij 2 2 w ij = 1 –  ------ , if d ij  h ,  h

(9)

w ij = 0 , if d ij  h

where wij is the weight of location j in the space at which data are observed for estimating the dependent variable at location i, and h is referred as a bandwidth. This function is a distance decay function, in which the weights of farther distant points from an unsampled location i decrease and the weights will practically fall to zero for those observations that are sufficiently distant from location i to be estimated (beyond the bandwidth). In this study, the adaptive method was employed to calculate kernels of GWR. The selection of the weighting function and optimal bandwidth h was accomplished by minimizing the corrected Akaike Information Criterion (AIC) as described in Fotheringham et al. (2002). The Akaike Information Criterion and varied bandwidths were calibrated to process the regression models. In the processing of the GWR regression models, weights and bandwidths decreased in the densely sampled places and increased in the sparsely sampled places (Jaimes et al., 2010). Evaluation The prediction accuracy of different approaches was evaluated by comparing the validation data (i.e., remaining observed data), which were not used for interpolation, with the predicted data. Mean error (ME) and root mean square error (RMSE) were used to verify prediction accuracy. Mean error is expressed as 1 ME = --n

n

  z  xi ,yi  + z  xi ,yi   ,

(10)

i=1

where n is the number of SOM observations in the validation dataset, z(xi,yi) and z*(xi,yi) are values of the observed and the predicted SOM, respectively; and xi and yi are the location coordinates. Root mean square error is expressed as RMSE =

1--n

n

  z  xi ,yi  – z  xi ,yi  

2

.

(11)

i=1

RESULTS Exploratory Data Analysis Regression and geostatistical analysis perform best on normally distributed data. When non-normality is apparent, transformations of the data to make them

spatial distribution of soil organic matter



923

Table 1. Summary Statistics of SOM Sample Data and Ln-Transformed Data N

Min. Max. Mean Median

STD 8.37

Skew

SOM (g/kg)

353

7.25 52.3 21.2 19.7

.822

LnSOM

353

1.98 3.95 2.98 2.98 .3937 –.07

STDE of STDE of Kurt. skew kurt. .130

 .674

.259

.130

–.569

.259

a N = number of samples; Min. = minimum; Max. = maximum; STD = standard deviation; skew = skewness; kurt. = kurtosis; STDE = standard error; LnSOM = Ln-transformed soil organic matter.

approximately normal should be helpful. Table 1 provides a statistical summary of the entire SOM data set of 353 observations. The observed SOM varies from 7.25 to 52.3 g/kg, which means there are great variations of SOM in the study area. The value of skewness is 0.822 and that of kurtosis is 0.674, indicating that the SOM data are approximately normally distributed. However, the Kolmogorov-Smirnov (K-S) test produced a significant coefficient (p-value = 0.04 < 0.05), which indicates that the data set may need a normal transformation. Here, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis (here, the null hypothesis is that the data accord with a normal distribution) is true. If the p-value is less than the required significance level, the null hypothesis is rejected. On the contrary, if the p-value is not less than the required significance level, the evidence is insufficient to reject the normality of data. In K-S test, 0.05 and 0.01 are the two frequently used significance levels, representing, respectively, the significant and very significant levels. In this study, we used a logarithmic function to perform the data transformation. We confirmed a final normal data distribution through the result of the K-S test (p-value = 0.194 > 0.05). The Ln-transformed data values vary from 1.98 to 3.95. A Pearson correlation analysis (Barnes et al., 2005) was performed to check the relationships between the Ln-transformed SOM and the environmental factors, including ELEVATION, SLOPE, COSA, NDVI, DIST, AWMSI, and FMI. Correlation coefficients among these variables are listed in Table 2. The results of the correlation analysis indicate that SOM has significant correlations with almost all of the environmental variables except for COSA. Positive correlations exist with ELEVATION, NDVI, SLOPE, DIST, and AWMSI, and a negative correlation occurs with FMI. However, there are also significant correlations among the environmental variables themselves, such as ELEVATION with NDVI, slope, FMI, and AWMSI; and DIST with ELEVATION, NDVI, and FMI. To reduce the multicollinearity problem, a stepwise linear regression was performed for dropping closely related predictor variables. Table 3 shows the results of the stepwise linear regressions using all variables. Based on the results, we chose the linear regression model e, which considers five environmental variables of ELEVATION, FMI, DIST, SLOPE, and NDVI, to be our final multiple linear regression (MLR) model. This model explains 60 percent of the variance in SOM (adjusted R2 = 0.600) and has a significant F-test result (p-value = 0.025 < 0.05). The MLR model is expressed as

924

wang et al.

Table 2. Pearson Correlation Matrix Between SOM and Environmental Variablesa LnSOM LnSOM NDVI DIST

1

NDVI

DIST

ELEVASLOPE TION

COSA

FMI

AWMSI

.517** .180** .728** .333** 0.055 –.582** .226** 1

.198** .512** .343** –0.043 –.561** .208** 1

ELEVATION SLOPE COSA FMI AWMSI

.363** .187** –0.051 –.201** 0.052 1

.538** 0.058 –.540** .333** 1

0.043 –.407** 1

.137*

–0.041

–0.016

1

–.219** 1

LnSOM = Ln-transformed soil organic matter; NDVI = Normalized Difference Vegetation Index; DIST = the nearest distance from sample locations to river; COSA = cosine value of aspect of slope; FMI = ferrous minerals index; AWMSI = area-weighted mean shape index calculated from the land use patches. * = correlation is significant at the 0.05 level (2-tailed); ** = correlation is significant at the 0.01 level (2-tailed). The 2-tailed test is a statistical test used in inference, in which a given statistical hypothesis, H0 (the null hypothesis: there is no relationship between variables here), will be rejected when the value of the test statistic is either sufficiently small or sufficiently large. The test is named after the “tail” of data under the far left and far right of a bell-shaped normal data distribution, or bell curve. The numbers 0.05 and 0.01 represent significant and very significant levels, respectively. a



LnSOM = 2.988 + 0.02elevation – 0.453FMI – 0.008slope + 0.482NDVI (13)

Note that based on the results of the stepwise regressions, the regression coefficient of DIST is zero. SOM Interpolated by RK and GWR Table 4 lists the fitted parameters of the variogram models for the Ln-transformed SOM and its residuals. Parameters of the variogram models include the following information: (1) nugget (C0), standing for the level of random variation within the data; (2) sill (C0 + C), representing the total magnitude of spatial variability; (3) range, describing the spatial dependence of the variability; (4) ratio C/(C0 + C), reflecting the proportion of spatially structured variance (C) in the total spatial variability (C0 + C); the larger the value, the higher the spatial structure dependence in SOM data (Cambardella et al., 1994); (5) R2, indicating how well the model fits the experimental variogram data although it is not as sensitive or robust as the Residual Sums of Squares (RSS) value for best-fit calculations; (6) RSS, showing an exact measure of how well the model fits the experimental variogram data; the lower the RSS value, the better the model fits. LnSOM and its residuals show a clear spatial dependence (Table 4). Both variogram models have approximately the same form and nugget but the residual variogram

spatial distribution of soil organic matter



925

Table 3. Results of the Stepwise Linear Regression Analysis Using Seven Independent Variables

a

Change statistics

R2

Adjusted R2

Std. error of the estimate

R2 change

F change

Sig. F change

.530

.528

.2736

.530

265.6

.000

Modelsa

b

.580

.576

.2591

.050

28.2

.000

c

.592

.586

.2560

.012

6.66

.010

d

.600

.593

.2538

.009

5.01

.026

e

.609

.600

.2517

.009

5.07

.025

a. Predictors: (Constant), ELEVATION. b. Predictors: (Constant), ELEVATION, FMI. c. Predictors: (Constant), ELEVATION, FMI, SLOPE. d. Predictors: (Constant), ELEVATION, FMI, SLOPE, DIST. e. Predictors: (Constant), ELEVATION, FMI, SLOPE, DIST, NDVI. Dependent Variable: LnSOM (i.e., Ln-transformed SOM).

Table 4. Parameters of the Omni-directional Exponential Semivariogram Models for the Ln-transformed SOM and its Residuals Variogram model

C0 (nugget)

C0+C (sill)

C/C0+C

LnSOM

Exponential

0.0001

0.120

0.999

10,260 0.938

4.62e-04

Residuals

Exponential

0.0001

0.064

0.999

 6,270 0.944

8.69e-05

Range (m)

R2

RSS

model has a somewhat smaller sill and range (Fig. 2), which is often found in practice (Hengl et al., 2004). We interpolated the spatial distribution of SOM using GWR with a variable bandwidth (spatially adaptive kernel) that adapts for the density of sample data at each regression location. The optimal kernel size for our study was determined through an interactive statistical optimization process to minimize the AIC. In addition to providing estimates of regression coefficients, t-statistics, and goodness-of-fit for each location, the ArcGIS software also provides several statistical tests to determine whether the GWR model is more useful than the MLR model. The results of the statistical tests show that the AIC value for the GWR (1103.8) is far lower than that for MLR (2233.9). This indicates that the local model provides a better fit to the sample data even after accounting for differences in degrees of freedom. The adjusted R2 generated by the GWR (0.9262) is much higher than that by MLR (0.5688). This means that GWR has a large improvement in explained variance. Figure 3 shows the spatial distributions of SOM interpolated using RK and GWR. It can be seen that the prediction maps generated by both methods show a similar spatial pattern or trend: high values of SOM usually are concentrated in high mountains where little human disturbance occurs while lower values of SOM mostly occur in dry land, paddy land, or other land at low elevation where soil is frequently disturbed by human activities. In addition, the minimum and maximum values predicted by both

926

wang et al.

Fig. 2. Variograms of LnSOM (left graph) and its residuals (right graph) fitted using exponential models.

Fig. 3. Maps of SOM predicted by RK (left) and GWR (right).

RK and GWR are beyond the minimum and maximum values of the observed sample data (Fig. 3). This indicates that both predicted maps reflect changes in elevation, slope, and other environmental factors, and generate much detailed spatial information about the distribution of SOM. An apparent difference is that the SOM map generated by RK shows more variations in the flat area of the floodplain than that generated by GWR. This should be related to the fact that RK incorporates the autocorrelation of sample data residuals whereas the GWR method used here did not consider autocorrelation, because the environmental variables (i.e., ELEVATION, FMI, SLOPE, and NDVI) should not change much in such a flat area. This further means RK is more sensitive to the spatial variation of SOM sample data (particularly outliers) than GWR is. Another difference is that the predicted SOM values on the low hills in the southern region are apparently higher, with clearer detail in the GWR map than in the RK map.

spatial distribution of soil organic matter



927

Table 5. Comparison of the Performances of RK and GWR using ME, RMSE, maximum (– and +) Errors of Predictions and Their Coefficients of Regression (adjusted R2) with Validation Data Maximum (–) error (g/kg)

Maximum (+) error (g/kg)

ME (g/kg)

RMSE (g/kg)

Adjusted R2

RK

–25.8

14.8

–3.6377

7.576

0.699

GWR

 –5.4

 6.3

 0.6058

2.478

0.909

This means GWR is more sensitive to the spatial variation of environmental variables than RK is. Comparison of the General Prediction Accuracies of RK and GWR To further assess the performances of RK and GWR, the 115 validation points were used for the purpose of testing the prediction accuracies of the two methods. Table 5 shows the comparison results. If the prediction is unbiased, the ME and RMSE will be expected to be zero. Thus, for an accurate prediction model, their absolute values should be as small as possible. It can be seen from Table 5 that the maximum errors generated by RK are much larger in terms of their absolute values than those generated by GWR. Both ME and RMSE produced by GWR have smaller absolute values than those generated by RK. These statistical values indicate that the prediction accuracy of GWR is higher than that of RK. Note that the negative ME value for RK illustrates that the method underestimated SOM in space. A similar conclusion can also be drawn from the error distribution at validation sampling sites (Fig. 4), which clearly shows that at some locations large errors were produced by the RK method. In addition, the adjusted R2 values between the observed SOM data and the predicted data at the validation sites also indicate that the GWR performed better (adjusted R2 = 0.909) than did RK (adjusted R2 = 0.699). DISCUSSION It was noticed in this study that RK and GWR produced some estimated values beyond the value range of the observed data. For example, the predicted maximum value is 156.8 g/kg for RK and 174.2 g/kg for GWR; both values are far larger than the observed maximum value of 52.3 g/kg. This issue may come from the process of back transformation of the estimated Ln-transformed data (Cho et al., 2009). When the observed data were transformed by a logarithmic function for fitting to a normal distribution, the back transformation may generate extreme values. To verify this fact, we also tested the prediction results using non-transformed original data for both RK and GWR. The results indicate that the maximum estimated values, 63.4 g/kg by RK and 63.8 g/kg by GWR, are close to the maximum observed value. As a best linear predictor, OK has its own characteristics: it considers the autocorrelation of sample data, the predicted value is identical to the observed value if the predicted point occurs at a sampling site, and the interpolation accuracy is higher near

928

wang et al.

Fig. 4. Errors of predictions by RK and GWR at the 115 validation sample sites.

the sampling point than farther away from the sampling point. While RK uses ordinary linear regressions to estimate the trend component, it partially inherits the characteristics of OK, which is used for estimating the residuals at unsampled locations. On this point, GWR is different because it is based on local regression and usually does not consider the autocorrelation of sample data; thus the values predicted by GWR are not always the same as the observed values on the sampling sites. GWR prediction mainly depends on the regression coefficients established by its neighboring SOM data and environmental factors. Because of this the map interpolated by GWR is more continuous in transition and less affected by sample values, particularly extreme values (i.e., outliers), than that created by RK, and it also inherits more closely the spatial variation characteristics of the incorporated environmental factors. The spatial kernel function and the bandwidth used in the model fitting process have an impact on the estimation of the GWR coefficients. The bandwidth is the radius of the search circle around each point being estimated and controls the distance decay in the weighting function (Guo et al., 2008). Changing the spatial kernel function or the bandwidth may change the coefficient estimates. Therefore, in this study we used an adaptive kernel function to reduce the limited data problem in some areas and we selected the optimal bandwidth by minimizing the AIC (Koutsias et al., 2010). From this aspect, the number of samples is not an absolutely crucial factor in this study. As long as there are enough observation points, which can represent most of the explanatory variables in space, they may meet the needs of the regressions. It is expected that land use may have a good relationship with SOM (e.g., Lettens et al., 2004), but the factor was removed in this study in the process of the stepwise regression. We used AWMSI as an indicator for land use. However, this case study did not show AWMSI to be a good index reflecting the impacts of land use on the spatial distribution of SOM. Other indices may be tested in future research to determine whether they more properly represent the impact of land use. In addition, most of the sampling points in this study are located in cultivated lands (especially in paddy soils), and the samples located in forest lands account for only a small percentage. The asymmetrical distribution of the sampling points may lead to a decrease in the correlation coefficients between SOM and environmental variables, and has a further impact on



spatial distribution of soil organic matter

929

prediction accuracy. Hence, more samples in complex environments or better sampling designs are needed in order to improve the estimation accuracy of GWR. CONCLUSIONS The spatial distribution maps of SOM interpolated by RK and GWR and the validation analysis show that there are some differences among the predicted results by the different methods. The interpolation maps using GWR and RK both capture many details. However, the map interpolated by GWR shows higher global accuracy, and also appears to be more realistic in depicting the details of the SOM patterns impacted by environmental factors in nature. In general, GWR is more sensitive to the spatial variation of environmental variables, whereas RK is more sensitive to the spatial variation of SOM sample data, particularly outliers. Although RK, by integrating the merits of OK and MLR, takes both spatial autocorrelation and auxiliary environmental variables into consideration, its model is still constructed on the basis of the spatial stationary assumption. Thus it does not deal with non-stationary spatial relationships of variables within the study area. The GWR approach, however, overcomes the limitation to some extent by using non-stationary regression models, which have different parameters at different locations. This may be one important reason why GWR produced the more realistic results of SOM with higher global accuracy in this study. In summary, the following conclusions may be made from this study: (1) Similar to the often-used RK in soil science, GWR is able to incorporate a variety of multiple auxiliary environmental factors into its modeling process. (2) The performance of GWR for estimating the spatial distribution and variation of SOM in the study area is good, and even better than that of RK in terms of prediction accuracy and realistic patterns. This result agrees with Mishra et al. (2010). Based on these considerations, we think RK and GWR each may have their respective applicable fields. Regression kriging may be more suitable for spatial predictions in relatively uniform environments, especially those suitable for gathering strongly autocorrelated data via regular grid sampling. Conversely, GWR is more effective in spatial predictions involving complex environments and spatially varied correlation relationships between a dependent variable and multiple explanatory variables. In addition, compared to RK, GWR may have higher requirements for the density of sample data for effectively estimating local parameters. ACKNOWLEDGMENTS We thank the reviewers very much for constructive comments that proved useful in revision of the manuscript. We gratefully acknowledge support for this research from the National Natural Science Foundation of China (No. 41271232), Natural Science Foundation of Fujian Province, China (Grant No. 2012J01179), and Science and Technological Project of the Educational Commission of Fujian Province, China (Grant No. JA11202). REFERENCES Aref, S., and M. M. Wander, 1998, “Long-Term Trends of Corn Yield and Soil Organic Matter in Different Crop Sequences and Soil Fertility Treatments on the Morrow Plot,” Advances in Agronomy, 62:153–197.

930

wang et al.

Barnes, C. W., Kinkel, L. L., and J. V. Groth, 2005, “Spatial and Temporal Dynamics of Puccinia andropogonis on Comandra umbellata and Andropogon gerardii in a Native Prairie,” Canadian Journal of Botany, 83:1159–1173. Baxter, S. and M. Oliver, 2005, “The Spatial Prediction of Soil Mineral N and Potentially Available N Using Elevation,” Geoderma, 128:325–339. Brunsdon, C., Fotheringham, S., and M. Charlton, 1998, “Spatial Nonstationarity and Autoregressive Models,” Environment and Planning A, 30:1905–1928. Cambardella, C. A., Moorman, T. B., Novak, J. M., Parkin, T. B., Karlen, D. L., Turco, R. F., and A. E. Konopka, 1994, “Fieldscale Variability of Soil Properties in Central Iowa Soils,” Soil Science Society of America Journal, 58:1501–1511. Cho, S., Lambert, D. M., Kim, S. G., and S. Jung, 2009, “Extreme Coefficients in Geographically Weighted Regression and Their Effects on Mapping,” GIScience & Remote Sensing, 46(3):273–288. Davidson, E. A., Trumbore, S., and R. Amundson, 2000, “Biogeochemistry: Soil Warming and Organic Carbon Content,” Nature, 408:789–790. Eldeiry, A. and L. A. Garcia, 2009, “Comparison of Regression Kriging and Cokriging Techniques to Estimate Soil Salinity Using Landsat Images,” paper presented at The 29th Annual Hydrology Days, Fort Collins, CO, March 25–27, 2009. Eswaran, H., Berg, E. V. D., and P. Reich, 1993, “Organic Carbon in Soil of the World,” Soil Science Society of America Journal, 57:192–194. Fotheringham, A. S., Brunsdon, C. A., and M. E. Charlton, 2002, Geographically Weighted Regression: The Analysis of Spatially Varying Relationships, New York, NY: John Wiley & Sons. Fotheringham, A. S., Charlton, M. E., and C. Brundson, 1998, “Geographically Weighted Regression: A Natural Evolution of the Expansion Method for Spatial Data,” Environment and Planning A, 30:1905–1927. Gao, J. B. and S. C. Li, 2011, “Detecting Spatially Non-stationary and Scale-Dependent Relationships between Urban Landscape Fragmentation and Related Factors Using Geographically Weighted Regression,” Applied Geography, 31:292–302. Goovaerts, P., 1998, “Ordinary Cokriging Revisited,” Mathematical Geology, 30:21–42. Grimm, R., Behrens, T., Märker, M., and H. Elsenbeer, 2008, “Soil Organic Carbon Concentrations and Stocks on Barro Colorado Island—Digital Soil Mapping Using Random Forests Analysis,” Geoderma, 146:102–113. Guo, L., Ma, Z., and L. Zhang, 2008, “Comparison of Bandwidth Selection in Application of Geographically Weighted Regression: a Case Study,” Canadian Journal of Forest Research, 38:2526–2534. Hengl, T., Heuvelink, G., and A. Stein, 2004, “A Generic Framework for Spatial Prediction of Soil Variables Based on Regression Kriging,” Geoderma, 122(1–2):75–93. Jaimes, N. B. P., Sendra, J. B., Delgado, M. G., and R. F. Plata, 2010, “Exploring the Driving Forces Behind Deforestation in the State of Mexico (Mexico) Using Geographically Weighted Regression,” Applied Geography, 30:576–591. Jiang, J. R., Zhong, S. L., Yuan, Z. P., Xiao, R. L., and Y. Z. Zhang, 1987, “Effect of Different Cropping System and Underwater Level on Content of Soil Organic Matter and Ferric Oxide in Paddy Soil,” Journal of Hunan Agricultural University (Natural Sciences), 3:25–30.



spatial distribution of soil organic matter

931

Kamarianakis, Y., Feidas, H., Kokolatos, G., Chrysoulakis, N., and V. Karatzias, 2008, “Evaluating Remotely Sensed Rainfall Estimates Using Nonlinear Mixed Models and Geographically Weighted Regression,” Environmental Modelling & Software, 23:1438–1447. Knotters, M., Brus, D. J., and J. H. Oude Voshaar, 1995, “A Comparison of Kriging, Cokriging and Kriging Combined with Regression for Spatial Interpolation of Horizon Depth with Censored Observations,” Geoderma, 67:227–246. Koutsias, N., Martínez-Fernández, J., and B. Allgöwer, 2010, “Do Factors Causing Wildfires Vary in Space? Evidence from Geographically Weighted Regression,” GIScience & Remote Sensing, 47(2):221–240. Lettens, S., van Orshoven, J., van Wesemael, B., and B. Muys, 2004, “Soil Organic and Inorganic Carbon Contents of Landscape Units in Belgium Derived Using Data from 1950 to 1970,” Soil Use and Management, 20:40–47. Li, Z., Zhang, Y. K., Schilling, K., and M. Skopec, 2006, “Cokriging Estimation of Daily Suspended Sediment Loads,” Journal of Hydrology, 327:389–398. Lian, G., Guo, X. D., Fu, B. J., and C. X. Hu, 2006, “Spatial Variability and Prediction of Soil Organic Matter at County Scale on the Loess Plateau,” Progress in Geography, 25:112–123 (in Chinese). Matheron, G., 1971, The Theory of Regionalized Variables and its Applications, Fontainebleu, France: CMMF, Les Cahiers du center de Morphologie Mathematique de Fontainebleau 5. McKenzie, N. J., 1999, “Spatial Prediction of Soil Properties Using Environmental Correlation,” Geoderma, 89:67–94. Mishra, U., Lal, R., Liu, D., and M. van Meirvenne, “Predicting the Spatial Variation of the Soil Organic Carbon Pool at a Regional Scale,” Soil Science Society of America Journal, 74:906–914. Mitchell, M. and F. Yuan, 2010, “Assessing Forest Fire and Vegetation Recovery in the Black Hills, South Dakota,” GIScience & Remote Sensing, 47(2):276–299. Myers, D. E., 1982, “Matrix Formulation of Co-Kriging,” Mathematical Geology, 14:249–257. Odeh, I. O. A., McBratney, A. B., and D. J. Chittleborough, 1995, “Further Results on Prediction of Soil Properties from Terrain Attributes: Heterotopic Cokriging and Regression-Kriging,” Geoderma, 67:215–226. Pang, S., Li, T. X., Wang, Y. D., and H. Y. Yu, 2009, “Spatial Interpolation and Sample Size Optimization for Soil Copper (Cu) Investigation in Cropland Soil at County Scale Using Cokriging,” Agricultural Sciences in China, 8:1369–1377. Pavel, A. P. and M. Kappas, 2008, “Reducing Uncertainty in Modeling the NDVIPrecipitation Relationship: A Comparative Study Using Global and Local Regression Techniques,” GIScience & Remote Sensing, 45(1):47–67. Reeves, D. W., 1997, “The Role of Soil Organic Matter in Maintaining Soil Quality in Continuous Cropping Systems,” Soil Tillage Research, 43:131–167. Robinson, T. P. and G. Metternicht, 2006, “Testing the Performance of Spatial Interpolation Techniquesfor Mapping Soil Properties,” Computers and Electronics in Agriculture, 50:97–108. Sun, W., Minasny, B., and A. McBratney, 2012, “Analysis and Prediction of Soil Properties Using Local Regression-Kriging,” Geoderma, 171–172:16–23.

932

wang et al.

Tu, J., and Z. G. Xia, 2008, “Examining Spatially Varying Relationships Between Land Use and Water Quality Using Geographically Weighted Regression I: Model Design and Evaluation,” The Science of the Total Environment, 407:358–378. Wang, K., Wang, H. J., Shi, X. Z., Weindorf, D. C., Yu, D. S., Liang, Y., and D. M. Shi, 2009, “Landscape Analysis of Dynamic Soil Erosion in Subtropical China: A Case Study in Xingguo County, Jiangxi Province,” Soil and Tillage Research, 105:313–321. Yu, D., Peterson, N. A., and R. J. Reid, 2009, “Exploring the Impact of Non-normality on Spatial Non-stationarity in Geographically Weighted Regression Analyses: Tobacco Outlet Density in New Jersey,” GIScience & Remote Sensing, 46(3):329– 346. Zech, W., Senesi, N., Guggenberger, G., Kaiser, K., Lehmann, J., Miano, T. M., Miltner, A., and G. Schroth, 1997, “Factors Controlling Humification and Mineralization of Soil Organic Matter in the Tropics,” Geoderma, 79:117–161.