Integrating classification and regression tree (CART ... - Springer Link

3 downloads 1674 Views 628KB Size Report
Nov 13, 2008 - Information System Application, Zhejiang University,. Hangzhou, 310029, China .... languages (VC++, VB, Delphi, Power Builder etc.). GIS functionalities ..... of the ArcGIS Desktop application framework to create maps and to ...
Environ Monit Assess (2009) 158:419–431 DOI 10.1007/s10661-008-0594-x

Integrating classification and regression tree (CART) with GIS for assessment of heavy metals pollution Wei Cheng · Xiuying Zhang · Ke Wang · Xuelong Dai

Received: 6 May 2008 / Accepted: 29 September 2008 / Published online: 13 November 2008 © Springer Science + Business Media B.V. 2008

Abstract The classification and regression tree (CART) model integrated with geographical information systems and the assessment of heavymetals pollution system was developed to assess the heavy metals pollution in Fuyang, Zhejiang, China. The integration of the decision tree model with ArcGIS Engine 9 using a COM implementation in Microsoft® Visual Basic 6.0 provided an approach for assessing the spatial distribution of soil Zn content with high predictive accuracy. The Zn concentration classes estimated by CART assigned the right classes with an accuracy of near 90%. This is a great improvement compared to the

ordinary Kriging method for the spatial autocorrelation of the study area severely destroyed by human activities. Also, it can be used to investigate the inter-relationships between the heavy metals pollution and environmental and anthropogenic variables. Moreover, the research presents model predictions over space for further applications and investigations. Keywords ArcGIS Engine 9 · CART · Fuyang · GIS · Heavy metals pollution

Introduction W. Cheng · K. Wang (B) Institute of Remote Sensing and Information System Application, Zhejiang University, Hangzhou, 310029, China e-mail: [email protected] X. Zhang International Institute for Earth System Science, Nanjing University, Nanjing, 210093, China X. Dai Agriculture Bureau of Fuyang, Fuyang, 311409, China W. Cheng · K. Wang Key Laboratory of Agricultural Remote Sensing and Information System, Zhejiang Province, China

Alongside air, water, and the biota, the soil is of central significance in ecosystem research as it is the place where many kinds of interactions take place between minerals, air, water, and the living environment (Bloemena 1995). The accumulation of heavy metals in agricultural soils is of increasing concern because of food safety issues, potential health risks, and its detrimental effects on soil ecosystems (Zheng et al. 2007). Heavy metals in soil originate either from weathering of parent material and/or from numerous external contaminating sources. In unpolluted regions, parent materials are the primary source of trace elements (Yaya et al. 2008). Main anthropogenic sources

420

of heavy metals exist in various industrial point sources such as present and former mining activities, foundries, smelters, and diffuse sources (Al-Khashman 2004). Various aspects must be considered by the society to provide a sustainable environment, including a soil clean of heavy metal pollution. The first among them is to identify environments (or areas) in which anthropogenic loading of heavy metals puts ecosystems and their inhabitants at health risk (Romic et al. 2007). One challenge in predictive modeling of heavy metals contents varying in the anthropogenic activities is to achieve the prediction over space with acceptable accuracy. To understand the spatial distribution of soil heavy metals contents and identify pollution sources, multivariate analysis (Emmerson et al. 1997; Yaya et al. 2008; Micó et al. 2006), geostatistical methods (Salgueiro et al. 2008; McGrath et al. 2004), and spatial analysis (Barrera-Bassols et al. 2006; Hillyer et al. 2006; ChoT and Newman 2005) have been developed and widely applied in soil systems. But these methods could not efficiently simulate the spatial distribution of heavy metals which are greatly influenced by human activities. However, a prior requirement of these methods is to quantify the spatial autocorrelation between properties at different locations so that the information from samples can be weighted into an estimator of the values at unsampled locations (Yao 1999). As a new modeling approach, the decision tree model has been shown to have high predictive accuracy (Zhang et al. 2006) and geographic information systems (GIS) have increasingly become a valuable management tool, providing an effective infrastructure for managing, analyzing, and visualizing disparate datasets related to soils, topography, land-use, land cover, and climate (Liao and Tim 1997; Miller et al. 2004). So integration of a decision tree approach with a GIS offers a potential solution in meeting this challenge (Zhang et al. 2006). Coupling between models and GIS was described as loose, tight, and embedded (Jagadeesh et al. 2006). However, in most cases the GIS is used to process and display inputs and outputs to an external model that can be started automatically via scripting, or manually (i.e. loose

Environ Monit Assess (2009) 158:419–431

coupling). This allows flexibility in the modeling language, and does not require a model to be rewritten, except a simple modification to process the model inputs and outputs (Tischler et al. 2007). But the manual process is labor intensive and not conducive to performing the various analyses (Crossman et al. 2007). There are two types of embedded coupling. One type is adding a simple GIS to a complex modeling system to display results and provide interactive control; another is where the model is written using the analytical engine of the GIS. Though simple models can be developed in the analytical engine of GIS, testing and debugging of source code in this environment is difficult and most of the GIS analytical engines are interpreted not compiled and also execution is very slow (Jagadeesh et al. 2006). Tight coupling is usually developed within the GIS and is extremely useful for modelers depending on the various functions of GIS. One approach of tight coupling is to integrate both models and GIS into high-level programming languages (VC++, VB, Delphi, Power Builder etc.). GIS functionalities are accessed through GIS components. Programming environment provides facilities for testing and debugging of models during the development process (Burrough 1996; Wegener 2000). This study integrated the classification and regression tree (CART) model and ArcGIS Engine 9 using a COM implementation in Visual Basic 6.0 for assessment of heavy metals pollution. This paper presents a case study in assessment of the spatial distribution of soil Zn in an area severely polluted. To do this, we made use of an interface called assessment of heavy-metals pollution (AHP): a tool that integrates geographical information systems with the CART model. The objective of this paper is to show AHP’s performance in a specific case study of soil zinc pollution in Fuyang County which is assumed as representative to counties in the Yangtze Delta, China, where the economic development has witnessed an unprecedented rapid growth since the economic reform in 1978. At the same time, the area had been heavily contaminated by industrial wastes, mining, vehicular emissions and so on.

Environ Monit Assess (2009) 158:419–431

Methodology Classification and regression tree Classification and regression tree, a statistical procedure introduced by Breiman et al. (1984), is primarily used as a classification tool, where the objective is to classify an object into two or more populations (Tian-Shyug et al. 2006). The underlying principle behind CART is to identify increasingly homogeneous configurations of predictive variables that should lead to increasingly homogenous configurations of target variables. Different types of predictive variables (categorical and continuous) can be integrated into the model (Selle et al. 2007). The machine-learning, probabilistic, non-parametric decision-tree method has been extensively exploited for vegetation mapping, ecological modeling, and remote sensing studies such as land use classification based on threshold values of various band data (Bou et al. 2008). A major advantage of CART is that assumptions which are required for the appropriate use of parametric statistics, such as Gaussian distribution of predictor variables, do not need to be satisfied (Rothwell et al. 2008). Moreover, it uses an effective algorithm to cope with the missing data situation. It has been demonstrated that CART can still perform reasonably well when the missing data do not exceed 5% (Yong 2006). The CART methodology consists of three steps. Initially an overfitting tree is grown by recursive partitioning of the data. The second step called tree pruning, the sequence of nodes that should be eliminated to obtain a set of smaller trees is found. The last step is to select an optimal tree from the pruned trees. Tree building The principle behind Tree Building is to recursively partition the target variable to maximize “purity” in the two child nodes. It checks all possible input variables as well as all possible values of the input variables to find the threshold that leads to the greatest improvement in the purity score of the resultant nodes. The optimal splitting value s* at node t is chosen from a set of all

421

splitting candidates S, so that the drop of impurity is maximized as E (s∗ , t) = max E (s,t) , s ∈ S, where E(s,t) is the drop of impurity given by Tran et al. (2008):   E (s,t) = E (t) − pL E (t L ) + p R E (t R ) Where E(t) is the impurity of the node t, pL and p R are the proportions of objects going to the left (t L ) or right (t R ) child nodes, E(t L ) and E(t R ) are their impurities. Several impurity measures have been proposed as splitting criteria. The information or entropy index allows forming groups where the diversity within them is minimized, and the entropy impurity of the node is determined as (Caetano et al. 2005): E (t) = −

k 

  P j (t) ln P j (t)

j=1

Where E(t) is the entropy impurity of node t, P j(t) is the fraction of objects in node t that belongs to the jth class of the k classes present in the dataset. Contrary to the entropy index, Gini searches for the largest category in the dataset and strives to isolate it from the other categories. The impurity is then determined as (Caetano et al. 2005): i (t) = 1 −

k  

2 P j (t)

j=1

Where i(t) is the impurity of node t, P j(t) is the fraction of objects in node t that belongs to the jth class of the k classes present in the dataset. The towing rule can also be used as splitting criterion, but is not related to the impurity measure. The algorithm used for evaluating the quality of the constructed tree is the Gini splitting method, which is considered as the default method. The Gini method has been considered slightly better than the Entropy tree fitting algorithm (Breiman 2001). Tree pruning The tree obtained by preceding building phase is biased towards the training data set and may have a large number of branches which substantially

422

increase the tree’s complexity while they do not yield higher accuracy if resulting from noisy data. The method for pruning in CART is based on the principle of minimum cost-complexity, in which both tree accuracy and complexity are considered. The cost-complexity parameter Rα (T) is used and for each subtree T it is defined as follows (Breiman et al. 1984):   T Rα (T) = R (T) + α  Where R (T) is the average within-node sum of  is the tree complexity, defined as the squares, T total number of nodes of the subtree, and α is the complexity parameter, which is a penalty for each additional terminal node. During the pruning procedure, α is gradually increased from 0 to 1 and for each value of α, the tree is selected which minimizes Rα (T). For a value of α equal to zero, Rα (T) is minimized by the maximal tree. By gradually increasing α, series of trees with decreasing complexity are then obtained (Espa et al. 2006). Optimal tree selection The principle behind selecting the optimal tree is to find a tree with respect to a measure of misclassification cost on the testing dataset (or an independent dataset), so that the information in the learning dataset will not be overfit. AHP (assessment of heavy-metals pollution) system The AHP system in this study chose the ArcGIS software as the development platform mainly because of its open development environment. ArcObjects is a module which contains the ArcGIS software component library to customize the functionalities of the software (Zeiler 2001). The module can be accessed and launched by using Microsoft® Visual Basic through the component object model (COM) (Loo Becky 2006). ArcGIS Engine 9 is a simple, applicationneutral programming environment for ArcObjects. It is a complete library of embeddable GIS components for developers to build custom applications. Using ArcGIS Engine, we can build

Environ Monit Assess (2009) 158:419–431

focused custom applications that deliver advanced GIS systems to many users, also we can use the ArcGIS Engine developer kit to successfully build stand-alone applications. Graphical user interface applications will make use of the extensive ArcGIS Controls exposed in the developer kit. These controls include everything you need to build a sophisticated front end application (Euan et al. 2004). The Microsoft Component Object Model is a platform-independent, distributed, object-oriented system for creating binary software components that can interact. COM is an architecture and infrastructure for building fast, robust, and extensible component-based software. At its lowest level, COM is merely a language-independent binary-level standard defining how software components within a single address space can rendezvous and interact with each other efficiently, while retaining a sufficient degree of separation between these components so that they can be developed and evolved independently. Therefore, it is possible to extend ArcObjects by writing COM components using any COM-compliant development language. You can extend every part of the ArcObjects architecture in exactly the same way as ESRI developers do (Zeiler 2001). Implementation of the CART model (Fig. 1), however, requires integration of ArcGIS Engine 9 and multiple databases for development of the model input parameters and for analysis and visualization of the simulation results. The developed interface performs different activities such as extraction of the values of fields and records in the attribute table of the feature class, acquisition of the user input information for heavy metals estimation finally storing the results in a designated folder of the computing system. The data is usually divided into two subsets, one for learning (or training) and the other for testing (or validation).

Study area and materials Fuyang County, situated at the north of Zhejiang Province, China, near to the southwestern periphery of Hangzhou city, was selected as the

Environ Monit Assess (2009) 158:419–431

423

Fig. 1 Flowchart of the assessment of heavy-metals pollution (AHP) system

study area. The county is located at 119◦ 25 00 ∼120◦ 19 30 E, 29◦ 44 45∼30◦ 11 58.5 N, and covers an area of 1,831 km2 . In the past 20 years, the economic development in Fuyang has witnessed a continuous rapid growth and an industrial system with Fuyang characteristics has gradually taken shape while concentrating on machinery, telecommunication equipment with electronics, paper-manufacturing, and building materials as core industries. Three hundred three soil samples were collected from different locations in March 2005 to take account of uniformity of soil sample distribution and soil types in the study area. The distribution of the 303 soil sampling points is presented in Fig. 2. All samples were taken at a depth of 0–20 cm and air-dried to remove stones and coarse plant roots or residues. The samples were

thoroughly mixed and ground to pass through a 0.15 mm sieve, then stored in polythene bags for chemical analysis. Zinc was determined by digesting the soil sample with a mixture of nitric acid (HNO3 ) and perchloric acid (HClO4 ) followed by zinc measurement in the digest by atomic absorption spectrophotometry. Soil pH was determined in a 1:2.5 soil–water ratio and organic matter by wet oxidation at 180◦ C with a mixture of potassium dichromate and sulfuric acid (Agricultural Chemistry Committee of China 1983). Land use map, with the information on soil and industrial plant types, is provided by the Bureau of land and resource of Fuyang, and traffic map by the Agriculture Bureau of Fuyang County. In our study area, the natural background soil Zn content is found to be 90 mg/kg (Zhejiang Soil Survey Office 1994). The content index was then

424

Environ Monit Assess (2009) 158:419–431

Fig. 2 Location of study area and of sampling points

divided into six classes to indicate the level of Zn contamination (Fig. 3a). Soil pH (see Table 1) was included in the model for it is strongly correlated with soil heavy metals content. The other reason is that soil pH is often more readily available from soil investigations than heavy metal data and its values are relatively stable. Agricultural practices such as the use of manure or inorganic fertilizers could add heavy metals to soils, thus the agriculture land use practice was also selected to estimate heavy metals content.

There are seven main agricultural land use in Fuyang county: paddy field (PF), dry land (DL), vegetable land (VT), tea garden (TG), orchard (OR), woodland (WO), and wasteland (WL). The independent variable was named LandUse. Different industrial plants have different impacts on soil Zinc accumulation. 1. Existing and disused smelting factories (SM) often produce and stock significant quantities of smelting waste piles and soil heaps (Jiang et al. 2002)

Environ Monit Assess (2009) 158:419–431

425

Fig. 3 Main views of the CART module

Table 1 Description of predictors used in CART analysis

Name

Description

pH LandUse INType

Soil pH value Agriculture land use, including PF, DL, WL, VT, TG, OR, WD Industrial types soil samples located in the 500 m buffer zone. SM, HM, PM, OT (others), NO (there is no industry plant)

RoadDist 1 2 3 4 5 6

Soil samples located in the 100 m buffer zone of the main roads Soil samples located in the 200 m buffer zone of the main roads Soil samples located in the 300 m buffer zone of the main roads Soil samples located in the 500 m buffer zone of the main roads Soil samples located in the 1,000 m buffer zone of the main roads Soil samples located outside the 1,000 m buffer zone of the main roads

426

Environ Monit Assess (2009) 158:419–431

Table 2 Criteria to assign soil Zn content to pollution classes G1 to G6 Pollution grade

Zn content (mg/kg)

Pollution index of Zn

1st grade (G1) 2nd grade (G2) 3rd grade (G3) 4th grade (G4) 5th grade (G5) 6th grade (G6)

≤ 90 90 < Zn ≤ 135 135 < Zn ≤ 180 180 < Zn ≤ 225 225 < Zn ≤ 270 > 270

≤ 1.0 1.0 < Pi 1.5 < Pi 2.0 < Pi 2.5 < Pi > 3.0

≤ 1.5 ≤ 2.0 ≤ 2.5 ≤ 3.0

2. Hardware machining (HM), especially factories related to Zinc materials 3. Paper mills (PM), influence Zinc concentrations due to paper mill sludge waste (Battaglia et al. 2007) 4. Chemical plants (CM) influence Zinc concentrations due to additives used 5. Others (OT), refers to the industrial plants without relationship to heavy metals 6. NO, soil samples located outside the 500 m buffer zone of any industrial plants. The independent variable was named INType Roadway dust receives varying inputs of anthropogenic metals from a variety of mobile or stationary sources, such as vehicular traffic, industrial plants, power generation facilities, residential oil burning, waste incineration, construction and demolition activities (Emanuela et al. 2006). To represent the influence of roads, soil samples within 100 m, 200 m, 300 m, 500 m, 1,000 m, and outside 1,000 m main roads buffer zones were respectively selected in this study. The independent variable was named RoadDist.

A minimum node size of five or 1% of total number of dataset (randomly) was applied in the CART module (Fig. 3b), since a simpler tree is easier to understand and faster to use, and more importantly, smaller trees provide greater predictive accuracy for unseen data (Bou et al. 2008). The maximum tree depth or maximum purity also can be specified in the module (Fig. 3b). An overfitting tree may perfectly classify the training data, but will most likely incur significant errors in the testing data and the real-world predictions (Yong 2006). Therefore, the data here are divided into two subsets, one for training and the other for testing (Fig. 3b). The training sample is used to split nodes, while the testing sample is used to compare the misclassification. One of the shortcomings of holding back a randomly selected subset of the data is that the data set should be large. When the data are limited, another method is needed. Therefore, one of the major innovations in CART is the option to use cross-validation (CV) to measure misclassification rates (Waheed et al. 2006). In the process, users are given a data set of training records (or training data). Each record has a number of variables (Fig. 3b). There is one distinguished variable called the dependent or response variable (i.e., soil heavy metals contents distribution, here is soil Zinc content) and the remaining variables are referred to as the predictor variables (e.g. PH, land use, industrial type and road buffer distance). Through a series of ‘yes/no’ questions concerning database fields (predictor variables), CART, also can be described as a nonparametric-data-

Table 3 Confusion matrix for CART predictions of soil Zn classes Training data G1 G1 G2 G3 G4 G5 G6 Accuracy (%)

G2

Test data G3

103 11 2

78 3

1 18 1

2 87.29

1 1 93.98

90.00

G4

1 9 2 1 69.23

G5

G6

G1

G2

G3

18 1

1 10 1

1 –

G4

G5

G6

– 1 0.00

5 100.00

1 3 1 75.00

1 25 96.15

Total accuracy: 89.39%; Kappa coefficient: 0.8296 for training data Total accuracy: 87.18%; Kappa coefficient: 0.8018 for test data

94.74

83.33

0.00

1 50.00

Environ Monit Assess (2009) 158:419–431

427

Table 4 Confusion matrix for Kriging predictions of soil Zn classes G1 G2 G3 G4 G5 G6 Accuracy (%)

G1

G2

2 2 1

51 23 7

G3

32 53 23 7 3 0.0000 0.6296 0.4492

G4

G5

G6

2 7 13 4 4 0.4333

6 1 4 5 0.2500

6 1 3 33 0.7674

Total accuracy: 41.79%; Kappa coefficient: 0.2584 for Kriging

driven rule-generating algorithm, automatically searches for important relationships and uncovers hidden structures (BOler et al. 2002). It identifies the variables with the highest correlation with soil Zinc content by splitting the data set into the two most dissimilar groups. The splitting of the data set and tree development continue until the data in each group are sufficiently uniform. The method here partitions the data set into six discrete subgroups, based on the classification Fig. 4 Spatial distribution of soil Zn contents in relation to different main road buffer zones

value of the dependant variable (here the soil Zinc content) (Fig. 3a). The result of the CART analysis associated with each split in the decision tree is a rule involving several predictor variables. The rules are important in two ways. First, they are used to predict the values of the response variables. Second, they contain a wealth of information about the relationship between the response and the predictor variables and the interactions among the predictors (Yong 2006).

Results and discussion Obviously, there are two primary advantages for the integration: (1) spatial representation is critical to environmental problem solving, but GIS currently lacks the predictive and related analytical capabilities necessary to examine complex problems and (2) modeling tools typically lack sufficiently flexible GIS-like spatial analytic components and are often inaccessible to potential users

428

Environ Monit Assess (2009) 158:419–431

Fig. 5 Spatial distribution of soil Zn contents in relation to different types of industrial plants

less expert than their makers Parks (1993). The developed AHP system seamlessly links ArcGIS Engine 9 and the CART model, automating the transfer of parameters and data, and graphically displaying the analysis results. The AHP system also removes the margin for error intrinsic to any manual process. The confusion matrix (see Tables 2, 3, and 4) shows the relationship between measured and estimated Zn classes. Two standard criteria of total accuracy and Kappa coefficient were used to assess the prediction accuracy. The total accuracy refers to the ratio of total number of correctly inferred Zn classes divided by the total number of samples (training, test data respectively), and the Kappa Coefficient uses all of the information in the confusion matrix, ranging from 0 to 1. As the values of the off-diagonal entries increase, so the value of Kappa decreases (Tso and Mather 2001). The overall CART accuracy of assigning samples to the right Zn classes is 89.39% and 87.18%, and the Kappa coefficient is 0.8296 and 0.8018

respectively for training data and test data. The samples used in CART were also used in Kriging to estimate Zn content spatially. Kriging estimates variable values at unknown locations from a semivariogram model and appropriately sampled data set. Kriging uses the semi-variogram to quantify the spatial variation. Although normality may not be strictly required in Kriging, serious violation of normality, such as high skewness and outliers, can impair the simulation results (McGrath et al. 2004). The simulated semivariogram for the raw Zn data presented a horizontal line, denoting soil Zn content was greatly influenced by exterior factors. Thus, the Box-Cox transformation was used to obtain normally distributed transformed soil Zn content. The experimental semivariogram suggested that the Box-Cox transformed Zn contents are best fitted to a Gaussian model dominated by a long-range structure. Range of this semivariogram is 9,508 m, and the determination coefficient (r2 ) is 0.72. The total accuracy of assigning kriged estimates of Zn classes to measured values is 41.79%,

Environ Monit Assess (2009) 158:419–431

and the corresponding Kappa coefficient is 0.2584. Unlike CART, the misclassification errors are often not to adjacent classes but to several classes away. This type of error is more serious, for example, one of the G2 samples was misclassified four classes away as a G6 type. Kriging clearly underperformed compared with CART predictions and errors can be several classes away, whereas CART errors are mostly one class away giving rise to a higher Kappa coefficient. The main reason for increased accuracy might be that Zn content in this study area is greatly influenced by human activities leading to localized sharp variations and hotspots which are smoothed over by Kriging with a long range variogram. It has been noted that location close to roads are severally polluted by heavy metals such as Pb, Zn, Cu, Cd, etc. from traffic (Al-Khashman 2004) (see Fig. 4). Tire wear and corrosion of safety fence are the two main sources of traffic-related zinc (Blok 2005). The influence of industrial plants on Zn accumulation can reach several meters to several kilometers from the point source depending on the industry involved (see Fig. 5). Release of emissions from nearby point sources such as smelters, hardware machining, paper mills, and chemical plants had different influence on the soil Zn contents. However, the smelters are the most severe pollution sources. Certainly, the method of CART decreases the measurement scale of the raw data to a lower level for the classification of the target. But decision makers and spatial planners require information on soil quality for different purposes: to locate areas suitable for organic (ecologically clean) farming and agro-tourism; to select sites suitable for conversion of agricultural to nonagricultural land, particularly for urbanization; to set up protection zones for groundwater pumped for drinking water; to estimate costs of remediation of contaminated areas (Romic et al. 2007). So these classifications are however useful when detailed concentrations are not required. Of course, CART cannot replace Kriging to predict heavy metals concentrations at unsampled points. The two methods have their own respective advantages and disadvantages in simulating the spatial distribution of soil heavy metals concentrations.

429

Conclusions Successful implementation of a decision tree model involves the integration of GIS, multiple databases, and visualization tools for extraction of the needed model input parameters and for analysis and visualization of the simulated results (Chansheng et al. 2001). The integration of the CART model with ArcGIS Engine 9 in this study provided an approach for assessing the spatial distribution of soil heavy metals contents with high predictive accuracy, and to present model predictions over space for further application and investigation. Its advantages are: (1) it is independent of the ArcGIS Desktop application framework to create maps and to visualize the results; (2) it facilitates sensitivity analyses interactively to assess the impact of subtle changes with simple inputs (e.g. soils, land use, topography, transportation, industry type); and (3) this modeling method can be used to investigate the inter-relationship between heavy metals pollution and environmental and anthropogenic variables. Acknowledgements Funding for this research was provided by the National Technology Support Foundation, China (2006BAD10A07) and (2006BAJ05A02).

References Agricultural Chemistry Committee of China (1983). Conventional methods of soil and agricultural chemistry analysis. Beijing, China: Science (in Chinese). Al-Khashman, O. A. (2004). Heavy metal distribution in dust, street dust and soils from the work place in Karak Industrial Estate, Jordan. Atmospheric Environment, 38, 6803–6812. doi:10.1016/j.atmosenv.2004.09.011. Barrera-Bassols, N., Zinck, J. A., & Van Ranst, E. (2006). Local soil classification and comparison of indigenous and technical soil maps in a Mesoamerican community using spatial analysis. Geoderma, 135, 140–162. doi:10.1016/j.geoderma.2005.11.010. Battaglia, A., Calace, N., Nardi, E., Petronio, B. M., & Pietroletti, M. (2007). Reduction of Pb and Zn bioavailable forms in metal polluted soils due to paper mill sludge addition. Effects on Pb and Zn transferability to barley. Bioresource Technology, 98, 2993– 2999. doi:10.1016/j.biortech.2006.10.007. Bloemena, M.-L. (1995). The distribution of Cd, Cu, Pb and Zn in topsoils of Osnabriick in relation to land use. The Science of the Total Environment, 166, 137– 148. doi:10.1016/0048-9697(95)04520-B.

430 Blok, J. (2005). Environmental exposure of road borders to zinc. The Science of the Total Environment, 348, 173– 190. doi:10.1016/j.scitotenv.2004.12.073. BOler, T., Karatzas, K., Peinel, G., Rose, T., & San Jose, R. (2002). Providing multi-modal access to environmental data—customizable information services for disseminating urban air quality information in APNEE. Computers, Environment and Urban Systems, 26, 39– 61. doi:10.1016/S0198-9715(01)00020-5. Bou, K. R., Chorowicz, J., Abdallah, C., & Dhont, D. (2008). Soil and bedrock distribution estimated from gully form and frequency: A GIS-based decision-tree model for Lebanon. Geomorphology, 93, 482–492. doi:10.1016/j.geomorph.2007.03.010. Breiman, L. (2001). Decision-tree forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Pacific Grove, CA: Wadsworth. Burrough, P. A. (1996). Environmental modeling with geographical information systems. In Z. Kemp (Ed.), Innovations in GIS 4 (pp. 143–153). London, England: Taylor & Francis Publisher. Caetano, S., Aires-de-Sousa, J., Daszykowskia, M., & Vander Heyden, Y. (2005). Prediction of enantioselectivity using chirality codes and Classification and Regression Trees. Analytica Chimica Acta, 544, 315– 326. doi:10.1016/j.aca.2004.12.012. Chansheng, H., Changan, S., Changchun, Y., & Brayan, P. A. (2001). A Window-based GIS-AGNPS interface. Journal of the American Water Resources Association, 37(2), 395–406. doi:10.1111/j.1752-1688.2001. tb00977.x. ChoT, S.-H., & Newman, D. H. (2005). Spatial analysis of rural land development. Forest Policy and Economics, 7, 732–744. doi:10.1016/j.forpol.2005.03.008. Crossman, N. D., Perry, L. M., Bryan, B. A., & Ostendorf, B. (2007). CREDOS: A conservation reserve evaluation and design optimisation system. Environmental Modelling & Software, 22, 449–463. doi:10.1016/j.envsoft.2005.12.006. Emanuela, M., Varrica, D., & Dongarra, G. (2006). Metal distribution in road dust samples collected in an urban area close to a petrochemical plant at Gela, Sicily. Atmospheric Environment, 40, 5929–5941. doi:10.1016/j.atmosenv.2006.05.020. Emmerson, R. H. C., O’reilly-Wiese, S. B., Macleod, C. L., & Lester, J. N. (1997). A multivariate assessment of metal distribution in inter-tidal sediments of the Blackwater Estuary, UK. Marine Pollution Bulletin, 34(11), 960–968. doi:10.1016/S0025-326X(97)00067-2. Espa, G., Benedetti, R., De Meo, A., Ricci, U., & Espa, S. (2006). GIS based models and estimation methods for the probability of archaeological site location. Journal of Cultural Heritage, 7, 47–155. doi:10.1016/ j.culher.2006.06.001. Euan, C., Davies, C., Elkins, R., Evans, K., Frankland, A., Gill, S., et al. (2004). ArcGIS engine developer’s guide. California: Environmental Systems Research Institute, Inc.

Environ Monit Assess (2009) 158:419–431 Hillyer, A. E. M., McDonagh, J. F., & Verlinden, A. (2006). Land-use and legumes in northern Namibia— The value of a local classification system. Agriculture Ecosystems & Environment, 117, 251–265. doi:10.1016/ j.agee.2006.04.008. Jagadeesh, B. A., Thirumalaivasan, D., & Venugopal, K. (2006). STAO: A component architecture for raster and time series modeling. Environmental Modelling & Software, 21, 653–664. doi:10.1016/j.envsoft.2004.11.011. Jiang, L., Yang, X., Ye, H., Shi, W., & Jiang, Y. (2002). Effect of copper refining on spatial distribution of heavy metal in surrounding soils and crops. Journal of Zhejiang University (Agriculture & Life Science), 28, 689–693 (in Chinese). Liao, H., & Tim, U. S. (1997). An interactive modeling environment for nonpoint source pollution control. Journal of the American Water Resources Association, 33(3), 591–603. doi:10.1111/j.1752-1688. 1997.tb03534.x. Loo Becky, P. Y. (2006). Validating crash locations for quantitative spatial analysis: A GIS-based approach. Accident; Analysis and Prevention, 38, 879–886. doi:10.1016/j.aap.2006.02.012. McGrath, D., Zhang, C., & Carton, O. T. (2004). Geostatistical analyses and hazard assessment on soil lead in Silvermines area. Ireland Environmental Pollution, 127, 239–248. doi:10.1016/j.envpol.2003.07.002. Micó, C., Recatalá, L., Peris, M., & Sánchez, J. (2006). Assessing heavy metal sources in agricultural soils of an European Mediterranean area by multivariate analysis. Chemosphere, 65, 863–872. Miller, R. C., Guertin, D. P., & Heilman, P. (2004). Information technology in watershed management decision making. Journal of the American Water Resources Association, 40(2), 349–357. Parks, B. O. (1993). The need for integration. In M. F. Goodchild, B. O. Parks, & L. T. Steyaert (Eds.), Environmental modelling with GIS (pp. 31–34). Oxford, England: Oxford University Press. Romic, M., Hengl, T., Romic, D., & Husnjak, S. (2007). Representing soil pollution by heavy metals using continuous limitation scores. Computers & Geosciences, 33, 1316–1326. Rothwell, J. J., Futter, M. N., & Dise, N. B. (2008). A classification and regression tree model of controls on dissolved inorganic nitrogen leaching from European forests. Environmental Pollution. doi:10.1016/ j.envpol.2008.01.007. Salgueiro, A. R., Freire Ávila, P., Garcia Pereira, H., & Santos Oliveira, J. M. (2008). Geostatistical estimation of chemical contamination in stream sediments: The case study of Vale das Gatas mine (northern Portugal). Journal of Geochemical Exploration, 98, 15–21. doi:10.1016/j.gexplo.2007.10.005. Selle, B., Lischeid, G., & Huwe, B. (2007). Effective modeling of percolation at the landscape scale using databased approaches. Computers & Geosciences, 34, 699– 713. doi:10.1016/j.cageo.2007.06.007. Tian-Shyug, L., Chiu, C.-C., Chou, Y.-C., & Lu, C.-J. (2006). Mining the customer credit using classification

Environ Monit Assess (2009) 158:419–431 and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis, 50, 1113–1130. doi:10.1016/j.csda.2005.04.013. Tischler, M., Garcia, M., Peters-Lidard, C., Moran, M. S., Miller, S., Thoma, D., et al. (2007). A GIS framework for surface-layer soil moisture estimation combining satellite radar measurements and land surface modeling with soil physical property estimation. Environmental Modelling & Software, 22, 891–898. Tran, V. T., Yang, B.-S., Oh, M.-S., Tan, A. C. C. (2008). Fault diagnosis of induction motor based on decision trees and adaptive neuro-fuzzy inference. Expert Systems with Applications. doi:10.1016/j.eswa. 2007.12.010. Tso, B., & Mather, P. M. (2001). Classification methods for remotely sensed data. London, England: Taylor & Francis. Waheed, T., Bonnell, R. B., Prasher, S. O., & Paulet, E. (2006). Measuring performance in precision agriculture: CART—A decision tree approach. Agricultural Water Management, 84, 173–185. Wegener, M. (Eds.) (2000). Spatial models and GIS new potential and new models. London, England: Taylor & Francis.

431 Yao, T. (1999). Nonparametric cross-covariance modeling as exemplified by soil heavy metal concentrations from the Swiss Jura. Geoderma, 88, 13–38 Yaya, O. D., Alaghab, O., & Tuncelc, G. (2008). Multivariate statistics to investigate metal contamination in surface soil. Journal of Environmental Management, 86, 581–594. Yong, L. (2006). Predicting materials properties and behavior using classification and regression trees. Materials Science and Engineering A, 433, 261–268. doi:10.1016/j.msea.2006.06.100. Zeiler, M. (2001). Exploring ArcObjects. California: Environmental Systems Research Institute. Zhang, B., Valentine, I., Kemp, P., & Lambert, G. (2006). Predictive modelling of hill-pasture productivity: Integration of a decision tree and a geographical information system. Agricultural Systems, 87, 1–17. Zhejiang Soil Survey Office (1994). Zhejiang soils. Hangzhou, China: Zhejiang Technology (in Chinese). Zheng, N., Wang, Q., & Zheng, D. (2007). Health risk of Hg, Pb, Cd, Zn, and Cu to the inhabitants around Huludao Zinc Plant in China via consumption of vegetables. Science of the Total Environment, 383, 81–89.