Landslide susceptibility modelling using GIS-based ...

2 downloads 0 Views 2MB Size Report
RF model, 21.150% of the study area falls into the very low susceptibility class. 24.376% of ..... study in the Cuyahoga Valley National Park, Ohio. Landslides 13 ...
Science of the Total Environment 626 (2018) 1121–1135

Contents lists available at ScienceDirect

Science of the Total Environment journal homepage: www.elsevier.com/locate/scitotenv

Landslide susceptibility modelling using GIS-based machine learning techniques for Chongren County, Jiangxi Province, China Wei Chen a, Jianbing Peng b, Haoyuan Hong c,d,e,⁎, Himan Shahabi f, Biswajeet Pradhan g,h, Junzhi Liu c,d,e, A-Xing Zhu c,d,e,⁎, Xiangjun Pei i,⁎⁎, Zhao Duan a a

College of Geology & Environments, Xi'an University of Science and Technology, Xi'an 710054, China Department of Geological Engineering, Chang'an University, Xi'an 710054, China Key Laboratory of Virtual Geographic Environments, Nanjing Normal University, Nanjing 210023, China d State Key Laboratory Cultivation Base of Geographical Environment Evolution (Jiangsu Province), Nanjing 210023, China e Jiangsu Center for Collaborative Innovation in Geographic Information Resource Development and Applications, Nanjing 210023, China f Department of Geomorphology, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran g School of Systems, Management and Leadership, Faculty of Engineering and IT University of Technology Sydney CB11.06.217, Building 11,81 Broadway, Ultimo NSW 2007 (PO Box 123), Australia h Department of Energy and Mineral Resources Engineering, Choongmu-gwan, Sejong University, 209 Neungdong-ro Gwangjin-gu, Seoul 05006, Republic of Korea i The State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China b c

H I G H L I G H T S

G R A P H I C A L

A B S T R A C T

• Bayes' net, RBF classifier, logistic model tree and random forest models were applied for landslide susceptibility modelling. • Information gain method was used to evaluate the relationship between landslide and conditioning factors. • Random forest model shows the better result in landslide prediction.

a r t i c l e

i n f o

Article history: Received 9 November 2017 Received in revised form 26 December 2017 Accepted 13 January 2018 Available online xxxx Editor: Ouyang Wei Keywords: Landslide susceptibility Bayes' net Radical basis function classifier Logistic model tree

a b s t r a c t The preparation of a landslide susceptibility map is considered to be the first step for landslide hazard mitigation and risk assessment. However, these maps are accepted as end products that can be used for land use planning. The main goal of this study is to assess and compare four advanced machine learning techniques, namely the Bayes' net (BN), radical basis function (RBF) classifier, logistic model tree (LMT), and random forest (RF) models, for landslide susceptibility modelling in Chongren County, China. A total of 222 landslide locations were identified in the study area using historical reports, interpretation of aerial photographs, and extensive field surveys. The landslide inventory data was randomly split into two groups with a ratio of 70/30 for training and validation purposes. Fifteen landslide conditioning factors were prepared for landslide susceptibility modelling. The spatial correlation between landslides and conditioning factors was analyzed using the information gain (IG) method. The BN, RBF classifier, LMT, and RF models were constructed using the training dataset. Finally, the receiver operating characteristic (ROC) and statistical measures, including sensitivity, specificity, and accuracy, were employed to validate and compare the predictive capabilities of the models. Out of the tested models, the RF model had the highest sensitivity, specificity, and accuracy values of 0.787, 0.716, and 0.752, respectively, for

⁎ Correspondence to: H. Hong and A.-X. Zhu, Key Laboratory of Virtual Geographic Environments, Nanjing Normal University, Nanjing, 210023, China. ⁎⁎ Corresponding author. E-mail addresses: [email protected] (H. Hong), [email protected] (A.-X. Zhu), [email protected] (X. J. Pei).

https://doi.org/10.1016/j.scitotenv.2018.01.124 0048-9697/© 2018 Elsevier B.V. All rights reserved.

1122 Random forest China

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

the training dataset. Overall, the RF model produced an optimized balance for the training and validation datasets in terms of AUC values and statistical measures. The results of this study also demonstrate the benefit of selecting optimal machine learning techniques with proper conditioning selection methods for landslide susceptibility modelling. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Landslides are very complex natural phenomenon that causes severe loss of human life and property worldwide (Lee et al., 2004). Over the years, many government agencies have attempted to find solutions to mitigate the disastrous consequences of landslides by educating people for a better understanding of the severe effects of landslides and developing appropriate tools for planning and decision making. This process is generally performed by identifying and mapping of areas susceptible to landslides. These maps are generated based on an assessment of landslide susceptibility, which is a spatial distribution of probabilities of landslide occurrences in a given area based on local geo-environmental factors (Aleotti and Chowdhury, 1999; Wang et al., 2015). During the last several decades, various techniques and approaches have been proposed and developed to study landslide susceptibility, including weights-of-evidence (Ding et al., 2017; Hong et al., 2017d), the evidential belief function (Pourghasemi and Kerle, 2016), certainty factors (Hong et al., 2017a), the analytical hierarchy process (Pawluszek and Borkowski, 2017), logistic regression (Raja et al., 2017), multivariate regression (Akgün and Türk, 2011; Guzzetti et al., 2006), and multivariate adaptive regression splines (Felicísimo et al., 2013). In recent years, in addition to the above methods, various machine learning techniques have been applied to landslide susceptibility mapping, such as support vector machines (Hong et al., 2017c; Tien Bui et al., 2017b), artificial neural networks (Chen et al., 2017f; Gorsevski et al., 2016), neuro-fuzzy techniques (Chen et al., 2017a; Lee et al., 2015), decision tree (Chen et al., 2017h; Hong et al., 2018b), and random forests (Catani et al., 2013; Chen et al., 2017d; Zhang et al., 2017). More recently, several hybrid methods have also been developed by combining statistical methods with machine learning approaches, such as the adaptive neuro-fuzzy inference system (Nasiri Aghdam et al., 2016), stepwise weight assessment ratio analysis technique (Dehnavi et al., 2015), ANN-fuzzy logic technique (Kanungo et al., 2006), ANN-MaxEnt-SVM (Chen et al., 2017b), and rough set-SVM (Peng et al., 2014). Some of these methods were reported to perform better than conventional methods (Tien Bui et al., 2016b). The above literature review reveals that several advanced machine learning approaches, such as RBF classifiers and logistic model trees, have seldom been explored for landslide modelling. Additionally, even a small increment in prediction accuracy may have a large impact on the resulting landslide susceptibility zones (Chen et al., 2018a, 2018b; Tien Bui et al., 2016b). Therefore, it is extremely important to investigate and compare machine learning methods and conventional methods to reach reasonable conclusions for landslide susceptibility assessment. Therefore, this study aims to evaluate and compare the performance of several state-of-the art machine learning techniques, including the Bayes' net, RBF classifier, logistic model tree, and random forest models, for the spatial prediction of landslides in the Chongren area (China). 2. Study area The study area is located in Chongren County, Jiangxi Province, China, which lies between latitudes 27°25′ N and 27°56′ N, and longitudes 115°49′ E and 116°17′ E (Fig. 1). This county covers an area of approximately 1520 km2. The average annual temperature is 17.7 °C, the average annual rainfall for the period between 1960–2012 ranged

from 1123.4 mm to 2850.7 mm (http://www.weather.org.cn), and the rainy season lasts from April to June (Hong et al., 2017a). Topographically, the altitudes of the area range from 1 m to 1218 m above sea level and approximately 60% of the study area has a slope gradient b10°. Geologically, the main lithologies are sandstone, slate, shale, limestone, and igneous rocks (Fig. 2). There have been no earthquake-induced landslides reported in the study area prior to the time of this study (Hong et al., 2017a). According to statistics from the government of Chongren County (http://www. jxcr.gov.cn), a total of 3077 people were threatened by landslides in the year 2016 alone in the study area. The damage to property was estimated to be approximately 4 million USD. 3. Data preparation and methods 3.1. Spatial database Landslide inventory mapping is an important step in landslide susceptibility assessment. In this study, historical records, satellite images, field surveys, and Google Earth® were used to analyze landslide locations. 222 landslide events were prepared for analysis. The size of the smallest landslide site was 2.5 m2 and the size of the largest site was 15,000 m2. The average size was 841.3 m2 (Hong et al., 2017a). There is no clear agreement on the precise causes of landslides because of their complex nature and development (Hong et al., 2016b). However, scientists have studied the relationships between landslide occurrence and several conditioning factors, such as topographical, geological, and climatic conditions (Hong et al., 2016a). Anthropogenic activity also has an important effect on geologicalenvironment (Qiao et al., 2017). Therefore, based on previous landslide susceptibility studies and analysis of the properties of the Chongren area, the slope, aspect, elevation, plan curvature, profile curvature, stream power index (SPI), sediment transport index (STI), topographic wetness index (TWI), distance to rivers, distance to roads, distance to faults, normalized difference vegetation index (NDVI), land use, lithology, and rainfall properties were considered in this study. A digital elevation model (DEM) with a resolution of 25 m was constructed based on topographic maps at a scale of 1:50,000. Slope, aspect, elevation, plan curvature, profile curvature, SPI, STI, and TWI were extracted using the DEM (Fig. 3a–h). The distance to rivers and distance to roads maps were constructed using the topographical map by buffering river and road lines (Fig. 3i–j). The NDVI and land use maps (Fig. 3m) were derived from Landsat 7/ ETM+ satellite images with a resolution of 30 m (http://www. gscloud.cn). The lithology and distance to faults maps were prepared using a geological map at a scale of 1:200,000, and the lithology was grouped into 10 classes (Hong et al., 2017a) (Fig. 3k, n). The mean annual precipitations at 18 rainfall stations for the period of 1960–2012 were used to construct the rainfall map (Fig. 3o). Finally, all landslide conditioning factor maps were converted into raster format with a resolution of 25 m. 3.2. Preparation of training and validation datasets In this study, 222 landslide locations (centroids) were randomly split with a ratio of 70/30. Accordingly, 155 landslide points (70%) were used for building models and the other 67 landslide points (30%)

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 1. Location of the study area and landslide inventory map.

1123

1124

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 2. Geologic map of the study area.

were used for model validation. Landslide susceptibility mapping using data mining methods can be considered as binary classification (Bennett et al., 2016). Therefore, the same number of non-landslide points (222) was randomly selected from landslide free areas and split with a ratio of 70/30. Subsequently, all raster values of the 15 landslide conditioning factors were extracted as landslide and non-landslide points. Finally, the landslide points were assigned a value of 1 and the non-landslide points were assign value of 0 to build the training and validation datasets. 3.3. Selection of landslide conditioning factors In landslide susceptibility modelling, the selection of landslide conditioning factors is an important step because there may be certain noisy factors that reduce the predictive capability of models. In this study, the information gain (IG) method was used to select landslide conditioning factors. Information gain was first proposed in 1966 for estimating an attribute's quality (Hunt et al., 1966) and has since been widely used by many researchers (Azhagusundari and Thanamani, 2013; Peirolo, 2011; Quinlan, 1986). The information gain value for a landslide conditioning factor Xi and class Y is calculated using the following formula: IGðY; X i Þ ¼ HðY Þ−H ðY jX i Þ

ð1Þ

where H(Y) is the entropy value of Yi and H(Y|Xi) is the entropy of Y after associating the values of the landslide conditioning factor Xi (Tien Bui et al., 2016a). 3.4. Bayes' net (BN) Bayesian networks have proven valuable for modelling uncertainty and supporting decision making processes (Andersen, 1991). The BN has been widely used in many areas, such as building design (Jensen et al., 2009) and landslide stability calculation (Jiang and Dinglong, 2013). Generally, a BN is considered as two learning subtasks: structure learning and parameter learning. The former determines the topology of the network and the latter defines the numerical parameters for a given network topology (Gheisari and Meybodi, 2016). 3.5. RBF classifier The RBF classifier is a specific type of radial basis function network. Generally, there are three layers: an input layer, hidden layer, and output layer. A common strategy is to train the hidden layer of the network using k-means clustering and the output layer using supervised learning.

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

An RBF classifier is trained by using the Gaussian radial basis function model (Frank, 2014) and the model learned for the l ‐ th output unit is: 

0 B f ðx1 ; x2 ; …xn Þ ¼ h@wl;0 þ

m X

n

wl;i

− ∑ j¼1

ð

Þ

2 a2 x j −ci; j j 2σ 2 i; j

1 C A

ð2Þ

1125

where x1, x2, …xn is the vector of landslide conditioning factors, h (∗) is the logistic function, m is the number of basis functions, wi is the weight for each basis function, a2j is the weight of the j ‐ th attribute, ci, j are the basis function centers, and σ2i, j are the variances.

i¼1

Fig. 3. (a) Slope; (b) Aspect; (c) Elevation; (d) Plan curvature; (e) Profile curvature; (f) SPI; (g) STI; (h) TWI; (i) Distance to rivers; (j) Distance to roads; (k) Distance to faults; (l) NDVI; (m) Land use; (n) Lithology; (o) Rainfall.

1126

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 3 (continued).

3.6. Logistic model tree (LMT) The logistic model tree combines a decision tree with a linear logistic regression technique to leverage the advantages of both (Karabulut and Ibrikci, 2014). The LogitBoost algorithm (Hall et al., 2009) is used for fitting the logistic regression functions at a given tree node. It assumed that there are x vectors and C classes in the dataset (Karabulut and

Ibrikci, 2014). The posterior probability for each class can be expressed as: pðcjxÞ ¼ PC

e F c ðxÞ

n¼1

eð F n ðxÞÞ

ð3Þ

where Fc(x) are the linear regression functions to be fitted and C is the

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

1127

Fig. 3 (continued).

number of classes. The least squares method is used to fit Fc(x) and for all classes the Fc(x) functions must sum to 0: C X n¼1

F c ðxÞ ¼ 0

ð4Þ

In order to achieve the fitting of estimates, LogitBoost uses maximum likelihood to find the smallest possible deviation between observed and predicted values. Compared to a traditional decision tree, LMT employs logistic regression functions to value the probability of each class. Therefore, LMT is a probability model that can handle uncertainty.

1128

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 3 (continued).

3.7. Random forest (RF) The random forest is a classification method that uses multiple tree predictors in such a way that each tree depends on the values of

randomly chosen vectors that are distributed evenly among all trees in the forest. In standard trees, each node is split using the best split among all variables (landslide conditional factors). In a random forest, each node is split using the best split among a subset of predictors

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

that are randomly chosen by the node. Naturally, it has become a popular method for finding useful but hidden patterns within large volumes of data. In order to determine the best node to split, there are n variables that can be chosen as random subsets from the training data. One can compute the best node split using Gini criteria (Breiman, 2001). These criteria measure the correlation degree between variables and results. According to the random forest algorithm, the lowest value is considered to be the best split for each node (Kausar and Majid, 2016). Gini criteria are expressed as: Giniðk; xi Þ ¼

m X ai i¼1

ns

I ðkui Þ

ð5Þ

where m represents the number of landslides at each node k and ns is the number of training input feature vectors. I(kui) represents the distribution of class labels at a node. At node k, the value p is a feature variable xi ∈ X where xi = {u1, u2,…um} and the value of I(kui) can computed as: Iðkui Þ ¼ 1−

c X n2

ci

i¼0

a2i

ð6Þ

3.8. Model performance evaluation. 3.8.1. The receiver operating characteristics curve The receiver operating characteristic (ROC) curve defines the performance of a binary classifier system as its discrimination threshold changes (Kavzoglu et al., 2015; Wang et al., 2015). The ROC curve represents sensitivity as a function of the false positive rate (1-specificity). It can be generated by plotting sensitivity in the y-axis against the cumulative distribution function of the false positive rate in the x-axis. It has been widely used as a standard tool for evaluating the general performance of models (Chen et al., 2017i; Hong et al., 2017b; Hong et al., 2018a). The area under the ROC curve (AUC) is a quantitative measure of the quality of a model, which can be categorized as poor (0.5–0.6), average (0.6–0.7), good (0.7–0.8), very good (0.8–0.9), and excellent (0.9– 1) (Chen et al., 2017c; Yesilnacar, 2005). A higher AUC value indicates a better model and an AUC value of 1 indicates a perfect model (Youssef et al., 2015). 3.8.2. Statistical measures In landslide susceptibility assessment, most scientists agree with the viewpoint that scientific methods should be used to evaluate the performance of landslide models. However, there is no clear agreement regarding which methods should be used. In this study, sensitivity, specificity, and accuracy are used to evaluate the performance of the landslide models. The exact definitions of these statistical measures have been detailed in many landslide studies (Chen et al., 2017g; Tien Bui et al., 2016b). They can be calculated using the following equations: TP TP þ FN

Specificity ¼

Accuracy ¼

TN FP þ TN

TP þ TN TP þ FP þ TN þ FN

4. Results and discussion 4.1. Landslide conditioning factor analysis For landslide conditioning factor selection, factors with weights exceeding a certain threshold were selected for analyzing landslide susceptibility. In this study, factors with weights less than zero have no contribution to landslide occurrences and must be removed from further analysis. The predictive ability of 15 landslide conditioning factors based on the information gain method is shown in Table 1. It can be observed that 12 landslide conditioning factors have positive values correlating to landslide occurrences (AM N 0). Among these factors, elevation has the highest IG value (AM = 0.074), followed by distance to rivers (AM = 0.072), distance to roads (AM = 0.075), STI (AM = 0.070), TWI (AM = 0.067), lithology (AM = 0.063), NDVI (AM = 0.051), distance to faults (AM = 0.041), SPI (AM = 0.038), slope (AM = 0.037), rainfall (AM = 0.036), and aspect (AM = 0.036). In contrast, land use, plan curvature, and profile curvature have no predictive ability (AM = 0). Therefore, in this study, these three factors were removed from further analysis. 4.2. Model construction

where nci are the samples that belong to ci with values ui and ai is the number of samples with the value ui at node k.

Sentivity ¼

1129

ð7Þ

ð8Þ

ð9Þ

where FP (false positive) and FN (false negative) are the numbers of pixels erroneously classified and TP (true positive) and TN (true negative) are the number of pixels that are correctly classified.

The landslide models were constructed using the aforementioned training dataset with 10-fold cross validation. The constructed models were applied to calculate landslide susceptibility indexes for all pixels in the study area. Thereafter, landslide susceptibility maps were constructed and reclassified into five susceptible classes using the geometrical interval method (Frye, 2007) with ratings of very low, low, moderate, high, and very high susceptibility (Fig. 4). The distribution of different landslide susceptibility classes is shown in Table 2. In the case of the BN model, it can be observed that the very low susceptibility class accounts for 22.511% of the study area. The low, moderate, and high susceptibility classes account for 19.542%, 21.173%, and 23.189% of the study area, respectively. The very high susceptibility class accounts for 13.585% of the study area. Regarding the landslide susceptibility map generated by the RBF classifier model, 26.886% of the study area belongs to very low susceptibility class. The low susceptibility class accounts for 24.221% of the study area and the moderate susceptibility class accounts for 23.236% of the study area. The high and very high susceptibility classes account for 16.395% and 9.262% of the study area, respectively. In the landslide susceptibility map produced using the LMT model, the very low susceptibility class accounts for 26.720% of the study area and the low susceptibility class accounts for 25.711% of the study area. 20.975% of the study area falls into the moderate susceptibility class. 16.029% of the study area falls into the high susceptibility class and 10.565% falls into the very high

Table 1 Average IG of conditioning factors. Number Landslide conditioning factors

Average merit (AM)

Standard deviation (Sd)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.074 0.072 0.075 0.070 0.067 0.063 0.051 0.041 0.038 0.037 0.036 0.006 0 0 0

±0.008 ±0.007 ±0.010 ±0.010 ±0.010 ±0.016 ±0.006 ±0.006 ±0.004 ±0.005 ±0.008 ±0.012 0 0 0

Elevation Distance to rivers Distance to roads STI TWI Lithology NDVI Distance to faults SPI Slope Rainfall Aspect Land use Plan curvature Profile curvature

1130

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 4. Landslide susceptibility map derived from: (a) BN; (b) RBF classifier; (c) LMT; (d) RF.

susceptibility class. In the landslide susceptibility map generated by the RF model, 21.150% of the study area falls into the very low susceptibility class. 24.376% of the study area falls into the low susceptibility class and 23.772% of the study area falls into the moderate susceptibility class.

Finally, 20.126% of the study area falls into the high susceptibility class and 10.576% of the study area falls into the very high susceptibility class. The general performance of the landslide models based on the ROC curve method is presented in Fig. 5. It can be observed that the LMT and RBF classifier landslide models achieved good performance for

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135 Table 2 Percentages of different landslide susceptibility classes.

1131

Table 3 Model performance on the training dataset.

Class

Bays net

RBF classifier

LMT

RF

Very low Low Moderate High Very high

22.511 19.542 21.173 23.189 13.585

26.886 24.221 23.236 16.395 9.262

26.720 25.711 20.975 16.029 10.565

21.150 24.376 23.772 20.126 10.576

landslide susceptibility assessment (AUC N 0.8). The LMT model achieved the highest performance (AUC = 0.812), followed by the RBF classifier model (AUC = 0.806), RF model (AUC = 0.795), and BN model (AUC = 0.760). The performance of the landslide models using statistical indexbased evaluations is shown in Table 3. The RF model achieved the highest performance for the classification of landslide pixels (sensitivity = 78.7%), followed by the RBF classifier model (sensitivity = 74.2%), LMT model (sensitivity = 73.5%), and BN model (sensitivity = 70.3%). For classification of non-landslide pixels, the highest performance was achieved by the RF and LMT models (specificity = 71.6%), followed by the RBF classifier model (specificity = 69.7%) and BN

Parameters

True positive True negative False positive False negative Sensitivity Specificity Accuracy

Models BN

RBF classifier

LMT

RF

109 102 53 46 0.703 0.658 0.681

115 108 47 40 0.742 0.697 0.719

114 111 44 41 0.735 0.716 0.726

122 111 44 33 0.787 0.716 0.752

model (specificity = 65.8%). In general, RF performs best for the classification of both landslide and non-landslide pixels. For accuracy, the RF model achieved the highest accuracy of 0.752, followed by the LMT model (0.726), RBF classifier model (0.719), and BN model (0.681). 4.3. Model validation The general performance of the landslide models using the ROC curve on the validation dataset is presented in Fig. 6. It can be observed that the LMT, RBF classifier, and RF landslide models achieved good

Fig. 5. ROC curves and AUC analysis using the training dataset: (a) BN, (b) RBF classifier, (c) LMT, and (d) RF models.

1132

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

Fig. 6. ROC curves and AUC analysis using the validation dataset: (a) BN, (b) RBF classifier, (c) LMT, and (d) RF models.

performance for landslide susceptibility assessment (AUC N 0.8) (Tien Bui et al., 2016b). The LMT model achieved the best performance (AUC = 0.824), followed by the RBF classifier model (AUC = 0.809), RF model (AUC = 0.800), and BN model (AUC = 0.760). Additional validation and comparison of the four models was also performed using statistical index-based evaluations (Table 4). The LMT model achieved the highest performance for the classification of landslide pixels (sensitivity = 80.6%), followed by the RF model (sensitivity = 73.1%), RBF classifier model (sensitivity = 71.6%), and BN model (sensitivity = 70.1%). For the classification of non-landslide pixels, the highest performance was achieved by the BN model (specificity = 79.1%), followed by the RBF classifier model (specificity =

Table 4 Model performance on the validation dataset. Parameters

True positive True negative False positive False negative Sensitivity Specificity Accuracy

Models BN

RBF classifier

LMT

RF

47 53 14 20 0.701 0.791 0.746

48 51 16 19 0.716 0.761 0.739

54 47 20 13 0.806 0.701 0.754

49 50 17 18 0.731 0.746 0.739

76.1%), RF model (specificity = 74.6%) and LMT model (specificity = 70.1%). The BN model achieved the highest precision with a value of 0.770, followed by the RBF classifier model (0.750), RF model (0.742), and LMT model (0.730). For accuracy, the LMT model achieved the highest value of 0.754, followed by the BN model (0.746), and the RBF classifier and LMT models (0.739). 4.4. Relative importance of conditioning factors for different models The selection of landslide conditioning factors using the information gain method indicated that 12 conditioning factor have positive contributions to landslide models. However, standard guidelines for the selection of landslide conditioning factors are still a topic of debate (Tien Bui et al., 2016b). Therefore, in this study, the classifier attribute evaluation method was used to assess variable importance during the modelling processes of the four models (Witten et al., 2011). From Fig. 7, it can be observed that the same conditioning factor can have a different contribution to different models. In the case of the BN model, elevation, distance to rivers, distance to roads, STI, and TWI had the highest contributions of 20.690%, 20.033%, 18.062%, 16.256%, and 14.614%, respectively. The other seven conditioning factors had smaller contributions to the BN model. For the RBF classifier, LMT, and RF models, the contributions of the landslide conditioning factors were very similar. In general, distance to rivers, elevation, and TWI had higher contributions to these three models, but the contributions of the other

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

1133

Fig. 7. Relative importance of conditioning factors for different models.

Table 5 Model performance of the SVM method. Parameters

Training dataset

Validation dataset

True positive True negative False positive False negative Sensitivity Specificity Accuracy

110 100 55 45 0.710 0.645 0.677

46 51 16 21 0.687 0.761 0.724

conditioning factors differed very little in these models. Therefore, it can be concluded that the predictive capability of a conditioning factor depends on the landslide model used and that further studies are necessary to explore landslide conditioning factor selection methods, as well as improvement of the predictive capability of conditioning factors. 4.5. Comparison of the models using the support vector machine method The performance of the four models was further compared to the performance a benchmark method, namely a support vector machine (SVM) using the same study area data. SVMs are a popular choice for landslide spatial prediction and achieve good prediction results (Chen et al., 2017e; Pham et al., 2016; Tien Bui et al., 2017b). For the SVM, a

radial basis function (RBF) kernel was employed and the best pair of parameters for regularization (C) and kernel width (γ) were found using the training dataset. The optimal value for C was 6 and the optimal value for γ was 0.06 (Tien Bui et al., 2017a). The detailed evaluation matrices for the training and validation datasets for the SVM model are shown in Table 5 and Fig. 8. It can be seen that the classification accuracies and AUCs for the training and validation datasets of the SVM model are lower than those of the four models explored in this study. 5. Conclusion In this study, BN, RBF classifier, LMT, and RF models were employed to assess landslide susceptibility in Chongren Country (China). Several experiments were performed using a dataset including a landslide inventory map and 15 landslide conditioning factors. The performance of the landslide models was evaluated using ROC curves and statistical measures. In this case study, the LMT and RBF classifier models achieved higher AUC values for the training and validation datasets, but differed significantly for the statistical measures. The BN model had a lower AUC value and the statistical measures differed significantly for the training and validation datasets. In contrast, the RF model yielded a high degree fitting for both the training and validation datasets. However, all of the models applied in this study are promising methods for landslide susceptibility mapping. The RF model achieved the best results

Fig. 8. ROC curves and AUC analysis using the SVM model: (a) training dataset, (b) validation dataset.

1134

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135

overall in this case study. The results of this study may be useful for decision making and land use planning in areas prone to landslides. Acknowledgments We express our thanks to Wei Ouyang, Associate Editor of the journal of Science of the Total Environment and our two anonymous reviewers. With their comments and suggestions, we were able to significantly improve the quality of our paper. This research was supported by a China Postdoctoral Science Foundation funded project (Grant No. 2017M613168), the Key National Basic Research Program of China (973 Program) (Grant No. 2014CB744702), Project funded by Shaanxi Province Postdoctoral Science Foundation (Grant No. 2017BSHYDZZ07), Scientific Research Program Funded by Shaanxi Provincial Education Department (Program No. 17JK0511, 17JK0515), National Science Foundation of China (Grant No. 41702298, 41431177, 41601413), Natural Science Research Program of Jiangsu (Project No. BK20150975, 14KJA170001), the Project Supported by Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2017JQ4020), International Partnership Program of Chinese Academy of Sciences (Grant No. 115242KYSB20170022), Opening fund of State Key Laboratory of Geohazard Prevention and Geoenvironment Protection (Chengdu University of Technology) (Program No. SKLGP2017K010), Supports to A-Xing Zhu through the Vilas Associate Award, the Hammel Faculty Fellow Award, and Manasse Chair Professorship from the University of Wisconsin-Madison, as well as the “One-Thousand Talents” Program of China. References Akgün, A., Türk, N., 2011. Mapping erosion susceptibility by a multivariate statistical method: a case study from the Ayvalık region, NW Turkey. Comput. Geosci. 37 (9), 1515–1524. Aleotti, P., Chowdhury, R., 1999. Landslide hazard assessment: summary review and new perspectives. Bull. Eng. Geol. Environ. 58 (1), 21–44. Andersen, S.K., 1991. Probabilistic reasoning in intelligent systems: networks of plausible inference: Judea Pearl. Artif. Intell. 48 (1), 117–124. Azhagusundari, B., Thanamani, D.A.S., 2013. Feature selection based on information gain. Int. J. Innov. Technol. Exploring Eng. 2 (2). Bennett, G.L., Miller, S.R., Roering, J.J., Schmidt, D.A., 2016. Landslides, threshold slopes, and the survival of relict terrain in the wake of the Mendocino Triple Junction. Geology 44 (5), 363–366. Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32. Catani, F., Lagomarsino, D., Segoni, S., Tofani, V., 2013. Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues. Nat. Hazards Earth Syst. Sci. 13 (11), 2815–2831. Chen, W., Panahi, M., Pourghasemi, H.R., 2017a. Performance evaluation of GIS-based new ensemble data mining techniques of adaptive neuro-fuzzy inference system (ANFIS) with genetic algorithm (GA), differential evolution (DE), and particle swarm optimization (PSO) for landslide spatial modelling. Catena 157, 310–324. Chen, W., Pourghasemi, H.R., Kornejady, A., Zhang, N., 2017b. Landslide spatial modeling: introducing new ensembles of ANN, MaxEnt, and SVM machine learning techniques. Geoderma 305, 314–327. Chen, W., Pourghasemi, H.R., Naghibi, S.A., 2017c. A comparative study of landslide susceptibility maps produced using support vector machine with different kernel functions and entropy data mining models in China. Bull. Eng. Geol. Environ. 1–18. Chen, W., Pourghasemi, H.R., Naghibi, S.A., 2017d. Prioritization of landslide conditioning factors and its spatial modeling in Shangnan County, China using GIS-based data mining algorithms. Bull. Eng. Geol. Environ. 1–19. Chen, W., Pourghasemi, H.R., Panahi, M., Kornejady, A., Wang, J., Xie, X., Cao, S., 2017e. Spatial prediction of landslide susceptibility using an adaptive neuro-fuzzy inference system combined with frequency ratio, generalized additive model, and support vector machine techniques. Geomorphology 297, 69–85. Chen, W., Pourghasemi, H.R., Zhao, Z., 2017f. A GIS-based comparative study of DempsterShafer, logistic regression and artificial neural network models for landslide susceptibility mapping. Geocarto Int. 32 (4), 367–385. Chen, W., Shirzadi, A., Shahabi, H., Ahmad, B.B., Zhang, S., Hong, H., Zhang, N., 2017g. A novel hybrid artificial intelligence approach based on the rotation forest ensemble and naïve Bayes tree classifiers for a landslide susceptibility assessment in Langao County, China. Geomatics Nat. Hazards Risk 8 (2), 1955–1977. Chen, W., Xie, X., Peng, J., Wang, J., Duan, Z., Hong, H., 2017h. GIS-based landslide susceptibility modelling: a comparative assessment of kernel logistic regression, Naıve-Bayes tree, and alternating decision tree models. Geomat. Nat. Haz. Risk 8, 950–973. Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Tien Bui, B., Duan, Z., Ma, J., 2017i. A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena 151, 147–160.

Chen, W., Shahabi, H., Shirzadi, A., Li, T., Guo, C., Hong, H.H., Li, W., Pan, D., Hui, J.R., Ma, M. Z., Xi, M.N., Ahmad, B.B., 2018a. A Novel ensemble approach of bivariate statistical based logistic model tree classifier for landslide susceptibility assessment. Geocarto Int.:1–32 https://doi.org/10.1080/10106049.2018.1425738. Chen, W., Xie, X.S., Peng, J.B., Shahabi, H., Hong, H.H., Tien Bui, D., Duan, Z., Li, S.J., Zhu, A.X., 2018b. GIS-based landslide susceptibility evaluation using a novel hybrid integration approach of bivariate statistical based random forest method. Catena:1–17 https:// doi.org/10.1016/j.catena.2018.01.012. Dehnavi, A., Aghdam, I.N., Pradhan, B., Varzandeh, M.H.M., 2015. A new hybrid model using step-wise weight assessment ratio analysis (SWARA) technique and adaptive neuro-fuzzy inference system (ANFIS) for regional landslide hazard assessment in Iran. Catena 135, 122–148. Ding, Q., Chen, W., Hong, H., 2017. Application of frequency ratio, weights of evidence and evidential belief function models in landslide susceptibility mapping. Geocarto Int. 32 (6), 619–639. Felicísimo, Á.M., Cuartero, A., Remondo, J., Quirós, E., 2013. Mapping landslide susceptibility with logistic regression, multiple adaptive regression splines, classification and regression trees, and maximum entropy methods: a comparative study. Landslides 10 (2), 175–189. Frank, E., 2014. Fully Supervised Training of Gaussian Radial Basis Function Networks in WEKA. 4. Department of Computer Science, University of Waikato, p. 14 Tech. Rep. Frye, C., 2007. About the Geometrical Interval Classification Method. http://blogs.esri. com/esri/arcgis. Gheisari, S., Meybodi, M., 2016. BNC-PSO: structure learning of Bayesian networks by Particle Swarm Optimization. Inf. Sci. 348, 272–289. Gorsevski, P.V., Brown, M.K., Panter, K., Onasch, C.M., 2016. Landslide detection and susceptibility mapping using LiDAR and an artificial neural network approach: a case study in the Cuyahoga Valley National Park, Ohio. Landslides 13 (3), 467–484. Guzzetti, F., Reichenbach, P., Ardizzone, F., Cardinali, M., Galli, M., 2006. Estimating the quality of landslide susceptibility models. Geomorphology 81 (1), 166–184. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11 (1), 10–18. Hong, H., Naghibi, S.A., Pourghasemi, H.R., Pradhan, B., 2016a. GIS-based landslide spatial modeling in Ganzhou City, China. Arab. J. Geosci. 9 (2), 1–26. Hong, H., Pourghasemi, H.R., Pourtaghi, Z.S., 2016b. Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology 259, 105–118. Hong, H., Chen, W., Xu, C., Youssef, A.M., Pradhan, B., Tien Bui, D., 2017a. Rainfall-induced landslide susceptibility assessment at the Chongren area (China) using frequency ratio, certainty factor, and index of entropy. Geocarto Int. 32 (2), 139–154. Hong, H., Panahi, M., Shirzadi, A., Ma, T., Liu, J., Zhu, A.X., Chen, W., Kougias, I., Kazakis, N., 2017b. Flood susceptibility assessment in Hengfeng area coupling adaptive neurofuzzy inference system with genetic algorithm and differential evolution. Sci. Total Environ. https://doi.org/10.1016/j.scitotenv.2017.10.114. Hong, H., Pradhan, B., Sameen, M.I., Kalantar, B., Zhu, A., 2017c. Improving the accuracy of landslide susceptibility model using a novel region-partitioning approach. Landslides 1:1–20. https://doi.org/10.1007/s10346-017-0906-8. Hong, H.Y., Ilia, I., Tsangaratos, P., Chen, W., Xu, C., 2017d. A hybrid fuzzy weight of evidence method in landslide susceptibility analysis on the Wuyuan area, China. Geomorphology 290, 1–16. Hong, H., Tsangaratos, P., Ilia, I., Liu, J., Zhu, A.X., Chen, W., 2018a. Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China. Sci. Total Environ. 625, 575–588. Hong, H., Liu, J., Bui, D.T., Pradhan, B., Acharya, T.D., Pham, B.T., Zhu, A.X., Chen, W., Ahmad, B.B., 2018b. Landslide susceptibility mapping using J48 Decision Tree with AdaBoost, Bagging and Rotation Forest ensembles in the Guangchang area (China). Catena 163, 399–413. Hunt, E.B., Marin, J., Stone, P.J., 1966. Experiments in Induction. Academic Press. Jensen, K.L., Toftum, J., Friis-Hansen, P., 2009. A Bayesian Network approach to the evaluation of building design and its consequences for employee performance and operational costs. Build. Environ. 44 (3), 456–462. Jiang, T., Dinglong, W., 2013. A landslide stability calculation method based on Bayesian network. International Symposium on Instrumentation & Measurement, Sensor Network and Automation, pp. 905–908. Kanungo, D., Arora, M., Sarkar, S., Gupta, R., 2006. A comparative study of conventional, ANN black box, fuzzy and combined neural and fuzzy weighting procedures for landslide susceptibility zonation in Darjeeling Himalayas. Eng. Geol. 85 (3), 347–366. Karabulut, E.M., Ibrikci, T., 2014. Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 38 (5), 1. Kausar, N., Majid, A., 2016. Random forest-based scheme using feature and decision levels information for multi-focus image fusion. Pattern. Anal. Applic. 19 (1), 221–236. Kavzoglu, T., Sahin, E.K., Colkesen, I., 2015. An assessment of multivariate and bivariate approaches in landslide susceptibility mapping: a case study of Duzkoy district. Nat. Hazards 76 (1), 471–496. Lee, S., Ryu, J.-H., Won, J.-S., Park, H.-J., 2004. Determination and application of the weights for landslide susceptibility mapping using an artificial neural network. Eng. Geol. 71 (3), 289–302. Lee, M.J., Park, I., Lee, S., 2015. Forecasting and validation of landslide susceptibility using an integration of frequency ratio and neuro-fuzzy models: a case study of Seorak mountain area in Korea. Environ. Earth Sci. 74 (1), 413–429. Nasiri Aghdam, I., Varzandeh, M.H.M., Pradhan, B., 2016. Landslide susceptibility mapping using an ensemble statistical index (Wi) and adaptive neuro-fuzzy inference system (ANFIS) model at Alborz Mountains (Iran). Environ. Earth Sci. 75 (7), 1–20.

W. Chen et al. / Science of the Total Environment 626 (2018) 1121–1135 Pawluszek, K., Borkowski, A., 2017. Impact of DEM-derived factors and analytical hierarchy process on landslide susceptibility mapping in the region of Rożnów Lake, Poland. Nat. Hazards 86 (2), 919–952. Peirolo, R., 2011. Information gain as a score for probabilistic forecasts. Meteorol. Appl. 18 (1), 9–17. Peng, L., Niu, R., Huang, B., Wu, X., Zhao, Y., Ye, R., 2014. Landslide susceptibility mapping based on rough set theory and support vector machines: a case of the Three Gorges area, China. Geomorphology 204, 287–301. Pham, B.T., Tien Bui, D., Dholakia, M., Prakash, I., Pham, H.V., 2016. A comparative study of least square support vector machines and multiclass alternating decision trees for spatial prediction of rainfall-induced landslides in a tropical cyclones area. Geotech. Geol. Eng. 34 (6), 1807–1824. Pourghasemi, H.R., Kerle, N., 2016. Random forests and evidential belief function-based landslide susceptibility assessment in Western Mazandaran Province, Iran. Environ. Earth Sci. 75 (3), 1–17. Qiao, W., Li, W., Li, T., Chang, J., Wang, Q., 2017. Effects ofcoal mining on shallow water resources in Semiarid Regions: a case study in the Shennan mining area, Shaanxi, China. Mine Water Environ. 36, 104–113. Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn. 1, 81–106. Raja, N.B., Çiçek, I., Türkoğlu, N., Aydin, O., Kawasaki, A., 2017. Landslide susceptibility mapping of the Sera River Basin using logistic regression model. Nat. Hazards 85 (3), 1323–1346. Tien Bui, D., Ho, T.-C., Pradhan, B., Pham, B.-T., Nhu, V.-H., Revhaug, I., 2016a. GIS-based modeling of rainfall-induced landslides using data mining-based functional trees classifier with AdaBoost, Bagging, and MultiBoost ensemble frameworks. Environ. Earth Sci. 75 (14), 1–22.

1135

Tien Bui, D., Tuan, T.A., Klempe, H., Pradhan, B., Revhaug, I., 2016b. Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides 13 (2), 361–378. Tien Bui, D., Nguyen, Q.P., Hoang, N.-D., Klempe, H., 2017a. A novel fuzzy K-nearest neighbor inference model with differential evolution for spatial prediction of rainfall-induced shallow landslides in a tropical hilly area using GIS. Landslides 14 (1), 1–17. Tien Bui, D., Tuan, T.A., Hoang, N.-D., Thanh, N.Q., Nguyen, D.B., Van Liem, N., Pradhan, B., 2017b. Spatial prediction of rainfall-induced landslides for the Lao Cai area (Vietnam) using a hybrid intelligent approach of least squares support vector machines inference model and artificial bee colony optimization. Landslides 14 (2), 447–458. Wang, L.J., Guo, M., Sawada, K., Lin, J., Zhang, J., 2015. Landslide susceptibility mapping in Mizunami City, Japan: a comparison between logistic regression, bivariate statistical analysis and multivariate adaptive regression spline models. Catena 135, 271–282. Witten, I.H., Frank, E., Mark, A.H., 2011. Data Mining: Practical Machine Learning Tools and Techniques. Third edition. Morgan Kaufmann, Burlington, USA. Yesilnacar, E., 2005. The Application of Computational Intelligence to Landslide Susceptibility Mapping in Turkey. Ph.D Thesis. Department of Geomatics, University of Melbourne, p. 423. Youssef, A.M., Al-Kathery, M., Pradhan, B., 2015. Landslide susceptibility mapping at AlHasher area, Jizan (Saudi Arabia) using GIS-based frequency ratio and index of entropy models. Geosci. J. 19 (1), 113–134. Zhang, K., Wu, X., Niu, R., Yang, K., Zhao, L., 2017. The assessment of landslide susceptibility mapping using random forest and decision tree methods in the Three Gorges Reservoir area, China. Environ. Earth Sci. 76 (11), 405.