Comparison of artificial neural networks and support ...

0 downloads 0 Views 1MB Size Report
Nov 14, 2011 - Agr3. Figure 2. Block diagram of a backpropagation neural network with a single hidden layer. Note: NIR, near-infrared; SWIR, shortwave ...
This article was downloaded by: [Bibliotheek TU Delft] On: 23 January 2013, At: 01:56 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

International Journal of Remote Sensing Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tres20

Comparison of artificial neural networks and support vector machine classifiers for land cover classification in Northern China using a SPOT-5 HRG image Xianfeng Song

a b

a

, Zheng Duan & Xiaoguang Jiang

b

a

Graduate University of the Chinese Academy of Sciences, Beijing, 100049, PR China b

Academy of Opto-Electronics, Chinese Academy of Sciences, Beijing, 100086, PR China Version of record first published: 14 Nov 2011.

To cite this article: Xianfeng Song , Zheng Duan & Xiaoguang Jiang (2012): Comparison of artificial neural networks and support vector machine classifiers for land cover classification in Northern China using a SPOT-5 HRG image, International Journal of Remote Sensing, 33:10, 3301-3320 To link to this article: http://dx.doi.org/10.1080/01431161.2011.568531

PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

International Journal of Remote Sensing Vol. 33, No. 10, 20 May 2012, 3301–3320

Comparison of artificial neural networks and support vector machine classifiers for land cover classification in Northern China using a SPOT-5 HRG image XIANFENG SONG†‡, ZHENG DUAN† and XIAOGUANG JIANG*‡ †Graduate University of the Chinese Academy of Sciences, Beijing 100049, PR China ‡Academy of Opto-Electronics, Chinese Academy of Sciences, Beijing 100086, PR China

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

(Received 4 January 2010; in final form 6 February 2011) This article presents a sufficient comparison of two types of advanced non-parametric classifiers implemented in remote sensing for land cover classification. A SPOT-5 HRG image of Yanqing County, Beijing, China, was used, in which agriculture and forest dominate land use. Artificial neural networks (ANNs), including the adaptive backpropagation (ABP) algorithm, Levenberg–Marquardt (LM) algorithm, Quasi-Newton (QN) algorithm and radial basis function (RBF) were carefully tested. The LM–ANN and RBF–ANN, which outperform the other two, were selected to make a detailed comparison with support vector machines (SVMs). The experiments show that those well-trained ANNs and SVMs have no significant difference in classification accuracy, but the SVM usually performs slightly better. Analysis of the effect of the training set size highlights that the SVM classifier has great tolerance on a small training set and avoids the problem of insufficient training of ANN classifiers. The testing also illustrates that the ANNs and SVMs can vary greatly with regard to training time. The LM–ANN can converge very quickly but not in a stable manner. By contrast, the training of RBF–ANN and SVM classifiers is fast and can be repeatable.

1.

Introduction

Land cover mapping and monitoring is essential for discovering changes of physical material on the surface of the Earth. Analysis of remote-sensing images has become the primary method for extracting land cover information due to the many advantages of remote sensing such as the large area of coverage, cost efficiency and repeatability over time (Pal and Mather 2004). Land cover mapping moves forward with the innovation of remote-sensing sensor technology, and such progresses again advance the development of new classification methods in remote sensing. Among numerous classification methods, the maximum likelihood classifier (MLC) is the most widely used, due to its simplicity and availability in most software packages as well as because it generates acceptable results (Huang et al. 2002, Zhang et al. 2007). However, MLC has a few limitations. First, it is a statistical parametric method that assumes that the members of all class data are normally distributed, which is not true for remote-sensing images (Dixon and Candade 2008). Secondly, MLC is

*Corresponding author. Email: [email protected] International Journal of Remote Sensing ISSN 0143-1161 print/ISSN 1366-5901 online © 2012 Taylor & Francis http://www.tandf.co.uk/journals http://dx.doi.org/10.1080/01431161.2011.568531

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

3302

X. Song et al.

difficult to deal with in cases of texture features in high spatial resolution images, in particular with small training set size situations (Jin et al. 2005). Thirdly, MLC is subject to the curse of dimension or Hughes effect (Hughes 1968, Pal and Mather 2003), that is, given a fixed number of training samples, its performance declines with the increasing dimensions due to the decreased reliability of estimates of statistical parameters needed to calculate the probability (Oommen et al. 2008). To overcome the problems existing in statistical methods, a non-parametric method, artificial neural networks (ANNs), has been introduced for land cover classification (Paola and Schowengerdt 1995, Kavzoglu and Mather 2003, Zhang et al. 2007). ANNs have no assumption on the statistical distribution of the data, thus avoiding problems on estimation of statistical parameters that existed in MLC. Many works have shown that ANNs yield better result than MLC (Paola and Schowengerdt 1995); furthermore, ANNs perform well with small training sets (Hepner et al. 1990, Foody et al. 1995, Foody and Mathur 2006) and are more resistant to noise or incomplete sampling of spectra in simulated remote-sensing data (Yool 1998). However, the performance of ANNs can vary considerably depending on the choice of network architecture and training algorithms. ANNs with inappropriate configuration and training may even yield significantly lower accuracy classification than that produced by MLC, though its theory is superior (Kavzoglu and Mather 1999, 2003). The other advanced non-parametric method, support vector machines (SVMs), has also been successfully applied to land cover classification (Huang et al. 2002, Zhu and Blumberg 2002, Pal and Mather 2004, 2005). It is based on statistical learning theory, adopting the method of structural risk minimization (SRM) for class member discrimination, which minimizes the probability of misclassifying the previously unseen data point drawn randomly from a fixed but unknown probability distribution (Vapnik 1995). Conventional statistical methods such as MLC or even ANNs minimize misclassification error using the training data by empirical risk minimization (ERM), which sometimes have an overfitting problem or poor generalization capability, that is with a high accuracy for training data but a low one for unseen data. In addition, SVM also requires determination of the selection of kernel functions and associated parameters that may have a significant effect on the performance of SVM. With regard to the above classification methods in the testing of land cover classification using ETM+ data in the Littleport area of eastern England and in the region of La Mancha Alta in central Spain, SVM achieved a higher level of classification accuracy than ANNs and MLC (Pal and Mather 2005). However, some recent experiment works using Landsat-5 Thematic Mapper (TM) data in the coastline area of Florida, USA, indicate that ANNs and SVMs have almost the same classification accuracy, while both outperform the MLC (Dixon and Candade 2008). There are also experiments to test the effect of spectral resolution on discrimination of five important Brazillian sugarcane varieties using different data sources, in which the overall classification accuracy of SPOT-5 HRG (66%) is much lower than that of ETM (72%), ASTER (72%) and Hyperion (87%), though it has a finer spatial resolution (Galvao et al. 2006). From the above, the performance of classification methods seems to be affected by many factors, including data sources, training set size and selection of parameters (Lu and Weng 2007), and the determination of classification methods used in a particular application will depend on the knowledge or similar experiences obtained from case-by-case experiments.

Comparison of ANNs and SVMs classifiers

3303

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

This work evaluated the performances of two advanced non-parametric methods, ANNs and SVM, for land cover classification using a SPOT-5 HRG image in a hilly agricultural area of Northern China. The agricultural land cover in such areas of China is fragmented, because of the specific land allocation policy of China. The agricultural land being split to many small parcels was allocated to households, and farmers have their freedom on planting under private land use rights. So, continuous farmland may dominate with different cropping patterns of crops or growing time. For land use mapping or precise agricultural investigation in China, SPOT data are often the most extensive data source. It has a small number of dimensionality (often four or five bands) but with a finer spatial resolution of 10 m. Therefore, the assessment of these two advanced classifiers was conducted using SPOT images in this particular situation of Northern China. 2. Materials and methods 2.1 Study area and data set The study area in this work is located in the Yanqing County of Beijing, China, with geospatial coverage of latitude 40◦ 30 19 N–40◦ 34 21 N and longitude 116◦ 4 26 E–116◦ 10 43 E (see figure 1). The elevation ranges from 100 to 500 m. This area is characterized by a temperate continental monsoon climate, with four distinct seasons. The mean annual temperature is about 8◦ C, with a mean minimum of −12◦ C and a mean maximum of 24◦ C. The annual precipitation is about 0.5 m and the rainfall mainly occurs between May and September, which is also the period of farming activity. Agriculture dominates this area with arable land cover. The cropping system is one crop per annum and the cultivations are mainly maize and soybean. Unlike the a developed country, where agricultural land is continuous large parcels easily farmed with agricultural machinery, the agricultural land in most areas of China is fragmented to small parcels, usually less than 0.2 ha. Since the land reforms in rural China of the mid-1980s, these parcels have been allocated to households on

(a)

(b) 40° 34′ 12″

China

Yanqing 40° 30′ 00″

Beijing 116° 4′ 12″

1

0

1

2 km

116° 10′ 48″

Figure 1. (a) The study area in Yanqing, Beijing and (b) the area of a false colour composite image based on a SPOT-5 image (near-infrared, red and green bands). The main areas coloured red, green, blue, black and grey on the image represent forest, crop, builtup, water and bare lands, respectively.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

3304

X. Song et al.

the basis of household population, average grain consumption and land locality or quality. With land use rights offered by the government, households can determine the cropping species and planting time of their land parcel by themselves so that the small land parcels and different cropping patterns present various characteristics in the remotesensing images, which makes it complex for land cover classification in such areas. In addition to the availability of data and our familiarity with the area, this area was selected as it also represents the typical agricultural land in Northern China. This study tested the performance of two advanced classification methods in handling the complexities of land use schemas, providing some insights into the selection of an appropriate method for land cover mapping in such areas of Northern China. SPOT is the most widely used data source for land cover mapping in China. A SPOT-5 HRG image acquired on 23 May 2004 was used in this study. It has a panchromatic band at 2.5 m spatial resolution and four multispectral bands including three bands (green: 0.50–0.59 µm; red: 0.61–0.68 µm; near-infrared (NIR): 0.78–0.89 µm) at 10 m spatial resolution and one shortwave infrared (SWIR) band (1.58–1.75 µm) at 20 m spatial resolution. In this work, the SWIR band was resampled to the spatial resolution of 10 m using the nearest neighbour algorithm, and thus the four-band multispectral image was used for land cover classification (Hill et al. 2007). 2.2 Artificial neural networks The ANN is a non-parametric method that does not take advantage of the statistical frequency distribution of the data. Among numerous solutions of neural networks, the most widely used algorithm in remote-sensing classification is the architecture of multi-perceptron layers, which consists of one input layer, at least one hidden layer and one output layer (Kavzoglu and Mather 2003). Each layer consists of several neuron nodes and all the nodes in a layer are connected to the nodes in the adjacent layers, but there are no interconnections between nodes with the same layer. Each interconnection between nodes carries an associated weight, and each node computes the weighted sum of the inputs and passes the sum through an activation function that provides the output value of this node (Kavzoglu and Mather 1999). For image classification in this work, the backpropagation (BP) neural network is adopted. The input layer represents the image used and each node in this layer corresponds to one band of the image. The hidden layers are used for computation, and the number of layers and nodes within each layer are determined by users. The output layer represents classification results, in which each node corresponds to one class. The determination of network topology is subject to adjustment in an experimental way. The three-layer network structure was adopted in this work (see figure 2) as a single hidden layer has been proved adequate for land cover classification (Paola and Schowengerdt 1995, Huang et al. 2002). Baum and Haussler (1989) had suggested an approach to calculate the number of nodes in the hidden layer, but actually there are no rules existing for determining the exact number of neurons in a hidden layer (Singh et al. 2010). We have taken 30 neurons in our application. The activation function of our network uses the simple logistic sigmoid function because it has two big advantages. First, this function has a continuous derivative and can easily be used for backpropagation and, secondly, its derivative is easily calculated.

Comparison of ANNs and SVMs classifiers

3305

Output layer Hidden layer

Water Builtup

Input layer

Dense

Green

Sparse Red Unused NIR

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Barehill Agr1

SWIR

Agr2 Agr3

Figure 2. Block diagram of a backpropagation neural network with a single hidden layer. Note: NIR, near-infrared; SWIR, shortwave infrared band.

xj =

m 

wij yi ,

(1)

i=1

where xj is the weighted signals that node j receives from all nodes to which it is connected in the preceding layer, yi is the output of node i in the preceding layer, wij is the weight between node i and node j, m is the total number of nodes. 1 , −x 1−  e j   yj = yj 1 − yj , yj =

(2)

where yj is the output from node j by a logistic sigmoid function and yj  is the derivative from node j. The training of a neural network is to determine an appropriate set of weights by learning the characteristics from the training samples. The neural network in this work is a non-linear feed forward neural network; its weights are iteratively determined by BP algorithms, in which the computed error is propagated backward through the network and correspondingly the weights are adjusted to minimize the error. This error denotes the sum of squared difference between the output by the network and the expected or predefined output, which is computed for the whole training set. To adjust weights properly, the derivative of the error function with respect to the network weights is first calculated, and the weights are then changed such that the error decreases. For this reason, the activation functions of a BP network must be differentiable. The forward and backward passes continue until the network converges to some state where the computed error is less than a specified threshold.

3306

X. Song et al. Table 1. The differences of ABP, LM, QN and RBF in their weight adjustment.

Optimization methods ABP algorithm LM algorithm QN algorithm

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

RBF algorithm

Weight adjustment  wt = − ηJT e + αwt−1 η – learning rate, α – momentum factor −(JT J + μI)−1 JT e μ – adaptive damping factor, I – identity matrix T −H−1 k J e −1 Hk − inverse  Hessian updated at kth iteration  −H−1 JT e 

Notes: ABP, adaptive backpropagation; LM, Levenberg–Marquardt; QN, Quasi-Newton; RBF, radial basis function. Jacobian (J) and Hessian (H) matrices refer to the first and second partial derivatives, respectively, of the error vector e with respect to weights in the neural network, while wt and wt−1 mean the weight adjusted at the iterative steps t and t–1, respectively.

The BP algorithm performs a gradient descent on the error surface to minimize the network error by using learning rate (η). If η is high, the network may reach a local minimum and if η is too small, it could cause a large amount of training time (Singh et al. 2005). For a better performance of the computational process and global convergence, four popular optimization methods rather than BP were tested to adjust the weights of the neural network in this work. They are the adaptive backpropagation (ABP) algorithm, Levenberg–Marquardt (LM) algorithm, Quasi-Newton (QN) algorithm and radial basis function (RBF) algorithm. Their formulas of weight adjustment are listed in table 1 (Singh et al. 2010). The ABP adds the momentum factor (α) to the network, which decreases the probability that the network will converge at a local minimum and reduces the training time. The LM is a second-order search method of a minimum, which interpolates between the Gauss–Newton (GN) algorithm and the method of gradient descent by using the damping factor (µ). µ is non-negative and adjusted at each iteration step. If the iteration leads to a rapid reduction of the network error, µ can be decreased, bringing the algorithm closer to the GN algorithm. Otherwise, µ can be increased, giving a step closer to the gradient descent direction. The QN is an alternative to the Newton methods for fast optimization, but it does not require calculation of second derivatives (Hessian matrix). Instead of the true Hessian, an initial matrix H0 is chosen (usually H0 = I), which is subsequently updated by an approximate Hessian matrix at each iteration. The update is computed as a function of the gradient using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) formula. Numerous setting tests according to the recommendations by Kavzoglu and Mather (2003) and Sing et al. (2005, 2010) were conducted to determine the appropriate neural network parameters. Matlab 6.5 is applied for network training and classification. The training process of such predefined networks took a long running time to avoid the insufficient training that can result in a low level of classification accuracy (Pal and Mather 2006). 2.3 Support vector machines The SVM is also a non-parametric classification method that can find a hyperplane which optimally separates two classes. The optimal hyperplane is developed using

Comparison of ANNs and SVMs classifiers

3307

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

training data and its generalization ability is validated using testing data. Given a training data set of two separable classes with r samples, that is represented by (x1 , y1 ), . . . , (xr ,yr ) where x ∈ RN is an N-dimensional space and y ∈ {+1, −1} is the class label, the two classes can be separated by various N−1 dimensional hyperplanes, but there is only one separating hyperplane that maximizes the distance from the hyperplane to the closest training sample points in each of the classes (Huang et al. 2002). This hyperplane is defined as the optimal hyperplane, and the points that constrain the width of the margin are called support vectors (see figure 3(a)). A hyperplane is defined as w · xi + b = 0, where xi is a point lying on the hyperplane, w is normal to the hyperplane and b is the bias that indicates the distance of the hyperplane from the origin. For a linearly separable case, a separating hyperplane can be defined for the two classes as w · xi + b ≥ +1,

for yi = +1;

(3)

w · xi + b ≤ −1,

for yi = −1.

(4)

The two inequalities, equations (3) and (4), can be combined into a single inequality: yi (w · xi + b) − 1 ≥ 0.

(5)

The training data points on these two hyperplanes defined by w · xi +b = ±1, which are parallel to the optimal hyperplane, are the support vectors (Mathur and Foody 2008, Kavzoglu and Colkesen 2009). The margin between these planes is 2/||w||, the optimal separating hyperplane which separates the classes with maximum margin can be found by minimizing the ||w||2 under constraint of equation (5).

Optimal separating hyperplane

w . xi + b = ± 1

Optimal separating hyperplane

w . xi + b = ± (1−ξi)

ξi w

w

b

b

Origin

Margin (a)

Origin

Margin (b)

Figure 3. Basis of SVM for binary classification. (a) Linearly separable situation and (b) non-linearly separable situation. The squares and triangles refer to two different classes to be separated. The dashed lines drawn parallel to the separating line mark the distance between the separating line and the closest vectors to the line. The vectors that constrain the width of the margin are the support vectors (circled points). Note: SVM, support vector machine.

3308

X. Song et al.  1 2 w . min 2 

(6)

If the classes are not linearly separable, the constraints of equation (5) cannot be satisfied. So the slack variables ξ i , i = 1,. . .r, that indicate the distance the sample is from the hyperplane passing through the support vectors of the class to which the sample belongs are introduced to relax the constraints (Foody and Mathur 2004) (see figure 3(b)). Then equation (3) can be written as

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

yi (w · xi + b) − 1 + ξi ≥ 0.

(7)

And the optimal hyperplane can be found by solving the following optimization problem: 

r  w2 min ξi . +C 2

(8)

i=1

Under the inequality constraints of equation (7), the first part of equation (8) is the same as in the linearly separable case to maximize the margin between the classes, while the second part is to penalize the samples located on the incorrect side of the hyperplane with the C factor controlling the relative balance of these two competing objectives. C should be determined by the users, and a larger value of C means assigning a higher penalty to errors (Pal and Mather 2004). When the separating hyperplane cannot be defined by linear equations on the training data, the kernel function K(x,xi ) can be introduced to convert the non-linear boundaries in the original data space into linear ones in the high-dimensional space. This transformation spreads the data out in such a way that a linear separating hyperplane can be fitted, as shown in figure 4. Thus, the classification decision function becomes f(x) = sgn

r 

αi yi K (x,xi ) + b ,

(9)

i=1

Feature space

Input space

Kernel function Separating hyperplane

Figure 4. (a) Transformation from input space into feature space with a kernel function. (b) Separation is complex in input space but easy in feature space as the kernel function converts a non-linear boundary in the input space into a linear one in the feature space.

Comparison of ANNs and SVMs classifiers

3309

where α i , i = 1, . . . , r are Lagrange multipliers and K(x,xi ) is a kernel function. The magnitude of α i is determined by the parameter C (Foody and Mathur 2004). There are mainly four types of kernel functions, namely linear, polynomial, RBF and sigmoid kernels (Kavzoglu and Colkesen 2009). The linear kernels are the specific cases of RBF (Chang and Lin 2001). The RBF has been widely used, and many studies have been reported to yield the best result in remote-sensing applications (Oommen et al. 2008). Therefore, the RBF was applied in this work for land cover classification of the SPOT-5 HRG image. The equation for RBF is

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

K(x,xi ) = exp(−γ x – xi 2 ), γ > 0,

(10)

where γ is the parameter controlling the width of the Gaussian kernel. The SVM was originally designed for the binary class and two main methods, ‘one against one’ and ‘one against others’, have been widely used to extend SVM to support multiclass classification. Both methods convert the multiclass problem to a set of binary problems, thereby enabling a basic SVM approach to be used (Mathur and Foody 2008). In the ‘one against one’ method, N(N–1)/2 SVMs are constructed for each pair of classes, where N is the number of classes. Applying each SVM to the vectors of the test data gives one vote to the winning class, then the pixel assigned to the label of the class with most votes (Pal and Mather 2004). On the other hand, the ‘one against all’ method compares one class with all the others and constructs N SVMs. This method makes SVMs formulated directly a multiclass optimization problem, but the number of parameters to be estimated increases considerably (Melgani and Bruzzone 2004). The ‘one against one’ method has been reported to be more suitable for practical use (Hsu and Lin 2002, Huang et al. 2002), so it was selected for land cover classification in this work. The performance of SVM is also dependent on the choice of the kernel function and its associated parameters. For the RBF kernel used in this work, there are two parameters that need to be defined by users, regularization parameter C and kernel width γ . In this work, the grid-search method using cross-validation approach and its package LIBSVM developed by Chang and Lin (2001) are used to automatically determine these two parameters. 2.4 Classification strategies Seven land cover classes including water, builtup, dense (dense forest), sparse (sparse forest), unused (unused land), barehill and agricultural land were primarily identified in this study. However, different plant species or plant growth stages on agricultural land present with different spectral characteristics on a remote-sensing image, so we further divided agriculture lands into three distinct subclasses (agr1, agr2 and agr3) in terms of spectral characteristics presented in the image during classification. Thus, a total of nine classes of land cover categories were finally used in the classification by either ANNs or SVMs. A pool of ground reference data was generated through visual interpretation of the image with the aid of the 2.5 m panchromatic band of SPOT HRG and our knowledge obtained by a survey of the area. Eight training subsets of data and one testing subset of data were generated by randomly sampling the reference data pool. The sample numbers of those training subsets are 450, 900, 1350, 1800, 2250, 2700, 3150 and 3600, respectively. Within a training subset, the sample sizes of nine classes are equal to each other. For eight training subsets, the sample size of its class is 50, 100, 150, 200, 250,

3310

X. Song et al.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

300, 350 and 400, respectively. The number of the samples in a testing subset is 1800, with 200 samples per class. Here, the testing subset was formed without including any samples inside eight training subsets, thus the bias in the accuracy evaluation due to the use of the same data set of pixels for both training and testing was avoided in this work. The ANNs with training algorithms of ABP, LM, QN and RBF were first compared using the above training subsets and testing subsets. The algorithms that outperform others were then selected to make a further comparison with the SVM. To evaluate the effect of training set size on classification using ANNs and SVMs, the same training and testing sample sets were applied to each method. Their classification accuracies were assessed using the kappa coefficient and the significance of the differences among classification results was computed using the Z-statistics described in §3. 3. Evaluation of performance The confusion matrix for each classification result was first generated by comparison with the testing data set. Based on the confusion matrix, the accuracy of classification was evaluated at two aspects including the overall level and category level, then the significance of the differences between the two classification results was calculated. At the overall level, the kappa coefficient and associated asymptotic variance were calculated based on the confusion matrix for each classification result as below: n κ=

k

nii −

i=1

n2 −

k

ni+ n+i

i=1 k

,

(11)

ni+ n+i

i=1

1 var(κ) = n

θ1 (1 − θ1 ) (1 − θ2 )2

+

2 (1 − θ1 ) (2θ1 θ2 − θ3 ) (1 − θ2 )3

+

  (1 − θ1 )2 θ4 − 4θ2 2 (1 − θ2 )4

,

(12)

where θ1 =

k k k 1 1  1  nii , θ2 = 2 ni+ n+i , θ3 = 2 nii (ni+ + n+i ), n n n i=1

i=1

i=1

k k 2 1   θ4 = 3 nij nj+ + n+i , n i=1 j=1

where κ is the kappa coefficient, var(κ) is the asymptotic variance of kappa, n is the total number of elements in the confusion matrix, nii is the diagonal value of the confusion matrix, ni+ and n+i are the sum of row i and the sum of column i of the confusion matrix, respectively, k is the number of rows or columns in the confusion matrix. At the category level, the conditional kappa coefficient and associated asymptotic variance were calculated based on the confusion matrix for each individual class as below: κi =

nnii − ni+ n+i , nni+ − ni+ n+i

(13)

Comparison of ANNs and SVMs classifiers

var(κi ) =

n (ni+ − nii ) [ni+ (n − n+i )]3

3311

[(ni+ − nii ) (ni+ n+i − nnii ) + nnii (n − ni+ − n+i + nii )] , (14)

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

where κ i and var(κ i ) are the conditional kappa coefficient and asymptotic variance for the ith class, respectively. n, nii , ni+ and n+i are as previously defined. The Z-statistics (Congalton and Mead 1986) was used to assess the significance of the difference between resulting classifications obtained by two methods of ANN and SVM. The difference between two classification results or two same individual classes in two classifications is considered to be significant at the 95% confidence level if the absolute value of the Z-statistics exceeded 1.96. The Z-statistics is calculated as below: |κa − κb | , Zab = √ var(κa ) + var(κb )

(15)

where Zab is the Z-statistic for comparison of classification a and b; κ a and κ b are the kappa coefficients of classifications a and b; and var(κ a ) and var(κ b ) are the asymptotic variances of κ a and κ b, respectively. 4. Results and discussion 4.1 Training and testing of ANNs The training of each neural network algorithm was repeated 10 times using the 1800 training samples set to avoid the effect of random acts. All the well-trained ANNs, which converge at a given sum of squared error (SSE), could have a high-accuracy classification on the testing samples set. Among four algorithms, the ABP–ANN and QN–ANN converge very slowly, needing about 200 s even in their best training cases. By contrast, the LM–ANN can have a fast convergence rate, as the LM uses some second-order information of Newton’s method (Singh et al. 2010). The convergence time of LM–ANN has one time less than 10 s from the total 10 times, four times falling into 50–75 s and 5 times beyond 300 s. This may be caused by the randomness or uncertainness of adjusting the gradient descent in the iterative process. Among four algorithms, the radial basis network can achieve relatively higher classification accuracy, although its training cannot reach the goal we set even after 100 epochs. It approximates discriminator functions by adding new neurons to the hidden layer of its network dynamically. This is an efficient design, other than the methods of ABP, LM and QN whose neurons cannot be changed after network initialization. For each algorithm, the trained network with the best testing result was selected. Table 2 shows their training time and testing accuracy, and figure 5 illustrates their converging rates. Considering the performance on both the convergence time and the classification accuracy, the LM–ANN and RBF–ANN were selected to make a detailed comparison with the SVM. 4.2 Training and testing of SVM The SVM with a kernel of RBF as suggested by Oommen et al. (2008) was tested using the same training set and testing set as ANNs used. The key point of the training and testing of SVM is to find the best penalty parameter and controlling parameter (C, γ )

3312

X. Song et al. Table 2. Performances of neural network algorithms. Training Parameters

No. of epochs

Time (s)

Learning speed (s)

Kappa coefficient

α = 0.10; SSE = 0.01 SSE = 0.01 SSE = 0.01 Spread = 0.90; SSE = 0.01

3545 18 3151 100

194.05 6.44 221.44 29.42

18.27 2.80 14.23 3.40

0.9360 0.9270 0.9120 0.9430

Algorithms ABP LM QN RBF

Testing

0.4

0.4 ABP

LM 0.3

SSE

SSE

0.3

0.2

0.1

0.2

0.1

0.0

0.0 0

30

60

90

120

150

180

210

240

0

5

10

15

Time (s)

Time (s)

(a)

(b)

20

25

30

1800

0.4 QN

RBF

1600 1400

0.3

SSE

1200 SSE

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Note: SSE, sum of squared error; ABP, adaptive backpropagation; LM, Levenberg–Marquardt; QN, Quasi-Newton; RBF, radial basis function.

0.2

1000 800 600

0.1

400 200

0.0

0 0

30

60

90

120

150

180

210

240

0

5

10

15

Time (s)

Time (s)

(c)

(d)

20

25

30

Figure 5. Converging rate of neural network algorithms. Four neural network classifiers were trained using the same training samples: (a) ABP algorithm; (b) LM algorithm; (c) QN algorithm; and (d) RBF. Note: ABP, adaptive backpropagation; LM, Levenberg–Marquardt; QN, Quasi-Newton; RBF, radial basis function; SSE, sum of squared error.

for land cover classification. Although the relationship between these two parameters obviously existed, their mathematical description is not clear. So we adopt the simple but effective grid-search approach to search the best (C, γ ), and the cross-validation

Comparison of ANNs and SVMs classifiers

3313

Table 3. Performance of the SVM algorithm. Grid

Training

Best parameters

Testing

log2 C

log2 γ

Search step

Time (s)

C

γ

Kappa coefficient

−5–15 2–12 6–8

−15–5 −2–5 1–2

2.0 (coarse) 0.2 (finer) 0.1 (finest)

81.97 596.55 56.48

8.00 128.00 119.43

8.00 3.03 3.03

0.9430 0.9440 0.9480

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Note: SVM, support vector machine.

accuracy is introduced to evaluate the goodness of the (C, γ ). To avoid doing an exhaustive search, a coarse grid search is performed first to identify a target region on the grid, and a finer grid search on the region is then conducted until the exact region is identified. As mentioned in the instructions for LIBSVM tools, the logarithm of (C, γ ) to the base 2 is applied to define the search region and search step. In this work, we did a grid search three times at about 82, 597 and 56 s, respectively, and each search found the best parameter (C, γ ) attached to its particular search region and search step (see table 3). Three search regions sequentially used in the process of grid searching are illustrated in figure 6. The results in table 3 show that all the (C, γ ) parameters identified by three searches can lead to a high-accuracy classification. In comparison with a coarse grid search, a fine grid search does not result in a significant improvement in the classification accuracy. This also means that the high-quality (C, γ ) has a wide range. 4.3 Effect of training set size For LM–ANN, RBF–ANN and SVM classifiers, each classifier was trained using eight training sets, respectively, and then those calibrated classifiers were applied to classify the testing set. The classification accuracy of all results was evaluated by computing the overall accuracy. Figure 7 illustrates the changes of kappa coefficient versus training set size by those methods. As can be seen in figure 7, the kappa coefficient of land cover classification results by SVM is significantly higher than that of ANNs at small size training sets. This indicates that SVM is quite tolerant to training set size, having a super generalization capability even with a small set of training samples. This is because only those support vectors that consist of a few margin samples are considered by SVM for classification; other samples make no more contribution to classification (Foody and Mathur 2004), so even a small set of samples can be separated into a few groups to train the SVM for obtaining its optimal parameters by the cross-validation method. These results also indicate that the training set sizes have a substantial effect on the performance of ANNs. ANNs use all training samples for network learning, in which a small number of training samples are often insufficient to derive the characteristics of the classes (Foody et al. 1995, Kavzoglu 2009). Figure 7 also shows that the kappa coefficients by three classifiers increase with training set size but decrease again after reaching the highest at a particular range of training set size. This might be caused by the overfitting problem. The above experimental results may suggest the appropriate training set sizes for sufficiently training these three classifiers. In this experiment, SVM needs 100 samples per class to

3314

X. Song et al.

–10

log2γ

–15

(a)

–5 Accuracy (%)

−5

5 log2C

0

10

5 15

94.5 94.0 93.5 93.0 92.5 92.0

−2

(b)

0

log2γ

−1

1 2 3 4 2

4

6

8

10

5 12

Accuracy (%) 95.0 94.5 94.0 93.5 93.0 92.5 92.0

log2C 1.0

(c)

1.2

1.4

log2γ

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

0

1.6

1.8

6.0

6.5

7.0 log2C

7.5

Accuracy (%) 95.0 94.5

2.0 8.0

Figure 6. Mining the best (C, γ ) for the SVM classifier using a grid-search approach. (a) A coarse grid search on C = 2−5 –215 , γ = 2−15 –25 is conducted first; (b) a finer grid search on the identified ‘better’ region C = 22 –212 , γ = 2−2 –25 is further carried out; (c) a final grid search is conducted again with C = 26 –28 , γ = 21 –22 , and the best (C, γ ) is found at (26.9 , 21.6 ). Note: SVM, support vector machine.

Comparison of ANNs and SVMs classifiers

3315

0.96

0.95

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Kappa coefficient

0.94

0.93

0.92 LM RBF SVM

0.91

0.90 450

900

1350

1800 2250 Training set size

2700

3150

3600

Figure 7. Kappa coefficients versus training set sizes. Note: LM, Levenberg–Marquardt; RBF, radial basis function; SVM, support vector machine.

get well trained for its highest classification accuracy while RBF–ANN and LM–ANN need 200 and 300, respectively. 4.4 Kappa analysis The LM–ANN, RBF–ANN and SVM classifiers that were sufficiently trained were used to classify the SPOT image. Based on the testing sample data, the confusion matrices were first calculated and the classification accuracies of three classifiers were then assessed by computing the kappa coefficient, conditional kappa coefficient and associated asymptotic variances at overall and category levels. The significance of the differences between classification results of those three methods was also evaluated by computing Z-statistic. Table 4 shows the results of kappa analysis over three classification results. Both the overall accuracy and the kappa coefficient by SVM are slightly higher than those by two ANNs. However, their differences are not significant since all Z-statistics are smaller than 1.96 (see table 5). Table 4. Kappa analysis.

Overall accuracy Kappa coefficient Variance

LM–ANN

RBF–ANN

SVM

0.9494 0.9431 0.000034

0.9489 0.9425 0.000034

0.9521 0.9480 0.000031

Note: LM, Levenberg–Marquardt; RBF, radial basis function; ANN, artificial neural network; SVM, support vector machine.

3316

X. Song et al. Table 5. Z-statistics among classification results. Pair comparison

Z-statistic

LM–ANN vs SVM RBF–ANN vs SVM LM–ANN vs RBF–ANN

0.6089 0.6817 0.0728

Note: LM, Levenberg–Marquardt; RBF, radial basis function; ANN, artificial neural network; SVM, support vector machine.

Table 6. Conditional kappa analysis and Z-statistics at catalogue level.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Conditional kappa coefficient

Class Water Builtup Dense Sparse Unused Barehill Agr1 Agr2 Agr3

Z-statistic

LM–ANN

RBF–ANN

SVM

LM–ANN vs SVM

1.0000 0.8700 0.9610 0.9610 0.9881 0.9886 0.9600 0.8708 0.8957

1.0000 0.8923 0.9773 0.9508 0.9940 0.9832 0.9375 0.8810 0.8738

1.0000 0.8850 0.9770 0.9720 1.0000 0.9890 0.9500 0.8740 0.8870

0 0.4311 0.8770 0.5823 1.4229 0.0354 0.4544 0.0925 0.2702

RBF–ANN vs SVM

LM–ANN vs RBF–ANN

0 0.2154 0.0189 1.0561 1.0006 0.4655 0.5116 0.2056 0.3978

0 0.6429 0.8922 0.4746 0.5733 0.4314 0.9599 0.2996 0.6664

Note: LM, Levenberg–Marquardt; RBF, radial basis function; ANN, artificial neural network; SVM, support vector machine.

Table 6 illustrates the conditional kappa coefficients of nine classes generated by these three methods. Three classifiers can discriminate all the individual classes very well, except for builtup, arg2 and agr3. Nevertheless, there is no significant difference in labelling individual class among those three classifiers, as the Z-statistics for each pair comparison is less than 1.96. 4.5 Visual analysis The kappa analysis was based on the testing sample data that are only a part of the whole image data, so the kappa analysis cannot present a full perspective view of the performances of these three sufficiently trained classifiers. Therefore, the total area coverage of land cover classes on the classification thematic maps was computed to find any differences existing between them (see figure 8). The histogram reveals that there is not large difference on individual classes among the land cover thematic maps produced by LM–ANN, RBF–ANN and SVM. The visual comparison among these thematic maps also supports this point (see figure 9); the pseudo colour, shape and distribution of land cover classes among three thematic maps are very close, except the very few sites highlighted by circles.

Comparison of ANNs and SVMs classifiers

3317

40 LM RBF SVM

35

Total area (%)

30 25 20 15

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

10 5 0

Water Builtup Dense Sparse Unused Barehill Class

Agr1

Agr2

Agr3

Figure 8. Histograms of total area coverage of land cover classes. The statistics of land cover classes classified on the same image by three well-trained classifiers, respectively, illustrate that there is no significant difference on any classes among three classifications. Note: LM, Levenberg–Marquardt; RBF, radial basis function; SVM, support vector machine.

5. Conclusions Two types of advanced non-parametric classifiers, ANNs and SVM, were tested for land cover classifications using SPOT-5 HRG images covering the hilly agricultural area of Northern China. This work tested the ANNs with three layer architecture and the SVM with RBF kernel function for land cover classification. All classifiers produced acceptable classification results with all sizes of training set but the SVM performs slightly better than ANNs according to kappa analysis with testing data. SVM has more generalization capabilities than ANNs, particularly in cases of small training sample sets. However, there is not significant difference in land cover classification accuracy among two kinds of classifiers when they both were well trained using their appropriate training size of sample sets. This is the same as the conclusions drawn by Dixon and Candade (2008), but different to those from Pal and Mather (2005). All the ANNs except RBF perform not so stably. For example, the LM–ANN can converge very fast but sometimes can be slow. The SVM is stable and its training time can be predicted by its search region and search step. The SVM requires a short training time by a coarse grid-search approach, but its classification accuracy is competitive. From the above, it can be concluded that the SVM is a good choice for land cover classification in this work.

3318

X. Song et al.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

Water

Builtup

Dense

Sparse

Unused

Barehill

Agr1

Agr2

Agr3

Barehill

Agr1

Agr2

Agr3

Barehill

Agr1

Agr2

Agr3

(a)

Water

Builtup

Dense

Sparse

Unused

(b)

Water

Builtup

Dense

Sparse

Unused

(c)

Figure 9. Classification thematic maps of Yanqing produced by (a) LM algorithm–ANN, (b) RBF–ANN and (c) SVM. Visually comparing these land cover maps produced by three well-trained classifiers, they are very consistent with each other and there are only slight differences among some particular areas on thematic maps. A few major differences are highlighted by circles. Note: LM, Levenberg–Marquardt; RBF, radial basis function; SVM, support vector machine; ANN, artificial neural network.

Comparison of ANNs and SVMs classifiers

3319

Acknowledgement This work was supported by the National Natural Science Foundation of China (no. 40771167) and the Scientific Research Foundation for Returned Overseas Chinese Scholars, Ministry of Education, China.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

References BAUM, E.B. and HAUSSLER, D., 1989, What size net gives valid generalization? Neural Computation, 1, pp. 151–160. CHANG, C.C. and LIN, C.J., 2001, LIBSVM: a library for support vector machines. Software. Available online at: http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ (accessed 20 May 2009). CONGALTON, R.G. and MEAD, R.A., 1986, A review of 3 discrete multivariate-analysis techniques used in assessing the accuracy of remotely sensed data from error matrices. IEEE Transactions on Geoscience and Remote Sensing, 24, pp. 169–174. DIXON, B. and CANDADE, N., 2008, Multispectral landuse classification using neural networks and support vector machines: one or the other, or both? International Journal of Remote Sensing, 29, pp. 1185–1206. FOODY, G.M. and MATHUR, A., 2004, Toward intelligent training of supervised image classifications: directing training data acquisition for SVM classification. Remote Sensing of Environment, 93, pp. 107–117. FOODY, G.M. and MATHUR, A., 2006, The use of small training sets containing mixed pixels for accurate hard image classification: training on mixed spectral responses for classification by a SVM. Remote Sensing of Environment, 103, pp. 179–189. FOODY, G.M., MCCULLOCH, M.B. and YATES, W.B., 1995, The effect of training set size and composition on artificial neural-network classification. International Journal of Remote Sensing, 16, pp. 1707–1723. GALVAO, L.S., FORMAGGIO, A.R. and TISOT, D.A., 2006, The influence of spectral resolution on discriminating Brazilian sugarcane varieties. International Journal of Remote Sensing, 27, pp. 769–777. HEPNER, G.F., LOGAN, T., RITTER, N. and BRYANT, N., 1990, Artificial neural network classification using a minimal training set: comparison to conventional supervised classification. Photogrammetric Engineering and Remote Sensing, 56, pp. 469–473. HILL, R.A., GRANICA, K., SMITH, G.M. and SCHARDT, M., 2007, Representation of an alpine treeline ecotone in SPOT 5 HRG data. Remote Sensing of Environment, 110, pp. 458–467. HSU, C.W. and LIN, C.J., 2002, A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13, pp. 415–425. HUANG, C., DAVIS, L.S. and TOWNSHEND, J.R.G., 2002, An assessment of support vector machines for land cover classification. International Journal of Remote Sensing, 23, pp. 725–749. HUGHES, G.F., 1968, On mean accuracy of statistical pattern recognizers. Transactions on Information Theory, 14, pp. 55–63. JIN, S., LI, D. and GONG, J., 2005, A comparison of SVMs with MLC algorithms on texture features. In Proceedings of SPIE – The International Society for Optical Engineering, The Fourth International Symposium on Multispectral Image Processing and Pattern Recognition (MIPPR2005), 31 October–2 November 2005, Wuhan, China (Bellingham, WA: SPIE), Vol. 6044, pp. 60442B.1–60442B.6. KAVZOGLU, T., 2009, Increasing the accuracy of neural network classification using refined training data. Environmental Modelling & Software, 24, pp. 850–858. KAVZOGLU, T. and COLKESEN, I., 2009, A kernel functions analysis for support vector machines for land cover classification. International Journal of Applied Earth Observation and Geoinformation, 11, pp. 352–359.

Downloaded by [Bibliotheek TU Delft] at 01:56 23 January 2013

3320

X. Song et al.

KAVZOGLU, T. and MATHER, P.M., 1999, Pruning artificial neural networks: an example using land cover classification of multi-sensor images. International Journal of Remote Sensing, 20, pp. 2787–2803. KAVZOGLU, T. and MATHER, P.M., 2003, The use of backpropagating artificial neural networks in land cover classification. International Journal of Remote Sensing, 24, pp. 4907–4938. LU, D. and WENG, Q., 2007, A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28, pp. 823–870. MATHUR, A. and FOODY, G.M., 2008, Multiclass and binary SVM classification: implications for training and classification users. IEEE Geoscience and Remote Sensing Letters, 5, pp. 241–245. MELGANI, F. and BRUZZONE, L., 2004, Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42, pp. 1778–1790. OOMMEN, T., MISRA, D., TWARAKAVI, N.K.C., PRAKASH, A., SAHOO, B. and BANDOPADHYAY, S., 2008, An objective analysis of support vector machine based classification for remote sensing. Mathematical Geosciences, 40, pp. 409–424. PAL, M. and MATHER, P.M., 2003, An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sensing of Environment, 86, pp. 554–565. PAL, M. and MATHER, P.M., 2004, Assessment of the effectiveness of support vector machines for hyperspectral data. Future Generation Computer Systems, 20, pp. 1215–1225. PAL, M. and MATHER, P.M., 2005, Support vector machines for classification in remote sensing. International Journal of Remote Sensing, 26, pp. 1007–1011. PAL, M. and MATHER, P.M., 2006, Some issues in the classification of DAIS hyperspectral data. International Journal of Remote Sensing, 27, pp. 2895–2916. PAOLA, J.D. and SCHOWENGERDT, R.A., 1995, A detailed comparison of backpropagation neural-network and maximum-likelihood classifiers for urban land-use classification. IEEE Transactions on Geoscience and Remote Sensing, 33, pp. 981–996. SINGH, U.K., TIWARI, R.K. and SINGH, S.B., 2005, One-dimensional inversion of geo-electrical resistivity sounding data using artificial neural networks – a case study. Computers & Geosciences, 31, pp. 99–108. SINGH, U.K., TIWARI, R.K. and SINGH, S.B., 2010, Inversion of 2-D DC resistivity data using rapid optimization and minimal complexity neural network. Nonlinear Processes in Geophysics, 17, pp. 65–76. VAPNIK, V.N., 1995, The Nature of Statistical Learning Theory (New York: Springer-Verlag). YOOL, S.R., 1998, Land cover classification in rugged areas using simulated moderate-resolution remote sensor data and an artificial neural network. International Journal of Remote Sensing, 19, pp. 85–96. ZHANG, Y., GAO, J. and WANG, J., 2007, Detailed mapping of a salt farm from Landsat TM imagery using neural network and maximum likelihood classifiers: a comparison. International Journal of Remote Sensing, 28, pp. 2077–2089. ZHU, G.B. and BLUMBERG, D.G., 2002, Classification using ASTER data and SVM algorithms: the case study of Beer Sheva, Israel. Remote Sensing of Environment, 80, pp. 233–240.