Optimization in an Error Backpropagation Neural Network

0 downloads 0 Views 1MB Size Report
May 6, 1998 - ent descent error hackpropagatirm provided the best ami most stable out-cif- sam]Jle ]Jelfomumce ...... How. (;011111\ission c:utt·w)rit-." ,.,. "'.
Manfred M. Fischer and Petra Staufer

Optimization in an Error Backpropagation Neural Network Environment with a Performance Test on a Spectral Pattern Classification Problem

This paper attempts to develop a mathematically rigid framework for minimizing the cross-entropy function in m1 error backpropagatingframeu:ork. In doing so, we derive the backpropagaticm formulae jor evaluating the partial derivatives in a computationally efficient way. Various teclwiques of optimizing the multi7Jle-class cross-entropy error fimction to train single hidden layer neural network classifiers with softmax output transfer fiwctions are investigated on a real-world multispectral pixel-by-pixel classification prohlem that is c~f .fwulamental importance in remote sensing. These techniques include epoch-based allll hatch versions of hackpropagaticm cif gradient descent, PR-ccmjugate gradient, and BFGS quasi-Newton errors. The method cif choice depends upon tlw nature cif the_learning task and whether one wants to optimize learning for ./.wed or classificaticm performance. It was fimnd that, conq)(lratively considere(, gradient descent error hackpropagatirm provided the best ami most stable out-cifsam]Jle ]Jelfomumce results across batch and epoch-based modes cif operatio11. If the goal is to maximize teaming SJJeed am/ a sacrifice in classification accuracy is accc]Jtable, then PR-ccmjugate gradient error backpropagatiou tends to be superior. If the training set is very large, stochastic e}JOCh-based versions of local optimizers should be chosen utilizing a larger rather than a smaller epoch size to avoid unacceptable instabilities in the classification results.

Spectral J1attern classification represents an area of considerable current interest an research. Satellite sensors record data in a variety of spectral Tlw authors gratt•fully acknowledge Adrian Trapletti for impleuwnting tlw softwart• rontint•s used iu this study. The algorithms art' hast>d on the Ntmli'rical lkclptw In C library (l'rt>ss et al. 1992). Mon•ovt>r, the authors would likP to thank tlw nHIJluscript rPViPwers for their valuahlt> and eonstnwtivt• comments.

Manfred M. Fischer is director of the Institute for Urban and Regioual Re.~earch, Austrian Academy of Scienct~s, ami professor of ecmwmic geography at \Virtsclwftsuniversltiit \Vien, wlwre Petra Staufer is assistant professor. Geographical Analysis, Vol. 31, No. 2 (April 1999) The Ohio State University Press Submitted: .5/6/98. Hevised version ac training set is very large, stochastic epoch-based rather than deterministic hatch modes of opt•rations tend to he preferable. I. SINCLE IIIDDEN LAYEH NET\VOIIKS AND TilE NET\\'OHK TIIAININC I'HOBLEM

Suppose we are intPrested in approximating a classification function .Cf': t--> :Jfc: which estimates the probability that a pattern bl'iongs to one of a priori known mutually l'xdusivt' classes; The function .'!F is not. analytically known, but rather samples S = {s 1, ... ,sK} with sk = (xk,yk) are generated hy a process that is governed hy .'!F, that is, §(xk) = yk. From the available samples Wl' want to build a smooth approximation to .Cf'. Note that in real-world applications, only a finite (that is, small) number K of learning examples is available or can be used at the same time. Moreover, the samples contain noist•. To approximate .Cf' we consider the class of single hiddt>u layer feedforwanl networks , the leading case of neural net\vorks models. consists of a combination of transfer functions fPJ() = l, ... ,]) aud 1/l,.(c = 1, ... , C) that are represented by hitlden units, and weighted forward connections lwtween the input, :J?N

92

Geographical Analysis

/

hidden, and output units. The cth output element of« is

«(x, w),. =

1p,.

(t (t Wr;

rp1

1Cj 11

x,))

I :5:

c:5: C,

(1)

u=O

j=O

where N denotes the number of input units, J the number of hidden and C the number of output elements (a priori given classes). x = (xo, XI, ••• , XN) is the input vector augmented with a bias signal x0 that can he thought of as being generated by a "dummy" unit (with index zero) whose output is clamped at 1. The w1, represent input to hidden connection wt>ights, and the tVrJ hidden to output weights (including the biases). The symbol w is a convenient short-hand notion of the w = [j(N + 1) + C(] + 1) ]-dimensional vector of all the tc1, and w,J network weights and biases (that is, the model parameters). rp1(-) tind 1/Jr:( ·) are difl'ert•ntiable nonlinear transfer (activation) limctions of, respectively, the hidden units (j = 1, ... ,]) and the output elements (c = I, ... , C). One of the major issues in neural network modeling includes the problem of selecting an appropriate member of model class « in view of a particular walworld application. This model specification problem involws both the choice of appropriate transfer functions "]( ·) and 1p,.( ·), and the determination of an adequate network topology of
r = l, 2, ...

(U)

(iii) update the paranwler wetor w(r+ I) =w(r)+t/(r)d(r)

r= 1,2, ...

( 10)

(iv) if d E(w) / dw =f. 0, then set r = r + I and go to (ii), else return w( r + l) as tlu· ads to the task to evaluate 15,.. Once again applying the chain rule we obtain

0Ek c _u"'Ek :1"" ~ _ ~ _ " _ _v_ ...._,_·' o,.- ~ . net,. c=I il,., Dnetc

v

( 16)

From (6) we have

aE"

a [ ~

--=--

D,.,

cJ,,

-~y

c"= I

c"

[,·"]] 1n !fc"

(17)

96

/

Geographical Analysis

and from (4)

a 2 s:: .9 us::

" ' \, ' ''

BFGS

\\

\

\\

::s

""'§....

\

~\ ----.... ..•. _,

r.IJ

~

0

1

100

.. )----·~-

~

10000

Number of Iterations (log-scaled) Fu;, L Batch Learning Curves as a Function of Training Tinw: ThP efl'ect of different optimization tedmiques (averagt>d values of cotlwrgcd simulations)

The average values of the multiple-class cross-entropy function do seem to indicate that PR-CG tended to find better local minima than any other procedure, and this conclusion is corroborated by the fact that the standard deviation after training in the ten runs is significantly lower as shown in Table 2. Especially, BFGS appears to he more prone to fall into local minima as indicated from the rather high standard deviation. Moreover, Table 2 clearly indicates that better out-of-sample classification performance is not the result of finding a lower minimum (see also Battiti and Tecchiolli 1994). PR-CG and BFGS se('lll to utilize information to modify the direction of steepest descent that resulted in significantly poorer out-of-sample classification accuracy on our task. In fact, GD outperforms PH-CG by 9.60 percentage points and BFGS by S.lO percentage points on average. An intt•resting conclusion from this comparative study is that out-of-sample classification performance can vary between algorithms and even between different trials of the same algorithm, despite all of them finding a local minimum. It is important to note that GO-out-of-sample performance varies between 80.49 and 86.22 percent of classification accuracy, while PR-CG out-of-sample performance varies between 70.21 and 78.54, and BFGS between 67.80 and 79.88 percent only. The second series of experiments involves epoch-based rather than batch learning, with a range of epoch sizes K* = 900, 600, 300 and 30. The results obtained an• summarized in Table 3 along with the corresponding learning curves displayed in Figure 2. Epoch-based learning strategies may be more ell'ective than batch learning, especially when the number K of training examples is very large and many training examples possess redundant information in the sense that many contributions to the gradient are very similar. Epoch-based updating makes the search path in the parameter spaee stochastic when the input vector is drawn at random. The main difficulty with stochastic epochbased learning is its apparent inability to converge on a minimum within the 100,000 iterations limit. "Vhile modifications to increase speed are important and attract most attention, classification accuracy is perhaps more important in applications such as pixel-by-pixel classification of remotely sensed data. Important differences in out-of-sample performance between batch and epoch-based learning may he

104

/

Geographical Analysis

TABLE 3 Epoch-bast•d Lt•arnin!!:: Comparative Perlimnanct• of Error Backpropagation with Different Optimization Technicptt·s' (;f)

Error Bat·kpropagution with I'H-C0) (1.20) (3.17)

11..'37 7fl0.47 1>4.0.'5 7-1.65

(0.65) (.'llO.HfJ) (H.14) (6.1l4)

20.24 1,4.'5l.ll.'3 lll.66 7.'5.07

• Pt•rfurnuuu·t• valut•s

rt·prt'M'llt tlw

(.'3..'34) (.'374.01>) (5.02) (4.40)

mt•un (shualard dt•viution in lmat•kt•ls) of It'll simulations dilli.•rin~ in tilt' inilhal nuulmn

Wt'ij.!;hh.

t·onditinn. Fundion \'alut•: multiplt~-c:lass t·ros"i-t'lllropy fundion vahw aftN JO·' ih•rutiow~. ln-Sounplt•-CI.&'i.'tifkllliou At·t·nruey: pNt't•nta,L{t' of traiuiu~ pi:u-ls t'OI'rt•(·tly dassifi(•tl uftt'l' )()"i ih•rations. Out-of-Samplt·Classifkutlou A20

l'J C;2

Column Total Omission Error[%]

so 1>.75

()

I ()

0 0 0 ()

()

4 75.00

0 0 0 0

l .'54 .51> 6.BO

s:3 142 64 200

12.05 4.fJ3 3.12 :3.00 11.()4 4.7.'3 fJH.70 0.00

multiple-class cross-entropy function in a neighborhood of the current point in parameter space is minimized. This perspective opens the possibility to combine the. backpropagation technique with more sophisticated optimization procedures for parameter adjustment, the PH-CG and BFGS techniques. The performance of error backpropagation with gradient descent, Polak-Hihiere conjugate gradient, and Broyden-Fietchcr-Goldfarb-Shanno quasi-Newton minimization is evaluated and tested in the context of urban landcovcr classification. The results obtained may he summarized as follows. First, the choice of optimization strategy (batch versus epoch-based mode of operation) and of the optimization technique (GD, PH-CG, and BFGS) depends on the nature of the learning task and whether one wants to optimize learning for speed or classification performance. Second, if the goal is to maximize learning speed on a pattern classification problem and a sacrifice in classification accuracy is acceptable, then PR-CG error hackpropagation, the most mature technique, would he the method of choice .. Third, where high classification accuracy and stability is more important than faster learning, then ~D error backprofagation exhibits superiority over PH-CG and BFGS in view of our pixel-hy-pixe pattern classification task-independently of the mode of operation-hut requires time-consuming tuning of the learning parameter YJ to achieve "best" out-ofsample performance. Fourth, if the training set is very large, stochastic epochbased rather than deterministic batch modes of operation should be chosen, with a larger rather than a smaller epoch size. Much work on such optimizers, however, is still required before they can be utilized with the same confidence and ease with which batch local optimizers arc currently used.

Manfred AI. Fischer nnd Petrn Stnufer

/

107

LITEHATUHE CITED Battiti, H., and G. Tecchiolli (1994). "Learning with First, Second, ami No D