Application of Artificial Intelligence and Innovations in Engineering

!! - - . # ! "# $%&%' ( ) #*) +( $%&% ,

Second Workshop on Application of Artificial Intelligence and Innovations in Engineering Geodesy

71

Support Vector Machines – Theoretical Overview and First Practical Steps Michael Heinert Institut für Geodäsie und Photogrammetrie Technische Universität Braunschweig e-mail: [email protected]

Abstract Beside a lot of established learning algorithms, the support vector machines offer a high model capacity, whereas the well-known over-fitting problem of all modelling techniques – either they are parametrical or non-parametrical – can be avoided. Originally designed for pattern recognition, the support vector machines are now improved to solve interpolations, extrapolations and non-linear multiple regressions on hybrid data. Accordingly, support vector machines have already successfully been used for many geodetic purposes, e.g. for landslide modelling or velocity field interpolation. Within this paper, a theoretical overview will be given as well as some "cooking recipe" how to realise small examples in MS -Excel. Such working sheets offer a good insight, what is going on within these computational techniques. Keywords: Support Vector Machines, Statistical Learning Theory, MS -Excel-Exercises

1 Introduction Within the last years, the Support Vector Machines (SVM) have successfully been used for several geodetical purposes. On the one hand, one can find – by the way, widely unknown – the SVMs within several programmes, that are used by geodesists. Especially they are added in as classification tools for geo-information systems or for remote sensing. On the other hand, they were tested as velocity field interpolation algorithms in recent tectonics and land slide detection (Riedel and Heinert, 2008; Heinert and Riedel, 2010). Accordingly, it seems to be worthwhile and, maybe, even enjoyable to enter a new field of algorithms. Due to the fact, that the SVMs are developed for pattern recognition purposes first and afterwards extended on non-linear regression, it is suitable to make initially a detour through the pattern recognition. A main interior idea of these algorithms is the implicit transform of the patterns, given by a n-dimensional input data and a skalar output, into a high-dimensional feature space x → x = Φ(x). Usually, the dimension of this Kernel-Hilbert space H is significantly higher than the pattern vector dimension n + 1. This transform into such a space of higher order allows the linear separation of the patterns that are originally linear not separable arranged in the data space X . The decision surface is built by a hyperplane. This mapping Φ : X → H leads to an algorithm which does not depend on basic decisions of

72


the user: there have neither centres of radial basis functions to be chosen nor the architecture of a neural network to be set up nor a number of linguistic fuzzy classes a priori to be fixed. This is unique among the supervised learning algorithms.

2 Support Vector Machine for linearly separable patterns The easiest case of a SVM is the separation of linearly separable patterns. Accordingly, this example is suitable to be explained in detail. In this first case the mapping Phi : X → H is trivial that x → x = Φ(x). The data space and the feature space are identical. 2.1 Geometrical approach The decision surface that shatters the two classes of patterns can be described by the definition of the hyperplane with a main point and a normal vector wT (x − x0 ) = 0

(1)

or easier by wT x + b = 0 mit b = −wT x0

(2)

as it can be found in Haykin (1999, p. 319). The separation of the patterns [xi , yi ] is reached by the application of xi into the indicator function given by Eq. (1) and yields wT xi + b ≥ 0 ∀ yi = +1 wT xi + b < 0 ∀ yi = −1.

(3) (4)

A positive or negative yi assigns a pattern to its class according to the present position of the hyperplane. This means, such patterns with positive yi are situated on the opposite side of the origin referring to the hyperplane, such with a negative yi on the same side. At that state the parameters w and b are not optimal yet, what leads to several wrong assignments. Let us assume, that the hyperplane already reached an approximate optimal position and no faulty assignment can be found anymore. To create a margin of separation between differently assigned patterns [xi , yi], we postulate: wT xi + b ≥ +1 ∀ yi = +1 wT xi + b ≤ −1 ∀ yi = −1.

(5) (6)

Note, that patterns [xi , yi ] for which −1 < wT xi + b < +1,

(7)

are situated inside of the present margin. Accordingly its width has to be fit as well. The product of each of the both inequalities (5) and (6) with the corresponding assignment yi yields (8) yi wT xi + b ≥ 1. Remember Eq. (1) that we may write yi wT xi − wT x0 ≥ 1.

(9)


wopt ||wopt|| -bopt ||wopt|| xi

wopt

73

wop x + t 0 b = +1 wop x + t 0 b=0 wop x + t 0 b = –1

||wopt||-1

x0 woptxi ||wopt||

Figure 1: SVM for linearly separable patterns: x0 and wopt define the position and direction of the hyperplane. The margin is fixed by the position of the support vectors on both sides of the hyperplane. Using this expression it can be shown that the only variable part can be found in the normal vector w, whereas the training patterns [xi , yi] are given and therefore constant. Note, that the variable vector x0 only describes the position vector of the hyperplane, but does not influence the width of the margin either. Accordingly, the m-dimensional normal vector with its Euclidean norm T 21 2 2 ||w|| = w w = w1 + w22 + . . . + wm (10) can be normalized to the unit vector w0 =

w . ||w||

(11)

Therefore, both sides of the equation (9) have to be divided by ||w||. In the resulting expression T w b 1 yi xi − ≥ (12) ||w|| ||w|| ||w||

a

b

we can find Euclidean distances. We can state, that w0T xi means the projection of a single pattern vector xi onto the unit normal vector w0 . This corresponds with the distance of xi to that plane through the origin that is parallel to the wanted hyperplane (Fig. 1). The qoutient −b · ||w||−1 is the projection of the hyperplane’s position vector x0 onto the unit normal vector w0 and is equivalent to the distance hyperplane – origin. The difference of the two terms denotes the perpendicular distance ||w||−1 of an arbitrary pattern xi to the hyperplane. Accordingly, in Eq. (12) we recognize the Hessian normal form of the hyperplane. Herein, the distance of the patterns on the margin is given as T w0 xi − b||w||−1 = ||w||−1. (13) This expression makes clear, that the margin’s width 2 · ||w||−1 is reciprocal to the length of the normal vector. Note, that this width is the double distance between the hyperplane and one of the symmetrical margin’s edges. If thus a short normal vector leads on a wide margin,

c

74


then is ||w|| to minimise to get the margin’s maximisation. Therefore, minimise the quadric cost function 1 1 Φ(w) = ||w||2 = wT w. 2 2

(14)

The geometrical meaning of this instruction is as follows: the hyperplane’s distance to the patterns [xi , yi ] has to increase more and more. Unfortunately, the hyperplane doesn’t do this by separating the patterns, but it departs from all patterns equally and furthermore it leaves the space in between the classes (Fig. 2a). Actually, the normal vector becomes – as demanded – infinitesimal small and, at its limit, it degenerates even to a point. This means that the hyperplane rotates in infinite distance around the centroid of all patterns. More constraints are necessary to force the hyperplane into the space between the both classes. It has therefore to be demanded, that all perpendicular distances to the positive as well as to the negative edge of the margin

di =

i

N i=0

yi

wT b xi − ||w|| ||w||

−

1 ||w||

(15)

are to be maximized together. Each condition is only satisfied if a) the labelling yi is correct, because otherwise the result would be di ≤ −1, and if b) the margin has been left, because otherwise the result would be −1 < di < 0 instead. A single condition di = yi wT xi − b − 1 ≥ 0 (16) can be re-formulated without the normalization by the normal unit vector. All these conditions together force the hyperplane between the two classes [xi , yi+ ] and [xi , yi− ]. But now the hyperplane is able to near the patterns of one class while the margin becomes infinitisimal small, because of an increasing Euclidean norm ||w|| (Fig. 2b). Note that we have to formulate a dual problem. Accordingly, each condition αi yi wT xi − b − 1 = 0 (17) gets a pre-factor, namely the Lagrange multiplier αi ≥ 0.

(18)

a

b

c W0

w'

w

w'' w'''

Figure 2: a) The minimisation ofwT w forces the hyperplane away from all patterns. b) A T maximisation of all yi w xi − b − 1 pushes the hyperplane in the direction of the smaller group of samples and the margin vanishes. c) Both conditions together achieve an optimal result.


75

Such weights the condition to avoid that all patterns participate equally in the solution. The boundary condition αi ≥ 0 repulse patterns to be wrongly labelled. Inevitably, a negative αi would implicitly assign a pattern to be in the opposite class (Schölkopf et al., 2007, p. 12). The equations (16), (17) and (18) are the so-called Karush-Kuhn-Tucker (KKT) conditions (Burges, 1998, p. 131). The are generally necessary for the solution of constrained non-linear optimizationes (Kuhn and Tucker, 1951; Hillier and Liebermann, 2002). Now we can construct a Lagrangean function 1 L(w, b, α ) = ||w||2 2 N − αi yi wT xi − b − 1 .

(19)

i=0

This function has to be optimised with respect to the normal vector w and the bias b, that we may write for the saddle point ∂L(w, b, α ) ∂L(w, b, α ) = 0 und = 0. ∂w ∂b

(20)

The differentiation of L(w, b, α ) with respect to w yields 1 ∂L(w, b, α ) ! = · 2w − αi yi xi = 0, ∂w 2 i=0 N

(21)

what is equal to w=

N

αi yi xi .

(22)

i=0

So, the computation of w is the result of the sum over the weighted products of the input xi and output yi of each pattern. Each product can be seen as the un-normalized covariance of xi and yi . The differentiation of L(w, b, α ) with respect to b yields ∂L(w, b, α ) ! αi yi = 0, =− ∂b i=0 N

(23)

what is equal to N

αi yi = 0

(24)

i=0

and has got a special meaning, that we may conclude that

αi+ yi+ = −

i

∀i=0...n+

ι

αι− yι− .

∀ι=0...n−

The number n+ and n− of patterns on each side of the margin is therefore irrelevant.

(25)

76


The expansion of the Lagrangean function (19) 1 L(w, b, α ) = wT w 2 N N − αi yiwT xi − b αi yi +

i=0 N

i=0

αi

(26)

i=0

can be expanded – term by term – once more using the optimality conditions which are yield by the derivations (22) and (24). The third term of the Lagrangean function vanishes by virtue of the condition (24): b

N

αi yi = 0.

(27)

i=0

The right-hand side of the other optimality condition (22) can be used in the Lagrangean function instead of w that 1 L(w, b, α ) = αi yi xi · αj yj xj 2 i=0 j=0 N

−

N

N

αi yixi ·

N

i=0

+

N

αj yj xj

j=0

αi .

(28)

i=0

2.2 Solution for linearly separable patterns Eq. (28) contains the optimality conditions, given by the minimisation with respect to w and b. Now we are looking for the saddle point that can be expressed by an optimal set of Lagrange multipliers αi . Accordingly, we reformulate the objective function as to be the maximisation of Q(α) =

N i=1

N 1 αi − αi αj yi yj xTi xj 2

(29)

(i,j)=1

subject to the constraints N

αi yi = 0

(30)

i=1

and αi ≥ 0 ∀ i = 1 . . . N.

(31)

Herein, all conditions (17) have to be maximized (Haykin, 1999, p. 322). Several well-known algorithms are available for the necessary non-linear optimization (Domschke and Drexl, 2002; Grundmann, 2002; Hillier and Liebermann, 2002; Rardin, 1998).


77

Whilst their description could fill another paper of this length, it is worth noting that these methods are comparable in the quality of their results. Nevertheless, their are quite different in the computing speed. Recently, it is recommended to maximize the object function using sequential minimal optimization (SMO) (Platt, 1998, 1999). Suppose that the object function Q(α) has been maximized with respect to (30) and (31), then the optimal weight vector is given as wopt =

N

αopt,i yi xi .

(32)

i=1

Most of the optimal Lagrange multipliers αopt,i will be zero, so that only a few patterns contribute to the summation. Accordingly, the summation can be reduced to wopt =

(s) N

αopt,ιyι(s) x(s) ι ∀αopt,ι > 0.

(33)

ι=1 (s)

(s)

This number of N (s) patterns [xι , yι ] are called the support vectors. They form a subset of patterns laying on the margin and determine the position and direction of the hyperplane. The optimal bias of this hyperplane is given by T bopt = yι(s) − wopt x(s) ι

(34)

using an arbitrary support vector. Analogue we may write (s) [x(s) ι , yι ] = [xi , yi ] ∀αi > 0.

(35)

2.3 SVM for linearly non-separable patterns Note that it could be necessary to allow false classifications, especially, if the given set of patterns is a priori linearly non-separable. The sharp classification by a hyperplane can be relaxed that way that patterns may fall on the wrong side of the decision surface. To construct that kind of a soft margin we introduce slack variables ξi ≥ 0. Accordingly, the classifier can be extended like yi (wT xi + b) ≥ 1 − ξi ∀ i = 1 . . . N.

(36)

Such new defined variables ξi describe the empirical risk of a wrong classification. Three cases are possible: • 0 < ξi ≤ 1: The pattern falls inside the region of separation, but still on the right side of the decision surface, • 1 < ξi ≤ 2: The pattern falls inside the region of separation on the wrong side of the decision surface, • ξi > 2: The pattern falls into the wrong class. The indicator function that has to be constructed is non-convex. The related loss function 1 Φ(w, ξ) = wT w + C ξi 2 i=1 N

(37)

78


is expanded by a second term, as approximation of the non-convex indicator function (Haykin, 1999, p. 327). The object function of this classifier (29) stays the same. The derivations with respect to ∂w, ∂b and ∂ξξ together with the KKT conditions (Burges, 1998, p. 136) lead to 0 ≤ αi ≤ C ∀ i = 1 . . . N

(38)

instead of the second constraint following Eq. (31). Surprisingly, the slack variables ξi do not occur explicitly. Furthermore they are implicitly defined by the choice of C. The input parameter C acts as trade-off controller between the machine’s complexity and the number of non-separable patterns: with an increasing C the complexity increases as well. So it is upon the user to determine heuristically an optimal balance between complexity and misclassification. To manage this, the algorithm has got two basic control outputs beside the optimised empirical risk: at first, the number of support vectors and at second the VC-dimension of the model (Sect. 4). The computation of the weight vector wopt is the same as for linearly separable patterns (33). However, to compute the bias b the Eq. (34) has to be used for all support vectors. The result of the optimal bias is given by the mean value. The positive slack variables are finally given by ξi = 1 − yi (wT xi + b) ∀ξi > 0.

(39)

The size of C is not easy to interpret. Therefore, it is possible as well to predefine a positive maximum for ξi – for what the user has got a better imagination – and C has to be optimised. 2.4 Linear SV pattern recognition using MS -Excel An example in MS -Excel allows to demonstrate how a SVM works (Fig. 3). Within the first two columns [A3:A14], [B3:B14]

there are put the training patterns [xi , yi ] with a two-dimensional input xi and the a priori known binary classification (y + , y − ) in column [C3:C14]

as output. The two following columns [D3] [E3]

=IF(K3>0;B3;-999), =IF(K30;A3;0), =IF(F3>0;B3;0)

They can be used to determine the bias, what means the distance between the hyperplane and the origin. The following condition [N3]

=IF(F3>0;(C3–(G$1*L3+H$1*M3));" ")

ensures, that this computation is carried out only for support vectors. In the last column that depends on the patterns one can find the slack variables ξi

80

[O3]


=IF( (1–C3*(A3*$G$1+B3*$H$1+$R$7))>0; (1–C3*(A3*$G$1+B3*$H$1+$R$7));" "),

which are only of interest, if they are positiv. In a separate block of cells, firstly, the complexity parameter C will be given [R2]

=C,

that controls the size of the slack variables. In the case of linearly separable patterns, the value of C is does not matter, because there are no ξi to control. The complete object function Q(α) is computed in the target cell [R4]

=F1–1/2*SUM(I1:J1).

This cell will be used by the add-in SOLVER as "Set Target Cell", when the optimization has to be started. The next cells [R5] [R6]

=-G1/H$1, =-R7/H$1

supply the parameters m, X0 which are necessary to depict the decision surface in a graphic. The y-intercept X0 depends on the bias. In the case of linearly separable patterns a single computation of b with an arbitrary support vector is sufficient. In the other case that even one pattern violates this separability condition, it is necessary to compute the average using all support vectors [R7]

=AVERAGE(N3:N53)

The next block of cells [Q10] [Q11] [R10] [R11] [S10] [S11] [T10] [T11]

=0, =R6, =MAX(A3:A14), =R6+Q11*R5, =-(R7–1)/H1, =S10+Q11*R5, =-(R7+1)/H1, =T10+Q11*R5

contains the functions for the depiction of the decision surface and its margin. After arranging all the cells and the necessary references the optimization may be started. To solve the object function within this example the following inputs have to be filled in into the SOLVER window: Set Target Cell Equal to By Changing Cells Subject to the constraints

as it can be seen in figure 9.

$R$4 Max $F$3:$F$14 $F$3:$F$14 = 0 $K$1=0


81

2.5 Results The classification of linear separable patterns with an one-dimensional input xi and a scalar output yi is by far the easiest: high values for the parameter C lead to three support vectors on the margin between the classes. The margin itself is empty (Fig. 4a). Higher values for C are without any impact. This is comparable to a solution without slack variables ξi . In 1 the chosen example (Fig. 4) they all vanish for C ≥ 2− 2 . Taking smaller values instead, the valid support vectors get slack variables ξi ≥ 0. Nevertheless, the hyperplane remains the same, although the margin increases. But with further decreasing values for C the solution changes. There are added more support vectors and the position and direction of the decision surface starts changing (Fig. 4b). In a case of a priori linear non-separable patterns, different patterns get directly positive slack variables. With a high value for C the margin vanishes (Fig. 4c). Note that a margin is the result of a sufficient small C. The smaller C and the wider the according margin, the more support vectors participate in the solution. For infinitesimal C all patterns taking part in the solution. Accordingly, the two centroids of the patterns determine the hyperplane. 10

a

9 8

C⇒∞ ξι(s) = 0 ∀ι

Stützvektoren

10

7 6

7 6

5

5

4 3

Trennbereich

2 1

b

9 8

C = 0,5 ξι(s) ≥ 0

4 3

ξ1 ξ2

2 1

0

0 0

1

2

3

4

5

6

7

10

9 10

C⇒∞

c

9 8

8

ξι(s) ≥ 0

7 6

0

1

2

3

4

5

6

7

10

9 10

C = 0,5 ξι(s) ≥ 0

d

9 8

8

7 6

5

5

ξ3

4 3

4 3

ξ1

2 1

2 1

0

0 0

1

2

3

4

5

6

7

8

9 10

0

1

2

3

4

5

6

7

8

9 10

Figure 4: Linear SV pattern recognition: the result for linear separable patterns in dependence of a maximum or small value for C (a and b) as well as the comparable result linear non-separable patterns (c and d)

3 Nonlinear Support Vector Regression 3.1 Kernels It would be quite costly mapping big amounts of data explicitly into a space of higher order to look for an optimal hyperplane in there. Accordingly, instead of transforming the data into

82


the feature space H, a kernel function is used within the data space. Such a kernel represents the hyperplane from the feature space. This kernel re-transforms the hyperplane into the data space X , what is called "the kernel trick". Such a continuous symmetric function K(xi , xj ) =

∞

λι ϕι (xi )ϕι (xj )

(40)

ι=1

with the eigenvectors ϕ(x) and the positive eigenvalues λ is defined within the closed interval a ≤ xi , xj ≤ b. A kernel function has to converge continuously and totally in the object function Q( · ) to be suitable for a SVM. Therefore this function has to be positive definite. According to Mercer (1909, p. 442) the kernel K(xi , xj ) is positive definite iff

b b K(xi , xj )ψ(xi )ψ(xj )∂xi ∂xj ≥ 0. a

(41)

a

The functions ψ(x) have to be square integrable:

b

ψ 2 (x)∂x < ∞.

(42)

a

But Mercer’s condition does not describe how to construct a kernel function. It only answers the question whether an already chosen function can be a suitable kernel (Haykin, 1999, p. 332). Among these functions one can find the cross-correlation function as well as the weighted sum of a neuron within an artificial neural network (ANN) or the radial basis function (RBF) of the Euclidean norm. Let us revisit once more the object function (29) with its Lagrangean multiplicators: Q(α) =

N

αi −

N 1 αi αj yi yj xi , xj . 2

(43)

(i,j)=1

i=1

According to the non-linear mapping, a transform of the patterns into the feature space would be Q(α) =

N i=1

N 1 αi − αi yiΦ(xi ) · αj yj Φ(xj ), 2

(44)

(i,j)=1

what we rejected because of the computational costs. Let K(xi , xj ) = Φ(xi ) · Φ(xj )

(45)

be a function that is a suitable kernel, then the data transform will be done implicitly. A kernel is said to be suitable if it e.g. contains the dot product xi , xj of the pattern inputs xi . Even the dot product xi , xj itself is a suitable kernel that we Eq. (29) may rewrite as Q(α) =

N i=1

αi −

N 1 αi αj yi yj K(xi , xj ), 2 (i,j)=1

but now with an arbitrary suitable kernel K(xi , xj ).

(46)


83

The – now repeatedly mentioned – dot product kernel K(xi , xj ) = xi , xj = xi · xj

(47)

yields good results first of all for linearly separable patterns. Even if there exist some misclassifications because of stochastic errors, this kernel can be used successfully. A basic improvement is given by the polynomial kernel (Schölkopf and Smola, 2001, p. 45) K(xi , xj ) = (xi , xj )d

(48)

respectively the inhomogenious polynomial kernel K(xi , xj ) = (xi , xj + 1)d .

(49)

The latter allows in its quadratic form – accordingly d = 2 – to solve the classical XORproblem (Haykin, 1999, p. 355). A commonly used kernel is the Gaussian or the extended RBF kernel K(xi , xj ) = e(−γxi −xj ) .

(50)

Principally, the radial basis functions are widely used in interpolations, fuzzy clustering methods or neural RBF networks for modelling purposes. So, they are quite interesting for SVMs as well. Another "import" from neural networks is the neural kernel K(xi , xj ) = tanh(axi , xj + b),

(51)

that can be defined with other activation functions instead of the hyperbolic tangent like e.g. sigmoid functions. The by far most flexible algorithm is the so-called ANOVA-kernel D Kid (xid , xjd ) . (52) KD (xi , xj ) = 1≤i1 0;B3;" ").

In the present example, the full solution of the regression is built by the median [R1]

=MEDIAN(R3:R100)

of the singularly computed values bi of each support vector [R3]

=IF( E3>0;(Q3-K3); IF(F3>0;(Q3–K3);" "))

to yield a first approximated bias. Now, the solution for the regression with the bias b can be updated in [L3]

=K3+$R$1.

Accordingly, we got a first approximation for the residuals in [S3]

=B3–L3,

and for the slack variables ξi , ξi∗ in [T3] [U3]

=IF(E3>0;($S3–$AB$3);" "), =IF(F3>0;(-$S3–$AB$3);" ").

In this computational state, they are quite often negative. This is an unacceptable violation of the KKT condition ξi, ξi∗ ≥ 0. Accordingly, we put once more an updated solution of the regression in [M3]

=K3+$AB$7.

Only for graphical purposes are the following cells in the columns N and O: [N3] [O3]

=M3+$AB$3, =M3–$AB$3,

88


which contain the limits of the ε-insensitive object function. Using the optimal solution we can store the residuals and their squares in column M: [V3] [W3]

=B3–M3, =S3ˆ2.

Accordingly, we can compute in [X3] [Y3]

=IF(E3>0;($V3–$AB$3);" "), =IF(F3>0;(-$V3–$AB$3);" ")

the optimal slack variables ξi , ξi∗ . Very rarely one can find negative values here. In these rare cases a further iteration is necessary to get the optimal bias. The final solution is based on the control inputs by the user in the cells [AB2] [AB3] [AB4]

=γ, =ε, =C.

The next block of cells [AB6] [AB7] [AB8] [AB9]

=-AB3*G1+I1–SUM(AD3:AO14), =R1–MIN(MIN(U3:U14);0) +MIN(MIN(T3:T14);0), =J1, =(AVERAGE(W3:W14)*12/11)ˆ(1/2)

contains the output parameters: present value of the converged object function Q(α, α∗ ), optimal bias b = X0 , surface normal m = w and the empirical risk σ. To solve the object function within this second example the following inputs have to be filled in into the SOLVER window: Set Target Cell Equal to By Changing Cells Subject to the constraints

$AB$6 Max $E$3:$F$14 $E$1=$F$1 $E$3:$F$14 = 0.

3.4 Results The second chosen example shows the regression of twelve patterns situated roughly on a sine function (Fig. 7). The quality of the solution is determined by the general parameters C and ε of the SV-regression as well as by the width parameter of the RBF kernel γ. A first solution bases on a wide and stiff kernel given by a small γ, on a moderate width of the insensitive area given by ε in the scale of the expected standard deviation between the regression and the patterns and finally on a moderate force on the slack variables given by an optimal C. This solution has got just four support vectors (Fig. 7a). Thus, only every third pattern determines the regression. The approximation of the sine function is constructed by four automatically assembled lines, that fit the sine function quite well. Imagine that the insensitive area is smaller given by a smaller ε in combination with a relaxed force on the slack variables given by a smaller C, then the approximation of the sine function looks much better. Unfortunately, this solution is less robust as a result of a higher number of support vectors (Fig. 7b).

Second Workshop on Application of Artificial Intelligence and Innovations in Engineering Geodesy 7

7

a

6

6

5

4

Stützvektoren

3 2

3 2

γ = 0,025 ε = 0,330

1 0

7

1

2

3

4

5

6

7

0

7 6

5

5

4

4

3

3

2

0

1

2

3

4

5

2

3

4

7

6

7

γ = 0,750 ε = 0,330 C = 50

0

6

5

d

1

C = 50

0

1

2

γ = 0,025 ε = 0,990

1

C = 12

0

c

6

γ = 0,025 ε = 0,110

1

C = 50

0

b

5

Sollfunktion

4

89

0

1

2

3

4

5

6

7

Figure 7: Non-linear SV-regression: the coloured patterns are the supoort vectors of the regression, the small dots represent all those patterns without an impact on the solution. The higher ε the wider the insensitive area, the smaller γ the stiffer is the regression function and the higher C the more patterns are forced into the insensitive area. The opposite example, just one with high robustness is reached by minimal number of support vectors. In this very case the minimum is given by two vectors. Therefore, we combine a high ε what means wide kernel functions with a moderate force onto the slack variables by an optimal C (Fig. 7c). Small RBF kernels given by a high γ deliver in this very case a bad solution. The a priori continuous function of a sine can not easily be assumed after such a zigzag regression (Fig. 7d). All solutions have the common characteristics, that the yield automatically a non-linear regression. Consider that the user chooses more or less optimal parameters, then we may expect that the qualities of the un-known are optimal reflected. It is upon the sophisticated user to find the optimal balance between robustness and complexity of the regression. General suggestions for this balance and the parameters to reach this balance can not be given here. It depends on the actual task that has to reached.

4 Vapnik-Chervonenkis dimension Each model – also an regression – exhibits a certain complexity. This complexity has be in a suitable balance to the dimension and the number of patterns. In a case of an imbalance – mostly the complexity is too high – occurs the feared effect of over-fitting (Heine, 1999; Miima, 2002; Heinert and Niemeier, 2007; Heinert, 2008a,b). This imbalance happens earlier than a low redundance in the view of an adjustment (Niemeier, 2008).

90


Figure 8: An example of VC dim(Φ(x)) = 3. There exist 23 possibilities how the patterns can be separated error-free into two classes by h = 3 lines. Since this problem is that serious, it is necessary to find a numerical measure of a model’s complexity. Such is the so-called Vapnik-Chervonenkis dimension or abbreviated VC-dim(F ) (Haykin, 1999, p. 95). Definition: The VC-dimension of an ensemble of dichotomies F = {Φw (x) : w ∈ W, Φ : Rm W → 0, 1} is the cardinality h = |L| of the largest set L that is shattered by F.

A more explanatory version of this quite compact definition is given by Vapnik himself in 1998 (p. 147): The VC-dimension of a set of indicator functions Φw (x), w ∈ W is equal to the largest number h of vectors that can be separated into two different classes in all the 2h possible ways using this set of functions.

According to these quotations, the VC-dim(F ) is the maximal number L of patterns, which can be separated without errors into two classes in a n-dimensional data space Rn (Haykin, 1999, p. 94f). Consider that the patterns may have fuzzy degrees of membership and can be therefore elements of two or more sets or classes. Referring this the h = |L| =VC(N ) denotes the cardinality of separable patterns. Remember that the cardinality of a limited set is the sum of all degrees of membership of its elements (Bothe, 1995, p. 35). The membership function of an element describes each set membership with values between 0 and 1, what can be interpreted as probability as well. The definition of the VC-dim(F ) refers to the use of a suitable set w of indicator functions Φw (x), which carry out this shattering of the all the patterns into two classes. The maximal number h, respectively the cardinality, can happen in 2h possible ways (Schölkopf and Smola, 2001, p. 9). Said in other words: every set of functions and their combinations have got a own model capacity given by the VC dimension. The practical computation of this dimension is not solved for all indicator functions in general (Koiran, P. and Sontag, 2002; Elisseeff and Paugam-Moisy, 1997; Sontag, 1998;


91

Schmitt, 2001, 2005). Furthermore is until now often only possible to limit this dimension by Bachmann-Landau notations depending on the number of model kernels. Nevertheless, the VC-dim(F ) concept is of high importance for the demonstrated method of modelling. It emphasizes the necessity to reduce the model capacity. Why? Generally speaking, the VC dimension in comparison with number of patterns describes the mental state of the model. Accordingly, we may assign the following terms to a model: "stupid" if VC-dim(F ) is too small, "intelligent" if the VC-dim(F ) reaches the lower bound of an optimal size, "wise" if the model is really optimal, "experienced" if the model’s VC-dim(F ) is around the upper bound of an optimal size and finally "autistic" if it has got a far too high size. Especially this case is often found: the user creates a model with too much weights or parameters. The model will "remember" every single pattern instead of creating any general rules. Getting a new pattern the model will be confused.

5 Résumé Why is an support vector machine robust against over-fitting? Although, a support vector machine uses powerful kernels, in the case of RBF or AVOVA kernels the theoretical VCdim(F ) can be even infinite, this algorithm is said to be robust? The decisive step has already be made in the set-up of the hyperplane. However the hyperplane is positioned between the patterns, only a few patterns on the margin or in its vicinity decide over the exact position of the decision surface. This very special group of support vectors is – we imply the user chose suitable values for C and ε or kernel parameters like γ – significantly smaller than the number of available patterns. Accordingly, all patterns that are not support vectors are implicitly classified or approximated by the model. That way they do not have nay impact on the optimization process and therefore the ugly effect of model over-fitting is a priori excluded.

References Bothe, H.-H.: Fuzzy Logic – Einführung in Theorie und Anwendung. 2nd ext. ed., SpringerVerlag, Berlin- Heidelberg- New York- Barcelona- Budapest- Hong Kong- London- MailandParis- Santa Clara- Singapur- Tokyo, 1995. Burges, C. H. C.: A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Mining and Knowledge Discovery. vol. 2, pp. 121–167, Kluwer Academic Publishers, 1998. Domschke, W. and Drexl, A.: Berlin/ Heidelberg, 2002.

Einführung in Operations Research. 5th edition, Springer

Elisseeff, A. and Paugam-Moisy, H.: Size of multilayer networks for exact learning: analytic approach. NeuroCOLT Techn. Rep. Series, NC-TR-97-002, 1997. Grundmann, W.: Operations Research Formeln und Methoden. Teubner, Stuttgart, Leipzig, Wiesbaden, 2002. Haykin, S.: Neural Networks – A Comprehensive Foundation. 2nd edition, Prentice Hall, Upper Saddle River NJ, 1999. Heine, K.: Beschreibung von Deformationsprozessen durch Volterra- und Fuzzy-Modelle sowie Neuronale Netze. PhD thesis, German Geodetic Commission, series C, issue 516, München, 1999.

92


Heinert, M. and Niemeier, W.: From fully automated observations to a neural network model inference: The Bridge "Fallersleben Gate" in Brunswick, Germany, 1999 - 2006. J. Appl. Geodesy 1, 2007, pp. 671–80. Heinert, M.: Systemanalyse der seismisch bedingten Kinematik Islands. PhD thesis, Geod. Schriftenr. TU Braunschweig 22, Brunswick (Germany), 2008. Heinert, M.: Artificial neural networks – how to open the black boxes? In: Reiterer, A. and Egly, U. ˝ (Eds.): Application of Artificial Intelligence in Engineering Geodesy. Vienna, 2008, pp. 642U62. Heinert, M.: Support Vector Machines – Teil 1: Ein theoretischer Überblick. zfv 135 (3): 2010, in press. Heinert, M. and Riedel, B. (2010): Support Vector Machines – Teil 2: Praktische Beispiele und Anwendungen. zfv 135, 2010, in press. Hillier, F. S. and Liebermann, G. J.: Operations Research Einführung. 5th edition, Oldenbourg Verlag, Munich, Vienna, 2002. Koiran, P. and Sontag, E. D.: Neural Networks with Quadratic VC Dimension. NeuroCOLT Techn. Rep. Series NC-TR-95-044, 1996. Kuhn, H. W. and Tucker, A. W.: Nonlinear Programming. Proc. of 2nd Berkeley Symp., pp. 481–492, Univ. of California Press, Berkeley, 1951. Miima, J. B.: Artificial Neural Networks and Fuzzy Logic Techniques for the Reconstruction of Structural Deformations. PhD thesis, Geod. rep. series Univ. of Technology Brunswick (Germany), issue 18, 2002. Niemeier, W.: Ausgleichsrechnung – Eine Einführung für Studierende und Praktiker des Vermessungs- und Geoinformationswesens. 2nd rev. and ext. edition Walter de Gruyter, BerlinNew York, 2008. Platt, J. C.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research, Technical Report MSR-TR-98-14. Platt, J. C.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf, B., Burges, C. J. C., Smola, A. J.: Advances in kernel methods: support vector learning. MIT Press, Cambridge (MA), 1999, pp. 185–208. Rardin, R. L. (1998): Optimization in Operation Research. Prentice Hall, Upper Saddle River USA, 1998. Riedel, B. and Heinert, M.: An adapted support vector machine for velocity field interpolation at Baota landslide. In: Reiterer, A. and Egly, U. (Eds.): Application of Artificial Intelligence in Engineering Geodesy. Vienna, 2008, pp. 101–116. Schmitt, M.: Radial basis function neural networks have superlinear VC dimension. In Helmbold, D. and Williamson, B. (Eds.): Proceedings of the 14th Annual Conference on Computational Learning Theory COLT 2001 and 5th European Conference on Computational Learning Theory EuroCOLT 2001, Lecture Notes in Artificial Intelligence. 2111, pp. 614–630. Springer- Verlag, Berlin, 2001. Schmitt, M.: On the capabilities of higher-order neurons: A radial basis function approach. Neural Computation 17 (3), pp. 6715–729, 2005. Schölkopf, B. and Smola, A. J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). MIT Press, 2001.


93

Schölkopf, B., Cristianini, N., Jordan, M., Shawe-Taylor, J., Smola, A. J., Vapnik, V. N., Wahba, G., Williams, Chr. and Williamson, B.: Kernel-Machines.Org, 2007. URL: http://www.kernelmachines.org. Smola, A. J. and Schölkopf, B.: A tutorial on support vector regression. Statistics and Computing 14, pp. 199–222, 2004. Sidle, R. C. and Ochiai, H.: Landslides- Processes, Prediciton and Land Use. AGU Books Board, Washington, 2006. Sontag, E. D. (1998): VC Dimension of Neural Networks. In Bishop, C. (Ed.): Neural networks and machine learning, pp. 69–95. Springer-Verlag, Berlin, 1998. Vapnik, V. N.: Statistical Learning Theory. in Haykin, S. (Ed.), Adaptive and Learning Systems for Signal Processing, Communications and Control, John Wiley & Sons, New York-ChichesterWeinheim- Brisbane-Singapore-Toronto, 1998. Zhang, J. and Jiang, B.: GPS landslide monitoring of Yunyang Baota. Report of University Wuhan, 2003.

94


Appendix Non-linear conditioned optimisation using the SOLVER How can the Lagrange multipliers be optimised under MS -Excel to get a maximised object function? Within this software programme one can find under the menu "Extras" an add-in called SOLVER. This add-in created by the software company Frontline Systems Inc. is not part standard installation. It is to called by the add-ins manager. The SOLVER enables us to source out all the questions of optimisation while we may concentrate on the algorithmic design of support vector machines. This is quite pleasant, because one can find a lot of books and scientific articles only dealing with model optimisation (Domschke and Drexl, 2002; Rardin, 1998; Platt, 1998, and many others). Accordingly, it is sufficient for the moment to know that the SOLVER does the maximisation of the object function under MS-Excel. Therefore the user assigns a "Target Cell", that should contain the present value of the object function (Fig. 9). In our very case the option "Max" has to be chosen, that the optimisation has got the right direction. Furthermore the "Changing Cells", namely the cells containing the Lagrange multipliers, have to be assigned. Finally, the user programmes the Karush-Kuhn-Tucker conditions under the window "Subject to the Constraints". Quite extended explanations and practical examples are given by Staiger (2007, in German).

Figure 9: Solver under the "Extras" menu in MS -Excel: Assignment of Lagrange multipliers in the "Changing Cells" and the "Target Cell" including the order that its value representing the object function has to be the "Max" "Subject to the Constraints" of the KarushKuhn-Tucker conditions.

Application of Artificial Intelligence and Innovations in Engineering

Application of Artificial Intelligence and Innovations in Engineering

Suggest Documents

Application of Artificial Intelligence and Innovations in ... - TU Wien

APPLICATION OF ARTIFICIAL INTELLIGENCE IN

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence and Knowledge Engineering

Artificial Intelligence Safety Engineering - Computer Engineering and ...

Artificial Intelligence Safety Engineering - Computer Engineering and ...

Artificial Intelligence Application in Reservoir Characterization and ...

Application of artificial intelligence - Core

Application of Artificial Intelligence Methods in ...

Application of Artificial Intelligence Methods in ...

Application of artificial intelligence in load