Pattern Recognition

1 downloads 0 Views 839KB Size Report
next, we shall study yet another method ... a powerful and robust approach to pattern recognition due to Vapnik and ..... training an L2 SVM using Frank Wolfe.
Pattern Recognition Prof. Christian Bauckhage

outline lecture 19

recap

support vector machines separable data non-separable data

summary

binary classification: setting

assume labeled training data

 n xi , yi i=1

where xi ∈ Rm and  yi =

+1, −1,

if xi ∈ Ω1 if xi ∈ Ω2

binary classification: goal

train a classifier / fit a model  +1 if x ∈ Ω1 y(x) = −1 if x ∈ Ω2 and use it to classify new data points x

x

binary classification: possible approach

linear classifier  +1 if wT x − w0 > 0 y(x) = −1 if wT x − w0 < 0

x

w w0 kwk

note

training a linear classifier is to determine suitable w and w0 from the available data

we have already studied plain vanilla least squares Fisher’s linear discriminant analysis

next, we shall study yet another method

support vector machines

support vector machines

⇔ a powerful and robust approach to pattern recognition due to Vapnik and Chervonenkis

Vladimir N. Vapnik (∗1936)

A.Y. Chervonenkis (∗1938, †2014)

idea

idea

idea

idea

idea

w w0 kwk

support vectors

⇔ those data vectors that determine the margin and consequently the separating hyperplane

problem

in general many data (n  1)

high dimensions (m  1)

⇔ we can hardly find the support vectors via visual inspection

solution

consider the margin ρ between the two classes ⇔ determine projection vector w that it maximizes ρ = min wT x − max wT x x∈Ω1

x∈Ω2

ρ

w

observe

w.l.o.g. we may scale w such that

w T x − w0 = 0

min wT x − w0 = +1

x∈Ω1

max wT x − w0 = −1

x∈Ω2

w

observe

w.l.o.g. we may scale w such that

w T x − w0 = 0

min wT x − w0 = +1

x∈Ω1

max wT x − w0 = −1

x∈Ω2

 since yi ∈ +1, −1 , this is to say  yi · wT xi −w0 > 1 ∀ i = 1, . . . , n

w

canonical representation of the separating hyperplane

T



yi · w xi−w0 > 1 ∀ i = 1, . . . , n

observe

two parallel hyperplanes

w T x = w0 + 1

wT x = w0 + 1 w T x = w0 − 1

T

w x = w0 − 1 x1 = x1 w x2 = x2 w

the normal of both planes is w where generally kwk = 6 1

w

observe

let x1 be a point in the 1st plane and a multiple of w

w T x = w0 + 1

wT x1 = x1 wT w = w0 + 1 let x2 be a point in the 2nd plane and a multiple of w T

T

w x2 = x2 w w = w0 − 1 so that  2 x1 = w0 + 1 w  2 x2 = w0 − 1 w

w T x = w0 − 1

x1 = x1 w x2 = x2 w w

observe

the margin or distance between the two hyperplanes is



w + 1 w − 1

0 0 ρ = x1 − x2 = 2 w − 2 w

w

w

w + 1 − (w − 1)

0

0 = w

2

w

2

= 2 w

w 2

=

w

note

the overall idea of maximizing the margin 2

ρ=

w between the two classes, thus leads to a constrained optimization problem argmin w, w0

1

w 2 2

 s.t. yi wT xi − w0 > 1,

i = 1, . . . , n

primal problem of SVM training

argmin w, w0

1 T w w 2

 s.t. yi wT xi − w0 − 1 > 0,

i = 1, . . . , n

Lagrangian

n   X  1 T  µi yi wT xi − w0 − 1 L w, w0 , µ = w w − 2 i=1

Lagrangian

n   X  1 T  µi yi wT xi − w0 − 1 L w, w0 , µ = w w − 2 i=1

 X 1 = wT w − µi yi wT xi − µi yi w0 − µi 2 n

i=1

X X X 1 = wT w − wT µ i yi xi + w 0 µi yi + µi 2 n

n

n

i=1

i=1

i=1

KKT conditions

KKT 1 X ∂L ! =w− µi yi xi = 0 ∂w n

(1)

i=1

∂L = ∂w0

n X

µi yi

!

=0

(2)

i=1

KKT 3 and KKT 4 µi > 0

and

   µi yi wT xi − w0 − 1 = 0

(3)

didactic example

labeled training data   X = x1 , x2 , x3   1 4 3 = 4 2 3   y = y1 , y2 , y3   = +1 − 1 − 1 for convenience define zi = yi xi   Z = z1 , z2 , z3

x1 x3 x2

didactic example (cont.)

from KKT 1 and KKT 2 with active constraints, we have 1  0   0  z z11 21 z31 

0 1 0 z12 z22 z32

0 0 0 −1 1 1

−z11 −z21 1 0 0 0

−z12 −z22 −1 0 0 0

    w1 −z13 0 −z23   w2  0     −1   w0  0   =   0   µ1   1  0   µ2   1  0 1 µ3

x1 x3 x2

didactic example (cont.)

from KKT 1 and KKT 2 with active constraints, we have 1  0   0  z z11 21 z31 

0 1 0 z12 z22 z32

0 0 0 −1 1 1

−z11 −z21 1 0 0 0

−z12 −z22 −1 0 0 0

    w1 −z13 0 −z23   w2  0     −1   w0  0   =   0   µ1   1  0   µ2   1  0 1 µ3

x1 x3 x2

this yields    w1 −2  w2   −2      w0  −11  =   µ   4 µ1   −6 2 µ3 10 

which violates KKT 3

didactic example (cont.)

we therefore consider 1  0   0 z 

11

z31

0 1 0 z12 z32

0 0 0 −1 1

−z11 −z21 1 0 0

and obtain

    w1 0 −z13 −z23   w2  0     −1   w0  = 0 0   µ1   1  1 0 µ3

x1 x3 x2

   w1 −0.8  w   0.4  2    w0  = −0.2 µ   0.4 1 µ3 0.4 

didactic example (cont.)

we therefore consider 1  0   0 z 

11

z31

0 1 0 z12 z32

0 0 0 −1 1

−z11 −z21 1 0 0

    w1 0 −z13 −z23   w2  0     −1   w0  = 0 0   µ1   1  1 0 µ3

x1 x3

and obtain

x2    w1 −0.8  w   0.4  2    w0  = −0.2 µ   0.4 1 µ3 0.4 

w

note

for support vectors, the constraints are active ⇒ for the Lagrange multipliers, we always have  µi

>0 =0

if xi is a support vector otherwise

note

for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix

note

for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix in general, we have to determine 1 parameter w0 m parameters w n parameters µ

which would involve inverting an (m + 1 + n) × (m + 1 + n) matrix

note

for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix in general, we have to determine 1 parameter w0 m parameters w n parameters µ

which would involve inverting an (m + 1 + n) × (m + 1 + n) matrix  this is at least O (m + 1 + n)2.37 and may not be practical

observe

given (1) and (2), me may reconsider the Lagrangian

observe

given (1) and (2), me may reconsider the Lagrangian n n n X X X  1 T T µi yi + µi L w, w0 , µ = w w − w µi yi xi +w0 2 i=1 i=1 i=1 | {z } | {z } =w

=0

observe

given (1) and (2), me may reconsider the Lagrangian n n n X X X  1 T T µi yi + µi L w, w0 , µ = w w − w µi yi xi +w0 2 i=1 i=1 i=1 | {z } | {z } =w

X 1 = − wT w + µi 2 n

i=1

1 = − wT w + 1T µ 2  = L w, µ

=0

observe

using (1) once more, we obtain  1X D µ =− µi yi xTi xj yj µj + 1T µ 2 i,j

observe

using (1) once more, we obtain  1X D µ =− µi yi xTi xj yj µj + 1T µ 2 i,j

1 = − µT G µ + 1T µ 2 where Gij = yi xTi xj yj

dual problem of SVM training

1 argmax − µT G µ + 1T µ 2 µ yT µ = 0 s.t. µ>0

observe

if we can solve this problem, we can finally compute w=

n X

µi yi xi =

X µi >0

i=1

µi yi xi =

X

µs ys xs

s∈S

 also, since for a support vector xs , ys wT xs − w0 = 1, we can directly compute w0 a robust solution is to exploit that y2s = 1 and therefore wT xs − w0 = ys ⇔ w0 = wT xs − ys and then to compute w0 =

1 X T w xs − ys |S| s∈S

note

often, the two classes Ω1 and Ω2 are not linearly separable in this case, neither of the above approaches (primal / dual) will work, i.e. neither constrained problem can be solved

idea

still maximize the margin but allow some points to lie on “the wrong side” of the separating plane

⇔ introduce additional slack variables ξi > 0 such that  yi wT xi − w0 > 1 − ξi

⇔ a misclassified data point xi is one where ξi > 1

observe

since we are trying to reduce the number of erroneous classifications, a reasonable idea is to solve the . . .

primal problem

X 1 ξi argmin wT w + C w, w0 ,ξ 2 n

i=1

s.t.

 yi wT xi − w0 − 1 + ξi > 0,

i = 1, . . . , n

ξi > 0,

i = 1, . . . , n

Lagrangian

n n n  X X   X  1 ξi − µi yi wT xi −w0 − 1−ξi − νi ξi L w, w0 , ξ, µ, ν = wT w+C 2 i=1 i=1 i=1

KKT conditions

X ∂L ! =0⇒w= µi yi xi ∂w i=1 n

X ∂L ! =0⇒0= µi yi ∂w0 i=1 n

∂L ! = 0 ⇒ µi = C − νi ∂ξi ξi > 0  yi wT xi − w0 − 1 + ξi > 0 µi > 0 νi > 0    µi yi wT xi − w0 − 1 + ξi = 0 νi ξi = 0

dual problem

1 argmax − µT G µ + 1T µ 2 µ yT µ = 0 s.t. 0 6 µ 6 C1

note

the term L=C

n X

ξi

i=1

is a loss function we may also consider other loss functions

L2 SVM

primal problem argmin wT w + w20 − ρ + C w, w0 ,ξ

n X

ξ2i

i=1 T



s.t. yi w xi − w0 > ρ − ξi

dual problem   argmax − µT G + yyT + C1 I µ µ

1T µ = 1 s.t. µ>0

training an L2 SVM using Frank Wolfe

def trainL2SVM(X, y, C=1., T=1000): m,n = X.shape Z = X * y I = np.eye(n) M = np.dot(Z.T, Z) + np.outer(y,y) + 1./C*I mu = 1./n * np.ones(n) for t in range(T): eta = 2./(t+2) grd = 2 * np.dot(M, mu) mu += eta * (I[np.argmin(grd)] - mu) w = np.dot(Z, mu) w0 = np.dot(mu, y) return w, w0

examples

perceptron

least squares 10

10

8

8

8

8

6

6

6

6

4

4

4

4

2

2

2

−2

0 0

2

4

6

8

10

−4

−2

2

0 0

2

4

6

8

10

−4

−2

0 0

2

4

6

8

10

−4

−2

−2

−2

−2

−2

10

10

10

10

8

8

8

8

6

6

6

6

4

4

4

4

2

2

2

0 −4

SVM

10

0 −4

LDA

10

−2

0 0

−2

2

4

6

8

10

−4

−2 −2

2

4

6

8

10

−4

−2

4

6

8

10

0

2

4

6

8

10

0 0

−2

2

2

0 0

0

2

4

6

8

10

−4

−2 −2

summary

we now know about

support vector machines for binary classification linearly separable data non separable data

primal and dual forms of the SVM training problem an utterly simple implementation of SVM training based on the Frank-Wolfe algorithm

turn page for final words of wisdom . . .

deriving the dual problem of L2 SVM training

the primal problem is given by

argmin wT w + w20 − ρ + C w, w0 ,ξ

n X

ξ2i

i=1

 s.t. yi wT xi − w0 > ρ − ξi

the Lagrangian is  X    L w, w0 , ξ, ρ, µ = wT w + w20 − ρ + C ξT ξ − µi yi wT xi − w0 − ρ + ξi i

= wT w + w20 − ρ + C ξT ξ − wT

X i

= wT w + w20 − ρ + C ξT ξ − wT

X i

µi yi xi +w0 | {z } =zi

X

µi yi + ρ 1T µ − ξT µ

i

µi zi + w0 yT µ + ρ 1T µ − ξT µ

we have X ∂L ! = 2w − µi zi = 0 ∂w



∂L ! = 2 w0 + yT µ = 0 ∂w0



∂L ! = 2Cξ − µ = 0 ∂ξ



ξ=

∂L ! = −1 + 1T µ = 0 ∂ρ



1T µ = 1

w=

i

1X µi zi 2 i

1 w0 = − yT µ 2 1 µ 2C

the Lagrangian can thus be written as X  L w, w0 , ξ, ρ, µ = wT w + w20 − ρ + C ξT ξ − wT µi zi + w0 yT µ + ρ 1T µ − ξT µ i

= wT w + w20 − ρ + C ξT ξ − 2 wT w − 2 w20 + ρ 1T µ − ξT µ = −wT w − w20 − ρ + C ξT ξ + ρ 1T µ − ξT µ 1 1 1 1 µT µ + ρ 1T µ − µT µ = − µT Gµ − µT yyT µ − ρ + C 4 4 2C 4 C2 1 1 1 = − µT Gµ − µT yyT µ − ρ − µT µ + ρ 1T µ 4 4 4C 1 1 1 = − µT Gµ − µT yyT µ − ρ − µT C1 Iµ + ρ 1T µ 4 4 4  = D ρ, µ

where Gij = zTi zj = yi zTi zj yj

the dual problem therefore is

argmax − µ

1 T µ G + yyT + 4

1 C

 I µ

1T µ = 1 s.t. µ>0

note that we may safely ignore the scaling factor of

1 4

this then leads exactly to the form we presented above