next, we shall study yet another method ... a powerful and robust approach to pattern recognition due to Vapnik and ..... training an L2 SVM using Frank Wolfe.
Pattern Recognition Prof. Christian Bauckhage
outline lecture 19
recap
support vector machines separable data non-separable data
summary
binary classification: setting
assume labeled training data
n xi , yi i=1
where xi ∈ Rm and yi =
+1, −1,
if xi ∈ Ω1 if xi ∈ Ω2
binary classification: goal
train a classifier / fit a model +1 if x ∈ Ω1 y(x) = −1 if x ∈ Ω2 and use it to classify new data points x
x
binary classification: possible approach
linear classifier +1 if wT x − w0 > 0 y(x) = −1 if wT x − w0 < 0
x
w w0 kwk
note
training a linear classifier is to determine suitable w and w0 from the available data
we have already studied plain vanilla least squares Fisher’s linear discriminant analysis
next, we shall study yet another method
support vector machines
support vector machines
⇔ a powerful and robust approach to pattern recognition due to Vapnik and Chervonenkis
Vladimir N. Vapnik (∗1936)
A.Y. Chervonenkis (∗1938, †2014)
idea
idea
idea
idea
idea
w w0 kwk
support vectors
⇔ those data vectors that determine the margin and consequently the separating hyperplane
problem
in general many data (n 1)
high dimensions (m 1)
⇔ we can hardly find the support vectors via visual inspection
solution
consider the margin ρ between the two classes ⇔ determine projection vector w that it maximizes ρ = min wT x − max wT x x∈Ω1
x∈Ω2
ρ
w
observe
w.l.o.g. we may scale w such that
w T x − w0 = 0
min wT x − w0 = +1
x∈Ω1
max wT x − w0 = −1
x∈Ω2
w
observe
w.l.o.g. we may scale w such that
w T x − w0 = 0
min wT x − w0 = +1
x∈Ω1
max wT x − w0 = −1
x∈Ω2
since yi ∈ +1, −1 , this is to say yi · wT xi −w0 > 1 ∀ i = 1, . . . , n
w
canonical representation of the separating hyperplane
T
yi · w xi−w0 > 1 ∀ i = 1, . . . , n
observe
two parallel hyperplanes
w T x = w0 + 1
wT x = w0 + 1 w T x = w0 − 1
T
w x = w0 − 1 x1 = x1 w x2 = x2 w
the normal of both planes is w where generally kwk = 6 1
w
observe
let x1 be a point in the 1st plane and a multiple of w
w T x = w0 + 1
wT x1 = x1 wT w = w0 + 1 let x2 be a point in the 2nd plane and a multiple of w T
T
w x2 = x2 w w = w0 − 1 so that 2 x1 = w0 + 1 w 2 x2 = w0 − 1 w
w T x = w0 − 1
x1 = x1 w x2 = x2 w w
observe
the margin or distance between the two hyperplanes is
w + 1 w − 1
0 0 ρ = x1 − x2 = 2 w − 2 w
w
w
w + 1 − (w − 1)
0
0 = w
2
w
2
= 2 w
w 2
=
w
note
the overall idea of maximizing the margin 2
ρ=
w between the two classes, thus leads to a constrained optimization problem argmin w, w0
1
w 2 2
s.t. yi wT xi − w0 > 1,
i = 1, . . . , n
primal problem of SVM training
argmin w, w0
1 T w w 2
s.t. yi wT xi − w0 − 1 > 0,
i = 1, . . . , n
Lagrangian
n X 1 T µi yi wT xi − w0 − 1 L w, w0 , µ = w w − 2 i=1
Lagrangian
n X 1 T µi yi wT xi − w0 − 1 L w, w0 , µ = w w − 2 i=1
X 1 = wT w − µi yi wT xi − µi yi w0 − µi 2 n
i=1
X X X 1 = wT w − wT µ i yi xi + w 0 µi yi + µi 2 n
n
n
i=1
i=1
i=1
KKT conditions
KKT 1 X ∂L ! =w− µi yi xi = 0 ∂w n
(1)
i=1
∂L = ∂w0
n X
µi yi
!
=0
(2)
i=1
KKT 3 and KKT 4 µi > 0
and
µi yi wT xi − w0 − 1 = 0
(3)
didactic example
labeled training data X = x1 , x2 , x3 1 4 3 = 4 2 3 y = y1 , y2 , y3 = +1 − 1 − 1 for convenience define zi = yi xi Z = z1 , z2 , z3
x1 x3 x2
didactic example (cont.)
from KKT 1 and KKT 2 with active constraints, we have 1 0 0 z z11 21 z31
0 1 0 z12 z22 z32
0 0 0 −1 1 1
−z11 −z21 1 0 0 0
−z12 −z22 −1 0 0 0
w1 −z13 0 −z23 w2 0 −1 w0 0 = 0 µ1 1 0 µ2 1 0 1 µ3
x1 x3 x2
didactic example (cont.)
from KKT 1 and KKT 2 with active constraints, we have 1 0 0 z z11 21 z31
0 1 0 z12 z22 z32
0 0 0 −1 1 1
−z11 −z21 1 0 0 0
−z12 −z22 −1 0 0 0
w1 −z13 0 −z23 w2 0 −1 w0 0 = 0 µ1 1 0 µ2 1 0 1 µ3
x1 x3 x2
this yields w1 −2 w2 −2 w0 −11 = µ 4 µ1 −6 2 µ3 10
which violates KKT 3
didactic example (cont.)
we therefore consider 1 0 0 z
11
z31
0 1 0 z12 z32
0 0 0 −1 1
−z11 −z21 1 0 0
and obtain
w1 0 −z13 −z23 w2 0 −1 w0 = 0 0 µ1 1 1 0 µ3
x1 x3 x2
w1 −0.8 w 0.4 2 w0 = −0.2 µ 0.4 1 µ3 0.4
didactic example (cont.)
we therefore consider 1 0 0 z
11
z31
0 1 0 z12 z32
0 0 0 −1 1
−z11 −z21 1 0 0
w1 0 −z13 −z23 w2 0 −1 w0 = 0 0 µ1 1 1 0 µ3
x1 x3
and obtain
x2 w1 −0.8 w 0.4 2 w0 = −0.2 µ 0.4 1 µ3 0.4
w
note
for support vectors, the constraints are active ⇒ for the Lagrange multipliers, we always have µi
>0 =0
if xi is a support vector otherwise
note
for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix
note
for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix in general, we have to determine 1 parameter w0 m parameters w n parameters µ
which would involve inverting an (m + 1 + n) × (m + 1 + n) matrix
note
for 3 data points in 2 dimensions, we first had to invert a 6 × 6 matrix and then a 5 × 5 matrix in general, we have to determine 1 parameter w0 m parameters w n parameters µ
which would involve inverting an (m + 1 + n) × (m + 1 + n) matrix this is at least O (m + 1 + n)2.37 and may not be practical
observe
given (1) and (2), me may reconsider the Lagrangian
observe
given (1) and (2), me may reconsider the Lagrangian n n n X X X 1 T T µi yi + µi L w, w0 , µ = w w − w µi yi xi +w0 2 i=1 i=1 i=1 | {z } | {z } =w
=0
observe
given (1) and (2), me may reconsider the Lagrangian n n n X X X 1 T T µi yi + µi L w, w0 , µ = w w − w µi yi xi +w0 2 i=1 i=1 i=1 | {z } | {z } =w
X 1 = − wT w + µi 2 n
i=1
1 = − wT w + 1T µ 2 = L w, µ
=0
observe
using (1) once more, we obtain 1X D µ =− µi yi xTi xj yj µj + 1T µ 2 i,j
observe
using (1) once more, we obtain 1X D µ =− µi yi xTi xj yj µj + 1T µ 2 i,j
1 = − µT G µ + 1T µ 2 where Gij = yi xTi xj yj
dual problem of SVM training
1 argmax − µT G µ + 1T µ 2 µ yT µ = 0 s.t. µ>0
observe
if we can solve this problem, we can finally compute w=
n X
µi yi xi =
X µi >0
i=1
µi yi xi =
X
µs ys xs
s∈S
also, since for a support vector xs , ys wT xs − w0 = 1, we can directly compute w0 a robust solution is to exploit that y2s = 1 and therefore wT xs − w0 = ys ⇔ w0 = wT xs − ys and then to compute w0 =
1 X T w xs − ys |S| s∈S
note
often, the two classes Ω1 and Ω2 are not linearly separable in this case, neither of the above approaches (primal / dual) will work, i.e. neither constrained problem can be solved
idea
still maximize the margin but allow some points to lie on “the wrong side” of the separating plane
⇔ introduce additional slack variables ξi > 0 such that yi wT xi − w0 > 1 − ξi
⇔ a misclassified data point xi is one where ξi > 1
observe
since we are trying to reduce the number of erroneous classifications, a reasonable idea is to solve the . . .
primal problem
X 1 ξi argmin wT w + C w, w0 ,ξ 2 n
i=1
s.t.
yi wT xi − w0 − 1 + ξi > 0,
i = 1, . . . , n
ξi > 0,
i = 1, . . . , n
Lagrangian
n n n X X X 1 ξi − µi yi wT xi −w0 − 1−ξi − νi ξi L w, w0 , ξ, µ, ν = wT w+C 2 i=1 i=1 i=1
KKT conditions
X ∂L ! =0⇒w= µi yi xi ∂w i=1 n
X ∂L ! =0⇒0= µi yi ∂w0 i=1 n
∂L ! = 0 ⇒ µi = C − νi ∂ξi ξi > 0 yi wT xi − w0 − 1 + ξi > 0 µi > 0 νi > 0 µi yi wT xi − w0 − 1 + ξi = 0 νi ξi = 0
dual problem
1 argmax − µT G µ + 1T µ 2 µ yT µ = 0 s.t. 0 6 µ 6 C1
note
the term L=C
n X
ξi
i=1
is a loss function we may also consider other loss functions
L2 SVM
primal problem argmin wT w + w20 − ρ + C w, w0 ,ξ
n X
ξ2i
i=1 T
s.t. yi w xi − w0 > ρ − ξi
dual problem argmax − µT G + yyT + C1 I µ µ
1T µ = 1 s.t. µ>0
training an L2 SVM using Frank Wolfe
def trainL2SVM(X, y, C=1., T=1000): m,n = X.shape Z = X * y I = np.eye(n) M = np.dot(Z.T, Z) + np.outer(y,y) + 1./C*I mu = 1./n * np.ones(n) for t in range(T): eta = 2./(t+2) grd = 2 * np.dot(M, mu) mu += eta * (I[np.argmin(grd)] - mu) w = np.dot(Z, mu) w0 = np.dot(mu, y) return w, w0
examples
perceptron
least squares 10
10
8
8
8
8
6
6
6
6
4
4
4
4
2
2
2
−2
0 0
2
4
6
8
10
−4
−2
2
0 0
2
4
6
8
10
−4
−2
0 0
2
4
6
8
10
−4
−2
−2
−2
−2
−2
10
10
10
10
8
8
8
8
6
6
6
6
4
4
4
4
2
2
2
0 −4
SVM
10
0 −4
LDA
10
−2
0 0
−2
2
4
6
8
10
−4
−2 −2
2
4
6
8
10
−4
−2
4
6
8
10
0
2
4
6
8
10
0 0
−2
2
2
0 0
0
2
4
6
8
10
−4
−2 −2
summary
we now know about
support vector machines for binary classification linearly separable data non separable data
primal and dual forms of the SVM training problem an utterly simple implementation of SVM training based on the Frank-Wolfe algorithm
turn page for final words of wisdom . . .
deriving the dual problem of L2 SVM training
the primal problem is given by
argmin wT w + w20 − ρ + C w, w0 ,ξ
n X
ξ2i
i=1
s.t. yi wT xi − w0 > ρ − ξi
the Lagrangian is X L w, w0 , ξ, ρ, µ = wT w + w20 − ρ + C ξT ξ − µi yi wT xi − w0 − ρ + ξi i
= wT w + w20 − ρ + C ξT ξ − wT
X i
= wT w + w20 − ρ + C ξT ξ − wT
X i
µi yi xi +w0 | {z } =zi
X
µi yi + ρ 1T µ − ξT µ
i
µi zi + w0 yT µ + ρ 1T µ − ξT µ
we have X ∂L ! = 2w − µi zi = 0 ∂w
⇒
∂L ! = 2 w0 + yT µ = 0 ∂w0
⇒
∂L ! = 2Cξ − µ = 0 ∂ξ
⇒
ξ=
∂L ! = −1 + 1T µ = 0 ∂ρ
⇒
1T µ = 1
w=
i
1X µi zi 2 i
1 w0 = − yT µ 2 1 µ 2C
the Lagrangian can thus be written as X L w, w0 , ξ, ρ, µ = wT w + w20 − ρ + C ξT ξ − wT µi zi + w0 yT µ + ρ 1T µ − ξT µ i
= wT w + w20 − ρ + C ξT ξ − 2 wT w − 2 w20 + ρ 1T µ − ξT µ = −wT w − w20 − ρ + C ξT ξ + ρ 1T µ − ξT µ 1 1 1 1 µT µ + ρ 1T µ − µT µ = − µT Gµ − µT yyT µ − ρ + C 4 4 2C 4 C2 1 1 1 = − µT Gµ − µT yyT µ − ρ − µT µ + ρ 1T µ 4 4 4C 1 1 1 = − µT Gµ − µT yyT µ − ρ − µT C1 Iµ + ρ 1T µ 4 4 4 = D ρ, µ
where Gij = zTi zj = yi zTi zj yj
the dual problem therefore is
argmax − µ
1 T µ G + yyT + 4
1 C
I µ
1T µ = 1 s.t. µ>0
note that we may safely ignore the scaling factor of
1 4
this then leads exactly to the form we presented above