Pattern Recognition

15 downloads 0 Views 1MB Size Report
clancy wiggum gary chalmers fat tony rod flanders todd flanders hans moleman mayor quimby dr. nick riveria sideshow bob snake jailbird groundskeeper willie ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 20 the kernel trick kernel engineering kernel algorithms kernel LSQ kernel SVM kernel LDA kernel PCA kernel k-means what’s really cool about kernels summary

high dimensionality can also be a blessing

example

    ϕ(x) = x, kxk2 = x1 , x2 , x12 + x22

y

ϕ

x

x ∈ R2



ϕ(x) ∈ R3

example

h

2 i ϕ(x) = x, (xT α)α − x

y

ϕ

→ x

x ∈ R2

ϕ(x) ∈ R3

α=

√1 2

1

observe

the examples suggest that non-linear transformations ϕ : Rm → RM where m 6 M can make data linearly separable ⇒ even for non-linear problems, we may resort to efficient linear techniques (recall lecture 08)

recall

in regression, clustering, and classification, linear approaches are a dime a dozen, for instance linear classifiers (least squares, LDA, SVM, . . . )  +1 if wT x > 0 y(x) = −1 if wT x < 0 nearest neighbor classifiers, k-means clustering, . . .

x − q 2 = xT x − 2 xT q + qT q

the kernel trick

idea

in order to adopt linear functions f (x) = xT w to non-linear settings, we may map x ∈ Rm to ϕ(x) ∈ RM

learn ω ∈ RM from data

and then consider h(x) = ϕ(x)T ω

problems

it is not generally clear what transformation ϕ(·) to choose depending on the dimension of the target space RM , it may be expensive to compute ϕ(x) for x ∈ Rm depending on the dimension of the target space RM , it may be expensive to compute inner products ϕ(x)T ϕ( y)

Mercer’s theorem to the rescue

definition

a Mercer kernel is a symmetric, positive semidefinite function k : Rm × Rm → R, such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x)

definitions

a Mercer kernel is a symmetric, positive semidefinite function k : Rm × Rm → R, such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x) a function k : Rm × Rm → R is positive semidefinite, if Z g2 (x) dx > 0 implies that ZZ g(x) k(x, y) g( y) dx dy > 0

note

the notion of positive semidefinite functions generalizes the notion of positive semidefinite matrices where X gT K g = gi Kij gj > 0 i,j

positive semidefinite matrices K have orthogonal eigenvectors ui and positive eigenvalues λi so that K ui = λi ui

note

in analogy, positive semidefinite functions have orthogonal eigenfunctions ψi (x) and eingenvalues µi > 0, i = 1, . . . , ∞ such that Z k(x, y) ψi ( y) dy = µi ψi ( y) where Z ψi (x) ψj (x) dx = δij

note

in analogy to the spectral representation of a matrix K=

m X

λi ui uTi

i=1

we have k(x, y) =

∞ X i=1

µi ψi (x) ψi ( y)

Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)

Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)

Proof k(x, y) =

X

µi ψi (x) ψi ( y)

i

=

X√ √ µi ψi (x) µi ψi ( y) i

=

X

ϕi (x) ϕi ( y)

i

= ϕ(x)T ϕ( y)

implications

⇒ instead of computing ϕ(x)T ϕ( y) on vectors in RM , we may evaluate a kernel function k(x, y) on vectors in Rm ⇒ we may not have to compute ϕ(x) and ϕ(x)T ϕ( y) explicitly, but may directly evaluate a kernel function k(x, y) ⇒ instead of worrying about how to choose ϕ(·), we may worry about choosing a suitable kernel function k(·, ·)

the kernel trick

the kernel trick is, first, to rewrite an algorithm for data analysis or classification in such a way that input data x only appears in form of inner products and, second, to replace any occurrence of such inner products by kernel evaluations this way, we can use linear approaches to solve non-linear problems!

kernel engineering

the linear kernel (⇔ proof that kernel functions exist)

the identity mapping ϕ(x) = x yields a valid Mercer Kernel k(x, y) = ϕ(x)T ϕ( y) = xT y because xT y = yT x

(symmetric)

and xT x > 0

(psd)

observe

in the following, we assume x, y ∈ Rm

b>0∈R c>0∈R

d>0∈R

g : Rm → R

ki (x, y) = ϕ(x)T ϕ( y) is a valid kernel for some ϕ

kernel functions k(x, y) = c · k1 (x, y)

k(x, y) = g(x) k1 (x, y) g( y) k(x, y) = xT y + b k(x, y) = k1 (x, y) + k2 (x, y) d k(x, y) = xT y + b d k(x, y) = k1 (x, y)

2 k(x, y) = ϕ(x) − ϕ( y) h

2 i k(x, y) = exp − 2σ1 2 x − y

note

again, kernel functions k(x, y) implicitly compute inner products ϕ(x)T ϕ( y) not in Rm but in RM

k(x, y)

M

inhom. linear kernel

xT y + b

m+1

polynomial kernel

xT y + b

Gaussian kernel

e−

d

kx−yk2 2σ2

m+d d





assignment

to understand and make sense of all of the above, read C. Bauckhage, “Lecture Notes on the Kernel Trick (I)”, dx.doi.org/10.13140/2.1.4524.8806 C. Bauckhage, “Lecture Notes on the Kernel Trick (III)”, dx.doi.org/10.13140/RG.2.1.2471.6322

kernel algorithms

recall: least squares  n given labeled training data (xi , yi ) i=1 we seek w such that y(x) = xT w for convenience, we let   x← 1 x   w ← w0 w and then solve

2 w∗ = argmin XT w − y w

(1)

note

in today’s lecture, we work with m × n data matrices | |  X = x1 x2 · · · | | 

 | xn  |

observe

the least squares problem in (1) has two solutions primal −1 w∗ = XXT Xy | {z } m×m

dual −1 w∗ = X XT X y | {z } n×n

where XT X is a Gram matrix since XT X

 ij

= xTi xj

kernel LSQ

while working with the dual may be costly, it allows for invoking the kernel trick

kernel LSQ

while working with the dual may be costly, it allows for invoking the kernel trick, because in y(x) = xT w∗ = xT X XT X

−1

y

all x, xi occur within inner products and we may rewrite y(x) = k(x)T K −1 y where Kij = k(xi , xj ) ki (x) = k(xi , x)

assignment

to see how to implement this idea in numpy / scipy, read C. Bauckhage, “NumPy / SciPy Recipes for Data Science: Kernel Least Squares Optimization (1)”, dx.doi.org/10.13140/RG.2.1.4299.9842

example: regression

linear kernel

polynomial kernel

xT y + 1

xT y + 1

Gaussian kernel 2

e

1 kx−yk −2 2.52

17.5

17.5

17.5

15.0

15.0

15.0

12.5

12.5

12.5

10.0

10.0

10.0

7.5

7.5

7.5

5.0

5.0

5.0

2.5

2.5

2.5

0.0 −5.0

3

−2.5

0.0 0.0

2.5

5.0

7.5

−5.0

−2.5

0.0 0.0

2.5

5.0

7.5

−5.0

−2.5

0.0

2.5

5.0

7.5

example: classification  polynomial kernel, d ∈ 1, 3, 6

−3

−2

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

−1

−1

−1

−2

−2

−2

0

1

2

3

0

1

2

3

 Gaussian kernel, σ ∈ 0.5, 1.0, 5.0

−3

−2

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

−1

−1

−1

−2

−2

−2

recall: support vector machines

dual problem of L2 SVM training   argmax − µT G + yyT + C1 I µ µ

1T µ = 1 s.t. µ>0 where Gij = yi yj xTi xj

observe

upon solving for µ, we have for the L2 SVM X w= µ s ys xs s∈S

w0 =

X

µ s ys

s∈S

and the resulting classifier is y(x) = sign xT w + w0 = sign

X s∈S

 !

µs ys xT xs + w0

kernel SVM

during training and application of an SVM, all the occurrences of x, xi are in form of inner products

we may therefore substitute Gij = yi yj k(xi , xj ) and xT w =

X s∈S

µs ys k(x, xs )

example  polynomial kernel, d ∈ 3, 5, 7 , C = 2

−3

−2

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

−1

−1

−1

−2

−2

−2

0

1

2

3

0

1

2

3

 Gaussian kernel, σ ∈ 0.25, 0.50, 0.75 , C = 1000

−3

−2

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

−1

−1

−1

−2

−2

−2

L2 SVM with polynomial kernel

### training X = ... # training data matrix y = ... # training label vector m = trainL2SVMPolyKernel(X, y, d=3, C=2., T=1000) s = np.where(m>0)[0] XS = X[:,s] ys = y[s] ms = m[s] w0 = np.dot(ys,ms) ### X = y = y =

testing ... # test data matrix applyL2SVMPolyKernel(X.T, XS, ys, ms, w0, d=3) np.sign(y)

L2 SVM with polynomial kernel

def trainL2SVMPolyKernel(X, y, d, b=1., C=1., T=1000): m, n = X.shape I Y K M

= = = =

np.eye(n) np.outer(y,y) (b + np.dot(X.T, X))**d Y * K + Y + 1./C*I

mu = np.ones(n) / n for t in range(T): eta = 2./(t+2) grd = 2 * np.dot(M, mu) mu += eta * (I[np.argmin(grd)] - mu) return mu

L2 SVM with polynomial kernel

def applyL2SVMPolyKernel(x, XS, ys, ms, w0, d, b=1.): if x.ndim == 1: x = x.reshape(len(x),1) k = (b + np.dot(x.T, XS))**d return np.sum(k * ys * ms, axis=1) + w0

example

training + testing on 2

Xtrain ∈ R2×2000

1

Xtest ∈ R2×134500 took less than a second

0 −2

−1

0

−1

−2

1

2

3

4

recall: linear discriminant analysis

training an LDA classifier is to compute w∗ = argmin w

 wT SB w = S−1 W µ1 − µ2 T w SW w

where SB = (µ1 − µ2 )(µ1 − µ2 )T SW = S1 + S2 X xi µj = n1j xi ∈Ωj

Sj =

X

xi ∈Ωj

(xi − µj )(xi − µj )T

observe

in the following, let   Xj = x1 , . . . , xni   yj = n1i , . . . , n1i

we have w∗ = S−1 W µ1 − µ2



= S−1 W X 1 y1 − X 2 y2 = S−1 W Xy

then µj = Xj yj moreover, let   X = X1 , X2   y = y1 , −y2

⇒ the optimal choice for w is a linear combination of the data vectors in X ⇔ we may substitute w = Xλ



observe

we have Sj =

X

(xi − µj )(xi − µj )T

xi ∈Ωj

=

X

xi xTi − nj µj µTj

xi ∈Ωj

= Xj XTj − nj Xj yj yTj XTj = Xj XTj − Xj Nj XTj where Nj =

1 nj 1

observe

  T wT SB w = λT XT X1 y1 − X2 y2 X1 y1 − X2 y2 X λ   T = λT XT X1 y1 − XT X2 y2 XT X1 y1 − XT X2 y2 λ   T = λT K 1 y1 − K 2 y2 K 1 y1 − K 2 y2 λ   T = λT M1 − M2 M1 − M2 λ = λT Mλ   wT SW w = λT XT X1 XT1 − X1 N1 XT1 + X2 XT2 − X2 N2 XT2 X λ   = λT XT XXT − X1 N1 XT1 − X2 N2 XT2 X λ   = λT XT XXT X − XT X1 N1 XT1 X − XT X2 N2 XT2 X λ   = λT K 2 − K 1 N1 K T1 − K 2 N2 K T2 λ = λT N λ

kernel LDA

training kernel LDA is to compute λ∗ = argmin λ

 λT M λ = N−1 M1 − M2 T λ Nλ

where it may be necessary to regularize N N ← N + I applying kernel LDA is to compute   y(x) = sign xT X λ∗ − θ = sign k(x)T λ∗ − θ where θ is determined as in usual LDA

example  polynomial kernel, d ∈ 1, 4, 6

−3

−2

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

0

−1

−1

−1

−2

−2

−2

1

2

3

recall: principal component analysis

given a (zero mean) data matrix   X = x1 , . . . , xn ∈ Rm×n we compute the sample covariance matrix C=

1 XXT ∈ Rm×m n

then solve the eigenvector/eigenvalue problem Cu = λu and use the resulting eigenvectors for various purposes

observe

we have Cu = λu 1 XXT u = u nλ ⇔ Xα = u ⇔

⇒ each eigenvector u of C is a linear combination of the column vectors xi of X and we emphasize that u ∈ Rm

α ∈ Rn

observe

we have Cu = λu ⇔

1 XXT Xα = λ Xα n

1 T T X XX Xα = λ XT Xα n ⇔ K 2 α = ˜λ K α ⇔ K α = ˜λ α ⇔

where ˜λ = n λ

moreover uT u = 1

1 ⇒ αT K α = ˜λ αT α = 1 ⇒ kαk = p ˜λ

centering the kernel

1X ϕ(xi ) − ϕ(xk ) n n

kc (xi , xj ) =

!T

1X ϕ(xj ) − ϕ(xl ) n n

k=1

= ϕ(xi )T ϕ(xj ) − ϕ(xi )T

!

l=1

1X 1X ϕ(xl ) − ϕ(xk )T ϕ(xj ) n n

1 X + 2 ϕ(xk )T ϕ(xl ) n

l

k

k,l

= k(xi , xj ) −

1X 1X 1 X k(xi , xl ) − k(xk , xj ) + 2 k(xk , xl ) n n n l

k

k,l

kernel PCA

solve K α = ˜λ α where Kij = kc (xi , xj ) normalize α α← p ˜λ compute uT x = αT XT x =

X i

αi k(xi , x)

note

on the following slides, we compare standard PCA against kernel PCA we compute them from m = 2 dimensional data points x1 , . . . , xn , where nm standard PCA will produce m eigenvectors ui ∈ Rm kernel PCA will produce n eigenvectors αj ∈ Rn

Rm is the data space, Rn is the feature space

since the feature space dimension exceeds 2, we cannot plot the kernel PCA results directly what we do instead is to consider points x ∈ Rm and visualize standardand kernel PCA in terms of a function f (x) where either f (x) = uTi x or f (x) = αTj XT x

f (x) = uTi x

examples: standard PCA

3

4 2

2.0

3

1.0

1

0.5

1

0

0 −3

−2

−1

0

1

2 −1

−1

−2

−1

−2

−1.0

−3

−1.5

−4

−2.0

1

−1

1.5

1

1.0

1

2

−1 −2 −3

−2

i=2

2 −1

−2

2.0

2

−4

1

0.5

0 0 0

1

2

3

−1

0 0

−0.5

i=1

2

−2

1

0.0 −2

i=1

−3

2

1.5

2

0

0.0 −2

−1

0 −0.5

1

2 −1

−1.0 −1.5 −2.0

i=2

−2

examples: (Gaussian) kernel PCA

f (x) = αTj XT x

0.6 0.6

0.4

2

2

2

0.4 0.4

0.2

1

1

1 0.2

0 −3

−2

−1

0

1

2

0.0

−3

−2

−1

−0.2

−1

0.2

0.0 0.0

0

0 0

1

2

−1

−0.2

−3

−2

−1

0

1

2 −0.2

−1

−0.4

−0.4

−0.4 −2

−2

−2

−0.6

−0.6

−0.6

j=1

j=2

j=3 0.6

0.6

0.4

0.5

2

2

0.4

2

1

0.2

1

0

0.0

0

0.4 0.2 1

0.3 0.2

0.0 0 −3

−2

−1

0

−1

−2

j=4

1

−3

2 −0.2

−0.4

−2

−1

0

1

2

−3

−2

−1

0

1

2

0.1 0.0

−1

−0.2

−1

−2

−0.4

−2

j=5

−0.1

j=6

−0.2

examples: (Gaussian) kernel PCA

f (x) = αTj XT x

0.8 2.0

0.6

2.0

2.0 0.4

0.6 1.5

1.5 0.4

1.0

1.0

0.5

0.2

0.5 0.0

0.0

0.0 −1

0.0 0

1

2 0.0

−0.5 −1.0

−2

−1

0.0 0

1

2 −0.2

−0.5 −1.0

−0.2

−1.5

−1

0

1

2 −0.2

−0.5 −1.0

−0.4

−1.5

−0.6

−2.0

j=1

−2

−0.4

−1.5 −0.4

−2.0

0.2

1.0

0.5 0.2

−2

1.5

0.4

−0.6

−2.0

j=2

j=3 0.6

2.0

2.0

2.0

0.4

0.4

1.5

1.5

1.0 0.5

0.4

1.5 0.2

1.0

0.2

1.0

0.5

0.2

0.5 0.0

0.0

0.0 −2

−1

0

1

−1.0 −1.5

−1

0.0 0

−0.5 −0.2

−0.4

0.0

0.0 −2

2

−0.5

−1.0

1

2 −0.2

−0.4

−1.5

−2.0

−2.0

j=4

j=5

−2

−1

0 −0.5

2 −0.2

−1.0 −1.5

−0.6

1

−2.0

j=6

−0.4

kernel k-means

you take it from here . . .

what’s really cool about kernels

example

non-numeric data S=



’homer simpson’, ’marge simpson’, ’bart simpson’, ’lisa simpson’, ’maggie simpson’, ’apu nahasapeemapetilon’, ’selma bouvier’, ’patty bouvier’, ’nelson muntz’, ’ralph wiggum’, ’seymour skinner’, ’disco stu’, ’kent brockman’, ’carl carlson’, ’lenny leonard’, ’comic book guy’, ’ned flanders’, ’prof. john frink’, ’barney gumbel’, ’dr. julius hibbert’, ’edna krabappel’, ’krusty the clown’, ’reverend lovejoy’, ’otto mann’, ’martin prince’, ’moe syslak’, ’waylon smithers’, ’charles montgomery burns’, ’milhouse van houten’, ’clancy wiggum’, ’gary chalmers’, ’fat tony’, ’rod flanders’, ’todd flanders’, ’hans moleman’, ’mayor quimby’, ’dr. nick riveria’, ’sideshow bob’, ’snake jailbird’, ’groundskeeper willie’

example

non-numeric data S=



’homer simpson’, ’marge simpson’, ’bart simpson’, ’lisa simpson’, ’maggie simpson’, ’apu nahasapeemapetilon’, ’selma bouvier’, ’patty bouvier’, ’nelson muntz’, ’ralph wiggum’, ’seymour skinner’, ’disco stu’, ’kent brockman’, ’carl carlson’, ’lenny leonard’, ’comic book guy’, ’ned flanders’, ’prof. john frink’, ’barney gumbel’, ’dr. julius hibbert’, ’edna krabappel’, ’krusty the clown’, ’reverend lovejoy’, ’otto mann’, ’martin prince’, ’moe syslak’, ’waylon smithers’, ’charles montgomery burns’, ’milhouse van houten’, ’clancy wiggum’, ’gary chalmers’, ’fat tony’, ’rod flanders’, ’todd flanders’, ’hans moleman’, ’mayor quimby’, ’dr. nick riveria’, ’sideshow bob’, ’snake jailbird’, ’groundskeeper willie’

bi-grams B=



’ho’, ’om’, ’me’, ’er’, ’r ’, ’ s’, ’si’, ’im’, ’mp’, ’ps’, ’so’, ’on’

a possible string similarity  2 · |B1 ∩ B2 | s s1 , s2 = |B1 | + |B2 |

a possible string distance   d s1 , s2 = 1 − s s1 , s2

a possible string kernel  2   d (s1 , s2 ) k s1 , s2 = exp − 2σ2

word2vec via kernel PCA

barney gumbel ralph wiggum lenny leonard comicedna book guy krabappel prince krusty the apuclown nahasapeemapetilon snakemartin jailbird clancy wiggum charles montgomery burns hans moleman patty bouvier selma bouvier kent brockman milhouse van houten nelson muntz otto mann carl carlson fat tony quimby dr. julius hibbertmayor gary chalmers dr. nick riveria prof. john frink disco stu sideshow groundskeeper willie bob seymour skinner moe syslak reverend lovejoy

waylon smithers

marge simpson bartsimpson simpson maggie lisa simpson homer simpson

ned flanders todd flanders rod flanders

word2vec via kernel PCA works for oov words !!!

barney gumbel ralph wiggum lenny leonard comicedna book guy krabappel prince krusty the apuclown nahasapeemapetilon snakemartin jailbird clancy wiggum charles montgomery burns hans moleman patty bouvier selma bouvier kent brockman milhouse van houten nelson muntz kirk vanmann houten otto carl carlson fat tony quimby dr. julius hibbertmayor gary chalmers dr. nick riveria prof. john frink disco stu sideshow groundskeeper willie bob seymour skinner moe syslak reverend lovejoy

waylon smithers

marge simpson bartsimpson simpson maggie abe simpson lisa simpson homer simpson

maude flanders ned flanders todd flanders rod flanders

assignment

for details, read E. Brito, R. Sifa, and C. Bauckhage, “KPCA Embeddings: An Unsupervised Approach to Learn Vector Representations of Finite Domain Sequences”, Proc. LWDA-KDML, 2017

summary

we now know about

Mercer kernels and the kernel trick rewrite an algorithm for analysis or classification such that input data x only enters it form of inner products replace all inner products by kernel evaluations

kernelized versions of algorithms we studied earlier