Pattern Recognition

0 downloads 0 Views 2MB Size Report
use dropout during training use massive amounts of data for training. (recall our ... variational autoencoders (VAEs) generative adversarial networks (GANs) ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 22 recap multilayer perceptrons back propagation best practices recurrent neural networks deep learning summary

mathematical neuron

y

⇔ synaptic summation and (non-linear) activation   y(x) = f wT x = f s

f (s) w0

where

x0



 x← 1 x   w ← w0 w

x1

w1

w2 x2

wm ...

xm

traditional activation functions

linear

1

f (s) = s 0 −4

logistic function f (s) =

−2

1 1 + e−βs

0

2

4

0

2

4

−1 1

hyperbolic tangent 0

f (s) =

eβs

e−βs

− eβs + e−βs

−4

−2

−1

more recent activation functions

rectified linear  f (s) = max 0, s

2

1

softplus f (s) = ln 1 + es



0 −4

−2

0

2

4

perceptron learning rule (PLR)

assume labeled training data D =  yi =



xi , yi

 n i=1

where

+1, if xi ∈ Ω1 −1, if xi ∈ Ω2

let f (·) = id(·) and run the following algorithm initialize w while not converged randomly select xi if yi · xTi w < 0 w = w + yi · xi

that is, use ∆w =

1 2

   T sign xi w − yi · xi

// test for mistake // update only if mistake

Hebbian learning

synaptic weights should be adapted proportional to correlations between pre- and post-synaptic activity

for example a perceptron that learns by means of Oja’s rule   ∆w = η f wT x x − η f 2 wT x w can identify principal components if f (·) = id(·)

earning via iterative error minimization

consider a non-linear neuron  y(x) = f wT x = f (s) where, for example f (s) = tanh β s



and  d f (s) = β 1 − f 2 (s) ds

gradient descend

we have 2  1 X yi − tanh β wT xi E D, w = 2 i

and    ∂ ∂E X = tanh β wT xi −1 yi − tanh β wT xi ∂w ∂w i

= −β

X    yi − tanh β wT xi 1 − tanh2 β wT xi xi i

what a perceptron can do

x2

y w0 1

w0 w x1

w1 x1

w2 x2

what a perceptron can’t do

x2

y w0 1 x1

w1 x1

w2 x2

what a multilayer perceptron can do

x2

y

x1

1

f

f

1

x1

x2

note

in order for this to work / make sense, the activation function f must be non-linear otherwise, we would just have yet another linear function  y y = W2 W1 x = W x W

1

y2

2

1

f

f

x1

x2

W1

1

x3

Theorem (universal approximation theorem) a feed forward MLP y(x) with a single layer of finitely many hidden neurons and non-linear monotonous activation functions can approximate any function g : Rm ⊃ X → R, X compact, up to arbitrary precision, i.e. max y(x) − g(x) <  x∈X

for any  Hornik, Stinchcombe, and White, Neural Networks, 2(5), 1989 Cybenko, Mathematics of Control, Signals, and Systems, 2(4), 1989

question so where is / was the problem ?

question so where is / was the problem ?

answer appropriate parameters  W = W1, W1, . . . , WL need to be determined

usual approach  n learn them from labeled data (x[i], y[i]) i=1 by minimizing n   1X 2 E W = y[i] − y x[i], W 2 i=1

back propagation

⇔ recursive estimation of weights in a multilayer perceptron originally due to Bryson and Ho (1969) (for optimal control of dynamic systems) repeatedly rediscovered for MLP training; in particular by Werbos, Rumelhart, Hinton, and Williams (1974)

back propagation

randomly initialize weights

j=0

j=1

j=2

j=L

repeat propagate training data through the network evaluate overall error E update weights using gradient descent wklj ← wklj − η

∂E ∂wklj

with step size η

until E becomes small

.. .

···

.. . .. .

back propagation

for stochastic gradient descend, we rewrite E=

2 1 X 1 X = y[i] − y x[i], W E[i] 2 2 i

i

and look at the partial derivatives of the E[i] w.r.t to the weights wklj

back propagation

for the last layer L, we have ∂E[i] ∂flL ∂slL ∂E[i] = ∂wklL ∂flL ∂slL ∂wklL

(1)

back propagation

for the last layer L, we have ∂E[i] ∂flL ∂slL ∂E[i] = ∂wklL ∂flL ∂slL ∂wklL

=

2 ∂ 1X yk [i] − fkL L ∂fl 2

(1)

(2)

k

 ∂ · L tanh slL ∂sl ∂ X L L−1 wkl fl · ∂wklL k

(3) (4)

back propagation

for (2), (3), and (4), we find respectively  ∂E[i] = − yl [i] − flL L ∂fl ∂flL 2 = 1 − flL L ∂sl ∂slL = flL−1 ∂wklL

back propagation

so far, we therefore have  ∂E[i] 2 = − yl [i] − flL · 1 − flL · flL−1 L ∂wkl = −δLl · flL−1

back propagation

for the last but one layer L − 1, we must consider ∂E[i] ∂E[i] 2 = L−1 · 1 − flL−1 · flL−2 L−1 ∂wkl ∂fl and we note that X ∂E[i] ∂f L ∂E[i] k = ∂fkL ∂flL−1 ∂flL−1 k

j −1

j

j+1

k−1 k k+1

back propagation

in general, we therefore have wklj ← wklj + η · δlj · flj−1

j = L, L − 1, . . . , 1

where  2 δLl = yl [i] − flL · 1 − flL and δlj−1 =

X k

δkj wklj 1 − flj−1

2

j = L, L − 1, . . . , 2

note

this is it . . .

note

this is it . . . but na¨ıve BP is a recipe for run-time disasters, slow to converge, and prone to oscillation ⇒ numerous possible improvements momentum terms ∆W(t + 1) = −η

dE + µ ∆W(t) dW(t)

weight decay (L2 regularization)   dE ∆W = −η + λW dW

note

automatic estimation of learning rate η super SAB heuristic resiliant backpropagation (RProp), . . .

variants for computation of gradients conjugate gradients (quasi) Newton methods Levenberg-Marquardt methods, . . .

different objective functions cross entropy, . . .

note

⇒ “simple” supervised training of a “simple” feed-forward multilayer perceptron is an art rather than a science ;-)

note

all of the following are from Y. LeCun, L. Bottou, G.B. Orr, and K.-R. Muller, “Efficient ¨ BackProp” in G.B. Orr and K.-R. Muller (eds), “Neural ¨ Networks: Tricks of the Trade”, Springer, 1998

stochastic- instead of batch learning

note that computing the average gradient ∆W = −η

∂E ∂W

requires a pass over the whole batch of training data one can often get faster and better solutions through updates based on single data points or mini-batches ∂E[i] ∂W   ∂E i : j ∆W = −η ∂W ∆W = −η

shuffle the training data

networks learn fastest from “unexpected” examples therefore shuffle the training data s.t. successive examples are from different classes prefer training examples that produce large errors over those producing small errors

normalize the input

convergence is typically faster if each of the variables (dimensions) in the training data are of zero mean and unit variance

PCA / ZCA whitening  n for data xi i=1 ⊂ Rm , compute

4

2

0 −4

−2

0

−2

−4

2

4

PCA / ZCA whitening  n for data xi i=1 ⊂ Rm , compute

4

2

1

xi ← xi − µ 0 −4

−2

0

−2

−4

2

4

PCA / ZCA whitening  n for data xi i=1 ⊂ Rm , compute

4

2

1

xi ← xi − µ

2

xi ← UT xi where C = UΛUT

0 −4

−2

0

−2

−4

2

4

PCA / ZCA whitening  n for data xi i=1 ⊂ Rm , compute

4

2

1

xi ← xi − µ

2

xi ← UT xi where C = UΛUT

3

xi ← L xi where L = Λ−1/2

0 −4

−2

0

−2

−4

2

4

PCA / ZCA whitening  n for data xi i=1 ⊂ Rm , compute

4

2

1

xi ← xi − µ

2

xi ← UT xi where C = UΛUT

3

xi ← L xi where L = Λ−1/2

4

xi ← U xi

0 −4

−2

0

−2

−4

2

4

choose the activation function

recall that outputs of one layer are inputs to the next inputs of of zero mean and unit variance are good

choose the activation function

recall that outputs of one layer are inputs to the next inputs of of zero mean and unit variance are good the once popular logistic function f (z) = 1 + e−s

−1

does not accomplish this, because ∀ s : f (s) > 0

choose the activation function

2

the hyperbolic tangent is symmetric about the origin

f (s) = 1.7159 · tanh

2 3s

0 −4

in particular, the function

−2



has the following properties f (±1) = ±1 argmaxs

d2 f dz2

= ±1

if s is of zero mean and unit variance, then f (s) will be of zero mean and unit variance

0

−2

2

4

choose the activation function

2

it may be beneficial to consider

0 −4

−2

0

f (s) = tanh (s) +  s −2

to escape from plateaus

2

4

choose the target values

training will drive outputs as close as possible to targets ⇒ if target values are large, weights will have to become large ⇒ for sigmoidal activation f (s), gradients will become small 2 therefore, set target values to points where dds2f is maximal

1

1

0 −4

−2

1

0 0

2

−1

f (s) = tanh(s)

4

−4

−2

0 0

−1

f 0 (s)

2

4

−4

−2

0

−1

f 00 (s)

2

4

initialize the weights

recall, large weights will cause sigmoidal activation to saturate (⇔ small gradients, slow learning) likewise for very small weights therefore, choose weights neither to small nor to large assuming that m inputs to a neuron are of zero mean and unit variance, initialize it weights by sampling from a Gaussian with zero mean and variance 1 σ= √ m

recurrent neural networks

note

feed forward networks are stateless

recurrent networks are statefull

recurrent network

⇔ truly universal function ⇔ dynamical system

σt+1 = σt + F Wσt ⇔ challenging to train



deep learning

convolutional neural network

neural architecture tailored towards image analysis popularized by LeCun et al. (1990)

source: LeCun et al. (1995)

convolutional neural network

OK, the idea is “not new”, so why the excitement?

convolutional neural network

OK, the idea is “not new”, so why the excitement? Krizhevsky, Sutskever, and Hinton (2012) revolutionized what can be expected from neural networks they used a neural network with 5 convolutional layers (some followed by pooling layers) followed by a fully connected MLP of 3 layers a total of 650,000 neurons and 60,000,000 parameters

and achieved error rates on the ImageNet benchmark (15,000,000 images from 22,000 categories) that were previously unheard of

convolutional neural network

source: L. Brown, nvidia blog, 2014

current best practices for deep learning

require weights in each convolutional layer to be shared use non-saturating activation functions, for instance rectified linear units  f (s) = max 0, s

current best practices for deep learning

require weights in each convolutional layer to be shared use non-saturating activation functions, for instance rectified linear units  f (s) = max 0, s use dropout during training use massive amounts of data for training (recall our study of the VC dimension) train on GPUs

numerous other breakthroughs

speech / text understanding speech / text translation genome analysis game AI ...

things are getting crazier by the day . . .

Google’s tensor flow, Dec 2015

Microsoft’s CNTK, Jan 2016

other recent breakthroughs

variational autoencoders (VAEs) generative adversarial networks (GANs)

further neural architectures

models not based on perceptrons echo state networks radial basis networks y(x) =

X i

"

#

wi − x 2 αi exp − 2σ2

self-organizing maps, neural gasses, . . . associative memories, Hopfield networks, . . . (restricted) Bolzmann machines, deep belief networks, . . .

neural networks are not that special

observe

a deep net computes a function  !   L 2 1 y = f W f ... f W f W x

20

21

WL

.. .

17

18

19

14

15

16

10

11

12

13

6

7

8

9

W2

W1 1

2

3

4

5

observe

a deep net computes a function  !   L 2 1 y = f W f ... f W f W x   = f WL ϕ x

20

21

WL

.. .

17

18

19

14

15

16

10

11

12

13

6

7

8

9

W2

W1 1

2

3

4

5

observe

a deep net computes a function  !   L 2 1 y = f W f ... f W f W x   = f WL ϕ x

20

21

WL

.. .

17

18

19

14

15

16

10

11

12

13

6

7

8

9

W2

W1 1

⇔ it is just another kernel machine

2

3

4

5

summary

we now know about

multilayer perceptrons and the back propagation algorithm best practices for training (deep) feed forward networks the need to learn more about all this ;-)