Pattern Recognition

9 downloads 0 Views 666KB Size Report
purpose of this additional material in lecture 18 of our course on pattern recognition, we discussed algorithms for non-convex optimization all the approaches we ...
Pattern Recognition Prof. Christian Bauckhage

outline additional material for lecture 18

derivative free optimization the Nelder-Mead method

summary

purpose of this additional material

in lecture 18 of our course on pattern recognition, we discussed algorithms for non-convex optimization all the approaches we studied there required the computation of gradients or even Hessians, i.e. 1st and 2nd order derivatives of an objective function however, an often overlooked fact in machine learning / pattern recognition is that there exist optimization algorithms that do not require the computation of derivatives next, we therefore discuss the idea of derivative free optimization there are various corresponding techniques which are appealing whenever derivatives are hard to obtain, for instance, in case the objective function is not continuous . . .

derivative free optimization

the general setting is as usual: we are given a function f : Rm → R and are interested in solving  x∗ = argmin f x x

similar to the gradient descend methods we discussed in lecture 18, we proceed iteratively . . .

derivative free optimization

we start with an initial guess x0 and create a sequence of optimized guesses  xt+1 = o xt that (hopefully) approaches x∗ in other words, we require that each call of the optimization procedure o achieves   f xt+1 6 f xt and hope for lim xt = x∗

t→∞

note

for gradient descend methods, we have   xt+1 = o xt = xt − ηt ∇f xt so that, for appropriate step sizes ηt , we can rest assured that   f xt+1 6 f xt

 however, for non-convex f x , there is no general guarantee that lim xt = x∗

t→∞

rather, optimization processes depending on an initial guess x0 might get stuck in a local optimum; this caveat also applies to derivative free methods

derivative free optimization

 derivative free optimization methods do not compute ∇f xt but merely evaluate the function f at several points x in the vicinity of the current best estimate xt  that point x in the vicinity of xt for which f x yields the best value is then used to compute the next best estimate xt+1

in other words, derivative free optimization methods rely   on evaluations of f x rather than on evaluations of ∇f x

however, in order for such methods to work and to converge, the proposal points x near xt have to be chosen cleverly . . .

the Nelder-Mead method

the Nelder-Mead method

the Nelder-Mead or simplex downhill method is a venerable derivative free optimization technique that dates back to the 1960s J.A. Nelder and R. Mead, “A Simplex Method for Function Minimization”, Computer Journal, 7(4), 1965

the Nelder-Mead method

in order to minimize functions f : Rm → R the Nelder-Mead method maintains a set  X = x1 , x2 , . . . , xm+1 of m + 1 proposal points which form a simplex in Rm

this simplex is iteratively updated using the following algorithm

algorithm initialize a set X of proposal points repeat    sort the points in X such that f x1 6 f x2 6 . . . 6 f xm+1 m P xi compute centroid xµ = m1 i=1   compute xr = xµ + α xµ − xm+1    if f x1 6 f xr < f xm update xm+1 = xr continue   if f xr < f x1   compute xe = xµ + γ xr − xµ   if f xe < f xr update xm+1 = xe else update xm+1 = xr continue   compute xc = xµ + ρ xm+1 − xµ   if f xc < f xm+1 update xm+1 = xc continue for all i = 2, . . . , m + 1  update xi = x1 + σ xi − x1 until convergence

// order

// reflect

// expand

// contract

// shrink

discussion

note the many continue statements in the above algorithm since vertex xm+1 always denotes the current worst estimate, reflecting it at the centroid xµ of the vertices x1 , . . . , xm might yield an improvement if the reflected point xr is the new best estimate among all the vertices, expanding it may yield even better estimates yet, if the reflected point xr is only better than the second worst estimate xm , a better estimate might result from contracting it into the simplex formed by the xi finally, shrinking deals with the (rare) case where a contraction of xm+1 produces an even worse estimate; here it is a good idea to move vertices xi>1 into the direction of the current best one x1

discussion

w.r.t. the constants α, γ, ρ, and σ that appear in this algorithm, we note the requirements α>0 γ>1 0