purpose of this additional material in lecture 18 of our course on pattern recognition, we discussed algorithms for non-convex optimization all the approaches we ...
Pattern Recognition Prof. Christian Bauckhage
outline additional material for lecture 18
derivative free optimization the Nelder-Mead method
summary
purpose of this additional material
in lecture 18 of our course on pattern recognition, we discussed algorithms for non-convex optimization all the approaches we studied there required the computation of gradients or even Hessians, i.e. 1st and 2nd order derivatives of an objective function however, an often overlooked fact in machine learning / pattern recognition is that there exist optimization algorithms that do not require the computation of derivatives next, we therefore discuss the idea of derivative free optimization there are various corresponding techniques which are appealing whenever derivatives are hard to obtain, for instance, in case the objective function is not continuous . . .
derivative free optimization
the general setting is as usual: we are given a function f : Rm → R and are interested in solving x∗ = argmin f x x
similar to the gradient descend methods we discussed in lecture 18, we proceed iteratively . . .
derivative free optimization
we start with an initial guess x0 and create a sequence of optimized guesses xt+1 = o xt that (hopefully) approaches x∗ in other words, we require that each call of the optimization procedure o achieves f xt+1 6 f xt and hope for lim xt = x∗
t→∞
note
for gradient descend methods, we have xt+1 = o xt = xt − ηt ∇f xt so that, for appropriate step sizes ηt , we can rest assured that f xt+1 6 f xt
however, for non-convex f x , there is no general guarantee that lim xt = x∗
t→∞
rather, optimization processes depending on an initial guess x0 might get stuck in a local optimum; this caveat also applies to derivative free methods
derivative free optimization
derivative free optimization methods do not compute ∇f xt but merely evaluate the function f at several points x in the vicinity of the current best estimate xt that point x in the vicinity of xt for which f x yields the best value is then used to compute the next best estimate xt+1
in other words, derivative free optimization methods rely on evaluations of f x rather than on evaluations of ∇f x
however, in order for such methods to work and to converge, the proposal points x near xt have to be chosen cleverly . . .
the Nelder-Mead method
the Nelder-Mead method
the Nelder-Mead or simplex downhill method is a venerable derivative free optimization technique that dates back to the 1960s J.A. Nelder and R. Mead, “A Simplex Method for Function Minimization”, Computer Journal, 7(4), 1965
the Nelder-Mead method
in order to minimize functions f : Rm → R the Nelder-Mead method maintains a set X = x1 , x2 , . . . , xm+1 of m + 1 proposal points which form a simplex in Rm
this simplex is iteratively updated using the following algorithm
algorithm initialize a set X of proposal points repeat sort the points in X such that f x1 6 f x2 6 . . . 6 f xm+1 m P xi compute centroid xµ = m1 i=1 compute xr = xµ + α xµ − xm+1 if f x1 6 f xr < f xm update xm+1 = xr continue if f xr < f x1 compute xe = xµ + γ xr − xµ if f xe < f xr update xm+1 = xe else update xm+1 = xr continue compute xc = xµ + ρ xm+1 − xµ if f xc < f xm+1 update xm+1 = xc continue for all i = 2, . . . , m + 1 update xi = x1 + σ xi − x1 until convergence
// order
// reflect
// expand
// contract
// shrink
discussion
note the many continue statements in the above algorithm since vertex xm+1 always denotes the current worst estimate, reflecting it at the centroid xµ of the vertices x1 , . . . , xm might yield an improvement if the reflected point xr is the new best estimate among all the vertices, expanding it may yield even better estimates yet, if the reflected point xr is only better than the second worst estimate xm , a better estimate might result from contracting it into the simplex formed by the xi finally, shrinking deals with the (rare) case where a contraction of xm+1 produces an even worse estimate; here it is a good idea to move vertices xi>1 into the direction of the current best one x1
discussion
w.r.t. the constants α, γ, ρ, and σ that appear in this algorithm, we note the requirements α>0 γ>1 0