FINDING OPTIMAL ALGORITHMIC PARAMETERS

0 downloads 0 Views 283KB Size Report
use a class of nonsmooth optimization algorithms to solve this problem which makes provision for an ... We opted for the mesh adaptive direct search (mads) [4] as derivative-free method ..... eter estimation in almost any branch of applied mathematics or engineering. ..... 2May be downloaded from www.gerad.ca/NOMAD.
c 2006 Society for Industrial and Applied Mathematics 

SIAM J. OPTIM. Vol. 17, No. 3, pp. 642–664

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING DERIVATIVE-FREE OPTIMIZATION∗ CHARLES AUDET† AND DOMINIQUE ORBAN† Abstract. The objectives of this paper are twofold. We devise a general framework for identifying locally optimal algorithmic parameters. Algorithmic parameters are treated as decision variables in a problem for which no derivative knowledge or existence is assumed. A derivative-free method for optimization seeks to minimize some measure of performance of the algorithm being fine-tuned. This measure is treated as a black-box and may be chosen by the user. Examples are given in the text. The second objective is to illustrate this framework by specializing it to the identification of locally optimal trust-region parameters in unconstrained optimization. The derivative-free method chosen to guide the process is the mesh adaptive direct search, a generalization of pattern search methods. We illustrate the flexibility of the latter and in particular make provision for surrogate objectives. Locally, optimal parameters with respect to overall computational time on a set of test problems are identified. Each function call may take several hours and may not always return a predictable result. A tailored surrogate function is used to guide the search towards a local solution. The parameters thus identified differ from traditionally used values, and allow one to solve a problem that remained otherwise unsolved in a reasonable time using traditional values. Key words. derivative-free optimization, black-box optimization, parameter estimation, surrogate functions, mesh adaptive direct search, trust-region methods AMS subject classifications. 90C56, 90C90, 90C31 DOI. 10.1137/040620886

1. Introduction. Most algorithms, be it in optimization or any other field, depend more or less critically on a number of parameters. Some parameters may be real numbers, such as an initial trust-region radius, or a scalar that dictates the precision at which a subproblem needs to be solved. These parameters may be required to remain between two, possibly infinite bounds, or may be constrained in a more complex way. They may also be discrete, such as the maximal number of iterations or the number of banned directions in a taboo search heuristic, or even categorical, such as a Boolean indicating whether exact or inexact Hessian is used, or which preconditioner to use. The overall behavior of the algorithm is influenced by the values of these parameters. Unfortunately, for most practical cases, it remains unclear how a user should proceed to determine good, let alone optimal, values for those parameters. We devise a framework for fine-tuning such parameters which is general enough to encompass most numerical algorithms from engineering, numerical analysis and optimization. The design of the framework relies on the observation that measures of performance can be derived from the dependency of the algorithm on its parameters. These measures are context- and problem-dependent and for this reason, we wish to ∗ Received by the editors December 15, 2004; accepted for publication (in revised form) March 20, 2006; published electronically September 15, 2006. http://www.siam.org/journals/siopt/17-3/62088.html † GERAD and D´ ´ epartement de Math´ematiques et de G´enie Industriel, Ecole Polytechnique de Montr´ eal, C.P. 6079, Succ. Centre-ville, Montr´eal, Qu´ebec, H3C 3A7 Canada (Charles.Audet@gerad. ca, http://www.gerad.ca/Charles.Audet; [email protected], http://www.mgi.polymtl.ca/ Dominique.Orban). The work of the first author was supported by NSERC grant 239436-01, AFOSR F49620-01-1-0013, and ExxonMobil R62291. The work of the second author was supported by NSERC grant 299010-04.

642

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

643

treat them as black-boxes in the remainder of this paper. We shall, however, give examples in the context of a particular application. An optimization problem is formulated where such a measure of performance is minimized as a function of the parameters, over a domain of acceptable values. We use a class of nonsmooth optimization algorithms to solve this problem which makes provision for an optional surrogate function to guide the search strategy. A surrogate function is a simplification of the real objective function that possesses similar behavior, but is believed to be less costly to evaluate, or easier to manipulate, in some sense. Surrogate functions range from simplified physical models to approximation surfaces, such as Kriging models. The reader is invited to consult [6] for a general framework for the use of surrogates in an optimization context. As an illustration of how to use this framework, we address the study of standard parameters present in trust-region algorithms for unconstrained optimization and try to identify some locally optimal values. The quality of a set of parameters is measured by the overall computational time required by a trust-region algorithm to solve a significant number of test problems to a prescribed precision. Other possibilities of performance measures include the overall number of function calls, the number of failures, the total number of iterations or the agreement of the algorithm with some prescribed behavior. In the numerical tests presented, the test problems originate from the CUTEr [19] collection. We formulate an optimization problem, where each evaluation of the objective function requires solving a collection of test problems. This objective function is time-consuming to evaluate, is highly nonlinear and no derivative is available or even proved to exist. Moreover, evaluating the objective twice with the same arguments may lead to slightly different function values since the computational time is influenced by the current machine load and fluctuates with network activity. We opted for the mesh adaptive direct search (mads) [4] as derivative-free method for reasons which we explain below. In our context, a surrogate function is obtained by applying the same trust-region algorithm to a set of easier problems. The trustregion parameters obtained by mads allow the solution of a problem which remained otherwise unsolved in reasonable time by the trust-region method. Related work on locally optimal parameter identification in a similar trust-region framework is presented in [17], where the parameter space is discretized and a thorough search is carried out. However, even for a modest number of discretized values, devising a mechanism able to compare so many variants of an algorithm becomes an issue. In recent years, performance profiles [13] have been extensively used to compare algorithms, but it remains unclear how to use them to efficiently compare more than five or six. We circumvent this issue in the present paper by delegating the task to another optimization algorithm. In recent years, pattern-search methods have proved to perform decently on molecular geometry and conformation problems [1, 28]. The paper [23] provides a review of direct search methods. The paper is structured as follows. In section 2, we describe a derivative-free framework, then outline a specific implementation of the mads class of algorithms, and highlight the main convergence results. We also describe how a surrogate function can be used within this algorithm. Section 3 describes a standard trust-region algorithm, and discusses the four algorithmic parameters to be fine-tuned. In section 4, we present a methodology specializing our framework to identify locally optimal trustregion parameters using a mads algorithm. Results are presented in secton 5, and we give concluding remarks in section 7.

644

CHARLES AUDET AND DOMINIQUE ORBAN

2. A framework for nondifferentiable problems. A general optimization problem may be stated as (2.1)

min ψ(p), p∈Ω

with ψ : Ω ⊆ R → R ∪ {+∞}. The nature and structure of the function ψ and the domain Ω limit the type of algorithms that may be used to attempt to solve this problem. Global optimization is often only possible when the problem structure is sufficiently rich and exploitable, and when the problem size is reasonable, but frequently out of reach in acceptable time. Under appropriate smoothness assumptions, we are thus often content with first-order critical solutions. For example, when ψ is continuously differentiable over Ω, an appropriate variant of Newton’s method combined with a globalizing scheme yields a critical point under reasonable assumptions. When ψ is nondifferentiable, discontinuous or fails to evaluate for some values of its argument p ∈ Ω, problem (2.1) cannot be satisfactorily approached by such a method. This is often the case when evaluating the objective entices running a computer code. In order to evaluate, the code may, for instance, have to solve a coupled system of differential equations, and may, for some internal reasons, fail to return a meaningful value. In this case, the function value is simply considered to be infinite. In a helicopter rotor blade design application [5], the objective function failed to return a value two times out of three. Randomness may also be present in the evaluation of a function, as in [29], where two evaluations of ψ at the same point p return slightly different values. In this less optimistic case, the best optimality condition which can be hoped for is to find a refined point. We shed light on this concept in section 2.3. 2.1. A general overview of pattern search type methods with surrogate. In the special case where Ω is defined by finitely many linear inequalities, algorithms from the broad class of generalized pattern search (gps) methods [3] are natural candidates to perform the minimization. They have the advantage of being relatively simple and easy to implement. Algorithms of the pattern search type attempt to locate a minimizer of the function ψ over Ω by means of the barrier function  +∞ if p ∈ Ω ψΩ (p) = (2.2) ψ(p) otherwise. We will refer to ψ as being the truth function or, sometimes, simply the truth. As is frequent in nonlinear programming, a second function, σ : Ω → R ∪ {+∞}, playing the role of a model for ψ, may be used to steer the iterates towards promising regions. In the context of nondifferentiable optimization and pattern search methods, a model is often referred to as a surrogate. The surrogate may be an approximation to the truth, or it may be a simplified function whose behavior is similar to that of the truth. An important feature of the surrogate is that it should be cheaper than the truth, in some sense—less costly in terms of time or other. The previous sentences are left intentionally vague, since the formal convergence analysis is independent of the quality of the approximation of ψ by σ. However, in practice, appropriate surrogates may improve the convergence speed. A barrier surrogate σΩ is defined similarly to (2.2). Pattern search type methods are iterative procedures where each iteration essentially consists of two steps. First, a global exploration of the space of variables

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

645

is conducted in hopes of improving the best feasible solution so far, or incumbent, pk ∈ Ω. This flexible exploration stage, called the search step, returns a set of candidates, but the truth function need not be evaluated at all of them. To decide at which of these the truth ψΩ will be evaluated, they are ordered in increasing surrogate function values. After ordering, the set of candidates L = {q 1 , q 2 , . . . , q m } satisfies σΩ (q 1 ) ≤ σΩ (q 2 ) ≤ · · · ≤ σΩ (q m ). A candidate q j ∈ L will be considered promising if σΩ (q j ) ≤ σΩ (pk ) + v|σΩ (pk )|, where v ∈ R+ ∪ {+∞} is a threshold supplied by the user. Candidates which are not promising are eliminated from the search list. The truth function is then evaluated at the promising candidates in L with increasing values of i, and terminating the process as soon as ψΩ (q i ) < ψΩ (pk ). In this case, an improved incumbent is found and we set pk+1 = q i . In the event where the search fails to identify an improved iterate, a local exploration about pk is performed. This is called the poll step. Again, the surrogate function is used to order the trial points. The convergence analysis relies only on this step and it must obey stricter rules. In particular, it prohibits the pruning of candidates. A inconvenience about gps algorithm is that they require some prior knowledge on the structure of Ω to decide on a set of appropriate search directions. Lack of such knowledge may prevent progress towards a minimizer [26]. So as to make a provision for more general Ω and bypass such a prior knowledge requirement, we opted for an extension of gps named the mesh adaptive direct search (mads) [4], which is also based on the above principles. mads allows a more general exploration of the space of variables ensuring stronger convergence results than gps. Indeed, gps confines the poll at each iteration to a fixed finite set of directions, while the set of directions for the mads poll may vary at each iteration, and in the limit the union of these poll directions over all iterations is dense in the whole space. 2.2. An iteration of a MADS algorithm. We now present a lower-level description of mads. The reader is invited to refer to [4] for a complete algorithmic description, a detailed convergence analysis and numerical comparisons with gps. The version presented here is specialized to our purposes and some algorithmic choices were made. Let S0 ⊂ Ω denote a finite set of initial guesses, provided by the user (a strategy exploiting the surrogate to determine S0 is presented in section 4.4). Set p0 to be the best initial guess in S0 . A mads algorithm is constructed in such a way that any trial point generated by the search or poll step is required to lie on the current mesh, the coarseness of which is governed by a mesh size parameter Δk ∈ R+ . The mesh is formally defined in Definition 2.1. Definition 2.1. At iteration k, the current mesh is defined to be the union    Mk = p + Δk z : z ∈ Z , p∈Sk

where Sk is the set of points where the objective function ψ has been evaluated by the start of iteration k and Z denotes the set of integers. The goal of the iteration is to find a trial mesh point with a lower objective function value than the current incumbent value ψΩ (pk ). Such a trial point is called an improved mesh point, and the iteration is called a successful iteration. There are no sufficient decrease requirements: any improvement in ψΩ leads to a successful iteration. The iteration is said to be unsuccessful if no improved point is found.

646

CHARLES AUDET AND DOMINIQUE ORBAN

s q3

q2 s

S S







r1 s

 

s pk

 

r4

s

S

Ss S

r2

S



 s4

s S Ss

q

r3

q1

s Iteration k (unsuccessful)

Iteration k + 1 with pk+1 = pk

Fig. 1. Two consecutive frames Fk , Fk+1 in R2 with Δk =

1 , 2

Δk+1 =

1 . 8

The poll step evaluates ψΩ at 2 trial points surrounding the current incumbent. These neighboring points are called the frame, and are denoted by (2.3)

Fk = {pk ± Δk d : d ∈ Dk },

where Dk = {d1 , d2 , . . . , d } is a basis in Z . To ensure convergence, the radii (max{ Δk d ∞ : d ∈ Dk }) of successive frames must converge to zero at a slower rate than the mesh size parameter. The construction of the basis Dk proposed in [4] ensures that 

Δk d ∞ = O( Δk ) for all d ∈ Dk . (2.4) Figure 1 shows an example of two consecutive frames in R2 . The figure on the left represents iteration k. The mesh Mk is represented by the intersection of all lines. Suppose that Δk = 12 . The thick lines delimit the frame, i.e., the region in which all four poll points must lie. In this example, the frame points q1 and q3 are obtained by the randomly generated direction d1 = (0, −2), and q2 and q4 are obtained by d2 = (2, 1). The figure on the right displays a possible frame if iteration k is unsuccessful. The mesh is finer at iteration k + 1 than it was at iteration k, and there are more possibilities in choosing a frame. More precisely, Δk+1 = Δ4k = 18 and, as in (2.4), the distance from the boundary of the frame to the incumbent is reduced  by a factor of 2; ri − pk+1 ∞ = 14 q j − pk ∞ for all i, j. The direction used to construct the frame points r1 and r3 are d1 = (−3, 4), and the direction for r2 and r4 are d2 = (4, 0). In the event that iteration k + 1 is successful at the mesh poll point r3 , iteration k + 2 would be initiated at the new incumbent pk+2 = r3 with a larger mesh size parameter Δk+2 = 4Δk+1 = 12 . When the poll step fails to generate an improved mesh point then the frame is called a minimal frame, and the frame center pk is said to be a minimal frame center. At iteration k, the rule for updating the mesh size parameter is ⎧ ⎨ Δk /4 if pk is a minimal frame center, 4Δk if an improved mesh point is found, and Δk ≤ 14 , Δk+1 = (2.5) ⎩ otherwise. Δk

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

647

Algorithm 2.1. [A mads algorithm] Step 0 [Initialization] Let S0 be given, p0 ∈ arg min{ψ(p) : p ∈ S0 }, Δ0 > 0, and v ∈ R+ ∪ {+∞}. Set the iteration counter k = 0, and go to Step 1. Step 1 [search step] Let L = {q 1 , q 2 , . . . , q m } ⊂ Mk be a finite (possibly empty) set of mesh points such that σΩ (q i ) ≤ σΩ (q j ) ≤ σΩ (pk ) + v|σΩ (pk )| when 1 ≤ i < j ≤ m. Let i0 be the smallest i ∈ {1, . . . , m} such that ψΩ (q i ) < ψΩ (pk ). If no such index i0 exists, go to Step 2. Otherwise, declare k successful, set pk+1 = q i0 and go to Step 3. Step 2 [poll step] Construct the frame Fk = {q 1 , q 2 , . . . , q 2 } as in (2.3) and order the points so that σΩ (q i ) ≤ σΩ (q j ) when 1 ≤ i < j ≤ 2. Let i0 be the smallest i ∈ {1, . . . , 2} such that ψΩ (q i ) < ψΩ (pk ). If no such index i0 exists, declare k unsuccessful and go to Step 3. Otherwise, declare k successful, set pk+1 = q i0 and go to Step 3. Step 3 [Parameter update] If iteration k was declared unsuccessful, then pk is a minimal frame center and pk+1 is set to pk . Otherwise pk+1 is an improved mesh point. Update Δk+1 according to (2.5). Increase k ← k + 1 and go back to Step 1.

The previous description is summarized in Algorithm 2.1. Given the functions ψ and σ, the only steps that are not completely defined in Algorithm 2.1 are the selection of the set of initial guesses S0 and the search strategy. Particular choices in the framework of an application are discussed in section 4.4. 2.3. Convergence properties of MADS. We restrict ourselves to the case where Ω is convex and full dimensional, i.e., with nonempty interior. This encompasses the case where Ω is defined by (possibly strict) linear inequalities. It seems reasonable to assume that in practice, most algorithms depend on parameters satisfying this assumption. Moreover, the mads convergence analysis greatly simplifies. This section presents the specialized results. The proofs of these results are special cases of the proofs in [4]. Let cl(A) denote the closure of a set A. Under our conditions, the tangent cone to Ω at some pˆ ∈ cl(Ω), denoted by TΩ (ˆ p), becomes the closure of the set {μ(v − pˆ) : μ ∈ R+ , v ∈ Ω}. The normal cone to Ω at pˆ ∈ cl(Ω) is the polar of the tangent cone NΩ (ˆ p) = TΩ (ˆ p)◦ = {w ∈ R | wT v ≤ 0 ∀v ∈ TΩ (ˆ p)}. If the gradient ∇ψ(ˆ p) exists at pˆ ∈ cl(Ω), we say that pˆ is first-order critical for (2.1) if −∇ψ(ˆ p) ∈ NΩ (ˆ p). For nonconvex domains, more general definitions of tangency from nonsmooth analysis are used in [4]. The analysis relies on Assumption 2.1. Assumption 2.1. At least one initial guess p0 ∈ S0 ⊆ Ω has finite ψ(p0 ) value and all iterates {pk } produced by Algorithm 2.1 lie in a compact set. The mechanism of Algorithm 2.1 ensures the following property. Lemma 2.2. The sequence of mesh size parameters satisfies lim inf k→∞ Δk = 0. Moreover, since Δk shrinks only at minimal frames, it follows that there are infinitely many minimal frame centers. Definition 2.3 specifies important subsequences of iterates and limiting directions. Definition 2.3. A subsequence of the mads iterates consisting of minimal frame centers, {pk }k∈K for some subset of indices K, is said to be a refining subsequence

648

CHARLES AUDET AND DOMINIQUE ORBAN

if {Δk }k∈K converges to zero. Any accumulation point of {pk }k∈K will be called a refined point. Let {pk }k∈K be a convergent refining subsequence, with refined point pˆ, and let v be any accumulation point of the set { ddkk  : pk + Δk dk ∈ Ω, k ∈ K} ⊂ R . Then v is said to be a refining direction for pˆ. Along a refining subsequence, we thus always have ψ(pk+1 ) ≤ ψ(pk ). Note that under Assumption 2.1, there always exists at least one convergent refining subsequence, one refining point and a positive spanning set of refining directions. Recall that the Clarke generalized directional derivative [7, 22] of ψ at pˆ in the direction v = 0 is defined as ψ ◦ (ˆ p; v) ≡

lim sup y → pˆ, y ∈ Ω, t ↓ 0, y + tv ∈ Ω

ψ(y + tv) − ψ(y) . t

The following result presents a hierarchy of optimality conditions satisfied at a refining point. The weakest conclusion is a direct consequence of Definition 2.3. The next conclusions result from increasingly strong assumptions on ψ. The main result is that the Clarke derivatives of ψ at a refined point pˆ are nonnegative for all directions in the tangent cone. Theorem 2.4. Let pˆ ∈ cl (Ω) be a refined point of a refining subsequence {pk }k∈K , and assume that the set of refining directions for pˆ is dense in TΩ (ˆ p). • pˆ is the limit of minimal frame center on frames that become infinitely fine, • if ψ is lower semicontinuous at pˆ, then ψ(ˆ p) ≤ limk∈K ψ(pk ), • if ψ is Lipschitz near pˆ, then ψ ◦ (ˆ p, v) ≥ 0 for every v ∈ TΩ (ˆ p), • if ψ is Lipschitz near pˆ ∈ int(Ω), then 0 ∈ ∂ψ(ˆ p) ≡ {s ∈ R : ψ ◦ (ˆ x; v) ≥ v T s,  ∀v ∈ R }, • if ψ is strictly differentiable [24] at pˆ ∈ Ω, then pˆ is first-order critical for (2.1) and is a KKT point of (2.1). Note that the above convergence results rely only on the poll step, and are independent of the surrogate function and of the search step. Furthermore, even though Algorithm 2.1 is applied to ψΩ instead of ψ, the convergence results are linked to the local smoothness of ψ and not ψΩ , which is obviously discontinuous on the boundary of Ω. Practical implementations of mads ensure that the density assumption of Theorem 2.4 is satisfied [4]. 3. Trust-region methods. In this section, we briefly cover a globally convergent framework for unconstrained programming. This framework depends on a number of parameters which influence the impact of the main convergence result. To successfully tackle a smooth nonlinear nonconvex programming problem from a remote starting guess, the iteration must often be embedded into a globalization scheme, the most popular of which are the linesearch, the trust region [10], and the more recent filters [14, 20]. The first two are probably the most well known and their philosophies may be seen as dual; a linesearch strategy computes a step length along a predetermined direction, while a trust-region strategy considers all acceptable directions but limits the maximal step length. 3.1. A basic trust-region algorithm. Trust-region methods appear to date back to a 1944 paper in which they were used to solve nonlinear least-squares problems [25]. After updating rules were introduced in 1966 [15] for the size of the region,

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

649

global convergence of a particular algorithm was proved in 1970 [30]. Trust-region methods now form one of the most popular globalization schemes and are often praised for their robustness and flexibility. They are used throughout optimization, from regularization problems to derivative-free and interior-point methods. For lists of references, historical notes, and thorough theoretical developments across the whole optimization spectrum we refer the interested reader to [10]. For simplicity, assume we wish to solve the unconstrained programming problem (3.1)

min

x∈Rn

f (x),

where f : Rn → R is a twice-continuously differentiable function which might be highly nonlinear and/or costly to evaluate. For problem (3.1) to be well defined, we will assume that f is bounded below. At iteration k, instead of manipulating f directly, f is replaced by a suitable local model mk which is easier and cheaper to evaluate. A region Bk ⊂ Rn , referred to as the trust region, is defined around xk to represent the extent to which mk is believed to reasonably model f . The trust region is defined as the ball Bk ≡ {xk + s ∈ Rn : s ≤ δk } , where δk > 0 is the current trust-region radius and · represents any norm on Rn . To simplify the exposition, we choose the Euclidean norm but other choices are acceptable [10]. The model mk is approximately minimized within Bk . If the decrease thus achieved is sufficient, and if the agreement between f and mk at the trial point is satisfactory, the step is accepted and the radius δk is possibly increased. Otherwise, the step is rejected and the radius is decreased. This last option indicates that mk might have been trusted in too large a neighborhood of xk . Global convergence of trust-region schemes is ensured by mild assumptions on mk and on the decrease that should be achieved at each iteration. In practice, one of the most popular models is the quadratic model (3.2)

1 mk (xk + s) = f (xk ) + ∇f (xk )T s + sT ∇2 f (xk )s. 2

Sufficient decrease in the model at a trial point xk + s is achieved if the decrease in mk is at least a fraction of that obtained at the Cauchy point xCk —the minimizer of mk along the steepest descent direction d = −∇mk (xk ) within Bk —i.e., (3.3)

mk (xk ) − mk (xk + s) ≥ θ [mk (xk ) − mk (xCk )] ,

where 0 < θ < 1 is independent of k. A typical trust-region framework for problem (3.1) may be stated as Algorithm 3.1. The updating rule (3.4) at Step 3 of Algorithm 3.1 is not the only one used in practice, but most likely the most common one. Other rules involve polynomial interpolation of ρk = ρ(sk ) as a function of the step sk [12], while others devise more sophisticated functions to obtain the new radius [21, 38]. The lengthy subject of how to solve the subproblems at Step 1 of Algorithm 3.1 while ensuring (3.3) is out of the scope of this paper. We shall simply argue in section 4 that the method used in our implementation ensures this. 3.2. Convergence properties of the basic algorithm. We recall in this section the important global convergence properties of Algorithm 3.1 without proof. The proofs may be found in [10].

650

CHARLES AUDET AND DOMINIQUE ORBAN

Algorithm 3.1. [Basic trust-region algorithm] Step 0 [Initialization] An initial point x0 ∈ Rn and an initial trust-region radius δ0 > 0 are given, as well as parameters 0 ≤ η1 < η2 < 1

and

0 < α1 < 1 < α2 .

Compute f (x0 ) and set k = 0. Step 1 [Step calculation] Define the model (3.2) of f (xk + s) in Bk and compute a step sk ∈ Bk which satisfies (3.3). Step 2 [Acceptance of the trial point] Compute f (xk + sk ) and ρk =

f (xk ) − f (xk + sk ) . mk (xk ) − mk (xk + sk )

If η1 ≤ ρk , then set xk+1 = xk + sk ; otherwise, set xk+1 = xk . Step 3 [Trust-region radius update] Set ⎧ if ρk < η1 ⎨ α1 sk

δk if η1 ≤ ρk < η2 δk+1 = (3.4) ⎩ max[α2 sk , δk ] if η2 ≤ ρk . Increment k by one, and go to Step 1. Step 2 of Algorithm 3.1 is often referred to as the computation of achieved versus predicted reduction. Achieved reduction is the actual reduction in the objective f , defined by aredk = f (xk ) − f (xk + sk ). Predicted reduction is the reduction suggested by the model, predk = mk (xk ) − mk (xk + sk ). A step sk is thus accepted whenever aredk ≥ η1 predk , an iteration we refer to as successful. Requirements on the function f and each model mk are gathered in Assumption 3.1. Assumption 3.1. The function f is bounded below and its Hessian matrix remains bounded over a set containing all iterates xk . The first stage in the global convergence analysis of Algorithm 3.1 is usually summarized by Theorem 3.1. Theorem 3.1. Suppose that Assumption 3.1 is satisfied. Then lim inf ∇f (xk ) = 0. k→∞

Theorem 3.1 was first proved in [30] in a framework where η1 = 0, i.e., where all trial points are accepted as soon as they produce a decrease in the objective. This result proves that if {xk } has limit points, at least one of them is critical. In fact, this is as good a convergence result as we can obtain when η1 = 0 [39]. Algorithm 3.1 is more demanding on the trial point—sufficient reduction must be achieved. This sheds some light on the importance of the value of η1 in the framework, for as the next result shows, a much stronger conclusion holds in this case. Theorem 3.2. Suppose Assumption 3.1 is satisfied, and that η1 > 0. Then lim ∇f (xk ) = 0.

k→∞

In other words, Theorem 3.2 shows that all limit points are first-order critical for (3.1). The distinction between Theorem 3.1 and Theorem 3.2 was reinforced by the

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

651

careful example of [39], where it is shown that an algorithm with η1 = 0 may very well produce limit points which are not critical. The importance of the parameters η1 and η2 , but also α1 and α2 of Algorithm 3.1 will be of interest to us in the remainder of this paper. In particular, we shall come back to the issue of reduction versus sufficient reduction. 4. Methodology. An objective of the paper is to address a long-standing question of identifying four optimal parameters found in a trust-region update (3.4), namely η1 , η2 , α1 , and α2 . In this section, we present a general methodology to address this issue. 4.1. A black-box approach to parameter estimation. Suppose that Algorithm A depends on a set of continuous parameters p restricted to lie in Ω ⊂ R , where  is typically small. Let PO = {Pi | i ∈ O} be a set of nO ≥ 1 problem instances believed to be representative of the class of problems for which Algorithm A was designed, or to be of particular interest in the context of Algorithm A. Define a function ψ : Ω → R so that for any p ∈ Ω, ψ(p) is some measure of the performance of Algorithm A in solving the set of problems Pi ∈ PO and such that, the smaller the value of ψ(p), the better the performance of the algorithm, in a context-dependent sense. In an optimization context, examples of a function ψ would include the total CPU time required to solve the complete test set, the number of problems unsuccessfully solved or the cumulative number of iterations or of function evaluations. In other contexts, any appropriate measure may be used. The above description qualifies as a black-box optimization problem in the sense that a computer program must in general be run in order to evaluate ψ(·) at a given parameter value p ∈ Ω. For all allowed values of the parameters p, we seek to minimize a global measure of the performance of Algorithm A. In other words, we wish to solve problem (2.1). Problem (2.1) is usually a small-dimensional nondifferentiable optimization problem with expensive black-box function evaluation. As an additional difficulty, evaluating the objective function of (2.1) twice at the same value of p might produce two sligthly different results.1 It therefore seems natural to use the mads algorithm to approach it. In the present context, there is a natural way to define a less expensive surrogate function that would have an overall behavior similar to that of ψ. Let PS = {Pj | j ∈ S} be a set of nS ≥ 1 easy problems and for any p ∈ Ω, define σ(p) to be the same measure (as with ψ) of performance of Algorithm A in solving the set PS of problems. The quality of an approximation of the behavior of ψ by the surrogate function σ depends on nS and on the features of the problems in PS . The more problems, the better the approximation, but the cost of surrogate evaluations will increase. There is therefore a trade-off between the quality and the cost of a surrogate function. It would thus make sense to include in PS problems that are less expensive to solve and possibly, but not necessarily, we may choose S ⊆ O. Another interesting possibility for a surrogate would be to solve the problems in PS to a looser accuracy. Note that this framework is sufficiently general to encompass algorithmic parameter estimation in almost any branch of applied mathematics or engineering. 1 And

thus technically, ψ is not a function in the mathematical sense.

652

CHARLES AUDET AND DOMINIQUE ORBAN

4.2. An implementation of the basic trust-region Algorithm 3.1. Our implementation of the trust-region method is relatively conventional and relies on the building blocks of the GALAHAD library for optimization [18]. Trust-region subproblems are solved by means of the Generalized Lanczos method for Trust Regions GLTR [16, 18]. This method is attractive for its ability to naturally handle negative curvature and to stop at the Steihaug–Toint point, i.e., the intersection of the trustregion boundary with a direction of sufficiently negative curvature [35, 37]. It ensures satisfaction of (3.3). We now discuss the termination criteria of applying the trust-region Algorithm 3.1 to the unconstrained problem (Pi ) ≡

min

x∈Rni

fi (x),

where fi : Rni → R, for i ∈ O ∪ S. The trust-region algorithm stops as soon as an iterate xk satisfies

∇fi (xk ) 2 ≤ 10−5 . The algorithm is also terminated when this criterion was not met in the first 1000 iterations. We elected against scaling the problem or the stopping test, as this introduced a new parameter in the procedure. Following the Steihaug–Toint procedure, the subproblem in s ∈ Rni min ∇fi (xk )T s + 12 sT ∇2 fi (xk )s s.t. s 2 ≤ δk is successfully solved as soon as

∇fi (xk + s) 2 ≤ min

1 1/2 , ∇fi (xk ) 2 10

or s 2 = δk .

The initial trust-region radius was chosen according to

(4.1)

δ0 = max

 1

∇fi (x0 ) 2 , 1 . 10

4.3. Measures of performance. In our experiments below, a single function ψ(·) is considered, the evaluation of which involves running a computer program which has a given trust-region algorithm for unconstrained optimization solve a series of problems with given values of the parameters. The parameters are p = (η1 , η2 , α1 , α2 ) from Step 3 of Algorithm 3.1,  = 4, and (4.2)

  Ω = p ∈ R4 | 0 ≤ η1 < η2 < 1 and 0 < α1 < 1 < α2 ≤ 10 .

A strategy to fine-tune the parameter δ0 appears in [32]. We have chosen to exclude this parameter from Ω, as the problem-dependent value (4.1) seems more sensible both conceptually and in our tests. This also makes comparisons with the results of [17] more manageable. Note that Assumption 2.1 is satisfied since the domain Ω is a full-dimensional bounded convex set. The upper bound on α2 was introduced on the one hand to

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

653

satisfy Assumption 2.1, and on the other hand because it does not appear intuitive, or useful, to enlarge the trust-region by an arbitrarily large factor on very successful iterations. The domain Ω is handled by mads through the barrier approach, i.e., if a parameter value violates any linear constraint, the resulting value of the objective is set to infinity. To ensure that a change in Δ in Algorithm 2.1 is comparable for all four parameters, the latter are scaled using ⎤ ⎡ ⎤⎡ ⎤ ⎡ 1000 η1 η˜1 ⎥ ⎢ η2 ⎥ ⎢ η˜2 ⎥ ⎢ 100 ⎥=⎢ ⎥⎢ ⎥ ⎢ (4.3) ⎦ ⎣ ⎦ ⎣ α1 ⎦ . ⎣ α 100 ˜1 10 α ˜2 α2 mads then works with the variables (˜ η1 , η˜2 , α ˜1, α ˜ 2 ). Suppose a set of nO unconstrained problems from the CUTEr [19] collection is chosen. An evaluation of the black-box function ψ(·) is defined by the solution of these problems using Algorithm 3.1 and the current parameter values p ∈ Ω. The outcome of this evaluation is either a real number—our measure of performance—or an infinite value, resulting from a computer or algorithmic failure in one or more problems. Failures may occur because the maximum number of iterations has been exceeded, because of memory or system-dependent errors or perhaps because of a floating-point exception. Noticeably, in the context of Algorithm 3.1, the same parameter values 

1 3 1 (4.4) , , ,2 pC = (η1 , η2 , α1 , α2 ) = 4 4 2 are often recommended in the literature [8, 9, 11, 31, 33, 34]. We shall refer to those as the classical parameter values. Our contention is to show that the values (4.4) are arbitrary and that much better options are available. We use Algorithm mads to identify them. In our tests, the initial trial point considered by mads is pC . If evaluating the objective function is relatively cheap compared to the algorithm’s internal work, which might be the case when the dimension of the problem is large and the linear algebra therefore more expensive, then the overall CPU time is an obvious choice for measuring performance. Since this turned out to be the case for several problems in our objective list, it is the measure of performance which we chose. If we denote by τi (p) the CPU time which was necessary to solve problem Pi with the parameters p, we may define  (4.5) τi (p). ψ cpu (p) = i∈O

Obviously, similar performance functions may be defined to minimize the overall number of function evaluations—a measure justified in the frequent case where objective function evaluations are computationally costly and dominate the other internal work of the algorithm. From the mads point of view, each objective function evaluation requires the computation of nO values: τi (·) for i ∈ O. 4.4. The role and choice of surrogate function σ. A surrogate function plays three important roles in the present context. First, it is used as if it were the

654

CHARLES AUDET AND DOMINIQUE ORBAN

real objective function, for the sole purpose of obtaining a better starting point than pC before restarting the procedure with the objective function defined by the objective list. Its purpose is thus to postpone the long computations until a basin containing a local minimizer is reached. No surrogate approximation of the surrogate σ was used. Starting from pC given by (4.4), mads terminates with some solution pS . After having obtained these parameters, we can now apply mads on the truth function ψ and use the surrogate function σ to guide the algorithm. The set of initial guesses was chosen as S0 = {pC , pS }. Second, the surrogate is used to order trial points generated by the mads poll or search steps, as described in section 2.1. If the surrogate is appropriate, the ordering should produce a successful iterate for the real objective function before all directions have been explored. Finally, the third role of the surrogate is to eliminate from consideration the trial search points at which the surrogate function value exceeds the user threshold value v (introduced in section 2.1), which is in our case set to 0.1. In the present application, the search strategy differs from one iteration to another, and goes as follows. When k = 0, the search consists of a 64 point Latin hypercube sampling of Ω in hopes of identifying promising basins [27, 36]. At iteration k ≥ 1, the search consists of evaluating the surrogate barrier function at 8 randomly generated mesh points and at the point produced by the dynamic search described in [4] (when applicable). This dynamic search is only called after a successful iteration, and essentially consists of evaluating ψ at the next mesh point in a previously successful direction. In the present context, a function evaluation consists in solving a list of problems with the trust-region Algorithm 3.1 and combining the results on each problem into a unique real number. The objective function was defined by a list of relatively hard problems of significant dimension. A surrogate function σ, as described in section 4.1, was defined by a set of smaller problems which could be expected to be solved in overall reasonable time. The surrogate function is simply (4.6)

σ cpu (p) =



τj (p).

j∈S

The problems were chosen as follows. The trust-region algorithm described in section 4.2 with the classical parameter values pC was run on the 163 unconstrained regular problems from the CUTEr collection, using the default dimensions. From those, some problems failed to be solved for memory reasons and some reached the maximum number of iterations of 1000. Two test lists were extracted from the results. The surrogate list S consists of those problems for which ni < 1000 and 0.01 ≤ τi (pC ) ≤ 30 (measured in seconds). The surrogate list contains 54 problems of small to moderate dimension,  C 2 ≤ ni ≤ 500 and such that τ i∈S i (p ) = 68.20 seconds. We may thus expect that running through the surrogate list, i.e., evaluating the surrogate, should not take much longer than two minutes. Problems in this list and their dimension are summarized in Table 5. The objective list O consists in those problems for which ni ≥ 1000 ≤ 3600. This yields a list of 55 problems with 1000 ≤ ni ≤ 20000 and and τi (pC ) such that i∈O τi (pC ) = 13461 seconds, which amounts to approximately 3 hours and 45 minutes. The latter is the time that one objective function evaluation may be expected to take. Problems in this list and their dimension are summarized in Table 4.

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

655

Table 1 Minimization of the surrogate function to improve the initial solution. The measure σ cpu (p) is in seconds. #f evals 1 3 57 138 139 141 154 194 224 289

σ cpu (p) 68.20 66.11 56.94 56.53 54.42 53.88 53.67 52.57 52.43 52.35

η˜1 250 538.86719 221.65625 221.65625 221.65625 221.65625 221.65625 221.6875 221.625 221.625

η˜2 75 77.480469 89.175781 89.300781 89.675781 89.675781 90.175781 90.175781 90.203125 90.207031

α ˜1 50 43.765625 39.511719 39.402344 39.074219 39.074219 39.074219 38.996094 38.996094 38.996094

α ˜2 20 17.914062 23.042969 23.136719 23.417969 22.917969 22.667969 22.792969 22.792969 22.792969

5. Numerical results. All tests were run on a 500 MHz Sun Blade 100 running SunOS 5.8. The implementation of mads is a C++ package called nomad.2 5.1. Improving the initial solution with a surrogate function. The first run consisted of applying mads to the surrogate function (4.6) from the classical parameters pC . Results are reported in Table 1. The first column contains the number of mads function evaluations required to improve the measure to the value in the second column. The other columns contain the corresponding parameter values. mads stopped after 310 function evaluations as the mesh size parameter Δk dropped below the stopping tolerance of 10−6 . The best set of parameters pS = (0.221625, 0.90207031, 0.38996094, 2.2792969) was identified at the 289th evaluation. This strategy allowed the improvement of the truth initial value from ψ cpu (pC ) = 13461 to ψ cpu (pS ) = 11498 seconds, i.e., an improvement of approximately 33 minutes or 14.58%. Note that the value of the surrogate at pC had to be evaluated again during the first iteration and that its differred by approximately 1% from the one we had first obtained section 4.4 before building the surrogate. This is an illustration of the nondeterministic aspect of such objective functions. 5.2. Performance profiles. Comparison between the initial and final values of the objective function, i.e., the benchmarking of the trust-region method on a set of nonlinear unconstrained programs for the initial and final values of the parameters, will be presented using performance profiles. Originally introduced in [13], we briefly recall here how to read them. Suppose that a given algorithm Ai from a competing set A reports a statistic uij ≥ 0 when run on problem j from a test set S, and that the smaller this statistic the better the algorithm is considered. Let the function  1 if u ≤ αu∗ ω(u, u∗ , α) = 0 otherwise be defined for all u, u∗ and all α ≥ 1. The performance profile of algorithm Ai is the function  ∗ j∈S ω(uij , uj , α) πi (α) = with α ≥ 1, |S| 2 May

be downloaded from www.gerad.ca/NOMAD.

656

CHARLES AUDET AND DOMINIQUE ORBAN

Table 2 Minimization of the objective function from improved starting point. The measure ψ cpu (p) is in seconds. #f evals 1 5 7 13 40 42 73 74 77 97 98 115 116 142

ψ cpu (p) 11498.02 11241.23 10757.54 10693.04 10691.43 10617.41 10617.41 10279.85 10183.95 10183.95 10195.39 10195.39 10193.14 10193.14

η˜1 221.625 221.375 221.375 219.375 219.25 219.25 219.25 221.25 221.25 221.25 221.25 221.25 221.25 221.25

η˜2 90.207031 90.207031 90.207031 90.207031 90.207031 90.207031 90.207031 94.457031 94.457031 94.457031 94.457031 94.457031 94.457031 94.457031

α ˜1 38.996094 38.871094 38.871094 38.871094 38.871094 38.621094 38.621094 37.996094 37.933594 37.933594 37.933594 37.933594 37.933594 37.933594

α ˜2 22.792969 22.917969 23.417969 24.417969 24.417969 24.417969 24.417969 23.042969 23.042969 23.042969 23.042969 23.042969 23.042969 23.042969

where u∗j = mini∈A uij . Thus, πi (1) gives the fraction of the number of problems for which algorithm Ai was the most effective, according to the statistics uij , πi (2) gives the fraction for which algorithm Ai is within a factor of 2 of the best, and limα→∞ πi (α) gives the fraction of examples for which the algorithm succeeded. 5.3. Minimizing the total computing time. Algorithm 2.1 was restarted using the objective function ψ cpu (·). The starting point was the parameter values suggested by the surrogate function in Table 1. The stopping condition this time was to perform a maximum of 150 truth evaluations. Based on the estimate of roughly 3 hours and 45 minutes per function evaluation, this amounts to an expected total running time of just about three weeks. Fragments of the evolution of ψ cpu (·) are given in Table 2. The final iterate (5.1)

p∗ = (0.22125, 0.94457031, 0.37933594, 2.3042969)

produced by mads gives a value ψ cpu (p∗ ) = 10193.14, and reduces the total computing time to just under 2 hours and 50 minutes; a reduction of more than 11% of the computing time when compared to pS and of almost 25% when compared to pC . This p∗ is clearly in favor of a sufficient decrease condition rather than of η1 ≈ 0. The values ψ cpu (pC ) and ψ cpu (p∗ ) can be visualized in the profile of Figure 2 which compares the cpu-time profiles of Algorithm 3.1 applied to the objective list for the initial and final parameter values. The profile of Figure 3 presents a similar comparison, using the number of function evaluations. 5.4. Interpretation of the results. The above results must be interpreted in light of the objective function used in the minimization procedure. Figures 2 and 3 result from a minimization of ψ cpu (·). A minimization of the total number of trust-region function evaluations ψ fval (·) would likely have produced a different set of parameters. Moreover, the simplicity of ψ fval (·) and ψ cpu (·) is counterbalanced by their disadvantage of computing global measures. More sophisticated objectives in the present application could penalize the fact that a particular problem took a long time to fail for some parameter values while for others, failure was quickly detected.

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

657

Objective function: CPU time 1

0.8

πi(α)

0.6

0.4

0.2

classical mads-solution 0

1

1.5

2

2.5

α

3

3.5

4

Fig. 2. Profile comparing the CPU time required for one evaluation of the mads objective for the initial and final parameter values.

Similarly, they do not treat differently problems which are uniformly solved in a fraction of a second for nearly all parameter values and problems whose running time varies with great amplitude. Such effects might, and do, cause mads to elect against exploring certain regions. To illustrate this point, we note that the parameter values recommended by mads differ significantly from those recommended in [17]. In a separate preliminary series

Objective function: #f evals 1

0.8

πi(α)

0.6

0.4

0.2

classical mads-solution 0

1

1.1

1.2

1.3

1.4

α

1.5

1.6

1.7

Fig. 3. Profile comparing the number of function evaluations required for one evaluation of the mads objective for the initial and final parameter values. The scale of the α axis has been extended.

658

CHARLES AUDET AND DOMINIQUE ORBAN

Fig. 4. Profile comparing the CPU time required for one evaluation of the mads objective for the initial, final, and alternative parameter values.

of tests, the surrogate function σ cpu (·) was minimized over Ω (4.2) without using the scaling (4.3). The final, alternative, set of parameters thus produced (5.2)

pA = (0.000008, 0.9906, 0.3032, 3.4021),

is rather close to the recommendations of [17] and seems to indicate that enforcing sufficient descent is not particularly beneficial in practice. The value ψ cpu (pA ) = 13707 is surprisingly higher than ψ cpu (pC ) = 13461. However, the corresponding profile appears significantly better than the reference algorithm using pC as illustrated by Figures 4 and 5. Problems with long solution times are gathered in Table 3. The first three of those do not appear to influence the behavior of mads by much, as their solution time varies little. Some problems failed to be solved for any values of the parameter. Among those, GENHUMPS is particularly detrimental to the value ψ cpu (pA ) as the failure takes 10 times longer to be detected than at p∗ . Likely, the value pA would produce much better results if GENHUMPS were not present. We see nonetheless in the present case that mads performed its task as it should have and that, perhaps, it is the objective function ψ cpu (·) which should take such outliers into account, since the presence of problems like GENHUMPS cannot be anticipated.

Table 3 Problems with relatively high and varying solution times. A “F” indicates a failure. CPU times are in seconds. Problem Pi DIXON3DQ EIGENALS NCB20B CHAINWOO GENHUMPS

τi (pC ) 2728.24 1768.25 1444.47 1224.25 F 615.29 F

τi (p∗ ) 1572.75 1177.19 1152.80 1224.10 F 444.96 F

τi (pA ) 1286.14 1119.76 964.25 1392.78 F 4028.99 F

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

659

Fig. 5. Profile comparing the number of function evaluations required for one evaluation of the mads objective for the initial, final, and alternative parameter values.

A phenomenon of a much more optimistic kind is revealed by problem CRAGGLVY which could not be solved in less than 1000 iterations using pC , which took 83.3 seconds, but was solved in 20 iterations and 6.59 seconds using p∗ . 6. Discussion. As mentioned earlier, the determination of useful parameter values for the management of the trust region was examined in the past in [17], where, on a much smaller list of 24 problems and after discretizing Ω into 3960 values for the parameters, the final value pA (5.2) was identified. In order to supplement Figures 4 and 5 and to validate the final parameter value p∗ (5.1) produced by mads, we performed an additional series of tests. We restarted mads with the initial value pA to determine whether or not it would identify pA as a local minimum, or escape and move to a different region. To make the task more difficult for mads, it was initialized with a small initial mesh size Δ0 = 4−4 = 0.00390625. The initial search was disabled to force the algorithm to explore only nearby regions. As before, the surrogate was used and all other algorithmic parameters were left unaltered. To limit the effort, we imposed a maximum of 100 truth function evaluations. After almost 16 days of computation and 100 evaluations of the truth function, mads moved away from p∗ and exited with final parameter values p = (0.0023205, 0.999233, 0.356989, 3.39077) and objective value ψ cpu (p) = 11144. Most noticeably, the values of the first and third parameters changed the most, causing an improvement of 42 minutes. From the output of the algorithm, it is likely that it would have kept moving away towards larger values of η1 . We believe this is due to the diverse nature of the problems in O and in particular, GENHUMPS, cf. Table 3. This indicates that, in the present case, pA is not a local minimizer.

660

CHARLES AUDET AND DOMINIQUE ORBAN

To simply compare the computational burden involved in the tests with mads and those of [17], notice that one function evaluation in [17] takes between 8.76 and 63.9 seconds, which is comparable to the cost of our surrogate. According to Table 1, it only took 289 evaluations of the surrogate to identify pS which is already rather close to p∗ . Even if ψ cpu (pS ) is not as good as ψ cpu (p∗ ), it is much lower than ψ cpu (pA ). The overall cost of the tests of section 5.1 on the surrogate function is about 4 hours—only marginally more than an expected evaluation of ψ cpu . We could therefore estimate that the total cost of identifying p∗ is the 142 evaluations reported in Table 2 plus the evaluation of section 5.1, or 143 evaluations. This is 27.7 times less than the 3960 evaluations required by the exhaustive exploration. With an expected 3 hours and 45 minutes per function evluation, discretizing as in [17] and evaluating ψ cpu at the 3960 different parameter values might have taken us as much as 3960 × 3.75 hours ≈ 619 days. Pessimistically speaking, this almost represents 2 years of computation. While this might be perceived as the price to pay for global information, there still is no guarantee that a global minimizer has been found. We prefer to think in terms of regions instead of precise parameter values. 7. Conclusion. We presented a framework for the optimization of algorithmic parameters, which is general enough to be applied to many branches of engineering and computational science. Using the algorithm presented in [2], this framework may also be extended to the case where some parameters are categorical. The framework is illustrated on an example which at the same time addresses the long-standing question of determining locally optimal trust-region parameters in unconstrained minimization. The mads algorithm for nonsmooth optimization of expensive functions [4] is at the core of the framework. The very notion of optimality for such problems is not well defined. Hence, our aim in designing this framework was to suggest values for the parameters which seem to perform better, in a sense specified by the user, on a set of problems which are context-dependent and can also be specified by the user. In real applications, we believe this black-box approach is beneficial since it allows users to take full advantage of their knowledge of the context to design appropriate test sets and performance measures. As our numerical experience indicates, the choice of objective to be optimized will likely influence the results. We reserve the exploration of more elaborate objective functions, making provision for outliers, and the study of scaling strategies for mads for future work. We also wish to explore modifications of the algorithm to accept more general constraints. Appendix. Tables. Tables 4 and 5 report numerical results on the objective and surrogate lists. In these tables, n is the number of variables, #f eval is the number of function evaluations, τi (·) is the CPU time in seconds, and ϕi (·) is the number of trust-region function evaluations. These statistics are reported at the three sets of parameters pC , p∗ , and pA . Acknowledgments. We wish to thank Gilles Couture for developing nomad, the c++ implementation of mads. We also wish to acknowledge the constructive comments of two anonymous referees which helped improve an earlier version of this paper.

661

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

Table 4 Results on the objective list O. Timings τi (·) are in seconds. The measure ϕi (·) is the number of function evaluations in each case. Failures occur when ϕi (·) = 1001. Name ARWHEAD BDQRTIC BROYDN7D BRYBND CHAINWOO COSINE CRAGGLVY DIXMAANA DIXMAANB DIXMAANC DIXMAAND DIXMAANE DIXMAANF DIXMAANG DIXMAANH DIXMAANI DIXMAANJ DIXMAANL DIXON3DQ DQDRTIC DQRTIC EDENSCH EG2 EIGENALS ENGVAL1 EXTROSNB FLETCBV3 FLETCHBV FLETCHCR FMINSRF2 FMINSURF FREUROTH GENHUMPS INDEF LIARWHD MODBEALE NCB20 NCB20B NONCVXU2 NONCVXUN NONDIA NONDQUAR PENALTY1 POWELLSG POWER QUARTC SCHMVETT SINQUAD SPARSQUR SROSENBR TESTQUAD TOINTGSS TQUARTIC TRIDIA WOODS Total

n 5000 5000 5000 5000 4000 10000 5000 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 10000 5000 5000 2000 1000 2550 5000 1000 5000 5000 1000 5625 5625 5000 5000 5000 5000 20000 5010 5000 5000 5000 5000 5000 1000 5000 10000 5000 5000 5000 10000 5000 5000 5000 5000 5000 4000 ψ

τi (pC ) 0.95 5.24 443.70 5.83 1224.25 1.83 83.30 0.53 0.46 0.55 0.64 7.19 37.29 30.91 12.84 178.43 455.60 461.46 2728.24 1.25 2.33 0.98 0.06 1768.25 2.47 78.29 314.52 314.96 77.14 151.81 134.09 2.37 615.29 260.57 1.74 38.46 639.83 1444.47 647.52 708.77 0.61 415.12 0.43 1.85 54.74 2.36 3.85 2.72 17.10 0.56 37.97 3.60 1.20 28.40 6.09 13641.01

τi (p∗ ) 0.92 5.26 242.95 6.35 1224.10 1.63 6.59 0.56 0.51 0.52 0.63 7.87 21.00 7.20 25.33 193.92 399.68 278.59 1572.75 1.13 2.37 1.06 0.07 1177.19 1.77 96.82 420.17 408.22 85.28 166.47 129.05 2.19 444.96 200.16 1.84 23.39 457.02 1152.80 539.34 555.75 0.51 168.80 0.44 1.81 54.96 2.31 3.79 2.77 17.04 0.51 36.40 2.81 1.17 30.07 6.34 10193.14

τi (pA ) 0.93 5.24 205.03 4.87 1392.78 1.70 6.39 0.48 0.51 0.53 0.56 8.90 24.47 23.65 30.62 185.57 317.44 364.61 1286.14 1.07 2.22 1.02 0.06 1119.76 1.71 69.68 454.31 449.52 76.15 177.89 212.60 67.56 4028.99 273.50 1.50 28.13 336.32 964.25 538.52 551.77 0.58 331.19 0.50 1.78 54.65 2.18 3.82 2.78 16.03 0.55 37.64 1.03 1.15 29.15 7.61 13707

ϕi (pC ) 7 18 331 15 1001 14 1001 12 12 14 15 15 26 24 22 17 35 40 10 13 53 21 4 98 18 1001 1001 1001 1001 140 122 18 1001 1001 20 23 86 23 1001 1001 8 164 62 25 44 53 11 16 28 11 18 23 15 17 66 11837

ϕi (p∗ ) 7 18 264 15 1001 12 20 12 13 13 15 15 24 18 24 17 32 29 8 12 50 21 4 77 15 1001 1001 1001 1001 239 169 16 1001 1001 22 16 71 23 1001 1001 8 92 56 24 44 50 10 16 27 10 17 19 14 16 68 10771

ϕi (pA ) 7 17 261 13 1001 13 19 11 12 13 14 16 20 24 27 16 30 32 8 12 47 19 4 80 14 1001 1001 1001 1001 333 488 1001 1001 1001 18 17 55 21 1001 1001 8 123 62 24 43 47 10 17 26 10 16 12 15 15 74 12173

662

CHARLES AUDET AND DOMINIQUE ORBAN

Table 5 Results on the surrogate list S. Timings τi (·) are in seconds. The measure ϕi (·) is the number of function evaluations in each case. Failures occur when ϕi (·) = 1001. Name 3PK ARGLINA BIGGS6 BOX3 BROWNAL BROWNBS BROWNDEN CHNROSNB CLIFF CUBE DECONVU DENSCHND DIXMAANK DJTL ERRINROS GENROSE GROWTHLS GULF HAIRY HEART6LS HEART8LS HIELOW HIMMELBF HUMPS LOGHAIRY MANCINO MARATOSB MEYER3 OSBORNEA OSBORNEB PALMER1C PALMER1D PALMER2C PALMER3C PALMER4C PALMER6C PALMER7C PALMER8C PENALTY2 PFIT1LS PFIT2LS PFIT3LS PFIT4LS SENSORS SISSER SNAIL TOINTGOR TOINTPSP TOINTQOR VARDIM VAREIGVL VIBRBEAM WATSON YFITU Total

n 30 200 6 3 200 2 4 50 2 2 61 3 15 2 50 500 3 3 2 6 8 3 4 2 2 100 2 3 5 11 8 7 8 8 8 8 8 8 200 3 3 3 3 100 2 2 50 50 50 200 50 8 12 3 σ

τi (pC ) 0.14 0.45 0.02 0.03 0.30 0.02 0.02 0.32 0.02 0.02 2.91 0.03 0.02 0.08 0.25 28.54 0.12 0.18 0.04 0.59 0.25 2.73 0.11 0.44 0.44 11.61 0.34 0.54 0.09 0.19 0.68 0.06 0.59 0.31 0.08 0.13 0.41 0.14 0.39 0.28 0.10 0.11 0.21 11.13 0.02 0.04 0.04 0.02 0.02 0.02 0.09 2.39 0.04 0.06 68.20

τi (p∗ ) 0.10 0.46 0.02 0.01 0.30 0.01 0.01 0.29 0.00 0.01 0.75 0.00 0.01 0.07 0.22 25.04 0.12 0.16 0.03 0.59 0.12 1.81 0.07 0.43 0.48 6.13 0.32 0.56 0.08 0.16 0.70 0.02 0.61 0.50 0.39 0.07 0.47 0.11 0.38 0.16 0.09 0.09 0.16 9.47 0.01 0.03 0.04 0.04 0.01 0.05 0.07 2.39 0.03 0.05 54.30

τi (pA ) 0.12 0.42 0.00 0.02 0.30 0.01 0.01 0.54 0.02 0.02 0.57 0.02 0.02 0.08 0.22 21.78 0.08 0.14 0.02 0.57 0.14 1.95 0.08 0.27 0.44 5.54 0.34 0.48 0.08 0.15 0.73 0.02 0.63 0.17 0.09 0.05 0.06 0.08 0.37 0.23 0.11 0.11 0.23 11.21 0.00 0.07 0.04 0.03 0.02 0.03 0.07 2.33 0.04 0.07 51.22

ϕi (pC ) 74 6 27 9 7 26 13 69 30 48 93 34 12 171 74 465 186 58 82 1001 263 15 205 1001 1001 24 1001 1001 74 25 1001 54 1001 651 85 234 1001 273 18 630 247 280 515 22 15 80 12 21 11 31 15 1001 11 77 14381

ϕi (p∗ ) 58 6 26 9 7 24 12 70 30 43 35 34 12 150 73 413 189 49 47 1001 148 10 113 1001 1001 20 1001 1001 80 22 1001 19 1001 1001 1001 112 1001 233 18 363 184 204 317 20 15 76 12 25 10 30 15 1001 11 76 14431

ϕi (pA ) 61 5 25 9 7 18 12 109 30 42 27 35 12 156 72 377 168 44 41 1001 186 11 113 588 1001 18 1001 1001 75 24 1001 19 1001 301 132 100 193 167 18 477 222 215 412 33 15 154 11 19 9 30 14 1001 11 140 11964

FINDING OPTIMAL ALGORITHMIC PARAMETERS USING MADS

663

REFERENCES [1] P. Alberto, F. Nogueira, H. Rocha, and L. N. Vicente, Pattern-search methods for userprovided points: Application to molecular geometry problems, SIAM J. Optim., 14 (2004), pp. 1216–1236. [2] C. Audet and J. E. Dennis, Jr., Pattern search algorithms for mixed variable programming, SIAM J. Optim., 11 (2000), pp. 573–594. [3] C. Audet and J. E. Dennis, Jr., Analysis of generalized pattern searches, SIAM J. Optim., 13 (2003), pp. 889–903. [4] C. Audet and J. E. Dennis, Jr., Mesh adaptive direct search algorithms for constrained optimization, SIAM J. Optim., 17 (2006), pp. 188–217. [5] A. J. Booker, J. E. Dennis, Jr., P. D. Frank, D. B. Serafini, and V. Torczon, Optimization using surrogate objectives on a helicopter test example, in Optimal Design and Control, Progress in Systems and Control Theory, J. Borggaard, J. Burns, E. Cliff, and S. Schreck, eds., Birkh¨ auser, Cambridge, MA, 1998, pp. 49–58. [6] A. J. Booker, J. E. Dennis, Jr., P. D. Frank, D. B. Serafini, V. Torczon, and M. W. Trosset, A rigorous framework for optimization of expensive functions by surrogates, Structural Optimization, 17 (1999), pp. 1–13. [7] F. H. Clarke, Optimization and Nonsmooth Analysis, Wiley, New York, 1983, reissued as Classics in Applied Mathematics 5, SIAM, Philadelphia, 1990. [8] T. F. Coleman and Y. Li, An interior trust region approach for nonlinear minimization subject to bounds, SIAM J. Optim., 6 (1996), pp. 418–445. [9] A. R. Conn, N. I. M. Gould, and Ph. L. Toint, Testing a class of methods for solving minimization problems with simple bounds on the variables, Math. Comp., 50 (1988), pp. 399–430. [10] A. R. Conn, N. I. M. Gould, and Ph. L. Toint, Trust-Region Methods, SIAM, Philadelphia, 2000. [11] J. E. Dennis, M. Heinkenschloss, and L. N. Vicente, Trust-region interior-point SQP algorithms for a class of nonlinear programming problems, SIAM J. Control Optim., 36 (1998), pp. 1750–1794. [12] J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Classics in Applied Mathematics, SIAM, Philadelphia, 1996. ´, Benchmarking optimization software with performance profiles, Math. [13] E. Dolan and J. More Program., 91 (2002), pp. 201–213. [14] R. Fletcher and S. Leyffer, Nonlinear programming without a penalty function, Math. Program., 91 (2002), pp. 239–270. [15] S. M. Goldfeldt, R. E. Quandt, and H. F. Trotter, Maximization by quadratic hillclimbing, Econometrica, 34 (1966), pp. 541–551. [16] N. I. M. Gould, S. Lucidi, M. Roma, and Ph. L. Toint, Solving the trust-region subproblem using the Lanczos method, SIAM J. Optim., 9 (1999), pp. 504–525. [17] N. I. M. Gould, D. Orban, A. Sartenaer, and Ph. L. Toint, Sensitivity of trust-region algorithms, 4OR, 3 (2005), pp. 227–241. [18] N. I. M. Gould, D. Orban, and Ph. L. Toint, GALAHAD—a library of thread-safe Fortran 90 packages for large-scale nonlinear optimization, Trans. ACM Math. Softw., 29 (2003), pp. 353–372. [19] N. I. M. Gould, D. Orban, and Ph. L. Toint, CUTEr and SifDec, a constrained and unconstrained testing environment, revisited, Trans. ACM Math. Softw., 29 (2003), pp. 373–394. [20] N. I. M. Gould, C. Sainvitu, and Ph. L. Toint, A filter-trust-region method for unconstrained optimization, SIAM J. Optim., 16 (2005), pp. 341–357. [21] L. Hei, A self-adaptive trust region algorithm, J. Comput. Math., 21 (2003), pp. 229–236. [22] J. Jahn, Introduction to the Theory of Nonlinear Optimization, Springer, Berlin, 1994. [23] T. G. Kolda, R. M. Lewis, and V. Torczon, Optimization by direct search: New perspectives on some classical and modern methods, SIAM Rev., 45 (2003), pp. 385–482. [24] E. B. Leach, A note on inverse function theorem, Proc. Amer. Math. Soc., 12 (1961), pp. 694–697. [25] K. Levenberg, A method for the solution of certain problems in least squares, Quart. J. Appl. Math., 2 (1944), pp. 164–168. [26] R. M. Lewis and V. Torczon, Pattern search algorithms for bound constrained minimization, SIAM J. Optim., 9 (1999), pp. 1082–1099. [27] M. D. McKay, W. J. Conover, and R. J. Beckman, A comparison of three methods for selecting values of input variables in the analysis of output from a computer code, Technometrics, 21 (1979), pp. 239–245.

664

CHARLES AUDET AND DOMINIQUE ORBAN

[28] J. C. Meza and M. L. Martinez, Direct search methods for the molecular conformation problem, Journal of Computational Chemistry, 15 (1994), pp. 627–632. [29] M. S. Ouali, H. Aoudjit, and C. Audet, Optimisation des strat´ egies de maintenance, Journal Europ´ een des Syst`emes Automatis´es, 37 (2003), pp. 587–605. [30] M. J. D. Powell, A new algorithm for unconstrained optimization, in Nonlinear Programming, J. B. Rosen, O. L. Mangasarian, and K. Ritter, eds., Academic Press, London, 1970, pp. 31–65. [31] A. Sartenaer, Armijo-type condition for the determination of a generalized Cauchy point in trust region algorithms using exact or inexact projections on convex constraints, Belg. J. Oper. Res. Statist. Comput. Sci., 33 (1993), pp. 61–75. [32] A. Sartenaer, Automatic determination of an initial trust region in nonlinear programming, SIAM J. Sci. Comput., 18 (1997), pp. 1788–1803. [33] Ch. Sebudandi and Ph. L. Toint, Nonlinear optimization for seismic travel time tomography, Geophysical Journal International, 115 (1993), pp. 929–940. [34] J. S. Shahabuddin, Structured Trust-Region Algorithms for the Minimization of Nonlinear Functions, Ph.D. thesis, Department of Computer Science, Cornell University, Ithaca, NY, 1996. [35] T. Steihaug, The conjugate gradient method and trust regions in large scale optimization, SIAM J. Numer. Anal., 20 (1983), pp. 626–637. [36] M. Stein, Large sample properties of simulations using Latin hypercube sampling, Technometrics, 29 (1987), pp. 143–151. [37] Ph. L. Toint, Towards an efficient sparsity exploiting Newton method for minimization, in Sparse Matrices and Their Uses, I. S. Duff, ed., Academic Press, London, 1981, pp. 57–88. [38] J. M. B. Walmag and E. J. M. Delhez, A note on trust-region radius update, SIAM J. Optim., 16 (2005), pp. 548–562. [39] Y. Yuan, An example of non-convergence of trust region algorithms, in Advances in Nonlinear Programming, Y. Yuan, ed., Kluwer Academic Publishers, Dordrecht, The Netherlands, 1998, pp. 205–218.