An efficient gradient method with approximate optimal

0 downloads 0 Views 969KB Size Report
Keywords Approximate optimal stepsize · Barzilai-Borwein (BB) method · ... a fourth-order model and some modified secant equations, Biglari and Solimanpur.
An efficient gradient method with approximate optimal stepsize for large-scale unconstrained optimization Zexian Liu & Hongwei Liu

Numerical Algorithms ISSN 1017-1398 Numer Algor DOI 10.1007/s11075-017-0365-2

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science+Business Media, LLC. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy Numer Algor DOI 10.1007/s11075-017-0365-2 ORIGINAL PAPER

An efficient gradient method with approximate optimal stepsize for large-scale unconstrained optimization Zexian Liu1,2

· Hongwei Liu1

Received: 22 January 2017 / Accepted: 12 June 2017 © Springer Science+Business Media, LLC 2017

Abstract In this paper, we introduce a new concept of approximate optimal stepsize for gradient method, use it to interpret the Barzilai-Borwein (BB) method, and present an efficient gradient method with approximate optimal stepsize for large unconstrained optimization. If the objective function f is not close to a quadratic on a line segment between the current iterate xk and the latest iterate xk−1 , we construct a conic model to generate the approximate optimal stepsize for gradient method if the conic model is suitable to be used. Otherwise, we construct a new quadratic model or two other new approximation models to generate the approximate optimal stepsize for gradient method. We analyze the convergence of the proposed method under some suitable conditions. Numerical results show the proposed method is very promising. Keywords Approximate optimal stepsize · Barzilai-Borwein (BB) method · Quadratic model · Conic model · BFGS update formula Mathematics Subject Classification (2010) 90C06 · 65K

 Zexian Liu

[email protected] Hongwei Liu [email protected] 1

School of Mathematics and Statistics, Xidian University, Xi’an 710126, People’s Republic of China

2

School of Mathematics and Computer Science, Hezhou University, Hezhou 542899, People’s Republic of China

Author's personal copy Numer Algor

1 Introduction Consider the following unconstrained optimization problem min f (x),

(1.1)

x∈R n

where f : R n → R is continuously differentiable. The gradient method takes the following form xk+1 = xk − αk gk ,

(1.2)

where gk is the gradient of f at xk and αk is the stepsize which depends on the method under consideration. Throughout this paper, fk = f (xk ) and · denotes the Euclidean norm. It is generally accepted that the steepest descent method (SD) [1], the stepsize of which is given by αkSD = arg min f (xk − αgk ), α>0

is badly affected by ill conditioning and thus converges very slowly though the negative gradient direction has some nice properties. However, since the well-known BB method [2] was proposed by Barzilai and Borwein in 1988, the interest for gradient methods has been renewed [3] and a lot of efficient stepsizes for gradient methods have been developed. In the BB method [2], the stepsize is given by αkBB1 =

sk−1 2 T y sk−1 k−1

or

αkBB2 =

T y sk−1 k−1

yk−1 2

,

(1.3)

where sk−1 = xk − xk−1 and yk−1 = gk − gk−1 . Clearly, the BB method is in essence a gradient method, but the choice of the stepsize is different from αkSD . Due to its simplicity and numerical efficiency, the BB method has achieved great developments during these years. The BB method was proved to be globally and linearly convergent [4, 5] for any dimensional strictly convex quadratic function. In 1997, Raydan [6] presented a global BB method for general nonlinear unconstrained optimization by incorporating the nonmonotone line search (GLL line search) [7], and the numerical results of [6] suggested that the BB method was superior to some classical conjugate gradient methods. From then on, a number of modified BB stepsizes have been developed for gradient methods for unconstrained optimization. Dai et al. [8] presented the cyclic Barzilai-Borwein method for unconstrained optimization. Using the interpolation scheme, Dai et al. [9] presented two modified BB stepsizes for gradient methods. Based on some modified secant equations, Xiao et al. [10] designed four modified BB stepsizes for gradient methods. According to a fourth-order model and some modified secant equations, Biglari and Solimanpur [11] presented several modified gradient methods with some modified BB stepsizes, and the numerical results of [11] indicated that these BB-like methods were efficient. Miladinovi´c et al. [12] proposed a new stepsize based on the usage of both the quasiNewton property and the Hessian inverse approximation by an appropriate scalar matrix, for gradient method for unconstrained optimization. However, it still remains to study how to design more robust and more efficient gradient methods for unconstrained optimization [13]. For large-scale problems,

Author's personal copy Numer Algor

more efficient stepsizes for gradient methods remain to be developed by using function values, gradients and the stepsizes at the previous iterations. We introduce a new type of stepsize, approximate optimal stepsize, for gradient method, and use it to interpret the BB method. Definition 1.1 Let φ(α) be an approximation model of f (xk − αgk ). A positive constant α ∗ is called approximate optimal stepsize associated to φ(α) for gradient method, if α ∗ satisfies α ∗ = arg min φ(α). α>0

The approximate optimal stepsize is different from αkSD , which will lead to expensive computational cost. The approximate optimal stepsize is generally calculated easily and can be applied to unconstrained optimization. In the BB method, for a strictly convex quadratic the BB stepsize αkBB1 is deter2    T y mined by solving the problem: min  α1 I sk−1 − yk−1  . Suppose that sk−1 −1 > 0, we take s T yk−1 1 φ(α) = f (xk ) − αgk 2 + α 2 gkT k−1 2 I gk , (1.4) 2 sk−1  where

T y sk−1 k−1 I sk−1 2

is regarded as an approximation to the Hessian matrix, as the approx-

imation model of f (xk − αgk ). By imposing αkBB1 , namely,

dφ dα

= 0, we also obtain the BB stepsize

αkBB1 = arg min φ(α). α>0

αkBB1

Consequently, the BB stepsize is the approximate optimal stepsize associated to the approximation model φ(α) in (1.4). Let Bk be an approximation to the Hessian matrix. Since in the BB method the scalar approximation matrix determined to satisfy the secant equation Bk sk−1 = yk−1

T y sk−1 k−1 I sk−1 2

is

(1.5)

as much as possible, the approximate model φ(α) in (1.4) seems nice and thus αkBB1

is very effective. This explains partly why the BB method is able to exhibit so surprising behavior. Due to the effectiveness of αkBB1 and the fact that αkBB1 = arg min φ(α), a natural α>0

question to ask is that: Can one construct more suitable approximation models to generate more efficient approximate optimal stepsizes? This is the purpose of our work. In this paper, we present an efficient gradient method with approximate optimal stepsize for large scale unconstrained optimization. If the objective function f (x) is not close to a quadratic function on the line segment between xk−1 and xk , we develop a conic model to generate the approximate optimal stepsize if the conic model is T y suitable to be used. Otherwise, we divide it into two cases: (i) If sk−1 k−1 > 0, we construct a new quadratic model to derive the approximate optimal stepsize; (ii) If

Author's personal copy Numer Algor T y sk−1 k−1 ≤ 0, we also construct two other new approximation models to derive the approximate optimal stepsize. We consider the convergence and the convergence rate of the proposed method, and present some numerical results which show that the proposed method is not only superior to the BB method and the SBB4 method [11] but also competitive to CG DESCENT (5.3) [14] and CGOPT [15]. The remainder of this paper is organized as follow. In Section 2, we derive an efficient approximate optimal stepsize for gradient method by different approximation models. In Section 3, we present an efficient gradient method with approximate optimal stepsize, and analyze the global convergence and the convergence rate of the proposed method under some suitable conditions. In Section 4, we compare the proposed method versus the BB method, the SBB4 method, CG DESCENT (5.3) and CGOPT. Conclusions and discussions are given in the last section.

2 Derivation of the approximate optimal stepsize In this section, we derive the approximate optimal stepsize for gradient method in three cases. According to the definition of approximate optimal stepsize in Section 1, we know that different approximation models φ(α) lead to different approximate optimal stepsizes. It is evident that the effectiveness of approximate optimal stepsize for gradient methods will rely on the approximation model φ(α). In order to design an efficient approximate optimal stepsize, a natural question to ask is that how to construct a suitable approximation model. We determine the approximation model based on the following observations. Define     2 f  T k−1 − fk + gk sk−1   μk =  − 1 . T   sk−1 yk−1 According to [9], μk is a quantity showing how f (x) is close to a quadratic on the line segment between xk−1 and xk . If the condition μ k ≤ c1

or

max {μk , μk−1 } ≤ c2 ,

(2.1)

where c1 and c2 are small positives and c1 < c2 , holds, we believe that f (x) is very close to a quadratic on the line segment between xk−1 and xk . General iterative methods, which are often based on a quadratic model, have been quite successful in solving practical optimization problems [16], since the quadratic model can approximate the objective function f (x) well at a small neighborhood of xk in many cases. Consequently, if f (x) is close to a quadratic on the line segment between xk−1 and xk , the quadratic approximation model is preferable. However, when far from the minimizer, the quadratic model might not work very well if the objective function f (x) possesses high non-linearity [17, 18]. To address the drawback, some conic models [17, 19, 20] have been exploited to approximate the objective function. The conic functions, which interpolate both function values and gradients at the latest two iterates, can fit exponential functions, penalty functions or other functions which share with conics the property of increasing rapidly near some

Author's personal copy Numer Algor

n − 1 dimensional hyperplane in R n [19]. All of this indicates that when f (x) is not close to a quadratic function on the line segment between xk−1 and xk , conic models may serve better than a quadratic model [20]. Case I When f (x) is not close to a quadratic on the line segment between xk−1 and xk , we consider the following conic model [17]: φ1 (α) = f (xk ) −

αgkT gk 1 − αbkT gk

+

α 2 gkT Bk gk   , 2 1 − αbT gk 2 k

where T s √ −gk−1 1 − γk k−1 g , γ = , ρk = k , k =(fk−1 − fk )2 k−1 k T ρk + fk−1 − fk γk gk−1 sk−1    T − gkT sk−1 gk−1 sk−1

bk = −

and Bk is generated by imposing generalized BFGS update formula [17] on the positive scalar matrix Dk : Bk = Dk −

T D Dk vk−1 vk−1 k T D v vk−1 k k−1

+

T rk−1 rk−1 T r vk−1 k−1

,

where rk−1 = y¯k−1 /γk , vk−1 = γk sk−1 and y¯k−1 = γk gk − γ1k gk−1 . In order to improve the numerical performance, we restrict the coefficient of bk as −5000 ≤ T r k ≤ 5000. It is easy to verify that if vk−1 − γ g1−γ k−1 > 0, Bk is symmetric T s k k−1 k−1

positive definite. Here, we take the scalar matrix Dk as Dk = ξ1 ξ1 ≥ 1. It is clear that αk

T v vk−1 k−1 I, T r vk−1 k−1

where

is the singular point of φ1 (α), φ1 (α) =  α gkT Bk gk +gkT gk bkT gk −gkT gk and φ1 (α) is continuous differentiable in R\ 1/bkT gk .  3 T 1−αbk gk  T  T  dφ1 T r T If k > 0, vk−1 k−1 > 0 and gk Bk gk + gk gk bk gk = 0, by imposing dα =0 we obtain the unique stationary point of φ1 (α): 



=

1 bkT gk

αkS =

gkT gk   . gkT Bk gk + gkT gk bkT gk

(2.2)

We analyze the properties of the stationary point αkS in the following cases.    (1) The singular point αk = T1 < 0. If gkT Bk gk + gkT gk bkT gk < 0, we know αkS




lim φ1 (α) =fk +

α→+∞

gkT gk

gkT Bk gk +  2 bkT gk 2 bkT gk

Therefore, there no exists α ∗ > 0 such that    α ∗ = min φ1 (α). Consequently, if gkT Bk gk + gkT gk bkT gk ≤ 0, we will switch α>0

1 . bkT gk

Author's personal copy Numer Algor

   to Case II. Here, we only consider the case of gkT Bk gk + gkT gk bkT gk > 0. In the   case we know αkS > 0. If α > αkS , we obtain α gkT Bk gk + gkT gk bkT gk − gkT gk > 0, which together with 1 − αbkT gk > 0 implies that φ1 (α) > 0 for α > αkS . By φ1 (0) = − gk 2 < 0, the continuous differentiability of φ1 (α)    in R\ 1/bkT gk , the uniqueness of the stationary point and φ1 αkS = 0, we know 

φ1 (α) < 0 for α ∈ 0, αkS . Therefore, the stationary point αkS satisfies αkS = min φ1 (α) , α>0

which means that the stationary point αkS is the approximate optimal stepsize associated to φ1 (α). (2) The singular point αk = T1 > 0. It is obvious that the stationary point αkS bk gk

satisfies 0 < αkS < T1 . If αkS < α < T1 , we obtain that 1 − αbkT gk > 0 and bk gk bk gk   α gkT Bk gk + gkT gk bkT gk − gkT gk > 0, which implies that φ1 (α) > 0





  . By φ1 (0) = − gk 2 < 0, φ1 αkS = 0, the continuous  differentiability of φ1 (α) in R\ 1/bkT gk and the uniqueness of the stationary point, 

we know φ1 (α) < 0 for α ∈ 0, αkS . Therefore, the stationary point αkS is a local minimizer of φ1 (α) and  T 2 g gk fk − Tk 2gk Bk gk for α ∈

αkS ,

If α >

1 , bkT gk

 lim

α→ 1/bkT gk

+

1 bkT gk

we have 1 − αbkT gk < 0, φ1 (α) = +∞

and

lim φ1 (α) =fk +

α→+∞

which together with φ1 (α) < 0 for α >

1 bkT gk

gkT gk

gkT Bk gk +  2 , bkT gk 2 bkT gk

implies that

 T 2     g gk gkT gk gkT Bk gk S S −  = − Tk φ1 αk − lim φ1 (α) = φ1 αk − fk − T  2 α→+∞ b k gk 2gk Bk gk 2 bkT gk − for α >

1 . bkT gk

gkT gk

gkT Bk gk −  2 < 0 bkT gk 2 bkT gk

Therefore, the stationary point αkS satisfies αkS = arg min φ1 (α), α>0

Author's personal copy Numer Algor

which implies that the stationary point αkS is the approximate optimal stepsize associated to φ1 (α). 

It is observed by numerical experiments that the bound αkBB2 , αkBB1 for αkS is T y very preferable if sk−1 k−1 > 0. Therefore, if the condition (2.1) does not hold and the conditions    T (2.3) k > 0, vk−1 rk−1 > 0 and gkT Bk gk + gkT gk bkT gk > 0

hold, the approximate optimal stepsize is taken as follows:      T y max min αkS , αkBB1 , αkBB2 , if sk−1 GM AOS (1) k−1 > 0, = αk S T if sk−1 yk−1 ≤ 0. αk ,

(2.4)

Otherwise, we will switch to the Case II.

Case II It is generally accepted that quadratic model will serve well if f (x) is close to a quadratic function on the segment between xk−1 and xk . So we do not wish to abandon quadratic model because of the large amount of practical experience and T y theoretical work indicating its suitability. If the condition (2.1) holds and sk−1 k−1 > T 0, or the conditions (2.2) do not hold and sk−1 yk−1 > 0, we consider the following quadratic approximation model: 1 φ2 (α) = f (xk ) − αgk 2 + α 2 gkT Bk gk , 2 where Bk is a symmetric positive definite matrix which can be regarded as an approximation to the Hessian matrix. Taking into account the storage cost and computational cost, Bk should be generated by imposing the BFGS update formula on a scalar matrix. Taking the scalar matrix as Dk = ξ2

T y yk−1 k−1 I, T y sk−1 k−1

where ξ2 ≥ 1, and imposing

the modified BFGS update formula [21] on the scalar matrix Dk , we obtain Bk = Dk −

T D Dk sk−1 sk−1 k T D s sk−1 k k−1

+

T y¯ y¯k−1 k−1 T y¯ sk−1 k−1

,

r¯k sk−1 and r¯k = 3(gk + gk−1 )T sk−1 + 6(fk−1 − fk ). sk−1 2 Since there exists u1 ∈ [0, 1] such that   T T r¯k = 3 sk−1 yk−1 − sk−1 ∇ 2 f (xk−1 + u1 sk−1 )sk−1 ,

where y¯k−1 = yk−1 +

in order to improve the numerical performance we restrict r¯k as     T T yk−1 , η¯ 1 sk−1 yk−1 , r¯k = min max r¯k , −η¯ 1 sk−1 where 0 < η¯ 1 < 0.1.

(2.5)

Author's personal copy Numer Algor T y¯ T T It follows from (2.5) that sk−1 k−1 = sk−1 yk−1 + r¯k ≥ (1 − η¯ 1 )sk−1 yk−1 when > 0, which implies the following lemma.

T y sk−1 k−1

T y T Lemma 2.1 Suppose that sk−1 k−1 > 0. Then sk−1 y¯k−1 > 0 and Bk is symmetric positive definite.

Imposing αk =

dφ2 dα

gkT gk gkT Bk gk

= 0, we obtain

= ξ2 yk−1  T y sk−1 k−1

2

gk 2 −



gkT sk−1

2

sk−1 2

gkT gk  +

gkT yk−1

 2 T y¯ sk−1 k−1

+

r¯k gkT sk−1  2 2 T y¯ sk−1 k−1 sk−1 

2 , (2.6)

T y By sk−1 k−1 > 0 and Lemma 2.1 we know that αk in (2.6) satisfies

αk = min φ2 (α), α>0

to φ2 (α). which implies that αk is the approximate optimal stepsize associated 

BB2 It is also observed by numerical experiments that the bound αk , αkBB1 for αk T y in (2.6) is very preferable. Therefore, if the condition (2.1) holds and sk−1 k−1 > T 0, or the conditions (2.3) do not hold and sk−1 yk−1 > 0, the approximate optimal stepsize is taken as the truncation form of αk in (2.6):     GM AOS (2) (2.7) = max min αk , αkBB1 , αkBB2 . αk

Otherwise, we will switch to the Case III. Case III In most BB-like methods, the stepsize is set simply to αk = λmax when T y sk−1 k−1 ≤ 0, where λmax is a pre-fixed large positive constant. It is too simple to cause expensive computational cost in searching a suitable stepsize satisfying a line search for gradient methods. T y If the condition (2.1) holds and sk−1 k−1 ≤ 0, or the conditions (2.3) do not hold T and sk−1 yk−1 ≤ 0, we design other approximation models to derive the approximate optimal stepsize. Suppose for the moment that f (x) is twice continuously differentiable, the second order Taylor expansion is 1 f (xk − αgk ) = f (xk ) − αgkT gk + α 2 gkT ∇ 2 f (xk )gk + o(α 2 ). 2 For a very small τk > 0, we have that ∇ 2 f (xk )gk ≈

g(xk − τk gk ) − g(xk ) , τk

which gives a new approximation model 1 φ3 (α) = f (xk ) − αgkT gk + α 2 |gkT (g(xk − τk gk ) − g(xk ))/τk |. 2

Author's personal copy Numer Algor 2 3 If gkT (g(xk − τk gk ) − g(xk ))/τk = 0, by imposing dφ dα = 0 and the coefficient of α in φ3 (α), we obtain the approximate optimal stepsize associated to φ3 (α): AOS(3)

αk

=

gkT gk |gkT (g(xk − τk gk ) − g(xk ))/τk |

.

(2.8)

As for the case of gkT (g(xk − τk gk ) − g(xk ))/τk = 0, the stepsize αk is computed according to the stepsize at the latest iterate. It is well-known that for a quadratic function the stepsize αkBB1 is exactly equal to the exact stepsize at the latest iterate, that is, s T sk−1 SD αkBB1 = k−1 = αk−1 . T y sk−1 k−1 Moreover, it also has been shown that if αkBB1 or αkSD is reused in a cyclic fashion, then the convergence rate is accelerated [22]. It appears that the stepsize αk−1 may provide some important information for the current stepsize. As a result, in the case we set the stepsize to (2.9) αk = δαk−1 , where δ is a positive parameter. To obtain the stepsize αk in (2.8), it has the cost of an extra gradient evaluation, which may result in great computational cost if the gradient evaluation is evoked frequently. To reduce the computational cost, we turn to consider gk−1 . Since T T T sk−1 yk−1 = −αk−1 gk−1 (gk − gk−1 ) = αk−1 (gk−1 2 − gk−1 gk ) ≤ 0,

we have that T gk , gk−1 2 ≤ gk−1

which implies

gk−1  ≤ 1. gk 

If ggk−12 ≥ ξ3 , where ξ3 > 0 is close to 1, we know that gk and gk−1 will incline k to be collinear and gk  and gk−1  are approximately equal. In the case we have that 2

gkT ∇ 2 f (xk )gk



T gk−1 ∇ 2 f (xk )gk−1

    ((g(xk + αk−1 gk−1 ) − g(xk ))T gk−1  s T yk−1  k−1 ≈ = , 2 αk−1 αk−1

which implies a new approximation model:

 T  yk−1  1 2 sk−1 . φ4 (α)) = f (xk ) − αgk  + α 2 2 αk−1 2

dφ4 T y 2 If sk−1 k−1 = 0, by imposing dα = 0 and the coefficient of α in φ4 (α), we also obtain the approximate optimal stepsize associated to φ4 (α):

αkAOS(4) =

gk 2 α2 . T |sk−1 yk−1 | k−1

T y As for the case of sk−1 k−1 = 0, the stepsize is also computed by (2.9).

(2.10)

Author's personal copy Numer Algor T y Therefore, if the condition (2.1) holds and sk−1 k−1 ≤ 0, or the conditions (2.3) T do not hold and sk−1 yk−1 ≤ 0, the stepsize is determined by

αk =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

gkT gk , |gkT (g(xk −τk gk )−g(xk ))/τk | gk 2 2 α , |s T yk−1 | k−1

if if

k−1

gk 2 gk−1 2 gk 2 gk−1 2

< ξ3 and gkT (g(xk − τk gk ) − g(xk ))/τk = 0, ≥ ξ3 and

T y sk−1 k−1 = 0,

otherwise,

δαk−1 ,

(2.11)

where δ is a positive parameter.

3 Gradient method with approximate optimal stepsize In this section, we present an efficient gradient method in which the approximate optimal stepsize is determined by (2.4), (2.7), or (2.11), and the proposed method is denoted by GM AOS. Though GLL line search [7] was firstly incorporated into the BB method [6], it is observed by numerical experiments that for BB-like methods the nonmonotone line search (Zhang-Hager line search ) proposed by Zhang and Hager [23] is preferable. Usually, the strategy (3.2) for a nonmonotone line search [24] is used to accelerate the convergence rate of some numerical methods. Therefore, we adopt Zhang-Hager line search with the strategy (3.2) in GM AOS. Now we describe GM AOS in detail. Gradient Method with Approximate Optimal Stepsize (GM AOS) Step 0 Initialization. Given a starting point x0 ∈ R n , constants ε > 0, λmin , λmax , α00 , ηmin , ηmax , σ , δ, ξ1 , ξ2 , ξ3 , η¯ 1 , τk , c1 and c2 . Set Q0 = 1, C0 = f0 and k := 0. Step 1 If ||gk ||∞ ≤ ε, stop. Step 2 If k = 0 , set α = α00 , go to Step 3. If the condition (2.1) does not hold and the conditions (2.3) hold, compute αk by (2.4); If the condition (2.1) holds and T y T sk−1 k−1 > 0, or the conditions (2.3) do not hold and sk−1 yk−1 > 0, compute αk by T y (2.7); If the condition (2.1) holds and sk−1 k−1 ≤ 0, or the conditions (2.3) do not T hold and sk−1 yk−1 ≤ 0, compute αk by (2.11). Set αk0 = max {min {αk , λmax } , λmin } and α = αk0 . Step 3 Zhang-Hager line search. If f (xk − αgk ) ≤ Ck − σ αgk 2 , then go to Step 4. Otherwise, update α by [24]  α, ¯ if α > 0.1αk0 and α¯ ∈ [0.1αk0 , 0.9α], α= 0.5α, otherwise,

(3.1)

(3.2)

where α¯ is the trial stepsize obtained by a quadratic interpolation at xk and xk − αgk , go to Step 3.

Author's personal copy Numer Algor

Step 4 Choose ηk ∈ [ηmin , ηmax ] and update Qk+1 , Ck+1 by the following ways: Qk+1 = ηk Qk + 1, Ck+1 = (ηk Qk Ck + f (xk+1 ))/Qk+1 .

(3.3)

Step 5 Set αk = α, xk+1 = xk − αk gk , k := k + 1 and go to Step 1. In what follows, we analyze the convergence and the convergence rate of GM AOS. Our convergence result utilizes the following assumptions. A1. f (x) is continuously differentiable on R n . A2. f (x) is bounded below on R n . A3. The gradient g(x) is Lipschitz continuous on R n , namely, there exists L > 0 such that g(x) − g(y) ≤ Lx − y, ∀x, y ∈ R n . Since dk = −gk , we have dk  = gk  and gkT dk = −gk 2 . Therefore, by Theorem 2.2 of [23], we can easily obtain the following theorem which shows that GM AOS is globally convergent. Theorem 3.1 Suppose that assumption A1, A2, and A3 hold. Let {xk } be the sequence generated by GM AOS. Then lim inf gk  = 0. k→∞

Furthermore, if ηmax < 1, then lim gk  = 0.

k→∞

Hence, every convergent subsequence of the {xk } approaches a stationary point x ∗ . Similar to the above theorem, by Theorem 3.1 of [23], we also obtain the following theorem which implies the R-linear convergence of GM AOS. Theorem 3.2 Suppose that A1 and A3 hold, f is strongly convex with unique minimizer x ∗ and ηmax < 1. Then there exists ζ ∈ (0, 1) such that   f (xk ) − f (x ∗ ) ≤ ς k f (x0 ) − f (x ∗ ) , for each k ≥ 0.

4 Numerical experiments In this section, to check the numerical performance of GM AOS, a large number of numerical experiments are done. We adopt the same test function set which includes 80 nonlinear unconstrained problems as in [25]. The test function set and their Fortran code can be found in Andrei’s website http://camo.ici.ro/neculai/AHYBRIDM. We have implemented the six BB-like methods in [11], and find that the BBlike method with αkSBB4 (SBB4) is the most efficient. Therefore, the SBB4 method is chosen to be compared with GM AOS. Besides, the BB method, CGOPT, and

Author's personal copy Numer Algor

1

0.95 0.9 0.85

P(τ)

0.8 0.75 0.7 0.65 0.6 GM_AOS (cone) GM_AOS (cone) with λ

0.55

max

0.5

1

2

3

4

5

6

7

8

9

10

11

τ

Fig. 1 Performance profile based on the number of iterations

CG DESCENT (5.3) are considered. The BB method and the SBB4 method are coded in Fortran 90, and the C language code of CG DESCENT (5.3) can be found in Hager’s website http://users.clas.ufl.edu/hager/papers/Software. GM AOS is written in Fortran 90 and C language. For fairness, the Fortran version of GM AOS is used to compare with the BB method and the SBB4 method, and the C language version of GM AOS is used to compare with CPOPT and CG DESCENT (5.3). In the numerical experiments, the dimensions of all problems are set to 10,000 and these parameters: α00 = 1/||g0 ||∞ , ε = 10−6 , δ = 10, λmin = 10−30 , λmax = 1030 , σ = 10−4 , ηk = 1, ηmax = ηmin = 1, η¯ 1 = 5.0/3 × 10−5 , τk = min{0.1αk−1 , 0.01}, ξ1 = 2.15, ξ2 = 1.07, ξ3 = 0.85, c1 = 10−8 and c2 = 0.07 1 0.95 0.9 0.85

P(τ)

0.8 0.75 0.7 0.65 0.6 GM_AOS (cone) GM_AOS (cone) with λmax

0.55 0.5

1

2

3

4

5

6

τ

7

8

9

Fig. 2 Performance profile based on the number of function evaluations

10

11

Author's personal copy Numer Algor

1 0.95 0.9 0.85

P(τ)

0.8 0.75 0.7 0.65 0.6 GM_AOS (cone) GM_AOS (cone) with λ

0.55

max

0.5

1

2

3

4

5

6

7

8

9

10

11

τ

Fig. 3 Performance profile based on the number of gradient evaluations

are chosen. GM AOS, the SBB4 method and the BB method adopt Zhang-Hager line search with the strategy (3.2), and the iteration is stopped if the inequality gk ∞ ≤ 10−6 is satisfied, the number of iterations exceeds 50,000, or the number of function evaluations exceeds 80000. For CG DESCENT (5.3) and CGOPT, the iteration is stopped if the inequality gk ∞ ≤ 10−6 is satisfied or the number of iterations exceeds 50,000, and the two well-known methods use the other default parameters in their codes. The performance profiles introduced by Dolan and More [26] are used to display the performance of these methods, respectively.

1

0.95 0.9 0.85

P(τ)

0.8 0.75 0.7 0.65 0.6 GM_AOS (cone) GM_AOS (cone) with λmax

0.55 0.5

1

2

3

4

5

6

τ

Fig. 4 Performance profile based on CPU time

7

8

9

10

Author's personal copy Numer Algor 1 0.9 0.8

P(τ)

0.7 0.6 0.5 0.4 GM_AOS (cone) SBB4 BB

0.3 0.2

1

2

3

4

5

6

7

8

9

10

11

τ

Fig. 5 Performance profile based on the number of iterations

We firstly examine the effectiveness of the stepsize (2.11). In Figs. 1, 2, 3, and 4, “GM AOS with λmax ” stands for the variant of GM AOS, which is different from GM AOS only in that in the Step 2 the stepsize αk is set to αk = λmax if the condition T y T (2.1) holds and sk−1 k−1 ≤ 0, or the conditions (2.3) do not hold and sk−1 yk−1 ≤ 0. In numerical experiments, GM AOS successfully solves 79 problems, while its variant successfully solves 75 problems. After eliminating those problems for which GM AOS or its variant is not stopped by the inequality gk ∞ ≤ 10−6 , 75 problems are left, and we only consider the performance for the remaining 75 problems in the following analysis. As shown in Fig. 1, GM AOS performs slightly better than its variant relative to the number of iterations. Similar observations can also be made 1

0.9 0.8

P(τ)

0.7 0.6 0.5 0.4 GM_AOS (cone) SBB4 BB

0.3 0.2

1

2

3

4

5

6

7

8

τ

Fig. 6 Performance profile based on the number of function evaluations

9

10

11

Author's personal copy Numer Algor 1

0.9 0.8

P(τ)

0.7 0.6 0.5 0.4 GM_AOS (cone) SBB4 BB

0.3 0.2

1

2

3

4

5

6

7

8

9

10

11

τ

Fig. 7 Performance profile based on the number of gradient evaluations

in Fig. 3. We can observe from Fig. 2 that GM AOS requires much less function evaluations than its variant since the stepsize (2.11) is used. In Fig. 4, we see that GM AOS is much faster than its variant. It indicates that the stepsize (2.11) is very efficient. In what follows, we divide the numerical experiments into two groups. In the first group of numerical experiments, we compare the performances of GM AOS with that of the SBB4 method and the BB method. The SBB4 method and the BB method both successfully solve 73 problems, which is 6 problems less than GM AOS. After eliminating those problems for which there is at least one method that is not stopped by the inequality gk ∞ ≤ 10−6 , 70 problems are left, and we 1 0.9 0.8 0.7

P(τ)

0.6 0.5 0.4 0.3 0.2 GM_AOS (cone) SBB4 BB

0.1 0 1

2

3

4

Fig. 8 Performance profile based on CPU time

5

6 τ

7

8

9

10

11

Author's personal copy Numer Algor 1 0.9 0.8

P(τ)

0.7 0.6 0.5 0.4 GM_AOS (cone) CG_DESCENT(5.3) CGOPT

0.3 0.2

2

4

6

8

10

12

τ

Fig. 9 Performance profile based on the number of iterations

only consider the performance for the remaining 70 problems in the following analysis. As shown in Fig. 5, GM AOS outperforms the SBB4 method and the BB method, since GM AOS successfully solves about 58% problems with the least iterations, while the percentage of the SBB4 method and the BB method are about 38% and about 25% problems, respectively. Similar observations can be made in Fig. 7. We observe from Fig. 6 that GM AOS has a very great advantage over the SBB4 method and the BB method, since GM AOS successfully solves about 69% problems with the least function evaluations, while the percentage of the SBB4 method and the BB method are 30 and 24%, respectively. Figure 8 shows that GM AOS is much faster 1 0.9 0.8 0.7

P(τ)

0.6 0.5 0.4 0.3 0.2 GM_AOS (cone) CG_DESCENT(5.3) CGOPT

0.1 0

2

4

6

8

10

τ

Fig. 10 Performance profile based on the number of function evaluations

12

Author's personal copy Numer Algor 1 0.9 0.8

P(τ)

0.7 0.6 0.5 0.4 GM_AOS (cone) CG_DESCENT(5.3) CGOPT

0.3 2

4

6

8

10

12

τ

Fig. 11 Performance profile based on the number of gradient evaluations

than the SBB4 method and the BB method. It indicates that GM AOS is superior to the SBB4 method and the BB method. In the second group of numerical experiments, we compare the performance of GM AOS with that of CG DESCENT (5.3) and CGOPT. CG DESCENT (5.3) successfully solves 76 problems, which are 3 problems less than GM AOS, while CGOPT successfully solves 79 problems, which is the same as GM AOS. After eliminating those problems for which there is at least one method which is not stopped by the inequality gk ∞ ≤ 10−6 , 73 problems are left, and we only consider the performance for the remaining 73 problems in the following analysis. As shown in Fig. 9, in contrast with CG DESCENT (5.3) and 1 0.9 0.8 0.7

P(τ)

0.6 0.5 0.4 0.3 0.2 GM_AOS (cone) CG_DESCENT(5.3) CGOPT

0.1 0

2

4

Fig. 12 Performance profile based on CPU time

6

τ

8

10

12

Author's personal copy Numer Algor

CGOPT, GM AOS is at a disadvantage when τ ≤ 3, since it is generally accepted that the search directions of CG DESCENT (5.3) and CGOPT, which also include the search direction at the latest iteration besides the negative direction, are very efficient and famous. However, when τ > 3, GM AOS still has a little advantage relative to the number of iterations. Figure 10 shows that GM AOS requires much less function evaluations than CG DESCENT (5.3) and CGOPT, since GM AOS successfully solves about 73% problems with the least function evaluations, while the percentage of CG DESCENT (5.3) and CGOPT are about 29 and 5%, respectively. As shown in Fig. 11, GM AOS outperforms CG DESCENT (5.3) and CGOPT relative to the number of gradient evaluations. We also see from Fig. 12 that GM AOS is faster than CG DESCENT (5.3) and CGOPT. It indicates GM AOS is competitive to CG DESCENT (5.3) and CGOPT for the test function set.

5 Conclusion and discussion In this paper, we present an efficient gradient method with approximate optimal stepsize (GM AOS). In GM AOS, the approximate optimal stepsizes is generated by different approximation models. Numerical results indicate that GM AOS is not only superior to the SBB4 method and the BB method but also competitive to the CG DESCENT (5.3) and CGOPT for the test function set. Though GM AOS is very efficient, we still feel that there might be a large room for constructing more suitable approximation models to design efficient approximate optimal stepsizes. Different approximation models lead to different approximate optimal stepsizes, and different approximate optimal stepsizes lead to different gradient methods. Due to the efficiency of gradient method with approximate optimal stepsize, we call the gradient method with approximate optimal stepsize as approximate optimal gradient method. The approximate optimal gradient method should be paid more attentions. Acknowledgements We would like to thank two anonymous referees for their valuable comments, which will help to improve the quality of this paper. We also thank professor Dai, Y. H. and Dr. Kou caixia for their C code of CGOPT, and thank Hager and Zhang, H. C. for their C code of CG DESCENT (5.3). This research is supported by National Science Foundation of China (No.11461021), Guangxi Science Foundation (Nos.2014GXNSFAA118028, 2015GXNSFAA139011), Scientific Research Project of Hezhou University (Nos.2014YBZK06, 2016HZXYSX03), Guangxi Colleges and Universities Key Laboratory of Symbolic Computation and Engineering Data Processing.

References 1. Cauchy, A.: M´ethode g´en´erale pour la r´esolution des syst´ems d’equations simultan´ees. Comp. Rend. Sci. Paris 25, 46–89 (1847) 2. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988) 3. Asmundis, R.D., Serafino, Riccio, F., et al.: On spectral properties of steepest descent methods. IMA J. Numer. Anal. 33(4), 1416–1435 (2013) 4. Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13, 321–326 (1993) 5. Dai, Y.H., Liao, L.Z.: R-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)

Author's personal copy Numer Algor 6. Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7, 26–33 (1997) 7. Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal. 23, 707–716 (1986) 8. Dai, Y.H., Hager, W.W., Schittkowski, K., et al.: The cyclic Barzilai-Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006) 9. Dai, Y.H., Yuan, J.Y., Yuan, Y.X.: Modified two-point stepsize gradient methods for unconstrained optimization problems. Comput. Optim. Appl. 22, 103–109 (2002) 10. Xiao, Y.H., Wang, Q.Y., Wang, D., et al.: Notes on the Dai-Yuan-Yuan modified spectral gradient method. J. Comput. Appl. Math. 234(10), 2986–2992 (2010) 11. Biglari, F., Solimanpur, M.: Scaling on the spectral gradient method. J. Optim. Theory Appl. 158(2), 626–635 (2013) 12. Miladinovi´c, M., Stanimirovi´c, P., Miljkovi´c, S.: Scalar correction method for solving large scale unconstrained minimization problems. J. Optim. Theory Appl. 151(2), 304–320 (2011) 13. Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive stepsizes. Comput. Optim. Appl. 35(1), 69–86 (2006) 14. Hager, W.W., Zhang, H.C.: A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM J. Optim. 16(1), 170–192 (2005) 15. Dai, Y.H., Kou, C.X.: A nonlinear conjugate gradient algorithm with an optimal property and an improved Wolfe line search. SIAM J. Optim. 23(1), 296–320 (2013) 16. Han, Q.M., Sun, W.Y., Han, J.Y., et al.: An adaptive conic trust-region method for unconstrained optimization. Optim. Methods Softw. 20(6), 665–677 (2005) 17. Sun, W.Y., Xu, D.: A filter-trust-region method based on conic model for unconstrained optimization (in Chinese). Sci. Sin. Math 55(5), 527–543 (2012) 18. Sun, W.Y.: Optimization methods for non-quadratic model. Asia Pac. J. Oper. Res. 13 (1) (1996) 19. Davidon, W.C.: Conic approximations and collinear scalings for optimizers. SIAM J. Numer. Anal. 17(2), 268–281 (1980) 20. Sorensen, D.C.: The Q-Superlinear convergence of a collinear scaling algorithm for unconstrained optimization. SIAM J. Numer. Anal. 17(17), 84–114 (1980) 21. Zhang, J.Z., Deng, N.Y., Chen, L.H.: New quasi-Newton equation and related methods for unconstrained optimization. J. Optim. Theory Appl. 102, 147–167 (1999) 22. Friedlander, A., Martinez, J.M., Molina, B., et al.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), 275–289 (1998) 23. Zhang, H.C., Hager, W.W.: A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim. 14, 1043–1056 (2004) 24. Birgin, E.G., Mart´ınez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods for convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000) 25. Andrei, N.: Accelerated hybrid conjugate gradient algorithm with modified secant condition for unconstrained optimization. Numer. Algorithms 54(1), 23–46 (2010) 26. Dolan, E.D., More, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)