A Newton's Method for Perturbed Second-order ... - Semantic Scholar

0 downloads 0 Views 361KB Size Report
A Newton's Method for Perturbed Second-order Cone Programs. Yu Xia ∗. Abstract. We develop optimality conditions for the second-order cone program.
A Newton’s Method for Perturbed Second-order Cone Programs Yu Xia



Abstract We develop optimality conditions for the second-order cone program. Our optimality conditions are well-defined and smooth everywhere. We then reformulate the optimality conditions into several systems of equations. Starting from a solution to the original problem, the sequence generated by Newton’s method converges Q-quadratically to a solution of the perturbed problem under some assumptions. We globalize the algorithm by 1) extending the gradient descent method for differentiable optimization to minimizing continuous functions that are almost everywhere differentiable; 2) finding a directional derivative of the equations. Numerical examples confirm that our algorithm is good for “warm starting” second-order cone programs — in some cases, the solution of a perturbed instance is hit in two iterations. In the progress of our algorithm development, we also generalize the nonlinear complementarity function approach for two variables to several variables.

Key words. Second-order cone programming, complementarity, warm start, semismooth Newton’s method, nonsmooth optimization.

1

Introduction

The aim of this paper is to develop a “warm starting” algorithm for the second-order cone program. ¯ ]. Using “;” to concatenate column vectors, we write a vector in Rn+1 indexed from zero as x = [x0 ; x def  n+1 n+1 : x0 ≥ k¯ xk2 . We A simple second-order cone (SOC) in R is the following set: Qn+1 = x ∈ R def

consider a Cartesian product of finite simple second-order cones: Q = QN1 × · · · × QNn , where Ni ∈ N is the dimension of the ith simple second-order cone for i = 1, . . . , n. Other names of second-order cone include def Lor´entz cone, ice-cream cone, quadratic cone. Denote x = (xT1 , . . . , xTn )T , where xi ∈ RNi corresponds to the ith second-order cone. Since a second-order cone induces a partial order, we write x ≥Q 0 interchangeably with x ∈ Q. def def Let c = (c1 ; . . . ; cn ), A = [A1 . . . An ], with Ai ∈ Rm×Ni , ci ∈ RNi , and b ∈ Rm represent given data. A standard form second-order cone programming (SOCP) problem is the following:

(1)

min cT x s.t. Ax = b, x ≥Q 0.

SOCP is somewhat between linear programming (LP) and semidefinite programming (SDP) [4] in view of approximation ability and computational cost. In fact, LP is a special case of SOCP, and an SOCP model can be embedded into an SDP model. Especially, quadratic programs, convex quadratically constraint quadratic programs, and problems involving fractional quadratic functions, can be cast as SOCP problems. A large number of real-world applications arising from control, structural design, statistics, combinatorial optimization, portfolio selection, etc., are modeled as SOCP problems (see [2, 20] for a survey). Recently, quite a few applications of SOCP in machine learning appear in literature. See, for instance, [5] and references therein. These SOCP classification methods are shown to be superior to the popular SVM (Support Vector Machine) in some aspects. Actually, since the SVM itself is a quadratic program, it can be formulated as an SOCP problem as well. Due to its versatile applications, SOCP deserves to be studied on its own merit. ∗ Center for Operations Research and Econometrics (CORE), Universit` e catholique de Louvain 34 Voie du Roman Pays, B-1348 Louvain-la-Neuve, Belgium, [email protected]

1

For many applications, “warm starting” algorithms may reduce computation cost and time. For example, in [35], the Euclidean facility location (EFL) problem and the Steiner minimal tree (SMT) problem under known topology are approximated by SOCP. When an EFL or an SMT problem needs to be solved under environmental changes, or when a series of similar EFL or SMT problems constitute a bigger project, it is often the case that there is a small change of the cost function in the EFL problem, or a little move of the regular points in the SMT problem. In the machine learning context, the model is modified a little when some new data points are added to the training set. Then, it is reasonable to expect that an optimal solution to a slightly perturbed problem is close to the original solution and to use the old solution to “warm start” the perturbed problem. Although SOCP can be approximated by interior point methods (IPM), see [2] for details, it is known that the old solution is unsuitable in this content; see [18], for instance, for an explanation. The centering condition of an IPM requires each iterate to stay close to the central path to prevent small stepsizes. Unfortunately, because of the complementarity constraints, of each block of a solution, either the primal or the dual variable must be close to the boundary. Consequently, the step size of an IPM iteration has to be very small, if not zero, to keep the iterate in the second-order cone, when the starting point is the solution to an earlier problem. Therefore, current warm starting IPMs for LP typically employ a perturbed old solution [21] or use a saved nearly optimal iterate that is well centered from a previous problem [18, 24, 36] to start a new problem. However, the amount of perturbation is heuristic. If the perturbation is too big, the initial point may be far from optimal solution set; if the perturbation is too small, the iterations may be stuck at an infeasible vertex. Besides, in applications other than the cutting plane scheme or the decomposition method such as Dantzig-Wolfe or Benders decomposition, which solves a series of slightly different problems, and intermediate iterates for an old problem are stored purposely when memory is available, it is not always the case that a nearly optimal well-centered intermediate iterate is available. For instance, a near-optimal solution to an SMT problem with unknown topology, which is at least as difficult as any NP-complete problem [17], may be obtained by some method other than an IPM, such as a heuristic procedure or an approximation algorithm, in the initial stage of a VLSI design. In the final stage of the VLSI design, a perturbed instance of SMT under known topology arising from adjustment of points may need to be solved. Even if such an intermediate iterate is available, it is likely to be further from the solution set of the new problem than the old solution is because of larger violation of optimality conditions. See [18] for further drawbacks of these two strategies. A crash start strategy is proposed in [18] to decompose an LP problem with nested block structures combining primal block-angular and dual block-angular structures into smaller subproblems and then to construct an initial point for the master LP from the solutions to the subproblems. Since the initial point is not constructed from the master problem, it can’t be assumed that the starting point is well-centered [18]. Extra computation is needed to solve the subproblems arising from the decomposition. In addition, they don’t show how to crash start an LP problem without the special decomposable structures. Because of the central parameter, IPMs are not pure Newton’s method; hence they don’t have locally Q-quadratic convergence rate. The fastest local convergence rate possible for an IPM is Q-superlinear, for instance, the path-following short-step predictor corrector method for LP under strict complementarity condition [33]; however, the situation is different if strict complementarity doesn’t hold. Since Newton’s method for systems of nonlinear equations has no restriction on starting points. And a sequence generated by Newton’s method converges Q-quadratically locally, we reformulate the SOCP model into some systems of nonlinear equations and solve them by semismooth Newton’s method [23, 29]. In the last years, Newton-type methods have been developed for some nonlinear programming (NLP) problems. See [14, 26, 29] and references therein. It works as follows. Under some constraint qualifications, the Karush-Kuhn-Tucker (KKT) system of an NLP problem is reformulated into nonlinear equations via some nonlinear complementarity (NCP) functions (a function M(a,b) is called an NCP-function if M (a, b) = 0 ⇔ a ≥ 0, b ≥ 0, ab = 0). Then these equations are solved by some generalized Newton-type methods. Under some regularity assumptions, fast local convergence rates of the Newton-type methods are established. Regularity assumptions typically include the linear independence of the gradients of the equality and active inequality constraints at an optimal solution. In addition, the objective function and the constraints are assumed to be differentiable at optimum. Otherwise, optimality conditions for the NLP might not be represented just as equalities and inequalities. And the subsequent analysis would be more complicated. An example illustrating the difficulty of nondifferentiability at optimum is given in [6, §3.1]. Due to its special structure, an SOCP model doesn’t satisfy the assumptions for the above Newtontype methods. Below are details. The reformulation of the second-order cone constraint as x0 ≥ k¯ xk2 is 2

 ¯ = 0. The SOC can also be represented as x0 ≥ 0, x20 ≥ x21 + · · · + x2n ; nondifferentiable at points where x however, the second inequality is neither convex nor concave. In addition, the gradients of the two inequalities ¯ = 0. There are some other reformulations of the SOC constraint are linearly dependent at a solution where x in the setting of IPM or smoothing Newton’s method [6, 16]; however, they have drawbacks if Newton’s  2 2 method is applied to them. The formulations given in [6] for IPM are e(k¯xk2 −x0 )/2 − 1 ≤ 0, x0 ≥ 0  k¯xk2 and x0 2 − x0 ≤ 0, x0 ≥ 0 . Again, both have linearly dependent gradients of constraints at a solution ¯ = 0. In addition, it is noted in [6] that the first formulation doesn’t work well in practice. Their where x second formulation isn’t well defined at points where x0 = 0. Two reformulations of the second-order cone complementarity constraints for smoothing Newton’s methods are provided in [16]. Let z denote the dual variable. As the smoothing parameter approaches 0, the limits of their formulations are the following. 1) φ0 (x, z) = x − ([λ1 ]+ u1 + [λ2 ]+ u2 ), where, [α]+ = max(0, α) for α ∈ R, and for (i=1, 2) λi = x0 − z0 + iT h 1 T (−1)i ¯ ¯ 6= z¯ 1, (¯ x − z ) x i , with v ∈ Rn being any vector satisfying kvk2 = 1. (−1) k¯ x − ¯zk2 , ui = 2  k¯x−¯zk2   1 1, (−1)i vT ¯ = z¯ x 2  √ √ ¯ + z0 ¯zk2 , λ1 u1 + λ2 u2 , where, for i= 1, 2, λi = kxk22 + kzk22 + 2(−1)i kx0 x 2) ψ0 (x, z) = x + z − iT h 1 T (−1)i ¯ + z0 ¯z 6= 0 ¯ + z0 ¯ z) x0 x (x x 1, ui = 2  kx0 x¯ +z0 z¯kT2 0 , with v ∈ Rn being any vector satisfying kvk2 = 1.  1 1, (−1)i vT ¯ + z ¯z = 0 x x 2

0

0

¯=z ¯ and that of their second one is Note that the limit of their first formulation is not well defined when x ¯ + z0 z¯ = 0. not well-defined when x0 x In this paper, we derive optimality conditions for (1). Our optimality conditions are well-defined and in C ∞ everywhere. However, complementarity constraints in our optimality conditions involve three terms. Thus, we cannot simply apply any NCP-function to our complementarity constraints. We then show how to reformulate complementarity constraints having more than two terms into nonsmooth equations by the min function, the Jordan decomposition, and compositions of general NCP-functions. We also use the Q method approach [34] to reformulate the optimality conditions into a system of equations. Our reformulations are simpler than those in [16]. Furthermore, we give assumptions under which each element in the Clarke generalized Jacobians of each reformulation is nonsingular. Thus, we can establish that iterates generated by the semismooth Newton’s method [28] converge Q-quadratically to an optimal solution locally. Our assumptions include primal-dual nondegeneracy [3], but neither linear independence of the gradients of the active constraints, nor strict complementarity. As well, we show that Newton’s method is good for re-optimizing perturbed equations, in the sense that starting from a solution to the original problem, the iterates converge Q-quadratically to a solution for a slightly perturbed problem. The perturbations can be changes of both the parameters and the size of the problem. To globalize the algorithm, we then apply some line search strategy to the equivalent unconstrained optimization problem of the nonlinear equations [10]. However, the objective function of the resulting unconstrained minimization is not smooth (continuously differentiable) everywhere. It has been noted in [27] that Armijo’s line search fails to be convergent if the gradient in the smooth case is replaced with a Clarke’s generalized directional derivative [8] in the nonsmooth case because the generalized gradient [8] is not continuous. A counter example, showing that replacing the gradient in the smooth case with a Clarke’s generalized directional derivative in the case of only locally Lipschitz function doesn’t work, is given in [32], As a result, a globally generalized Newton-type method typically either requires differentiability of the corresponding unconstrained optimization problem, or assumes that the nonsmooth equations are directionally differentiable [25, 26, 28] and sets the kth search direction ∆ wk be a solution to the equations:   (2) G(wk ) + D G(wk ); ∆ wk = 0 ,

where wk represents the kth iterate, and D [G(w); v] is the directional derivative of the nonlinear equations G at w in the direction v. However, (2) is nonlinear and is not easy to solve in general. And none of the above papers addresses how to solve the system. In this paper, we propose perturbed Armijo-type gradient method for minimizing a continuous function that is almost everywhere differentiable. Note that a locally Lipschitz function is differentiable almost everywhere. We prove global convergence of the method. Unlike the bundle methods by Lemar´echal and Wolfe (see for instance [32]), our algorithm doesn’t require the construction of bundles of generalized gradients and the drive of the bundle-size parameters to zero. Nor does our method have the drawbacks of subgradient 3

methods (see for instance [27]): poor convergence rate, each iterate not necessarily being descent, and no implementable stopping criterion. We also show how to solve (2) for our reformulations of (1). The rest of the paper is organized as follows. Some background materials are included in §§1.1. In §2, we derive optimality conditions for (1). In §3, we transform the optimality conditions into several systems of equations. For each system, we give sufficient conditions under which each element of its generalized Jacobian is nonsingular. Thus, the sequence generated by semismooth Newton’s method is locally Q-quadratic convergent. In §4, we show that the sequence generated by Newton’s method converges Q-quadratically to a solution of the perturbed problem from the old solution under some assumptions. Globalization of the algorithm, including the perturbed gradient method and the method to find the direction given by (2), is discussed in §5. Numerical examples are presented in §6. Properties of our method and comparisons of our reformulations are summarized in §7.

1.1

Background

For completeness, we cite relevant results on semismooth Newton’s method here. Consider a vector-valued function G : Rn → Rm , written in terms of component functions as G(x) = 1 [G (x); G2 (x); . . . ; Gm (x)]. By Rademacher’s theorem, G is differentiable (i.e., each Gi is differentiable) almost everywhere on any neighborhood of x in which G is Lipschitz. Denote the set of points at which G is nondifferentiable by ΩG . For a vector y at which the necessary partial derivatives exist, write ∇G(y) for the usual m × n Jacobian matrix of partial derivatives.  def Denote the Clarke Generalized Jacobian [8] of G at x as ∂G(x) = conv lim ∇G(xi ) : xi → x, xi ∈ / ΩG . Let B denote the open unit ball in Rm×n . Then ∂G has the following properties [8, Proposition 2.6.2]: (1 ) ∂G(x) is a nonempty convex compact subset of Rm×n ; (2 ) ∂G is upper semicontinuous at x; i.e., for any ǫ > 0, there is δ > 0 such that, for all y ∈ x+δB, ∂G(y) ⊂ ∂G(x)+ǫB; (3 ) ∂G(x) ⊂ ∂G1 (x)×∂G2 (x)×· · ·×∂Gm (x), where the latter denotes whose ith row belongs to ∂Gi (x) for each i = 1, . . . , m. ) ( the set of all matrices def

Denote ∂B G(x) =

lim xi →x ∇G(xi ) . Then ∂G(x) = conv ∂B G(x). xi ∈Ω / G

Semismoothness is originally introduced in [23] for functionals. In [29], this concept is extended to vectorvalued functions. They show that semismoothness is equivalent to the uniform convergence of directional derivative in all directions. This is important in extending Newton’s method for systems of differentiable functions to semismooth functions [28, 29]. ′ A function G is said to be semismooth [29] at x if limh′ →h G(x+tht )−G(x) = limV ∈∂G(x+th′ ) {V h′ } and G t↓0

h′ →h, t↓0

is locally Lipschitz continuous at x. Semismooth functions include convex functions and smooth functions, and are directionally differentiable [29]. Suppose G is semismooth at x. Then G is said to be strongly semismooth [28, 30] at x if there exist a constant L and a neighborhood N of x such that ∀ x+h ∈ N, V ∈ ∂G(x+h), kV h − D [G(x); h]k2 ≤ L khk22 . It is argued in [30] that a vector valued function is strongly semismooth iff each component is strongly semismooth; a function with locally Lipschitz derivative (LC 1 function) is strongly semismooth everywhere; the sum and the min of two LC 1 functions are strongly semismooth. The function G is said to be strongly BD-regular [28] at x if ∀ V ∈ ∂B G(x), V is nonsingular. Theorem 1.1 ([28, Theorem 3.1]) Suppose that x∗ is a solution of G, and that G is directionally differentiable at a neighborhood of x∗ , strongly semismooth and strongly BD-regular at x∗ . Then the sequence (3)

xk+1 = xk − V k

−1

G(xk ),

where V k ∈ ∂B G(xk ),

is well defined and convergent to x∗ Q-quadratically in a neighborhood of x∗ . Lemma 1.1 Banach perturbation Lemma Given an invertible matrix A ∈ Rn×n , and a perturbation 1 −1 k≤ matrix ∆ A ∈ Rn×n . Suppose kA−1 kk ∆ Ak < 1. Then (A + ∆ A)−1 exists and kA+∆ Ak ≤ k(A + ∆ A) kA−1 k 1−kA−1 kk ∆ Ak .

Here, k · k is any matrix norm.

4

2

Optimality Conditions

In this section we develop optimality conditions for a general convex program with SOC constraints: min f (x) s.t. x ∈ H x ≥Q 0,

(4)

where f : RN1 +···+Nn → 7 R is a proper convex function, i.e. f (x) < +∞ for at least one x and f (x) > −∞ for every x, and H is a convex set.  1 def Let R correspond to the following diagonal matrix with proper order: R = 

−1

..

.

−1

. Indicating

the normal cone to H at x as NH (x), the subdifferential of f at x as ∂f (x), we transform (4) into the def

def

def

following system, with variables λ = (λ1 , . . . , λn )T , ω = (ω1 , . . . , ωn )T , z = (zT1 , . . . , zTn )T : (5a) (5b) (5c)

0 ∈ ∂f (λ1 z1 ; . . . ; λn zn ) − (ω1 Rz1 ; . . . ; ωn Rzn ) + NH (λ1 z1 ; . . . ; λn zn ) (λ1 z1 ; . . . ; λn zn ) ∈ H (zi )0 = 1 (i = 1, . . . , n)  λi ωi (zi )20 − ¯zTi z¯i = 0 (i = 1, . . . , n)

(5d)

(zi )20 − ¯zTi ¯zi ≥ 0 (i = 1, . . . , n) λi ≥ 0 (i = 1, . . . , n)

(5e) (5f)

ωi ≥ 0 (i = 1, . . . , n).

(5g)

The variables of (5) are λ, ω, z. The next Lemma shows the equivalence of (5) and (4). Lemma 2.1 Assume f is a proper convex function and the objective value of (4) is bounded below on the feasible set. Then: (i) If (ω ∗ ; λ∗ ; z∗ ) solves (5), x∗ = (λ∗1 z∗1 ; . . . ; λ∗n z∗n ) solves (4). r (ii) Furthermore, assume H = ∩m i=1 Hi , where for i = 1, . . . r, Hi is a polyhedral convex set, and (∩i=1 Hi ) ∩ ∗ ∗ m ∗ ∗ (∩i=r+1 ri Hi ) ∩ int Q = 6 ∅. Then for any solution x of (4), there exists (ω ; λ ; z ) solving (5), and x∗ = (λ∗1 z∗1 ; . . . ; λ∗n z∗n ). As convention [1], we partition Qn+1 into three disjointed sets: {0}, int Q, and bd Q, where bd Q is the boundary of Q excluding 0. To prove the lemma, we first describe NQn+1 (·), i.e. the normal cone to Qn+1 . Proposition 2.1 For any x ∈ Qn+1 ,   {0} NQn+1 (x) = {λ(−x0 , x1 , . . . , xn )T : λ ≥ 0}   −Qn+1

x ∈ int Qn+1 , x ∈ bd Qn+1 , x = 0.

Proof: We will omit the dimension “n + 1”, just write “Q” in the proof. By definition,

(6)

NQ (x) = {z : zT (y − x) ≤ 0, ∀ y ≥Q 0}.

For x = 0 and x ∈ int Q, the representations are easy to verify. Next, we will show the result for x ∈ bd Q through similar techniques in [1, 12]. First, we will prove NQ (x) ⊆ {λ(−x0 , x1 , . . . , xn )T : λ ≥ 0}. By convexity of Q and it being a cone, ∀ v ∈ Q, we have v + x ∈ Q. Setting y = v + x in (6), we get that the normal vector z satisfies: ∀ v ≥Q 0, hz, vi ≤ 0.

(7) Letting v = (k¯ zk2 ; z¯) in (7), we have (8) (9)

z0 ≤ 0, n X zi2 . z02 ≥ i=1

5

Letting y in (6) be 0 and then 2x, we obtain z0 x0 = −

(10)

n X

zi xi .

i=1

Since x ≥Q 0, we have x20 ≥

(11)

n X

x2i .

i=1

For an arbitrary scalar α, adding (9), 2α multiplying (10) and α2 multiplying (11) together, we have (z0 + αx0 )2 ≥

(12)

n X i=1

(zi − αxi )2 .

− xz00

Let α = in (12). Notice α ≥ 0 by (8) and x ≥Q 0. So (12) is valid iff z0 = xz00 x0 , zi = − xz00 xi (i = 1, . . . , n). This shows NQ (x) ⊆ {λ(−x0 , x1 , . . . , xn )T : λ ≥ 0}. The other direction NQ (x) ⊇ {λ(−x0 , x1 , . . . , xn )T : λ ≥ 0} is obvious since x ∈ bd Q and Q is self-dual. ( 1 x∈Q, Let δ(·|Q) denote the indicator function of Q, i.e. δ(x|Q) = +∞ x ∈ /Q. Remark 2.1 Another intuitive way to prove the above proposition is the following. By [31], NQ (x) = ∂δ(x|Q). Furthermore, every y ∈ ∂δ(x|Q) is the normal P to a nonvertical supporting plane to the graph of δ(·|Q) at [x; δ(x|Q)]. For x ∈ bd Q, the derivative of ni=1 x2n − x20 is the normal to the supporting plane to the graph of δ(·|Q) at [x; δ(x|Q)]. Now we proceed to prove Lemma 2.1. Proof: By [31], a necessary and sufficient condition for x∗ belonging to the solution set of (4) is 0 ∈ ∂ [f (x∗ ) + δ (x∗ |H ∩ Q)] .

(13)

Note that ∂δ(x∗ |H ∩ Q) = NH∩Q (x∗ ). According to [31, Theorem 23.8] and its corollary, NH∩Q (x) ⊇ NQ (x) + NH (x) ; if in addition, (∩ri=1 Hi ) ∩ (∩m 6 ∅; then NH∩Q (x) = NQ (x) + NH (x). i=r+1 ri Hi ) ∩ int Q = We first prove (i). Assume (ω ∗ , λ∗ , z∗ ) is a solution to (5). Then with regard to (5b), (5f) and (5e), we get that x∗ = (λ∗1 z∗1 ; . . . ; λ∗n z∗n ) is feasible for (4). Moreover, by (5g) and Proposition 2.1, −ωi∗ Rz∗i ∈ NQi (x∗i ). Hence x∗ is an optimal solution to (4) by (13). Now we prove (ii). Let x∗ be a solution to (4). We let λ∗i = (x∗i )0 . By (13), ∃ v∗ ∈ ∂f (x∗ ), u∗ ∈ NQ (x∗ ), w∗ ∈ NH (x∗ ), such that v∗ + u∗ + w∗ = 0. When x∗i ∈ int QNi , we set ωi∗ = 0 and z∗i = λ1∗ x∗i .

When x∗i ∈ bd QNi , let z∗i =

1 ∗ xi . λ∗ i

i

By lemma 2.1, ∃ α ≥ 0, such that u∗i = −αRx∗i . So we set ωi∗ = α · λ∗i in

this case. When x∗i = 0, by Proposition 2.1, −u∗i ∈ QNi . We set ωi∗ = (u∗i )0 . If ωi∗ 6= 0, we set z∗i = − ω1∗ Ru∗i ; i otherwise, we set z∗i = (1; 0). Thus, we have proved the lemma. Suppose H = {x : Ax = b}. Assume that A has full row rank. Then NH (x) = {AT y} for any x ∈ H. Therefore, in the special case of (1), (5) is reduced to the following system, (14a) (14b) (14c) (14d) (14e) (14f) (14g)

c − (ω1 Rz1 ; . . . ; ωn Rzn ) + AT y = 0

λ1 A1 z1 + · · · + λn An zn = b (zi )0 = 1 (i = 1, . . . , n)  2 λi ωi (zi )0 − z¯Ti ¯zi = 0 (i = 1, . . . , n) (zi )20 − z¯Ti ¯zi ≥ 0 (i = 1, . . . , n) λi ≥ 0 (i = 1, . . . , n)

ωi ≥ 0 (i = 1, . . . , n).

6

p ¯ i − (xi )0 ≤ 0 (i = 1, . . . , n) represent xi ∈ QNi . Applying [31, Theorem 28.2] to (1), ¯ Ti x Remark 2.2 Let x one can conclude the existence of Lagrangian multipliers under the same assumptions as that in Lemma 2.1; and using subgradients, the KKT system is equivalent to (14). However, our preceding proof is for a more general problem and is readily extended to SDP and p-cone program. When f (x) is nonlinear or H is non-polyhedral, (4) can be approximated by the cutting plane method of Cheney-Goldstein-Kelley. Since linear objective and polyhedral convex domain are good enough for many applications, we assume linear objective and polyhedral feasible set in the following context. Another reason of focusing on (1) is that because of the linearity, some special assumptions that are necessary to prove the nonsingularity of each element in the generalized Jacobian for our systems in §3 may not be applicable to a general nonlinear case, and visa versa. We give (5) here because we believe that it may help the design of more efficient algorithm than the cutting plane method for (4). In the rest of the paper, we concentrate on (14).

3

Reformulating the Optimality Conditions as Equations

The optimality conditions (14) include inequalities, which cannot be managed directly by Newton’s method for equations. In this section, we further transform (14) into some systems of equations, which can be solved by semismooth Newton’s method. For each system, we give assumptions under which each element of its Clarke generalized Jacobian is nonsingular for the sake of perturbation analysis in §4. Since the Clarke generalized Jacobian is the convex hull of ∂B G(x) for operator G, we also obtain that the solution is strongly BD-regular. Consequently, we can conclude that sequences generated by the semismooth Newton’s method for our reformulations converge Q-quadratically locally to a solution under these assumptions if G is strongly semismooth. NCP-functions convert complementarity constraints of two variables into equations. The complementarity constraints of (5), i.e. (5d)-(5g) include three terms. In this section, we show how to model complementarity constraints involving several terms into equations with the min function in §§3.1, with the Jordan decomposition in §§3.2, with the composition of NCP-functions in §§3.3. We also give a reformulation of (14) into a system of equations based on the Q method of [34] in §§3.4. For every nonlinear equation reformulation, we assume that its solution satisfies (15)

λi and ωi not be zero at the same time (i = 1, . . . , n);

λ 6= 0.

Violation of either of the assumptions above results in the local nonuniqueness of that solution to (14). Hence that solution is not BD-regular and we can’t apply Theorem 1.1 to it. To see this, suppose λ = 0. Then by (14b), b = 0. From the full row rank assumption of the matrix A and (14a), we conclude that any ω ≥ 0, z ∈ Q with (zi )0 = 1, along with λ = 0 is a solution to (14). When λj = ωj = 0, we have that any zj satisfying (14c) and (14e) is a solution of (14). Next, we introduce some notations used in this section. We partition the index set into subsets based on the complementarity (14d)-(14g) at an optimal solution. def

def

L1 = {i : λi > 0, ωi > 0, zi ∈ bd QNi },

L2 = {i : λi > 0, ωi = 0, zi ∈ bd QNi },

L3 = {i : λi > 0, ωi = 0, zi ∈ int QNi },

L4 = {i : λi = 0, ωi > 0, zi ∈ bd QNi },

def

def

def

L5 = {i : λi = 0, ωi > 0, zi ∈ int QNi }.   Note that for each i ∈ L1 ∪ L3 ∪ L5 , only one of the three terms, i.e. λi , ωi , and (zi )20 − z¯Ti ¯zi , is zero; for each i ∈ L2 ∪ L4 , two of the terms are zero. Because of (15), we exclude the indices of the blocks whose both λi and ωi are zero. Nor do we includethe set {i : λi > 0, ωi > 0, zi ∈ int QNi }, because by (14d), for each i = 1, . . . , n, at least one of λi , ωi and (zi )20 − z¯Ti z¯i is zero at an optimum. ¯ T )T . ¯ to represent the subvector of x excluding x0 and x1 , i.e., x = (x0 , x ¯ T )T = (x0 , x1 , x We use x ¯ ¯ We let A stand for the submatrix of A excluding the first columns, i.e. A = [a0 , A]. The symbol A¯ is used to ¯ We use subscribe ¯1 to designate represent the submatrix of A excluding the first two columns: A¯ = [a1 , A]. def def ¯ z¯ = [z ; z ; . . . ; z ]. the index set without index 1. Thus, A¯1 = [a0 , A], 0 2 n 1 n+1 For any z ∈ R with z0 = 1, we define Kz [34] as: 7

when k¯zk2 6= 0 and z1 6= − k¯ zk2 ,  Kz

1 2 z1 − 2k¯ zk2

1 2  z1 def  2k¯ z k2 =

 

when z1 = − k¯ zk2 or k¯ zk2 = 0,

¯T

¯ z − √2k¯ zk

 √1 I− 2

¯

¯ z 2k¯ z k2

− 2k¯¯zzk

2

1 2 − 21 def  

1 2 1 2



Kz = 

It is easy to verify that Kz−1 = 2KzT , and   1 + k¯zk2 (16) z = Kz 1 − k¯zk2  , 0

3.1



0T

− √12

2

¯ zT z¯ zk2 +z1 ) k¯ zk2 (k¯

   ; 

 √1 I 2

 . 

 1 − k¯zk2 Rz = Kz 1 + k¯zk2  . 0 

A Reformulation Based on the Min Function

In this part, we reformulate (14) into a system of equations via the min function. Then we give sufficient conditions under which each element of its Clarke generalized Jacobian is nonsingular. Thus, we can conclude that the sequence generated by semismooth Newton’s method converges Q-quadratically locally. Using min functions to reformulate (14d)-(14g), we have that (14) is equivalent to the following system: (ω1 Rz1 ; . . . ; ωn Rzn ) − AT y − c = 0

(17a)

λ1 A1 z1 + · · · + λn An zn − b = 0 (zi )0 − 1 = 0 (i = 1, . . . , n)

(17b) (17c)

min(λi , ωi , zTi Rzi ) = 0 (i = 1, . . . , n).

(17d)

Remark 3.1 If the complementarity constraint is {a1 · a2 · · · · · an = 0, ai ≥ 0 (i = 1, . . . , n)}, we can formulate it as min(a1 , . . . , an ) = 0 . Next, we give sufficient conditions under which each element in the Clarke generalized Jacobian of (17) is nonsingular. To this end, we first simplify an element V in the Clarke generalized Jacobian of (14) at an optimum block by block, then consider V as a whole. We left multiply Diag(KzTi ) to block rows (17a), right multiply Diag(Kzi ) to columns corresponding to z. Dropping subscripts for block numbers, we write each block of V as the following.

(17a)

1 2

 ω

1−k¯ z k2 1+k¯ zk2 0

(17b) (17c) (17d)

λ 

ω 2

AKz p



1+k¯ z k2 1−k¯ zk2 0



zT 0 1 10

yT −I

λAKz

1 1 T 2, 2, 0



l 1 − k¯ zk2 , 1 + k¯zk2 , 0T

q

Here,

(18)



  l = 1, p = q = 0      p = 1 − α, l = α, q = 0 p = 1, q = l = 0    q = 1 − α, l = α, p = 0    q = 1, p = l = 0 8

i ∈ L1 , i ∈ L2 , i ∈ L3 , i ∈ L4 , i ∈ L5 ,

−(AKz )T



with α being any real number in [0, 1]. The separation of the index set at an optimum into L1 , . . . , L5 reveals the complementarity structure of (14). In this section, for analysis simplification, we re-partition ∪5i=1 Li based on the following tree. Each case we study below corresponds to a leaf of the tree. These are the only possible cases due to the complementarity (14d)-(14g) and assumptions (15). Note that when ω = 0, λ must not be zero due to (15); hence q = 0. When ω 6= 0 and q = 0, because of (14d), z must be in bd Q; hence l = 1. ∪5i=1 Li

ω 6= 0

q=0

ω=0

q 6= 0 l=0

l 6= 0

l 6= 0

l=0

case 1: ω 6= 0, q = 0. The conditions imply that p = 0, z ∈ bd Q, l = 1. We first use l(1 + k¯ zk2 ) to eliminate row (17d) and column z1 ; then use 21 (1 + k¯ zk2 ) to eliminate column ω and the 2nd row of (17a); next use 21 to eliminate column z0 and row (17c). When λ 6= 0, case 1 includes solely L1 . This block is reduced to (17b)

λ z¯T  0

ω 2λ

−I

(AKz )¯1

yT −(AKz )T¯1

.

When λ = 0, case 1 includes exclusively L4 . This block can be reduced to the following matrix.

(17b)

(AKz )0

yT −(AKz )T0

case 2: ω 6= 0, l = 0. Clearly, p = 0, λ = 0, q = 1. This case includes just L4 and L5 . We first eliminate row (17d) and column λ by q, then subtract column z0 from column z1 to eliminate row (17c) and column z0 by 12 . Next, we add the 1st row of (17a) to the 2nd row of (17a); so we can eliminate the 1st row of (17a) and column z1 by ω2 , the 2nd row of (17a) and column ω by 1, and the remaining by − ω2 I. Thus we needn’t worry about this block when analyzing the nonsingularity of (17) by assumption (15). case 3: ω 6= 0, q 6= 0, l 6= 0. The conditions imply z ∈ bd Q, λ = 0; so this case includes only L4 . zk2 ) to delete column ω and the 2nd row of (17a), then use 21 to eliminate column z0 We first use 12 (1 + k¯ ω and row (17c), and subtract 4α multiplying (17d) from the 1st row of (17a). Then this block is reduced to:

(17b)

− (1−α)ω 8α (AKz )0

yT −(AKz )T0

.

case 4: ω = 0, l = 0. Under (15), λ 6= 0; hence q = 0, p = 1. This block consists of L2 and L3 . Assume dual nondegeneracy. Then (Ai Kzi ) has linearly independent columns (see [3]). Hence the column corresponding to λ is nonzero. We first eliminate column ω and row (17d) by p, then subtract the first two columns of z multiplied by λ1 (1 + k¯ zk2 , 1 − k¯ zk2 ) from column λ. This block is reduced to the following. zT (17b)

AKz 9

yT −(AKz )T

case 5: ω = 0, l 6= 0. For this block, λ 6= 0, q = 0 and z ∈ bd Q. This case includes only L2 . When p = 0, l must be 1. 1+k¯ z k2 to eliminate column ω and the 2nd row of (17a), then use (1 + k¯ zk2 ) to eliminate column We first use 2 z1 and row (17d), use 12 to get rid of column z0 and row (17c). This block is reduced to the following matrix. λ ¯zT

yT −(AKz )T¯1

(AKz )¯1

(17b)

Now we consider p = 1 − α, l = α (0 < α < 1). We first subtract λ2 multiplying column z0 from column λ to eliminate row (17c) and column λ, then subtract 1 1−α multiplying row (17d) from the 2nd row of (17a). This block is reduced to the following matrix. zT

(17b)



0

yT

2α − (1−α)λ

AKz

0



−(AKz )T

Next we give sufficient conditions under which each element in the Clarke generalized Jacobian of (17) is nonsingular. def

Theorem 3.1 Suppose that a solution w∗ = (ω ∗ , λ∗ , z∗ , y∗ ) to (17) satisfies primal-dual nondegeneracy and assumption (15). Then each element in the Clarke generalized Jacobian of (17) is nonsingular at w∗ . Let G(w) = 0 denote (17). Consequently, the sequence (3) converges Q-quadratically to w∗ in a neighborhood of w∗ . Proof: As in [2, 3], we partition the primal variable xi = λi zi into three parts x = (xB ; xI ; xO ), where xB is the collection of all the boundary blocks, xI includes all the interior blocks, and xO collects all the zero blocks. Rearrange the index set so that xB = (x1 , . . . , xr ). It is shown in [2, 3] that primal nondegeneracy means the following matrix having linearly independent rows for all α1 , . . . , αr and ν that are not all zeros. 

A1 α1 (Rx1 )T

(19)

... ...

Ar αr (Rxr )T

AI 0T

AO νT



Since right multiplying a nonsingular matrix to a full row rank matrix doesn’t change the row rank of the zk2 = 1 and λi > 0. By [2, Lemma 19], latter, we right multiply Diag(Kzi ) to (19).   Notice that i ∈ B means k¯ primal nondegeneracy is equivalent to: (A1 Kz1 )0 (A1 Kz1 )2:N1 . . . (Ar Kzr )0 (Ar Kzr )2:Nr AI KzI ,   i.e. ((AKz )L1 L2 )¯1 (AKz )L3 , having linearly independent rows. def

Similarly, we partition the dual variable si = ωi Rzi into s = (sB ; sI ; sO ), with sB = (s1 ; . . . ; ss ) being the concatenation of boundary blocks, sI including all the interior blocks, and sO collecting all the zero blocks. And we partition A in the same way: A = (A˜B , A˜I , A˜O ). By [3], dual nondegeneracy means  (20) A˜1 Rs1 . . . A˜s Rss A˜O

Rsi , by (16), we see that dual nondegeneracy having linearly independent Since A˜i Rsi = A˜i Kzi Kz−1 i   columns.  h i   . . . A˜s Kzs A˜O KzO , i.e. ((AKz )L L ) (AKz )L L , having linearly independent means A˜1 Kz1 0

1

0

4

0

2

3

columns. Adding some columns to a full row rank matrix doesn’t change its rank; and after some columns being deleted fromh a full column rank matrix,  full column rank.  the matrix still has i Hence, primal nondegener  (AK ) (AK ) (AK ) (AK ) acy implies z L2(p=0) z L4(l6=0) z L1 ¯ z L2(p6=0) L3 having linearly independent 1 ¯ 0h 1    i   (AKz )L2(p=0) (AKz )L2(p6=0) L3 rows. And dual nondegeneracy implies (AKz )L1 0 (AKz )L4(l6=0) ¯ 0 1 having linearly independent columns.

10

  We choose all columns of (AKz )L1 L4(l6=0) 0 , (AKz )L2(p=0) ¯1 , and (AKz )L2(p6=0) L3 , along with some columns from ((AKz )L1 )2:n , to form an m by m nonsingular matrix B1 , and collect the remaining columns of ((AKz )L1 )2:n into B2 . The nonsingularity of ∂G is reduced to the following matrix being nonsingular:   I˜1 −B1T  (21) I˜2 −B2T  . B1 B2           (1−αi )ωi 2αi ωi ˜2 = −Diag ωi I ˜ , and I , , . Here, I1 = −Diag 0, 2λi i∈L i∈L 8αi (1−αi )λi 2λi 4 2 i∈L1

(li 6=0,qi 6=0)

i∈L1

(li 6=0,pi 6=0)

By the second condition of (15), B1 is nonempty. We first subtract I˜1 B1−1 left multiplying the 3rd block rows of (21) from the 1st block rows of (21), then add I˜1 B1−1 B2 I˜2−1 multiplying the 2nd block rows to the 1st block rows of (21). Therefore (21) is reduced to B1T + I˜1 B1−1 B2 I˜2−1 B2T = (I + I˜1 B1−1 B2 I˜2−1 B2T B1−T )B1T , which is nonsingular because it is in the form (I + N1 N2 )B1T with both N1 = I˜1 and N2 symmetric negative semidefinite and B1T nonsingular. Even if B2 is empty, (21) is still nonsingular. Besides, it is easy to verify that min(ωi , λi , zTi Rzi ) is strongly semismooth. In view of the arguments following the strongly semismoothness definition in §§1.1, we conclude that (17) is strongly semismooth. By Theorem 1.1, we obtain the locally Q-quadratic convergence rate of the sequence (3).

3.2

A Reformulation Based on Jordan Decomposition

In this part, we reformulate (14) into a system of nonlinear equations by the Jordan decomposition of λ and ω. Then we give conditions under which each element of its Clarke generalized Jacobian is nonsingular. For a scalar a ∈ R, denote its Jordan decomposition as a = [a]+ − [a]− , where [·]+ and [·]− are the def

def

projection functions: [a]+ = max(a, 0), [a]− = min(a, 0).  Assume that (λ; ω; z; y) solves (22); then it is easy to verify that [λ1 ]+ ; . . . ; [λn ]+ ; [ω1 ]+ ; . . . ; [ωn ]+ ; y; z is a solution to (14). ([ω1 ]+ Rz1 ; . . . ; [ωn ]+ Rzn ) − AT y − c = 0

(22a)

[λ1 ]+ A1 z1 + · · · + [λn ]+ An zn − b = 0 (zi )0 − 1 = 0 (i = 1, . . . , n)

(22b) (22c)

[ωi ]− + [λi ]− + zTi Rzi = 0 (i = 1, . . . , n).

(22d)

Remark 3.2 When complementarity constraint is {a1 · · · · · an = 0, ai ≥ 0 (i = 1, . . . n)}, we can reformulate it as [a1 ]− + . . . [an−1 ]− + an = 0 , and replace ai by [ai ]+ for i = 1, . . . , n in the original solution. Next, we give conditions under which each element in the Clarke generalized Jacobian of (22) is nonsingular. We first study an element in the generalized Jacobian block by block. As in the previous section, we left multiply Diag(KzTi ) to block rows (22a), right multiply Diag(Kzi ) to block columns z, and omit subscripts for block numbers. Then each block of V is in the following form.

(22a)

p 2

 ω

1−k¯ zk2 1+k¯ z k2 0

(22b) (22c) (22d)

λ 

[ω]+ 2

qAKz 1−p



1+k¯ zk2 1−k¯ z k2 0



1−q

Here p and q take the values below:   ωi > 0 p = 1 (23) p ∈ [0, 1] ωi = 0 ,   p=0 ωi < 0

zT 0 1 1 0

yT −I

[λ]+ AKz



−(AKz )T

( 12 , 12 , 0T ) (1 − k¯zk2 , 1 + k¯zk2 , 0T )   λi > 0 q = 1 q ∈ [0, 1] λi = 0 .   q=0 λi < 0

For simplicity of analysis, we re-partition the index set L1 ∪ · · · ∪ L5 into the following tree. 11

L1 ∪ · · · ∪ L5

q=1

p=1

p=0

0 0, b = 0) with p 6= 0, q 6= 0. The function M (a, b) can be: 1. 2. 3. 4.

2

−βab + [min(0, a + b)] (for β ∈ (0, 2]) (see [11] ) , min(a, b) √ , φ(a, b) = a2 + b2 − a − b (see [14]) , θ(|a − b|) − θ(a) − θ(b) (see [22]), where θ(t) is a differentiable strictly increasing function from R to R such that θ′ (0) + θ′ (ζ) 6= 0 for all ζ > 0.

We reformulate (14) into a system of nonlinear equations by replacing a, b, and c with λi , ωi , and zTi Rzi for i = 1, . . . , n in M (a, |b|c) or M [a, M (b, c)]. Denote the resulting system as G(w) = 0. Assume that the NCP-function M satisfies (24) and is strongly semismooth. Suppose strict complementarity at optimum. Then at an optimum, the only points at which G is nondifferentiable are those with ωi = 0 for M (λi , |ωi |zTi Rzi ); moreover, it is easy to verify that the structures of the Jacobians of these systems are the same as that of (17) and G is strongly semismooth. Hence we have the following. Theorem 3.3 Let w∗ be an optimal solution to G(w) = 0 satisfying primal-dual nondegeneracy, strict complementarity assumptions, and (15). Then each element in the Clarke generalized Jacobian of G(w∗ ) is nonsingular. Consequently, the sequence (3) converges Q-quadratically to w∗ in a neighborhood of w∗ .

3.4

A Formulation Based on the Q Method

In this part, we reformulate (14) into equations based on the Q method in [34]. In §§§3.4.1, we derive equivalent optimality conditions for (14) using the decomposition for the Q method. The complementarity constraints in the optimality conditions have only two variables; therefore, they can be reformulated into equations by any NCP-function. We then give sufficient conditions for the regularity of the Clarke generalized Jacobian of the equations reformulated by the min function in §§§3.4.2.

13

3.4.1

The Equations Based on the Q Method

In this part, based on a decomposition of λ, ω, z from the Q method [34], we derive optimality conditions for (1). Finally, we reformulate the optimality conditions into equations. Consider a decomposition of a vector z ∈ Rn+1 with z0 = 1 as ¯ ) + β(1; −¯ (25) z = α(1; u u), where α, β ∈ R and k¯ uk2 = 1. Then z ∈ Q iff α, β ≥ 0; z ∈ int Q iff α, β > 0; z ∈ bd Q iff α, β ≥ 0 and one of them is zero [34] . zk2 1+k¯ , Assume α ≥ β. When ¯ z 6= 0, the decomposition (25) is uniquely determined by α = 2 zk2 1−k¯ ¯ z ¯ = k¯zk . We further set u ¯ = (1; 0) if ¯z = 0. , u 2 2 Replacing z by its decomposition (25), we write λz, ωRz as (26)

¯ ) + λ2 (1; −¯ λz = λ1 (1; u u),

β =

¯ ) + ω2 (1; −¯ ωRz = ω1 (1; u u).

Furthermore, assume λ, ω ≥ 0, λ1 , λ2 , ω1 , ω2 ≥ 0. Then λωzT Rz = 0 iff λ1 ω1 = λ2 ω2 = 0 [34]. def ¯ i ] correspond to the decompoLet (λ; ω; z; y) be a solution to (14). Let vi = [(λ1 )i ; (λ2 )i ; (ω1 )i ; (ω2 )i ; u sitions of (λi ; ωi ; zi ) in (25) and (26). Then (v1 ; . . . ; vn ; y) solves the following system.

(27)

¯ i ) + (ωi )2 (1; −¯ (ωi )1 (1; u ui ) − ATi y − ci = 0 (i = 1, . . . , n) ¯ 1 ) + (λ1 )2 (1; −¯ ¯ n ) + (λn )2 (1; −¯ A1 ((λ1 )1 (1; u u1 )) + · · · + An ((λn )1 (1; u u2 )) − b = 0 (λi )1 (ωi )1 = 0, (λi )2 (ωi )2 = 0 (i = 1, . . . , n) (λi )1 ≥ 0, (λi )2 ≥ 0, (ωi )1 ≥ 0, (ωi )2 ≥ 0 (i = 1, . . . , n) ¯ Ti u ¯ i = 1 (i = 1, . . . , n). u

Note that λi and ωi in (14) are decomposed into [(λi )1 , (λi )2 ] and [(ωi )1 , (ωi )2 ] in (27). ¯ , λ1 , λ2 , ω1 , ω2 with λ1 ω1 = λ2 ω2 = 0, we let On the other hand, given u

(28)

λ = λ1 + λ2 ,

  λ −λ  ¯  1; λ11 +λ22 u    ω1 −ω2 z= ¯ u 1; ω1 +ω2    (1; 0)

ω = ω1 + ω2 ,

λ1 + λ2 6= 0 ω1 + ω2 6= 0

otherwise.

Next, we prove that z is well-defined. When λ1 + λ2 6= 0 and ω1 + ω2 6= 0, it is not difficult to verify ω1 −ω2 2 that λλ11 −λ +λ2 = ω1 +ω2 . Now assume λ1 + λ2 = ω1 + ω2 = 0. Then by λ1 ω1 = λ2 ω2 = 0, we get that λ1 = λ2 = ω1 = ω2 = 0. Suppose that (v1 ; . . . ; vn ; y) solves (27), where vi is defined before (28). Obtain (λi ; ωi ; zi ) from (28). Then (λ; ω; z; y) is a solution to (14). Thus, we have proved that (14) is equivalent to (27). Note that the complementarity constraints in (14) can be transformed into equations via any NCPfunctions. Below, we reformulate the complementarity constraints in (27) by the min function. (29a) (29b) (29c) (29d) (29e) 3.4.2

¯ i ) + (ωi )2 (1; −¯ (ωi )1 (1; u ui ) − ATi y − ci = 0 (i = 1, . . . , n) ¯ 1 ) + (λ1 )2 (1; −¯ ¯ n ) + (λn )2 (1; −¯ A1 ((λ1 )1 (1; u u1 )) + · · · + An ((λn )1 (1; u u2 )) − b = 0 min ((λi )1 , (ωi )1 ) = 0 (i = 1, . . . , n) min ((λi )2 , (ωi )2 ) = 0 (i = 1, . . . , n) ¯ Ti u ¯ i = 1 (i = 1, . . . , n) u Regularity of the Generalized Jacobian

In this part, we give conditions under which each element in the Clarke generalized Jacobian of (29) is nonsingular to prove locally Q-quadratic convergence rate of the sequence generated by the semismooth Newton’s method. We first analyze each element V in the Clarke generalized Jacobian of (29) block by block, then consider V as a whole. 14

¯ ) ∈ Rn+1 with k¯ For any (1; u uk2 = 1, define an n by n orthogonal matrix Lu¯ as  ! ¯T  u u − 1   u1 6= −1  T  u ¯ I − u¯ u¯ def ! 1+u1 . Lu¯ =  −1   u1 = −1   I

¯ u¯ Denote AL



¯ 1

   ¯ u¯ excluding the first column. We left multiply Diag 1 LT to as the submatrix of AL ¯ u i

¯ T . Dropping subscripts for block numbers, block rows (29a), right multiply Diag (Lu¯ i ) to block columns u we write each block of V as the following. (29a) (29b) (29c) (29d) (29e)

ω11 1 0

 ω12  −1 0

p1

λ1

¯T u

λ2

yT T

¯ u¯ )1 a0 − (AL

¯ u¯ )1 a0 + (AL q1

p2



(ω1 − ω2 ) 0I ¯ u¯ (λ1 − λ2 ) AL

q2

−aT 0 T ¯ u −(AL ¯)

2(1, 0T )

Here, for (j = 1, 2),

(30)

  0 = (λi )j < (ωi )j ; pj = 0, qj = 1 pj = 1, qj = 0 0 = (ωi )j < (λi )j ;   pj = 1 − α, qj = α (0 ≤ α ≤ 1) 0 = (λi )j = (ωi )j .

To prove that each element V in the Clarke generalized Jacobian at an optimum is nonsingular, we assume λ1 ≥ λ2 ,

(31) (32)

ω1 ≤ ω2 ;

(ω1 − ω2 ) and (λ1 − λ2 ) not be zero at the same time.

We first study the structure of each block of V . For simplicity of analysis, we re-partition the index set ∪5i=1 Li based on the number of zeros in the set {(pi )1 , (pi )2 , (qi )1 , (qi )2 }. In the first three cases, two of them are zero; while in the last two cases, only one of them is zero. The relation q2 = p1 = 0 implies λ1 = ω2 = 0; p1 = 0, p2 6= 0, q1 6= 0, q2 6= 0 implies λ1 = ω1 = ω2 = 0; q2 = 0, q1 6= 0, p1 6= 0, p2 6= 0 implies λ1 = λ2 = ω2 = 0; therefore, under (31) and (32), none of the latter three cases are possible. case 1: p1 = p2 = 0. Hence q1 = q2 = 1 by (30). This case is constituted by L4 and L5 . After adding column ω1 to column ω2 , we find that to analyze regularity of the Clarke generalized Jacobian of (29), we can ignore this case by (15). case 2: q1 = q2 = 0. Thus p1 = p2 = 1 by (30). This case consists of L2 and L3 only. We first add column λ2 to column λ1 , then subtract 12 multiplying column λ1 from column λ2 . To analyze regularity of the Clarke generalized Jacobian, this block can be transformed to the following matrix.

(29b) h A

1

Lu ¯

1

1 1 −1

I

i

h − A

1

yT 1

Lu ¯

1 1 −1

I

iT

case 3: p1 6= 0, q2 6= 0, q1 = p2 = 0. Hence ω1 = λ2 = 0 by (30). We first eliminate column ω1 and row (29c) by p1 , column λ2 and row (29d) by q2 , column u1 and row (29e) by 2. Then we add the first row of (29a) to the second row of (29a) to eliminate column ω2 and the 1st row of (29a). When λ1 6= 0, ω2 = 0, this case includes solely L2 . When λ1 6= 0, ω2 6= 0, this case includes just L1 . This block can be reduced to the following matrix.

15

¯T λ1 u  0

(29b)

yT ¯ ¯ )T −aT 1 0 −(ALu T ¯ u −(AL ¯ )¯ 1

ω

− λ2 I 1

¯ ¯ )¯1 ¯ u a0 +(AL ¯ )1 (ALu

When λ1 = 0, hence ω2 6= 0. Case 3 includes only L4 . This block can be reduced to the following matrix. λ1 (29b)0

¯ u¯ )1 a0 + (AL

−aT0

yT ¯ u¯ )T1 − (AL

case 4: p1 6= 0, q1 6= 0. Hence, q2 = 1, p2 = 0 by (30). This case includes merely L4 . Assume p1 = 1 − α, q1 = α (0 < α < 1). We first remove column λ2 and row (29d) by q2 , column u1 and row (29e) by 2, then add the first row of (29a) α to the second row of (29a) to get rid of column ω2 and the first row of (17a) by 1. Then we subtract 1−α multiplying column ω1 from column λ1 . This block is then reduced to the following matrix. 2α − 1−α

(29b)0

¯ u¯ )1 a0 + (AL

−aT0

yT ¯ u¯ )T1 − (AL

case 5: p2 6= 0, q2 6= 0. Hence, q1 = 0, p1 = 1 by (30). This case consists of L2 . Assume p2 = 1 − α, q2 = α (0 < α < 1). We first eliminate column ω1 and row (29c) by 1, column u1 and row (29e) by 2, then add the 1st row of (29a) to α the 2nd row of (29a). Next, we subtract 1−α multiplying column ω2 from column λ2 to eliminate column ω2 1 and row (29d), subtract 2 multiplying the 2nd row of (29a) from the 1st row of (29a). This block is reduced to

(29b) h A

 1

0

2α − 1−α

Lu ¯



  1 01

1 −1

I

i

h − A

1

Lu ¯

y 1

1 1 −1

I

iT

.

Next we give sufficient conditions for regularity of the Clarke generalized Jacobian of (29). def

def

Theorem 3.4 Denote ω ∗ = [(ω1∗ )1 , (ω1∗ )2 , . . . , (ωn∗ )1 , (ωn∗ )2 ], λ∗ = [(λ∗1 )1 , (λ∗1 )2 , . . . , (λ∗n )1 , (λ∗n )2 ]. Suppose def that a solution w∗ = (ω ∗ , λ∗ , u∗ , y∗ ) to (29) satisfies primal-dual nondegeneracy and assumptions (31) and (32). Then every element in the Clarke generalized Jacobian of (29) is nonsingular at w∗ . Therefore, the sequence (3) converges Q-quadratically to w∗ in a neighborhood of w∗ .   1 1 i   1 1 i h h def 1 −1 1 −1 . After to (19). Denote B = ADiag 1 Lu¯ i Proof: We right multiply Diag 1 Lu¯ i I

I

adding some columns to (19) and deleting some columns from (20), we get that primal nondegeneracy implies [(BL1 L2(p2 =0) )¯1 BL2(p2 6=0) L3 (BL4 (p1 6=0) )0 ] having linearly independent rows, and dual nondegeneracy implies [(BL1 L4(p1 6=0) )0 (BL2 (p2 =0) )¯1 BL2(p2 6=0) L3 ] having linearly independent columns.   The proof of Theorem 3.1 can be carried over here with I˜2 = Diag − (ωi )2 I and I˜1 being (λi )1



 0  2αi  2α − 1−αi , Diag 0, − i 1 − αi “ p1L6=40, ” q1 6=0

16

0

L1

   (ωi )2  I , − . “ L2 ” (λ ) i 1 p2 6=0, L1



q2 6=0

4

Re-optimization

In this section, we will prove the following theorem on re-optimization. def

def

Theorem 4.1 Denote w = (ω; λ; y; z). Let x = (λ1 z1 ; . . . ; λn zn ). Let Gold (w) = 0 represent any of the systems (17), (22), (29), or the systems in § 3.3. Let wold be a solution to Gold (w) = 0. Assume that every element in ∂Gold (wold ) is nonsingular. Suppose that the data are perturbed to (A + ∆ A, b + ∆ b, c + ∆ c). Denote the perturbed system as Gnew (w) = 0. Then there

exist positive scalars ν and υ,  such that if the perturbations (∆ A, ∆ b, ∆ c) satisfy k∆ Ak2 ≤ ν and ∆ c + ∆ AT yold ; ∆ b − ∆ Axold ∞ < υ, then Gnew (w) = 0 is solvable. In addition, starting from wold , the iterates (3) converge Q-quadratically to a solution of Gnew (w) = 0. We can augment A with added or deleted variables or constraints, and treat them as zeros in the old or the perturbed system. Therefore, the perturbation of A can also include addition or deletion of variables or constraints, provided that assumptions in the above theorem are satisfied. To prove the theorem, we first give a lemma. Lemma 4.1 There exist a neighborhood N (wold ) of wold and a constant ρ ≥ 0 independent of w ∈ N (wold ), such that for any w + ∆ w ∈ N (wold ), V ∈ ∂G(w + ∆ w), we have 2

kV ∆ w − D [G(w); ∆ w]k2 ≤ ρ k∆ wk2 ,

(33)

2

kG(w + ∆ w) − G(w) − D [G(w); ∆ w]k2 ≤ ρ k∆ wk2 .

Proof: Suppose G(w) = [G1 (w); G2 (w); . . . ; Gq (w)]. Then by definition, also property 3 of Clarke generalized Jacobian in §§1.1, for any V ∈ ∂G(w + ∆ w), we have V ∆ w = (V1 ∆ w; . . . ; Vq ∆ w), where Vi are the rows of V corresponding to Gi (w). As well, we have D [G(w); ∆ w] = (D [G1 (w); ∆ w] ; . . . ; D [Gq (w); ∆ w]). Assume that for each i = 1, . . . , q, the above lemma is satisfied, i.e., there exist Ni (wold ) and ρi independent of w such that for any w + ∆ w ∈ Ni (wold ) and Vi ∈ ∂Gi (w + ∆ w), the inequalities below are valid: 2 2 kVi ∆ w − D [Gi (w); ∆ w]k2 ≤ ρi k∆ wk2 , kGi (w + ∆ w) − Gi (w) − D [Gi (w); ∆ w]k2 ≤ ρi k∆ wk2 . p P def def q old 2 ). We let N (wold ) = ∩qi=1 Ni (wold ), ρ = i=1 ρi . Note that ρ is independent of w ∈ N (w old Then, for any w + ∆ w ∈ N (w ) and V ∈ ∂G(w + ∆ w), we have 2

kV ∆ w − D [G(w); ∆ w]k2 = kG(w + ∆ w) − G(w) −

D [G(w); ∆ w]k22

q X

=

2

kVi ∆ w − D [Gi (w); ∆ w]k2 ≤

i=1 q X i=1

kGi (w + ∆ w) − Gi (w) −

q X i=1

4

ρ2i k∆ wk2 ,

D [Gi (w); ∆ w]k22



q X i=1

ρ2i k∆ wk42 .

In other words, (33) is satisfied. Thus, we only need to show that every component of G satisfies the lemma, and that the intersection of these neighborhoods Ni (wold ) in the lemma is nonempty. Next, we give the neighborhood and ρ in the above lemma for each component of G. 1. min(a, b, c) Without loss of generality, we assume aold ≤ bold ≤ cold .   When aold < bold ≤ cold , we let the neighborhood N (aold , bold , cold ) of aold , bold , cold satisfy the following condition: for all (a, b, c) ∈ N (aold , bold , cold ) , a < b, a < c.  When aold = bold < cold, we let the neighborhood N (aold , bold , cold ) satisfy the condition: for any (a, b, c) ∈ N (aold , bold , cold ) , a < c and b < c.  When aold = bold = cold , we set N (aold , bold , cold ) = R3 . It is easy to verify that ρ = 0. When a is replaced with zT Rz and ∆ a replaced with ∆ z, we get ρ = 2. 2. Projection functions [·]+ and [·]− When aold < 0, we let N (aold ) = (−∞, 0). When aold > 0, we let N (aold ) = (0, +∞). When aold = 0, we let N (aold ) = R. Then it is easy to see that ρ = 0. 3. The Fischer-Burmeister function φ(λi , ωi ) old 2

old 2



> 0, we require each element in N (λi It is easily verified that ρ = √ old 24 old 2 .

When λi

+ωi

a

+b

17

old



, ωiold )

p satisfy λ2i + ωi2 ≥

1 2

q 2 2 λi old + ωi old .

  When λi old = ωi old = 0, we let N (λi old , ωi old ) = R2 . Hence ρ = 0.

4. Other Functions T It is easy to see that ρ = 2 and the neighborhood Pn is the whole space for zi Rzi and zi zi . All the other T operators employed by G are ωi Rzi − Ai y − ci , i=1 λi Ai zi − b, (zi )0 − 1, or some of their variants, which are either linear or in the form λDx, with λ an unknown scalar, x an unknown vector, and D a given matrix. So the neighborhood N (wold ) for these operators in lemma 4.1 is the whole space and ρ = 21 kDk2 . Now we proceed to prove Theorem 4.1. Proof: We first consider perturbations of b and c. The perturbed operator is Gnew = Gold − (∆ c; ∆ b; 0; 0). Obviously, for any w, ∂Gnew (w) = ∂Gold (w). Therefore, Gnew is Lipschitz near wold , and each element in ∂Gnew (wold ) is nonsingular by assumption. Let ¯ its closure. By [8, Lemma 1, Lemma 2 in Chapter 7], there B indicate the open Euclidean unit ball, and B

exist positive scalars δ and r, such that for any w ∈ wold + rB and V ∈ ∂Gnew (w), V −1 2 ≤ 1δ . And if ¯ kGnew (w1 ) − Gnew (w2 )k ≥ δ kw1 − w2 k . Replacing r by lr with 0 ≤ l ≤ 1 in w1 and w2 lie in wold + rB, 2 2 the proof of [8, Lemma 3 in Chapter 7.1], we have that Gnew (wold + lrB) contains Gnew (wold ) + ( 12 lrδ)B. δ , 1r ), where We can always find 0 < l∗ ≤ 1/2, such that wold + 2l∗ rB ⊆ N (wold ) and l∗ ≤ min( 2ρr 1 ∗ old N (w ) and ρ are defined in the previous lemma. Suppose k(∆ c; ∆ b)k2 < 2 l rδ. Then the new problem has a solution, designated as wnew , contained in wold + l∗ rB. Next, we use induction to prove the Q-quadratic of the sequence (3) to wnew from wold .

k convergence

old new ∗ new Apparently, w − w 2 < l r. Assume w − w 2 < l∗ r. Then

k



w − wold ≤ wk − wnew + wk − wold < 2l∗ r. 2 2 2 Thus, wk ∈ N (wold ). Similar to the proof of [29, Theorem 3.2], by the triangular inequality, we get



k+1

−1

w − wnew 2 = wk − wnew − V k Gnew (wk ) 2

 

k −1  new k new new ≤ V G (w ) − G (w ) − D Gnew (wnew ); wk − wnew 2

2  new new  ρ

k −1  k k new k new + V V (w − w ) − D G (w ); w − w

≤ 2 wk − wnew 2 . δ 2

k

The last inequality is due to lemma 4.1. By assumptions on w − wnew 2 and l∗ , we have

k+1

2 ρ ρ

w − wnew 2 ≤ 2 wk − wnew 2 < 2 (l∗ r)2 ≤ l∗ r. δ δ

Now we add perturbation of A.  Since Ax = (A+∆ A)x−∆ Ax, we have Gnew (wold )−Gold (wold ) = − ∆ c − ∆ AT yold ; ∆ Axold − ∆ b; 0 . Note that after perturbations, ρ in lemma 4.1 may be changed, but not N (wold ). Also observe that only the perturbation of A may affect the value of ρ, and ρ depends linearly on A by (4) of the proof of lemma 4.1. So there exists ν1 > 0, such that when k∆ Ak2 ≤ ν1 , we have ρnew ≤ 2ρold . By the uppersemicontinuousness of ∂Gnew (property 2 in §§1.1), V −1 2 ≤ δ1 for every V ∈ ∂Gnew (w), and the Banach perturbation lemma (Lemma (1.1)), we get that there exists a positive number ν2 , so that

when k∆ Ak2 ≤ ν2 , for any w ∈ wold + 2r B, and any V ∈ ∂Gnew (w), we have V −1 2 ≤ 2δ . Therefore, Gnew (wold + 12 lrB) contains Gnew (wold ) + 81 lrδB. Let ν = min(ν1 , ν2 ). Assume k∆ Ak2 ≤ ν. Then as the proof for the perturbation of b and c, let l∗ ≤ min too.

5

δ , 1, 1 4ρold r r 2

, and υ =

1 ∗ 8 l rδ,

we get Q-quadratic convergence rate of the sequence (3) to wnew ,

Globalization

Iterates generated by pure Newton’s method converge only locally. Let G indicate any of the nonlinear def def equation systems in §3. Denote w = (ω; λ; z; y). Set Ψ(w) = 21 G(w)T G(w); then a root of G is the same thing as a global minimizer of Ψ [10]. Unfortunately, Ψ(·) may not be differentiable everywhere 18

for our systems. As is discussed in the Introduction, simply replacing the gradient in the smooth case by some generalized gradient in the Armijo line search rule fails to work for nonsmooth optimization. In this section, we use two schemes to circumvent the problem. In §§5.1 we extend the Armijo-type gradient descent method to almost everywhere differentiable functions by perturbation. In §§5.2, we reformulate the directional derivative equations (2) for our reformulations into some linear complementarity problems.

5.1

Perturbed Armijo-type Gradient Descent Method

In this part, we first introduce our perturbed Armijo line search rule for minimizing continuous and almost everywhere differentiable functions. Note that locally Lipschitz functions, hence semismooth functions, are almost everywhere differentiable. Then, we give convergence results for our method. Our line search strategy is readily extendable to other line search rules, such as the nonmonotone line search rule [19]. The set of points at which Ψ is differentiable is dense in the domain. Therefore, if Ψ is nondifferentiable at the intended starting point, one can always find an initial point that is arbitrarily close to it and at which Ψ is differentiable. The Perturbed Armijo Rule Choose constants 0 < s ≤ 1, 0 < σ < 1, β ∈ (0, 1), γ ∈ (β, 1). For each k ≥ 0, assume Ψ is differentiable at wk . 1. Set αk,0 = s, i = 0. Let ∆ wk be the search direction. 2. Find the smallest nonnegative integer l for which  Ψ(wk ) − Ψ wk + β l αk,i ∆ wk ≥ −σβ l αk,i ∇Ψ(wk )T ∆ wk .   3. If Ψ is nondifferentiable at wk + β l αk,i ∆ wk , find t ∈ [γ, 1) so that Ψ is differentiable at wk + tβ l αk,i ∆ wk , set αk,i+1 = tβ l αk,i , i + 1 → i, go to step 2; otherwise, set αk = β l αk,i , wk+1 = wk + αk ∆ wk , k + 1 → k.

Remark 5.1 In step 3, t can be chosen so that 1 − t is in the order of machine precision, since the set of points at which Ψ is nondifferentiable has measure zero. Next, we modify the convergence analysis of gradient method with Armijo rule (see [7, p. 38, Prop. 1.2.1]) to give the following results. k ∞ k ∞ Proposition 5.1 Suppose that the sequences {wk }∞ k=1 and {∆ w }k=1 are bounded. Then {w }k=1 has ki ∞ limit points. Furthermore, assume for each sub-sequence {w }i=1 converging to a nonstationary point of Ψ,

lim sup ∇Ψ(wki )T ∆ wki < 0.

(34)

i→∞

Then each limit point of {wk }∞ k=1 is a stationary point of Ψ. ˜ be a limit point of {wk }. Proof: Since the sequence {wk }∞ k=1 is bounded, it has limit points. Let w k i ∞ ˜ Because Ψ is con˜ is not a stationary point of Ψ, let {w }i=1 be a subsequence converging to w. If w ˜ By the definition of perturbed Armijo rule, Ψ(wk ) − Ψ(wk+1 ) ≥ tinuous, Ψ(wki ) converges to Ψ(w). ˜ is not a stationary point. By (34), −σαk ∇Ψ(wk )T ∆ wk , for k = 1, 2, . . . Assume w αki ∇Ψ(wki )T ∆ wki → 0.

(35)

From (34) and (35), we have αki → 0 . That means ∃ p > 0, ∀ i ≥ p, the stepsize is reduced at least once. By the definition of perturbed Armijo 1 ): rule, ∀ i ≥ p, for some ̺ki ∈ [ β1 , γβ  Ψ(wki ) − Ψ wki + αki ̺ki ∆ wki < −σαki ̺ki ∇Ψ(wki )T ∆ wki ,  and Ψ is differentiable at wki + αki ̺ki ∆ wki . From (34), one can relabel {wki } a subsequence with ∆ wki 6= 0 if necessary. Denote (36)

def

dki =

∆ wki , k∆ wki k2

def α ¯ ki = αki ̺ki ∆ wki 2 . 19



¯ ki → 0. Note that (36) can be written as: ∀i ≥ p: Since ∆ wki 2 is bounded, α  ¯ ki dki Ψ(wki ) − Ψ wki + α < −σ∇Ψ(wki )T dki . α ¯ki

(37)

˜ 2 ≤ By chain rule, similar to the proof of Lemma 4.1, we get that ∃ h > 0, ρ > 0, so that ∀ kw′ − wk ′′ ′ ′ ′′ ′ ′′ ′ 2 ˜ 2 ≤ h, we have kΨ(w h and kw′′ − wk ) − Ψ(w ) − D [Ψ(w ); w − w ]k ≤ ρ kw − w k . 2 2

˜ α ˜ 2 ≤ From wki → w, ¯ ki → 0, dki 2 = 1, we get that ∃ q ≥ p, so that ∀ i > q, we have wki − w

h ˜ 2 ≤ h . Therefore, ¯ki dki − w ¯ ki ≤ h2 . Hence wki + α 2, α

(38)



2 ¯ ki dki Ψ(wki ) − Ψ wki + α ¯ ki ρ dki 2 . ≥ −∇Ψ(wki )T dki − α k i α ¯

Combining (37) and (38), we get

2 α ¯ ki ρ dki 2 ∆ wki 2 − < ∇Ψ(wki )T ∆ wki . 1−σ

(39)

¯ki → 0, we have that the left-hand-side of (39) converges to From dki 2 = 1, ∆ wki being bounded, and α zero, which contradicts to (34).

Corollary 5.1 At the kth iteration, if the Jacobian of G is nonsingular, let ∆ wk = −∇G−1 (wk )G(wk );  −1 T T k ∞ otherwise, let ∆ wk = − ∇G(wk ) ∇G(wk ) + ck I ∇G(wk ) G(wk ) . Assume that {wk }∞ i=1 and {∆ w }i=1 are bounded. Suppose ck → 0. Then the sequence {wk }∞ i=1 generated by the gradient method with perturbed Armijo rule has limit points which are stationary points of Ψ.

5.2

Search Direction Determined by the Directional Derivative Equations

In this part, we show how to solve (2) for our reformulations in §3. Since the set of points at which Ψ is nondifferentiable of any of our systems has measure zero, the probability of hitting these points is zero; consequently, the probability of applying the scheme in this section to the algorithm is zero. Therefore, the expected total computation doesn’t include that below. For simplicity, we drop the iteration number k. 5.2.1

The Min Function

We reformulate (2) with G being the operator defined by (17) in this part. We divide the index set {1, 2, . . . , n} into the following five separate subsets based on complementarity. def

def

L0 = {i : min(λi , ωi , zTi Rzi ) is differentiable}, def

Lλω = {i : λi = ωi < zTi Rzi },

Lλz = {i : λi = zTi Rzi < ωi },

Lωz = {i : ωi = zTi Rzi < λi },

Lλωz = {i : λi = ωi = zTi Rzi }.

def

def

It is easy to verify that the directional derivatives are  min (∆ λi , ∆ ωi )       ∆ λi min ∆ λ , 2zT R ∆ z  i i i  D min(λi , ωi , zTi Rzi ); ∆ ωi  = T  min ∆ ωi , 2zi R ∆ zi  ∆ zi    min ∆ λi , ∆ ωi , 2zTi R ∆ zi

i ∈ Lλω , i ∈ Lλz , i ∈ Lωz , i ∈ Lλωz .

Let pj , qj , lj be determined by (18). Then a solution (∆ ω; ∆ λ; ∆ y; ∆ z) to the following linear complementarity problem is equivalent to a solution of (2) for (17). In addition, the linear complementarity terms in the formulation below only appear at blocks where min(λi , ωi , zTi Rzi ) is nondifferentiable.

20

5.2.2

T T Rz y + cj − ωj Rzj P j ∆ ωj + ωj R ∆ zj − Aj ∆ y = AjP (∆ λ A z + λ A ∆ z ) = b − i i i i i i i∈L λi Ai zi i∈L ∆(zj )0 = 1 − (zj )0  T T p j ∆ ωi + qj ∆ λj + 2lj zj R ∆ zj = − min λj , ωj , zj Rzj uj = ∆ λj + min(λj , ωj , zTj Rzj )  vj = ∆ ωj + min(λj , ωj , zTj Rzj )   uj ≥ 0, vj ≥ 0, uj vj = 0 T  uj = ∆ λj + min(λj , ωj , zj Rzj ) wj = 2zTj R ∆ zj + min(λj , ωj , zTj Rzj )   uj ≥ 0, wj ≥ 0, uj wj = 0 T  vj = ∆ ωj + min(λj , ωj , zj Rzj ) wj = 2zTj R ∆ zj + min(λj , ωj , zTj Rzj )   v ≥ 0, wj ≥ 0, vj wj = 0  j uj = ∆ λj + min(λj , ωj , zTj Rzj )    v = ∆ ω + min(λ , ω , zT Rz ) j j j j j j T  w = 2z R ∆ z + min(λ , ω , zTj Rzj ) j j j j  j   uj ≥ 0, vj ≥ 0, wj ≥ 0, uj vj wj = 0

(j ∈ L) (j ∈ L) (j ∈ L0 ) (j ∈ Lλω )

(j ∈ Lλz )

(j ∈ Lωz )

(j ∈ Lλωz )

The Formulation Based on Jordan Decomposition

In this part, we show how to solve (2) for G being the operator of (22). def def Define the index sets: Lλ = {i ∈ L : λi = 0} , Lω = {i ∈ L : ωi = 0} . Note that D ([0]+ ; ∆ a) = [∆ a]+ , D ([0]− ; ∆ a) = [∆ a]− . Let usj = [∆ s]+ , vjs = −[∆ s]− (j ∈ Ls , s ∈ {λ, ω}) . Then the Jordan decompositions of ∆ λj and ∆ ωj are the following: ∆ λj = uλj − vjλ (j ∈ Lλ ), where

ω uω j vj = 0

uλj vjλ

=0

ω ∆ ωj = u ω j − vj (j ∈ Lω ),

uω j≥0 uλj ≥

vjω ≥ 0 vjλ

0

≥0

(j ∈ Lω ),

(j ∈ Lλ ).

Let pj and qj (j ∈ L) be as defined in (23). Then (2) with respect to (22) can be reformulated as: T T uω (j ∈ Lω ) j Rzj − Aj ∆ y = cj + Aj y ∆ ωj pj Rzj + [ωj ]+ R ∆ zj − ATj ∆ y = cj + ATj y − [ωj ]+ Rzj (j ∈ L \ Lω ) X X X uλi Ai zi + (∆ λi qi Ai zi + [λi ]+ Ai ∆ zi ) = b − [λi ]+ Ai zi i∈Lλ

i∈L

i∈L\Lλ

∆(zj )0 = 1 − (zj )0

(1 − pj ) ∆ ωj + (1 − qj ) ∆ λj + vjλ

2zTj R ∆ zj

(j ∈ L)

= −zTj Rzj − [ωj ]− − [λj ]−

(j ∈ L \ (Lλ ∪ Lω ))

(1 − pj ) ∆ ωj − + 2zTj R ∆ zj = −zTj Rzj − [ωj ]− − [λj ]− (j ∈ Lλ \ Lω ) −vjω + (1 − qj ) ∆ λj + 2zTj R ∆ zj = −zTj Rzj − [ωj ]− − [λj ]− (j ∈ Lω \ Lλ ) −vjω − vjλ + 2zTj R ∆ zj = −zTj Rzj − [ωj ]− − [λj ]− (j ∈ Lω ∩ Lλ ) ω ω ω ω uj vj = 0 uj ≥ 0 vj ≥ 0 (j ∈ Lω ) uλj vjλ = 0 uλj ≥ 0 vjλ ≥ 0 (j ∈ Lλ ) This is again a linear complementarity problem. And complementarity terms appear only at blocks whose indices are in Lλ ∪ Lω . 21

5.2.3

General Complementarity Functions and the Q Method

In this part, we show how to solve (2) when G is the operator defining any of the systems in §§3.3 and §§3.4. When λj = 0, D [|λj |; ∆ λj ] = | ∆ λj |. If ∆ λj only appears in (14b) of (2), we assume ∆ λj ≥ 0 when λj = 0, since the sign of ∆ λj doesn’t affect the result in this case. Thus, we can replace D [|λj |; ∆ λj ] by ∆ λj , and add ∆ λj ≥ 0. Complementarity conditions in (2) represented by compositions of absolute value functions, min functions, and [·]+ and [·]− , can be reformulated as linear complementarity problems in the same way as those for (17) and (22) in the previous parts at points where they are nondifferentiable. √ The Fischer-Burmeister function φ is nondifferentiable at (0, 0), with D [φ(0, 0); (∆ a, ∆ b)] = ∆ a2 + ∆ b2 − ∆ a − ∆ b . Therefore, φ(0, 0) + D [φ(0, 0); (∆ a, ∆ b)] = 0 is equivalent to ∆ a · ∆ b = 0. Along with the nonnegative constraints a + ∆ a ≥ 0, b + ∆ b ≥ 0, we get that (2) in this case equals to ∆ a · ∆ b = 0, ∆ a ≥ 0, ∆ b ≥ 0 . In conclusion, (2) at points where it is nondifferentiable can be formulated as a linear complementarity problem whose complementarity constraints appear only at blocks whose corresponding components of G are nondifferentiable there.

6

Numerical Experiments

We have implemented the semismooth Newton’s algorithm for (14) with different choices for the NCPfunctions and line search rules. Our initial computational results show that generally the numbers of function evaluations and iterations are fewer for nonmonotone line search if the initial point is far from the optimum, but the numbers are almost the same for the two line search strategies when the algorithm starts close to the optimum. We present some test results with perturbed nonmonotone line search below. The algorithm. Let G denote the left-hand-side of any of the reformulations in § 3. Given positive numbers ǫ, steptol, itlimit, conlimit, the algorithm terminates at an ǫ-solution of (14), or when the stepsize is less than steptol, or when the iteration number exceeds itlimit.

Do while kGk∞ ≥ ǫ, wk+1 − wk ∞ ≥ steptol, and k ≤ itlimit. 1. Calculate ∇G(wk ), and estimate it’s condition number.

2. If G(wk ) is singular or its estimated condition number is larger than conlimit, perform Tichonov regularization −1 ∆w = − ∇G(wk )T ∇G(wk ) + ck I ∇G(wk )T G(wk ); otherwise, let

∆w = −∇G(wk )−1 G(wk ).

3. Do line search to determine the step size α. 4. Let wk+1 = wk + α∆w; k + 1 → k. We use the suggested values in [10]. Of the PC running the code, the machine precision is τ = 2.220446 × 10−16 ; so we set ǫ = τ 1/3 = 6.055454 × 10−6 , steptol = τ 2/3 = 3.666853 × 10−11 , itlimit = 100.

6.1

Example 1

Our first computational example is the Euclidean SMT problem from [35, example 1]. SMT problems arise from facility location, network design, component placement on circuit board. The SMT problem is to find a shortest network spanning a set of given points, called regular points. The solution is always a tree, called a Steiner minimal tree (SMT), including some additional vertices, called Steiner points. Assume that the number of regular points are N ; then there are at most N − 2 Steiner points, and the degree of each Steiner point is at most 3. A tree whose vertices including just the N given regular points and N − 2 Steiner points with the degree of each Steiner point being 3 is called a full Steiner topology of the N regular points. In [35], the problem of computing the shortest network under a known full Steiner 22

topology in the Euclidean plane is formed as an SOCP model which is then approximated by an interior point method.Their model is the following. def def Denote p = 2N − 3, which is the number of edges; q = 2N − 4, which is the total number of coordinates of the Steiner points. Let   −1 0 · · · 0 0 0 0 ··· 0 AT1       0 0    (0; c1 )  0 −1 · · ·  −1p  0 ··· 0 AT2  b= , c = (0; c2 ) , AT =  0  ∈ R3p×(p+q) , 0q   . . (0; cp )   .   0 0 · · · −1 0  0 0 ··· 0 ATp where ATi ∈ R2×q is a row of (N − 2) 2 × 2 block matrices. The edges are ordered so that each of the first N edges connects a regular point to a Steiner point. For i = 1, . . . , N , ci is the coordinates of regular point i1 , where i1 is the index of the regular point on the ith edge; the only non-zero block of ATi is the i2 nd, which is the identity matrix of order 2, where i2 is the index of the Steiner point on the ith edge. For i = N + 1, . . . p + q, ci = 0. Let the indices of the two Steiner points on the ith edge be i1 and i2 . Then the i1 st block of ATi is −I2 , the i2 nd block of ATi is I2 , the rest blocks of ATi are zero. Therefore, the SMT problem is to find y satisfying the dual SOCP: max s.t.

(40)

bT y AT y + s = 0 s ≥Q 0.

we give coordinates of the regular points from [35, example 1] in Table 1. The IPM given in [35] is a feasible Table 1: The coordinates of the 10 regular points in [35, example 1] index 9 10 11 12 13

x 2.30946900 0.57736700 0.80831400 1.68591200 4.11085500

y 9.20821100 6.48093800 3.51906200 1.23167200 0.82111400

index 14 15 16 17 18

x 7.59815200 8.56812900 4.75750600 3.92609700 7.43649000

y 0.61583600 3.07917900 3.75366600 7.00879800 7.68328400

method, i.e. all the iterates satisfy the linear constraints. Their accuracy measure is the sum of duality gap. Their algorithm takes 23 iterations to reduce the duality gap to 6.5 × 10−9 . Our starting points are infeasible. And our accuracy is measured by kGk∞ ≤ 6.0 × 10−6 , which includes both the duality gap and the infeasibility. Examples 1.0-1.1 show that fewer iterations are needed for our algorithm based on system (17), (22), or (29), to “warm start” the problem than those for the algorithm in [35]. 0 Our initial values are the follows. For i = 1, . . . , p, yi0 is the length of the ith edge; yp+1:p+q are the coordinates of the Steiner points. For (17), and (22), we set ωi0 = yi0 , λ0i = 1, z0i = (1; 0; 0). For (29), we set 0 − ci . (ωi0 )1 = 0, (ωi0 )2 = yi0 , (λ0i )1 = (λ0i )2 = 0.5, z0i = ATi yp+1:p+q Example 1.0 We tried to solve [35, example 1] using the initial Steiner points given in Table 2. The output is summarized in ‘Example 1.0’ of Figure 1. We also plot the results from [35] here for comparison. In the figure, x-axis represents the iteration number, y-axis is log10 of kGk∞ for (17), (22), (29), and log10 (xT z) for the algorithm in [35]. The figure shows locally Q-quadratic convergence rate of our algorithm. Our initial network-cost is 25.559761, larger than that at iteration 7 in [35], i.e. 25.48266, but each of our network-cost at the 7th iteration of (17) , the 5th of (22), and the 13th of (29), equals to 25.356068, is the same as that at the 23th — the last iteration in [35]. In other words, the amount of network-cost reduced by our algorithm in 7, 5, and 13 iterations respectively, is larger than that reduced by the method of [35] in 16 iterations near an optimum. 23

Figure 1: Example 1 Example 1.0

Example 1.1

4

4 min Jordan Q Xue Ye

2

min Jordan Q Xue Ye

2

0 0

digits of gap

digits of gap

−2 duality

−4

−6

duality

−2

−4

||G||



||G||∞ −6

−8 −8

−10

−12

0

5

10

15

20

−10

25

0

5

10

Iteration

15

20

25

Iteration

Table 2: Initial coordinates of the 8 Steiner points in example 1.0 index 1 2 3 4

x-coordinate 0.6 0.8 1.7 4.1

y-coordinate 6.5 3.5 1.2 0.8

index 5 6 7 8

x-coordinate 7.2 5.2 2.5 3.9

y-coordinate 1.8 2.1 7.5 7.0

Table 3: The coordinates of the 10 regular points in example 1.1 index 9 10 11 12 13

x 2.06225265 0.82034497 1.24810704 1.65588987 3.66904285

y 9.06259293 6.63177002 3.85186112 1.36153760 0.86330140

index 14 15 16 17 18

x 7.55387796 8.92332597 5.04443039 3.42613689 7.43136476

y 0.97892289 3.05143468 3.90964814 6.64003516 7.22161716

Example 1.1 To test re-optimization, we perturbed each coordinate of the regular points by a scalar in (−0.5, 0.5); see Table 3. Starting from the solution to example 1.0, Newton’s method for each of our formulations (17), (22), and (29) found an optimal solution in 4 iterations; see ‘Example 1.1’ of Figure 1. Example 1.2 In this example, we set point 9 to (2.5, 9.0). Starting from the old solution to example 1.0, we got an optimal solution in 2 iterations using our algorithm on each of our formulations (17), (22), and (29).

6.2

Example 2

This example includes randomly generated SOCP problems. The results are summarized in Table 4. Each of our randomly generated problem has 10 10-dimensional SOC blocks. Let ‘b’ represent the block being in bd Q, ‘o’ represent the block being zero, ‘i’ represent the block being in int Q. The types of the blocks of a primal solution are [b, o, i, b, b, i, o, o, b, b]; that of a dual solution are [b, i, o, b, b, o, i, b, b, b]. Note that strict complementarity is not satisfied at the 8th block. The number of linear constraints is m = 33. According to [10], the line search Newton direction is independent of the scaling of the dependent or the independent variables. So we set each element of A and y in our randomly generated problem in the range (−1, 1), And we let the first element of each nonzero block of a primal or a dual solution be 1, its remaining elements be random numbers.

24

Table 4: Output of example 2 tp 1

2

3

4

5

6

7

sys (17) (22) (29) (17) (22) (29) (17) (22) (29) (17) (22) (29) (17) (22) (29) (17) (22) (29) (17) (22) (29)

succ 100 100 100 100 100 100 97 98 98 100 100 100 100 100 100 99 100 100 99 99 99

initial gap 0.708798 0.708798 0.708798 -0.468689 -0.468689 -0.468689 -0.719331 -0.719363 -0.719363 -1.675727 -1.676413 -1.676288 -0.443995 -0.443996 -0.443996 0.040911 0.045377 0.045377 0.255684 0.253370 0.254746

final gap -14.091132 -14.023286 -14.114863 -7.392421 -8.334704 -7.735485 -7.772418 -7.578901 -7.638747 -7.835776 -7.879695 -7.314917 -7.691475 -7.316729 -7.401255 -7.873853 -7.943580 -7.659177 -7.401473 -7.834493 -7.694833

it num 1.06 1.00 1.00 3.60 4.07 6.75 8.41 10.32 15.44 3.66 4.11 7.77 6.11 7.16 11.04 4.50 4.53 6.70 7.45 7.27 9.61

fun eval 2.06 2.00 2.00 4.63 5.15 16.16 20.92 33.66 77.89 6.30 8.28 26.85 10.61 17.41 45.70 9.19 6.23 15.61 20.31 12.42 25.11

We randomly generated 100 instances of the problem, and solved them by our algorithm for (17), (22), and (29). We used starting point y = 0, x and s the optimal solution. The output is summarized in block rows ‘type 1’ of Table 4. All the instances converge to optimal solutions. The block rows ‘type 2-10’ of Table 4 show the summaries of various perturbed problems as below. We randomly generated 100 instances of each perturbed problem. We used the solution to the original problem as the starting point.   kbk kbk Type 2: each element of ∆ b is in − m 2 , m 2 .   Pn kck kck Type 3: each element of ∆ c is in − N 2 , N 2 , where N = i=1 Ni .   kAk kAk Type 4: each element of ∆ A is in − mNF , mNF .     kAk kbk kAk kbk Type 5: Each element of ∆ A is in 0.8 − mNF , mNF , each element of ∆ b is in − m 2 , m 2 , each element   kck kck of ∆ c is in 0.5 − N 2 , N 2 . Type 6: we add a constraint. Type 7: we delete the last block. The second column of Table 4 is the reformulation from which the output in that row is obtained. The column ‘succ’ shows the number of instances whose kGk∞ are reduced below τ 1/3 . The columns ‘initial gap’ and ‘final gap’ list the average values of log10 kGk∞ at the initial point and the final solution of a solved instance respectively. The column ‘it num’ shows the average number of iterations per solved instance. The column ‘fun eval’ shows the average number of function evaluations of a solved instance. Confirming to the analysis in § 5, for all the instances, total number of perturbations in line search to avoid nondifferentiable points is only six. Of our three reformulations, the average number of iterates generated by the semismooth Newton’s method based on (17) is the fewest; that on (29) is the most. For type 1, 2, 4, 6 perturbations, fewer than 4 iterations are needed to re-optimize an SOCP problem by the semismooth Newton’s method based on (17); for type 3, 5, 7 perturbations, fewer than 8 iterations. The cost of each iteration of our algorithm is no more than that of an IPM.

25

7

Discussions

We summarize properties of our algorithm below. The total numbers of variables and equations of (14) are about half of that of other systems. Without regularization, the Jacobian of each iteration has the same sparsity pattern; so one can apply techniques for solving large-scale sparse equations to it. Besides, due to the special structure of nonlinear complementarity reformulation, only a reduced system of (14) needs to be solved (see [9]). To further lessen the work of each iteration but keep the desired convergence rate, one may use some modified Newton’s methods, like periodical, quasi-Newton’s methods. The optimal solution to an old problem can be used as an initial point for a new one via Newton-type methods. Furthermore, one can use the decomposition of the Jacobian for the old solution, if it is available, to solve the linear system with Newton-type methods. For local convergence analysis, strict complementarity is not needed for (14). Under primal-dual nondegeneracy and some other conditions, its Jacobian is regular at optimum; hence the solution is numerically stable and accurate. Because of their Q-quadratic convergence rate and no restriction on starting points, Newton-type methods are better for re-optimization than IPMs are, but starting from a point far from optimum, sequences generated by Newton-type methods may be trapped at a local minimum or saddle point of the merit function. So to “cold start” a problem, we suggest using a hybrid algorithm: start with an IPM until the primal, dual infeasibility and duality gap are small, then switch to a Newton-type method. We have extended the above results about the normal cone, perturbation analysis, etc. to semidefinite programming, symmetric cone programming, and p-cone programming. Preliminary numerical results show that as for SOCP, Newton-type methods are good for “warm starting” these models. Comparisons of our reformulations. Each of our reformulations of (1) is derived with a different technique. In practice, one may choose one of the reformulations that best suits the real-world problem, since different systems are different in computation time, regularity assumptions of the generalized Jacobian at optimum, and the nonsmooth points of the merit functions, etc. See [15] for a comparison of some NCPfunctions. The regularity assumptions for (17) and (22) are the same, but they are different from that for (29) (cp Theorems (3.1, 3.2, 3.4)). Nonsmooth points of the merit functions for different reformulations are also different. The merit function is continuously differentiable if the Fischer-Burmeister function is used to formulate the complementarity constraints of (27). Our numerical experiments on (17), (22) and (29) show that the semismooth Newton’s method based on (17) is the fastest and that based on (22) is faster than that based on (29).

Acknowledgments I thank a dissertation fellowship from Rutgers University. Besides, I would like to express my gratitude to my Ph.D. advisor Professor Farid Alizadeh. This work is supported in part through his research grants — the U.S. National Science Foundation Grant #NSF-CCR-0306558 and Office of Naval Research contract #N00014-03-1-0042. Discussions with him motivated the paper and encouraged me to improve its quality. I also thank Professor Paul Tseng for a restatement of the proof of Lagrangian multipliers and him and Professor Jonathan Eckstein for comments on my perturbed line search scheme in the previous draft. Finally, I would like to thank two anonymous referees for their suggestions and comments which help improve the presentation of the paper, especially for one referee’s numerous suggestions, comments, and for pointing out references [18, 33].

References [1] I. Adler and F. Alizadeh. Primal-dual interior point algorithms for convex quadratically constrained and semidefinite optimization problems. Technical Report RRR 46-95, RUTCOR, Rutgers University, 1995. [2] F. Alizadeh and D. Goldfarb. Second-order cone programming. Math. Program., 95(1, Ser. B):3–51, 2003. 26

[3] F. Alizadeh and S. H. Schmieta. Optimization with semidefinite, quadratic and linear constraints. Technical Report RRR 23-97, RUTCOR, Rutgers Univeristy, 1997. [4] Farid Alizadeh. Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim., 5(1):13–51, 1995. [5] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 6, New York, NY, USA, 2004. ACM Press. [6] Hande Y. Benson and Robert J. Vanderbei. Solving problems with semidefinite and related constraints using interior-point methods for nonlinear programming. Math. Program., 95(2, Ser. B):279–302, 2003. [7] Dimitri P. Bertsekas. Nonlinear programming: 2nd Edition. Athena Scientific, 1999. [8] Frank H. Clarke. Optimization and nonsmooth analysis. John Wiley & Sons Inc., New York, 1983. A Wiley-Interscience Publication. [9] Tecla De Luca, Francisco Facchinei, and Christian Kanzow. A theoretical and numerical comparison of some semismooth algorithms for complementarity problems. Comput. Optim. Appl., 16(2):173–205, 2000. [10] John E. Dennis, Jr. and Robert B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall Inc., Englewood Cliffs, N.J., 1983. [11] Yu. G. Evtushenko and V. A. Purtov. Sufficient conditions for a minimum for nonlinear programming problems. Dokl. Akad. Nauk SSSR, 278(1):24–27, 1984. [12] Jacques Faraut and Adam Kor´ anyi. Analysis on symmetric cones. The Clarendon Press Oxford University Press, New York, 1994. Oxford Science Publications. [13] M. C. Ferris and J. S. Pang. Engineering and economic applications of complementarity problems. SIAM Rev., 39(4):669–713, 1997. [14] A. Fischer. A special Newton-type optimization method. Optimization, 24(3-4):269–284, 1992. [15] Andreas Fischer and Houyuan Jiang. Merit functions for complementarity and related problems: a survey. Comput. Optim. Appl., 17(2-3):159–182, 2000. [16] Masao Fukushima, Zhi-Quan Luo, and Paul Tseng. Smoothing functions for second-order-cone complementarity problems. SIAM J. Optim., 12(2):436–460 (electronic), 2001. [17] M. R. Garey, R. L. Graham, and D. S. Johnson. The complexity of computing Steiner minimal trees. SIAM J. Appl. Math., 32(4):835–859, 1977. [18] Jacek Gondzio and Andreas Grothey. Reoptimization with the primal-dual interior point method. SIAM J. Optim., 13(3):842–864 (electronic) (2003), 2002. [19] L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal., 23(4):707–716, 1986. [20] Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Herv´e Lebret. Applications of secondorder cone programming. Linear Algebra Appl., 284(1-3):193–228, 1998. ILAS Symposium on Fast Algorithms for Control, Signals and Image Processing (Winnipeg, MB, 1997). [21] Irvin J. Lustig, Roy E. Marsten, and David F. Shanno. Computational experience with a globally convergent primal-dual predictor-corrector algorithm for linear programming. Math. Programming, 66(1, Ser. A):123–135, 1994. [22] O. L. Mangasarian. Equivalence of the complementarity problem to a system of nonlinear equations. SIAM J. Appl. Math., 31(1):89–92, 1976.

27

[23] Robert Mifflin. Semismooth and semiconvex functions in constrained optimization. SIAM J. Control Optimization, 15(6):959–972, 1977. [24] John E. Mitchell and Michael J. Todd. Solving combinatorial optimization problems using Karmarkar’s algorithm. Math. Programming, 56(3, Ser. A):245–284, 1992. [25] J.-S. Pang and L. Qi. A globally convergent Newton method for convex sc1 minimization problems. J. Optim. Theory Appl., 85(3):633–648, 1995. [26] Jong-Shi Pang. Newton’s method for B-differentiable equations. Math. Oper. Res., 15(2):311–341, 1990. [27] E. Polak and D. Q. Mayne. Algorithm models for nondifferentiable optimization. SIAM J. Control Optim., 23(3):477–491, 1985. [28] Li Qun Qi. Convergence analysis of some algorithms for solving nonsmooth equations. Math. Oper. Res., 18(1):227–244, 1993. [29] Li Qun Qi and Jie Sun. A nonsmooth version of Newton’s method. Math. Programming, 58(3, Ser. A):353–367, 1993. [30] Liqun Qi and Houyuan Jiang. Semismooth Karush-Kuhn-Tucker equations and convergence analysis of Newton and quasi-Newton methods for solving these equations. Math. Oper. Res., 22(2):301–325, 1997. [31] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, N.J., 1970. Princeton Mathematical Series, No. 28. [32] Philip Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Math. Programming Stud., (3):145–173, 1975. [33] Stephen J. Wright. Primal-dual interior-point methods. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997. [34] Yu Xia and Farid Alizadeh. The Q method for second-order cone programming. Computers & Operations Research, 2006. doi:10.1016/j.cor.2006.08.009, available on-line Nov. 2006. [35] Guoliang Xue and Yinyu Ye. An efficient algorithm for minimizing a sum of Euclidean norms with applications. SIAM J. Optim., 7(4):1017–1036, 1997. [36] E. Alper Yildirim and Stephen J. Wright. Warm-start strategies in interior-point methods for linear programming”. SIAM J. Optim., 12(3):782–810, 2002.

28