A block coordinate gradient descent method for ...

23 downloads 0 Views 325KB Size Report
Jun 11, 2011 - FULL LENGTH PAPER. A block coordinate gradient descent method for regularized convex separable optimization and covariance selection.
Math. Program., Ser. B (2011) 129:331–355 DOI 10.1007/s10107-011-0471-1 FULL LENGTH PAPER

A block coordinate gradient descent method for regularized convex separable optimization and covariance selection Sangwoon Yun · Paul Tseng · Kim-Chuan Toh

Received: 29 March 2010 / Accepted: 23 February 2011 / Published online: 11 June 2011 © Springer and Mathematical Optimization Society 2011

Abstract We consider a class of unconstrained nonsmooth convex optimization problems, in which the objective function is the sum of a convex smooth function on an open subset of matrices and a separable convex function on a set of matrices. This problem includes the covariance selection problem that can be expressed as an 1 -penalized maximum likelihood estimation problem. In this paper, we propose a block coordinate gradient descent method (abbreviated as BCGD) for solving this class of nonsmooth separable problems with the coordinate block chosen by a Gauss-Seidel rule. The method is simple, highly parallelizable, and suited for large-scale problems. We establish global convergence and, under a local Lipschizian error bound assumption, linear rate of convergence for this method. For the covariance selection problem, the method can terminate in O(n 3 /) iterations with an -optimal solution. We compare the performance of the BCGD method with the first-order methods proposed by Lu (SIAM J Optim 19:1807–1827, 2009; SIAM J Matrix Anal Appl 31:2000–2016, 2010) for solving the covariance selection problem on randomly generated instances.

In memory of Paul Tseng, who went missing while on a kayaking trip along the Jinsha river, China, on August 13, 2009, for his friendship and contributions to large-scale continuous optimization. S. Yun (B) Korea Institute for Advanced Study, 207-43 Cheongnyangni 2-dong, Dongdaemun-gu, Seoul 130-722, Korea e-mail: [email protected] P. Tseng Department of Mathematics, University of Washington, Seattle, WA 98195, USA K.-C. Toh Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076, Singapore e-mail: [email protected]

123

332

S. Yun et al.

Our numerical experience suggests that the BCGD method can be efficient for largescale covariance selection problems with constraints. Keywords Block coordinate gradient descent · Complexity · Convex optimization · Covariance selection · Global convergence · Linear rate convergence · 1 -penalization · Maximum likelihood estimation Mathematics Subject Classification (2000) 90C25 · 90C30 · 90C55

49M27 · 49M37 · 65K05 · 90C06 ·

1 Introduction It is well-known that undirected graphical models offer a way to describe and explain the relationships among a set of variables, a central element of multivariate data analysis. Given n variables drawn from a Gaussian distribution N (0, C) for which the true covariance matrix C  0n is unknown, we wish to estimate C from a sample covariance matrix S  0n by maximizing its log-likelihood. Of particular interest is imposing a certain sparsity on the inverse covariance matrix C −1 . This is known as the covariance selection (estimation) problem. Let S n be the set of n × n symmetric matrices. d’Aspremont, Banerjee, and El Ghaoui [3] formulated this covariance selection problem as follows: ⎧ ⎨ max

X ∈S n



log det X − S, X  − ρ

n 

|X i j | : α I  X  β I

i, j=1

⎫ ⎬ ⎭

,

(1)

where ρ > 0 is a given parameter controlling the trade-off between log-likelihood and sparsity, 0 ≤ α < β ≤ ∞ are bounds on the eigenvalues of the solution. They proposed a first order method developed in [10] for solving (1) and a block coordinate descent method, in which a solution of a box constrained quadratic program with n −1 variables is required at each iteration, for solving (1) and its dual problem when α = 0 and β = +∞:

min − log det X − n : |(X − S)i j | ≤ ρ, i, j = 1, . . . , n .

X ∈S n

(2)

(Note that in the above objective function, log det X is defined to be −∞ whenever X is not positive definite.) The box constrained quadratic program arises because only one column (and the corresponding row) is updated at each iteration, and is solved by using either an interior point method or the optimal first-order method in [9]. Lu [7] proposed to solve (1) by applying a variant of Nesterov’s smooth method [10] to the following generalization of (2):

min h ∗ (Z ) : |(Z − S)i j | ≤ ρ, i, j = 1, . . . , n ,

Z ∈S n

123

(3)

A block coordinate gradient descent method

333

where h ∗ denotes the conjugate function of h(·) which is defined by h(X ) = − log det X if α I  X  β I , otherwise, h(X ) = ∞. Since h is strictly convex, h ∗ is n essentially smooth [12, Section 26] with domh ∗ = S n if β < ∞ and domh ∗ = S++ n if β = ∞, where S++ is the set of n × n symmetric positive definite matrices. Moreover, h ∗ (Z ) and ∇h ∗ (Z ) can be easily calculated from the spectral decomposition of Z . The saddle function format for the problem (1) is different from the one considered in [3]. Friedman, Hastie, and Tibshirani [4] proposed to solve (2) by applying a block coordinate descent method, in which a solution of an 1 -penalized linear least squares problem with n − 1 variables is required at each iteration. Dahl, Vandenberghe, and Roychowdhury [2] studied the maximum likelihood estimation of a Gaussian graphical model whose conditional independence is known and they formulated the problem as follows:

max log det X − S, X  : X i j = 0, ∀(i, j) ∈ V ,

(4)

X ∈S n

where V is the collection of all pairs of conditional independent nodes. We note that (i, i) ∈ V for 1 ≤ i ≤ n and (i, j) ∈ V if and only if ( j, i) ∈ V . They proposed a Newton method to solve (4) when the underlying graph is nearly-chordal. During the writing of this paper, Scheinberg and Rish [15] proposed a greedy algorithm based on a coordinate ascent method for solving (1) with α = 0 and β = +∞. Yuan [20] proposed an alternating direction method for solving (1). Lu [8] considered estimating the sparse inverse covariance of a Gaussian graphical model whose conditional independence is assumed to be partially known in advance. The problem in [8] can be formulated as the following constrained 1 -penalized maximum likelihood estimation problem: ⎫ ⎧ ⎬ ⎨  max log det X − S, X  − ρi j |X i j | : X i j = 0, ∀(i, j) ∈ V , ⎭ X ∈S n ⎩

(5)

(i, j) ∈V

where V is as in (4) and {ρi j }(i, j) ∈V is a set of nonnegative parameters. Lu proposed adaptive first-order (adaptive spectral projected gradient and adaptive Nesterov’s smooth) methods for the problem (5). His methods consist of solving a sequence of unconstrained penalty problems of the following form: max log det X − S, X  −

X ∈S n



ρi j |X i j |

i, j

where the set of moderate penalty parameters {ρi j }(i, j)∈V is adaptively adjusted until a desired approximate solution is found. We note that (1) with α = 0 and β = +∞ is a special case of (5) with ρi j = ρ, V = ∅, and (4) is a special case of (5) with ρi j = 0, ∀(i, j) ∈ V . The dual problem of (5) is given as follows:

min − log det X − n : |(X − S)i j | ≤ υi j , i, j = 1, . . . , n ,

X ∈S n

(6)

123

334

S. Yun et al.

where υi j = ρi j for all (i, j) ∈ V and υi j = ∞ for all (i, j) ∈ V . We note that (2) is a special case of (6) with ρi j = ρ and V = ∅. Recently, Wang, Sun, and Toh [18] proposed a Newton-CG primal proximal point algorithm (which employs the essential ideas of the proximal point algorithm, the Newton method, and the preconditioned conjugate gradient solver) for solving log-determinant optimization problems that include the problems (1) with α = 0 and β = +∞, (2), (4), (5), and (6). In this paper, we consider a more general unconstrained nonsmooth convex optimization problem of the form: min

X ∈m×n

def

F(X ) = f (X ) + P(X ),

(7)

where m×n is the set of m × n matrices, f is real-valued and convex smooth (i.e., continuously differentiable) on its domain dom f = {X ∈ m×n | f (X ) < ∞}, which is assumed to be open, and P : m×n → (−∞, ∞] is a proper, convex, lower semicontinuous (lsc) [12] function. We assume that domF = ∅. Of particular interest is when P is separable, i.e., P(X ) =

n 

Pi j (X i j ),

(8)

i, j=1

for some proper, convex, lsc functions Pi j :  → (−∞, ∞]. More generally, P can be block-separable; see (23). The problem (5) is a special case of (7) and (8) with f (X ) = − log det X + S, X ,

Pi j (X i j ) =

ρi j |X i j | δ{0}

if (i, j) ∈ V ; , otherwise,

X ∈ Sn, (9)

where δ Q denotes the indicator function for a set Q. The problem (6) is a special case of (7) and (8) with f (X ) = − log det X,

Pi j (X i j ) =

0 ∞

if L i j ≤ (X − S)i j ≤ Ui j ; , otherwise,

X ∈ Sn, (10)

where L i j = −ρi j , Ui j = ρi j for all (i, j) ∈ V and L i j = −∞, Ui j = ∞ for all n . The problem (3) is a special case of (i, j) ∈ V . In all above examples, dom f = S++ ∗ (7) and (8) with f ≡ h , Pi j given as in (10) with ρi j = ρ, V = ∅, and Z ∈ S n . The main contributions of this paper are three-fold. First, we extend the BCGD method proposed in [17] to solve (7), in which dom f is not necessarily m×n . Hence the modification of the Lipschitz continuity assumption of ∇ f used in [17] and the boundedness assumption of the level set associated with the initial value are required for the analysis of the global convergence and the linear convergence rate even though m×n can be identified with mn ; see Assumptions 2 and 3. And some care is also needed to choose the stepsize so as to ensure that the updated point is in the subset

123

A block coordinate gradient descent method

335

of dom f . Second, we analyze the iteration complexity of the BCGD method with the Gauss-Seidel rule; see Theorem 3. When the BCGD method is applied to solve covariance selection problems (1) with α = 0 and β = +∞, (5) with ρi j > 0 for (i, j) ∈ V , and the dual problems (2), (6) with ρi j > 0 for (i, j) ∈ V , the method can terminate in O(n 3 /) iterations with an -optimal solution. Third, we demonstrate its effectiveness on large-scale covariance selection problems. When the BCGD method is applied to solve (5) with V = ∅, we do not need to solve a sequence of unconstrained penalty problems as did in [8]. This is the advantage of our method over the adaptive first-order methods in [8]. The key advantage of our BCGD method over the methods in [8] or [18] for solving a problem such as (3) lies in the fact that there is no need to perform eigenvalue decompositions or matrix-matrix multiplications in each iteration of our algorithm. Thus our algorithm typically would require less memory than the other competing methods, and hence is able to solve larger problems. The paper is organized as follows. In Sect. 2, we describe the BCGD method and analyze the convergence properties and the complexity of the BCGD method. In Sect. 3, for the regularized log-likelihood problem, i.e., f (X ) = − log det(X )+S, X , we study the boundedness of the level set associated with the initial point and the Lipschitz continuity of ∇ f on this level set. We also analyze the complexity of the BCGD method when it is applied to solve covariance selection problems. In Sect. 4, we describe an implementation of our method and report our numerical experience in solving large-scale covariance selection problems with or without constraints on randomly generated instances. Also, we compare our method with first-order methods in [7,8]. We give our conclusions in Sect. 5. In our notation, N = {(i,√j)| i, j = 1, . . . , n}. For any X, Y ∈ m×n , X, Y  = trace(X Y T ) and X  F = X, X . For simplicity, we write X  = X  F . X J denotes the submatrix of X comprising X i j for (i, j) ∈ J ⊆ N (we note that we can not use X J for all subset J of N ) and J (X ) denotes the element of m×n obtained by setting the entries of X not indexed by J to zero. A linear mapping H : m×n → m×n is self-adjoint if H = H∗ , where H∗ denotes the adjoint of H, and is positive definite if X, H(X ) > 0 whenever X = 0. The identity linear mapping is denoted by I. We let λmin (H) = minX =1 X, H(X ), λmax (H) = maxX =1 X, H(X ). The identity matrix is denoted by I . For any A, B ∈ S n , we write A  B (respectively, A  B) to mean that A − B is positive semidefinite (respectively, positive definite). Unless otherwise specified, {X k } denotes the sequence X 0 , X 1 , . . ., and {X k }K denotes a subsequence {X k }k∈K with K ⊆ {0, 1, . . .}. 2 Block coordinate gradient descent method In this section, we describe the block coordinate gradient descent method for solving (7) and analyze the global convergence and the asymptotic convergence rate of the BCGD method, analogous to those obtained in [17], when the (restricted) GaussSeidel rule is used. Also, we give an upper bound on the number of iterations needed for the BCGD method to achieve -optimality. In our method, we use the gradient ∇ f to build a quadratic approximation of f and apply coordinate descent to generate an improving direction D = DH (X ; J ), where

123

336

S. Yun et al. def

DH (X ; J ) = 1 arg min ∇ f (X ), D + D, H(D) + P(X + D) : Di j = 0, ∀(i, j) ∈ J 2 D∈m×n (11) with J ⊆ N and a self-adjoint positive definite linear mapping H (approximating the Hessian ∇ 2 f (X )), at X . By the separability of P, if H(D) = (Hi j Di j )i j where H ∈ S n has Hi j > 0 for all i, j ∈ {1, . . . , n}, then (DH (X ; N ))i j , the (i, j)th entry of DH (X ; N ), can be easily computable. In particular, – –

if P is given by (8) with P i j as in (10), then  (∇ f (X )) (DH (X ; N ))i j = median L i j − X i j , − Hi j i j , Ui j − X i j ;  if P(X ) = i, j ρi j |X i j |, 

(∇ f (X ))i j −ρi j (∇ f (X ))i j +ρi j then (DH (X ; N ))i j = −median . , X , i j Hi j Hi j We now describe formally the block coordinate gradient descent method. BCGD method: Choose X 0 ∈ domF. For k = 0, 1, 2, . . ., generate X k+1 from X k according to the iteration: Step 1. Choose a nonempty J k ⊆ N and a self-adjoint positive definite Hk . Step 2. Solve (11) with X = X k , J = J k and H = Hk to obtain D k = DHk (X k ; J k ). Step 3. Choose a stepsize α k > 0 and set X k+1 = X k + α k D k . For the stepsize rule, we use the following adaptation of the Armijo rule. Armijo rule: k > 0 such that X k + α k D k ∈ dom f and let α k be the largest element Choose αinit init k β j} of {αinit j=0,1,... satisfying

F(X k + α k D k ) ≤ F(X k ) + α k σ k ,

(12)

where 0 < β < 1, 0 < σ < 1, and def

k = ∇ f (X k ), D k  + P(X k + D k ) − P(X k ).

(13)

Since Hk is positive definite, we see from [17, Lemma 1] and the identification of with mn that, for some α¯ k ∈ (0, 1] satisfying X k + α¯ k D k ∈ dom f , we have

m×n

F(X k + α D k ) ≤ F(X k ) + α k + o(α) ∀α ∈ (0, α¯ k ], and k ≤ −D k , Hk (D k ) < 0 whenever D k = 0. Since 0 < σ < 1, this shows that α k given by the Armijo rule is well defined and positive.

123

A block coordinate gradient descent method

337

We say that X ∈ m×n is a stationary point of F if X ∈ domF and F  (X ; D) ≥ 0 for all D ∈ m×n . The following lemma gives an alternative characterization of the stationarity. Its proof is nearly identical to [17, Lemma 2] and is omitted. Lemma 1 For any self-adjoint positive definite H. An X ∈ domF is a stationary point of F(X ) if and only if DH (X ; N ) = 0. In what follows, we denote def

X 0 = {X | F(X ) ≤ F(X 0 )},

  def X 0 = X 0 + B ∩ domP,

(14)

def

where > 0 and B = {X ∈ m×n | X  ≤ 1}. For our convergence analysis and complexity analysis (Theorems 1–3), we will make the following assumption on f . Assumption 1 There exist L ≥ 0 and > 0 such that X 0 ⊆ dom f and ∇ f (Y ) − ∇ f (Z ) ≤ LY − Z  ∀Y, Z ∈ X 0 .

(15)

Assumption 1 is satisfied by (3) with L = β 2 and = ∞ when h ∗ is replaced by f, P is given as in (10) with ρi j = ρ, ∀(i, j) ∈ V and V = ∅, and β < ∞. This is because h is strongly convex with modulus 1/β 2 so ∇h ∗ is Lipschitz continuous on S n with constant β 2 ; see [5]. It is also satisfied by (9) and (10) with L = 1/((1 − ω)2 ζ 2 ), = ωζ for 0 < ω < 1 when there exists a positive constant ζ such that ζ I  X for all X ∈ X 0 ; see Sect. 3. Since m×n can be identified with mn , by using Assumptions 2 and 3 below, we will extend the global convergence and the linear convergence rate analysis of the BCGD method in [17, Theorem 1 and 2] to the problem (7). However, unlike in [17], dom f can not be m×n and ∇ f need not be Lipschitz continuous on domP, and so some care is needed in choosing the stepsize α so as to ensure that X + α DH (X ; J ) is in X 0 ⊆ dom f . Consequently, the convergence analysis presented in [17] must be modified for the present BCGD method. Next, we have a lemma concerning stepsizes satisfying the Armijo descent condition (12). Its proof is omitted since it is almost identical to that of [17, Lemma 5(b)]. Lemma 2 Under Assumption 1, for any X ∈ X 0 , nonempty J ⊆ N , and self-adjoint positive definite H with λmin (H) ≥ λ > 0, we have that D satisfies the descent condition F(X + α D) − F(X ) ≤ σ α ,

(16)

for any σ ∈ (0, 1) and any 0 ≤ α ≤ min{1, /D, 2λ(1 − σ )/L}, where = ∇ f (X ), D + P(X + D) − P(X ) and D = DH (X ; J ). We say that P is block-separable with respect to a nonempty J if P(X ) = PJ (X J ) + PJ c (X J c ) ∀X ∈ m×n

123

338

S. Yun et al.

for some proper, convex, lsc function PJ and PJ c . For the convergence of the BCGD method, the index set J k must be chosen judiciously. As in [17], global convergence can be ensured by choosing J k in a GaussSeidel manner, i.e., J 0 , J 1 , . . . collectively cover N for every T consecutive iterations, where T ≥ 1 (also see [16] and references therein), i.e., J k ∪ J k+1 ∪ · · · ∪ J k+T −1 = N , k = 0, 1, . . .

(17)

For our convergence rate analysis, we need a more restrictive choice of J k , specifically, there exists a subsequence T ⊆ {0, 1, . . .} such that 0∈T,

  ∀k ∈ T , N = disjoint union of J k , J k+1 , . . . , J τ (k)−1

(18)

def

where τ (k) = min{k  ∈ T | k  > k}. For example, we have τ (0) = n for T = {0, n, 2n, . . . }. The next theorem establishes, under the following reasonable assumptions on our choice of Hk and the boundedness of the level set X 0 , the global convergence of the BCGD method using the generalized Guass-Seidel rule (17) to choose {J k }. ¯ Assumption 2 λ ≤ λmin (Hk ) and λmax (Hk ) ≤ λ¯ for all k, where 0 < λ ≤ λ. Assumption 3 There exists a positive constant  such that X  ≤  for all X ∈ X 0 . Note that Assumption 3 is satisfied for the problems (9) with ρii > 0 for i = 1, . . . , n, and (10); see Sect. 3. Theorem 1 Let {X k }, {Hk }, {D k } be the sequences generated by the BCGD method k > 0. Then under Assumption 2, where {α k } is chosen by the Armijo rule with inf k αinit the following results hold. (a)

{F(X k )} is nonincreasing and k given by (13) satisfies − k ≥ D k , Hk (D k ) ≥ λD k 2 ∀k, F(X k+1 ) − F(X k ) ≤ σ α k k ≤ 0 ∀k.

(b)

(c)

(19) (20)

If {J k } is chosen by the Gauss-Seidel rule (17), P is block-separable with respect to J k for all k, and supk α k < ∞, then every cluster point of {X k } is a stationary point of F. If Assumptions 1 and 3 are satisfied, then we have inf k α k > 0. Furthermore, if limk→∞ F(X k ) > −∞, then { k } → 0 and {D k } → 0.

Proof Parts (a) and (b) can be proved by using nearly the same argument as used in the proof of [17, Theorem 1 (a) and (e)]. k or else, by Lemma 2, (c) Since α k is chosen by the Armijo rule (12), either α k = αinit α k /β > min{1, /D k , 2λ(1 − σ )/L}. Since P is convex, there is a constant M p such that λ X 2 + P(X ) ≥ M p ∀X ∈ domP. 4

123

(21)

A block coordinate gradient descent method

339

By (11), we have 1 0 ≥ ∇ f (X k ), D k  + D k , Hk (D k ) + P(X k + D k ) − P(X k ) 2 λ λ ≥ ∇ f (X k ), D k  + D k 2 − X k + D k 2 − P(X k ) + M p 2 4 λ k 2 λ λ ≥ D  − (∇ f (X k ) + X k )D k  − X k 2 4 2 4 + f (X k ) − F(X 0 ) + M p ,

(22)

where the second inequality uses Assumption 2, (21) and the third inequality uses the Cauchy-Schwarz inequality and (20). By the continuity of ∇ f and f , and Assumption 3, there are constants M1 , M2 , and M3 such that ∇ f (X k ) ≤ M1 , X k  ≤ M2 , f (X k ) ≥ M3 for all k ≥ 0. Then, by (22), λ k 2 λ λ D  − (M1 + M2 )D k  − M22 + M3 + M p − F(X 0 ) ≤ 0 4 2 4 Hence there is a positive constant ϑ such that {D k } ≤ ϑ. This together with k > 0 implies inf α k > 0. If in addition, lim k inf k αinit k k→∞ F(X ) > −∞, then this   and (20) imply { k } → 0, which together with (19) imply {D k } → 0. Note that Theorem 1 can readily be extended to any stepsize rule that yields a larger descent than the Armijo rule at each iteration. For example, the limited minimization rule: α˜ k ∈ arg min{F(X k + α D k ) | X k + α D k ∈ domF}, 0≤α≤s k . We will where 0 < s < ∞, yields a larger descent than the Armijo rule with s = αinit use this limited minimization rule in our numerical experiments on the dual covariance selection problems (2) and (6); see Sect. 4. Let X¯ be the set of stationary points of F, and

dist(X, X¯ ) = min X − X¯  X¯ ∈X¯

∀X ∈ m×n .

To establish the local linear convergence of our BCGD method with a suitable choice of {J k }, we need the following Assumption, which is a local Lipschitzian error bound condition stating that the distance from X to X¯ is locally of the order of DI (X ; N ), in addition to Assumptions 1, 2, and 3. Assumption 4 X¯ = ∅ and, for any ξ ≥ min X F(X ), there exist scalars τ > 0 and ¯ > 0 such that dist(X, X¯ ) ≤ τ DI (X ; N ) whenever F(X ) ≤ ξ, DI (X ; N ) ≤ ¯ .

123

340

S. Yun et al.

Using the notation in (18), we say that P is block-separable if P(X ) =

τ (k)−1  =k

PJ  (X J  ) ∀X ∈ m×n , k ∈ T

(23)

is satisfied if for some proper, convex, lsc function PJ  . This block-separability   ω X  with ω > 0, or P(X ) = ω X  with ω > 0 P(X ) =      ∗ J J   and  · ∗ denotes the nuclear norm (i.e., the sum of singular values). The next theorem establishes, under Assumptions 1–4, the linear rate of convergence of the BCGD method using the restricted Gauss-Seidel rule (18) to choose J k . We omit its proof since it is nearly identical to that of [17, Theorem 2]. Theorem 2 Let {X k }, {Hk }, {D k } be the sequences generated by the BCGD method satisfying Assumption 2, where {J k } is chosen by the restricted Gauss-Seidel rule (18) k ≤ 1 and with T ⊆ {0, 1, . . .} and {α k } is chosen by the Armijo rule with supk αinit k inf k αinit > 0. Suppose Assumptions 1 and 3 are satisfied, F satisfies Assumption 4, and P is block-separable, then either {F(X k )} ↓ −∞ or {F(X k )}T converges at least Q-linearly and {X k }T converges at least R-linearly. The next theorem gives an upper bound on the number of iterations for the BCGD method to achieve -optimality. For simplicity, we give the proof for the case when {J k } is chosen by the restricted Gauss-Seidel rule (18) and the proof is presented in the Appendix. Extension of this result to the generalized Gauss-Seidel rule (17) can be def

def

achieved; see Remark 1. In what follows, we define ek = F(X k ) − inf X F(X ), r 0 =

def max X dist(X, X¯ ) | X ∈ X 0 , tk = max{i | τ i (0) ≤ k} with τ i (0) = τ (τ i−1 (0)) for a positive integer i and τ 0 (0) = 0, and · denotes the ceiling function. Theorem 3 Let {X k }, {Hk }, {D k } be the sequences generated by the BCGD method under Assumption 2, where {J k } is chosen by the restricted Gauss-Seidel rule (18) k > 0 and with T ⊆ {0, 1, . . .}, and {α k } is chosen by the Armijo rule with inf k αinit k supk αinit < ∞. Suppose Assumptions 1 and 3 are satisfied, inf X F(X ) > −∞, √ k , α k }D k ∈ X 0 for all k. If P is block-separable, and X k + min{αinit ek − eτ (k) ≥

β C2 /C1 , ∀ k > 0, then ek ≤  whenever  tk ≥

−1     1 ln e0 / ln 1 − ; 2C1

otherwise, ek ≤  whenever  tk ≥

4C22 

 ,

k , sup α k , where C1 and C2 are constants depending only on r 0 , λ, λ¯ , σ, β, inf k αinit k k 0 supk D , the Lipschitz constant L i of ∇ f (·)J i over X .

123

A block coordinate gradient descent method

341

Remark 1 Extension of Theorem 3 to the generalized Gauss-Seidel rule (17) can be achieved as follows. By (17), J k ∪ J k+1 ∪ · · · ∪ J k+T −1 = N for k = 0, 1, . . .. Let J k1 = J k and J ki = J k+i−1 − (J k ∪ J k+1 ∪ · · · ∪ J k+i−2 ) for i = 2, . . . , T . By excluding empty sets and renumbering if necessary, there is a set {J k1 , . . . , J kr (k) } such that r (k) ≤ T, J k1 ∪ J k2 ∪ · · · ∪ J kr (k) = N , J ki ∩ J k j = ∅ for i = j, and J ki = ∅ for all i = 1, . . . , r (k). Also, for each i = 1, . . . , r (k), there is the smallest k integer qik such that J ki ⊆ J qi and k ≤ qik < k + T . Then qik = q kj if i = j. Hence if we assume that P(X ) =

r (k)  =1

PJ k (X J k ) ∀X ∈ m×n , k ≥ 0

instead of (23), then proceeding similarly as in the proof of Theorem 3, we can obtain the similar result as Theorem 3. 3 Regularized log-likelihood problem: covariance selection In this section, we study the boundedness of the level set X 0 = {X ∈ S n | F(X ) ≤ F(X 0 )}

(24)

and the Lipschitz continuity of ∇ f on X 0 for the regularized log-likelihood problem, i.e., f (X ) = − log det X + S, X  with S  0n , which includes the covariance selection problems (1), (2), (5), and (6). The former ensures the existence of a cluster point of {X k } while the latter will be used for the convergence rate analysis and the complexity analysis of the BCGD method when applied to the regularized log-likelihood problem. In what follows, f (X ) = − log det X + S, X  with S  0n . Also we assume X¯ = ∅. Then, since the objective function is strictly convex, the optimal solution is unique and is denoted by X ∗ . Assumption 5 There exist positive constants ζ and ζ¯ such that ζ I  X  ζ¯ I for all X ∈ X 0 , where X 0 is defined in (24). If Assumption 5 is satisfied, then the function f is strongly convex on X 0 , with a convexity parameter of 1/ζ¯ 2 , in the sense that Y, ∇ 2 f (X )Y  = Y, X −1 Y X −1  ≥ ζ¯ −2 Y 2 for every Y ∈ S n . Also, f has a gradient that is Lipschitz continuous with respect to the Frobenius norm on X 0 , with Lipshitz constant L = 1/ζ 2 . The following lemma shows that Assumption 5 is satisfied when P has linear growth asymptotically. Lemma 3 Suppose P(X ) ≥  tr(X ) whenever X  0 and tr(X ) ≥ ζ , for some scalars  > 0, ζ > 0. Then Assumption 5 is satisfied.

123

342

S. Yun et al.

Proof For any X ∈ X 0 with tr(X ) ≥ ζ , we have F(X 0 ) ≥ − log det X + S, X  + P(X ) ≥ − log det X +  tr(X ) n  = (− log(λi (X )) +  λi (X )) i=1

≥ − log(λi (X )) +  λi (X ) + (n − 1) min (− log(t) +  t) t>0

= − log(λi (X )) +  λi (X ) + (n − 1)(log( ) + 1),

(25)

for any i = 1, . . . , n, where the first inequality uses S  0n , X  0n . To obtain an upper bound for λi (X ), we consider first the case when λi (X ) ≥ Using log(t/t¯) ≤ t/t¯ − 1, with t = λi (X ) and t¯ = 2 , we obtain from (25) that F(X 0 ) ≥ log( /2) +

2 .

 λi (X ) + 1 + (n − 1)(log( ) + 1). 2

Rearranging terms yields λi (X ) ≤

 2  F(X 0 ) − n(log( ) + 1) + log(2) , i = 1, . . . , n. 

For the case when λi (X ) ≤ 2 , the above inequality is also valid. This shows that X  ζ¯ I for all X ∈ X 0 , where ζ¯ is the constant on the right-hand side. Since X ∈ X 0 so that X  0, we also have λi (X ) > 0 so (25) implies F(X 0 ) > − log(λi (X )) + (n − 1)(log( ) + 1). Rearranging terms and taking exponential yield λi (X ) > exp((n − 1)(log( ) + 1) − F(X 0 )), i = 1, . . . , n.   Therefore there is a positive constant ζ such that X  ζ I for all X ∈ X 0 .  The assumption of Lemma 3 is satisfied by P(X ) = ρ i,n j=1 |X i j | with  = ρ and ζ being any positive scalar. This is also satisfied with P given by (8) and Pi j as in (10), with  being any positive scalar and ζ any scalar greater than tr(S + U ), when Uii < ∞ for i = 1, . . . , n; and P given by (8) with Pi j as in (9), with  = mini {ρii } and ζ being any positive scalar, when ρii > 0 for i = 1, . . . , n. The following lemma shows that, for any X ∈ X 0 , X − X ∗  can be bounded from above by the Frobenius norm of the solution of the subproblem (11) under Assumption 5. It can be proved by proceeding as in the proof of [17, Theorem 4], and hence we omit its proof. Lemma 4 Suppose Assumption 5 holds. Then, for any X ∈ X 0 , X − X ∗  ≤ (ζ¯ 2 + (ζ¯ /ζ )2 )DI (X ; N ).

123

(26)

A block coordinate gradient descent method

343

If Assumption 5 is satisfied, then Assumption 3 holds and, by Lemma 4, Assumption 4 holds with τ = (ζ¯ 2 + (ζ¯ /ζ )2 ) and ¯ = ∞. Hence Theorems 2 and 3 are satisfied for the regularized log-likelihood problems such as (9) and (10). The next lemma gives an upper bound of A, B in terms of the maximum eigenvalue of A, the rank and the Frobenius norm of B when A  0 and B ∈ S n . It will be used in Lemma 6. √ Lemma 5 Suppose A  0 and B ∈ S n . Then A, B ≤ λmax (A) rank(B)B. Proof By Fan’s inequality [1, Theorem 1.2.1], we have A, B ≤

n 

λi (A)λi (B) ≤

i=1

n 

λmax (A)|λi (B)| = λmax (A)

i=1

rank (B)

|λi (B)|

i=1

   rank (B)   (λi (B))2 = λmax (A) rank(B)B ≤ λmax (A)rank(B) i=1

where λi (A) and λi (B) are the i-th eigenvalue of A and B respectively.

 

The next lemma gives an explicit bound on DH (X ; J ) if Assumption 5 is satisfied and P is Lipschitz continuous on domP. Lemma 6 For any X ∈ X 0 , nonempty J ⊆ N , and self-adjoint positive definite H with 0 < λ ≤ λmin (H), let D = DH (X ; J ) and G = −X −1 + S. Suppose Assumption 5 is satisfied. If P is Lipschitz continuous on domP with Lipshitz constant L p , then √ def D ≤ C0 ( = 2(S + rank(D)/ζ + L p )/λ). Moreover, the descent condition (16) is satisfied for σ ∈ (0, 1) whenever 0 ≤ α ≤ min{1, ωζ /C0 , 2λ(1 − σ )(1 − ω)2 ζ 2 } with ω ∈ (0, 1). Proof By (11), 1 G, D + D, H(D) + P(X + D) ≤ P(X ). 2 Since G = −X −1 + S and ζ I  X , by Lemma 5 with A = X −1 and B = D, we have  G, D = S, D − X −1 , D ≥ −(S + rank(D)/ζ )D. This together with the Lipschitz continuity of P and D, H(D) ≥ λD2 implies that −(S +



rank(D)/ζ )D +

λ D2 − L p D 2

1 ≤ G, D + D, H(D) + P(X + D) − P(X ) ≤ 0 2

123

344

S. Yun et al.

Hence D ≤ C0 . If X ∈ X 0 , then, for any α ∈ [0, α] ¯ with α¯ = min{1, ωζ /C0 } and ω ∈ (0, 1), we have ¯ 0 )I  X − αDI ¯  X + α D. 0 ≺ (1 − ω)ζ I  (ζ − αC Hence (X + α D)−1 − X −1  ≤

1 α D. Therefore there is some > 0 (1−ω)2 ζ 2 1 such that f satisfies (15) with L = (1−ω)2 ζ 2 . By Lemma 2, the descent condition (16) is satisfied for σ ∈ (0, 1) whenever 0 ≤ α ≤ min{α, ¯ 2λ(1 − σ )(1 − ω)2 ζ 2 }.  

Remark 2 When we apply the BCGD method to solve the problems (1), (2), (4), (5), and (6), the specific choices of J k are guided by the need to efficiently compute ∇ f (X k ), D k  and F(X k + α k D k ) in the Armijo stepsize rule. Hence we want to avoid computing det(X k + α k D k ) and ∇ f (X k ) = (X k )−1 from scratch since it would require O(n 3 ) operations. Following the proposal in [3], we will choose J k = {(i, j), ( j, i) | i = 1, . . . , n} where j = k + 1(mod n) for the Gauss-Seidel rule (17) and J k = {(i, j), ( j, i) | i = 1, . . . , j} where j = k + 1(mod n) for the restricted Gauss-Seidel rule (18). By these choices, we update only one column (and the corresponding row) of X at each iteration. Then det(X k + α k D k ) can be computed in O(n 2 ) operations by using the Schur complement of the submatrix produced by removing the jth column and row in X k + α k D k . Similarly, (X k+1 )−1 can be updated in O(n 2 ) operations from (X k )−1 by using the Sherman-Woodbury-Morrison formula; see Sect. 4 for details. Remark 3 For the problem (10) (see also (2) and (6)), Assumption 5 is satisfied and k = min{1, α} L p = 0. Hence if, for all k, we take Hk = I, L i = 1/ζ 2 , L = 4/ζ 2 , αinit ¯ k where α¯ is defined as in Lemma 6, and we choose J = {(i, j), ( j, i) | i = 1, . . . , n} where j = k + 1(mod n) or J k = {(i, j), ( j, i) | i = 1, . . . , j} where j = k + 1(mod n), then rank(D k ) = 2 and so the iteration bounds in Theorem 3 reduce to  O

   0  n n 2 ζ¯ 2 e ln . or O  ζ2 ζ 2

Since N is the union of J k , J k+1 , . . . , J k+n−1 , the resulting complexity bounds on the number of iterations for achieving -optimality can be  O

   0  n2 n 3 ζ¯ 2 e ln . or O  ζ2 ζ 2

For the problem (9) (see also (1) with α = 0, β = +∞ and (5) with ρi j > 0 for (i, j) ∈ V ), Assumption 5 is satisfied and L p = n ρ˜ with ρ˜ = ρ for (1) and k , and J k as ρ˜ = max(i, j) ∈V {ρi j } for (5). Hence if, for all k, we choose Hk , L i , L , αinit

123

A block coordinate gradient descent method

345

above, then the resulting complexity bounds on the number of iterations for achieving -optimality can be  O

n 2 (1 + ζ ) ζ2



e0 ln 



 or O

n 3 ζ¯ 2 (1 + ζ ) ζ 2

 .

The computational cost per each iteration of the BCGD method is O(n 2 ) operations except the first iteration when it is applied to solve (1) with α = 0, β = +∞, (2), (5) with ρi j > 0 for (i, j) ∈ V , (6); see Sect. 4. Hence the BCGD method can be implemented to achieve -optimality in  O

n 5 ζ¯ 2 (1 + ζ )



ζ 2

operations. In contrast, the worst-case iteration complexity of interior point methods for finding an -optimal solution to (1) is O(n ln(e0 /)), where e0 is an initial duality gap. But each iterate of interior point methods requires O(n 6 ) arithmetic cost for assembling and solving a typically dense Newton system with O(n 2 ) variables. Thus, the total worst-case arithmetic cost of the interior point methods for finding an -optimal solution to (1) is O(n 7 ln(e0 /)) operations. And the first-order method proposed in [7] requires O(n 3 ) operations per iteration since it is dominated by the eigenvalue decomposition and matrix-matrix multiplication of n × n matrices. As indicated in [7], the overall worst-case arithmetic √ cost of this first-order method for finding an -optimal solution to (1) is O(ρβn 4 / ) operations. 4 Numerical experiments on covariance selection In this section, we describe the implementation of the BCGD method and report our numerical results for solving the covariance selection problems (2) and (6) on randomly generated instances. In particular, we report the comparison of the BCGD method with a first-order method (called the ANS method, with the matlab code publicly available) developed in [7,8] for solving the covariance selection problem of the form (1) with α = 0 and β = +∞ and the covariance selection problem of the form (5). We have implemented the BCGD method in matlab. All runs are performed on an Intel Xeon 3.20 GHz, running Linux and matlab (Version 7.6). 4.1 Implementation of the BCGD method In our implementation of the BCGD method, we choose a self-adjoint positive definite linear mapping as follows: Hk (D) = (Hikj Di j )i j ,

(27)

123

346

S. Yun et al.

where H k = h k (h k )T with h kj = min{max{((X k )−1 ) j j , 10−10 }, 1010 } for all j = 1, . . . , n. If 10−10 ≤ ((X k )−1 ) j j ≤ 1010 for all j = 1, . . . , n, then D, Hk (D) = diag((X k )−1 )Ddiag((X k )−1 ), D. By (2), f (X ) = − log det X − n. Therefore ∇ 2 f (X k )[D, D] = (X k )−1 D(X k )−1 , D. Then the above choice Hk can be viewed as a diagonal approximation to the Hessian. Also this choice has the advantage that (X k )−1 has already been evaluated when computing the gradient and D k is given by a closed form formula which can be computed efficiently in matlab using vector operations. We tested the alternative choice of Hk = ηk I for some fixed constant ηk including 1, but its overall performance was much worse. Also we tested another alternative choice of (27) with H k = h k (h k )T + W k ◦ W k , where “◦” denotes the Hadamard product, and Wikj = ((X k )−1 )i j for i = j and Wikj = 0 for i = j. This choice is a somewhat better diagonal approximation of the Hessian than (27), but its overall performance was similar to that of (27). We choose the index subset J k by the Gauss-Seidel (cyclic) rule, J k = {(i, j), ( j, i) | i = 1, . . . , n} where j = k + 1(mod n). Hence we update only one column (and the corresponding row) at each iteration. In order to satisfy the Armijo descent condition (12), α k should be chosen to satisfy X k + α k D k  0. Since X k  0, there is a positive scalar α¯ k that satisfies X k + α¯ k D k  0. Hence if α k ∈ (0, α¯ k ], then X k + α k D k  0. We describe below how to compute α¯ k > 0 that satisfies X k + α¯ k D k  0. By permutation and the choice of J k , we can always assume that we are updating the last column (and the last row). Suppose we partition the matrices X k and D k in the following block format:  Xk =

V k uk (u k )T w k



 and

Dk =

0n−1 d k (d k )T r k

 ,

(28)

where V k ∈ S n−1 , u k , d k ∈ n−1 , and w k , r k ∈ . Then, by [6, Theorem 7.7.6], X k +α D k  0 if and only if V k  0 and w k +αr k −(u k +αd k )T (V k )−1 (u k +αd k ) > 0. Since X k  0, hence V k  0. This implies that X k + α D k  0 if and only if a1k α 2 + 2a2k α − a3k < 0,

(29)

where a1k = (d k )T (V k )−1 (d k ), a2k = (u k )T (V k )−1 (d k ) − 0.5r k and a3k = w k − (u k )T (V k )−1 (u k ). Note that since X k  0, we have a3k > 0. If D k = 0, then either d k = 0, or d k = 0 and r k = 0. Now we consider the following cases in setting α¯ k . Case (1): If d k = 0, then a1k > 0. This together with a3k > 0 implies that (29) is satisfied for all α ∈ (0, α) ¯ where α¯ = dk

123

α¯ k

a1k

. Therefore

 0 for all α ∈ (0, α). ¯ We set = max{0.9α, ¯ −a2k /a1k }. k k k k = 0 and r > 0, then a1 = 0 and a2 = −0.5r < 0. Hence (29) is Case (2): If satisfied for all α > 0. Therefore X k + α D k  0 for all α ∈ (0, ∞). We set α¯ k = 10. Case (3): If d k = 0 and r k < 0, then a1k = 0 and a2k = −0.5r k > 0. Hence (29) is satisfied for all α ∈ (0, α), ¯ where α¯ = a3k /(2a2k ). Therefore X k +α D k  0 ¯ for all α ∈ (0, α). ¯ We set α¯ k = 0.9α. Xk

+ α Dk

−a2k + (a2k )2 +a1k a3k

A block coordinate gradient descent method

347

k = min{α The stepsize α k can be chosen by the Armijo rule (12) by setting αinit ¯ k , 1}. But, for (2) and (6) (see also (10)) with 0 < α ≤ 1, by using (28), we have

f (X k + α D k ) + P(X k + α D k ) − f (X k ) − P(X k ) = f (X k + α D k ) − f (X k ) = − log det V k − log (w k + αr k − (u k + αd k )T (V k )−1 (u k + αd k )) − n + log det V k + log (w k − (u k )T (V k )−1 (u k )) + n = − log (a3k − a1k α 2 − 2a2k α) + log a3k .

(30)

Hence if − log (a3k − a1k α 2 − 2a2k α)+log a3k < 0, then a3k −a1k α 2 −2a2k α > a3k . And so a1k α 2 + 2a2k α < 0. Since α > 0, we must have a2k < 0 and so 0 < α < −2a2k /a1k if a1k = 0 or α > 0 if a1k = 0. Also a1k = 0 if and only if d k = 0. This together with a2k < 0 implies that, for 0 < α ≤ 1, the quantity f (X k + α D k ) + P(X k + α D k ) − f (X k ) − P(X k ) is minimized when α=

min{1, −a2k /a1k } 1

if d k = 0; else.

Hence we would better use the limited minimization rule. Remark 4 We can directly apply the BCGD method to the primal covariance selection problem (1) with α = 0 and β = ∞. Suppose that the Gauss-Seidel (cyclic) rule, J k = {(i, j), ( j, i) | i = 1, . . . , n} where j = k + 1(mod n), is employed to choose {J k }. Then only one  column and the corresponding row are updated at each iteration. Since P(X ) = ρ i,n j=1 |X i j |, by using (28), f (X k + α D k ) + P(X k + α D k ) − f (X k ) − P(X k ) = − log det V k − log (w k + αr k − (u k + αd k )T (V k )−1 (u k + αd k )) + log det V k + log (w k − (u k )T (V k )−1 (u k )) +2(u k + αd k 1 − u k 1 ) + |w k + αr k | − |w k | = − log (a3k − a1k α 2 − 2a2k α) + log a3k +2(u k + αd k 1 − u k 1 ) + |w k + αr k | − |w k |, where a1k , a2k , and a3k are as defined previously. Hence the limited minimization rule can be used for finding a stepsize, but finding such a stepsize is not as simple as that of the dual formulation (2). However the method can still be implementable in O(n 2 ) operations per iteration. Similarly, we can directly apply the BCGD method to the more general covariance selection problem (5). Next, we give a comment on the computational complexity of the BCGD method when it is applied to the problems (2) and (6). We analyze the cost of each iteration of the BCGD method. The main computational cost per iteration is a number of inner products, vector-scalar multiplications and vector additions, each requiring O(n) floating-point operations, plus a number of matrix-vector multiplications and matrix additions, each requiring O(n 2 ) floating-point operations. In addition, each iteration

123

348

S. Yun et al.

requires one gradient (the inverse of X ∈ S n ) evaluation to find the direction, and one evaluation of the inverse of an (n − 1) × (n − 1) submatrix of X to find the stepsize. Those evaluations generally requires O(n 3 ) floating-point operations. But since we only updates one column and the corresponding row of the current matrix, the inverse of X can be updated by using the Sherman-Woodbury-Morrison formula [11, Appendix A.2] and the inverse of the (n −1)×(n −1) submatrix of X can be evaluated by using the Schur complement formula [14, Subsection 13.2.2]. Both computations only require O(n 2 ) operations. Therefore we only need to compute the full inverse of the initial matrix at the first iteration, which can be much cheaper than O(n 3 ) operations if the initial matrix is sparse. For the evaluation of the current function value, the computation of the determinant of an n × n matrix is needed and generally requires O(n 3 ) floating-point operations. But, by using (30), it only requires O(n 2 ) operations. As indicated at the end of Sect. 3, the iteration bound of the BCGD for   3 method ¯2 achieving -optimality, when it is applied to (2) and (6), reduces to O nζζ2 , where

ζ¯ , ζ are two constants that satisfy Assumption 5. Hence the BCGD method can be  5 ¯2  implemented to achieve -optimality in O nζζ2 operations.

Throughout our numerical experiments, we terminate the BCGD method when the following conditions are satisfied: 

DHk (X k ; N ), Hk (DHk (X k ; N )) ≤ ε1 ,      S, (X k )−1  + (i, j) ∈V ρi j |((X k )−1 )i j | − n    ≤ ε2 ,    1 + log det(X k ) + S, (X k )−1  + (i, j) ∈V ρi j |((X k )−1 )i j |

(31) (32)

where ε1 and ε2 are small positive numbers (in our numerical runs, we set ε1 = 5×10−3 and ε2 = 10−4 ). Here we scale DHk (X k ; N ) by Hk to reduce its sensitivity to Hk . The criterion (31) is motivated by Lemma 1 and it is an extension of a criterion that has been suggested previously in [17]. The criterion (32) is based on the relative duality gap between the problems (5) and (6). We should note that the sequence of iterates {X k } generated by the BCGD method is for the dual problem (6), and we use (X k )−1 to estimate an approximate primal optimal solution for (5). The initial iterate for the BCGD method is chosen to be X 0 = S + max{ρi j |(i, j) ∈ V }I . The stopping conditions for the ANS method are set as suggested in [8]. But we modified the “absolute gap” condition in the ANS method to the “relative gap” condition (32), with the same tolerance ε2 = 10−4 , since the latter is more commonly used in optimization algorithms. Note that each iteration of the ANS method requires the computation of the full eigenvalue decomposition of a symmetric matrix. In its original implementation, the ANS method uses the routine eig.m in Matlab to carry out this task. Here we replaced eig.m by the LAPACK routine dsyevd.f (via mex-interface to Matlab), which is based on a divide-and-conquer strategy. On our machine, the latter routine is about 5–10 times faster than the former.

123

A block coordinate gradient descent method

349

4.2 Numerical results In this subsection, we report the performance of the BCGD method and compare it to the ANS method [8] for solving the covariance selection problem of the form (1) (with α = 0 and β = +∞) and (5) on randomly generated instances. We note that the BCGD method is applied to solve the dual problems (2) and (6). The BCGD method can be directly applied to solve (1) (with α = 0 and β = +∞) and (5). In this case, the stepsize can also be chosen by the Armijo rule or the limited minimization rule; see Remark 4. But its overall performance was worse than that of using the dual formulation. Hence we only present the numerical results of the BCGD method when applied to the dual formulation. All the test instances used in this subsection were generated randomly in a similar manner as described in [3,7,8]. First, we generate a random sparse matrix A ∈ S n whose nonzero elements are set randomly to be ±1. Then we generate a sparse inverse covariance matrix Σ −1 from A as follows: A = A ∗ A ; d = diag(A); T = diag(d) + max( min(A − diag(d), 1), −1); Σ −1 = T − min{1.2λmin (T ) − ϑ, 0}I, where ϑ is a small positive number. Then we generate a matrix B ∈ S n by B =Σ +τ

Σ F ,  F

where  ∈ S n is a random matrix whose elements are drawn from the uniform distribution on the interval [−1, 1], and τ is a small positive number. Finally, we obtain the following randomly generated sample covariance matrix: S = B − min{λmin (B) − ϑ, 0}I. In our experiments, we set τ = 0.15 and ϑ = 10−4 for generating all instances. Let Ω = {(i, j) | (Σ −1 )i j = 0, |i − j| ≥ 2}. For the problem (5), we set V to be a random subset of Ω such that card(V ) is about 50% of card(Ω). The regularization parameters ρi j are set to the constant values of 5/n for all the instances. The parameter values are chosen empirically to achieve a reasonable recovery of the true inverse covariance matrix Σ −1 . We should note that in general, it is impossible to recover Σ −1 accurately based on S by solving (1) or (5). Thus the purpose of solving (1) or (5) is not to recover the true matrix Σ −1 accurately but to detect the sparsity pattern of Σ −1 while maintaining a reasonable approximation to the true matrix. We evaluate the recovery success of Σ −1 by a matrix X based on the following criteria: 1 def 1 Σ X − I  F , L E = (Σ, X  − log det(Σ X ) − n) n n TN TP def def , Sensitivity = , Specificity = TN + FP TP + FN def

LQ =

(33) (34)

123

350

S. Yun et al.

where TP, TN, FP, and FN denotes the number of true positives, true negatives, false positives, and false negatives, respectively, with respect to the sparsity pattern of Σ −1 . In our situation, L E and L Q (which were considered in [19] without the normalization factor 1/n), measure the quality of the approximation of Σ −1 by X , and Specificity measures the quality of zero entries while Sensitivity measures the quality of nonzero entries. We should mention that the estimated matrix X would not be sparse in general but have many small entries. Thus we postprocess the matrix X by setting all entries which are smaller than 5 × 10−2 in absolute value to 0. Table 1 reports the performance of the BCGD and ANS methods. In the table, an entry of the form “6.07 4” means the number “6.07 × 104 ”. The dimension n and density (percentage of nonzero entries) of the inverse covariance matrix and the def

number of constraints |V | = card(V ) are given in the first major column. In the second major column, we report the number of iterations taken by the BCGD and ANS methods. The numbers in each parenthesis correspond to L Q , L E , Specificity, and Sensitivity, respectively. From the Specificity and the Sensitivity values, we see that the solution of (1) or (5) can estimate the sparsity pattern of Σ −1 very accurately, but it is less accurate in the actual approximation of Σ −1 . The third major column gives the primal objective values. Note that the sub-column under “ANS” gives the difference in the primal objective values between the ANS and the BCGD methods. As we are reporting the primal objective value for (1) or (5), a positive difference means that the ANS method achieved a better objective value than the BCGD method. Conversely, a negative difference would indicate that the BCGD method has achieved a better objective value. The last major column reports the CPU times. From Table 1, we see that the BCGD and the ANS methods are comparable in their performance when solving the problem (1) with α = 0 and β = +∞. But for the problem (5), the BCGD method substantially outperforms the ANS method in terms of the CPU time taken and the quality of the primal objective values attained.

Table 1 Comparison of the BCGD and ANS methods in solving the problems (1) and (5) with the data matrix S generated randomly. The regularization parameters ρi j are set to ρi j = 5/n for all the problems. The numbers in each parenthesis are: L Q , L E , Specificity and Sensitivity, respectively. n|density(%)||V |

Iteration count BCDG

ANS

Primal objective value

Time (s)

BCDG

BCDG ANS

ANS

500 | 2.74 | 0

1,662 (2.9 − 2|5.2 − 1|0.99|0.72)

46

−8.18195357 2 2.42−2

5.5

11.5

1,000 | 4.15 | 0

8,701 (1.5 − 2|2.2 − 1|0.99|0.99)

87

−4.32170724 2 8.60−5

145.7

117.7

1,500 | 4.63 | 0

12,661 (1.4 − 2|2.9 − 1|0.99|0.98)

84

−4.35887269 2 1.65−3

491.9

371.1

2,000 | 5.14 | 0

18,781 (1.2 − 2|3.2 − 1|0.98|0.97)

93

−2.80176287 2 6.03−4

1, 285.1

953.9

500 | 1.97 | 6.07 4

3601 (2.9 − 2|5.5 − 1|1.00|0.76)

12.5

146.9

619

−8.42619444 2 −1.54−1

1,000 | 3.33 | 2.42 5 11,341 (1.6 − 2|2.2 − 1|1.00|0.99)

807

−4.45131714 2 −3.53−3

195.8 1, 053.4

1,500 | 3.71 | 5.42 5 13,321 (1.4 − 2|2.9 − 1|1.00|0.99)

969

−4.63013088 2 −1.21−1

528.2 3, 839.5

2,000 | 4.13 | 9.61 5 20,681 (1.3 − 2|3.2 − 1|1.00|0.99) 1256

123

−3.19691367 2 −7.90−2 1, 448.8 10, 845.3

A block coordinate gradient descent method

351

5 Conclusions In this paper we have proposed a block coordinate gradient descent method for solving convex nonsmooth optimization problems on a set of matrices. We also analyzed its computational complexity. Important applications of such convex nonsmooth optimization problems include the covariance selection problem. On both the primal and dual formulation of the covariance selection problems (1) with α = 0, β = +∞, (2), (5), and (6), this method achieves linear convergence, and terminates in O(n 5 /) operations with an -optimal solution. The BCGD method can be applied to solve (5) without solving a sequence of unconstrained penalty problems as did in [8]. Our preliminary numerical experiments suggest that our method is efficient in solving the dual formulation of large-scale covariance selection problems, especially when they have many constraints. There are some topics that can be investigated in a future study. Can the complexity bound in Sects. 2 and 3 be sharpened? The Gauss-Southwell type rules for choosing J k in [17] can be extended to our general problem. We did not consider them here because we do not have a better choice for J k than the choice of one column and the corresponding row. Can we still efficiently find J k satisfying the Gauss-Southwell type rules?

Appendix: Proof of theorem 3 By (18), N is the disjoint union of J k , J k+1 , . . . , J τ (k)−1 . This together with (13) and (23) yields that

τ (k)−1  i=k

i ≤

min

D∈m×n

1 G˜ k , D + D, H˜ k (D) + P(X k + D) − P(X k ) , 2

(35)

τ (k)−1 where G˜ k := i=k J i (G i ), G i := ∇ f (X i ), and H˜ k : m×n → m×n is the self-adjoint positive definite mapping satisfying

D, H˜ k (D) =

τ (k)−1 

 J i (D), Hi ( J i (D)) ∀D ∈ m×n .

(36)

i=k

˜ k := G k , D˜ k  + P(X k + D˜ k ) − P(X k ), we then Letting D˜ k := DH˜ k (X k ; N ), have from (35) that

123

352

S. Yun et al. τ (k)−1  i=k

1 i ≤ G˜ k , D˜ k  +  D˜ k , H˜ k ( D˜ k ) + P(X k + D˜ k ) − P(X k ) 2 1 ˜ k + G˜ k − G k , D˜ k  +  D˜ k , H˜ k ( D˜ k ) = 2 ˜k ˜k 1 λ + G˜ k − G k  D˜ k  ≤ + G˜ k − G k 2 +  D˜ k 2 , ≤ 2 2 λ 4

˜ k ≥  D˜ k , H˜ k ( D˜ k ) and the third inequality uses where the second inequality uses − ab ≤ (a 2 + b2 )/2 for any a, b ∈ . Also, letting L i be the Lipschitz constant of ∇ f (·)J i over X 0 , we have

G˜ k − G k 2 =

τ (k)−1 

 J i (∇ f (X i )) − J i (∇ f (X k ))2

i=k τ (k)−1 

τ (k)−1 

i−1  | j | (α j )2 λ i=k i=k j=k ⎛ ⎛ ⎞ ⎞ τ (k)−1 τ (k)−2 τ (k)−1 j|  τ (k)−1    | | j | ⎝ ≤ L i ⎠ (α j )2 ≤ ⎝sup α  Li ⎠ αj λ λ  j=k i= j+1 i=k j=k ⎧ ⎫⎞ ⎛ ⎨τ ()−1 ⎬ τ (k)−1   |  | ≤ ⎝sup α  max Li ⎠ αj . ⎭  ⎩ λ 



L i X i − X k 2 ≤

Li

j=k

i=

def

k , β min{1, 2λ Using −| i | ≥ α i i /α (since α i ≥ α, where α = min{inf k αinit (1 − σ )/L , / supk D k }} by Lemma 2) and the above two inequalities, we have

τ (k)−1 τ (k)−1   ˜k λ ˜k 2 1 k j j ˜ +  D  − C0 0≤ α ≤ − C0 α j j, 2 4 4 j=k

sup α  max

 τ ()−1

(37)

j=k

 Li

˜k ≥ where C0 := + , and the second inequality uses − λ2  D˜ k , H˜ k ( D˜ k ) ≥ λ D˜ k 2 (it can be seen using Assumption 2 that λ ≤ λmin (H˜ k ) ≤ λmax (H˜ k ) ≤ λ¯ ). By Fermat’s rule [13, Theorem 10.1], 1 α

i=

D˜ k ∈ arg min G k + H˜ k ( D˜ k ), D + P(X k + D). D

123

A block coordinate gradient descent method

353

Now consider any X¯ k ∈ X¯ satisfying X k − X¯ k  = dist(X k , X¯ ). Then ˜ k ≤ G k + H˜ k ( D˜ k ), X¯ k − X k  + P( X¯ k ) − P(X k ) − λmin (H˜ k ) D˜ k 2 ≤ f ( X¯ k ) − f (X k ) + H˜ k ( D˜ k ), X¯ k − X k  + P( X¯ k ) − P(X k ) − λ D˜ k 2 = F( X¯ k ) − F(X k ) + H˜ k ( D˜ k ), X¯ k − X k  − λ D˜ k 2 , where the second inequality uses the convexity of f , Assumption 2, and (36). Thus ˜ k − λ D˜ k 2 F(X k ) − F( X¯ k ) ≤ H˜ k ( D˜ k ), X¯ k − X k  −  λ ˜ k −  D˜ k 2 ≤ λ¯ H˜ k ( D˜ k ), D˜ k r 0 − 2   τ (k)−1 τ (k)−1     ≤ −4λ¯ C0 α j j r 0 − 2C0 α j j, j=k

(38)

j=k

where the second inequality uses the self-adjoint positive definite property of H˜ k and ˜ k ≥  D˜ k , H˜ k ( D˜ k ) and (37). the last inequality uses − Finally, we have from the Armijo rule (12) that F(X j+1 ) − F(X j ) ≤ σ α j j ,

j = k, k + 1, . . . , τ (k) − 1.

Summing this over j = k, k + 1, . . . , τ (k) − 1 yields that F(X τ (k) ) − F(X k ) ≤ σ

τ (k)−1 

α j j.

(39)

j=k

Combining (39) with (38) yields  ek ≤ C2 ek − eτ (k) + C1 (ek − eτ (k) ),

(40)

 0 0 ¯ where C1 := 2C σ and C 2 := 2λC 1 r . √ k τ (k) ≥ C /C for all k ≥ 0 (ii) 2 1 √ Now we consider the two cases: (i) e − e k τ (k) e −e < C2 /C1 for some k ≥ 0. In case (i), (40) implies that we have ek ≤ 2C1 (ek − eτ (k) ) and rearranging terms yields   1 ek . eτ (k) ≤ 1 − 2C1 Thus if  e0

√ ek − eτ (k) ≥ C2 /C1 ∀k ≥ 0, then ek ≤  whenever

1 1− 2C1

tk

      1 −1 e0 / ln 1 − ≤  or, equivalently, tk ≥ ln .  2C1

123

354

S. Yun et al.

√ k τ (k) < C /C , the inequality (40) implies that we have In case (ii), 2 1 √ when e − e k k τ (k) e ≤ 2C2 e − e , and rearranging terms yields eτ (k) ≤ ek −

(ek )2 . 4C22

(41)

√ On the other hand, when ek − eτ (k) ≥ C2 /C1 , the inequality in (41) also holds if ¯ 0 )2 . Note that the latter inequality must hold if we pick λ¯ to be ek ≤ 2C22 /C1 = 4λ(r sufficiently large. We may assume ek > 0 ∀k ≥ 0 (otherwise, ek ≤ ) in (41). Then we consider the reciprocals ξ j = 1/e j . By (41) and ek > 0, we have 0 ≤ ek /(4C22 ) < 1. Thus (41) yields ξ τ (k) − ξ k ≥

1 ek (1 − ek /(4C22 ))



1 1 1 = ≥ . 2 k k e 4C2 − e 4C22

√ tk −1 τ i+1 (0) t (ξ − This implies that if ek − eτ (k) ≤ C2 /C1 ∀k ≥ 0 then ξ τ k (0) = ξ 0 + i=0 i tk ξ τ (0) ) ≥ 4C and consequently 2 2



tk (0)

=

1 ξ

τ tk (0)

4C22 . tk



It follows that ek ≤  whenever  tk ≥

4C22 

 .

Acknowledgments We thank two anonymous referees for their detailed comments and suggestions to improve the paper. The first author was supported in part by BSRP (2010-0510-1-3) through NRF funded by the Ministry of Education, Science and Technology.

References 1. Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, New York (2000) 2. Dahl, J., Vandenberghe, L., Roychowdhury, V.: Covariance selection for non-chordal graphs via chordal embedding. Optim. Methods Softw. 23, 501–520 (2008) 3. D’Aspremont, A., Banerjee, O., El Ghaoui, L.: First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 30, 56–66 (2008) 4. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008) 5. Goebel, R., Rockafellar, R.T.: Local strong convexity and local Lipschitz continuity of the gradients of convex functions. J. Convex Anal. 15, 263–270 (2008) 6. Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, Cambridge (1999) 7. Lu, Z.: Smooth optimization approach for covariance selection. SIAM J. Optim. 19, 1807–1827 (2009) 8. Lu, Z.: Adaptive first-order methods for general sparse inverse covariance selection. SIAM J. Matrix Anal. Appl. 31, 2000–2016 (2010)

123

A block coordinate gradient descent method

355

9. Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k 2 ). Sov. Math. Doklady 27, 372–376 (1983) 10. Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005) 11. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999) 12. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970) 13. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, New York (1998) 14. Saad, Y.: Iterative Methods for Sparse Linear Systems. PWS Publishing Company, Boston (1996) 15. Scheinberg, K., Rish, I.: Learning sparse Gaussian markov networks using a greedy coordinate ascent approach. In: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 6323, pp. 196–212 (2010) 16. Tseng, P.: Convergence of block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109, 473–492 (2001) 17. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Prog. Ser. B 117, 387–423 (2009) 18. Wang, C., Sun, D., Toh, K.-C.: Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. SIAM J. Optim. 20, 2994–3013 (2010) 19. Wong, F., Carter, C.K., Kohn, R.: Efficient estimation of covariance selection models. Biometrika 90, 809–830 (2003) 20. Yuan, X.: Alternating direction methods for sparse covariance selection. http://www.optimzationonline.org/DB_FILE/2009/09/2390.pdf

123