A RANDOMIZED INCREMENTAL SUBGRADIENT METHOD FOR

6 downloads 0 Views 227KB Size Report
Copyright © by SIAM. .... To show convergence we need some notation and ... where 1N denotes the N column vector with all entries equal to one. ...... [9] J. G. Kemeny and J. L. Snell, Finite Markov Chains, Van Nostrand, New York, 1960.
c 2009 Society for Industrial and Applied Mathematics 

SIAM J. OPTIM. Vol. 20, No. 3, pp. 1157–1170

A RANDOMIZED INCREMENTAL SUBGRADIENT METHOD FOR DISTRIBUTED OPTIMIZATION IN NETWORKED SYSTEMS∗ ¨ BJORN JOHANSSON† , MABEN RABI† , AND MIKAEL JOHANSSON† Abstract. We present an algorithm that generalizes the randomized incremental subgradient method with fixed stepsize due to Nedi´c and Bertsekas [SIAM J. Optim., 12 (2001), pp. 109–138]. Our novel algorithm is particularly suitable for distributed implementation and execution, and possible applications include distributed optimization, e.g., parameter estimation in networks of tiny wireless sensors. The stochastic component in the algorithm is described by a Markov chain, which can be constructed in a distributed fashion using only local information. We provide a detailed convergence analysis of the proposed algorithm and compare it with existing, both deterministic and randomized, incremental subgradient methods. Key words. convex programming, subgradient optimization, distributed optimization, Markov chain AMS subject classifications. 65K05, 90C25 DOI. 10.1137/08073038X

1. Introduction. We consider the following convex optimization problem

(1.1)

minimize x

N 

fn (x)

n=1

subject to

x ∈ X,

where fn : Rη → R are convex functions and X ⊆ Rη is a convex set. Let f (x) = N   n=1 fn (x), and let f and x denote the optimal value and the optimizer of (1.1), respectively. To the problem we associate a connected N -node network, specified by the graph G = (V, E). The problem can now be interpreted as a networked system, where each node incurs a loss fn (x) of operating at x and nodes cooperate to find the optimal operating point (the optimizer x of (1.1)). In other words, each component in the objective function corresponds to a node in a network; see Figure 1.1 for an example setup. The goal of this paper is to devise and analyze a novel distributed algorithm that iteratively solves (1.1) by passing an estimate of the optimizer between neighboring nodes in the network. There is a substantial interest in such algorithms, since centralized algorithms scale poorly with the number of nodes and are less resilient to failure of the central node. Moreover, peer-to-peer algorithms, which only exchange data between immediate neighbors, are attractive, since they make minimal assumptions on the networking support required for message passing between nodes. Application examples include estimation in sensor networks, coordination in multiagent systems, and resource allocation in wireless systems; see, e.g., [8, 17]. One popular way of solving (1.1) is to use a subgradient method. Early references on this type of method include [19, 5, 15], while more recent and complete discussions ∗ Received by the editors July 16, 2008; accepted for publication (in revised form) April 13, 2009; published electronically August 19, 2009. Parts of this research have previously been published in the proceedings of the IEEE Conference on Decision and Control 2007. The research was partially funded by the Swedish Research Council, the Swedish Foundation for Innovation Systems, and the European Commission. http://www.siam.org/journals/siopt/20-3/73038.html † School of Electrical Engineering, Royal Institute of Technology (KTH), 100 44 Stockholm, Sweden ([email protected], [email protected], [email protected]).

1157

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1158

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

1

2

3

f1 (·)

f2 (·)

f3 (·)

Fig. 1.1. Example setup with three nodes. A line between nodes implies that they are connected.

can be found in, e.g., [20, 16, 1]. The method’s popularity stems from their ease of implementation and their capability of handling nondifferentiable objective functions. Another key property is that subgradient methods often can be executed in a distributed fashion. The prototype subgradient method iteration for constrained convex minimization is xk+1 = PX {xk − αk hk }, where PX {·} denotes Euclidean projection on the feasible set X , αk is a stepsize, and hk is a subgradient of the objective function at xk ; there exist quite a few variations and extensions, but none of them fit our needs. Naturally, the structure of the problem can be exploited and tailored algorithms, so-called incremental subgradient methods, can be used. This class of algorithms was first proposed in [10] and is based on the iteration (1.2)

xk+1 = PX {xk − αk gnk (xk )} ,

where gnk (xk ) is a subgradient of the function fnk at xk , defined by   (1.3) gnk (xk ) ∈ ∂fnk (xk ) = y ∈ Rη |fnk (z) ≥ fnk (xk ) + y T (z − xk ), ∀z ∈ X . The set ∂fnk (xk ) is called the subdifferential, which is a nonempty, closed, and bounded convex set when fnk is a finite convex function on Rη [18, Theorem 23.4]. Depending on how nk and αk are chosen, the resulting algorithms have rather diverse properties, and the stepsize, αk , typically needs to be diminishing to insure asymptotic convergence of the iterates to an optimizer. To the authors’ knowledge, most results on deterministic incremental subgradient methods pertaining to stepsizes satisfying  α = ∞ are covered and unified in [11]. the very common assumption ∞ k=1 k Although incremental subgradient methods were originally devised to boost convergence speed, they can also be used as decentralized mechanisms for optimization. A simple decentralized algorithm, proposed and analyzed in [13], is to use (1.2) with a fixed stepsize and let nk cycle deterministically over the set {1, . . . , N } in a round-robin fashion. We call this algorithm the deterministic incremental subgradient method (DISM). In [13], another variant, also suitable for distributed implementation, is proposed: it is a randomized algorithm where nk is a sequence of independent and identically distributed random variables which take on values from the set {1, . . . , N } with equal probability. We call this algorithm the randomized incremental subgradient method (RISM). If the problem setup permits, it is also possible to use incremental gradient methods [2]. In all of these methods, the iterate can be interpreted as being passed around between the nodes in the network. Finally, (1.2) is similar to the iterations used in stochastic approximation [12]. However, in stochastic approximation algorithms, the stepsize is typically diminishing and not fixed as it is in the algorithm we propose and analyze in this paper. In the remainder of this paper, we will develop an algorithm that is based on (1.2), but where the sequence nk is constructed such that only neighbors need to

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1159

communicate with each other (in techspeak, this means that the network does not need to provide any multihop routing mechanism for the message passing). This is in contrast with both the RISM and the DISM, where nodes far apart need to communicate with each other. The outline is as follows: in section 2, we present the novel algorithm, and in section 3 we analyze its convergence properties. In section 4, we compare, in the sense of performance bounds, the novel algorithm with existing algorithms. Finally, section 5 concludes the paper with a discussion. 2. Algorithm. We associate an N -state time-homogeneous Markov chain, MC, with the optimization problem (1.1). We make the following assumptions. Assumption 1. (i) The functions fn : Rη → R are convex and the set X is convex, nonempty, and closed. (ii) The Markov chain MC is irreducible, aperiodic, and its stationary distribution is uniform. (iii) The subgradients are bounded, sup {z2 | z ∈ ∂fn (x), n ∈ {1, . . . , N }, x ∈ X } ≤ C. Remark. In general, the subgradients of a convex function are not bounded and the last assumption may seem to be rather restrictive. However, in several important cases this assumption is true; e.g., the functions fn (·) are the pointwise maximum of a finite set of linear functions or the set X is compact. We are now ready to define our novel algorithm, which we denote the Markov incremental subgradient method (MISM). The iterations are defined as follows: (2.1)

xk+1 = PX {xk − αgwk (xk )} , k ≥ 0,

where wk is the state of MC and x0 ∈ X . Note that xk ∈ X for all integers k ≥ 0 by construction. Remark. The iterations (2.1) are interesting in their own right, and they generalize the DISM and the RISM. As we will see in section 4, the MISM reduces to the DISM or the RISM by choosing MC appropriately. However, the MISM is particularly interesting in the context of distributed implementation. As we will see in the next section, by choosing MC in a special way, we can interpret the iterations as an estimate of the optimizer that is passed between neighboring nodes and thereby iteratively improved. 2.1. Markov chain for distributed execution. In this section we investigate what we need to be able to implement (2.1) in a decentralized fashion. We make the following assumptions on the graph associated with the optimization problem (1.1). Assumption 2. The undirected graph G = (V, E) is composed of N nodes and is connected. The assumption guarantees that there is a path between all nodes. The question is: how do we to construct, using only local information, the transition matrix of MC, P , such that the iterate, xk , only jumps to an adjacent node? If the sparsity / E are fulfilled, then the state of MC can only jump constraints [P ]ij = 0 when (i, j) ∈ from state i to state j if (i, j) ∈ E. The sparsity constraints therefore imply that the iterate in (2.1) is passed to a node that is adjacent to the current node. It turns out there is a simple way to find such a Markov chain using the so-called Metropolis–Hastings scheme, and we have the following lemma; see, e.g., [3]. Lemma 1. Under Assumption 2, MC fulfills Assumption 1 and the sparsity con/ E, if the elements of the transmission probability matrix of straints [P ]ij = 0, (i, j) ∈

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1160

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

MC are set to

(2.2)

⎧ 1 1 ⎪ ⎪ min , ⎪ ⎪ di dj ⎪ ⎪ ⎨

 1 1 [P ]ij = max 0, − ⎪ ⎪ di dk ⎪ ⎪ (i,k)∈E ⎪ ⎪ ⎩ 0

if (i, j) ∈ E and i = j if i = j otherwise,

where di is the number of edges of node i. Note that each node only needs to know its number of edges and its neighbors’ number of edges in order to be able to construct its part of the Markov chain. 3. Convergence analysis. To show convergence we need some notation and four lemmas. 3.1. Technical preliminaries. Denote the starting state of MC by i. Under Assumption 1 and with probability 1, all states in MC are visited equally often. Thus, ∞ we can form the subsequence {ˆ xk }∞ k=0 of {xl }l=0 by sampling it whenever the Markov chain visits the state i, i.e., whenever wl = i. For example, if w0 = i, w3 = i, and w5 = i, then w0 , w1 , w2 , w3 , (3.1)

x0 ,

w4 , w5 ,

...

x1 , x2 , x3 , x4 , x5 ,    

...

Ri0

xˆ0 ,

Ri1

x ˆ1 ,

x ˆ2 , . . . ,

ˆ1 = x3 , and x ˆ2 = x5 . In addition, let Rki be the recurrence time for where x ˆ0 = x0 , x state i, ⎧   k−1 k−1   ⎪ ⎪ i i ⎪ inf t − Rm | wt = i, t ≥ Rm + 1, t ∈ N , k > 0 ⎪ ⎪ ⎨ m=0 m=0 Rki = (3.2) ⎪ ⎪ ⎪inf {t | wt = i, t ≥ 1, t ∈ N} , k = 0 ⎪ ⎪ ⎩ 0, k < 0. Successive recurrence times to a state form an independent and identically distributed sequence of random variables, due to the strong Markov property [14, Theorem 1.4.2], and we note that the statistics of Rki will not depend on k. Also note that Rki is ˆk−1 , x ˆk−2 , . . . . Furthermore, we let vki,j be the random number of independent of x ˆk , x  k−1 i  k i visits to state j during the time interval m=0 Rm + 1, m=0 Rm . The first lemma concerns the average number of visits to other states over a recurrence time. Lemma 2. Under Assumption 1, we have that   i,1   E[vk ] . . . E[vki,N ] = 1TN and E Rki = N, for all i = 1, . . . , N, k ≥ 0, where 1N denotes the N column vector with all entries equal to one. Proof. Note that the statistics of vki,1 and Rki do not depend on k due to the strong Markov property. From [14, Theorem 1.7.5], we have that  i,1    E[vk ] . . . E[vki,N ] P = E[vki,1 ] . . . E[vki,N ] .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1161

Furthermore, we also know that the transition matrix P has only one eigenvector with eigenvalue 1, namely the invariant distribution [9, Theorem   4.1.6]. Since P is assumed to have a uniform stationary distribution, we have that E[vki,1 ] . . . E[vki,N ] = 1TN ,   N which is the desired result. Finally, E Rki = j=1 E[vki,j ] = N .   The second lemma concerns the second moment of the recurrence times, E (Rki )2 ; see, e.g., [9, Theorem 4.5.2]. Lemma 3. Under Assumption 1, the second moment of the recurrence time Rki is finite and given as E



Rki

2 

= 2[Γ]ii N 2 − N,

−1

with Γ = (IN − P + limn→∞ P n ) , where IN is the N × N identity matrix. We will also need the additional notion of first passage time, Fji = inf{t ∈ N | wt = i, w0 = j, t ≥ 1}, which is the number steps  needs  to  hit the  statei for the  first  time after starting  MC in state j. Note that E Rki = E Fii and E (Rki )2 = E (Fii )2 . The first passage times fulfill the following lemma; see, e.g., [9, Theorem 4.4.7] and [9, Theorem 4.5.1]. Lemma 4. Under Assumption 1, the first and second moments of the first passage times are given by     E Fji = [A]ji and E (Fji )2 = [B]ji , with     A = N IN − Γ + E diag(Γ) and B = A 2N diag(Γ) − IN ) + 2(ΓA − E diag(ΓA) , −1

where Γ = (IN − P + limn→∞ P n ) , E = 1N 1TN , and diag(X) is the matrix with the same diagonal elements as the matrix X and all other elements set to 0. The last lemma establishes a bounding inequality that we will use in the convergence proof. Lemma 5. Under Assumption 1 and with probability 1, the sequence {ˆ xk }∞ k=0 , ∞ formed by sampling the sequence {xl }l=0 whenever wl = i generated by (2.1), fulfills    2 (3.3) E ˆ xk+1 − y2 ˆ xk with K = maxi E



2

ˆ xk − y2 − 2α (f (ˆ xk ) − f (y)) + α2 C 2 K,

   2 < ∞. Rki

∞ Proof. In this proof, we need to use both sequences {ˆ xk }∞ k=0 and {xl }l=0 , and we needto keep track of which elements correspond to each other. For this purpose, let k−1 i , so that xl = x ˆk and xl+Rik = x ˆk+1 . Using the nonexpansion property l = m=0 Rm of Euclidean projection, the definition of a subgradient, and the assumption that the subgradients are bounded, we have that, for any y ∈ X , 2

2

xl+1 − y2 ≤ xl − y2 − 2α(gwl (xl ))T (xl − y) + α2 C 2 2

≤ xl − y2 − 2α (fwl (xl ) − fwl (y)) + α2 C 2 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

1162

Along the same lines of reasoning, we get the family of inequalities 2

2

xl+1 − y2 ≤ xl − y2 − 2α (fwl (xl ) − fwl (y)) + α2 C 2 ,   xl+2 − y22 ≤ xl+1 − y22 − 2α fwl+1 (xl+1 ) − fwl+1 (y) + α2 C 2 , .. . 2  2          xl+Rik − y  ≤ xl+Rik −1 − y  − 2α fwl+Ri −1 xl+Rik −1 − fwl+Ri −1 (y) + α2 C 2 . 2

2

k

k

Combining all of them together we get 2    xl+Rik − y  2



xl −

2 y2

Rik −1

− 2α

   fwl+j (xl+j ) − fwl+j (y) + Rki α2 C 2 , j=0

which can be rewritten as i

Rk −1 2       2 (3.4) xl+Rik − y  ≤ xl − y2 − 2α fwl+j (xl+j ) − fwl+j (xl ) 2

j=0

Rik −1

− 2α

 

 fwl+j (xl ) − fwl+j (y) + Rki α2 C 2 .

j=0

Notice that   fwl+j (xl ) − fwl+j (xl+j ) ≤ gwl+j (xl )2 xl+j − xl 2 ≤ C xl+j − xl 2 ≤ αjC 2 . This and the fact that 2α inequality (3.4) as follows:

Rik −1 j=0

αjC 2 =



Rki

2

 − Rki α2 C 2 enable us to rewrite

i

Rk −1 2      2   2 fwl+j (xl ) − fwl+j (y) + α2 C 2 Rki . xl+Rik − y  ≤ xl − y2 − 2α 2

j=0

Using vki,j as defined in Lemma 2, we express (3.4) as N  2   2   2 (3.5) xl+Rik − y  ≤ xl − y2 − 2α vki,j (fj (xl ) − fj (y)) + α2 C 2 Rki . 2

j=1

Now, due to the Markov property and Lemma 2, we have (3.6)  E

N  j=1

vki,j

  N    (fj (xl ) − fj (y))  xl , wl = E[vki,j ] (fj (xl ) − fj (y)) = f (xl )−f (y) .  j=1

Define (3.7)

K=

   2 , max E Rkj

j∈{1,...,N }

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1163

which is known to be finite and easily computable1 from Lemma 3. Note that K ≥  j 2   j 2   E Rk = E Rk  xl , wl for any j ∈ {1, . . . , N }. By taking the conditional expectation of (3.5) with respect to xl and wl , and using (3.6) and (3.7), we obtain the desired result. 3.2. Proof of convergence. We are ready for the main results of this paper. Theorem 1. Let {xl }∞ l=0 be generated by (2.1). Under Assumption 1 and with probability 1, the sequence {xl }∞ l=0 fulfills ⎧  ⎪ if f  = −∞, ⎨lim inf f (xl ) = f , l→∞

αC 2 K ⎪ ⎩lim inf f (xl ) ≤ f  + , l→∞ 2

if f  > −∞.

Proof. Denote the starting state of MC by i. With probability 1, all states ∞ are visited equally often, and thus we can form the subsequence {ˆ xk }∞ k=0 of {xl }l=0 by sampling it whenever MC visits the state i; see (3.1) for an illustration of this sampling. We attack the problem using an approach that is similar to the approach used in the proof of Proposition 3.1 in [13]. The idea of the proof is to show that the iterates eventually will enter a special level set. For this purpose, let M and J be positive xk }∞ , where Jˆ is chosen integers. We will now consider the sequences {xl }∞ l=J and {ˆ k=Jˆ such that the first element of {ˆ xk }∞ corresponds to the first element of {xl }∞ l=J . k=Jˆ This implies that xˆJˆ = xJ and Jˆ ≤ J. 1 If the function f (·) is such that supx∈X f (x) < f  + M , then the iterates trivially fulfill 

 1   xl ∈ x ∈ X f (x) < f + , ∀l ≥ 0. M Otherwise, if supx∈X f (x) ≥ f  +

1 M,

we can pick yM ∈ X such that ⎧ ⎨−F, if f  = −∞, f (yM ) = 1 ⎩f  + , if f  > −∞, M

for some F ≥ M . We now define the special level set, LM , which the iterates eventually will enter,

 αC 2 K 1  + LM = x ∈ X f (x) ≤ f (yM ) + . M 2 Note that this set includes yM . To simplify the analysis, we derive a stopped sequence from {ˆ xk }∞ by defining the sequence {˜ xk }∞ as follows: k=Jˆ k=Jˆ 

x ˜k =

x ˆk yM

ˆ . . . , k} if xˆj ∈ / LM ∀j ∈ {J, otherwise.

1 The constant K can easily be computed if the transition probability matrix P is known. The nodes typically do not know P in a decentralized implementation. However, the constant K is not necessary in order to execute the algorithm; the constant K is only used in the convergence proofs and the performance bounds.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1164

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

When x ˜k ∈ / LM , by setting y = yM in (3.3) in Lemma 5, we get    2 xk , wk E ˜ xk+1 − yM 2 ˜



2

˜ xk − yM 2 + α2 C 2 K − 2α (f (˜ xk ) − f (yM )) .

On the other hand, whenever x˜k ∈ LM , the sequence is forced to stay at yM , and we have the trivial inequality    2 2 xk , wk ≤ ˜ xk − yM 2 + 0. E ˜ xk+1 − yM 2 ˜ If we define zk through  zk = we can write (3.8)

2α (f (˜ xk ) − f (yM )) − α2 C 2 K 0

if x˜k ∈ / LM , if x˜k ∈ LM ,

   2 2 ˆ E ˜ xk+1 − yM 2 ˜ xk − yM 2 − zk , ∀k ≥ J. xk , wk ≤ ˜

When x ˜k ∈ / LM , we have (f (˜ xk ) − f (yM )) ≥

αC 2 K 1 + , M 2

which is equivalent to zk = 2α (f (˜ xk ) − f (yM )) − α2 C 2 K ≥

(3.9)

2α . M

If we take the expectation of (3.8), the result is     2 2 ˆ E ˜ xk+1 − yM 2 ≤ E ˜ xk − yM 2 − E [zk ] , ∀k ≥ J, and starting from x˜Jˆ, we recursively get 

(3.10)

E ˜ xk+1 − yM 22



⎡ ⎤ k   2  ˆ ≤ E x zn ⎦ , ∀k ≥ J. ˜Jˆ − yM 2 − E ⎣ n=Jˆ

Furthermore,  from the  iterations (2.1) and the bounded subgradients assumption, we ˜Jˆ − yM 2 ≤ ˜ have that x x0 − yM 2 + αJC = x0 − yM 2 + αJC, which implies that  2  2 ˜Jˆ − yM 2 ≤ x0 − yM 2 + 2αJC x0 − yM 2 + (αJC)2 . E x

Let τ˜ be the stopping time defined as ˆ τ˜ = inf{t ∈ N|˜ xt ∈ LM , t ≥ J}, then τ˜ is the number of nonzero elements in the nonnegative sequence {zk }∞ . Since k=Jˆ  z either converges to a finite real value or diverges zk is nonnegative, the series ∞ k ˆ k=J ∞ to infinity. Thus, from (3.9) it follows that k=Jˆ zk ≥ 2α ˜, where the left-hand side Mτ

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1165

always is defined. By letting k go to infinity in (3.10) and using the nonnegativity of a norm, we have  ∞   2 2 0 ≤ x0 − yM 2 + 2αJC x0 − yM 2 + (αJC) − E zn (3.11) n=J

(3.12)

≤ x0 −

yM 22

2α E [˜ τ] + 2αJC x0 − yM 2 + (αJC) − M 2

and the bound

 M  2 x0 − yM 2 + 2αJC x0 − yM 2 + (αJC)2 . 2α Thus, the stopping time τ˜ is almost surely finite and at least one element in the sequence {ˆ xk }∞ will be in the set LM . Since {ˆ xk }∞ is a subsequence of {xl }∞ l=J , k=Jˆ k=Jˆ ∞ it follows that at least one element of {xl }l=J will be in the set LM . Therefore, we have that (3.13)

E [˜ τ] ≤

inf f (xl ) ≤ f (yM ) +

l≥J

αC 2 K 1 + , M 2

and since the choice of J is arbitrary and the right-hand side is independent of J, we have that lim inf f (xl ) ≤ f (yM ) + l→∞

αC 2 K 1 + . M 2

By letting M go to infinity and noting that Lemma 5 holds for all i, we have shown the theorem. Remark. The case f  = −∞, which implies that the optimal set {x ∈ X |f (x) =  f } is empty, is relevant even under Assumption 1. This case occurs, e.g., when the functions fn are linear and the feasible set X = Rη . Theorem 2. Let {xl }∞ l=0 be generated by (2.1). If Assumption 1 holds and the set of all optimal x, X  = {x ∈ X |f (x) = f  }, is nonempty, then the sequence {xl }∞ l=0 almost surely fulfills min f (xl ) ≤ f  +

0≤l≤τ

αC 2 K + δ, 2

where τ is a stopping time with bounded expected value 2 N  E [τ ] ≤ dist (x ) , 0 2αδ X  where distX  (x0 ) = inf{x0 − y2 |y ∈ X  }. Proof. Denote the starting state of MC by i. With probability 1, all states ∞ are visited equally often, and thus we can form the subsequence {ˆ xk }∞ k=0 of {xl }l=0 by sampling it whenever MC visits the state i; see (3.1) for an illustration of this sampling. The proof idea is the same as for the proof of Theorem 1; we show that the iterates eventually will enter a special level set. If the function f (·) is such that 2 2 supx∈X f (x) ≤ f  + αC2 K + δ or f (x0 ) ≤ f  + αC2 K + δ, then the theorem is trivially fulfilled. Otherwise, let Lδ be the level set defined by

 αC 2 K   +δ . Lδ = x ∈ X f (x) ≤ f + 2

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1166

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

Define the sequence {˜ xk }∞ k=0 as follows:  x ˆk if x ˆj ∈ / Lδ ∀j ≤ k, x ˜k =  x ˇ∈X otherwise, where x ˇ is an arbitrary point in X  . When x ˜k ∈ / Lδ , Lemma 5 gives us    2 2 xk , wk ≤ ˜ xk − x E ˜ xk+1 − x ˇ2 ˜ ˇ2 + α2 C 2 K − 2α (f (˜ xk ) − f (ˇ x)) . Otherwise, x ˜k ∈ Lδ , the sequence will stay at xˇ, and we have the trivial inequality     xk − x ˇ22 ˜ ˇ22 + 0. E ˜ xk+1 − x xk , wk ≤ ˜ By defining zk through zk =



2α (f (˜ xk ) − f (ˇ x)) − α2 C 2 K 0

if x ˜k ∈ / Lδ , if x ˜k ∈ Lδ ,

we have zk ≥ 2αδ if x ˜k ∈ / Lδ , and we can write     ˜k , wk ≤ ˜ xk − xˇ22 − zk , ∀k. (3.14) ˇ22 x E ˜ xk+1 − x Let τ˜ be the stopping time defined as τ˜ = inf{t|˜ xt ∈ Lδ , t ≥ 0, t ∈ N}, then τ˜ is the random number of nonzero in the nonnegative sequence {zk }∞ k=0 elements ∞ ∞ and k=0 zk ≥ 2αδ˜ τ , where the series k=0 zk either converges to a finite real number or diverges to infinity. By letting k go to infinity in (3.14) and using the nonnegativity of a norm, we have  ∞  2 x0 − x 0 ≤ ˜ x0 − x (3.15) ˇ2 − E zn ≤ ˜ ˇ22 − 2αδE [˜ τ] n=0

and the bound (3.16)

E [˜ τ] ≤

1 1 ˜ x0 − xˇ22 = x0 − x ˇ22 . 2αδ 2αδ

Now let τ be the stopping time defined as τ = inf{t|xt ∈ Lδ , wt = i, t ≥ 0, t ∈ N}. This means that the stopping conditions will be fulfilled when xt is in the set Lδ 2 and the Markov chain is in state i; note that f (xτ ) ≤ f  + αC2 K + δ. By using the recurrence time Rki , which counts the number of elements in the original sequence {xl }∞ xk }∞ l=0 between the elements in the sampled sequence {ˆ k=0 , we can write τ=

τ˜ 

i Rk−1 ,

k=1

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1167

where τ˜ ≥ 1 since x0 ∈ / Lδ by assumption. Since τ˜ is a stopping time for the sequence x ˆ0 , x ˆ1 , . . . , occurrence of the event {˜ τ ≥ j} is decided by the sequence "j−1 x ˆ0 , . . . , x ˆj−1 . In particular, I{˜ τ ≥ j} = m=0 I{ˆ xm ∈ / Lδ }, where I{·} is the indicator function of the event {·}. Furthermore, due to the construction of {ˆ xk }∞ k=0 i ∞ ∞ and {Rk }k=0 , and the Markov property of the sequence {wk }k=0 , the recurrence i i times Rj−1 , Rji , Rj+1 , . . . are independent of x ˆj−1 , x ˆj−2 , x ˆj−3 , . . . . More specifically,   "j−1   i  i i we have that E I{˜ τ ≥ j}Rj−1 = E m=0 I{ˆ = P{˜ τ ≥ j}E Rj−1 , xm ∈ / Lδ }Rj−1 where P{·} denotes the probability of the event {·}. Using the previous properties and a Wald’s identity type of argument (see, e.g., [4, Theorem 5.5.3]), we have # $ ∞ τ˜   i E I{˜ τ = l} Rk−1 (3.17) E[τ ] = l=1

(3.18)

=

k=1

l ∞  

∞ ∞       i i = E I{˜ τ = l}Rk−1 E I{˜ τ = l}Rk−1

l=1 k=1

(3.19)

=

∞ 

k=1 l=k





i E I{˜ τ ≥ k}Rk−1 =

k=1

∞ 

 i  P{˜ τ ≥ k}E Rk−1 = E[˜ τ ]E[R0i ]

k=1

N 2 ≤ x0 − x (3.20) ˇ2 . 2αδ The (3.18) holds since the series converges absolutely:  in    ∞ of summationi order ∞ change ∞ ∞ i i | = = E[˜ τ ]E[R |E I{˜ τ = l}R E I{˜ τ = l}R 0 ] < ∞, k−1 k−1 k=1 l=k k=1 l=k ∞ i where we used the nonnegativity of τ˜ and Rk . The relation k=1 P{˜ τ ≥ k} = E[˜ τ ], used in (3.19), follows from [4, Theorem 3.2.1]. Since (3.20) holds for arbitrary xˇ in 2 ˇ2 with (distX  (x0 )) . X  , we can replace x0 − x Remark. The results in this section show that our proposed algorithm (2.1) can solve the optimization problem (1.1) in a distributed fashion relying only on neighborto-neighbor communication. Lemma 1 demonstrates how the forwarding probabilities (and hence the complete Markov chain) can be constructed by each node using only information from neighboring nodes. Theorem 1 and Theorem 2 establish that the algorithm becomes increasingly accurate as α decreases, while the convergence rate becomes slower. We also show convergence of the time average of the sampled sequence {ˆ xk }∞ k=0 , (3.21)

x ¯im =

m  k=0

x ˆik , m+1

where the superscript i is used to explicitly show which state that is sampled and we allow MC to start in an arbitrary state j. In a distributed implementation, where the state of MC can be interpreted as a token with the current xk being passed between nodes, each node i can maintain the average x ˆim by sampling the sequence {xk }∞ k=0 every time instant the node gets the token, which corresponds to MC visiting the state i. Theorem 3. Under Assumption 1, and if the set of all optimal x, X  = {x ∈ X |f (x) = f  }, is nonempty and MC starts in state j, then the sequence {¯ xim }∞ m=0 defined by (3.21) and (2.1) almost surely fulfills E[f (¯ xim )] ≤ f  +

  αC 2 K 1 (dist , (x0 ))2 + 2[A]ji αC dist (x0 ) + [B]ji α2 C 2 +   X 2α(m + 1) X 2

where the matrices A and B are given by Lemma 4.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

1168

Proof. With probability 1, MC visits all states equally often and we can form the sampled sequence {ˆ xik }∞ k=0 . By taking the expectation of (3.3) in Lemma 5, we arrive at   2  2       ˆik − y 2 − E 2α f xˆik − f (y) + α2 C 2 K, ˆik+1 − y 2 ≤ E x (3.22) E x and

 2     i  E 2α f x ˆk − f (y) ≤ E xˆik − y 2 + α2 C 2 K.

(3.23)

By recursively plugging (3.22) into (3.23), we get  m     2   i f x ˆk − f (y) ≤ E x ˆi0 − y 2 + α2 C 2 K(m + 1), 2αE k=0

and by dividing with 2α(m + 1) and using the convexity of f we obtain  m   f (ˆ 2  αC 2 K 1 xik ) i E xˆi0 − y 2 + . ≤ f (y) + E[f (¯ xm )] ≤ E m+1 2α(m + 1) 2 k=0

Note that x ¯im ∈ X due to the convexity of the set X . Since MC starts in the state j, MC, and thereby also the algorithm, has taken Fji steps before the sampling starts,  i  and we have x ˆ0 − y 2 ≤ x0 − y2 + Fji αC, which in combination with Lemma 4 implies that 2   2 E xˆi0 − y 2 ≤ x0 − y2 + 2[A]ji αC x0 − y2 + [B]ji α2 C 2 . Finally, by minimizing x0 − y2 over y ∈ X  and using the results above, we arrive at E[f (¯ xim )] ≤ f +

  αC 2 K 1 2 2 2 . (dist + (x )) + 2[A] αC dist (x ) + [B] α C 0 ji 0 ji X 2α(m + 1) X  2 2

Remark. The previous theorem implies that lim supm→∞ E[f (¯ xim )] ≤ f  + αC2 K . 4. Comparison with existing incremental subgradient algorithms. For the DISM and the RISM, there exist results of the same type as Theorem 2, i.e., min f (xl ) = f  + αβ + ν with E[˜ τ] ≤

(4.1)

0≤l≤τ

ρ , αν

where β and ρ are positive constants that depend on the algorithm. To compare the algorithms, we need to compute for each algorithm the minimum expected number of iterations needed for a given accuracy (min0≤l≤τ f (xl ) = f  + γ). For the general case (4.1), we get the following optimization problem: ⎧ ⎪ minimize ⎪ ⎨ α,ν ⎪ subject to ⎪ ⎩

⎫ ⎪ ⎪ ⎬

⎧ αν ⎪ ⎨ maximize α,ν

⎧ γ  ⎪ ⎨α = 2β ⇔ subject to αβ + ν = γ ⎪ ⇒ ⎪ ⎪ αβ + ν ≤ γ ⎪ ⎪ ⎩ ⎭ ⎩ν  = γ . ⎭ α ≥ 0, ν ≥ 0 α ≥ 0, ν ≥ 0 2 ρ αν

⎫ ⎪ ⎬

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

A DISTRIBUTED INCREMENTAL SUBGRADIENT METHOD

1169

Table 4.1 Upper bounds of the expected number of iterations, E[τ ], needed to reach the accuracy min0≤l≤τ f (xl ) ≤ f  + γ. For brevity2 , let D = N C 2 γ −2 (distX  (x0 ))2 . Algorithm

E{˜ τ}

DISM RISM MISM

N 2D ND KD

Using these optimal values, we compute an upper bound of the expected number of iterations, E[τ ], needed to reach the accuracy min0≤l≤τ f (xl ) ≤ f  + γ for the DISM, the RISM, and the MISM. The results are presented in Table 4.1. Since   N   N  N  2  2    i 2 K ≥ E Rk vk (n) vk (n) ≥ E vk (n) = N, ≥E =E n=1

n=1

n=1

where we used the nonnegativity and integrality of vk (n), the results in Table 4.1 indicate that the RISM is the best algorithm, and that the ranking between the DISM and the MISM will depend on the topology of the network as well as the transition probability matrix of the Markov chain. However, it is not only the expected number of iterations that are of interest; in applications, the ease of implementation and energy consumption are crucial. Numerical experiments show that the MISM has favorable properties in these two respects, as reported in [7, 6], but this topic will not be further pursued in this paper. It is interesting to note that we can recover the DISM and the RISM from the MISM by choosing the transition probability matrix in the following way: ⎛

PDISM

0 1 ⎜0 0 ⎜ = ⎜. ⎝ ..

0 1

1 0

0

⎛1 ⎞ ... ⎜N . . .⎟ ⎜ . ⎟ .. ⎟ and PRISM = ⎜ .. ⎝ ⎠ . 1 ... N

... .. . ...

1⎞ N⎟ .. ⎟ . ⎟ ⎠ 1 N

with E[RDISM ] = N 2 and E[RRISM ] = 2N 2 − N. The transition matrix PDISM will make the Markov chain deterministically explore the topology in a logical ring and Rki = N . Note that the Markov chain corresponding to PDISM does not satisfy Assumption 1, since it does not have a stationary distribution and is periodic, but the analyses in Theorem 1 and Theorem 2 still apply. The transition matrix PRISM will make the Markov chain jump to any node in the topology with equal probability at each time step, precisely as the RISM, and E[RRISM ] = 2N 2 − N by Lemma 3. The convergence bound given by the MISM analysis for PDISM is identical with the convergence bound given by the DISM analysis. On the other hand, the convergence bound given by the MISM analysis for PRISM is much 2 The constant C is defined in a slightly different way for the DISM and the RISM in [13]. There it is assumed that the norm of the subgradients for the actual trajectory of the algorithms are upper bounded by C. This is more general and less conservative than our definition of C, but it is very hard to check if it is fulfilled and therefore not practical. Our analysis also holds for the less conservative definition of C.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1170

¨ BJORN JOHANSSON, MABEN RABI, AND MIKAEL JOHANSSON

worse than the original RISM result. This is due to the fact that in the original RISM analysis all iterates are analyzed, while in the MISM analysis, only those iterates at the arbitrary starting state are analyzed. 5. Conclusions. We have proposed a novel randomized incremental subgradient method that is well suited for decentralized implementation in distributed systems. The stochastic component in the algorithm is described by a Markov chain, which can be constructed using only local information. Furthermore, the algorithm is a generalization of the RISM and the DISM due to Nedi´c and Bertsekas. These algorithms can be recovered by choosing the transition probability matrix of the Markov chain in special ways. The algorithm has been analyzed in detail, resulting in a convergence proof as well as a bound on the expected number of iterations needed to reach an a priori specified accuracy. In addition, we have also provided a convergence rate analysis of the expected value of the time averages of a subsequence of the iterates, where the subsequence is formed by sampling the original sequence of iterates when the Markov chain is in a particular state. Finally, we have presented a comparison of the convergence rates of the MISM, the RISM, and the DISM. REFERENCES [1] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Nashua, NH, 1999. [2] D. Blatt, A. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29–51. [3] S. Boyd, P. Diaconis, and L. Xiao, Fastest mixing Markov chain on a graph, SIAM Rev., 46 (2004), pp. 667–689. [4] K. L. Chung, A Course in Probability Theory, Academic Press, New York, 1974. [5] Y. M. Ermol’ev, Methods of solution of nonlinear extremal problems, Cybernet., 2 (1966), pp. 1–17. [6] B. Johansson, C. Carretti, and M. Johansson, On distributed optimization using peer-topeer communications in wireless sensor networks, in Proceedings of IEEE SECON, June 2008. [7] B. Johansson, M. Rabi, and M. Johansson, A simple peer-to-peer algorithm for distributed optimization in sensor networks, in Proceedings of IEEE CDC, Dec. 2007. [8] B. Johansson, A. Speranzon, M. Johansson, and K. H. Johansson, On decentralized negotiation of optimal consensus, Automatica, 44 (2008), pp. 1175–1179. [9] J. G. Kemeny and J. L. Snell, Finite Markov Chains, Van Nostrand, New York, 1960. [10] V. M. Kibardin, Decomposition into functions in the minimization problem, Autom. Remote Control, 40 (1980), pp. 1311–1323. [11] K. C. Kiwiel, Convergence of approximate and incremental subgradient methods for convex optimization, SIAM J. Optim., 14 (2004), pp. 807–840. [12] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Springer, Berlin, 2003. ´ and D. P. Bertsekas, Incremental subgradient methods for nondifferentiable opti[13] A. Nedic mization, SIAM J. Optim., 12 (2001), pp. 109–138. [14] J. R. Norris, Markov Chains, Cambridge University Press, London, 1998. [15] B. T. Polyak, A general method of solving extremum problems, Soviet Math. Dokl., 8 (1967), pp. 593–597. [16] B. T. Polyak, Introduction to Optimization, Optimization Software, 1987. [17] M. Rabbat and R. Nowak, Distributed optimization in sensor networks, in Proceedings of ACM/IEEE IPSN, 2004. [18] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [19] N. Z. Shor, On the structure of algorithms for the numerical solution of optimal planning and design problems, Ph.D. thesis, Cybernetics Institute, 1964. [20] N. Z. Shor, Minimization Methods for Non-Differentiable Functions, Springer-Verlag, Berlin, 1985.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.