An E cient Stochastic Approximation Algorithm for Stochastic Saddle

An Ecient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems Arkadi Nemirovski and Reuven Y. Rubinstein Faculty of Industrial Engineering and Management Technion|Israel Institute of Technology Haifa 32000, Israel

Abstract We show that Polyak's (1990) stochastic approximation algorithm with averaging originally developed for unconstrained minimization of a smooth strongly convex objective function observed with noise can be naturally modi ed to solve convex-concave stochastic saddle point problems. We also show that the extended algorithm, considered on general families of stochastic convex-concave saddle point problems, possesses a rate of convergence unimprovable in order in the minimax sense. We nally present supporting numerical results for the proposed algorithm.

1 Introduction We start with the classical stochastic approximation algorithm and its modi cation given in Polyak (1990).

1.1 Classical stochastic approximation Classical stochastic approximation (CSA) originates from the papers of Robbins-Monroe and Kiefer-Wolfovitz. It is basically a steepest descent method for solving the minimization problem min (x); (1.1) x2X where the exact gradients of are replaced with their unbiased estimates. In the notation from Example 2.1, the CSA algorithm is xt+1 = X [xt ? t (xt; !t)] ; (1.2) where x1 is an arbitrary point from X , X (y) is the point of X closest to y (the projection of y onto X ); the stepsizes t are normally chosen as

t = C=t; (1.3) C being a positive constant. Under appropriate regularity assumptions (see, e.g., Kushner and Clark (1978)) the sequence xt converges almost surely and in the mean square to the unique minimizer of the objective. Unfortunately, the CSA algorithm possesses poor robustness. In the case of a smooth (i.e., with a Lipschitz continuous gradient) and nondegenerate (i.e., with a nonsingular Hessian) convex objective, the rate of convergence is O(t?1) and is unimprovable, in a certain precise sense. However, to achieve this rate, one could adjust the constant C in (1.3) to the \curvature" of ; a \bad" choice of C { by an absolute constant factor less than the optimal one { can convert the convergence rate O(t?1 ) to O(t?) with < 1. Finally, if the objective, although smooth, is not nondegenerate, (1.3) may result in extremely slow convergence. The CSA algorithm was signi cantly improved by Polyak (1990). In his algorithm, the stepsizes t are larger in order than those given by (1.3) (they are of order of t? with 2 (1=2; 1)), so that the rate of convergence of the trajectory (1.2) to the solution is worse in order than for the usual CSA. The crucial dierence between Polyak's algorithm and the CSA is that the sequence (1.2) is used only to collect information about the objective rather than to estimate the solution itself. Approximate solutions xt to (1.1) are obtained by averaging the \search points" x in (1.2) according to t X xt = 1t x : =1 1

It turns out that under the same assumptions as for the CSA (smooth nondegenerate convex objective attaining its minimum at an interior point of X ), Polyak's algorithm possesses the same asymptotically unimprovable convergence rate as the CSA. At the same time, in Polyak's algorithm there is no need for \ ne adjustment" of the stepsizes to the \curvature" of . Moreover, Polyak's algorithm with properly chosen preserves a \reasonable" (close to O(t?1=2)) rate of convergence even when the (convex) objective is nonsmooth and/or degenerate. A somewhat dierent aggregation in SA algorithms was proposed earlier by Nemirovski and Yudin (1978, 1983). For additional references on the CSA algorithm and its outlined modi cation, see Ermoliev (1969), Ermoliev and Gaivoronski (1992), L'Ecuyer, Giroux, and Glynn (1994), Ljung, P ug and Walk (1992), P ug (1992), Polyak (1990) and Tsypkin (1970) and references therein. Our goal is to extend Polyak's algorithm from unconstrained convex minimization to the saddle point case. We shall show that, although for the general saddle point problems below the rate of convergence slows down from O(t?1 ) to O(t?1=2), the resulting stochastic approximation saddle point (SASP) algorithm, as applied to stochastic saddle point (SSP) problems, preserves the optimality properties of Polyak's method. The rest of this paper is organized as follows. In Section 2 we de ne the SSP problem, present the associated SASP algorithm, and discuss its convergence properties. We show that the SASP algorithm is a straightforward extension of its stochastic counterpart with averaging, originally proposed by Polyak (1990) for stochastic minimization problems as in Example 2.1 below. It turns out that in the general case the rate of convergence of the SASP algorithm becomes O(t?1=2) instead of O(t?1 ), that is, the convergence rate of Polyak's algorithm. We demonstrate in Section 3 that this slowing down is an unavoidable price for extending the class of problems handled by the method. In Section 4 we present numerical results for the SASP algorithm as applied to the stochastic Minimax Steiner problem and to an on-line queuing optimization problem. Appendix contains the proofs of the rate of convergence results for the SASP algorithm. It is not our intention in this paper to compare the SASP algorithm with other optimization algorithms suitable for o-line and on-line stochastic optimization, like the stochastic counterpart (Rubinstein and Shapiro (1993)). It is merely to show the high potential of the SASP method and to promote it for further applications.

2 Stochastic saddle point problem 2.1 The problem Consider the following saddle point problem: 2

(SP) Given a function (x; y) (x 2 X Rn, y 2 Y Rm), nd a saddle point of on X Y , i.e., a point (x; y) 2 X Y at which attains its minimum in x 2 X and its maximum in y 2 Y : nd (x; y) 2 X Y : (x; y) (x; y) (x; y) [8x 2 X; y 2 Y ]: (2.1) In what follows, we write (SP) down as min max (x; y): x2X y2Y We make the following

Assumption A. X and Y are convex compact sets and is convex in x 2 X , concave in y 2 Y and Lipschitz continuous on X Y . Let us associate with (SP) the following pair of functions

(x) = max (x; y); (y) = min (x; y) y2Y x2X (the primal and the dual objectives, respectively), and the following pair of optimization problems (x); (D) max (y): (P) min y2Y x2X

It is well known (see, e.g., Rockafellar (1970)) that under assumption A both problems (P) and (D) are solvable, with the optimal values equal to each other, and the set of saddle points of on X Y is exactly the set Argmin Argmax : X

Y

2.1.1 Stochastic setting We are interested in the situation where neither the function in (SP), nor the derivatives f@(x; y)=@x @(x; y)=@yg are available explicitly; we assume, however, that at a time instant t (t = 1; 2; :::) one can obtain, for every desired point (x; y) 2 X Y , \noisy estimates" of the aforementioned partial derivatives. These estimates form a realization of the pair of random vectors

(x; y; !t) 2 Rn; (x; y; !t) 2 Rm; (2.2) f!tg1t=1 being the \observation noises". We assume that these noises are independent identically distributed, according to a Borel probability measure P , random variables taking values in a Polish (i.e., metric separable complete) space . 3

We also make the following

Assumption B. The functions (z; !), (z; !) on (X Y ) are Borel functions taking values in Rn, Rm, respectively, such that 8z = (x; y) 2 X Y : 0x(z) IEP (z; !) 2 @x(z); 0y (z) IEP (z; !) 2 @y (z) and

q

q

L DX z2sup IEj(z; !)j2 + DY sup IEj(z; !)j2 < 1: X Y z2X Y

Here

(2.3)

(2.4)

@x(x; y) and @y (x; y) are the sub- and super-dierentials of in x; y, respectively; j j is the standard Euclidean norm on the corresponding Rk ; DX and DY are the Euclidean diameters of X and Y , respectively. We refer to and in (2.2) satisfying Assumption B as to a stochastic source of information (SSI) for problem (SP), and refer to problem (SP) satisfying Assumption A and equipped with a particular stochastic source of information and as to a stochastic saddle point (SSP) problem. The associated quantity L will be called the variation of observations of the stochastic source of information. Our goal is to develop a stochastic approximation algorithm for solving the SSP problem.

2.1.2 The accuracy measure As an accuracy measure of a candidate solution (x; y) 2 X Y of problem (SP), we use the following function "(x; y) = (x) ? min + max ? ( y ) : (2.5) Y X Note that "(x; y) is expressed in terms of the objective function rather than of the distance from (x; y) to the saddle set of . It is nonnegative everywhere and equals to 0 exactly at the saddle set of : This is so, since the saddle points of are exactly the pairs (x; y) 2 Argmin Argmax . Note nally that X

Y

"(x; y) = (x) ? (y); since min = max . X Y 4

(2.6)

2.2 Examples We present now several stochastic optimization problems which can be naturally posed in the SSP form.

Example 2.1 Simple single-stage stochastic program. Consider the simplest stochas-

tic programming problem

Z min f ( x ) ; f ( x ) = F (x; !)dP (!) x2X Rn

(2.7)

with convex compact feasible domain X and convex objective f . Here is a Polish space and P is a Borel probability measure on . Assume that the integrand F (x; !) is suciently regular, namely,

F (x; ) is P -square summable for every x 2 X ; F is dierentiable in x for all !, with (x; !) rxF (x; !) Borel in (x; !); F (; !) is Lipschitz continuous in x 2 X with a certain constant L(!) and Z

0 Z 11=2 L(!)dP (!) < 1; M @sup j(x; !)j2dP (!)A < 1: x2X

Assume further that when solving (2.7), we cannot compute f (x) directly, but are given an iid random sample f!tg1 t=1 distributed according to P and know how to compute (x; ! ) at every given point (x; !) 2 X . Under these assumptions program (2.7) can be treated as an SSP problem with

(x; y) = f (x) and a trivial { singleton { set Y (which enables us to set 0). It is readily seen that the resulting SSP problem satis es assumptions A and B with

L = MDX ;

(2.8)

and that the accuracy measure (2.5) in this problem is just the residual in terms of the objective: "(x; y) "(x) = f (x) ? min f: X 5

Example 2.2 Minimax stochastic program. Consider the following system of stochastic inequalities: Z n nd x 2 X R s.t. fi(x) = Fi(x; !)dP (!) 0; i = 1; :::; m

(2.9)

with convex compact domain X and convex constraints fi, i = 1; :::; m. Here , P are the same as in Example 2.1, and each of the integrands Fi(x; !) possesses the same regularity properties as in Example 2.1. Clearly, to solve (2.9) it suces to solve the optimization problem min (x); (x) = i=1max f (x); (2.10) ;:::;m i x2X

which is the same as solving the following saddle point problem m m X m j X y = 1g: y f ( x ) ; Y = f y 2 R min max ( x; y ) ; ( x; y ) = i i i + x2X y2Y i=1

i=1

(2.11)

Note that the latter problem clearly satis es Assumption A. Similarly to Example 2.1, assume that when solving (2.10), we cannot compute fi (and thus ) explicitly, but are given an iid sample f!tg1t=1 distributed according to P and, given x 2 X and ! 2 , we can compute Fi(x; !), i(x; !) = rxFi(x; !), i = 1; :::; m. Note that under this assumption the SSI for the saddle point problem (2.11) is given by m X (x; y; !) = yirxFi(x; !); (x; y; !) = (F1(x; !); :::; Fm(x; !))T : i=1

The variation of observations for this source can be bounded from above as L DX M (1) + DY M (0); 31=2 2 mZ X Fi2(x; !)dP (!)5 ; M (0) = 4sup x2X 2 i=1 Z 31=2 M (1) = 4 sup jrxFi(x; !)j2dP (!)5 : x2X;i=1;:::;m

Note that in this case the accuracy measure "(; ) satis es the inequality ; "(x; y) (x) ? min X

so that it presents an upper bound for the residual in (2.10). Although problem (2.7) can be posed as a convex minimization problem (2.10) rather than the saddle point problem (2.11), it cannot be solved directly. Indeed, to solve (2.10) by a Stochastic Approximation algorithm, we need unbiased estimates of subgradients of , and we cannot built estimates of this type from the only available for us unbiased estimates of fi; fi0. Thus, in the case under consideration the saddle point reformulation seems to be the only one suitable for handling \noisy observations". 6

Example 2.3 Single-stage stochastic program with stochastic constraints. Consider the following stochastic program: Z f0(x) = F0(x; !)dP (!) ! min subject to

Z x 2 X Rn; fi(x) = Fi(x; !)dP (!) 0; i = 1; :::; m

(2.12) (2.13)

with convex compact domain X and convex functions fi, i = 0; :::; m, and let , P and the integrands Fi(x; !) satisfy the same assumptions as in Examples 2.1, 2.2. As above, assume that when solving (2.12) { (2.13), we cannot compute fi explicitly, but are given an iid sample f!tg1 t=1 distributed according to P and, given x 2 X and ! 2 , can compute Fi(x; !), rxFi(x; !), i = 0; :::; m. To solve problem (2.12) { (2.13), it suces to nd a saddle point of the Lagrange function m X (x; y) = f0(x) + yifi(x) i=1

Rm+ .

on the set X Note that if (2.12) { (2.13) satis es the Slater condition, then possesses a saddle point on X Rm+ , and the solutions to (2.12), (2.13) coincide with the x-components of the saddle points of . Assume that we have prior information on the problem, which enables us to identify a compact convex set Y Rm+ containing the y-component of some saddle point of . Then we can replace in the Lagrange saddle point problem the nonnegative orthant Rm+ with Y , thus obtaining an equivalent saddle point problem min max (x; y) (2.14) x2X y2Y with convex and compact set X and Y . Noting that the vectors m X (x; y; !) = rxF0(x; !) + yirxFi(x; !); (x; y; !) = (F1(x; !); :::; Fm(x; !))T i=1

form a stochastic source of information for (2.14), we see that (2.13) { (2.12) can be reduced to an SSP. The variation of observations for the associated stochastic source of information clearly can be bounded as L DX M (1) + DY M (0); 2m Z 31=2 X 4 M (0) = max Fi2(x; !)dP (!)5 ; x2X i=1

2 31=2 Z m X M (1) = x2max (1 + jyij) 4i=0max jrxFi(x; !)j2dP (!)5 : ;:::;m X;y2Y i=1

7

These examples demonstrate that the SSP setting is a natural form of many stochastic optimization problems.

2.3 The SASP algorithm The SASP algorithm for the stochastic saddle point problem (2.1), (2.3) is as follows:

Algorithm 2.1 where

zt+1 = X Y [zt ? t (zt; !t)] ; zt = (xt; yt) 2 X Y; t = 1; 2; :::

(2.15)

X Y [x; y] is the projector on X Y :

0 argmin jx ? uj2 1 u2X A X Y [x; y] = @ argmin jy ? vj2 ; v2Y

(2.16)

the vector (x; y; !) 2 Rn Rm is

(x; y; !) (x; y; !) = ?X (x; y; !) ; Y

(2.17)

X and Y being positive parameters of the method; t are positive stepsizes which, in principle, can be either deterministic or stochastic (see subsections 2.3 and 2.4, respectively);

the initial point z1 is an arbitrary (deterministic) point in X Y . As an approximate solution of the SSP problem we take the moving average

zt

t X 1 = t ? (t) + 1 z ; = (t)

(2.18)

where (t) is a deterministic function taking, for every t > 1, integer values between 1 and t.

In the two subsections which follow we discuss the rate of convergence of the SASP algorithm and the choice of its parameters. Subsections 2.4 and 2.5 deal with o-line and on-line choice of the stepsizes, respectively. 8

2.4 Rate of convergence and optimal setup: o-line choice of the stepsizes Here we consider the case of deterministic sublinear stepsizes. Namely, assume that

t = Ct?; (2.19) where C > 0 and 2 (0; 1]. As we shall see in Section 3, t with properly chosen C and , yield an unimprovable in order (in a certain precise sense) rate of convergence.

Theorem 1 Under assumptions A, B and (2.19), the expected inaccuracy "N IE"(zN )

of the approximate solutions z N generated by the method can be bounded from above, for every N > 1, as follows: i h L q "N DX2 ?X1 + DY2 ?Y 1 C (N ? N + (N ) + 1) N ? (N ) + 1 h i ? 2 ? 2 2 ? +C L X DX + Y DY (N ):

(2.20)

The proof of Theorem 1 is given in Appendix A. It is easily seen that the parameters minimizing, up to an absolute constant factor, the right hand side of (2.20) are given by the setup X = DX2 ; Y = DY2 ; (t) =c 2t b; (2.21) = 1=2; C = 2=L: Here czb denotes the smallest integer which is z. With this setup, (2.20) results in "N 10LN ?1=2 ; N > 1: (2.22)

2.5 Rate of convergence and optimal setup: on-line choice of the stepsizes Setup (2.21) requires a priori knowledge of the parameters DX ; DY , L. When the domains X; Y are \simple" (e.g., boxes, Euclidean balls or perfect simplices), there is no problem in computing the diameters DX ; DY . And in actual applications we can handle simple X and Y only, since we should know how to project onto these sets. Computation of the variation of observations L is, however, trickier. Typically the exact value of L is not available, and a bad initial guess for L can signi cantly slow down the convergence rate. For practical purposes it might be better to use an on-line policy for updating guesses for L, and our current goal is to demonstrate that there exists a reasonably wide family of these policies preserving the convergence rate of the SASP algorithm. 9

We shall focus on stochastic stepsizes of the type (cf. (2.19))

t = Ctt?;

(2.23)

where 2 (0; 1] is xed and Ct > 0 depends on the observations (z ; ! ); (z ; ! ). For the theoretical analysis, we make the following

Assumption C. For every t, Ct depends only on the observations collected at the rst t steps, i.e., Ct is a deterministic function of (z ; ! ); (z ; ! ), 1 t. Moreover, there exist \safety bounds" { two positive constants C and C { such that

C C t C :

(2.24)

for all t. Let us associate with the SASP algorithm (2.15) { (2.18), (2.23) the following (deterministic) sequence: t = sup jCt ? Ct?1j; t = 2; 3; :::; (2.25) where the supremum is taken over all trajectories associated with the SSP problem in question.

Theorem 2 Let the stepsizes t in the SASP algorithm (2.15) { (2.18) be chosen according to (2.23), and the remaining parameters X ; Y , (t) according to (2.21). Then under assumptions A, B, C the expected inaccuracy "N IE"(z N ) of the approximate solution zN generated by the SASP algorithm can be estimated, for every N > 1, as follows:

h

i ?1

2 41 + 1 2C

N X

3 t 5

N C(Ni ? (N ) + 1) ht=(N )+1 C L2 X DX?2 + Y DY?2 L +q : + (N ) N ? (N ) + 1

"N DX2 ?X1 + DY2 Y

(2.26)

Note that (2.20) is a particular case of (2.26) with Ct C = C = C . The proof of Theorem 2 is given in Appendix A. We now present a simple example of adaptive choice of Ct. Recalling (see (2.19), (2.21)) that the optimal stepsize is the one with = 1=2 and Ct 2=L, where L is the variation of observations, it is natural to choose t as

t = Ctt?1=2; Ct = 2=Lt; 10

(2.27)

where Lt is our current guess for the unknown quantity L. Since by de nition (2.4) L = DX Mx q+ DY My ; Mx = supz2Z qIEj(z; !)j2; (2.28) 2 My = supz2Z IEj(x; !)j ; a natural candidate for the role of Lt is given by t t Lt = D sX Mxt+ DY My ; Mxt = t?1 P j(z ; ! )j2; (2.29) s =1 t M t = t?1 P j(z ; ! )j2 y

=1

{ the sample mean of \magnitude of observations". Such a choice of Lt may violate assumption (2.24), since uctuations in the observations (z ; ! ), (z ; ! ) may result in Lt being either too small or too large to satisfy (2.24). It thus makes sense to replace (2.29) with its truncated version. More precisely, let 8 > < s; ? s 2 [M?x?; Mx+ ] x (s) = > Mx ; s < Mx : 8 Mx+; s > Mx+? + > < s; ? s 2 [M?y ; My ] y (s) = > My ; s < My ; : My+; s > My+ where Mx? ; Mx+ , 0 < Mx? Mx+ < 1 and My? ; My+ , 0 < My? My+ < 1 present some a priori guesses for lower and upper bounds on Mx and My , respectively. Then the truncated version of (2.29) is t t Lt = D sX Mxt+ DY MY ; Mxt = t?1 P 2x(j(z ; ! )j); (2.30) s =1 t M t = t?1 P 2 (j(z ; ! )j): y

=1 y

Clearly, the stepsize policy (2.27), (2.30) satis es (2.24) { it suces to take C = 2=(DX Mx+ + DY MY+ ) and C = 2=(DX Mx? + DY My? ). In addition, for the truncated version we have t O(1)t?1 ; (2.31) where O(1) depends solely on our safety bounds Mx , My . Inequalities (2.26), (2.31) combined with the stepsize policy (2.27), (2.30) result in the same O(N ?1=2)-order of convergence as in (2.22). Note that the motivation behind the stepsize policy (2.27), (2.30) is, roughly speaking, to choose stepsizes according to the actual magnitudes of observations (; ), (; ) along the trajectory rather than according to the worst-case \expected magnitudes" Mx, My . 11

3 Discussion 3.1 Comparison with Polyak's algorithm As applied to convex optimization problems (see Example 2.1), the SASP algorithm with the setup (2.21) looks completely similar to Polyak's algorithm with averaging. There is, however, an important dierence: the stepsizes t = O(t?1=2) given by (2.21) are not quite suitable for Polyak's method. For the latter, the standard setup is t = O(t?) with 1 for which Polyak's method possesses its most attractive 2 < < 1, and this is the setup ? 1 property as opposite to O(t )- the rate of convergence on strongly convex (i.e., smooth and nondegenerate) objective functions. Speci cally, let O = O(X; a; A; L) (0 a < A, L > 0) be the class of all stochastic optimization problems minx2X f (x) on a compact convex set X Rn with twice dierentiable objective f satisfying the condition ajhj2 hT f 00(x)h Ajhj2 8x 2 X; h 2 Rn and equipped with a stochastic source of information with variation of the observations not exceeding L. Note that problems from class O possess uniformly smooth objectives. In addition, if a > 0, which corresponds to the \well-posed case", the objectives are uniformly nondegenerate as well. For a > 0 Polyak's method with stepsizes Ct?, 12 < < 1 and properly chosen C = C (; L; DX ; a; A) ensures that the expected error "N of N -th approximate solution does not exceed C1N ?1 , where Ci > 0 depend only on the data of O. Under the same circumstances the stepsizes O(t?1=2) given by (2.21) will result in a slower convergence, namely, "N C2N ?1 ln N . Thus, in the \well-posed case" the SASP method with setup (2.21) is slower by a logarithmic in N factor than the original Polyak's method. The situation changes dramatically when a = 0, that is, when we pass from the \wellposed" case to the \ill-posed" one. Here the SASP algorithm still ensures (uniformly in problems from O) the rate of convergence "N O(N ?1=2 ), which is not the case for Polyak's method. Indeed, consider the simplest case when X = [0; 1] and assume that observation noises are absent, so that (x; !) = f 0(x). Consider next the subfamily of O([0; 1]; 0; 1; 1) comprised of the objectives f"(x) = 4"x2; where " 2 (0; 0:1), and let us apply Polyak's method to f" with stepsizes i = t?, 2 [1=2; 1) starting at x0 = 1. The search points are Yt xt+1 = xt ? 8"xtt? ) xt = (1 ? 8" ?)x0 expf?c()"t1?g; =1

where c() is continuous on [1=2; 1). For xN > 21 , the points x1; :::; xN , as well as their averages, belong to the domain where f" (x) ? minX f" > ". Therefore in order to get an 12

"-minimizer of f", Polyak's algorithm requires at least d()"? ? steps. We conclude that in the ill-posed case the worst-case, with respect to O, rate of convergence of Polyak's algorithm cannot be better than O(N ?(1?) ). Thus, in the ill-posed case Polyak's setup

t = O(t?) with 2 (1=2; 1) results in worse, by factor of order of N ?1=2, rate of convergence then the setup t = O(t?1=2. We believe that the outlined considerations provide enough arguments in favor of the rule t = O(t?1=2) unless we are certain that the problem is \well-posed". As we shall see below, in the case of \genuine" saddle point problems (not reducing to minimization of a convex function via unbiased observations of its subgradients) the rate of convergence of the SASP algorithm with setup (2.21) is unimprovable even for \well-posed" problems. 1

3.2 Optimality issues We are about to demonstrate that as far as general families of SSP problems are concerned, the SASP algorithm with setup (2.21) is optimal in order in the minimax sense. To this end, let us de ne the family of stochastic saddle point problems S (X; Y; L) as that of all SSP instances on X Y (recall that an SSP problem always satis es assumptions A, B) with the variance of observations not exceeding L > 0. Given a positive " and a subfamily S S (X; Y; L), let us denote by Compl("; S ) the information-based "-complexity of S de ned as follows. a) Let us de ne a solution method for the family S as a procedure B which, as applied to an instance p from the family, generates sequences of \search points" zt 2 X Y and \approximate solutions" zt 2 X Y , with the pair (zt; zt) de ned solely on the basis of observations along the previous search points: zt = Zt ((z1 ; !1); (z1; !1); :::; (zt?1; !t?1); (zt?1; !t?1)) ; zt = Z t ((z1; !1); (z1; !1); :::; (zt?1; !t?1); (zt?1; !t?1)) : Formally, the method B is exactly the collection of \rules" fZt(); Z t() : R(n+m)(t?1) ! X Y g1t=1, and the only restriction on these rules is that all function pairs (Zt; Z t) must be Borel. b) For a solution method B, its "-complexity ComplB ("; p) on an instance p 2 S is de ned as the smallest N such that t N ) IE"p(zt) " the expectation being taken over the distribution of observation noises; here "p(z) is the inaccuracy measure (2.5) associated with instance p. The "-complexity ComplB ("; S ) of B on the entire family S is ComplB ("; S ) = sup ComplB ("; p); p2S

13

i.e., it is the the worst case, over all instances from S , "-complexity of B on an instance. For example, (2.22) says that the complexity of the SASP method on the family of stochastic saddle point problems S (X; Y; L) can be bounded from above as 2 ComplSASP("; S (X; Y; L)) 100 L2 + 1; (3.1) " provided that one uses the setup (2.21) with L = L. Finally, the "-complexity Compl("; S ) of the family S is the minimum, over all solution methods B, of "-complexities of the methods on the family: Compl("; S ) = inf ComplB ("; S ): B A method B is called optimal in order on S if there exists C < 1 such that ComplB ("; S ) C Compl("; S ) 8" > 0: Optimality in order of a method B on a family S means that S does not admit a solution method \much faster" than B: for every required accuracy " > 0, B solves every problem from S within (expected) accuracy " in not more than C Compl("; S ) steps, while every competing method B0 in less than Compl("; S ) steps fails to solve, within the same accuracy, at least one (depending on B0) problem from S . We are about to establish the optimality in order of the SASP algorithm on families S (X; Y; L):

Proposition 3.1 The complexity of every nontrivial (with a non-singleton X Y ) family S (X; Y; L) of stochastic saddle point problems admits a lower bound Compl("; S (X; Y; L)) C ?1

! L2 + 1 ; "2

(3.2)

C being a positive absolute constant. The proof of Proposition 3.1 is given in Appendix B. Taking into account (3.1), we arrive at

Corollary 3.1 For every convex compact sets X Rn; Y Rm and every L > 0, the SASP algorithm with setup (2.21) (L is set to L) is optimal in order on the family S (X; Y; L) of SSP problems. 14

Remark 3.1 The outlined optimality property of the SASP method means that as far as the performance on the entire family S (X; Y; L) is concerned, no alternative solution

method outperforms the SASP algorithm by more than an absolute constant factor. This fact, of course, does not mean that it is impossible to outperform essentially the SASP method on a given subfamily S of S (X; Y; L).

For example, the family of convex stochastic minimization problems O = O(X; a; A; L) introduced in Subsection 3.1 can be treated as a subfamily of S (X; Y; L) (think of a convex optimization problem as a saddle point problem with objective independent of y). As explained in Subsection 3.1, in the \well-posed" case a > 0 the SASP algorithm is not optimal in order on O (the complexity of the method on O is O("?1 ln "?1), while the complexity of O is O("?1)). In view of Remark 3.1, it makes sense to present a couple of examples of what we call \dicult" subfamilies S S (X; Y; L) { those for which Compl("; S ) is of the same order L2"?2 as the complexity of the entire family S (X; Y; L). For the sake of simplicity, let us restrict ourselves to the case X = [?1; 1], Y = [0; 1], L = 10. It is readily seen that if both X , Y are non-singletons, then Sstnd = S ([?1; 1]; [0; 1]; 10) can be naturally embedded into S (X; Y; L), so that "dicult" subfamilies of S ([?1; 1]; [0; 1]; 10) generate "dicult" subfamilies in every family of stochastic saddle point problems S (X; Y; L) with nontrivial X; Y . 1) The rst example of a \dicult" subfamily in Sstnd is the family of ill-posed smooth stochastic convex optimization problems O = O(X; a; A; L) associated with X = [?1; 1], a = 0, A = 1, L = 10, see Subsection 3.1. Indeed, consider a 2-parametric family of optimization programs f";s = 4"(x ? s)2 ! min subject to jxj 1; (3.3) the parameters being s 2 [?1; 1] and " 2 (0; 0:1]. We assume that the family (3.3) is equipped with the stochastic source of information ";s (x; !) = 8"(x ? s) + !; ! N (0; 1): Thus, we seek to minimize a simple quadratic function of a single variable in the situation when the derivative of the objective function is corrupted with an additive Gaussian white noise of unit intensity. The same reasoning as in the proof of Proposition 3.1 demonstrates that the family f";s is indeed "dicult", the reason being that the programs (3.3) become more and more ill-posed as " approaches 0. 2) The second example is a single-parametric family of \genuine" saddle point problems min max f (x; y); x2[?1;1] y2[0;1] s (3.4) fs(x; y) = (1 ? 2y)(x ? s) + 21 (x ? s)2 ? 21 s2 15

equipped with the stochastic sources of information h i s (x; y; !) = 1 ? 2y + x ? s ? ! h = @fs@x(x;y) ? !i ; ! N (0; 1); s (x; y; !) = ?2x + 2s + 2! = @fs@y(x;y) + 2!

(3.5)

here the parameter s varies in [?1; 1]. The origin of the stochastic saddle point problem (3.4) { (3.5) is very simple. What we seek in fact is to solve a Minimax problem (see Example 2.2) of the form gs (x) max[fs;?(x); fs;+(x)] ! min subject to x 2 [?1; 1]; (3.6) fs;? = 21 (x ? s)2 ? (x ? s) ? 21 s2; fs;+ = 21 (x ? s)2 + (x ? s) ? 21 s2

speci ed by an unknown parameter s 2 [?1; 1]. What we can observe are the values and derivatives of fa; at any desired point x. They are corrupted with N (0; 1)-observation noise ! and equal to fa;? (x)?!x+!=1 x2 ? (a + ! )x ? (x ? a ? ! ); 2 fa;+ (x)?!x?!=1 x2 ? (a + ! )x + (x ? a ? ! ); 2 respectively. The observations of fa;0 (x) are, respectively, x ? (a + !) ? 1 = fa;0 ?(x) ? !; x ? (a + !)x + 1 = fa;0 +(x) ? !:

(3.7) (3.8)

Applying the scheme presented in Example 2.2 to the above stochastic Minimax problem, we convert it into an equivalent stochastic saddle point problem, which is readily seen to be identical with (3.4) { (3.5). Note that the family of stochastic saddle point problems S = f(3:4)?(3:5g, (s 2 [?1; 1]) is contained in Sstnd. We claim that the family S is "dicult". Indeed, denoting by "s (x; y) the accuracy measure associated with the problem (3.4) and taking into account (2.5), we have "s(x; y) f (x) ? min x f (x); 1 (x ? s)2 + jx ? sj ? 1 s2: f (x) = ymax [ yf ( x ) + (1 ? y ) f ) x )] = s; ? s; + 2[0;1] 2 2

It follows that "s(x; y) jx ? sj. In other words, if there exists a method which solves in N steps every instance from S within an expected inaccuracy ", then this method is capable of recovering, within the same expected inaccuracy, the value of s underlying the instance. On the other hand, from the viewpoint of collected information, the N observations (3.5) used by the method are equivalent to observing a sample of N iid N (s; 1) random variables. Thus, if Compl("; S ) = N , then one can recover, within the expected inaccuracy ", the unknown mean of N (; 1)-distribution from N -point sample drawn from this distribution, regardless of the actual value s 2 [?1; 1] of the mean. It is well-known that the latter is possible only when N O("?2 ). Thus, Compl("; S ) O("?2 ), as claimed. 16

Note nally that the stochastic Minimax problems (3.6) which give rise to the stochastic saddle point problems from S are \as nice as a Minimax problem can be". Indeed, the components fs; () present just shifts by s of simple quadratic functions on the axis. Moreover, problems (3.6) are perfectly well posed { the solution is \sharp", i.e., the residual gs (x) ? minx0 gs (x0) = gs(x) ? gs (s) jx ? sj is of the rst order in the distance jx ? sj from a candidate solution x to the exact solution s, provided that this distance is small. We see that in situations less trivial than the one considered in case 1), "dicult" stochastic saddle point problems can arise already from quite simple and perfectly well-posed stochastic optimization models.

4 Numerical Results In this section we apply the SASP algorithm to a pair of test problems: the stochastic Minimax Steiner problem and on-line optimization of a simple queuing model.

4.1 A Stochastic Minimax Steiner problem Assume that in a two-dimensional domain X there are n towns of the same population, say equal to 1. The distribution of the population over the area occupied by town i is Fi. All towns are served by a single facility, say, an ambulance. The \inconvenience of service" for town i is measured by the mean distance from the facility to the customers, i.e., by the function Z i(x) = jx ? P jdFi(P ); x being the location of the facility. The problem is to nd a location for the facility which minimizes the worst-case, with respect to all n towns, inconvenience of service. Mathematically we have the following minimax stochastic program min '(x); '(x) = i=1 max (x): x2X ;:::;n i We assume that the only source of information for the problem is a sample !1 = (P11; :::; P1n); :::; !N = (PN1 ; :::; PNn ) with mutually independent entries Pti 2 R2 distributed according to Fi, i.e., a random sample of N tuples of requests for service, one request per town in a tuple. The above Minimax problem can be naturally posed as an SSP problem (cf. Example 1.2) with the objective n n X X (x; y) = yii(x) : X Y ! R [Y = fy 2 Rn j y 0; yi = 1g]; i=1

i=1

17

the observations being n X yi (x ? P i ); (x; y; !) = i i=1 jx ? P j (x; y; !) = (jx ? P 1j; :::; jx ? P n j)T ; ! = (P 1; :::; P n):

In our experiments we chose X to be the square (?10; ?10)T x (10; 10)T and dealt with n = 5 towns placed at the vertices vi, i = 1; :::; 5, of a regular pentagon, Fi being the normal two-dimensional distribution with mean vi and the unit covariance matrix. We used setup (2.19), (2.21) with the parameters

p

p

DX = 20 2; DY = 2; L = DX + 10DY and ran 2,000 steps of the SASP algorithm, starting at the point x1 = (10; 10)T . The results are presented on Fig. 1. We found that the relative inaccuracy 2000) ? min '(x) = '(x min '(x) in 20 runs varied from 0:0006 to 0:006.

18

\Centers" of towns (*) and the optimal solution (+)

Plot of 10,000 random \calls for service", 2,000 calls per town

Optimal solution (+) and results of 20 runs of the SASP algorithm 2.1

4.2 A simple queuing model Here we apply the SASP algorithm to optimization of simple queuing models in steady state, such as the GI=G=1 queue. We consider the following minimization program: minimize

(x);

x 2 X;

(4.1)

where the domain X is an r-dimensional box

X = fx : p x qg: 19

(4.2)

In particular, we assume that the expected performance (x) is given as (x) = c ` + bT x; (4.3) where ` is the expected steady state waiting time of a customer, c is the cost of a waiting customer, x = (x1; : : :; xr ) are parameters of the distributions of interarrival and service times, bk is the cost per unit increase of xk and bT is the transpose of b. Note that for most exponential families of distributions (see e.g., Rubinstein and Shapiro (1993), Chapter 3) the expected performance (x) is a convex function of x. To proceed with the program (4.1), consider Lindley's recursive (sample path) equation for the waiting time lt of a customer in a GI=G=1 queue (e.g. Kleinrock (1975), p. 277): lt+1 = maxf0; lt + Zt(x)g; t 1; l1 = 0; (4.4) where, for xed x, Zt(x); t = 1; 2; : : : is an iid sequence of random variables (dierences between the interarrival and the service times) with distribution F (; x) depending on the parameter vector x. It is readily seen that the corresponding algorithm for calculating an estimate t of 0 can be written as follows:

Algorithm 4.1 : 1. Generate the output process lt = lt(x) using Lindley's recursive equation

lt+1(x) = maxf0; lt(x) + Zt (x)g; t 1; l1 = 0; (4.5) here Zt (x) = Fa?1(uat; x) ? Fs?1(ust; x), Fa(; x); Fs(; x) are the distributions of the interarrival and service times, respectively, and ua1 ; us1 ; ua2 ; us2 ::: are independent random variables uniformly distributed in [0; 1]. 2. Dierentiate (4.5) with respect to x, thus getting a recurrent formula for rlt(x), and use this recurrence to construct the estimates t = crlt(x) + b of r(x). Note that under mild regularity assumptions (see, e.g., Rubinstein and Shapiro (1993), Chapter 4) the expectation of t ? 0(x) converges to 0 as t ! 1. Application of the SASP algorithm (2.15){(2.18) to program (4.1) { (4.3) yields xt+1 = X [xt ? t X t]; t (4.6) xt = (t?ct=2b+1)?1 P x ; =ct=2b

where t are the estimates of r(xt) yielded by Algorithm 4.1, with x in (4.5) replaced by xt (see (4.6)), x = DX2 and DX = q ? p. 20

It is important to note that now we are in a situation dierent from the one postulated in Theorem 1 in the sense that the stochastic estimates of r used in (4.6) are biased and depend on each other. The numerical experiments below demonstrate that the SASP algorithm handles these kinds of problems reasonably well. In these experiments, we considered an M=M=1 queue, and x being the interarrival and service rates; xis the decision variable in the program (4.1){(4.3). Taking into account that IExl = 1=(x ? ), it is readily seen that the value x which minimizes the performance measure (x) in (4.1){(4.3) is x = + (bc?1)1=2: We set = 1, b = 1, c = 1 (which corresponds to x =2 and (x) = 3) and choose as X the segment X = fx : 1:25 x 5g which in terms of the trac intensity = (x) = =x is X = fx : 0:2 (x) 0:8g: To demonstrate the eect of the Cesaro averaging (4.6) we present below statistical results for both the sequences xt and xt (see (2.15){(2.18)), i.e., with and without Cesaro averaging. We shall call the sequence xt the crude SASP (CSASP) sequence. Tables 4.2, 4.2 present 10 realizations of the estimates for x and (x) yielded by the CSASP and SASP algorithms (denoted xct, xrt , respectively) along with the corresponding values of the objective; two estimates in the same column correspond to a common simulation. Each of the 10 experiments related to Table 4.2 was started at the point x1 = 5, which corresponds to = 0:2; the starting point for the experiments from Table 4.2 was x1 = 1:25 ( = 0:8). The observations used in the method were given by the Lindley equation approach; the \memory" (t) and the stepsizes t were chosen according to (2.21), with DX = 5 ? 1:25 = 3:75 and L = 125. In each experiment, we performed 2000 steps of the algorithm (i.e., simulated 2000 arrivals of customers). Tables 4.2 and 4.2 summarize the statistics of Tables 4.2 and 4.2, namely, they present 10 10 10 the sample means x~cN = 1=10 P xct; (~xcN ) = 1=10 P (xct); and x~rN = 1=10 P xrt ; (~xrN ) = i=1 i=1 i=1 10 P 1=10 (xr ) and the associated con dence intervals. i=1

t

Let wc and wr be the widths of the con dence intervals associated with the CSASP and SASP sequences, respectively. The quantity = wc=wr can be regarded as the eciency of the SASP sequence relative to the CSASP one. From the results of Tables 4.2 and 4.2 it follows that the eciency is quite signi cant. E.g., for the experiments presented in Table 4.2 we have x~ 4 and (~x) 8. We applied to problem (4.1){(4.3) the SASP algorithm with the adaptive stepsize policy (2.27), (2.30) and used it for various single-node queuing models with dierent interarrival and service time distributions. In all our experiments we found that the SASP algorithm converges reasonably fast to the optimal solution x. 21

Table 4.1 10 Experiments with Lindley's formula starting from x0 = 5. No. 1 2 3 4 5 6 7 8 9 10

xct 1:974 2:466 1:918 1:781 1:851 1:863 1:907 1:885 2:225 2:299

(xct) 3:000 3:148 3:007 3:061 3:025 3:021 3:009 3:014 3:041 3:069

xrt 1:961 1:979 2:008 1:980 2:073 1:952 2:021 2:044 1:933 2:147

(xrt ) 3:001 3:000 3:000 3:000 3:005 3:002 3:000 3:001 3:004 3:018

Table 4.2 10 Experiments with Lindley's formula starting from x0 = 1:25. No. 1 2 3 4 5 6 7 8 9 10

xct 3:182 2:224 1:918 1:792 2:216 2:270 2:678 2:234 1:698 2:234

(xct) 3:641 3:041 3:007 3:054 3:038 3:057 3:274 3:044 3:130 3:044

xrt 1:978 2:015 2:200 1:988 1:962 1:920 2:038 1:931 2:105 2:101

(xrt ) 3:000 3:000 3:033 3:000 3:001 3:006 3:001 3:005 3:010 3:009

Table 4.3 Point estimators and con dence intervals for the M/M/1 queue with Lindley's equation starting from x1 = 5.

Method Estimator Point estimators CSASP x~cN 2:017 c (~xN ) 3:039 r SASP x~N 2:010 (~xrN ) 3:003 22

Con dence interval 1:875 2:159 3:011 3:067 1:970 2:050 3:000 3:007

Table 4.4 Point estimators and con dence intervals for the M/M/1 queue with Lindley's equation starting from x1 = 1:25.

Method Estimator Point estimators CSASP x~cN 2:245 c (~xN ) 3:133 r SASP x~N 2:024 (~xrN ) 3:006

Con dence interval 1:977 2:512 3:013 3:253 1:969 2:079 3:000 3:013

5 Conclusions We have shown that

The SASP algorithm (2.15){(2.18) is applicable to a wide variety of stochastic sad-

dle point problems, in particular to those associated with single-stage constrained convex stochastic programming programs (Examples 2.1{2.3). The method works under rather mild conditions: it requires only convexity-concavity of the associated saddle point problem (Assumption A) and conditional independence and unbiasedness of the observations (Assumption B). In contrast to the classical Stochastic Approximation, no smoothness or nondegeneracy assumptions are needed. The rate of convergence of the method is dataand dimension-independent and is optimal in order, in the minimax sense, on wide families of convex-concave stochastic saddle point problems. As applied to general saddle point problems, the method seems to be the only stochastic approximation type routine converging without additional smoothness and nondegeneracy assumptions. The only alternative method for treating these problems is the so-called stochastic counterpart method (see Shapiro (1996)), which, however, requires more powerful nonlocal information on the problem. (For more details on the stochastic approximation versus the stochastic counterpart method, see Shapiro (1996)).

6 Appendix 6.1 Appendix A: Proof of Theorems 1 and 2 As was already mentioned, the statement of Theorem 1 is a particular case of that of Theorem 2 corresponding to Ct C = C = C , so that it suces to prove Theorem 2. 23

10. Let q d(z; z0) = ?X1jx ? x0j2 + ?Y 1 jy ? y0j2; z = (x; y); z0 = (x0; y0) 2 Rn Rm be the scaled Euclidean distance between z and z0. Note that due to the standard properties of the projection operator, we have

8z 2 X Y 8z0 : d(z; X Y [z0]) d(z; z0): We assume, without loss of generality, that (0; 0) 2 X Y .

(6.1)

20. Note that

zt = zt(!t?1); Ct = Ct(!t); !i = (!1; :::; !i); where for each t zt, Ct are deterministic Borel functions. Let us x z = (x; y) 2 X Y and consider the random variable

d2t = d2(z; zt): We have from (6.1) d2t+1 d2(z; X Y [zt ? t (zt; !t )]) d2(z; zt ? t (zt; !t)) = ?X1 jx ? xt + tX (zt; !t)j2 ?1 2 + i h ?Y1 jy ? yt 2? tY (zt; !t)Tj = Xh jx ? xtj + 2 t (x ? xt) (zt; !t) + t2X j(zt; !t)j2 i + ?Y 1jyh? ytj2 ? 2 t (y ? yt)T (zt; !t ) + t2Y ji(zt; !t)j2 = d2t + 2 t (x ? xt)T (zt; !t) ? (y ? yt)T (zt; !t) + t2 [X j(zt; !t)j2 + Y j(zt; !t)j2] : Setting (z; !) = (z; !) ? IE(z; !) (z; !) ? 0x(z); (z; !) = (z; !) ? IE(z; !) (z; !) ? 0y (z); we can rewrite the resulting relation as (xt ? x)T 0x(zt) ? (yt ? y)T 0y (zt) where and

d2t ?d2t+1 T T 2 t + x (zt ; !t ) ? y (zt; !t ) +t(!t) + 2t t2(!t);

t(!t) = ?xTt (zt; !t) + ytT (zt; !t)

(6.2) (6.3)

t2(!t) = X j(zt; !t)j2 + Y j(zt; !t)j2: (6.4) Since (x; y) is convex in x and 0x(zt) 2 @x(zt), we have (x; yt) ? (xt; yt) (x ? xt)T 0x(zt): Similarly, ?[(xt; y) ? (xt; yt)] ?(y ? yt)T 0y (zt); whence (x ? xt)T 0x(zt) ? (y ? yt)T 0y (zt) (x; yt) ? (xt; y): 24

Substituting this inequality in (6.2), we obtain

(xt; y) ? (x; yt)

d2t ?d2t+1 T T 2 t + x (zt ; !t ) ? y (zt; !t ) +t(!t) + 2t t2(!t):

(6.5)

30. Summing the inequalities (6.5) over t = (N ); :::; N and applying the Cauchy inequality, we obtain i h N N P [(xt; y) ? (x; yt)] d NN + P d2t 21 t ? 2 t1? ? d2N N t= (N )+1 t= (N ) h i + jxjjN (!N )j + jyjjN (!N )j (6.6) N +R (!N ) + P t 2(!t); 2

2

2

(

)

(

)

N

where

+1

1

t t= (N ) 2

N P (zt; !t); t= (N ) N N (!N ) = P (zt; !t); t= (N ) i N h RN (!N ) = P ?xTt (zt; !t) + ytT (zt; !t) :

N (!N ) =

(6.7)

t= (N )

Applying next Jensen's inequality to the convex functions (; y) and ?(x; ) and taking into account (2.18), we obtain that N h i X [(xt; y) ? (x; yt)] (N ? (N ) + 1) (xN ; y) ? (x; yN ) : t= (N )

Since (0; 0) 2 X Y , we also have that jxj DX and jyj DY . Clearly d2t DX2 ?X1 + DY2 ?Y 1 ; because both z(N ) and z belong to X Y . In view of these inequalities we obtain from (6.6) h i (N ? (N ) + 1) (xN ; y) ? (x; yN ) 1 1 # h 2 ?1 2 ?1 i " 1 N P 1 2 DX X + DY Y N + ? t= (N )+1 t t? (6.8) +DX jN (!N )j + DY jN (!N )j N +R (!N ) + P t 2(!t): (

N

t t= (N ) 2

)

1

The right hand side of (6.8) is independent of z = (x; y); consequently, it (6.8) majorizes the upper bound of the left-hand side over z 2 X Y . This upper bound is equal to h i (N ? (N ) + 1) (xN ) ? (yN ) = (N ? (N ) + 1)"(zN ) 25

(see (2.6)). Thus, we have derived the following inequality i h (N ? (N ) + 1)"(zN ) 21 DX2 ?X1 + DY2 ?Y 1 N (!N ) + DX jN (!N )j + DY jN (!N )j N +R (!N ) + P t 2(!t); N

N (!N ) =

1

(N )

t t= (N ) 2 N + P 1t ? t1?1 : t= (N )+1

(6.9)

In view of (2.23) and assumption C we have N (!N ) = (N ) + P N

t (t?1) ? Ct? C N t=(N )+1 Ct N C(N ) + P Ctt ? Ctt? t= (N )+1 N 1 jt ? (t ? 1) j + P (

1

)

1

t= (N )+1 Ct?1 N jCt ?Ct?1 j N + N P C t= (N )+1 Ct Ct?1 " # N P N 2 + 1 C C t= (N )+1 t :

Consequently, (6.9) yields (N ?

(N ) + 1)"(zN )

(6.10)

+ NC

# h 2 ?1 2 ?1i N " 1 P N t DX X + DY Y C 2 + C t= (N )+1 +DX jN (!N )j + DY jN (!N )j N ? +R (!N ) + C (N ) P 2(!t) 1 2

N

2

t= (N )

(6.11)

t

(we have taken into account that t C ?(N ) for t (N )). 40. To obtain the desired estimate for IE"N (zN ), it suces to take expectation of both sides of (6.11). When doing so, one should take into account that

In view of assumption B the conditional expectations of the vectors (zt; !t) and (zt; !t) (for xed !t?1) are zero, those of their squared norms do not exceed Mx2 sup IEj(z; !)j2; My2 sup IEj(z; !)j2; z2X Y

z2X Y

26

and by construction zt = (xt; yt) is a deterministic function of !t?1. This implies the inequalities IERN (!N ) N = P IEf?xTt (zt; !t) + ytT (zt; !t)g t=(N ) = 0; IEjq N (!N )j IEfjN (!N )j2g v u N N t?1 t P =u IEfj(zt; !t)j2g + P 2IEfT (zt; !t) P (z ; ! )g t= (N ) t= (N )+1 =(N ) v u N t P IEfj(zt; !t)j2g =u t= (N ) q Mx N ? (N ) + 1; and similarly q IEjN (!N )j My N ? (N ) + 1: Finally, N N IE P t2(!t) = IE P [X j(zt; !t)j2 + y j(zt; !t)j2] t= (N ) h t=(2N ) i X Mx + Y My2 (N ? (N ) + 1): With these inequalities we obtain from (6.11) h 2 ?1 2 ?1i " 1 N "N 2C(N ?(N )+1) DX X + DY Y 2 + C

N P

t= (N )+1

t

#

My + C h M 2 + M 2 i ? (N ): + DpXNM?x+D(NY)+1 Y y 2 X x

Since, by de nition, and, consequently, we arrive at (2.26).

L = DX Mx + DY My X Mx2 + Y My2 [X DX?2 + Y DY?2 ]L2;

6.2 Appendix B: Proof of the Proposition. Without loss of generality we may assume that X is not a singleton. By evident homogeneity reasons, we may also assume that the diameter of X is 2 and that X contains the 27

segment I = fte j ?1 t 1g, e being a unit vector. For a given " 2 (0; 0:01L), consider the two problems p+ and p? with the following functions: + (x; y) = 4"xT e; ? (x; y) = ?4"xT e: Let further +(x; y; !) = (4" + L4 !)e; + (x; y; !) 0; ? (x; y; !) = (?4" + L4 !)e; ? (x; y; !) 0; be the associated estimates of the partial (with respect to x and y) derivatives of phi+ (x; y) and ? (x; y), respectively. Assume, nally, that ! is a standard Gaussian random variable. It is readily seen that the problems indeed belong to S (X; Y; L). Let N = Compl("; S (X; Y; L)). By the de nition of complexity, there exists a method B which in N steps solves all problems from S (X; Y; L) (in particular, both p+ and p? ) with expected inaccuracy at most ". The method clearly implies a routine R for distinguishing between two hypotheses, H+ and H? , on the distribution of an iid sample 1; :::; N where H states that the distribution of every t is N (; 2), = 4"; 2 = L2=16. The routine R is as follows: In order to decide which of the hypotheses takes place, we treat the observed sample as the sequence of coecients at e in the N subsequent observations of the gradient with respect to x in a saddle point problem on X Y (and add zero observations of the gradient with respect to y). Applying the rst N steps of method B to these observations, we form the N -th approximate solution zN and check whether eT zN > 0. If it is the case, we accept H+ , otherwise we accept H? . It is clear that the probability for R to reject the hypothesis H+ when it is valid is exactly the probability for B to get, as a result of its work on p+ , a point zN with eT zN 0. In this case the inaccuracy of zN , regarded as an approximate solution to p+ , is at least 4", and since the expected inaccuracy of B on p+ is ", the indicated probability is at most 1=4. By similar considerations, the probability for R to reject H? when this hypothesis is valid is also 1=4. Thus, the integer N = Compl("; S (X; Y; L)) is such that there exists a routine for distinguishing between the aforementioned pair of statistical hypotheses with probability of rejecting the true hypothesis (whether it is H+ or H? ) at most 1=4. By standard statistical arguments, this is possible only if 2 "2 O(1) N 2 N 256 L2 with an appropriately chosen positive absolute constant O(1), which yields the sought lower bound on N . 28

References [1] Asmussen, S. and Rubinstein, R.Y. (1992). \The eciency and heavy trac properties of the score function method in sensitivity analysis of queuing models", Advances of Applied Probability , 24(1), 172{201. [2] Ermoliev, M. (1969). \On the method of generalized stochastic gradients and quasiFejer sequences", Cybernetics, 5(2), 208{220. [3] Ermoliev, Y.M. and Gaivoronski, A.A. (1992). \Stochastic programming techniques for optimization of discrete event systems", Annals of Operations Research, 39, 1{41. [4] Kleinrock, L. (1975). Queuing Systems, Vols. I and II, Wiley, New York. [5] Kushner, H.I. and Clarke, D.S. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag, Applied Math. Sciences, Vol. 26. [6] L'Ecuyer, P., Giroux, N. and Glynn, P.W. (1994). \Stochastic optimization by simulation: Numerical experiments for the M/M/1 queue in steady-state", Management Science, 40, 1245{1261. [7] Ljung, L., P ug G. and H. Walk (1992). Stochastic Approximation and Optimization of Stochastic Systems. Birkhaus Verlag, Basel. [8] Nemirovski, A. and Yudin, D. (1978). \On Cesaro's convergence of the gradient descent method for nding saddle points of convex-concave functions", Doklady Akademii Nauk SSSR, Vol. 239, No. 4, (in Russian; translated into English as Soviet Math. Doklady). [9] Nemirovski, A. and Yudin, D. (1983). Problem Complexity and Method Eciency in Optimization, J. Wiley & Sons. [10] P ug, G. Ch. (1992). \Optimization of simulated discrete event processes". Annals of Operations Research, 39, 173{195. [11] Polyak, B.T. (1990). \New method of stochastic approximation type", Automat. Remote Control, 51, 937{946. [12] Rockafellar, R.T. (1970). Convex Analysis, Princeton University Press. [13] Rubinstein, R.Y. and Shapiro, A. (1993). Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization Via the Score Function Method, to be published by John Wiley & Sons. [14] Shapiro, A. (1996). \Simulation based optimization|Convergence analysis and statistical inference", Stochastic Models, to appear. 29

[15] Tsypkin Ya.Z. (1970). Adaptation and Learning in Automatic Systems, Academic Press, New York.

30

Dear Dr. King, Please nd attached the revised version of our paper An Ecient Stochastic Approximation Algorithm

along with our response to the referee reports on the original version of the paper. As you could see from this response, the result on non-improvability of the rate of convergence of our algorithm is not contradictory at all. We are very grateful to you and to the referees for the eort in processing our paper. We apologize for the delay with the revision caused by two subsequent sabbatical leaves of both of us. Yours sincerely, Arkadi Nemirovski

Reuven Rubinstein

Response to Referee Report # 1 1. Although we do not feel that the well-established notions related to the informationbased complexity were presented in the original version of the paper \in a non-rigorous manner", we do have extended the related part of the paper in order to prevent misunderstandings. 2. There is no contradiction between the result on non-improvability of our algorithm with its rate of convergence O(t?1=2) and the classical results on O(t?1)-rate of convergence of the Stochastic Approximation algorithm. Every \non-improvability statement" is in fact a statement about the behaviour of a method on certain family S of problems rather than on a particular problem. Speci cally, given the required level of accuracy " and a solution method B (the precise description of the latter notion is given in the paper), we look how many steps it takes from the method, in the worst case with respect to instances from S , to solve an instance within accuracy ". When S is xed, the resulting number of steps ComplB (") depends on " and on B. A method B is called optimal in order on S , if there exists a constant C such that ComplB (") C ComplB0 (") for every other solution method B0 and all " C ?1. The rate of convergence of an optimal method B (the function inverse to ComplB (") { the accuracy guaranteed by B as a function of the number of steps) { is called non-improvable on S rate of convergence. It should be stressed that both the fact of optimality of a given method and the notion of a non-improvable rate of convergence depend of the family S in question. In particular, the larger is S , the slower is the associated non-improvable rate of convergence, and this observation resolves the \contradiction" found by the referee. Let us restrict ourselves to the case considered by the referee { a convex objective f (x) is minimized on a segment, say, on [?1; 1]; it is known in advance that f is smooth: [0 ] ` f 00(x) L [< 1] 8x and attains its (unconstrained) minimum on the segment [?1; 1]. Our goal is to approximate the minimizer of f , given observations of f 0() corrupted by, say, N (0; 1) random noises (the noises of subsequent observations are independent of each other). These assumptions together de ne a certain family S of stochastic minimization problems, and this family depends on the parameters `; L { the only \free" elements of the above description. With this notation, the family considered by the referee (\Suppose for instance that f (x) = 21 (x ? x)2:::") is S (1; 1), and the unimprovable rate of convergence for this family is t?1. As an optimal method, one can take either the classical Stochastic Approximation with stepsizes t?1, or the Polyak's algorithm with stepsizes O(t?). This is exactly what is stated by the referee, but there is absolutely no contradiction with what is stated in the paper { in the latter we were speaking about the class S (0; 1), and not S (1; 1) 1). 1) In fact, in the original version of the paper no classes S ( ) were mentioned: we were speaking about certain explicitly de ned class \S ( )". With proper speci cation of the \parameters" , the class S ( ) becomes, essentially, S (0 1), but no setup can convert S ( ) into S ( 1) with 0 `; L

X; Y; L

X; Y; L

` >

;

X; Y; L

X; Y; L

`;

Note that the family S (0; 1) is much wider than S (1; 1), so that it is quite natural that an unimprovable rate of convergence for the former family is much worse than the one for the latter (speci cally, O(t?1=2) instead of O(t?1)). Although we believe that the \non-improvability" statements/proofs were quite correct already in the original version of the paper, we did our best to update the related parts of the text in order to prevent misunderstandings like the one we were just discussing. In particular, in the revised version we are stressing explicitly on the fact that nonimprovability of the rate O(t?1=2) takes place on some, not all, families of stochastic saddle point problems, along with providing examples of \natural" families where this phenomenon takes/does not take place.

33

Response to Referee # 2 General comments of the referee: We have corrected the misprints, did our best to improve Introduction, etc. As about references { our policy was to indicate the pioneers of the subject, to list a number of monographs re ecting its history and state-of-the-art and, of course, to indicate the preceding papers we are \directly linked to". As about the papers, perhaps quite important and valuable, but not related directly to the particular subject of our paper { well, we do not pretend to survey the \vast literature on SA" { it is indeed vast... Speci c comments of the referee: (1) (\The parameters Mx and My ...") { The observation is absolutely true, but we did not claim that the estimates Mxt , Myt are consistent estimates of Mx, My , and in fact we are not interested in consistent estimation of Mx , My at all. Indeed, from the convergence analysis of the algorithm it follows that what we are interested in is, roughly speaking, the order of magnitude of observations along the trajectory of the method, and not the worst-case magnitude of observations. As indicated by referee, if z do converge to the true solution, our adaptation stepsize policy indeed achieves this target; otherwise we still \survive" due to the safeguards we use. We have added relevant comments. (2): (\It is stated on page 15...") { in the revised text, we assume explicitly the convexity of (x); as about associated sucient conditions, a precise reference is given. Now, the simulated waiting times lt we use in the \queuing experiments" are not drawn form the steady-state distribution; this point and other details related to the experiments are explicitly discussed in the revised text. (3): (\It is stated on page 16...") { the (indeed bad) original sentence is now corrected. As about x { it stands for the vector of parameters of the distributions specifying the queuing model; in particular, in the example we are processing numerically x is the service rate in M/M/1 queue (reciprocal of the expectation of the service time), and not the service time itself. (4): The text is modi ed accordingly.