A Perturbation Approach to Approximate Value Iteration for Average

A Perturbation Approach to Approximate Value Iteration for Average Cost Markov Decision Process with Borel Spaces and Bounded Costs ´ Oscar Vega-Amaya∗

Joaqu´ın López-Borbón†

August 21, 2015

Abstract The present work proves the convergence of the approximate value iteration (AVI) algorithm and provides performance bounds for the average cost criterion with bounded cost and Borel spaces under standard compactness-continuity conditions and a uniform ergodicity assumption. This is done for the class of approximation operators with the following properties: (i) they give exact representation of the constant functions; (ii) they are positive linear operators; and (iii) they have certain continuity property. The main point is the operators with properties (i)-(iii) define transition probabilities on the state space of the controlled system. This fact has the following important consequences: (a) the approximating function is the target function’s average value with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be thought of as a perturbation of the original Markov model; (c) the perturbed model inherits the ergodicity properties of the original Markov model. These facts allow to bound the AVI algorithm performance in terms of the accuracy of the approximations given by operators for the primitive data model, namely, the one-step reward function and the system’s transition law. The bounds are given in terms of the supremum norm of bounded functions and the total variation norm of finite-signed measures. The results are illustrated with numerical approximations for a class of single item inventory systems with linear order cost, no set-up cost and no back-orders. Key words. Markov decision processes, average cost criterion, approximate value iteration algorithm, contraction and non-expansive operators, perturbed Markov decision models. ∗ Corresponding author. Departamento de Matem´ aticas, Universidad de Sonora, Luis Encinas y Rosales s/n, C.P. 83180, Hermosillo, Sonora, M´ exico ([email protected]. uson.mx.) This work was partially supported by project SEP-CA Red en Control Estoc´ astico. † Departamento de Matem´ aticas, Universidad de Sonora, Luis Encinas y Rosanles s/n, C. P. 83180, Hermosillo, Sonora, M´ exico ([email protected].) This author was supported by Consejo Nacional de Ciencia y Tecnolog´ıa (scholarship 61702) and by Universidad de Sonora.

1

AMS subject classifications. 90C40, 90C59, 93E20.

1

Introduction

The approximate value iteration (AVI) algorithms is a class of approximating procedures that attempt to mitigate the so-called curse of dimensionality present in the Markov decision. The research on approximation procedures for the computation of optimal value functions and optimal control policies is by now a very active subdiscipline of Markov decision processes with a diversity of problems, and the literature dealing with them has experienced an intensive growth in recent years (see, for instance, the books [6, 27, 35, 36], and the survey papers [7, 22, 28, 29, 30, 31, 32]. In fact, Powell [29] in this concern states that this situation “can sometimes seem like a jungle of algorithmic strategies.” However, the major part of the works are concentrated either on the discrete spaces case (mainly, on the finite case) or on the discounted cost criterion. The recent papers by Bertsekas [7] and Powell and Ma [30] give succinct but comprehensive reviews on the approximate policy iteration for the discounted cost criterion; the former deals with finite models, while the latter one reviews the continuous space case. For further results on the discounted case with uncountable state space see [2, 24, 33, 34]. In contrast, the average cost criterion, being by far more involved than the discounted criterion, has been much less studied numerically. In fact, some authors consider the average cost optimal control problem is not a single problem, but as a collection of problems (see [3]). Among the few exceptions we can mention the papers [1, 10, 11, 13], which focus on discrete spaces. The paper [1] shows the convergence of the Q-learning algorithm for finite models, while [13] proposes a Q-learning variant based on policy iteration for finite semi-Markov models. The paper [10] considers finite models and requires two approximate linear programs to get “good” approximate solutions: one to approximate the optimal average cost and the other to control the accuracy of the approximations of the (differential cost) function that solves the optimality equation. The paper [11] also develops an approximate linear programming approach but it considers countable state spaces in discrete and continuous time. This last paper exploits the fact that the average cost problem is limit of discounted problems when the discount factor vanishes and the discounted optimal problem is seen as a perturbed version of the average cost problem. The present work studies a class of approximate value iteration (AVI) algorithms for the average cost criterion with Borel spaces and bounded costs following the approach recently introduced in [40] for the discounted cost criterion. As it is well known, the main approach to solve the optimal control problem, either with respect to average cost criterion or the discounted cost criterion, consists in finding a solutions of the corresponding optimality equations. The average cost optimality equation has the form ρ∗ + h∗ (·) = T h∗ (·), 2

(1)

where ρ∗ is a constant, h∗ (·) is a measurable function and T is the dynamic programming operator (see (9) below). If such a pair (ρ∗ , h∗ ) exists and h∗ (·) is a bounded function, for instance, then it is proved using standard arguments that ρ∗ is the optimal average cost and also that any stationary policy f ∗ that attains the minimum at the right hand-side of the optimality equation is optimal. For general results on the average cost criterion see, for instance, [15, 16, 20, 21, 38, 39]. However, equation (1) can be rarely solved analytically, so its solutions are approximated by several procedures. The major schemes to obtain or to approximate a solution to the optimality equation (1) are, namely, the value iteration or successive approximation algorithm, the policy iteration algorithm, the linear programming approach ([5, 14, 15, 16, 26]). The present work concerns with the first of them, which, roughly speaking, aims to get a solution (ρ∗ , h∗ ) as limit of the sequences ρn (·) := Jn (·) − Jn−1 (·),

hn (·) = Jn (·) − Jn (z), n ∈ N := {1, 2, . . .},

(2)

where Jn (·) is the n-step optimal cost function (with J0 ≡ 0), z is an arbitrary but fixed state. Using the dynamic programming operator these sequences are related by the equations ρn+1 (z) + hn+1 (·) = T hn (·) ∀n ∈ N,

(3)

with h0 ≡ 0. Note that ρn+1 (z) = T hn (z) because hn+1 (z) = 0. Thus, we have that hn+1 (·) = Tz hn (·) ∀n ∈ N, (4) where the operator Tz is defined as Tz u(·) := T u(·) − T u(z)

(5)

for functions u(·) belonging to an appropriate space of functions. Observe that (ρ∗ , h∗ ) satisfies (1) with h∗ (z) = 0 if and only if h∗ (·) is a fixed point of Tz . Thus, it is said that the value iteration algorithm (3) or (4) converges if the sequence (ρn (z), hn ), n ∈ N, converges to a solution (ρ∗ , h∗ ) of the average cost optimality equation (1). If this is the case and the algorithm (4) is stopped at time n ∈ N, then it is computed a greedy stationary policy fn with respect to the function hn (·)–that is, a policy that attains the minimum at T hn (·)– and the optimal value ρ∗ is approximated by the average cost J(fn ) induced by policy fn . Unfortunately, the procedure (4)-or (2) or (3)-is numerically unfeasible for systems with large or continuous spaces if it is pursued an exact representation of the sequence {(ρn , hn )}. Then, one can try to circumvent this obstacle considering an approximate or fitted value iteration algorithm which interleaves an approximation step between two consecutive applications of the dynamic programming operator T . In many cases the approximation step is represented by an operator L, so Lv(·) is the approximation of function v(·). 3

There are two slight different approximate procedures depending on which operator, either T or L, acts first. These procedures are given by the composite operators Tb = T L and Te = LT . Thus, the standard value iteration algorithm (4) is substituted by b hn+1 (·) = Tbz b hn (·) := Tbb hn (·) − Tbb hn (z), n ∈ N0 := {0, 1, . . .},

(6)

e hn+1 (·) = Tez e hn (·) := Tee hn (·) − Tez e hn (z) n ∈ N0 .

(7)

or In several approaches, Lv(·) is the projection of function v(·) into some class of approximating functions F that depends on finitely many parameters. For instance, F is usually taken as the spanned space by linear combinations of a finite “basis function” φ0 , φ1 , . . . , φN . The class F is called an architecture for the approximating problem and Lv(·) is called scoring function. The main steps in the projection method is the choice of a good architecture F, which is problem dependent, and the computation of the projections. Usually the projections are very difficult to find analytically and to compute numerically, so they are approximated by a variety of simulation methods (see, e.g., [2, 6, 24, 27]), introducing thus a second approximation error source. Other kind of approximation operators L are the so-called self-approximating operators, that is, operators that do not need auxiliary methods to produce the approximation Lv(·) for any function v(·). Examples of such operators are given by piecewise constant approximation schemes, linear and multilinear approximations, splines, Chebyshev polynomials, kernel-based approximations, etc. (see, e.g., [31, 34]). Then, as in the exact VI algorithm (3) or (4), if algorithm (7) is stopped at stage n, it computes an e hn -greedy policy fn –i.e., a stationary policy that attains the minimum at T e hn (·)–and approximates the optimal average cost ρ∗ by means of the average cost J(fn ) incurred by policy fn . This approximation algorithm raises several important problems: P1 The first one is the convergence of the approximate sequence {(e ρn , e hn )}; P2 Once the convergence is ensured, say to (e ρ, e h), the second one concerns with establishing computable bounds for the approximation errors (i) ρe−ρ∗ and (ii) e h(·) − h∗ (·). P3 Finally, the third one–perhaps, the most important from the practical viewpoint –consists in providing performance bounds for algorithm, that is, bounds for the approximation error E(fn ) := J(fn ) − ρ∗ . Obviously, the same problems are raised for the algorithm (6) replacing the sequence {(e ρn , e hn )} and (e ρ, e h) by {(b ρn , b hn )} and (b ρ, b h), respectively. Concerning problem P1 , it was already noted in previous works that the approximate value iterations need not to converge even for the discounted 4

criterion–see, e.g., [9, 12, 31] –at least the approximator L enjoys some nonexpansive properties with respect to some suitable norm. For instance, it is well-known that the β-discounted dynamic programming operator Tβ is a contraction operator on the space of bounded measurable functions into itself whenever the one step cost is a bounded function; thus, if L is non-expansive, the operators Tbβ := Tβ L and Teβ := LTβ are contractions operator too. Hence, the Banach fixed-point theorem guarantees the convergence of the approximate sequences. On the other hand, the projection method, which is among the most used approximating procedures, usually leads to expansive operators. (One exception to this rule is the aggregation-projection scheme used in [37].) So, the convergence of the approximating iterates (6) and (7) may fail and the hope in this case is that the sequences of n-step approximation errors or Bellman residuals ||Lb hn − T Lb hn || and ||e hn − T e hn ||, n ∈ N, remain bounded where || · || is a suitable norm of functions. Note that in case the convergence fails problem P3 still has sense–provided of course the Bellman residuals are bounded–but it turns up in a pretty difficult task and the authors are unaware of general computable bounds either for the discounted cost or the average cost criteria. The present work address problems P1 , P2 and P3 assumming the Markov decision model satisfies standard compactness and continuity conditions (Assumptions 2.1 and 2.2, section 2), a uniform ergodicity condition (Assumption 2.3) and considering monotone linear operators–as suggested in [12]–which give exact representation for the constant functions and additionally have the following continuity property: Lvn ↓ 0 provided vn ↓ 0 (Definition 4.1). This kind of approximator are called averagers following Gordon’s terminology [12]. The operators defined by piecewise constant approximation schemes, linear and multilinear interpolation, kernel-based approximators [34], Shoenberg’s splines [4], and the state aggregation-projection scheme in [37] satisfy all these properties. The key point in our approach is that the averagers define transition probabilities on the state space (see Lemma 4.2). This fact allows to view the approximating action of an averager as a perturbation of the original Markov decision model and then to provide the kind bounds asked in P2 (i) and P3 . The convergence in P1 comes from the fact that the perturbations induced by averagers preserve the continuity and ergodic properties imposed on the original model (see Assumption 2.3, section 2). This implies that operators Tbz and Tez are contraction mappings with respect to the span semi-norm (see Lemma 4.5), establishing the convergence in problem P1 . On the other hand, problem P2 (ii) has turned out to be somewhat evasive, and the authors were unable to give sharp bounds. In fact, Farias and Van Roy [10] develop an approximate linear programming approach based on an aggregate-projection method and show with a simple example that even very accurate approximations for the optimal average cost does not guarantee good bounds in P2 (ii), and also that the approximations to h∗ (·) may produce bad policies, which is, of course, the actual problem. Then, they introduce a complementary approximate linear program to control the accuracy of the approximations to h∗ (·). This misbehavior is due to the pretty peculiar features of the average cost criterion and, more specifically,

5

to the role the function h∗ (·) plays in solving the optimal control problem. For additional comments on this issue see Remarks 3.3 and 5.4, sections 3 and 5. To the best of the authors’ knowledge, the aforementioned link between approximation schemes using averagers and perturbation has largely passed unnoticed except for the Gordon’s paper [12], who does not take advantage of it. In fact, it is referred as an “intriguing property” that allows to view the “averagers as a Markov processes” (see [12, Section 4]). A remarkable fact is that, once the approximation step is seen as a perturbation, the analysis of problems P1 -P3 is direct and the proofs follow by somewhat elementary arguments. The remainder part of the present work is organized as follows. Section 2 introduces the Markov decision model, the average cost criterion and the assumptions under which the approximate value iteration algorithm is studied. Section 3 contains some well-known results on value iteration algorithm which are our departing point. The core of the work are Sections 4 and 5. Section 4 introduces the averagers and the perturbed models, while Section 5 addresses problems P1 -P3 ; specifically, Theorem 5.2 proves the convergence of the AVI algorithms, among other properties, and Theorem 5.3 gives the performance bounds. Section 6 illustrates the perturbation approach developed in previous section with numerical results for a single item inventory system with finite capacity, no backlog and no set-up production cost. Section 7, Appendix, collects the proof of the results. Throughout the work we use the following notation. For a topological space (S, τ ), B(S) denotes the Borel σ-algebra generated by the topology τ and “measurability” will always mean Borel measurability. Moreover, M (S) is the class of all measurable functions on S whereas Mb (S) denotes the subspace of bounded measurable functions endowed with the supremum norm ||v||∞ := supseS |u(s)|. The subspace of bounded continuous functions is denoted by Cb (S). For each subset A ⊂ S, IA denotes the indicator function of A, that is, IA (s) = 1 for s ∈ A and IA (s) = 0 otherwise. A Borel space Y is a measurable subset of a complete separable metric space endowed with the metric inherited.

2

The average cost criterion

We consider a Markov decision model M = (X, A, {A(x) : x ∈ X}, Q, C) with the usual meaning: the state space X and the action or decision space A are Borel spaces, A(x) is a measurable subset of A that stands for the admissible action or decision set for the state x ∈ X. As usually, it is assumed that the admissible state-action pairs set K := {(x, a) ∈ X × A : a ∈ A(x), x ∈ X} is a Borel measurable subset of the product space X×A. The transition law Q(·|·, ·) is a stochastic kernel on X given K which means that Q(·|x, a) is a probability measure on X for each pair (x, a) ∈ K, and that Q(B|·, ·) is a Borel measurable function on K for each Borel measurable subset B of X. Finally, the one-step cost or running cost function C(·, ·) is a Borel measurable function on K. As it is well-known, a Markov decision model can be thought of as a model

6

of a stochastic dynamical system that evolves as follows: at time n = 0 the decision-maker observes the system in some state x0 = x ∈ X and chooses a decision or action a0 = a ∈ A(x) incurring in a cost C(x, a). Then, the system moves to a new state x1 = x0 ∈ X according to the probability measure Q(·|x, a) and then the decision-maker chooses a new admissible decision a1 = a ∈ A(x0 ) with a cost C(x0 , a0 ) and so on. We will refer to the processes {xt } and {at } as the state and decision processes, respectively. At each decision time n ∈ N0 the decision-maker chooses the decision variable an according to a deterministic or stochastic rule πn , which may or depend on the whole previous history hn = (x0 , a0 , . . . , xn−1 , an−1 , xn ) of the system. Naturally, the rules pick admissible actions with probability one, that is, πn (an ∈ A(xn )|hn ) = 1 ∀hn ∈ Hn , where Hn := Kn × X for n ∈ N and H0 := X. Note that Hn is the set of all admissible histories up to time n ∈ N0 . We will refer to the sequence π = {πn } as (admissible) decision or control policy. The class of all policies is denoted by Π. Denote by F the class of measurable selectors, that is, the class consisting of measurable functions f : X → A such f (x) ∈ A(x) for all x ∈ X. The policy π = {πn } is called (deterministic) stationary policy if πn (·|hn ) = δf (xn ) (·) for all history hn ∈ Hn and n ∈ N0 for some selector f ∈ F; thus, the policy π = {πn ) is identified with the selector f and the class of all stationary policies with the family of selectors F. ∞ Let Ω := (X × A) be the canonical sample space and F the corresponding product σ-algebra. It is well-known that for each decision policy π = {πn } and initial state x0 = x ∈ X there exists a probability measure Pxπ on the measurable space (Ω, F) that governs the evolution of the controlled process {(xn , an )} induced by the policy π = {πn }. The expectation operator with respect to the probability measure Pxπ is denoted as Exπ . The expected cost incurred in n steps under the policy π = {πn } ∈ Π given the initial state x0 = x is given by Jn (π, x) := Exπ

n−1 X

C(xk , ak ).

k=0

Let J0∗ (·) = 0 and Jn∗ (·), n ∈ N, be the n-stage optimal cost function, that is, Jn∗ (x) := inf Jn (π, x), π∈Π

x ∈ X.

(8)

The (expected) average cost is given by 1 Jn (π, x), n→∞ n

J(π, x) := lim sup

x ∈ X, π ∈ Π,

and the optimal average control problem consists in finding a policy π ∗ = {πn∗ } such that J(π ∗ x) = J ∗ (x) := inf J(π, x) ∀x ∈ X. π∈Π

7

If such a policy π ∗ = {πn∗ } exists it is called (expected) average optimal, while J ∗ (·) is called the (expected) average cost optimal value function. For every measurable function v on K and stationary policy f ∈ F, we shall use the following notation: vf (x) := v(x, f (x)), x ∈ X. In particular, for the cost function and the transition law we shall write Cf (x) := C(x, f (x))

and Qf (·|x) := Q(·|x, f (x)), x ∈ X.

Using this notation, Qnf (·|·) stands for the n-step transition probability for the stationary policy f ∈ F. The main results of the present work uses either Assumption 2.1 or 2.2 below. Assumption 2.1(a) C(·, ·) is bounded by a constant K > 0; (b) A(x) is a non-empty compact subset of A for each x ∈ X and the mapping x → A(x) is continuous; (c) C(·, ·) is a continuous function on K; (d) Q(·|·, ·) is weakly continuous on K, that is, the mapping Z (x, a) → u(y)Q(dy|x, a) X

is continuous for each function u ∈ Cb (X). Assumption 2.2(a) C(·, ·) is bounded by a constant K > 0; moreover, the following holds for each x ∈ X : (b) A(x) is a non-empty compact subset of A; (c) C(x, ·) is a continuous function on A(x); (d) Q(·|x, ·) is strongly continuous on A(x), that is, the mapping Z a→ u(y)Q(dy|x, a) X

is continuous for each function u ∈ Mb (X). In what follows we assume that either Assumption 2.1 or Assumption 2.2 holds. Thus, C(X) will denote either the space Cb (X) or Mb (X) depending on whether Assumption 2.1 or Assumption 2.2 is being used, respectively. Then, with this convention at hand, the dynamic programming operator Z T u(x) := inf [C(x, a) + u(y)Q(dy|x, a)], x ∈ X, (9) a∈A(x)

X

maps the space C(X) into itself and for each u ∈ C(X) there exists a selector fu ∈ F such that Z T u(x) = Cfu (x) + u(y)Qfu (dy|x) ∀x ∈ X. X

8

The policy fu is called u-greedy policy. For a proof of these statements see, for instance, [14, Proposition D.3, p. 130]. For each f ∈ F define the operator Tf u(x) := Cf (x) + Qf u(x), x ∈ X, and observe that it maps Mb (X) into itself whenever the one-step cost function C(·, ·) is bounded. A solution of the so-called average cost optimality equation consists of a pair formed by a constant ρ∗ and a measurable function h∗ (·) on X that satisfy the equation ρ∗ + h∗ (x) = T h∗ (x) (10) If such solution exists and h∗ (·) ∈ C(X), then there is an h∗ -greedy policy f ∈ F, that is, ρ∗ + h∗ (·) = T h∗ (·) = Tf ∗ h∗ (·) (11) ∗

The triplet (ρ∗ , h∗ , f ∗ ) is called canonical triplet. Then, standard dynamic programming arguments show that the stationary policy f ∗ is an optimal policy and that the constant ρ∗ is the optimal cost, that is, J ∗ (·) = J(f ∗ , ·) = ρ∗ . Moreover, if the pair (ρ, h) also enters in a canonical triplet, then ρ = ρ∗ and h = h∗ + k for a constant k. The existence of a solution to the average cost optimality equation and the convergence of the value iteration algorithm are guaranteed under the ergodicity condition in Assumption 2.3 below. The total variation norm of a finite-signed measure λ on X is defined as Z ||λ||T V := sup{| v(y)λ(dy)|/||v||∞ : v ∈ Mb (X), v 6= 0}. (12) X

If ν = P1 −P2 , where P1 and P2 are probability measures on X, it can be proved that ||λ||T V = 2 sup{|P1 (B) − P2 (B)| : B ∈ B(X)}. (13) Assumption 2.3 There exists a positive number α < 1 such that ||Q(·|x, a) − Q(·|x0 , a0 )||T V ≤ 2α

∀(x, a), (x0 , a0 ) ∈ K.

(14)

Remark 2.4 (c.f. [14, Lemma 3.3, p. 57], [23, Thm. 16.0.2, p. 384]) Assumption 2.3 implies the Markov chain {xn } induced for each stationary policy f ∈ F is uniformly ergodic; specifically, it implies that sup ||Qnf (·|x) − µf (·)||T V ≤ 2αn

∀n ∈ N,

(15)

x∈X

where µf (·) is the unique invariant probability measure for the transition law Qf (·|·), that is, Z µf (B) = Qf (B|x)µf (dx) ∀B ∈ B(X). X

9

Property (15) implies sup |Exf u(xn ) − µf (u)| ≤ 2αn ||u||∞

(16)

x∈X

for all u ∈ Mb (X) and f ∈ F, where Z µf (v) := v(y)µf (dy) ∀f ∈ F, v ∈ Mb (X). X

In turn, (16) yields n−1 1 fX Ex u(xk ) = µf (u) n→∞ n

lim

(17)

k=0

for all x ∈ X and u ∈ Mb (X). In particular, J(f ) := J(f, x) = µf (Cf ) ∀x ∈ X, f ∈ F.

(18)

For further discussion on Assumption 2.3 and a proof of (15) the reader is referred to [14, Ch. 3.], [23, Ch. 16], [8, 17, 18]. The proof of our main results (Theorem 5.3, section 5) uses the the following result borrowed from [25]. Remark 2.5 Let S(·|·) and R(·|·) be transition probabilities on X. Define θ := sup ||S(·|x) − R(·|x)||T V

and γ :=

x∈X

1 sup ||S(·|x) − S(·|y)||T V . 2 x,y∈X

If the Markov chain with transition probability R(·|·) is uniformly ergodic and γ < 1, then θ ||πS − πR ||T V ≤ 1−γ where πS (·) and πR (·) are the invariant probability measures for S(·|·) and R(·|·), respectively.

3

Value iteration algorithm

Let z ∈ X be an arbitrary but fixed state and Mb0 (X) be the subspace of functions u ∈ Mb (X) satisfying the condition u(z) = 0. The subspace Cb0 (X) is the class of function u ∈ Cb (X) with u(z) = 0. Usually, the spaces Mb (X) and Cb (X) are endowed with the supremum norm ||u||∞ := supx |u(x)|, but here we also consider the span of bounded functions which is defined as ||u||sp := sup u(x) − inf u(x), u ∈ Mb (X). x∈X

x∈X

Note that ||·||sp is a semi-norm on Mb (X), but restricted to the subspace Mb0 (X) or Cb0 (X) it becomes a norm. In fact, ||u||∞ ≤ ||u||sp 10

∀u ∈ Mb0 (X),

(19)

so the subspaces (Mb0 (X), || · ||sp ) and (Cb0 (X), || · ||sp ) are Banach spaces. As above, C0 (X) denotes either Cb0 (X) or Mb0 (X) depending on whether Assumption 2.1 or Assumption 2.2 is being used. Let (ρ∗ , h∗ ) be solution of the optimality equation (10) with h∗ (z) = 0. Then, ρ∗ = T h∗ (z) and the optimality equation can be rewritten as h∗ (x) = T h∗ (x) − T h∗ (z) ∀x ∈ X. Thus, define the operator Tz u(x) := T u(x) − T u(z),

x ∈ X.

Note that, under either Assumption 2.1 or 2.2, Tz maps the subspace C0 (X) into itself and also that a function h∗ ∈ C0 (X) solves the optimality equation if and only if it is a fixed point of Tz , that is, h∗ (x) = Tz h∗ (x) = T h∗ (x) − T h∗ (z) ∀x ∈ X. Moreover, the n-stage optimal cost functions (8) satisfy the recursive equation (see, e.g., [14]) ∗ Jn+1 (·) = T Jn∗ (·) ∀n ∈ N0 , which leads to the equations ρn+1 (z) + hn+1 (·) = T hn (·) ∀n ∈ N0 ,

(20)

where hn (·) and ρn (·), n ∈ N, are the functions introduced in (2), that is, hn (·) = Jn∗ (·) − Jn∗ (z)

∗ and ρn (·) = Jn∗ (·) − Jn−1 (·), n ∈ N.

The value iteration (VI) algorithm is said to converge if the sequence (ρn (z), hn ), n ∈ N, converges to a solution (ρ∗ , h∗ ) of the average cost optimality equation. Now recall that using the operator Tz , the recursive equation (20) becomes in hn+1 (·) = Tz hn (·) = Tzn h0 (·) ∀n ∈ N0 , with h0 (·) = 0. The convergence of the VI algorithm is established in Remark 3.1 and Theorem 3.2 below. Remark 3.1 Suppose that Assumption 2.3 and either one of Assumptions 2.1 or 2.2 holds. Then: (a) The operator Tz is a contraction from the Banach space (C0 (X), || · ||sp ) into itself with module α (see [14, Lemma 3.5, p. 59.]). Thus, from the Banach fixed-point theorem and (11), there exists a canonical triplet (ρ∗ , h∗ , f ∗ ) with h∗ (·) belonging to C0 (X) and ρ∗ = T h∗ (z). (b) Moreover, ||Tzn u − h∗ ||∞ ≤ ||Tzn u − h∗ ||sp ≤ αn ||h∗ ||sp → 0 11

for all u ∈ C(X). In particular, taking u ≡ 0, we have ||hn − h∗ ||∞ ≤ ||hn − h∗ ||sp ≤ ||h∗ ||sp αn → 0. To complete the proof of the convergence of the VI algorithm it only remains to establish the convergence of sequence ρn (z), n ∈ N, to ρ∗ . This is shown, among other useful facts, in the next theorem which comes from [14, Thm. 4.8, p. 64]. Theorem 3.2 Suppose that either one of Assumptions 2.1 or 2.2 holds, and also that Assumption 2.3 holds. Let h∗ (·) be as in Remark 3.1 and define the sequences sn := inf ρn (x) and Sn := sup ρn (x), n ∈ N. x∈X

x∈X

Then: (a) {sn } is nondecreasing and {Sn } is nonincreasing; moreover, −αn−1 ||h∗ ||sp ≤ sn − ρ∗ ≤ Sn − ρ∗ ≤ αn−1 ||h∗ ||sp

∀n ∈ N.

Hence, in particular, ρn (z) → ρ∗ . (b) If f is an hn -greedy policy, then 0 ≤ J(f ) − ρ∗ ≤ ||ρn ||sp ≤ αn−1 ||h∗ ||sp

∀n ∈ N.

Remark 3.3 Here it is worth to comment on some very specific features of the average cost criterion that makes problem P2 (ii) turns somewhat evasive. On one hand, the average cost criterion is grossly underselective, that is, it may not distinguish between policies having quite different finite-horizon behavior. On the other hand, concerning to the optimal control problem, note that h∗ (·) just plays an instrumental role, that is, it is only a mean to determine the optimal stationary policies. However, one can see that h∗ (·) is linked to the growth rate of the finite-horizon cost generated by the optimal stationary policies. In fact, iterations of the equation ρ∗ + h∗ (·) = Cf ∗ (·) + Qf ∗ h∗ (·) and property (16) lead to B(f ∗ , x) := h∗ (x) − µf ∗ (h∗ ) = lim [Jn (f ∗ , x) − nρ∗ ] ∀x ∈ X. n→∞

(21)

The function B(f ∗ , ·) is called the bias of the optimal policy f ∗ . To get a more concrete meaning of the bias function, assume that g ∗ is another stationary policy that enters in a canonical triplet with ρ∗ and h∗ (·) and also that B(f ∗ , ·) < B(g ∗ , ·). Then, for each x ∈ X and ε > 0 there exists Nx,ε ∈ N such that Jn (f ∗ , x) < Jn (g ∗ , x) ∀n ≥ Nx,ε , which means that policy f ∗ has costs with lower growth rate that policy g ∗ for large enough horizons. Obviously, the optimal policy f ∗ is preferable to 12

the optimal policy g ∗ . In fact, it can be shown under the assumptions of this paper, that there exists an stationary policy with minimal bias in the class of average optimal stationary policies. For a comprehensive discussion on bias optimality and its relations with other sensitive criteria to the growth rate of the finite-horizon cost see [19]. Summarizing, one can say that the classic algorithms to solve the average optimal control problem–as value iteration, policy iteration and linear programming algorithms–are just designed to get the average optimal cost and to identify the optimal policies, not to distinguish the “second level” information provided by h∗ (·) on the growth rate of the finite horizon cost incurred by optimal stationary policies. Remark 3.4 From (16) and (21), it follows that |B(f ∗ , x)| ≤

∞ X

∗

|Exf Cf ∗ (xn ) − ρ∗ |

k=0

≤ 2||Cf ∗ ||∞

∞ X k=0

αn ≤

2K . 1−α

Hence, ||B(f ∗ , ·)||∞ ≤ 2K(1 − α)−1 . This in turn implies that ||h∗ ||∞ ≤ 4K(1 − α)−1 .

4

Averagers and perturbed models

As it was already mentioned, the main concern of the present work are problems P1 -P3 –stated in the Introduction–for the approximate value iteration algorithms (6) and (7). This is done following the approach introduced in [40], where it is considered a class of approximating operators L called averagers. These operators are introduced in the following definition. Definition 4.1 The operator L : Mb (X) → Mb (X) is said to be an averager if it satisfies the following properties: (a) LIX (·) = IX (·); (b) L is a linear operator; (c) L is a positive operator, that is, Lu(·) ≥ 0 for each u(·) ≥ 0 in Mb (X). (e) L satisfies the following continuity property: vn (·) ↓ 0, vn (·) ∈ Mb (X) =⇒ Lvn (·) ↓ 0. If in addition L maps Cb (X) into itself it is called continuous averager. The class of averager may seem somewhat restrictive at a first sight, but several classes of approximation operators of common use satisfy all those conditions. Among these are the piecewise constant approximators, linear and multilinear approximators, kernel-based operators (see, e.g., [34]), Shoenberg’s splines [4] and certain aggregation-projection operators [37]. 13

The key point is that the approximating step in the AVI algorithms (6) and (7) when L is an averager can be seen as perturbation of the original Markov model. To introduce the perturbed models we need the following simple but important result; in fact, the term averager comes from property (22). Lemma 4.2 Suppose that L is an averager. Then, the mapping L(B|x) := LIB (x),

x ∈ X, B ∈ B(X),

is a transition probability on X and Z v(y)L(dy|·) = Lv(·) ∀v ∈ Mb (X).

(22)

X

The proof of Lemma 4.2 is omitted because it follows standard arguments. Remark 4.3 If L is an averager, then it is monotone and non-expansive with respect to || · ||∞ and || · ||sp , that is, for all u, v belonging to Mb (X), it holds that ||Lu − Lv||∞ ≤ ||u − v||∞ and ||Lu − Lv||sp ≤ ||u − v||sp . To prove these facts, let u, v ∈ Mb (X) be arbitrary functions satisfying the inequality u ≥ v. Then, 0 ≤ L(u − v) = Lu − Lv, which implies that Lu ≥ Lv. Concerning the second statement, note that it suffices to check such property for an arbitrary function v due to the linearity of the operator L. Then, the monotonicity and the normalizing condition in Definition 4.1(a) implies that −||v||∞ ≤ Lv ≤ ||v||∞ and also that inf x∈X v(x) ≤ Lv ≤ supx∈X v(x), which in turn lead to ||Lv||∞ ≤ ||v||∞ and ||Lv||sp ≤ ||v||sp . c = (X, A, {A(x) : We shall consider the following two perturbed Markov models M b f e e x ∈ X}, R, Q) and M = (X, A, {Rf , Qf : f ∈ F}), which are defined below and c and M f for short. Recall that M stands for the original Markov denoted by M model. c. For this model, the state and control spaces, the The perturbed model M admissible decision sets, and the one-step cost function are the same of the original Markov model M . The transition law is defined as Z b Q(B|x, a) := L(B|y)Q(dy|x, a), (x, a) ∈ K, B ∈ B(X), X

which is clearly a stochastic kernel on X given K because it is the composition of stochastic kernels. Thus, given a policy π ∈ Π and initial state x b0 = x ∈ X, let {(b xk , b ak )} be the resulting controlled process and Pbxπ the corresponding probability measure both bxπ be the expectation operator defined on the measurable space (Ω, F). Let E with respect to such probability measure. The (expected) average cost and the

14

c are given as average optimal value for the perturbed model M n−1

X b x) := lim sup 1 E bxπ J(π, C(b xk , b ak ), x ∈ X, π ∈ Π, n→∞ n k=0

b x), x ∈ X, Jb∗ (x) := sup J(π, π∈Π

c if Jb∗ (·) = J(π b ∗ , ·). respectively. Thus, π ∗ is said to be optimal for model M c is given The dynamic programming operator Tb for the perturbed model M as Z b b T u(x) := inf [C(x, a) + u(y)Q(dy|x, a)], x ∈ X, a∈A(x)

X

= T Lu(x). Moreover, define Tbz u(·) = Tbu(·) − Tbu(z). Remark 4.4(a) Suppose Assumption 2.1 holds and also that L is a continb uous averager. Then, Q(·|·, ·) is weakly continuous, Tb(Cb (X)) ⊂Cb (X) and 0 b Tz (Cb (X)) ⊂ Cb (X). Moreover, for each u ∈ Cb (X) there exists a policy f ∈ F such that Tbu(·) = Tbf u(·), where Z b b f (dy|x) x ∈ X. Tf u(x) := Cf (x) + u(y)Q X

(b) Similarly, if L is an averager and Assumption 2.2 holds, then Tb(Mb (X)) ⊂ Mb (X) and Tbz (Mb (X)) ⊂ Mb0 (X), and for each u ∈ Mb (X) there exists a policy f ∈ F such that Tbu(·) = Tbf u(·). Now define the functions Jbn (·) := TbJbn−1 (·),

and ρbn (·) = Jbn (·) − Jbn−1 (·), n ∈ N,

with Jb0 (·) = 0. Observe that the functions defined in (6) can also be expressed as b hn (·) = Jbn (·) − Jbn (z) and ρbn (z) = Tbb hn (z), n ∈ N. f. For this model, the state and control spaces and The perturbed model M the admissible decision sets are the same of the orginal Markov model M . The perturbed one-step costs and transition laws are only defined for the class of stationary policies as follows: ef (·) := LCf (·) C

e f (B|·) := LQf (B|·) and Q

15

e f (·|·), f ∈ F, is a transition for each f ∈ F and B ∈ B(X). By Lemma 4.2, Q probability on X because it is the composition of two of them: Z e Qf (B|x) = Qf (B|y)L(dy|x), ∀x ∈ X, B ∈ B(X). X

Thus, for each stationary policy f ∈ F and initial state x e0 = x ∈ X there exists a Markov chain {e xn } and probability measure Pexf defined both on the e f (·|·) is the one-step transition probability measurable space (Ω, F) such that Q exf . The of {e xn }. The expectation operator with respect to Pexf is denoted by E corresponding average cost criterion and optimal value function are given as n−1 X ef ef (e e x) := lim sup 1 E C xk ), J(f, x n→∞ n

x ∈ X, f ∈ F,

k=0

x ∈ X.

e x), Je∗ (x) := sup J(f, f ∈F

f if Je∗ (·) = A policy f ∗ ∈ F is said to be optimal for the perturbed model M ∗ e , ·). J(f f is defined as The dynamic programming operator Te for model M Z ef + e f (dy|x)], x ∈ X, Teu(x) := inf [C u(y)Q f ∈F

X

for any u ∈ Mb (X). Moreover, define Tez u(·) = Teu(x) − Teu(z), u ∈ Mb (X). Remark 4.5(a) Suppose L is a continuous averager and that Assumption 2.1 holds. Then, the dynamic programming operator Te maps Cb (X) into itself and Te = LT. To verify the last fact, let u ∈ Cb (X) be a fixed function and note that Z T u(·) ≤ Cf (·) + u(y)Qf (dy|·) ∀f ∈ F. X

Thus, using the monotonicity and linearity of L, Lemma 4.2 implies that Z ef (·) + e f (dy|·). LT u(·) ≤ C u(y)Q X

For the other hand, recall that there exists fu ∈ F such that Z T u(·) = Cfu (·) + u(y)Qfu (dy|·). X

16

Then, using the linearity of L and Lemma 4.2 again, Z e e f (dy|·), LT u(·) = Cfu (·) + u(y)Q u X

which yields that Teu = LT u. Hence, Te = LT. (b) If L is an averager and Assumption 2.2 holds, then part (a) holds replacing Cb (X) by Mb (X). Next, define the functions Jen (·) = TeJen−1 (·), Je0 (·) = 0

and

ρen (·) = Jen (·) − Jen−1 (·), n ∈ N,

and observe that functions (7) can also expressed as e hn (·) = Jen (·) − Jen (z)

and ρen (z) = Tehn (z), n ∈ N.

Remark 4.6 To end with the description of the perturbed models, note that the convergence of the approximate value iteration algorithms given in (6) and (7) correspond to the convergence of the standard value iteration algorithms for c and M f, respectively. the perturbed models M

5

Convergence and performance bounds of AVI algorithms

This section addresses problems P1 -P3 posed in the Introduction. The main c and M f retain practically all the properties of the point here is that models M c original model M . For instance, it is shown in Lemma 5.1 below that models M f satisfy the ergodicity property in Assumption 2.3 provided the original and M model M does. The proofs of all results stated in this section are given in the Appendix, section 7. Lemma 5.1 Suppose that model M satisfies Assumption 2.3. Then, 0 0 b b ||Q(·|x, a) − Q(·|x , a )||T V ≤ 2α

∀(x, a), (x0 , a0 ) ∈ K,

e f (·|x) − Q e g (·|x0 )||T V ≤ 2α ||Q

∀x, x0 ∈ X, f, g ∈ F.

c and M f, where µ Hence, properties (15)-(18) hold in both models M bf (·) and µ ef (·) b f (·|·) and Q e f (·|·), respectively, for are the invariant probabilities measures for Q each f ∈ F. In particular, Z b b J(f ) := J(f, ·) = Cf (y)b µf (dy), ZX e ) := J(f, e ·) = ef (y)e J(f C µf (dy). X

17

Lemma 5.1 combined with continuity and compactness conditions implies that operators Tbz and Tez are contractions from the space C0 (X) into itself. This later fact implies the existence of canonical triplets for the perturbed models and the convergence of AVI algorithms as well–see Theorem 5.2 below–establishing thus the convergence asked in P1 . Theorem 5.2 Suppose Assumption 2.3 holds together either one of the following set of conditions: (i) L is a continuous averager and Assumption 2.1; (ii) L is an averager and Assumption 2.2. Then: c and M f, (a) There exist canonical triplets (b ρ, b h, fb) and (e ρ, e h, fe) for models M b e respectively, where h(·) and h(·) are functions in C0 (X); (b) ρb = ρe, b h(·) − T e h(·) = k1 and Lb h(·) − e h(·) = k2 , where k1 and k2 are constants; (c) the AVI algorithms (6) and (7) converge to (b ρ, b h) and (e ρ, e h), respectively. Now, to address problems P2 -P3 , first it is need to specify how the accuracy of the approximations given by operator L is measured. As it was already mentioned, the goal is to provided performance bounds for the AVI algorithms in terms of the accuracy of the approximations provided by L for the primitive data of the original Markov model, namely, the one-step cost function C(·, ·) and the transition law Q(·|·, ·). b0 be a subset of stationary policies that contains the subTo this end, let F b1 := {f ∈ F : T Lb classes F∗ := {f ∈ F : T h∗ (·) = Tf h∗ (·)}, F h(·) = Tf Lb h(·)}, b2 := {f ∈ F : T Lb e0 be a F hn (·) = Tf Lb hn (·) for some n ∈ N}. Similarly, let F e1 := {f ∈ F : subset of stationary policies that contains F∗ and the subclasses F e e e e e T h(·) = Tf h(·)} and F2 := {f ∈ F : T hn (·) = Tf hn (·) for some n ∈ N}. The accuracy of the approximations given by operator L is measured by the constant b0 ) := δQ (F

sup

b f (·|x)||T V , ||Qf (·|x) − Q

(23)

x∈,Xf ∈b F0

c, and by the constants for model M e0 ) := sup ||Cf − C ef ||∞ , δC (F

e0 ) := δQ (F

f ∈e F0

sup

e f (·|x)||T V , ||Qf (·|x) − Q

x∈Xf ∈b F0

f. With this notation, we are now ready to state our main results. for model M Theorem 5.3 Suppose that assumptions in Theorem 5.2 hold and let σ ∗ := ρb = ρe. Then: K b0 ); δQ (F (a) |ρ∗ − σ ∗ | ≤ 1−α (b) If f ∈ F is Lb hn -greedy, that is, T Lb hn = Tf Lb hn , then 0 ≤ J(f ) − ρ∗ ≤ ||b ρn ||sp + 18

2K b0 ); δQ (F 1−α

K e0 )]; δQ (F 1−α (d) If g ∈ F is e hn -greedy, that is, T e hn = Tg e hn , then

e0 ) + (c) |ρ∗ − σ ∗ | ≤ 2[δC (F

e0 ) + 0 ≤ J(g) − ρ∗ ≤ ||e ρn ||sp + 2[δC (F

K e0 )]. δQ (F 1−α

Remark 5.4 Remark 3.4 implies that approximation error functions b h − h∗ ∗ e and h(·) − h are uniformly bounded by 8K/(1 − α), which is, of course, a poor bound. However, the key point in any approximation scheme for the average cost criterion is, rather to get close approximations, that the approximating functions are “well aligned” with the target function h∗ (·) in the sense that the approximations do not underestimate states with a high cost and then drive the system to them, or overestimate state with lows cost and take the system out of them. This goal can be accomplished, for instance, if the approximations are close enough to the target function h∗ (·), but it is not a necessary conditions. The next proposition shows that the approximating functions b h and e h “choose” good stationary policies. Proposition 5.5 Suppose that assumptions in Theorem 5.2 hold. Then: (a) If f ∈ F is Lb h-greedy, that is, T Lb h = Tf Lb h, then 0 ≤ J(f ) − ρ∗ ≤

2K b0 ); δQ (F 1−α

(b) If g ∈ F is e h-greedy, that is, T e h = Tg e h, then e0 ) + 0 ≤ J(g) − ρ∗ ≤ 2[δC (F

K e0 )]. δQ (F 1−α

The proof Proposition 5.5 is practically the same of Theorem 5.3((b) and (c); thus, it is omitted. Remark 5.6 In a first comparison of parts (a) and (c)–or (b) and (d)–in Theof gives better bounds than model rem 5.3 it would seem that perturbed model M c M , but it may not be the case. For instance, suppose that Q(B|x, a) = 0 for each discrete subset B ⊂ X := [a, b] and (x, a) ∈ K, and also that L is an interpolator on the grid s0 = a < s1 < · · · < sN = b, so v(si ) = Lv(si ) for i = 0, . . . , N , and each function v on X. Then, LIB1 (·) = 0 for B1 := X\{s0 , s1 , . . . , sN }, and so b0 ) = 2 independently how fine the grid is. b 1 |x, a) = 0. Thus, δQ (F Q(B

6

An example from inventory systems

The goal of this section is to illustrate the approach developed in the previous sections with some numerical results. It was chosen an inventory systems for which is known a analytical solution with the end to contrast the numerical solutions and the performance bounds of the approximate algorithm with the exact solution. 19

Then, consider an single item inventory system with no backlog, no set-up cost and finite capacity θ. Let xn and an be the stock of the item and the amount of it ordered to the production unit at the beginning of the nth-stage, and wn the product’s demand during that period. It is assumed the quantity an is immediately supplied at beginning of nth-stage. Since excess demand is not backlogged, the stock evolves according to xn+1 = (xn + an − wn )+ , n ∈ N0 , where x0 = x and v + := max(v, 0) for each real number v. The demand process {wn } is formed by independent and identically distributed nonnegative random variables with distribution function F (·). The state and control spaces are X = A = [0, θ] and the admissible control set for state x ∈ X is A(x) = [0, θ−x]. In order to guarantee Assumption 2.3 holds it is assumed that F (θ) < 1. The one-step cost function is C(x, a) = pEw0 (w0 − x − a)+ + h(x + a) + ca, (x, a) ∈ K, where Ew0 denotes the expectation operator with respect to the distribution of w0 and the constants p, h and c stand for the unit penalty cost for unmet demand, the unit holding cost for the stock at hand and the unit production cost, respectively. Note that the one-step cost function C(·, ·) and the mapping x → A(x) are continuous. Moreover, the transition law of the system Q(B|x, a) = Pr[(x + a − w0 )+ ∈ B], B ∈ B(X), (x, a) ∈ K, satisfies the equation Z

v(y)Q(dy|x, a) = Ew0 v((x + a − w0 )+ )

X

for all (x, a) ∈ K and v ∈ Mb (X). This and the bounded convergence theorem imply the mapping Z (x, a) → v(y)Q(dy|x, a) X

is continuous for each v ∈ Cb (X). Hence, the inventory system satisfies Assumption 2.1. For the other hand, Assumption 2.3 follows from the inequality Q({0}|x, a) = Pr[w0 ≥ x + a] ≥ 1 − F (θ) > 0 ∀(x, a) ∈ K, In fact, if 0 ∈ B then Q(B|x, a) ≥ 1 − F (θ) for all (x, a) ∈ K; if 0 ∈ / B, then Q(B|x, a) ≤ Q((0, θ]|x, a) ≤ F (θ). Hence, |Q(B|x, a) − Q(B|x0 , a0 )| ≤ F (θ) ∀(x, a) ∈ K, B ∈ B(X), 20

which in turn, from (13), implies that Assumption 2.3 holds with α = F (θ). Now, consider the linear interpolation scheme with evenly spaced nodes s0 = 0 < s1 < . . . < sN = θ. Thus, the approximating operator L is given as Lv(x) = bi (x)v(si ) + bi (x)v(si+1 ), x ∈ [si , si+1 ], for each function v ∈ Mb (X) where bi (x) :=

si+1 − x si+1 − si

and bi (x) := 1 − bi (x), x ∈ [si , si+1 ],

for i = 0, . . . , N − 1. The operator L is clearly a continuous averager, that is, it is an averager that maps Cb (X) into itself. For the above inventory model and its perturbed models as well, there exist base-stock average cost optimal policies. This can be shown following, for instance, the arguments given in [41] combined with the geometric ergodicity property (15). Recall, that a stationary policy f is a base-stock policy if f (x) = S − x for x ∈ [0, S], and f (x) = 0 otherwise, where the constant S ≥ 0 is the so-called re-order point. In fact, the optimal re-order point S ∗ for the original inventory model satisfies the equation F (S) =

p−h−c p−c

if p > h + c, and S ∗ = 0 otherwise (see [41]). Moreover, the optimal average cost is ρ∗ = pEw0 (w0 − S ∗ ) + hS ∗ + cEw0 min(S ∗ , w0 ). e0 f and take F In view of Remark 5.5, we only consider the perturbed model M e as the class of the base-stock policies. Now, in order to get estimate for δQ (F0 ) e0 ), it is assumed that that F (·) has a density ρ(·) which is Lipschitz and δQ (F continuous with module l > 0 on [0, θ], that is, |ρ(x) − ρ(y)| ≤ l|x − y| ∀x, y ∈ [0, θ]. Then, e0 ) = sup ||Cf − C ef ||∞ ≤ (p + h + c)4s, δC (F f ∈e F0

e0 ) = sup ||Qf − Q e f ||T V ≤ (2θl + 4M 0 )4s, δQ (F f ∈e F0

where M 0 is a bound of the density ρ(·) and 4s = θ/N . These bounds follow by elementary but cumbersome computations, so their derivation is omitted. The numerical results displayed in Tables 1, 2 and Figure 1 correspond to the parameter values c = 1.5, h = 0.5, p = 3, θ = 25 for grids with N = 102 , 103 , 5 × 103 and 104 . Moreover, it is assumed the product’s demand has an 21

Table 1: Approximate optimal policies and performance bounds for AVI algorithm with a linear interpolation scheme with N + 1 evenly spaced nodes for the inventory system with demand exponentially distributed with parameter λ = 0.05, and parameters c = 1.5, h = 0.5, p = 3, θ = 25. tol. ε = 10−4 n∗ Sn∗ e0 ) δC (F e0 ) δQ (F AE TE

N = 102 7 22 1.25 0.08125 40.60904 40.60906

N = 103 7 21.975 0.125 0.008125 4.060904 4.060929

N = 5 × 103 7 21.975 0.025 0.001625 0.8121807 0.8122065

N = 104 7 21.9725 0.0125 0.0008125 0.4060904 0.4061161

Table 2: Approximate optimal re-order points with a linear interpolation scheme with 1001 evenly spaced nodes for the inventory system with demand exponentially distributed with parameter λ = 0.05, and parameters c = 1.5, h = 0.5, p = 3, θ = 25. n e Sn sen

2 51.79333 42.80326

3 51.03147 49.35696

4 50.98637 50.80725

5 50.98612 50.97339

6 50.98612 50.98547

7 50.98612 50.98610

exponential density ρ(·) with parameter λ = 0.05. Note that in this case, the density ρ(·) is bounded by M 0 = λ = 0.05 and also that it has Lipschitz module l = λ2 = 0.0025. With these parameter values, the above bounds become in e0 ) ≤ 54s and δQ (F e0 ) ≤ 0.3254s. δC (F Moreover, the optimal re-order point and the optimal average cost are S ∗ = 21.97225 and ρ∗ = (c + h)λ−1 + hS ∗ = 50.98612, respectively. The approximate value iteration algorithm stops once the stopping error SE := ||e ρn ||sp falls below the tolerance ε = 10−4 . Let n∗ be the first time SE is less than ε and S(n∗ ) the re-order point for the policy e hn∗ -greedy. Table 1 and Figure 1 show that the algorithm converges very fast and practically gets e0 ) + K(1 − α)−1 δQ (F e0 )] the true optimal policy. The quantities AE := 2[δC (F and TE := AE + SE are bounds for the approximation error and the total error, respectively. The optimal average cost ρ∗ is approximated by the sequences sen := inf x∈X ρen (x) and Sen := supx∈X ρen (x), n ∈ N. Table 2 display the values of these sequences for the grid with N = 103 . Figure 2 shows sequences sen and Sen and Figure 3 shows functions e hn for θ = 120 with N = 103 .

22

Figure 1: Functions e hn with N = 103 and θ = 25

0

10

20

30

n=7 n=5 n=3 n=2 n=1

0

7

5

10

15

20

25

Appendix

Proof of Lemma 5.1. Assumption 2.3 can be equivalenty rewritten as Z Z | v(y)Q(dy|x, a) − v(y)Q(dy|x0 , a0 )| ≤ 2α||v||∞ X

X

for all (x, a) ∈ K, v ∈ Mb (X). Now, taking v = Lu with u ∈ Mb (X), the above inequality yields Z Z | Lu(y)Q(dy|x, a) − Lu(y)Q(dy|x0 , a0 )| ≤ 2α||Lu||∞ . X

X

b From definition of Q(·|·, ·), (22) and non-expansiveness of L, it follows that Z Z 0 0 b b | u(y)Q(dy|x, a) − u(y)Q(dy|x , a )| ≤ 2α||u||∞ , X

X

which leads to 0 0 b b ||Q(·|x, a) − Q(·|x , a )||T V ≤ 2α ∀(x, a), (x0 , a0 ) ∈ K.

To prove the second inequality fix policies f, g ∈ F and B ∈ B(X). Assumption 2.3 implies that −α + Qg (B|y) ≤ Qf (B|y) ≤ Qg (B|y) + α 23

∀x, y ∈ X.

Figure 2: Sequences Sen , sen with N = 103 and θ = 120.

60

●

55 ●

●

50

●

●

●

●

●

●

● ●

● ●

● ●

●

●

●

●

●

●

●

●

● ●

45 ●

40

● ● ●

35

● ●

30 ●

1

3

5

7

9

11

13

15

17

19

Fixing y ∈ X, properties in Definition 4.1(a)-(c) imply −α + Qg (B|y) ≤ LQf (B|x) ≤ Qg (B|y) + α ∀x ∈ X. Now fixing x ∈ X, the latter inequality implies −α + LQg (B|y) ≤ LQf (B|x) ≤ LQg (B|y) + α ∀y ∈ X. ef , Q e g and (22), it follows that Then, from definition of Q e f (B|x) − Q e g (B|y)| ≤ α |Q

∀x, y ∈ X, B ∈ B(X).

Hence, from (13), e f (·|x) − Q e g (·|y)||T V ≤ 2α ||Q

∀x, y, f, g ∈ F.

c satisfies Proof of Theorem 5.2. Under Assumption of Theorem 5.2, model M all conditions in Remark 3.1 and Theorem 3.2 (see Remark 4.4). Thus, Tbz is contraction from C0 (X) into itself with contraction modulus α and there exists a c with b canonical triplet (b ρ, b h, fb) for model M h in C0 (X) and ρb = Tbb h(z); moreover, 24

Figure 3: Functions e hn with N = 103 and θ = 120.

−20

0

20

40

n=20 n=9 n=5 n=3 n=1

0

20

40

60

80

100

120

all conclusion in Theorem 3.2 hold replacing ρ∗ , h∗ , sn , Sn , hn and J(f ) with b ), respectively, where ρb, b h, sbn , Sbn , b hn and J(f sbn := inf ρbn x∈X

and Sbn := sup ρbn , n ∈ N. x∈X

f, the contractiveness of operator Tez on C0 (X) and the Regarding to model M existence of a canonical triplet (e ρ, e h, fe) with e h in C0 (X) and ρe = Tee h(z) follows the same arguments given in [14, Lemma 3.5, p. 59] for the proof of Remark 3.1, while the convergence of the corresponding value iteration algorithm for f follows the arguments given in [14, Thm. 4.8, p. 64] for Theorem 3.2 model M fn , e e ), respectively, replacing ρ∗ , h∗ , sn , Sn , hn and J(f ) with ρe, e h, sen , S hn and J(f where sen := inf ρen (x) and Sen := sup ρen (x), n ∈ N. x∈X

x∈X

Part (c) follows directly since the AVI algorithms (6) and (7) coincide with c and M f, respectively (see the standard value iteration algorithm for model M Rermark 4.6). Thus, it only remains to prove part (b). To do this, first note that ρb + b h= b b b b b T h = T Lh; then, ρb + Lh = LT (Lh), which in turn implies that function v = Lb h satisfies the equation ρb + v = Tev. 25

Hence, ρb = ρe and functions v = Lb h and e h differ by a constant. On the other hand, the pair (e ρ, e h) satisfies the equalities ρe + e h = Tee h = LT e h. e e e Now, observe that ρe+T h = T L(T h), which yields that function u = T h satisfies the equation ρe + u = Tbu, which implies that function u = T e h and b h differ by a constant because ρe = ρb. Lemma 7.1 Let f ∈ F be an arbitrary stationary policy and K the bound of C(·, ·). Then: b ) = J(f e ); (a) σf := J(f (b) |J(f ) − σf | ≤

K 1−α

b f (·|x)||T V ; supx∈X ||Qf (·|x) − Q

e f (·|x)||T V . ef ||∞ + K supx∈X ||Qf (·|x) − Q (c) |J(f ) − σf | ≤ ||Cf − C 1−α Proof of Lemma 7.1. To prove part (a), let f ∈ F be a fixed policy and define the operators Tbf := Tf L and Tef := LTf . From Lemma 5.1, it follows that operators Tbz,f v := Tbf v − Tbf v(z), Tez,f v := Tef v − Tef v(z) are contractions operators from Mb0 (X) into itself with contraction modulus α. Hence, there exist functions b hf and e hf in Mb0 (X) that satisfies the Poisson equations bf b ρbf + b hf = Cf + Q hf ef + Q ef e ρef + e hf = C hf ,

(24)

b ) = ρbf and where ρbf := Tbf b hf (z) and ρef := Tef e hf (z). Clearly, it holds that J(f e J(f ) = ρef . Equation (24) implies that bf b ρbf + Lb hf = LCf + LQ hf = LCf + LQ(Lb hf ) ef + Q(L e b =C hf ). Hence, ρbf = ρef , which proves part (a). Additionally, note that functions Lb hf b and hf differ only by a constant. To prove part (b), note that property (18) and Lemma 5.1 imply b )| = |µf (Cf ) − µ |J(f ) − J(f bf (Cf )| ≤ K||µf − µ bf ||T V . b f (·|·), θ = supx∈X ||Qf (·|x)− Q e f (·|x)||T V Now taking S(·|·) = Qf (·|·), R(·|·) = Q 1 and γ = 2 supx,y∈X ||Qf (·|x) − Qf (·|y)||T V , Remark 2.5 yields b )| ≤ |J(f ) − J(f

K e f (·|x)||T V , sup ||Qf (·|x) − Q 1 − γ x∈X 26

which implies the first inequality because γ ≤ α. The inequality (b) follows in a similar manner. In fact, note that e )| ≤ |µf (Cf ) − µf (C ef )| + |µf (C ef ) − µ ef )| |J(f ) − J(f ef (C ef ||∞ + ||µf − µ ≤ ||Cf − C ef ||T V . e f (·|·) to obtain the desire Now use Remark 2.5 again but now taking R(·|·) = Q result. Proof of Theorem 5.3. Parts (a) and (c) follow directly from Lemma 7.1 after noting that b )| ≤ sup |J(f ) − J(f b )|, |ρ∗ − σ ∗ | = | inf J(f ) − inf J(f f ∈b F0

∗

f ∈b F0

f ∈b F0

∗

e )|. e )| ≤ sup |J(f ) − J(f |ρ − σ | = | inf J(f ) − inf J(f f ∈e F0

f ∈e F0

f ∈e F0

To prove the remaining parts recall, as it was discussed in the proof of Lemma c and M f. 5.2, that all the conclusions in Theorem 3.2 remain valid for models M Thus, since f ∈ F is Lb hn -greedy policy, T Lb hn = Tf Lb hn , c. Then, from Theorem 3.2(b), which means that f is b hn -greedy in the model M b ) − ρb ≤ ||b 0 ≤ J(f ρn ||sp .

(25)

On the other hand, b )| + |J(f b ) − ρb| + |b 0 ≤ J(f ) − ρ∗ ≤ |J(f ) − J(f ρ − ρ∗ |. Thus, Theorem 5.3(a), Lemma 7.1(b) and inequality (25) imply 0 ≤ J(f ) − ρ∗ ≤ ||b ρn ||sp +

2K δb (Q), 1 − α F0

which proves (b). The proof of (c) is analogous. First note that e e − ρe| + |e 0 ≤ J(g) − ρ∗ ≤ |J(g) − J(g)|| + |J(g) ρ − ρ∗ |. Now, since g ∈ F is e hn -greedy, then Te hn = Tf e hn , f. Thus, from Theorem 3.2(b), which means that g is e hn -greedy in model M e − ρe ≤ ||e 0 ≤ J(g) ρn ||. Thus, the desire result follows from (26), part (c) and Lemma 7.1(c). 27

(26)

References [1] J. Abounadi, D. Bertsekas and V. S. Borkar (2001), Learning algorithms for Markov decision processes with average cost, SIAM Journal on Control and Optimization 40, 681-698. [2] A. Almudevar (2008), Approximate fixed point iteration with an application to infinite horizon Markov decision processes, SIAM Journal on Control and Optimization 46, pp. 541-561. [3] A. Araposthatis, V. S. Borkar, E. Fernández-Guacherand, M. K. Gosh, S. I. Marcus (1993), Discrete-time controlled Markov processes with average cost criterion: a survey, SIAM Journal on Control and Optimization 31, 282-344. [4] L. Beutel, H. Gonska, and D. Kacsó (2002), On variation-diminishing Shoenberg operators: new quantitative statements, in Multivariate Approximation and Interpolations with Applications, Ed. M. Gasca, Monograf´ıas de la Academia de Ciencias de Zaragoza 20, 9-58. [5] D.P. Bertsekas (1987), Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall, Englewood Cliffs, N. J. [6] D.P. Bertsekas and J. N. Tsitsiklis (1996), Neuro-Dynamic Programming, Athena Scientific, Belmont, MA. [7] D.P. Bertsekas, Approximate policy iteration: a survey and some new methods, Journal of Control Theory and Applications 9 (2011), 310-335. [8] C. C. Y. Dorea and A. G. C. Pereira (2006), A note on a variations of Doeblin’s condition for uniform ergodicity of Markov chains, Acta Mathematica Hungarica 110, pp. 287-292. [9] D. P. de Farias and B. van Roy (2000), On the existence of fixed points for approximate value iteration and temporal difference learning, Journal of Optimization Theory and Applications 105, pp. 589-608. [10] D. P. de Farias and B. van Roy (2002), Approximate linear programming for average-cots dynamic programming, Advances in Neural Information Processing Systems 15, MIT Press, Cambridge, M. A. [11] D. P. de Farias and B. Van Roy (2006), A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees, Mathematics of Operations Research 31, 597-620. [12] G. J. Gordon (1995), Stable function approximation in dynamic programming, Proceedings of the 12th International Conference on Machine Learning, pp. 261-268.

28

[13] A. Gosavi (2004), A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis, Machine Learning 55, 5-29. [14] O. Hern´ andez-Lerma (1989), Adaptive Markov Control Processes, SpringerVerlag, N. Y. [15] O. Hern´ andez-Lerma and J. B. Lasserre (1996), Discrete-Time Markov Control Processes. Basic Optimality Criteria, Springer-Verlag, N. Y. [16] O. Hern´ andez-Lerma and J. B. Lasserre (1999), Further Topics on DiscreteTime Markov Control Processes, Springer-Verlag, N. Y. [17] O. Hern´ andez-Lerma and J. B. Lasserre (2003), Markov Chains and Invariant Probabilities, Birkhauser Verlag, Basel, Switzerland. [18] O. Hern´ andez-Lerma, R. Montes-de-Oca, R. Cavazos-Cadena (1991), Recurrence conditions for Markov decision processes with Borel spaces: a survey, Annals of Operations Research 29, 29-46. [19] O. Hern´ andez-Lerma, O. Vega-Amaya (1998), Infinite-horizon Markov control processes with undiscounted cost criteria: From average to overtaking optimality, Applicationes Mathematicae 25, 153-178. [20] O. Hern´ andez-Lerma, O. Vega-Amaya, G. Carrasco (1999), Sample-path optimality and variance-minimization of average cost Markov control processes, SIAM Journal on Control and Optimization 38, 79-93. [21] A. Jaskiewicz and A. S. Nowak (2006), On the optimality equation for average cost Markov control processes with Feller transitions probabilities, Journal of Mathematical Analysis and Applications 316 (2), 495-509. [22] J. M. Lee and J. H. Lee (2004), Approximate dynamic programming strategies and their applicability for process control: a review and future directions, International Journal of Control, Automation, and Systems 2(3), 263-278. [23] S. P. Meyn and R. L. Tweedie (1993), Markov Chain and Stochastic Stability, Springer-Verlag, London. [24] R. Munos (2007), Performance bounds in Lp -norm for approximate value iteration, SIAM Journal on Control and Optimization 47, 2303-2347. [25] A. S. Nowak (1998), A generalization of Ueno’s inequality for n-step transition probabilities, Applicationes Mathematicae 25, 295-299. [26] M. L. Puterman (1994), Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, N. Y. [27] W. P. Powell (2007), Approximate Dynamic Programming. Solving the Curse of Dimensionality, John Wiley & Sons, Inc. 29

[28] W. P. Powell, What you should know about approximate dynamic programming, Naval Research Logistics 56 (2009), 239-249. [29] W. P. Powell, Perspectives of approximate dynamic programming, Annals of Operations Reseach, 2012, DOI 10.1007/s10479-012-1077-6. [30] W. P. Powell and J. Ma (2011), A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications, Journal of Control Theory and Applications 9 (2011), 336-352. [31] J. Rust (1996), Numerical dynamic programming in economics, in Handbook of Computational Economics, Vol. 1, H Amman, D. Kendrick and J Rust (eds.), 619-728. [32] M. S. Santos (1998), Analysis of a numerical dynamic programming algorithm applied to economic models, Econometrica 66, 409-426. [33] M. S. Santos and J. Rust (2004), Convergence properties of policy iteration, SIAM Journal on Control and Optimization 42, 2094-2115. [34] J. Stachurski (2008), Continuous state dynamic programming via nonexpansive approximation, Computational Economics 31, 141-160. [35] J. Stachurski (2009), Dynamic Economic: Theory and Computation, The MIT Press, Cambridge, MA. [36] R. S. Sutton and A. G. Barto (1998), Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA. [37] B. Van Roy (2006), Performance loss bounds for approximate value iteration with state aggregation, Mathematics of Operation Reseach 31, 234-244. [38] O. Vega-Amaya (2003), The average cost optimality equation: a fixed-point approach, Bolet´ın de la Sociedad Matemática Mexicana 9, 185-195. [39] O. Vega-Amaya (2003), Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation, SIAM Journal on Control and Optimization 42, 1876-1894. [40] O. Vega-Amaya and J. López-Borbón (2012), Performance bounds for discounted approximate value iteration using averagers, Reporte de Investigaci´ on 13, Departamento de Matemáticas, Universidad de Sonora, México. [41] O. Vega-Amaya and R. Montes-de-Oca (1998), Application of average dynamic programming to inventory systems, Mathematical Methods of Operations Research 47, pp. 451-471.

30