Approximate Policy Iteration: A Survey and Some New Methods

April 2010 - Revised December 2010 and June 2011

Report LIDS - 2833

A version appears in Journal of Control Theory and Applications, 2011

Approximate Policy Iteration: A Survey and Some New Methods Dimitri P. Bertsekas † Abstract We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms, and aims to unify the available methods in the light of recent research developments, and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods such as LSTD, and iterative methods such as LSPE and TD(λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods, and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

1.

INTRODUCTION In this paper we aim to survey and place in broad context a number of issues relating to approximate policy iteration methods for finite-state, discrete-time, stochastic dynamic programming (DP) problems. † The author is with the Dept. of Electr. Engineering and Comp. Science, M.I.T., Cambridge, Mass., 02139. His research was supported by NSF Grant ECCS-0801549, by the LANL Information Science and Technology Institute, and by the Air Force Grant FA9550-10-1-0412. Many thanks are due to Huizhen (Janey) Yu for extensive helpful discussions and suggestions. 1

Introduction These methods are one of the major approaches for approximate DP, a field that has attracted substantial research interest, and has a wide range of applications, because of its potential to address large and complex problems that may not be treatable in other ways. Among recent works in the extensive related literature, we mention textbooks and research monographs: Bertsekas and Tsitsiklis [BeT96], Sutton and Barto [SuB98], Gosavi [Gos03], Cao [Cao07], Chang, Fu, Hu, and Marcus [CFH07], Meyn [Mey07], Powell [Pow07], Borkar [Bor08], Busoniu, Babuska, De Schutter, and Ernst [BBD10], and the author’s text in preparation [Ber10a]; edited volumes and special issues: White and Sofge [WhS92], Si, Barto, Powell, and Wunsch [SBP04], Lewis, Lendaris, and Liu [LLL08], and the 2007-2009 Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning; and surveys: Barto, Bradtke, and Singh [BBS95], Borkar [Bor09], Lewis and Vrabie [LeV09], and Szepesvari [Sze09]. For an overview of policy iteration methods, let us focus on the α-discounted n-state Markovian Decision Problem (MDP) with states 1, . . . , n, controls u ∈ U (i) at state i, transition probabilities pij (u), and cost g(i, u, j) for transition from i to j under control u. A (stationary) policy µ is a function from states i to admissible controls u ∈ U (i), and Jµ (i) is the cost starting from state i and using policy µ. It is well-known (see e.g., Bertsekas [Ber07] or Puterman [Put94]) that the costs Jµ (i), i = 1, . . . , n, are the unique solution of Bellman’s equation Jµ (i) =

n X j=1

pij µ(i) g(i, µ(i), j) + αJµ (j) ,

i = 1, . . . , n.

Equivalently, the vector Jµ ∈