Pathologies of Approximate Policy Iteration in Dynamic ... - MIT
Recommend Documents
Jul 6, 2008 - ... algorithms have been proposed for learning good or even optimal policies Sutton and Barto (1998). 3 ...... Richard Sutton and Andrew Barto.
Abstract. We consider the classical policy iteration method of dynamic
programming (DP), where ... and chattering, and optimistic and distributed policy
iteration.
... and has hence attracted a great deal of interest within the control community. ... (API) has generated much interest recently within the RL community [7, 3]. ..... Athena Scientific, Nashua, NH. 4. ... Ph.D. thesis, Kings College, Cambridge, UK.
In John D. Lafferty, Christopher K. I.. Williams, John Shawe-Taylor ...... problem of online SVM learning has been found by Cauwenberghs and Poggio [70] where.
Representation Policy Iteration. Sridhar Mahadevan. Department of Computer
Science. University of Massachusetts. 140 Governor's Drive. Amherst, MA 01003.
Apr 18, 1996 - there are four available actions, North, South, East and West, that .... versions of PI on the motion planning domain, and check if the policy.
so that the algorithm handling it will also perform well: this is the case ... software to learn optimal policies to be used by non experts, the ability to .... decision problems, and approximate dynamic programming. .... (tic-tac-toe, and the 8-puzz
Pathologies of Approximate Policy Iteration in Dynamic ... - MIT
Pathologies of Approximate Policy Iteration in Dynamic. Programming. Dimitri P. Bertsekas. Laboratory for Information and Decision Systems. Massachusetts ...
Pathologies of Approximate Policy Iteration in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology
March 2011
Summary We consider policy iteration with cost function approximation Used widely but exhibits very complex behavior and a variety of potential pathologies Case of the tetris test problem Two types of pathologies Deterministic: Due to cost function approximation Stochastic: Due to simulation errors/noise
We survey the pathologies in Policy evaluation: Due to errors in approximate evaluation of policies Policy improvement: Due to policy improvement mechanism
Special focus: Policy oscillations and local attractors Causes of the problem in TD/projected equation methods: The projection operator may not be monotone The projection norm may depend on the policy evaluated
We discuss methods that address the difficulty
References
D. P. Bertsekas, “Pathologies of Temporal Differences Methods in Approximate Dynamic Programming," Proc. 2010 IEEE Conference on Decision and Control, Proc. 2010 IEEE Conference on Decision and Control, Atlanta, GA. D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 2007, Supplementary Chapter on Approximate DP: On-line; a “living chapter."
MDP: Brief Review J ∗ (i) = Optimal cost starting from state i Jµ (i) = Cost starting from state i using policy µ Denote by T and Tµ the DP mappings that transform J ∈