University of London Imperial College of Science ...

1 downloads 0 Views 3MB Size Report
Jan 1, 2008 - To this end, the above list is roughly ordered in terms of importance and .... learn to play limit poker on a fixed web-service interface versus the ...
University of London Imperial College of Science, Technology and Medicine Department of Computing

Learning to Act Stochastically Luke Dickens

Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, November 2009

1

Abstract This thesis examines reinforcement learning for stochastic control processes with single and multiple agents, where either the learning outcomes are stochastic policies or learning is perpetual and within the domain of stochastic policies. In this context, a policy is a strategy for processing environmental outputs (called observations) and subsequently generating a response or input-signal to the environment (called actions). A stochastic policy gives a probability distribution over actions for each observed situation, and the thesis concentrates on finite sets of observations and actions. There is an exclusive focus on stochastic policies for two principle reasons: such policies have been relatively neglected in the existing literature, and they have been recognised to be especially important in the field of multi-agent reinforcement learning. For the latter reason, the thesis concerns itself primarily with solutions best suited to multi-agent domains. This restriction proves essential, since the topic is otherwise too broad to be covered in depth without losing some clarity and focus. The thesis is partitioned into 3 parts, with chapter of contextual information preceding the first part. Part 1, focuses on analytic and formal mathematical approaches for constructing models, and, where tractable, identifying long-term learning outcomes and predicting mid-term learning behaviour on particular models. This includes the formulation of the finite analytic stochastic process (FASP) framework – which brings together a large family of MDP-type processes under one unified description framework, and identification of the learning pressure field – a generalisation of the gradient of utility for multiple non-cooperative agents in stochastic games. Part 2 examines the more scalable reinforcement learning techniques, it develops the theory of estimation and optimisation with reinforcement learning, from basic to more recent developments and presents novel adaptations and algorithms based on reinforcement learning theory. New methods include the Modified Temporal Difference (MTD(λ)) and Pseudo Monte Carlo (PMC) estimation techniques, and the triple-table stochastic optimisation method. All estimation and optimisation techniques covered, are then evaluated by applying them to models from Part 1, and experimental results are compared to the analytic solutions. Part 3 summarises the achievements in the thesis, and proposes potential extensions, which could merge this work with other recent advances. It concludes with a discussion of the wider implications of these developments.

Acknowledgements I would like to thank my two supervisors Krysia Broda and Alessandra Russo, who supported my ideas throughout my PhD, even when I failed to communicate them very clearly. I would also like to thank my partner Mel, for everything and without whom this would not have been possible. Finally, I would like to thank my family and friends, who against all logic have stuck by me.

Contents 1 Introduction 1.1 Introduction . . . . . . . . . . 1.2 Domain of Enquiry . . . . . . 1.3 Stochastic Control Processes 1.4 Ranking Policies . . . . . . . 1.5 Expected Return . . . . . . . 1.6 Value Functions . . . . . . . . 1.7 Stochastic Approximation . . 1.8 Contributions . . . . . . . . . 1.9 Organisation . . . . . . . . .

I

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Formal Techniques

9 9 11 15 17 18 21 22 30 34

35

2 FASPs 2.1 Stochastic Mappings . . . . . . . . 2.2 Markov Decision Processes . . . . . 2.3 Modelling POMDPs . . . . . . . . 2.4 FASPs . . . . . . . . . . . . . . . . 2.5 Synchronous Actions with Multiple 2.6 Turn-taking with Multiple Agents

. . . . . . . . . . . . . . . . . . . . Agents . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

36 37 43 44 51 52 53 55 55 61 70 72 78

3 Analytics 3.1 Limit Behaviour for Fixed Policies 3.2 Predicting Expected Return . . . . 3.3 Learning Pressure Fields . . . . . . 3.4 Deriving the Value Function . . . . 3.5 Summary . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4 Examples 4.1 Triflop . . . . . . . . . . . . 4.2 Pentacycle . . . . . . . . . . 4.3 Bowling and Veloso’s 2-Step 4.4 Zinkevich’s NoSDE Game . 4.5 Two Agent Pentacycle . . . 4.6 FASP Soccer Examples . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

79 . 79 . 85 . 91 . 94 . 97 . 101

. . . . . . . . Game . . . . . . . . . . . .

4

II

Empirical Approaches

107

5 Estimation 5.1 Foundations of RL . . . . . . . . . . . 5.2 Monte-Carlo RL . . . . . . . . . . . . 5.3 TD(0) . . . . . . . . . . . . . . . . . . 5.4 TD(λ) . . . . . . . . . . . . . . . . . . 5.5 Pseudo Monte-Carlo . . . . . . . . . . 5.6 Local-Gain Value Function Estimation 5.7 Summary . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

108 109 113 115 117 125 134 140

6 Optimisation 6.1 Coupled Methods . . . . . . . . . . . 6.2 Actor-Critic - TD(0) . . . . . . . . . 6.3 Policy Gradient . . . . . . . . . . . . 6.4 Function Approximation . . . . . . . 6.5 Observations as Features . . . . . . . 6.6 Stochastic Actor-Critic . . . . . . . . 6.7 Actor-Critic with AOD Estimates . . 6.8 Three Table Actor-Critic Algorithms 6.9 Summary . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

142 144 145 147 156 161 166 169 174 175

. . . . . .

178 179 189 202 206 211 216

. . . . . . . . .

7 Experiments 7.1 Value Table Estimation . . . . . . . . . 7.2 Optimisation on the triflop . . . . . . . 7.3 Optimisation on the Pentacycle . . . . . 7.4 Evidence of the Learning Pressure Field 7.5 Multi-Agent Reinforcement Learning . . 7.6 Summary . . . . . . . . . . . . . . . . .

III

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Evaluation

8 Discussion 8.1 Summary . . . . . . 8.2 Future Experiments 8.3 Theoretical Avenues 8.4 Concluding Remarks

217 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

218 218 221 222 228

List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 4.7

Reward functions for the triflop. . . . . . . . . . . . . . Expected returns for the triflop. . . . . . . . . . . . . . . The Triflop’s AGD value-functions. . . . . . . . . . . . The Triflop’s local-gain value-functions. . . . . . . . . . Reward Schemes for the Pentacycle. . . . . . . . . . . . Reward Schemes for Bowling and Veloso’s 2-step game. Reward Schemes for Zinkevich et al.’s NoSDE game. .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

81 81 83 84 87 92 96

List of Figures 2.1 2.2

Graph of Simple POMDP or FASP. . . . . . . . . . . . . . . . . . . . . . . Visualisation of Causal Pathway for AMAFASP . . . . . . . . . . . . . . .

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 4.37

Graph of triflop . . . . . . . . . . . . . . . . . . . . . . . S-AOs for triflop . . . . . . . . . . . . . . . . . . . . . . Expected returns for the Triflop. . . . . . . . . . . . . . Graph of Pentacycle . . . . . . . . . . . . . . . . . . . . S-AO graphs for Pentacycle . . . . . . . . . . . . . . . . J1 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ1 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J2 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ2 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J3 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ3 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J4 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ4 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J5 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ5 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J6 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ6 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J7 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ7 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J8 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ8 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . J9 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Γ9 for Pentacycle . . . . . . . . . . . . . . . . . . . . . . Graph of Bowling and Veloso’s 2-step game . . . . . . . J1 for 2-step game . . . . . . . . . . . . . . . . . . . . . J2 for 2-step game . . . . . . . . . . . . . . . . . . . . . Learning Pressure for Bowling and Veloso’s 2-step game Learning Pressure for for Inverted 2-step game . . . . . Zinkevich, Greenwald and Littman’s NoSDE game . . . Learning Pressure for NoSDE game . . . . . . . . . . . . Graph of Dual-Pentacycle . . . . . . . . . . . . . . . . . Γ2,3 for Dual-Pentacycle . . . . . . . . . . . . . . . . . . Γ4,5 for Dual-Pentacycle . . . . . . . . . . . . . . . . . . Γ6,7 for Dual-Pentacycle . . . . . . . . . . . . . . . . . . Γ8,9 for Dual-Pentacycle . . . . . . . . . . . . . . . . . . Littman’s MDP adversarial soccer game . . . . . . . . . Action resolution in MDP soccer . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 54 80 80 82 85 86 87 87 88 88 88 88 88 88 89 89 89 89 89 89 90 90 90 90 91 93 93 93 94 95 96 97 98 99 100 100 101 102 7

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 7.24 7.25 7.26 7.27 7.28 7.29 7.30 7.31 7.32 7.33 7.34 7.35 7.36 7.37 7.38 7.39 7.40 7.41

Q Estimates on Triflop r7 , using PMC critic . . . . . . . . . . . . . . . Uncorrected Q Estimates on Triflop r7 , using TD(λ) critic . . . . . . . Corrected Q Estimates on Triflop r7 , using TD(0.9) critic . . . . . . . Comparing critic algorithms on Triflop r1 , r2 and r3 . . . . . . . . . . Comparing critic algorithms on Triflop r4 , r5 and r6 . . . . . . . . . . Comparing critic algorithms on Triflop r7 , r8 and r9 . . . . . . . . . . The effect of θ on rate of convergence . . . . . . . . . . . . . . . . . . Triflop J1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Triflop J5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Triflop J7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimising on Triflop J5 from unbiased θ . . . . . . . . . . . . . . . . Optimising on Triflop J1 from unbiased θ . . . . . . . . . . . . . . . . Optimising on Triflop J7 from unbiased θ . . . . . . . . . . . . . . . . Optimising on Triflop J5 from random θ . . . . . . . . . . . . . . . . . Optimising on Triflop J1 from random θ . . . . . . . . . . . . . . . . . Optimising on Triflop J7 from random θ . . . . . . . . . . . . . . . . . ˆ . . . . . . . . Optimising on Triflop J5 from random θ with primed Q ˆ . . . . . . . . Optimising on Triflop J1 from random θ with primed Q ˆ . . . . . . . . Optimising on Triflop J7 from random θ with primed Q Optimising on Triflop J7 with QP-PMC . . . . . . . . . . . . . . . . . Optimising on Triflop J7 using QP-TD(λ) . . . . . . . . . . . . . . . . Optimising on Triflop J7 using QP-TDγ (λ) . . . . . . . . . . . . . . . Analytic Results for Pentacycle r1 . . . . . . . . . . . . . . . . . . . . Comparing choice of γ on Pentacycle J1 using QP-PMC . . . . . . . . Comparing actor-critic choice on Triflop J7 . . . . . . . . . . . . . . . Comparing actor-critic choice on Pentacycle J1 . . . . . . . . . . . . . Individual optimisation runs for QVP-TD(0.9) . . . . . . . . . . . . . Optimising on Pentacycle J1 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J2 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J3 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J4 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J5 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J6 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J7 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J8 using QP-PMC . . . . . . . . . . . . . . Optimising on Pentacycle J9 using QP-PMC . . . . . . . . . . . . . . Average Rewards for gv in the 2-step games. . . . . . . . . . . . . . . . Evidence of the LPF for Bowling 2-step game. . . . . . . . . . . . . . . Competing different actor-critic algorithms on Bowling’s 2-step game. Evidence of the LPF for Inverted 2-step game. . . . . . . . . . . . . . Competing different actor-critic algorithms on Inverted 2-step game. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

180 181 183 186 187 188 189 190 190 190 191 192 193 194 195 195 196 197 197 199 200 201 202 203 204 205 205 207 207 208 208 209 209 210 210 211 212 213 214 215 216

8

Chapter 1

Introduction 1.1

Introduction

This thesis examines reinforcement learning (RL) for Stochastic Control Processes (SCPs) with single- and multiple-agents, where either the learning outcomes are stochastic policies or the learning is perpetual and within the domain of stochastic policies. In this context, a SCP is a potentially non-deterministic system or environment which accepts input control-signals (called actions) and generates outputs (called observations), and these observations depend in some way on the actions – we discuss these in more detail in Section 1.3. A policy is a strategy for processing observations and subsequently generating a response action. A stochastic policy represents a probability distribution over a set of actions for each observed situation, and the thesis concentrates primarily reactive policies with finite sets of observations and actions for purposes of clarity; where relevant, potential generalisations to continuous sets and memory based policies are indicated. It has long been recognised that in certain single agent SCPs stochastic policies perform arbitrarily better than functional alternatives [34]; it is also known that many multi-agent problems demand stochastic solutions [14, 53], but the majority of the RL literature is interested in problems with functional solutions (see [9, 65] as examples). In fact, even though many RL approaches require stochastic behaviour while learning, this behaviour appears to be considered simply an exploratory precursor to the inevitable functional learning outcomes [9, 65]. Perhaps this is why many techniques barely distinguish between two stochastic policies, where they are simply linear combinations of two (or more) functional policies, such as ε-greedy and soft-max exploratory policies [9, 65]. This is not entirely without cause; functional policies have been shown sufficient for optimality for processes that satisfy the Markov-Assumption (see Section 2.2 for a definition), and for more general finite-horizon problems where agents have sufficient memory of past events [47]. Much work in single-agent (SA)RL has therefore concentrated on model-based or memory-based agents, where the participating agents’ are assumed to have some internal storage capacity, which compensates for incomplete system information, obviating the need for anything but functional policies. 9

In this thesis, we show that there is a often a complex preference ranking over policies for even small problems, and functional policies are regularly outperformed by stochastic alternatives, especially so in multi-agent domains. In particular, Chapter 4 examines a number of single- and multi-agent examples analytically, and in Chapter 7 we confirm our analytic findings with empirical results. Moreover, recently developed actor-critic and other policy-gradient techniques decouple policy improvement from the preference values they are seeking to optimise [1, 3, 41, 49, 63], and we see in Chapter 6 how this theoretically allows them to freely explore the space of stochastic policies. However, the examples presented in these papers are almost exclusively those where functional policies are sufficient for optimality: even though there is some evidence to suggest that this the exception rather than the rule, even for single-agent problems with limited ambiguity in state information [61]. This suggests that the stochastic policy is regarded as an intermediate or exploratory phase in learning – a second class policy type. Even the very recent natural policygradient research papers [10, 40, 57], which try in some sense to optimise the exploratory phase, illustrate their new methods by learning functional policies. And, as we shall see in Chapter 6, existing natural gradient algorithms use the TD-error which suffers from a compound bias under certain non-Markov conditions. This bias less important in problems where functional policies are sought, but may become more important where accurate stochastic policies are sought (again see Chapter 6). To summarise, the thesis focuses exclusively on stochastic policies for two principle reasons: such policies have been relatively neglected in the existing literature (see [4], [35] and [61] for notable exceptions), and they have been recognised to be especially important for multi-agent interactions (e.g. see [14]). For this second reason, the thesis favours solutions best suited to multi-agent domains over those that are only sound for single-agent problems. The thesis is partitioned into 3 parts, with this chapter of contextual information preceding the first part. Part I, begins by constructing a modelling framework which allows us to define a variety of models, including single- and multi-agent problems with partial state knowledge– where multiple agents can observe and act in time with one another, taking turns, or more freely observing and acting out of time with one another. Each model supports one or many utility measures, allowing us to look at single-agents with unique or multiple constraints and multiple agents acting cooperatively, non-cooperatively or a combination of these (see Chapter 3). We show how these models can be solved, where tractable, identifying long-term learning outcomes and predicting mid-term learning behaviour on particular models. This is a level of detail about the learning dynamics previously only available for strategic form games (game-theory), see [26], and provides insight into the meta-level stochastic process of a learning system. Part II examines reinforcement learning RL techniques, theoretically able to tackle much larger problems than can be solved analytically. We develop the theory of estimation and optimisation with reinforcement learning, from basic to more recent de10

velopments, giving in the process an introduction to a family of RL algorithms known as actor-critic techniques, and show how actor-critic algorithms are a good choice for efficiently and accurately learning stochastic policies in problems with partial observability. We then present both novel adaptations and entirely new algorithms for both the estimation and optimisation parts of the actor-critic approach. We go on to evaluate these techniques by applying them to the models from Part I, and compare experimental results with analytic solutions. Part III summarises the thesis’ achievements, proposes extensions to the reinforcement learning results which incorporate the most recent general developments in the field of reinforcement learning, and concludes with a discussion of potential applications and other future theoretical work.

1.2

Domain of Enquiry

We are principally concerned with constructing reinforcement learning (RL) algorithms, which will be of the greatest use in learning stochastic policies in whatever domain these may occur, and this will impose a set of demands on our algorithms. In an abstract sense these demands can be broken down into five criteria, representing the expectations that users of the algorithms may have. • Accurate Principally, the user would expect the learning algorithms to be accurate in learning stochastic policies. That is the algorithms should be able to establish a preference between two policies even when their performances differ only slightly. • Robust The algorithms should also be robust in the sense that they should be resistant to falling into error states or responding to exceptional situations by producing nonsensical results. • Efficient Efficiency and Competitiveness are of great importance, particularly in large or multi-agent domains as we shall see throughout the thesis. In this sense we expect efficient algorithms to make good use of available information. • Scalable In the second part of this section we are specifically looking at approximate techniques, and these are intended for use in domains where exact approaches are not tractable. For this reason, these algorithms should be scalable, and the larger the domain to which they can be applied the better. • Exhaustive Ideally algorithms will be exhaustive (complete), in the sense that they should find the single best solution (or set of solutions) where one exists. It will not always be possible to maximise an algorithm’s fitness to all of these requirements at the same time, and under those circumstances a trade-off will be needed. To this end, the above list is roughly ordered in terms of importance and this will inform 11

the approach taken in this thesis. Individual users of algorithms will of course have different ideas about the extent to which earlier demands dominate later ones, and different situations will resist the complete satisfaction of each condition to varying degrees. It is therefore of some use to move away from the abstract and imagine potential application areas. It is impossible to ignore the fact that the area of multiple agents in knowledgerestricted, perpetual non-cooperative agent engagement will potentially make the most restrictive demands on the properties of such algorithms (and we formalise these concepts in Chapter 2 in particular with the SMAFASP and AMAFASP models). With this in mind, we briefly outline here how this domain contextualises the demands above, and in turn shapes the remainder of the thesis.

1.2.1

Accuracy

Two features of this domain will influence an algorithm’s accuracy. First, the agent(s) only has (have) restricted state knowledge, and will consequently conflate different situations, which potentially require different responses. This is one reason that stochastic policies will be needed, but it also means that the algorithm must balance the importance of different experiences. We see how this is achieved with the TD(λ) and PMC estimation techniques in Chapter 5 and how this is exploited in Chapter 6. The second complicating factor is the changing nature of the system induced by different non-cooperative agents. These other agents are potentially learning or changing their policies to improve their own ends, and in doing so change the responses and outcomes of the system with respect to the original agent. For this reason, empirical knowledge of the system is potentially transient and may have to be discarded. We differentiate our work from that of the existing actor-critic literature, such as Sutton et al. [63], Konda and Tsitsiklis [41] and Bhatnagar et al. [10], in that we are principally interested in algorithms for stochastic policies; whereas this literature rarely explicitly state a focus on functional policies, the examples used to test algorithms tend to have solutions with functional policies. In such work accuracy is measured for an algorithm on their ability to accurately learn functional policies. As we shall see throughout this thesis, algorithms that are accurate for functional policies are not necessarily accurate for stochastic policies.

1.2.2

Robustness

Again the limited access and other agents will influence the robustness of algorithms, as will its perpetual nature. In particular, predetermined or fixed courses of action may induce endless loops, or perhaps more damagingly, could be exploited by other agents. A good algorithm should remain flexible and exploratory, but equally must be able to return to solutions which previously performed well.

12

In particular, robustness suggests to us the use of incremental algorithms. These are algorithms which continually improve knowledge and adapt appropriately in a stepwise manner. This contrasts with batch or episodic algorithms which collect experience over a number of steps n(> 1) and at every n-steps use this experience to update knowledge and adapt. Some of the best existing algorithms in terms of accuracy and efficiency for stochastic policies, such as those found in Peters and Schaal [57], are not (strictly) incremental, and hence our algorithms differ from theirs in this respect.1

1.2.3

Efficiency

In some sense, the efficiency requirement is in conflict with the need for robustness. The efficient algorithm should make good use of all experience, but the robust algorithm must abandon useless, false or outdated information. In Chapter 3 we explore one interpretation of the ideal trade-off between these two counter demands, and invoke the learning pressure field which predicts how agents’ policies may change when multiple agents learn together. In Chapters 5 and 6 we return to this theme and explore how existing and novel algorithms might achieve this trade-off. This trade-off may seem similar to the more familiar exploration-exploitation trade-off often cited in the RL literature [9, 65]. However, for some in the RL community exploration is associated with stochastic policies, and exploitation with the ultimate functional solution. In that context, the stochastic policy is little more than a way of comparing different functional policies, until the best one is found; we wish to break that association. Finally, there is some discussion in Chapter 8 on how we can improve on this naive view of this ideal trade-off mentioned above, by considering natural-gradient learners which explore uncertainty in a more sophisticated way. We go on to show how incremental algorithms which are accurate for stochastic policies in perpetual multi-agent domains can be made natural-gradient learners.

1.2.4

Scalability

It is essential in any multi-agent domain that learning approaches are scalable, but there are many degrees of scalability and it is important to see where our approaches lie on this continuum from the very small to the very large. For this reason, we believe it makes more sense to talk about scalability as a relative term, rather than an absolute one. As mentioned before, the thesis is broken into three parts, and each part concerns itself with a different degree of scalability. Part I is concerned with fully analytic approaches, which do not scale well above about 50 states. However, these are used principally to extract maximum information from a system in order that more scalable approaches can be examined, validated and benchmarked. 1

In one paper by Peters and Schaal [57], they present two algorithms, one of which is designed for games which are episodic in nature (eNAC), and the other delays the optimisation step until the estimates converge.

13

In Part II, we concentrate on algorithms which learn over finite observation and action sets. In this context, the best algorithms will be the most efficient. The most efficient algorithms will be able to learn large models more quickly, and hence scalability goes hand in glove with efficiency here. It may be that with hidden state models, the underlying state space will continue to increase in size, without affecting the observation or action spaces. The best algorithms will then need to manage the inherent uncertainty of each observation, since it will correspond to a large number of underlying system states. However, there is a limit to how useful these approaches will be and as observation and action sets grow it may be computationally unfeasible to consider each observation and action individually. Later in Part II (Chapter 6), we investigate how traditional RL for functional policies copes with a move from discrete observations and actions to features. This is one way of aggregating a large amount of situational information into a small dense package, often a vector of real numbers, and this promises a greater degree of scalability. There is a qualitative difference between discrete observations and feature vectors of real numbers, though; features introduce a linear relationship between two distinct situations which did not exist in the discrete domain, and this interdependence makes more rigorous demands on algorithms – requiring stricter guarantees. We show in Chapter 6 how this has been done for the established TD(λ) actor-critic algorithm and indicate how this might be done for our novel PMC actor-critic. It is worth mentioning that one complicating factor of features is that when we aggregate situations we also aggregate uncertainties. We therefore expect algorithms which manage highly uncertain observations well, to have properties which will transfer well to the feature domain. This has been shown to be true for the natural actor-critic by Peters and Schaal [57], and is an underlying motivation of our work. One other way that a problem or task can scale is in terms of complexity of the underlying learning task, and as Ring states in [59], To design a learning algorithm is to make an assumption. The assumption is that there is structure in the learning task. If there is no structure, then there is no relationship between training data and testing data, there is nothing to be learned, and all learning fails. The structure Ring alludes to may have little or no correspondence to the size of the observation and action space: a problem with a small set of observations and actions may be difficult, and we try to identify such small-but-difficult tasks in Chapter 4; whereas there exist many simple tasks in the realm of continuous multi-dimensional observations and action responses. To exemplify this difference consider the following two tasks: to learn to play limit poker on a fixed web-service interface versus the making of a cup of tea in a strangers kitchen. The difficultly in the former task is one which is unrelated to how the information is presented, it could be that all the information needed to describe the task is given in a small and efficient logic programme, while the observation and action 14

sets in latter task are much larger consisting of visual, tactile and audible cues (among others). While (most) humans can easily solve the latter task – even when the stranger’s kitchen is very unfamiliarly set out, this task is currently considered very difficult for a robot by the artificial intelligence community. On the other hand, some poker playing agents have recently competed relatively successfully with professional human players [37] on limit poker. Therefore, it is possible to imagine scalability to refer to task difficultly, rather than the size of the observation or action space. To measure this, it would be necessary to quantify task difficulty in some sense, and this is itself a challenging domain of enquiry (not to mention contentious and possibly inherently subjective). This thesis does not attempt to address this in much more detail, but makes one final point. As we shall see in this thesis, the algorithms that perform well in the domain of stochastic policies, also perform at least comparably in the domain of functional policies. However, algorithms designed for the functional domain can perform arbitrarily badly on some of our manufactured examples, as we shall see.

1.2.5

Exhaustiveness

The exhaustiveness of an algorithm is perhaps the most at odds with efficiency and scalability. Many of the properties which allow that algorithms to learn efficiently for large systems will involve a high level of abstraction and this can restrict the available solution policies to those that use this abstraction. However, there is another way to look at how exhaustive an algorithm is, and that is to restrict the context to the set of consistent policies, and describe an exhaustive algorithm as one which fully explores these policies. In this sense, any gradient-ascent technique may well perform badly on a single learning run, since it climbs a single path to the solution. It may be possible to ameliorate this, perhaps by running multiple learning experiments and looking at the best, or aggregate solutions. In many cases, the application area will profoundly influence how best to structure the overall learning strategy to provide exhaustive coverage of all possible solutions. We do not concentrate in detail on these higher level learning strategies in this thesis, and consequently any user of the algorithms described here should address the question of exhaustive search in their own way.

1.3

Stochastic Control Processes

A central concept to this thesis is the Stochastic Control Process (SCP). Here we define it first loosely, then list certain further constraints that the reader should assume hold unless otherwise stated. The reason for the more general definition is that it represents the contextual scope of all methods and ideas presented in this thesis. A SCP is a process that accepts controls signals (actions) from one or more control-

15

ling agents, whose knowledge of the system comes from output signals from the system (observations or observational states). These actions and observations are events, ordered on a time-line, and a system’s outputs may only depend probabilistically on past events. A policy is rule by which a controlling agent generates new actions for a control process, potentially dependent on all previous events. Commonly, this definition is further constrained in the following ways: • Action and observation signals alternate and occur at discrete time-steps. An experienced sequence of events (actions and observations) is then referred to as a trace or history. • There is some underlying system state - sometimes simply referred to as the state, which affects the observations and is in turn affected by actions. • Each agent prefers some system states over others, and this allows one or more measure (reward/cost) signals to accompany observations (or to be assigned directly to observations). These allow learning/adaptive agents to assess and optimise behaviour. • The state of the system is sometimes obscured, so that the agent only has partial knowledge of the true state of the system; such systems are called partially observable. • The SCP in some way represents a repeatable experimental environment. That is to say that if the system is interacted with, restarted and interacted with again, and the sequence of actions is the same in both cases, then the expectations over observations is the same, in both cases - as is any sequence of dependent measure signals. These constraints are remarked on throughout the thesis, and in particular are inherent features of the modelling frameworks used to simulate SCPs. However, where SCPs are found outside of a computer scientist’s laboratory these constraints may only be approximated, or may not apply at all. Therefore the techniques and algorithms described here can only be applied to real world processes optimistically, without hard guarantees. One further constraint we typically work under is that of finite sets, e.g. for actions, observations and states. These constraints are often fairly straight-forward to relax and where this is the case we indicate how this might be done.

1.3.1

Finite- and Infinite-Horizon Problems

Traditionally, the literature splits SCPs into two types, finite- and infinite-horizon problems. A finite-horizon process is one that is guaranteed (perhaps only with certain policies) to reach a terminal state in finite time, and once at the terminal state the experiment is considered to have ended. Obviously, most experiments of interest in the 16

real world fit this description, and there are a number of ways of modelling this with Markov Decision Processes (MDPs) [9] many of which are more complex than similar infinite-horizon models. An infinite-horizon process is one which continues forever, in some fashion, and often allows sufficiently good approximations to finite-horizon problems which run for a long time. These infinite-horizon problems tend to be less complex, because we assume the properties of the system to be independent of time. This thesis concerns itself exclusively with infinite-horizon problems (or those which have properties independent of time), and justifies this in three ways. Firstly, since infinite-horizon problems are independent of time, so are the associated policies, and this allows us to concentrate on simple reactive policies (those without memory). To see why, consider that even an agent with a very large memory on an infinite time-line must eventually forget (fail to remember) some of its interaction history, and hence shares many properties with an agent that forgets immediately. Finite memories represent an enhanced, but still incomplete, knowledge of the environment’s state, and are therefore fundamentally no different from an observation that provided that information. Secondly, in multi-agent non-cooperative problems, where agents compete in a series of episodes, the interaction from previous episodes may inform agents’ actions in this one. So, while the game appears episodic in structure, the on-going engagement between agents is not. Finally, a solution for an infinite-horizon problem can apply to a finite-horizon problem by assuming that the process is independent of time; although, whether this is successful depends on the problem. A finite-horizon solution, where a policy’s action choice depends on the time at which that choice is made, cannot be explicitly defined for all time-steps of an infinite-horizon problem. So, while an infinite-horizon approach is not always ideal, it has the broadest applicability to our domain of enquiry. For these reasons, and unless otherwise stated, SCPs examined in this thesis are modelled as infinite-horizon processes.

1.4

Ranking Policies

An agent’s policy represents its method for generating action choices for an SCP from its experience. That is to say, at some time-step, t, an agent has some history of observations and action choices, h = oi , ai , i = 0, . . . , t, ot – where oi and ai respectively represent the agent’s observation and action at time i; the agent’s policy, π, is some way by which this history, h, is translated into a new action, at . The general purpose of control is to induce desired behaviour, and therefore any evaluation on the effectiveness of policy must allow policies to be ranked in some way. Arguably the most intuitive way of doing this is to assign some real valued score to each policy, such that higher scoring policies are preferred over lower scoring ones (or vice versa if we are thinking of the score representing undesirability). It is this real valued method we focus on in this thesis, and we visit various ways of translating expected 17

reward signals over all possible traces into a real valued result in Section 1.5. These ideas are key in reinforcement learning; however, few alternatives have been explored, and we discuss some of these in Chapter 8. Throughout this thesis, we consider the way agents rank policies to be an integral part of a SCP definition.

1.4.1

Process Equivalence

We also briefly explore the notion of process equivalence, i.e. whether two analytically defined SCPs – such as Markov Decision Processes – are the same. This is of particular concern to researchers of transfer learning [54]. A formal measure is defined in the next chapter as is a looser definition we call policy-equivalence. In short, two SCPs are policyequivalent, if they preserve their ranking over policies.

1.5

Expected Return

The expected return, denoted throughout this thesis by J, associates a single real value with each possible policy, which in someway represents an expectation of value on a trace of an entire run of some SCP. Such a trace, from beginning, until termination if there is one — or ongoing if not, is called a lifetime trace, and is denoted by τ . If the value of some trace τ is denoted by V(τ ), and given some policy π, the expected return is then J(π) = Eτ (V(τ ) |π ) where Ex (y |z ) is the expected value of some variable y, given some condition z, with respect to the distribution of variable x. A potentially more useful form of this is given in terms of trace probabilities J(π) =

X

Pr (τ |π ) V(τ )

τ ∈T

where T represents the possible set of traces. This leaves us with the difficulty of choosing a function V(.) which can evaluate a trace, and this is potentially problematic, for instance when we wish to consider infinitehorizon problems. Typically this is done by evaluating each time-step individually with some (utility) measure, say fτ,t for each time t in τ , and then somehow aggregating these over τ – where the trace, τ , is obvious this can be written ft .

1.5.1

Expected Returns on finite-horizon problems

Before we examine expected returns on infinite-horizon problems, it is worth examining how this is done for finite-horizon problems. The ideas developed here help inform our discussion on infinite-horizon problems.

18

For some terminating stochastic process with terminal state sT , which for some trace, τ , and at time-step t is in some state sτ,t , we would define the average expected return as Jave,f (π) = Eτ

! j 1X fτ,i j = argmin(sτ,k = sT ) j k (1 − i ) n i

(1.21)

These sequences with slower decay rates are of most interest to us, as they weight estimates in favour of more recent information. As we shall see in later chapters, this is particularly useful with sequences of random variables which only satisfy the conditions for Lemma 1.1 after some arbitrary step n, or only in the limit – see Chapter 6. The limit property of the variance is slightly more complicated. Here we first consider step n = 1, and using Equation (1.17) we can see that the variance is Var (ˆ x1 ) = Var ((1 − 1 )ˆ x0 + 1 X1 ) = 21 Var (X)

(1.22)

and at step n Var (ˆ xn ) = Var ((1 − n )ˆ xn−1 + n Xn ) = (1 − n )2 Var (ˆ xn−1 ) + 2n Var (X)  = (1 − n )2 (1 − n−1 )2 Var (ˆ xn−2 ) + 2n−1 Var (X) + 2n Var (X)   n−1 n  X  Y = 2i (1 − j )2 + 2n  Var (X) i=1

(1.23)

j=i+1

= µn Var (X) for µn ∈ R. To show convergence for our estimate w.p.1, we must show that limn→∞ µn = 0. Looking at the last line here and recalling Assumption 1.1, we can first see that it is bounded for bounded Var (X), since µn =
N N

δ X 2 C− < i < C, 2

n X

and

i=1

2i