Optimal Control Using the Tranport Equation

Optimal Control Using the Tranport Equation: the Liouville Machine

Ivo Kwee [email protected]

J¨urgen Schmidhuber [email protected]

Technical Report No. IDSIA-08-00 September 29, 2000 IDSIA / USI-SUPSI Instituto Dalle Molle di studi sull’ intelligenza artificiale Galleria 2 CH-6900 Manno, Switzerland

* IDSIA was founded by the Fondazione Dalle Molle per la Qualita della Vita and is affiliated with both the Universita della Svizzera italiana (USI) and the Scuola unversitaria professionale della Svizzera italiana (SUPSI) in Lugano. This research is funded by SNF grant 21-55409.98

Technical Report No. IDSIA-08-00

1

Optimal Control Using the Tranport Equation: the Liouville Machine Ivo Kwee [email protected]

J¨urgen Schmidhuber [email protected]

September 29, 2000 Abstract Transport theory describes the scattering behavior of physical particles such as photons. Here we turn this theory into a novel approach to control. Environments and tasks are defined by physical boundary conditions. Given some task, we define a set of probability densities on continuous state and action and time. From these densities, the goal is to derive an optimal policy such that for all states the most likely action maximizes cumulative reward. Liouville’s conservation theorem tells us that the density at time t, state s and action a, must equal the density at t dt, s ds, a da. Discretization yields a coupled set of partial differential equations that can be solved directly and whose solution corresponds to an optimal policy. Discounted reward schemes are incorporated naturally by taking the Laplace transform of the equations. The Liouville machine quickly solves rather complex maze problems.

+

+

+

1 Introduction We present an approach to optimal control by modelling the dynamics of an agent in some environment in analogy to the motion of scattering particles, and interpret the well-known transport equation [2] as a formalism for control tasks. Using this equation, our method solves probability densities for all stateaction pairs and subsequently extracts the optimal policy from this field such that for all states the most likely action maximizes cumulative reward. Researchers in the field of robot planning have proposed the use of potential force fields (e.g., [8]) or a method based on Huygen’s principles of light propagation [3]. Our method bears resemblance to these methods in that it uses some field solution to extract an optimal path. However, while these methods were rather proposed rather heuristically, we show that the use of the transport and diffusion equation follows from first principles when considering the dynamics of the system. In what follows, we will first review Liouville’s theorem which provides an essential constraint on the temporal evolution of the agent’s dynamics. Then we will proceed to the transport equation taking into account possible scattering and absorption states (such as “death traps”). Illustrative experiments will focus on agents solving maze tasks.

2 Statement of the problem The taks of optimal control under delayed reward is formulated as follows. The state of an agent in a certain environment is given by s 2 S . The agent can perform action a 2 A. Then, the state of the entire system at time t is given by the tuple fs; a; tg. In the maze tasks that we use, s is the position of the agent and the action a 2 fN; E; S; W g is one the possible compass directions.


2

We define a goal state s where the agent will receive a nonnegative amount of reward. Furthermore, the agent possibly incurs a small negative amount of reward every step it takes. Starting from some state s0 , the task for the agent then is to perform a sequence of actions fa(0); a(1); :::g so that it reaches the goal state s while maximizing its total cumulative reward. In practice, the latter corresponds to finding the time-optimal path leading to the goal.

3 Derivation of the Transport Equation Time-dependent probability density of state-action We define the action as the unit vector in the direction of the velocity of an agent. Let f (s; a; t) denote the time-dependent probability density of an agent existing at position s in domain , executing action a 2 A at time t. Of course, if the agent is known to be at some position s0 at time t0 while executing action a0 then its conditional probability is simply one, i.e. f (s0 ; a0 ; t0 ) = 1. And, for an environment with a single agent, all other states at the same time t = t0 the probability is zero. Liouville’s theorem After a time step, dt, the probability density function will generally change because the agent moves. If action a0 6= 0, the probability at time t1 = t0 + dt will now be one for some new position s1 = s0 + ds, i.e f (s1 ; a1 ; t1 ) = 1, while at the previous position generally f (s0 ; a0 ; t1 ) = 0. The function generally changes through time. We can establish a relation describing the evolution of this function in time in analogy to the motion of particles described by transport theory. According to Liouville’s theorem the number of particles in a volume element is conserved along a flow line [5],

f (s + ds; a + da; t + dt)

f (s; a; t) = 0:

(1)

In other words, for our single agent system, this means that in absence of exploring actions or absorption states (e.g., death pits that eat agents), an agent’s probability of existence must remain constant as we track its course in state-action space. Transport equation Equation 1 describes a difference equation and can be transformed into a partial differential equation (PDE) by letting dt ! 0 and using the partial derivatives with respect to all independent variables. Thus, after dividing by dt, we obtain

t

+ rs + ra f (s; a; t) = 0

(2)

where rs is the gradient operator with respect to position s, and ra is the gradient operator with respect to action a. The derivative = s=t is a velocity vector (representing state change per time unit), and = a=t is an acceleration vector (representing velocity change per time unit). In general, parameters f; g are functions of s, a and t. To include random processes that can deviate the agent from its current motion prescribed by Eq. 2, we add a term on the right hand side:

t

+ rs + ra f (s; a; t) =

f t

:

(3)

coll

The equation above is a generic form of the transport equation, here applied to probability densities. In transport theory, the term on the right hand side term as often referred to as the collision term. Particular


3

forms of the collision term lead to alternative variants of the transport equation, such as the Boltzmann equation for gas dynamics, the neutron transport equation or the radiative transfer equation. The transport equation is an equation of the Hamilton-Jacobi type; the latter describes a more general class of timeevolution problems characterized by an energy function. Interpretation of terms We first take a closer look at the physical meaning of terms and parameters in the transport equation. We start by rearranging the equation:

f t

= rs f ra f +

f t

:

(4)

coll

The first term on the right hand side describes local changes in the density caused by the agent’s finite inertia and its velocity, . The agent is arriving at or leaving a small volume element around s. The second term describes local density changes due to “external” forces changing the agent’s direction. Recall that describes an acceleration term, and that external forces on the agent are related to the latter by Newton’s law: = F =m, where m is the mass of the agent. If the force results from a potential field V , then F = rV . Note that “external” does not necessarily mean “uncontrollable”. External forces include gravity and those of “muscles” controlled by the agent itself. The third term (f =t)coll accounts for all processes that may affect the agent’s probability of existence; these include exploratory moves and “sudden deaths.” In its most general form, the collision term is a triple convolution integral,

f t

= coll

Z 1Z Z 0

A

(s0 ! s; a0 ! a; t

t0 ) f (s0 ; a0 ; t0 ) ds0 da0 dt0 ;

(5)

and defines all possible correlations in time, space and action. Hence the approach is not limited to Markovian environments where the current state in principle conveys all information about the probability of the next state, given a particular action — information about previous states is naturally taken into account if necessary. Diffusion approximation Because of the (possibly) high dimensionality of the parameters, the full time-dependent tranport equation is hard to solve; without any approximations it is practically unsolvable. We now make some simplifying assumptions: 1. First, in absence of external forces, such as gravity or attracting sources, = 0, such that the third term in Eq. 2 vanishes. This means that dynamics are controlled by the agents themselves. 2. Next, we assume the speed of the agent to be constant and homogeneous in the environment. 3. Furthermore, we assume instantaneous collisions that only permits correlations local in time. In most cases the latter also implies locality in space. This leads to a fundamental assumption of traditional reinforcement learning (RL, [4]), namely, the Markov assumption. By relaxing the restriction of locality in time, however, we may also model non-Markovian environments that exhibit state changes depending on events earlier in time. We then have to use the general form of the collision term in Eq. 5 that retains the convolution in time. 4. Isotropic collisions. This means that the agent collides with equal probability in all directions (like undirected exploration in traditional RL).


4

Any of the assumptions above may be relaxed. This will require more computation, but add little in understanding the basic concepts of the Liouville Machine. Under the above assumptions, the collision integral simplifies to

f t

=

coll

1 Z 0 0 2 A f (s; a ; t) da

f (s; a; t) ;

(6)

where is a parameter that represents the collision rate. Then the transport equation is sufficiently described by the diffusion equation [2],

t

rs rs g(s; t) = 0

(7)

where is a diffusion parameter and g (s; t) is the diffuse probability. Notice that the quantity g (unlike f ) is not dependent on action a anymore. Laplace transform We can further simplify our calculations by integrating Eq. 7 in time to obtain an time-integrated probability density. However, instead of merely integrating in time, we like to penalize the agent for each time step it consumes. In the continuous domain, time-weighted integration is obtained by the Laplace transform. The Laplace transformed transport equation becomes

[ + rs + ra ℄ F (s; a; ) = C (s; a; ):

(8)

[ rs rs ℄ G(s; ) = 0;

(9)

where F = L[f ℄. On the right hand side C is the Laplace transform of the collision term. The Laplace transform of the diffusion equation is with G = L[g ℄. It is important to note that by taking a Laplace (time) integral, we transform the original time-dependent problem to a time-independent one, which is easier to solve. In both equations, the Laplace decay parameter, , appears as an artificial absorption term in the equation. We would obtain the same equation if we used non-weighted integration and penalized the agent at each time step dt with a factor exp( dt). In other words, a Laplace solution for some corresponds to the same probability that we would get if we assigned a penalty factor or some probabilistic death-causing process to the agent. A solution for = 0 would correspond to a non-penalized integrated probability, but for conservative environments the solution is trivially f = 1 everywhere.

4 Numerical Solution Discretization of space and action in a maze Although the transport and diffusion equation are naturally formulated for continuous space, time and action, for computational solutions we need to discretize the equations. We use the discrete ordinates methods [2] and discretize action space in four directions ak , with k = 1; :::; 4, corresponding to the four compass directions north, east, south and west, respectively. For other environments we may choose a finer level of discretization to, e.g., eight directions (using north-east, southeast etc.), but here we restrict ourself to the four principle directions. The transport equation then reduces to a coupled set of four equations:

A [Fk ℄ =

4 X i=1

wi p(ai ! ak ) Fi ;

for k

= 1; ::; 4;

(10)


5

where Fk = F (s; ak ; ); weights wi have to be correctly determined by quadrature. For the spatial variable we use a finite difference scheme. We construct N mesh points in domain

and replace the gradient operators by their discrete counter parts. If we define fkn = F (sn ; ak ; ) and 1 ; :::; f N ), the system of equations may be expressed represent the density by a vector f = (f11 ; f11 ; :::; fK K in matrix form as follows:

Af = Bf

(11)

where A represents the discretized operator of the left hand side, and B is the discretized form of the right hand side of Eq. 8. Initial and boundary conditions We still need to impose correct initial and boundary conditions before we can solve for f . We start at t=0 with an agent at some initial state-action (s0 ; a0 ). Thus the initial condition is

f (s; a; 0) = F (s; ak ; ) = Æ (s0

s) Æ(a0 a):

(12)

For the boundary conditions, we require that

f (s; a; t) = F (s; a; ) = 0

(13)

whenever s is a wall. This boundary condition states that any agent that bumps into a wall is immediately removed. Alternatively one may model reflective boundaries, or require that the velocity becomes zero at the boundary. Here, however, our goal is just to illustrate and exemplify the basic scheme. Linear system Together with the initial and boundary condition, we obtain a linear system of the form

Cf = b; (14) where b is a vector containing the boundary values, and C = A B with A and B as defined in Eq. 11.

The solution can be found using conventional methods for solving linear systems. The system matrix is sparse but for the transport equation not symmetric; we have used a preconditioned biconjugate gradient sparse matrix solver [6]. It is very fast. We note that solving the diffusion equation requires less computation than solving the transport equation; the latter has four times more unknowns. Solution to Eq. 14 is a linear problem for given model parameters in a completely known environment. In practice the exact environment may not be known and model parameters (e.g. where walls are located) have to be updated in course of the simulations. In the latter case, the problem becomes a fixed-point problem and we need to iterate model updating and solving Eq. 14.

5 Application to Optimal Control We now establish a relationship between the solution of the transport (or diffusion) equation and the problem of estimating the optimal control policy from delayed reward.


6

Figure 1: Left: maze layout; the goal position is marked with an ’*’. Right: adjoint diffusion solution.

N

E

S

W

mean

Figure 2: Adjoint solutions to the transport equation. Left to right: probability density maps for north, east, south and west directions, respectively. Rightmost plot: mean probability averaged over all directions. Adjoint solution The solution of either the time-dependent or Laplace transformed equations represents the (possibly timeweighted) probability of an agent still existing in (s; a) having started from some given state (s0 ; a0 ). For our purpose of control, however, more important is the probability of an agent reaching some goal state. The latter probability is computed by regarding the adjoint solution to the equations. The adjoint solution corresponds to the expected value of future discounted reward of an agent in (s; a) reaching the goal (s ; a ) pursuing its current policy. Computation of the adjoint solution is simply obtained by changing signs of the time variable in the equations, and by setting the initial condition to the goal state (s ; a ). It corresponds to solving the evolution of probabilities backwards in time. How do the solutions look like? Fig. 1 shows a simple 19 19 maze and its adjoint solution to the diffusion equation. We can regard the solution as if some reward were “back-diffused” from the goal state into the environment. The solution of the adjoint transport equation is a little different and is shown in Fig. 2. First of all the solution provides not only one probability map, but 4 maps for each direction north, east, south and west. Also, the probability distributions for the actions show differences due to the inertia of the agent. For increasing collision parameter we expect the maps to become more and more similar, and eventually converge to the diffusion solution. Extracting the optimal policy: increasing Now that we can compute the (adjoint) solution to the equations, how can we extract an optimal (and determistic) policy from that? We demonstrate two methods: one is to probabilistically generate paths for increasing value of , the other is to sample an optimal path by deterministically choosing the best action at each position in the maze. From the probability density solution we can simulate the dynamic behaviour of agents in the maze. At each position we just sample an action proportional to the ratio of its probability and the total of the


(a)

7

(b)

(c)

Figure 3: Generated paths computed from adjoint tranport solution. (a) Random path for = 0:01 (202 steps). (b) Random path for = 0:1 (42 steps). (c) Shortest path (34 steps). All paths were generated from the adjoint density solution and required typically less than 1 second on a Ultra-SPARC workstation. four action probabilities for that position. Now, by increasing (after each trial) we force the agent to be more efficient. There are two ways of understanding this. First, viewing as a penalty parameter, actions are increasingly punished, so that differences in action values become more pronounced. Or, viewing as a time weight parameter, the time window for the Laplace transform decreases for increasing . In other words, the Laplace solution will correspond to faster (thus better) agents; in the limit the generated path should converge to the optimal path. We show the effects of increasing using the simple maze of Fig. 1. Plot (a) in Figure 3 shows a random path for a Laplace parameter = 0:01 that was generated by sampling directions according to the probability density solution; the length of the path was 202 steps. We can force the agent to be more efficient by increasing the Laplace parameter to = 0:1, which corresponds to increased discount of the reward; Plot (b) in the same figure shows the resulting generated path. Notice that the agent “switches” to a different solution that now leads through the lower part of the maze; this policy is almost optimal and it takes 42 steps to reach the goal. Finally, the optimal path can be determined by taking the most probable action in each state. Plot (c) shows the shortest path through the maze which requires a total of 34 steps from the start to the goal. Extracting the optimal policy: maximum likely action Instead of sampling a path according to its probability distribution, we may (for the transport equation) simply select the most likely action at the current position s,

aml (s) = max F (s; a): a

(15)

We refer to aml as the maximum likelihood (ML) action. Then, starting from some fs0 ; a0 g we subsequently choose the ML action; if the computed (adjoint) solution is correct, we should end up at the desired goal. For the diffusion equation the selection requires a bit more work because the solution does not depend on a anymore. Then we may look to the gradient of the solution, and select the action that corresponds to the largest increase of F , i.e. we select the next state to be the best neighbour of the current state. Again, provided the diffusion solution is correct, starting from some fs0 ; a0 g should lead to the goal. We verified that this solution was the same as plot (c) in Fig. 3. In fact, the determined optimal path does not depend on the Laplace parameter as long as > 0. Random versus optimal policy At first sight, it may seem unclear how the Liouville machine has obtained the optimal policy from the probability distributions of a random walk. It may seem contradictive to have formulated the explorative


(a)

8

(b)

(c)

(d)

Figure 4: Toy example. Shortest path solutions between two points in free space (lower left to upper right); all solutions required 22 steps. (a) transport-based solution with = 0:01, (b) with = 0:12, (c) with = 10:0, and (d) diffusion-based solution. behaviour as random, while the obtained optimal path surely is not. Recall that the probability density solution is a cumulation of all possible paths; a cumulation that includes both optimal and nonoptimal solutions. However, early arriving agents must have travelled along short-path solutions, and the Laplace transform acts as an artificial time-weighted filter that is able to extract these “fast” agents. On the other hand, deterministically selecting the most probable action at each state can be regarded as the policy of some better agent, that uses the solution of “confused” agents in a clever way. Transport-based versus diffusion-based solution As previously mentioned, to be rigorously correct we should model the agent’s dynamics using the full transport equation. However, we have also argued that for strong scattering (i.e. frequent collisions), the dynamics are sufficiently described by the (easier to solve) diffusion equation. A toy example in Fig. 4 demonstrates the differences of the solution between the two models. While, for all shown solutions, the number of steps required to reach the goal are equal, the actual path taken differs depending on the model and (in transport-based solutions) on the collision rate parameter . The rightmost figure shows the solution using the diffusion equation, and resembles most closely the transportbased solution using large . Comparing the pictures, we may say that the transport equation acounts for finite inertia of the agent, while the diffusion equation assumes that the agent is so light that it can swiftly turn in any direction. Equivalently we may say that the diffusion equation assumes all actions of the agent are exploratory moves; the transport agent mostly goes straight, but sometimes explores.

6 A Bigger Maze Example We apply the “Liouville machine” to the problem of finding the shortest path to the exit of a bigger maze, given an arbitrary start position within the maze. Optimal policy The computation times mentioned below are based on results obtained on a Pentium 600 MHz computer. Maze model known. When the walls of the maze are known in advance, only one field computation is needed, from which the optimal policy can be derived immediately: Using the transport equation, the optimal policy is directly obtained by choosing the most probable action at each position. Using the diffusion equation, the optimal policy is obtained by taking the action closest to the direction of the upward gradient.


(a)

9

(b)

(c)

(d) Figure 5: A 6464 maze with start near the lower left and goal near the upper right. (a) Estimated optimal policy and shortest path solution at iteration 1; (b) at iteration 3; (c) at iteration 5; and (d) at final iteration 10. The computed shortest path has 180 steps. Undiscovered walls are left blank.


10

The total computation time required to compute the optimal policy for all positions is roughly 3 seconds for the maze in Figure 5. Maze model unknown. When the agent does not know the maze in advance, however, it has to be explored (see details below). Solving for the optimal policy is now a fixed point computation, where we iterate between updating a model of the maze and computing the reward field. Figure 5 shows a 64 64 maze and the computed optimal policy at iterations 1, 3, 5 and final iteration 10. Using the diffusion-based solver, a single field computation required about 3 seconds (see above); the full computation including exploration and model updates took about 5 minutes. The more sophisticated transport-based solver required a longer time: about 1 minute for a single field calculation and about 20 minutes to obtain the optimal policy. Exploration and model update To simulate agent exploration, we generated a random walk with actions randomly chosen weighted by their relative action probabilities. We defined a maximum number of steps but doubled this number if the agent did not find the goal. Initially the agent indeed was not able to find it, revealed by the “exploration fronts” in Fig.5, but in later iterations it did find the goal; during the final iteration it needed 19212 steps to reach it. After exploration we updated the maze model of the agent with the positions of walls recorded during the simulation. We also employed an heuristic annealing schedule: we set the reward decay parameter, , equal to the inverse of the length of best (succesful) trial. In the early iterations the decay parameter was about 10 6 , and increased to about 10 3 at final iteration 10. Finally, we notice that even in the unexplored regions of the maze, the agent computes a default policy; it computes a policy assuming no walls in these regions. Also notice that small parts in the south west of the maze did not get explored, not even during the last iteration. Optimal path The optimal path can be generated by taking the most probable action (from the optimal policy) at each step. The shortest paths are drawn in each plot of Fig.5. In early iterations, the optimal policy is still inconsistent and we can see that the computed path breaks down around the frontier of the explored region. At final iteration 10, we found an estimated shortest path of 180 steps.

7 Discussion Relation to traditional RL As framework. Our approach of tackling optimal control from delayed rewards is related to traditional reinforcement learning (RL, [4]). Whether and when our method is superior to traditional methods must still be investigated. However, the aim of this paper was not to propose a better algorithm but to propose a new approach that provides a new perspective on the problem of optimal control. We have tried to derive a framework from first principles, believing that it gives a better understanding of the overall problem and also is important in order to quantify model parameters. Storage complexity. One difference to traditional RL methods is that our agent is required to perform more complex computations than RL methods based on Bellman’s equation [1]. The latter use fairly simple “temporal difference” backups, while our approach requires inversion of a large system matrix. On the other hand, “exploration” is simpler in our approach because the agent needs only to memorize binary values indicating whether a position is “free” or a “wall”; in traditional RL the agent needs to store all real-valued state-action evaluations.


11

Nonlinearity While solving the the transport or diffusion equation is a linear problem for completely known environments, we have shown that the problem is a fixed-point problem if the environment is not known beforehand. Extracting the optimal policy from the probability densities involves another nonlinear operation. We conclude that the overall problem is nonlinear in nature. Limitations Most of the simplifications that were made in this paper are not intrinsic limitations of the Liouville machine. We can account for nonhomogeneous and non-Markovian environments by correctly retaining the full collision integral and sacrificing computational simplicity. Generally speaking, however, our efficient version of the Liouville Machine does not seem applicable to partial observable environments where the sensations of the agent may be ambiguous. But the latter is a common problem for many traditional RL approaches.

Summary We found a connection between control and physical transport theory, by deriving a transport equation for the time-dependent joint probability of state-action pairs using Liouville’s theorem of conservation. We argued that in certain cases the diffusion equation is a sufficient approximation. Environments with complex physical laws will require the full transport equation though. Using these equations, we have shown that the task of estimating the optimal policy from discounted delayed reward is mathematically equivalent to solving the propagation of the reward backwards in time. This provides a novel rigorous framework for a wide variety of control problems involving continuous states and time. In ongoing theoretical work we are trying to establish a closer link between Liouville machine and more traditional approaches to control. Our formalism directly solves for the joint probability of stateaction pairs. How exactly does this relate to Watkins’ “Q-values” [7] for discrete reinforcement learning based on the theory of Markov decision processes? In ongoing experimental work we also intend to apply the Liouville machine to more and more complex problems involving realistic physics and agent dynamics. Acknowledgments This work was sponsored by SNF grant 21-55409.98.

References [1] R. Bellman. Adaptive Control Processes. Princeton University Press, 1961. [2] J. J. Duderstadt and W. R. Martin. Transport Theory. John Wiley & Sons, New York, 1979. [3] R. Glasius, A. Komoda, and S. Gielen. Neural network dynamics for path planning and obstacle avoidance. Neural Networks, 8(1):125–133, 1995. [4] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [5] C. Kittel and H. Kroemer. Thermal Physics. W. H. Freeman and Co., New York, 1980.


12

[6] W. H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York, 2nd edition, 1992. [7] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992. [8] D. Xiao and R. Hubbold. Navigation guided by artificial force fields. In Proceedings of ACM CHI 98 Conference on Human Factors in Computing Systems, volume 1, pages 179–186. Addison Wesley, 1998.

Appendix A: Some properties of the Laplace transform Delta function: Differentiation: Integration: Time-shift: Time-scale: Convolution:

L

hR t

L[Æ(t)℄ L[f 0 (t)℄

0 f (t)dt

i

L[f (t a)℄ L[f (at)℄ L[f (t) g(t)℄

=

1

=

L[f (t)℄

=

1 L[f (t)℄

=

e a L[f (t)℄

=

jaj L[f (t=a)℄

=

L[f (t)℄ L[g(t)℄

f (0)

s

1