Policy Iteration for Continuous-time Systems with ... - advantech greece

7

3URFHHGLQJVRIWKHWK0HGLWHUUDQHDQ&RQIHUHQFHRQ &RQWURO $XWRPDWLRQ-XO\$WKHQV*UHHFH

Policy Iteration for Continuous-Time Systems with Unknown Internal Dynamics D. Vrabie*, O. Pastravanu** and F.L. Lewis* *

**

University of Texas at Arlington – ARRI, Arlington, United States Technical University “Gh. Asachi” – Automatic Control Department, Iasi, Romania

Abstract—In this paper we propose a new adaptive critic scheme for finding on-line the state feedback, infinitehorizon, optimal control solution of linear continuous-time systems using only partial knowledge regarding the system dynamics, i.e. no knowledge regarding the system A matrix is needed. This is in effect an adaptive control scheme for partially unknown systems that converges to the optimal control solution.

I. INTRODUCTION Policy iteration names the class of algorithms built on a two-step iteration: policy evaluation and policy improvement. In order to solve the optimal control problem, instead of directly solving Bellman’s equation for the optimal cost and then finding the optimal control policy (i.e. the feedback gain for linear systems), the policy iteration method starts by evaluating the cost of a given initial policy and then tries to use this information to obtain a new improved control policy. The two steps are repeated until the policy improvement step no longer changes the actual policy. Howard was the first to formulate policy iteration in [1] and since then the technique has been extensively studied and employed by the computational intelligence and machine learning communities for finding the optimal control solution for Markov decision problems of all sorts. Although the algorithm often converges after a small number of iterations, the major drawback when it is applied for discrete state systems resides in the necessity of sweeping the entire state space before computing the associated cost with a given control policy. For the control engineering community two results stand out for employing the policy iteration technique, developed in the computational intelligence community, to find the solution of the optimal control problem for continuous state linear systems. The two are [2], for discrete time systems, and [3], for continuous time systems. Using algorithms of the sort while working with linear systems is particularly affordable since a sweep of the entire state space is no longer necessary because the associated cost to a control policy can be easily determined using data along a single state trajectory. In [2] the authors formulated a policy iterations algorithm that converges to the optimal solution of the discrete time Linear Quadratic Regulator (LQR) problem using Qfunctions. The use of Q functions in the same time avoids the need of any knowledge regarding the system model. This work was supported by the National Science Foundation ECS0501451 and the Army Research Office W91NF-05-1-0314.

For continuous time systems, in [3], were presented two model free policy iterations algorithms. The model free quality of the approach was achieved either by evaluating the infinite horizon cost, associated with a control policy, along a stable state trajectory or by using measurements of the state derivatives. Either of the two approaches leads to avoiding the necessity of knowing the internal system dynamics. In the same time, at the cost of augmenting the system state vector and under the requirement of redefining the cost function, knowing the system’s inputstate dynamics will no longer be necessary since now it becomes part of the internal dynamics of the augmented system; thus the algorithm becoming model free. The convergence guarantee of the policy iteration technique to the optimal controller (i.e. the solution of the LQR problem) was given for continuous time systems by Kleinman in [4] and for discrete time systems by Hewer in [5]. Both these methods require repetitive solution of Lyapunov equations, which requires knowledge of the full system dynamics (i.e. the plant input and system matrices). In [2] Bradtke et al. gave proof of convergence for Q-learning policy iteration for discrete-time systems, which, on the virtue of using the so called Q-functions [6],[7], does not require knowledge of the system dynamics. In this paper we propose a new policy iteration technique that will solve in an online fashion, along a single state trajectory, the LQR problem for continuous time systems using only partial knowledge about the system dynamics (i.e. the internal dynamics of the system, the A matrix, need not be known) and without requiring measurements of the state derivative. This is in fact an adaptive control scheme for partially unknown linear systems that converges to the optimal control solution without knowing the plant system matrix. II. CONTINUOUS-TIME ADAPTIVE CRITIC SOLUTION FOR THE INFINITE HORIZON OPTIMAL CONTROL PROBLEM A. Dynamic Programming and LQR Consider the linear, time-invariant dynamical system given by x = Ax + Bu

(1)

with x (t ) ∈ R n , subject to the infinite-horizon optimal control problem

7


∞

V ( x 0 ) = min ∫ ( x T Qx + u T Ru )dτ u(t )

(2)

0

with Q ≥ 0, R > 0 and ( A, B ) controllable. It is known that the control solution of this problem, determined by Bellman’s optimality principle, is given by u = − Kx with K = R −1 B T P

Based

on

(8), with the cost parameterized as V ( x (t )) = x Px , considering an initial stabilizing control gain K 0 , the following policy iteration scheme can be implemented on-line: T

x T (t ) Pi x (t ) =

t +T

∫x

T

(Q + K i T RK i ) xdτ + x T (t + T ) Pi x (t + T ) (9)

t

(3) −1

where the matrix P is obtained by solving the Algebraic Riccati Equation (ARE) AT P + PA − PBR −1BT P + Q = 0 .

(4)

The solution of the infinite horizon optimization problem can be obtained using the Dynamic Programming approach by solving a finite horizon optimization problem backwards in time and extend the horizon to infinity. In this case the following Riccati differential equation has to be solved − P = AT P + PA − PBR −1BT P + Q P (t f ) = Pt f

(5)

It should be noted that in order to obtain the Dynamic Programming solution of equation (4) complete knowledge of the model of the system is required, i.e. both the system matrix A and control input matrix B must be known. B. Policy iteration algorithm In the following we propose a new policy iteration algorithm that will solve for the optimal control gain of the LQR problem without using knowledge regarding the system internal dynamics (i.e. the system matrix A). Let K be a stabilizing gain for (1), such that x = ( A − BK ) x is a stable closed loop system. Then the corresponding infinite horizon quadratic cost is given by ∞

(6)

t

where P is the real symmetric positive definite solution of the Lyapunov matrix equation ( A − BK ) T P + P( A − BK ) = −( K T RK + Q)

(7)

and V ( x (t )) serves as a Lyapunov function for (1). The cost function (6) can be written as V ( x (t )) =

t +T

∫x t

T

(Q + K T RK ) xdτ + V ( x(t + T )) .

(10)

Equations (9) and (10) formulate a new policy iteration algorithm motivated by the work of Murray et al. in [3]. C. Online implementation of the algorithm without using knowledge of the system internal dynamics For the implementation of the iteration scheme given by (9) and (10) one only needs to have knowledge of the B matrix for the policy update. The information regarding the system A matrix is embedded in the states x(t ) and x(t + T ) which are observed online. To find the parameters (matrix Pi ) of the cost function

the solution of which will converge to the solution of the ARE for t f → ∞ .

V ( x (t )) = ∫ x T (Q + K T RK ) xdτ = x T (t ) Px(t )

T

K i+1 = R B Pi .

(8)

for the policy K i in (9), the term x T (t ) Pi x (t ) is written as x T (t ) Pi x (t ) = p i T x (t )

(11)

where x (t ) is the Kronecker product quadratic polynomial basis vector with the elements {x i (t ) x j (t )}i =1,n; j =i, n and p = ν (P ) with ν (.) a vector valued matrix function that acts on n×n symmetric matrices and gives a column vector by stacking the elements of the diagonal and upper triangular part of the symmetric matrix into a vector where the off-diagonal elements are taken as 2Pij [8]. Using (11) equation (9) is rewritten as p i T ( x (t ) − x (t + T )) =

t +T

∫ x(τ )

T

(Q + K i T RK i ) x (τ ) dτ

t

. (12)

≡ d ( x (t ), K i )

In this equation pi is the vector of unknown parameters, x (t ) − x (t + T ) acts as a regression vector and the right hand side, denoted d ( x (t ), K i ) , d ( x (t ), K i ) ≡

t +T

∫x

T

(τ )(Q + K i T RK i ) x (τ ) dτ

t

is measured based on the system states over the time interval [t, t+T]. At each iteration step, after a sufficient number of state-trajectory points are collected using the same control policy Ki , a least-squares method is employed to solve for the V-function parameters, pi , which will then yield the matrix Pi . The parameter vector p i is found by minimizing, in the least-squares sense, the

7


error between the target function, which is d ( x (t ), K i ) , and the parameterized left hand side of (12) over a compact set Ω ⊂ R n . Evaluating the right hand side of (12) at N ≥ n(n + 1) / 2 ( N = n(n + 1) / 2 is the number of independent elements in the matrix P ) points x i in the state space, over the same time interval T, the leastsquares solution is obtained as p i = ( XX T ) −1 XY

(13)

where X = [ x 1∆

x 2∆

... x N ∆ ]

x i ∆ = x i (t ) − x i (t + T ) Y = [ d ( x 1 , K i ) d ( x 2 , K i ) ... d ( x N , K i )]T .

The least-squares problem can be solved in real-time after a sufficient number of data points are collected along a single state trajectory. In practice, the matrix inversion in (13) is not performed, the solution of the equation being obtained using algorithms that involve techniques such as Gaussian elimination, backsubstitution, and Householder reflections. The solution of equation (13) can also be obtained using the Recursive Least Squares algorithm (RLS) in which case a persistence of excitation condition is required. The proposed policy iteration procedure requires only measurements of the states at discrete moments in time, t and t + T , as well as knowledge of the observed cost over the time interval [t , t + T ] , which is d ( x (t ), Ki ) .Therefore there is no required knowledge about the system A matrix for the evaluation of the cost or the update of the control policy. However the B matrix is required for the update of the control policy, using (10), and this makes the tuning algorithm only partially model free. For the algorithms presented in [3] computing the cost of a given policy required either - several control experiments to be performed, considering different initial conditions, until the system states converged to zero (thus letting T → ∞ in (9)) in order to solve a least squares problem or - directly solving a Lyapunov equation of the sort (7) avoiding the use of A matrix knowledge by measuring the system states and their derivatives. The policy iteration algorithm proposed in this paper avoids the use of A matrix knowledge and at the same time does not require measuring the state derivatives. Moreover, since the control policy evaluation requires measurements of the cost function over finite time intervals, the algorithm can converge (i.e. optimal control is obtained) while performing measurements along a single state trajectory, provided that there is enough initial excitation in the system. In this case, the control policy is updated at time t+T, after observing the state x(t+T) and it will be used for controlling the system during the time interval [t+T, t+2T]; thus the algorithm is suitable for online implementation from the control theory point of view. The structure of the system with the adaptive controller is presented in Fig. 1. The most important thing is that the system has to be augmented with an extra state V (t ) ,

with V = x T Qx + u T Ru , in order to extract the information regarding the cost associated with the given policy. It is shown that having little information about the system states, x, and the augmented system state, V, extracted from the system only at specific time values (i.e. x(t ), x (t + T ) and V (t + T ) − V (t ) ), the critic is able to evaluate the performance of the system associated with a given control policy. Then a policy improvement takes place at time t+T. In this way, over a single state trajectory in which several policy evaluations and updates have taken place the algorithm can converge to the optimal control policy. It is however necessary that sufficient excitation exists in the system’s initial state, because the algorithm iterates only on stabilizing policies which will make the states go to zero. In the case that excitation was lost prior to obtaining the convergence (the system reached the equilibrium point) a new experiment needs to be conducted having as a starting point the last policy from the previous experiment. Actor −K

u

System

x = Ax + Bu; x0

x

V = xT Qx + uT Ru

V ZOH T

T

T

Critic Figure 1. Structure of the system with adaptive controller

The iterations will be stopped (i.e. the critic will stop updating the control policy) when the error between the system performance evaluated at two consecutive steps will cross below a designer specified threshold. Also in the case that this error is bigger than the above mentioned threshold the critic will take again the decision to start tuning the actor parameters. It is observed that the update of both the actor and the critic is performed at discrete moments in time. However, the control action is a full fledged continuous-time control, only that its constant gain is updated at certain points in time. Moreover, the critic update is based on the observations of the continuous-time cost over a finite sample interval. As a result, the algorithm converges to the solution of the continuous-time optimal control problem, as it is shown in the next subsection. D. Analysis of the algorithm – proof of convergence In this section we make an analysis of the proposed algorithm and prove its convergence. Lemma 1 Solving for Pi in equation (9) is equivalent to finding the solution of the underlying Lyapunov equation Ai T Pi + Pi Ai = −( K i T RK i + Q)

(14)

where Ai = A − BK i is stable. Proof: Since Ai is a stable matrix and K i T RK i + Q > 0 then there exists a unique solution of the Lyapunov

7


equation (14), Pi > 0 . Since x T Pi x is a Lyapunov function for the system x = Ai x and d ( x T Pi x) = x T ( Ai T Pi + Pi Ai ) x = − x T ( K i T RK i + Q) x (15) dt

then the unique solution of the Lyapunov equation satisfies t +T

∫x t

T

t +T

(Q + K i T RK i ) xdτ = − ∫ d ( x T Pi x)

Theorem 3 (convergence) The policy iteration (9) and (10) converges to the optimal control solution given by (3) where the matrix P satisfies the ARE (4). Proof: In [4] Kleinman showed that using Newton’s method, conditioned by an initial stabilizing policy K 0 , all the subsequent control policies will be stabilizing and the iteration (14) and (10) will converge to the solution of the ARE. Based on the proven equivalence between (9) and (10), and (14) and (10), we conclude that the proposed new policy iteration algorithm will converge to the solution of the optimal control problem (2) – without using knowledge on the internal dynamics of the controlled system (1). █

t

III.

T

= x (t ) Pi x(t ) − x T (t + T ) Pi x (t + T )

i.e. equation (9). That is, provided that Ai is asymptotically stable, the solution of (9) is the unique solution of (14). █ Remark 1 Although the same solution is obtained whether solving the equation (14) or (9), solving equation (9) does not require any knowledge on the system matrix A. From Lemma 1 it follows that the algorithm (9) and (10) is equivalent to iterating between (14) and (10), without using knowledge of the system internal dynamics. Let Ric( Pi ) be the matrix valued function defined as Ric( Pi ) = AT Pi + Pi A + Q − Pi BR −1 B T Pi

(16)

and let RicP' i denote the Frechet derivative of Ric( Pi ) taken with respect to Pi . The matrix function RicP' i evaluated at a given matrix M will thus be Ric P' i ( M ) = ( A − BR −1 B T Pi ) T M + M ( A − BR −1 B T Pi ) . Lemma 2 The iteration between (9) and (10) is equivalent to Newton’s method given by Pi = Pi −1 − ( Ric P' i −1 ) −1 Ric( Pi −1 ) .

(17)

Proof: Equations (14) and (9) are compactly written as Ai T Pi + Pi Ai = −( Pi −1 BR −1 B T Pi −1 + Q) .

(18)

Subtracting Ai T Pi −1 + Pi−1 Ai on both sides gives Ai T ( Pi − Pi −1 ) + ( Pi − Pi −1 ) Ai = − ( Pi −1 A + AT Pi −1 − Pi −1 BR −1 B T Pi −1 + Q )

(19)

which, making use of the introduced notations Ric( Pi ) and RicP' i , is the Newton method formulation (17).

█

ONLINE POLICY ITEARATION DESIGN FOR F-16 AIRCRAFT In this section we illustrate the new policy iteration algorithm for the design of the optimal controller for the short period dynamics of an aircraft. We consider here the linear model of the F-16 short period dynamics given in [9]. A. System model and simulation results The system state vector is x = [α q δ e ] , where α denotes the angle of attack, q is the pitch rate and δ e is the elevator deflection angle. The control input is the elevator actuator voltage. The matrices that describe the linearized system dynamics are - 1.01887 0.90506 - 0.00215    A =  0.82225 - 1.07741 - 0.1755  0 0 - 20.2   0    B= 0  20.2

.

The simulation was conducted using data obtained from the system at every 0.05s. For the purpose of demonstrating the algorithm the initial state is taken to be different than zero. Also, since the system to be controlled is stable, the algorithm is initialized without controller (i.e. K 0 = 0 ). The cost function parameters, namely the Q and R matrices, were chosen to be identity matrices of appropriate dimensions. In order to solve online for the values of the P matrix which parameterizes the cost function, before each iteration step one needs to setup a least squares problem of the sort described in section II-C with the solution given by (13). Since there are six independent elements in the symmetric matrix P we setup the least squares problem by measuring the cost function associated with the given control policy over six time intervals T=0.05s, the initial state and the system state at the end of each time interval. In this way, at each 0.3s, enough data is collected from the system to solve for the value of the matrix P and perform a policy update. The result of applying the algorithm for the F-16 system is presented in Fig. 2.

7


The value of the obtained P matrix obtained after the third iteration is

P matrix parameters 2.5

1.1540 − 0.0072  1.4117   P =  1.1540 1.4191 − 0.0087 , − 0.0072 − 0.0087 0.0206 

2 1.5 P(1,2) P(2,2) P(3,3) P(1,2) optimal P(2,2) optimal P(3,3) optimal

1 0.5 0 0

Figure 2.

0.5

1

while the value of the optimal P matrix, i.e. the solution of the ARE, is

1.5

Time(s)

2

Parameters of the P matrix converging to the optimal values

The experiment was performed along the state trajectory presented in Fig.3. System states

0.2

1.1539 − 0.0072  1.4116   P =  1.1539 1.4191 − 0.0087 . − 0.0072 − 0.0087 0.0206 

x(1) x(2) x(3)

0.15

0.1

Thus, after 0.9s, the system will be controlled in an optimal fashion with the optimal controller which was obtained on-line without using knowledge about the system’s internal dynamics. In order to further test the proposed policy iteration algorithm a second experiment was performed. This experiment is initialized with the same parameters like the first one only that at time t=1.05s, after the controller parameters reached optimality, the value of the system parameter A(1,1) was changed to -0.69. The evolution of the P matrix parameters P(1,2), P(2,2) and P(3,3) is presented in Fig. 5 and the system states are presented in Fig. 6. P matrix parameters 30

0.05

0 0

25 20 0.2

0.4

0.6

0.8

1

Time(s)

1.2

1.4

1.6

1.8 15

Figure 3. System states trajectories

From Fig. 2 it is clear that the system controller comes close to the optimal controller after two iteration steps were performed. The update of the controller was not performed anymore after the third iteration, as presented in Fig. 4, since the difference between the measured cost and the expected cost went below the specified threshold of 0.00001. P parameters updates

10 5 0 0

0.5

1

1.5

Time(s)

2

2.5

3

Figure 5. Parameters of the P matrix converging to the optimal values

It is clear from Fig. 5 that the controller parameters converged to the optimal ones. The optimal controller for the new system was obtained at time t=2.25s after another four parameter updates as presented in Fig. 7. The newly obtained P matrix is

1 0.8 0.6

6.0272 − 0.0375  7.492   P =  6.0272 5.3216 − 0.033  − 0.0375 − 0.033 0.0207 

0.4 0.2

and the optimal control solution is

0 0

P(1,2) P(2,2) P(3,3) P(1,2) optimal P(2,2) optimal P(3,3) optimal

1

2

3

4

Iteration number

5

6

7

Figure 4. Updates of the P matrix (i.e. the control policy) 0 – the P matrix was not updated, 1 – the P matrix was updated

6.0166 − 0.0375  7.4793   P =  6.0166 5.3126 − 0.0329 . − 0.0375 − 0.0329 0.0207 

7


Another issue, related to using relevant data for each update step, appears in the case of a parametric change in the system. It is required that the acquired data, necessary for solving the least squares problem, characterize the system with the same internal dynamics; in other words if some data was acquired before the system dynamics changed and some after the change took place then the set of equations must not be used for solving (9) and updating the controller. Various methods to avoid this problem can be imagined including solving the least squares problem for several sets of data, while using the same control policy, and only in the case that all the results are consistent with each other the policy update should be performed.

System states

0.3

x(1) x(2) x(3)

0.25 0.2 0.15 0.1 0.05 0 0

0.5

1

1.5

2

Time(s)

2.5

3

Figure 6. System states trajectories

P parameters updates 1 0.8 0.6 0.4 0.2 0 0

2

4

6

Iteration number

8

10

Figure 7. Updates of the P matrix (i.e. the control policy) 0 – the P matrix was not updated, 1 – the P matrix was updated

It is important to notice that the change in the parameter of the A matrix affected the stability of the plant, now the poles being placed at 0.0004, -1.76 and -20.2. However the closed loop, with the optimal controller obtained after the first three iterations, remains stable (with the poles placed at -0.0175, -1.7722 and -28.5669) and thus the condition required for the algorithm to be applicable is still satisfied. B. Implementation issues As mentioned in the previous section, at the point when the difference between the expected cost and the measured cost crosses under a designer specified threshold (i.e. the convergence was obtained) the update of the control policy is no longer necessary. In fact the policy should no longer be updated for the reason that as soon as the system states come close to zero the excitation in the system will be lost. Thus irrelevant data can be measured and solving (9) will no longer be equivalent to solving (14). If this data is used for solving the least squares problem not only that the obtained controller will no longer be optimal but the closed loop system could become unstable. It has to be emphasized that in order to successfully apply the algorithm enough excitation must be present in the system.

IV. CONCLUSION In this paper we proposed a new policy iteration technique to solve on-line the continuous time LQR problem without using knowledge about the system’s internal dynamics (system matrix A). The algorithm can be viewed as an on-line adaptive critic scheme in which the actor performs continuous time control while the critic incrementally corrects the actor’s behavior at discrete moments in time until best performance is obtained. The critic evaluates the actor performance over a period of time and formulates it in a parameterized form. Based on the critic’s evaluation the actor behavior policy is updated to achieve better control performance. Convergence of the proposed algorithm, under the condition of initial stabilizing controller, to the solution of the optimal control problem has been established by proving equivalence with the algorithm presented in [4]. The algorithm has been tested in simulation for the short period dynamics F-16 aircraft system and implementation issues have also been discussed. REFERENCES [1] [2]

[3]

[4]

[5]

[6] [7] [8]

[9]

R. A. Howard, Dynamic Programming and Markov Processes, MIT Press, Cambridge, Massachusetts, 1960. S. J. Bradtke, B. E. Ydestie, A. G. Barto, “Adaptive Linear Quadratic Control Using Policy Iteration”, Proceedings of the American Control Conference, pp. 3475-3476, Baltimore, Maryland, June, 1994. J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive Dynamic Programming”, IEEE Trans. on Systems, Man and Cybernetics, vol. 32, no. 2, pp 140-153, 2002. D. Kleinman, “On an Iterative Technique for Riccati Equation Computations”, IEEE Trans. on Automatic Control, vol. 13, pp. 114- 115, February, 1968. G. Hewer, “An Iterative Technique for the Computation of the Steady State Gains for the Discrete Optimal Regulator”, IEEE Trans. on Automatic Control, vol. 16, pp. 382- 384, Aug. 1971. C.J.C.H. Watkins, Learning from delayed rewards. PhD Thesis, University of Cambridge, England, 1989. P. Werbos, “Neural networks for control and system identification”, IEEE Proc. CDC89, IEEE, 1989. J. W. Brewer, “Kronecker Products and Matrix Calculus in System Theory”, IEEE Trans. on Circuit and System, vol. 25, no. 9, 1978. B. L. Stevens, F. L. Lewis, Aircraft Control and Simulation, Willey, 2nd Edition, 2003.