Stochastic optimal control via Bellman's principle

Available online at www.sciencedirect.com

Automatica 39 (2003) 2109 – 2114

www.elsevier.com/locate/automatica

Brief Paper

Stochastic optimal control via Bellman’s principle Luis G. Crespoa;∗ , Jian-Qiao Sunb a National

b Department

Institute of Aerospace, 144 Research Drive, Hampton VA 23666, USA of Mechanical Engineering, University of Delaware, Newark, DE 19711, USA

Received 28 January 2003; received in revised form 27 May 2003; accepted 20 June 2003

Abstract This paper presents a strategy for 3nding optimal controls of non-linear systems subject to random excitations. The method is capable to generate global control solutions when state and control constraints are present. The solution is global in the sense that controls for all initial conditions in a region of the state space are obtained. The approach is based on Bellman’s principle of optimality, the cumulant neglect closure method and the short-time Gaussian approximation. Problems with state-dependent di6usion terms, non-closeable hierarchies of moment equations for the states and singular state boundary condition are considered in the examples. The uncontrolled and controlled system responses are evaluated by creating a Markov chain with a control dependent transition probability matrix via the generalized cell mapping method. In all numerical examples, excellent controlled performances were obtained. ? 2003 Elsevier Ltd. All rights reserved. Keywords: Stochastic control; Optimality; Non-linearity; Random processes; Numerical algorithms; Random vibrations

1. Introduction The optimal control of stochastic systems is a di>cult problem, particularly when the system is strongly non-linear and there are state and control constraints. Very few closed form solutions to such problem are available in the literature (Bratus, Dimentberg, & Iourtchenko, 2000; Dimentberg, Iourtchenko, & Bratus, 2000). Given its complexity, we must resort to numerical methods (Kushner & Dupuis, 2001). While some numerical methods of solution to the Hamilton Jacobi Bellman (HJB) equation are known, they usually require knowledge of the boundary/asymptotic behavior of the solution in advance (Bratus et al., 2000). Numerical strategies to 3nd deterministic optimal control solutions based on the Bellman’s principle of optimality (BPO) are available (Crespo & Sun, 2000). In this paper these tools are extended to the stochastic control problem. The method, that involves both This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Ioannis Paschalidis under the direction of Editor Tamer Basar. ∗ Corresponding author. Tel.: +1-757-766-1689; fax: +1-757-7661812. E-mail addresses: [email protected] (L.G. Crespo), [email protected] (J.-Q. Sun).

0005-1098/$ - see front matter ? 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0005-1098(03)00238-3

analytical and numerical steps, o6ers several advantages: (i) it can be applied to strongly non-linear systems, (ii) it takes into account state and control constraints and (iii) it leads to global solutions, from where topological features can be extracted, e.g. switching curves. Former developments can be found in Crespo and Sun, (2002), where comparisons with analytical solutions were made. Simulations for the examples presented are available at http://research.nianet.org/∼lgcrespo/simulations.html. 2. Stochastic optimal control 2.1. Problem formulation Consider a system governed by the stochastic di6erential equation in the Stratonovich sense dx(t)=m(x(t); u(t)) dt + (x(t); u(t)) dB(t), where x(t) ∈ Rn is the state vector, u(t) ∈ Rm is the control, B(t) is a vector of independent unit Wiener processes and the functions m(·) and (·) are in general non-linear functions of their arguments. Itˆo’s calculus (Risken, 1984), leads to the equation 1 @(x; u) dx(t) = m(x; u) + (x; u)T dt 2 @x + (x; u) dB(t):

(1)

2110

L.G. Crespo, J.-Q. Sun / Automatica 39 (2003) 2109 – 2114

The corresponding Fokker–Planck–Kolmogorov (FPK) equation is given by @ 1 @(x; u) @ =− m(x; u) + (x; u)T @t @x 2 @x +

1 @2 (x; u)(x; u)T ; 2 @x2

(2)

where (x; t0 |x0 ; t0 ) is the conditional probability density function (PDF) of the response. Let the cost functional be T L(x(t); u(t)) dt ; J (u; x0 ; t0 ; T ) = E (x(T ); T ) + t0

(3)

where E[ · ] is the expected value operator, [t0 ; T ] is the time interval of interest, (x(T ); T ) is the terminal cost and L(x(t); u(t)) is the Lagrangian function. The optimal control problem is to 3nd the control u(t) ∈ U ⊂ Rm for t ∈ [t0 ; T ] in Eq. (1) that drives the system from a the initial condition x(t0 ) = x0 to the target set de3ned by (x(T ); T ) = 0 such that the cost functional J (·) is minimized. The 3xed 3nal state condition leads to control solutions of the feedback type, i.e. u(x). 2.2. Bellman’s principle of optimality Let V (x0 ; t0 ; T ) = J (u∗ ; x0 ; t0 ; T ) be the so-called value function or optimal cost function (Yong & Zhou, 1999). The BPO can be stated as tˆ L(x(t); u(t)) dt V (x0 ; t0 ; T ) = inf E u∈U

t0

+

tˆ

T

∗

∗

∗

L(x (t); u (t)) dt + (x (T ); T ) ; (4)

where t0 6 tˆ 6 T . Consider the problem of 3nding the optimal control for a system starting from xi in the time interval [i; T ], where is a discrete time step. De3ne the incremental and the accumulative costs as (i+1) J = E L(x(t); u(t)) dt ; (5) i

∗ JT = E (x (T ); T ) +

T

(i+1)

∗

∗

L(x (t); u (t)) dt ;

(6)

where {x∗ (t); u∗ (t)} is the optimal solution pair over the time interval [(i + 1); T ]. In this context, the BPO is given by V (xi ; i; T ) = inf u∈U {J + JT }. The incremental cost J is the cost for the system to march one time step forward starting from a deterministic initial condition xi neighboring V (x((i + 1)); (i + 1); T ). The system moves to an intermediate set of the state variables. The accumulative cost JT is the optimal cost of reaching the target set (x(T ); T ) = 0

starting from this intermediate set and is calculated through the accumulation of incremental costs over time intervals between (i + 1) and T , i.e. V (x((i + 1)); (i + 1); T ) for the processed state space. 3. Solution approach To evaluate the expected values in Eq. (6) for a given control, (x; |x0 ; 0) is needed. For a given feedback control law u=f(x), the response x(t) is a stationary Markov process (Lin & Cai, 1995). For a small , (x; |x0 ; 0) is known to be approximately Gaussian within an error of order O(2 ) (Risken, 1984). We can derive dynamic equations for the moments of the states from Eq. (1). Such equations are closed using the neglect closure method (Lin & Cai, 1995; Sun & Hsu, 1989) and used by a backward search algorithm to generate the global control solution. 3.1. Backward search algorithm The backward solution process starts from the last segment of the time interval [T − ; T ]. Since the terminal condition for the 3xed 3nal state problem is speci3ed, a family of local optimal solutions for all initial conditions x(T − ) is found. The optimal control in the interval [i; T ] is determined by minimizing the sum of the incremental and the accumulative cost leading to V (xi ; i; T ) subject to the con∗ ∗ dition x((i + 1)) = xi+1 , where xi+1 is a random variable whose support is mostly on the processed state region. Con∗ dition x((i + 1)) = xi+1 must be applied in a probabilistic sense. Notice however, that (x; |x0 ; 0) covers the entire state space no matter how small is. To quantify the continuity condition, let be the extended target set such that ∗ xi+1 ∈ . For a given control, de3ne P as: (x; |xi ; 0) dx (7) P = x∈

then P is the probability of reaching the extended target set in time starting from xi . The controlled response x(t) starting from a set of initial conditions xi will become a candidate for the optimal solution when P is maximal. The numerical procedure is presented next. Discretize a 3nite state region D ⊂ Rn into a countable number of parts/cells. Let U be a set consisting of a countable number of admissible controls ui for i = 1; 2; : : : ; I . The control is assumed to be constant over the time intervals. Let ⊂ Rn denote the discretized target set (x(T ); T )=0 and JT = E[(x(T ); T )] be the terminal cost. In this framework, the algorithm is as follows: (1) Find all the cells that surround the target set . Denote the corresponding cell centers zj . (2) Construct the conditional probability density function (x; |zj ; 0) for each control ui and for all cell centers zj . Call every combination (zj ; ui ) a candidate pair.


(3) Calculate the incremental cost J (zj ; ui ), the accumulative cost JT (zk∗ ; u–∗ˆ ) and P for all candidate pairs. zk∗ is an image cell of zj in and u–∗ˆ is the optimal control of zk∗ found in previous iterations. (4) Search for the candidate pairs that minimize J (zj ; ui )+ JT (zk∗ ; u–∗ˆ ) and satisfy P ¡ max {P }, where 0 ¡ ¡ 1 is a factor set in advance. Denote such pairs as (zj∗ ; ui∗ ). (5) Save the minimized accumulative cost function JT (zj∗ ; ui∗ ) = J (zj∗ ; ui∗ ) + JT (zk∗ ; u–∗ˆ ) and the optimal pairs (zk∗ ; u–∗ˆ ). (6) Expand the target set by including the cells zj∗ . (7) Repeat the search from Steps (1) to (6) until the initial condition x0 is reached. As a result, the optimal control solution for all the cells ∗ covered by is found. The choice of image cells, i.e. xi+1 , could certainly by biased. This however, is avoided by using (i) non-uniform integration times such that the growth of is gradual, i.e. mapping most of the probability to neighboring cells, and (ii) by restricting the potential optimal pairs to be candidate pairs with high P . These considerations led to the same global control solution regardless of the cell size (Crespo & Sun, 1999). The resulting dynamics of the conditional PDF is simulated using the generalized cell mapping method (GCM) (Crespo & Sun, 2002). Notice that if a simulation is done for a long time all probability will eventually leave D. This is a consequence of using a 3nite computational domain to model a di6usion process, therefore all numerical simulations face this problem. While some probability is leaking out from D, probability mass is also being brought back from its complement. In this study the computational domain is set such that the leakage of probability during the transient controlled response is very small. 4. Non-analytically closeable terms Closure methods readily handle polynomials. However, for other types of non-linearities they might not only require tedious and lengthy integrations, but also might lead to expressions that do not admit a closed form. This prevents the integration of moment equations for the states even numerically. In order to overcome this di>culty, the cellular structure of the state space can be used to approximate such non-linearities with multiple Taylor expansions. Once the approximations are available, the in3nite hierarchy of moments can be closed. 4.1. Example The optimal steering of a vehicle on a vortex 3eld is studied next. The vehicle moves on the (x1 ; x2 ) plane with constant velocity relative to the vortex. The control u is the heading angle with respect to the positive x1 -axis. The

2111

velocity 3eld is given by v(x1 ; x2 ) = ar=(br 2 + c), where r =|x| is the radius from the center of the vortex. The vehicle dynamics is given by the Stratonovich equations x˙1 = cos (u) − vx2 =r + !vw1 ;

(8)

x˙2 = sin (u) + vx1 =r + !vw2 ;

where ! is a constant, w1 and w2 are correlated Gaussian white noise processes with zero mean such that E[w1 (t)w1 (t )] = 2D1 $(t − t ); E[w2 (t)w2 (t )] = 2D2 $(t − t ) and E[w1 (t)w2 (t )] = 2D12 $(t − t ). The control objective is to drive the vehicle to the target set such that the cost functional T %v(x1 ; x2 ) dt ; (9) J =E t0

is minimized. This cost models the risk associated with the selected path. Details can be found in Crespo (2003). For uncorrelated white noise processes we 3nd

x v @v 2 2 + ! D1 E v m˙ 10 = cos (u) − E ; r @x1

x v @v 1 m˙ 01 = sin (u) + E + ! 2 D2 E v ; r @x2

x x v x1 v @v 1 2 2 + 2! D1 E m˙ 20 = 2m10 cos (u) − 2E r r @x1 +2!2 D1 E[v2 ];

(10)

m˙ 02 = 2m01 sin (u) + 2E

x x v x2 v @v 1 2 + 2!2 D2 E r r @x2

+ 2!2 D2 E[v2 ]; x22 v x2 v @v 2 m˙ 11 = cos (u)m01 − E + ! D1 E r r @x1 2 x1 v x1 v @v 2 + ! D2 E : + sin (u)m10 + E r r @x2

Several expected values in these equations cannot be analytically expressed by lower order moments. Let f(x1 ; x2 ) denote a non-analytically closeable function in Eq. (10). Second-order Taylor expansions about the cell centers were used (Crespo, 2003). The region D = [ − 2; 2] × [ − 2; 2] is discretized with 1089 cells. Take U = {−'; −14'=15; : : : ; 14'=15}, % = 1, a = 15, b = 10, c = 2, != 1, D1 = 0:05 and D2 = 0:05. Let be the set of cells corresponding to the target set ={x1 =2; x2 }, i.e. has the rightmost column of cells in D. Fig. 1 shows the mean vector 3eld of the vortex. A trajectory of the vehicle moving freely is superimposed. The center of the vortex attracts probability due to the state dependence of the di6usion and not to the existence of an attracting point at the origin. This behavior does not have a corresponding counterpart in a deterministic analysis. The time evolution of relevant indices is shown in Fig. 2. The vector 3eld of the mean of the controlled response is shown in Fig. 3.

2112


Fig. 4. PDF of the controlled response after 0:3 time units.

Fig. 1. Expected vector 3eld of the uncontrolled trajectories.

1.4

E[L]

1.2 1 0.8 0.6 0

5

10

15

20

5

10

15

20

5

10

15

20

25

m01 & m10

1 0.5 0 0.5 1 0

m02 & m20

1

A controlled trajectory is shown in Fig. 1. In the process, the control keeps the system in D with probability one. A discontinuity in the vector 3eld exists in spite of having an unbounded control set. Such a discontinuity implies a dichotomy in the long term behavior of the controlled response whose dynamics depends strongly on the initial state. While for initial conditions above the discontinuity, the control drives the vehicle against the velocity 3eld, for the initial conditions under, the control moves the vehicle in the direction of the current. Fig. 4 shows the controlled response of a system starting from x = (0:8; −0:32) after 0:3 time units. The system bifurcates when it reaches the discontinuity.

0.5

0 0

Time

Fig. 2. Uncontrolled trajectories: moments of x1 (- -) and moments of x2 ( – ).

5. Singular boundary conditions A transformation variables is used to remove non-smooth state constraints. The optimal control problem, i.e. the system dynamics, the cost functional, the admissible state and control spaces, and the target set; is transformed to a new domain where it is solved by the method. Then, the global feedback control solution is transformed back to the physical domain, where it is evaluated. For optimal bang-bang solutions the inverse transformation of the solution satis3es the control constraints. 5.1. Example A vibro-impact system with a one-sided rigid barrier subject to a Gaussian white noise excitation is considered. The impact is assumed to be perfectly elastic. The control of the vibro-impact system is then subject to a state constraint given by the one-sided rigid barrier. The equation of motion is given by yT + )(y2 − 1)y˙ + 2*y˙ + 2 y = u(t) + w(t)

Fig. 3. Expected vector 3eld of the controlled trajectories. Target cells are marked with crosses.

− ) y(t ˙ impact

(11)

+ = −y(t ˙ impact )at subject to the impact condition y(timpact ) = −h, where timpact is the time instant at which


impact occurs, y(t) is the displacement, w(t) is a Gaussian white noise process satisfying E[w(t)]=0, E[w(t)w(t+,)]= 2D$(,) and u(t) is a bounded control satisfying |u| 6 u. ˆ The control objective is to drive the system from any initial condition to rest, i.e. (y; y) ˙ = (0; 0), while the rate of energy dissipation is maximized. The computational domain is Dy =[−h; a]×[−v; v]. As suggested in Dimentberg, (1988), we introduce the following transformation y = |x| − h, leading to xT + )(x2 − 2h|x| + h2 − 1)x˙ + 2*x˙ + 2 x = sgn(x)(u(t) + w(t) + h2 ):

(12)

The transformed domain is given by Dx =[−2h; 2a]×[−v; v], the new target set is (x; x)˙ = (±h; 0), and the cost functional is given by ∞ 2 2 Jx = E [(|x| − h) + x˙ ] dt : (13) t0

2113

(x; x)˙ = (±1; 0). Since the optimal control is bang-bang, we have U = {−1; 1}. The uncontrolled response is marginally stable since ) = 0 and * = 0. At impact, the system is reUected back in such a way that y remains the same and the sign of y˙ is reversed. The crossing of any other boundary is an irreversible process in the sense that DV y acts as a sink cell. The uncontrolled response for the initial uniform distribution Dy = [ − 0:86; −0:78] × [ − 0:64; −0:57] is 3rst studied. After 3 time units, only 22% of the probability is within Dy indicating the strong e6ect of di6usion on the marginally stable system. Besides, the probability left in Dy is not concentrated about the target. The global optimal control solution in Dx is shown in Fig. 5, where the switching curves are also shown. Approximations of the switching curves were obtained by curve 3tting. The control solution is mapped back to Dy leading to Fig. 6. Notice the qualitative di6erences in the controlled response for states in the third quadrant of Dy . While for the cells marked with crosses,

For x1 = x and x2 = x, ˙ the Itˆo equation is given by d x1 = x2 dt; d x2 = (−)x12 x2 + 2)h|x1 |x2 − ()- + 2*)x2 − 2 x1 √ + sgn(x1 ).) dt + sgn(x1 ) 2D dB;

(14)

where B(t) is a unit Wiener process, . ≡ u + h2 and - ≡ h2 − 1. Some manipulations lead to m˙ 10 = m01 ; m˙ 20 = 2m11 ; m˙ 01 = −)m21 + 2)hE[|x1 |x2 ] − ()- + 2*)m01 − 2 m10 + E[sgn(x1 )].; Fig. 5. Global optimal control solution in Dx . Cells with circles are the ˆ otherwise u∗ =−u. ˆ Switching regions where the optimal control is u∗ = u, curves are superimposed.

m˙ 11 = m02 − )m31 + 2)hE[|x1 |x1 x2 ] − ()- + 2*)m11 − 2 m20 + E[x1 sgn(x1 )].;

(15)

m˙ 02 = −2)m22 + 4)hE[|x1 |x22 ] − 2()- + 2*)m02 − 22 m11 + 2E[x2 sgn(x1 )]. + 2DE[sgn(x1 )2 ]: (16) The above equations can be closed and integrated (Crespo, 2003). Now, we solve for optimal controls in the domain Dx using the cost function Jx and the target set (x; x)˙ . The control solution u∗ (x1 ; x2 ), is then transformed back to the physical domain, i.e. u∗ (y; y). ˙ The optimal control in both domains is bang-bang, being fully determined by switching curves. From u∗ (x1 ; x2 ), an approximation of such curves is built and transformed to the physical domain. As a example, take uˆ = 1, h = −1, a = 1, v = 1, ) = 0, * = 0, = 1 and D = 0:1. The transformed state space is Dx =[−2; 2]×[−2; 2], where 849 cells are used. The transformed target set is formed by the cells that contain the points

Fig. 6. Global optimal control solution in Dy . Previous conventions apply.

2114


studied to demonstrate the e6ectiveness of the approach. Processes with a state dependent di6usion part, non-analytically closeable equations for the moments of the states and singular boundary conditions are considered. In all cases, the uncontrolled and controlled system responses were evaluated in the space of the conditional PDFs using the generalized cell mapping method. In all cases, excellent controlled performances were obtained. Relaxation of the computational complexity, extensions to high dimensional systems and a rigorous study of the convergence and stability of the algorithm should be pursued in the future. References Fig. 7. Stationary PDF of the controlled response.

JD & PD

1 0.5

m01 & m10

0 1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5 Time

6

7

8

9

0.5 0 −0.5

m02 & m20

−1

0.8 0.6 0.4 0.2 0

Fig. 8. Time evolutions of PD (− −), JD (−) the moments m10 (− · −), m01 (− −), m20 (− · −), and m02 (− −) of the controlled response.

the control speeds up the system favoring impact in the others intends to avoid it. For the same initial condition, the stationary controlled PDF is shown in Fig. 7. Stationarity is reached in 4 time units with 98% of the probability within Dy . As before, the switching curves split a uni-modal PDF into a bi-modal one. The time evolution of relevant indices is shown in Fig. 8. As before, the controlled response (i) converges to the target set with high probability, (ii) minimizes the cost and (iii) maximizes the probability of staying in Dy . 6. Conclusions This paper proposes a strategy to 3nd optimal controls of non-linear stochastic systems using Bellman’s principle of optimality, the cumulant neglect closure method and the short-time Gaussian approximation. Control problems of several challenging non-linear systems with 3xed 3nal state conditions subject to state and control constraints were

Bratus, A., Dimentberg, M., & Iourtchenko, D. (2000). Optimal bounded response control for a second-order system under a white-noise excitation. Journal of Vibration and Control, 6, 741–755. Crespo, L. G. (2003). Stochastic optimal controls via dynamic programing. NASA/CR 2003-2124191. Crespo, L. G., & Sun, J. Q. (2000). Solution of 3xed 3nal state optimal control problems via simple cell mapping. Non-linear Dynamics, 23, 391–403. Crespo, L. G., & Sun, J. Q. (2002). Stochastic optimal control of non-linear systems via short-time Gaussian approximation and cell mapping. Non-linear Dynamics, 28, 323–342. Dimentberg, M. (1988). Statistical dynamics of non-linear and time-varying sytems. Tauton, U.K.: Research Studies Press. Dimentberg, M., Iourtchenko, D., & Bratus, A. (2000). Optimal bounded control of steady-state random vibrations. Probabilistic Engineering Mechanics, 15, 381–386. Kushner, H. J., & Dupuis, P. (2001). Numerical methods for stochastic control problems in continuous time. New York: Springer. Lin, Y. K., & Cai, G. Q. (1995). Probabilistic structural dynamics: Advanced theory and applications. New York: McGraw-Hill. Risken, H. (1984). The Fokker–Planck equation—methods of solution and applications. New York: Springer. Sun, J. Q., & Hsu, C. S. (1989). Cumulant–neglect closure method for asymmetric non-linear systems driven by Gaussian white noise. Journal of Sound and Vibration, 135, 338–345. Yong, J., & Zhou, X. Y. (1999). Stochastic controls, hamiltonian systems and HJB equations. New York: Springer.

Luis G. Crespo obtained his Ph.D. from University of Delaware in 2002. He is currently a Sta6 Scientist of the National Institute of Aerospace NIA (former ICASE) and a member of the Uncertainty Based Methods group of NASA Langley Research Center. His research interests are the dynamics, control and optimization of deterministic and stochastic systems.

Jian-Oiao Sun obtained his Ph.D. in Mechanical Engineering from University of California at Berkeley in 1988. He is currently a professor of Mechanical Engineering at the University of Delaware. His research interests include non-linear dynamics, structural acoustic control, biomechanics and nano-scale mechanical measurement.