Approximate Optimal Tracking Control for Continuous ... - IEEE Xplore

3 downloads 160 Views 155KB Size Report
Faculty of Mechanical & Electrical Engineering, Kunming University of Science & Technology, Kunming, 650500, P.R. China. E-mail: [email protected]. 2.
Proceedings of the 33rd Chinese Control Conference July 28-30, 2014, Nanjing, China

Approximate Optimal Tracking Control for Continuous-Time Unknown Nonlinear Systems Jing Na1, Yongfeng Lv1, Xing Wu1, Yu Guo1 and Qiang Chen2 1. Faculty of Mechanical & Electrical Engineering, Kunming University of Science & Technology, Kunming, 650500, P.R. China E-mail: [email protected] 2. College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, P.R. China Abstract: This paper proposes an online adaptive approximate solution for the infinite-horizon optimal tracking control for continuous-time nonlinear systems with unknown system dynamics, which is achieved in terms of a novel identifier-critic based approximate dynamic programming (ADP) structure. To obviate the requirement of complete knowledge of system dynamics, an adaptive neural network (NN) identifier is designed with a novel adaptive law. A steady-state control in conjunction with an adaptive optimal control is proposed to stabilize the tracking error dynamics in an optimal manner. A critic NN is utilized to approximate the optimal value function and to obtain the optimal control action. Novel adaptive laws based on parameter estimation error are developed to guarantee that both the identifier NN weights and critic NN weights converge to small neighborhoods of their ideal values. The closed-loop system stability and the convergence to the optimal solution are all proved based on Lyapunov theory. Simulation results exemplify the efficacy of the proposed methods. Key Words: Adaptive control, Optimal control, Approximate dynamic programming, System identification 

1

Introduction

Optimal control has been widely used in the real industrial applications [1], which is designed to control the system in an optimal way, i.e. minimize a cost function. Traditional optimal controls are usually designed by solving the Hamilton-Jacobi-Bellman (HJB) equation offline, requiring completely known system dynamics. Recently, with the wish to develop online adaptive optimal control, the principle of reinforcement learning (RL) [2, 3] and adaptive control has been incorporated into optimal control. Werbos [4] introduced a RL-based actor–critic framework, called approximate dynamic programming (ADP), where neural networks (NNs) are trained to approximately solve the optimal control problem. A survey of ADP-based feedback control designs can be found in [5-8]. The iterative nature of ADP lends successful design of discrete-time (DT) optimal control [9-11]. However, extensions of RL-based controllers to continuous-time (CT) systems entail challenges in proving stability and convergence, and ensuring that the algorithm is online and model-free. Some of existing ADP algorithms for CT nonlinear systems lack a rigorous stability analysis [3, 12]. By incorporating NNs into the actor-critic structure, an offline method has been proposed in [13] to find approximate solutions of optimal control for CT nonlinear systems. In [14, 15], an integral RL technique was designed to get the online optimal control, where the critic NN and actor NN weights are updated in a sequential manner. Vamvoudakis and Lewis [16] further proposed a synchronous ADP algorithm, which involves simultaneous tuning of both actor and critic neural networks . To obviate the requirement for the drift system dynamics in [16], a novel actor–critic–identifier architecture was proposed in [17]. Although the identifier states converge to *

This work is supported by National Natural Science Foundation (NSFC) of China under Grants 61203066, 51365023 and 51265018, and Basic Research Planning Project of Yunnan Province under Grant 2013FB028.

8990

their true values, the identifier NN weights convergence was not guaranteed. Moreover, the knowledge of the input dynamics is required in [17]. The recent work [18] removed this requirement of system dynamics and proposed a novel adaptive law based on the experience replay technique. On the other hand, most of available ADP methods have been studied for regulation problem rather than tracking control. With this regard, only [19] addressed the tracking control, where a traditional steady-state control is used in conjunction with an optimal control that stabilizes the error dynamics at the transient stage in an optimal way. In this paper, we propose a new ADP algorithm to achieve optimal tracking control of completely unknown nonlinear systems. Inspired by [17, 19], we design a NN identifier to estimate the unknown system dynamics, where the NN weights convergence can be proved by introducing a novel law based on the parameter estimation error [20]. Then an adaptive steady-state control is designed to obtain the desired tracking at the steady-state, and an ADP-based optimal control is proposed to stabilize the tracking error dynamics in an optimal manner. To this end, a critic NN is used to approximate the solution of the HJB equation and to calculate the optimal control action, which leads to an identifier-critic structure. A novel adaptive law based on the parameter estimation error [20] is used to online update the identifier and critic NN weights simultaneously. The stability of the closed-loop system and the convergence to the optimal solution are proved. Simulation results are given to illustrate the validity of the proposed control scheme.

2

Problem Formulation Consider nonlinear continuous-time system described by x f ( x)  g ( x)u (1)

where x  \ n is measurable system state, u  \ m is the control input, f ( x)  \ n is the unknown system drift

dynamics and g ( x)  \ nu m is the unknown input dynamics. It is assumed that f ( x)  g ( x)u is Lipschitz continuous on

a set : which contains the origin. The objective is to design an adaptive control u (t ) to make the system output x(t ) track a given trajectory xd (t ) , so that the following cost function can be minimized [19]: V (e(t ))

³

f

t

r (e(W ), u (W ))dW

(2)

r (e(W ), u (W ))

eT (t )Qe(t )  u (t )T Ru (t ) is the utility function

with Q and R being symmetric positive definite matrices. Remark 1: Although some results have been recently developed to address the optimal control of system (1) by means of ADP, e.g. [13-17, 19], at least the input dynamic g ( x) is assumed to be known. Moreover, most of ADP schemes focus on regulation rather than tracking control.

Adaptive NN Identification

To facilitate the control design, an adaptive identifier is established to reconstruct the unknown system dynamics. For this purpose, we assume the dynamics of system (1) is continuous on compact set : , then NNs [13] can be used to approximate unknown functions f ( x) and g ( x) as: (3) f ( x) T[ ( x)  H f g ( x) \9 ( x)  H g where T   nu kT and \   k um

[   kT and 9   \

nu k\

(4)

are unknown NN weights,

are NN basis functions, and H f and

H g are approximation errors. The following assumption is used in the paper: Assumption 1 [13]: The NN weights T , \ , and the NN errors H f and H g are bounded. According to Weierstrass higher-order approximation theorem and Claims in [13, 16], H f and H g converge to zero as kT , k\ o f , i.e. the approximation error can be vanish as the numbers of NN neuron increase. Then using (3) and (4), system (1) can be rewritten as: where H T

x T[ ( x)  \9 ( x)u  H T (5) H f  H g u denotes the NN approximation error.

To simplify the design of adaptive law for updating NN weights, system (5) can be presented as a compact form as: x W1T I1 ( x, u )  H T (6) where W1

­°kx f  x f x ®  °¯kI1 f  I1 f I1

x(t )  xd (t ) denotes the tracking error, and

where e(t )

3

will present a novel adaptive law to ‘direct’ estimate the unknown NN weights in (6). To estimate the NN weights W1 , we define the filtered variables x f and I1 f of x and I1 as

[T ,\ ]T   d un is the unknown NN weight

matrix, I1 ( x, u ) [[ T ( x), u T 9 T ( x)]T   d regressor vector with d kT  k\ .

is

the

NN

Remark 2: Several adaptive identifiers have been proposed for system (6), where the adaptive laws are all designed by minimizing the residual identifier error (i.e. error between system state x and the identifier output xˆ ) based on gradient [19] or modified RISE algorithm [17]. However, the identifier weight convergence was not guaranteed. As indicated in [18], the convergence of the identifier weights is essential for the convergence of the control. This paper

8991

(7)

where 0  k  \ is a scalar constant filter parameter. It can be obtained from (6)~(7) that x  xf x f W1T I1 f  H Tf (8) k where H Tf is the filtered version of H T in terms of kHTf  H Tf

HT .

Furthermore, we define auxiliary matrices P1   d ud and Q1   d un as follow: ­ P1 AP1  I1 f I1Tf , P1 (0) 0 °° T ® ª x  xf º °Q1 AQ1  I1 f « » Q (0) 0 1 ¬ k ¼ ¯° where A ! 0 is a positive constant. Then the solution of (9) is derived as t T  A (t  r ) ­ P (t ) ³0 e I1 f (r )I1 f (r )dr °° 1 T ® t ª x(r )  x f (r ) º  A (t  r ) °Q1 (t ) ³ e I1 f (r ) « » dr 0 k °¯ ¬ ¼

(9)

(10)

We define matrix M 1   d u n based on P1 and Q1 as ˆ M 1 PW (11) 1 1  Q1 ˆ Then the adaptive law for W is provided as 1

 (12) Wˆ1 *1 M 1 where *1 >0 is a constant learning gain. To prove the convergence of adaptive law (12), we denote Omax (˜) and Omin (˜) are the maximum and minimum eigenvalues of the corresponding matrices, and have: Lemma 1 [20]: If the regressor vector I1 is persistently excited (PE), then the matrix P1 defined in (9) is positive definite, i.e. Omin ( P1 ) ! V 1 ! 0 for positive constant V 1 .

Proof: Please refer to [20] for detailed proof of Lemma 1. Then we have the following Theorem: Theorem 1: For system (6) with adaptive law (12), if I1 is PE, then the NN weights error W W  Wˆ converges to a 1

1

1

compact set around zero. Proof: From (8) and (10), one can obtain that Q1 PW 1 1  X1 where

t

X1 (t )  ³ e 0

A (t  r )

T Tf

I1 f (r )H (r )dr

is

(13) a

bounded

variable, i.e. X1 d HX1 for a positive constant HX1 . Then the auxiliary vector M 1 can be written as ˆ   PW M 1 PW 1 1  PW 1 1 X 1 1  X1 Consider the Lyapunov function as V1

(14)

1  T 1  W1 *1 W1 , then 2

its derivative V1 can be concluded by (12) and (14) as: V W T * 1W W T M W T PW  W T X 1

1

1

1

1

1

1

1

1

1

1

d  W1 (V 1 W1  HX1 )

ˆ ( x)ue  H N @  eT Qe  ueT Rue (20) H (e, ue ,V ) VeT > Ke e \9

(15)

Then according Lyapunov Theorem, the weights error W1 is uniformly ultimately bounded (UUB) within a compact set : : W | W d H / V , and the size depends on the bounds 1

^

1

X1

1

1

`

of approximation error HX1 and the excitation level V 1 .

4

W1I1  H T is a bounded

identifier error. Thus we have H N d K N with K N being a positive constant. To achieve optimal tracking control, the control input u can be composed of two parts [19] as u ud  ue , where ud is the steady-state control used to retain the tracking error at the steady-state, and ue is the adaptive optimal control designed to stabilize the tracking error dynamics at transient stage in an optimal manner. Steady-state Tracking Control

Since ud is used to retain the tracking error performance in the steady-state, it should be designed to compensate for the unnecessary dynamics in (16). Then we design ud as ˆ ( x )  K e] u gˆ  ( x)[ x  T[ (17) d

where e

d

e

x  xd is the tracking error, K e ! 0 is the feedback

gain, and gˆ  ( x)

[\9ˆ ( x)] \9ˆ ( x) T

1

ˆ ( x)]T denotes the [\9

ˆ ( x) . generalized inverse of \9 From (16) and (17), the dynamic of e is described as ˆ ( x)  \9 ˆ ( x)(ud  ue )  xd  H N e T[

Approximate Optimal Control

The adaptive optimal control ue will be designed to stabilize (18) in an optimal manner. In this case, the infinite horizon cost function (2) can be rewritten as V (e(t ))

³

t

f

r (e(W ), ue (e(W )))dW

³

f

t

r (e(W ), ue (e(W )))dW



(21)

and it satisfies the HJB equation 0 min ª¬ H (e, ue , V ) º¼

(22)

wH (e, ue* , V * ) / wue*

0 from (20) as

1 wV * (e) ˆ ( x )]T  R 1[\9 (23) 2 we where V * is the solution of the HJB equation (22). The HJB equation (22) is a nonlinear partial differential equation (PDE) which is difficult to solve. Thus as [13-17], a critic NN will be used to approximate the value function V * (e) . Assuming the value function is smooth on compact sets, then there exists a single-layer NN [16], such that V * (e) can be uniformly approximated as ue*

V * (e) W2T I2 (e)  H v

(24)

*

and its derivative wV (e)/we is wV * (e) (25) ’I2T W2  ’H v we where W2   l is the unknown NN weights, I2 (e)   l is

the basis function vector and H v is the approximation error, l is the number of neurons. ’I2 wI2 / we and ’H v wH v / we are the partial derivative of I2 and H v with respect to e , respectively. Then substituting (24)into (23), ue* can be given as 1 ˆ ( x)]T ’I2T W2  ’H v  R 1[\9 (26) 2 Assumption 2 [13]: The ideal critic NN weights W2 , ue*

activation function I2 (  K e e  \9

residual HJB equation error due to the NN approximation errors H N and ’H v , which can be made arbitrarily small with sufficiently large NN nodes [13, 16], i.e. H N o 0 for kT , k\ o f and ’H1 o 0 for l o f . eT Qe  ueT Rue , so that (30) is rewritten as

and 4

5

T 2

4 W ;  H HJB (31) Remark 3: Available ADP schemes [13-19] are designed by using an extra actor NN, of which the weights are updated to minimize the residual Bellman error in the approximated HJB equation based on the Least-squares [17] or modified Levenberg-Marquardt algorithms [16]. In the following, we will design an adaptive law for critic NN based on (31), so that the critic NN can be used to determine the control action (28) without using an actor NN in [13-19]. Thus the following development will be different to available ADP schemes. We define the matrix P2   l ul and vector Q2   l as ­° P2 AP2  ;;T , P2 (0) 0 (32) ® °¯Q2 AQ2  ;4, Q2 (0) 0 where A is defined as Eq. (9). Another auxiliary vector M 2 is obtained based on P2 and Q2 as M PWˆ  Q (33) 2

2

2

Then the adaptive law for the critic NN is given as  Wˆ2 * 2 M 2 (34) where * 2 ! 0 is the learning gain. We have the following Theorem: Theorem 2: For critic NN adaptive law (34) with the regressor vector ; in (31) being PE, then the critic NN weights error W2 converges to a compact set around zero. Proof: Similar to Lemma 1, we know that if ; is PE, then P2 is positive definite, i.e. Omin ( P2 ) ! V 2 ! 0 holds for positive constant V 2 . Moreover, from (32) ~ (33), we have

Q2 where X2

 P2W2  X2

In the simulation, Q and R in the cost function (19) are chosen as identity matrices as [21]. The control objective is to make the system states x track the desired trajectory x1d sin(t ) and x2 d cos(t )  sin(t ) . We first use adaptive law (12) to estimate unknown 0 0 0º ª 1 1 system parameters W1 [T ,\ ] « » ¬ 0.5 0 0.5 1 2 ¼ with I2 ( x)= ª¬ x1 , x2 , x2 (1  x2 (cos(2 x1 )  2) 2 ), u cos(2 x1 ), u º¼

t

 P2W2  X2

(36)

1  T 1  W2 * 2 W2 , then 2 W2T P2W2  W2T X2 d  W2 (V 2 W2  HX 2 ) (37)

Then according to Lyapunov Theorem, W2 converges to the

^

T

being the known regressor vector. The parameters used in simulation are set as k 0.001 , A 1 , *1 * 2 150 , The initial NN weights are Wˆ (0)=Wˆ (0) 0 and system state is 1

2

1 . Fig. 1 shows the estimated no-null identifier weights Wˆ , which converge to the true values. x1 (0)

3, x2 (0)

1

2.5

T11 T12

2

T21 1.5

T23 \21

1

\22

0.5

0

-0.5

-1.5

0

Consider the Lyapunov function as V2

`

compact set :1 : W2 | W2 d HX 2 / V 2 , where the bound is determined by HX 2 and V 2 . 4.3

Consider nonlinear continuous-time system [21] as  x1  x2 0 ª º ª º x «  » u (38) 2 » « x cos(2 ) 2  1 ¼ ¬0.5x1  0.5x2 (1  (cos(2x1 )  2) )¼ ¬

-1

constant HX 2 as X 2 d HX 2 . From (35) and (33), we have

V2

Simulations

(35)

 ³ e A (t  r ) ;(r )H HJB (r )dr is bounded by positive M2

2

Proof: Please refer to the Appendix for a detailed proof.

ˆ ( x )u e @ ’I2 >  K e e  \9

We denote the known terms as ;

1

ultimately bounded, and the optimal control ue in (28) converges to a small bound around its ideal optimal solution ue* in (26), i.e. ue  ue* d H u for a positive constant H u .

Identifier Parameters

where H HJB

(34) are used, if the initial control action is admissible and the regressor vectors I1 and ; are PE, then the tracking error e and the NN weights errors W , W are uniformly

Stability Analysis

Theorem 3: For system (1) with adaptive tracking control (29) consisted of (17) and (28), and adaptive laws (12) and

8993

0

2

4

6

8

10 t(s)

12

14

16

18

20

Fig.1 Convergence of the identifier weights Wˆ1 . To achieve tracking control, the steady-state control (17) ˆ ( x )  K e] for system (38) is written as ud gˆ  ( x)[ xd  T[ e with K e 0.65 . Moreover, optimal control (28) will be designed by choosing the optimal value function [16, 21] as 1 2 2 V (e) e1  e2 (39) 2 so that the ideal optimal control is 1 wV * (e) ˆ ( x)]T ue  R 1[\9  cos(2e1 )  2 e2 (40) 2 we Similar to [16, 17], we select the activation function for the critic NN as I2 (e) [e12 , e1e2 , e22 ]T , then the optimal weights W [0.5, 0,1]T can be derived. The estimated weights Wˆ 2

2

is shown in Fig.2, which converge to the optimal values. This means that the designed optimal control can converge to the optimal solution (40). Moreover, the system states for tracking the given trajectory are shown in Fig. 3, and the tracking error is given in Fig.4. W 23

W 22

W 21

1

Critic NN weights

0.8

0.6

Appendix—Proof of Theorem 3

0.4

Proof: Substitute (28) into (18), we can obtain ˆ ( x)ue  H N e  K e e  \9

0.2

1 ˆ ( x) ­® R 1[\9 ˆ ( x)]T ’I2T Wˆ2  K e e  \9 ¯ 2 1 ˆ ( x)]T (’I2T W2  ’H v ) ¾½  \9 ˆ ( x)ue*  H N  R 1[\9 2 ¿ 1 T ˆ ( x)]R 1 >\9 ˆ ( x) @ ’I2T W2  \9 ˆ ( x)ue*  K e e  [\9 2 1 T ˆ ( x)]R 1 >\9 ˆ ( x ) @ ’H v  H N  [\9 2 Consider a Lyapunov function as V V1  V2  V3

0 0

2

4

6

8

10 t(s)

12

14

16

18

20

Fig.2 Convergence of the critic NN weights Wˆ2 3 X1

Xd1

1

1

X and X

d1

2

0 -1 0

2

4

6

8

10 t(s)

12

14

16

18

20

d2

0

2

1

-1

X and X

system dynamics, an adaptive identifier is proposed. Then a steady-state control for retaining the tracking performance is accomplished with an optimal control for stabilizing the error dynamics in an optimal manner. A critic NN is used to online learn the solution of HJB equation; the NN weights are used to derive the optimal control action. Novel adaptive laws based on the parameter estimation error are developed for updating both identifier weights and critic NN weights. The stability of the closed-loop system and the convergence to the optimal solution are proved under conventional PE conditions. Simulation results validate the efficacy of the proposed methods.

-2

X2 0

2

4

6

8

10 t(s)

12

14

16

(42) 1  T 1  1  T 2  W1 *1 W1  W2 * 2 W2  *eT e  KV * (e) 2 2 where V * (e) is the optimal cost function defined in (21) and K ! 0 , * ! 0 are positive constants. Consider the inequality ab d a 2K / 2  b 2 / 2K with K ! 0 , then we can obtain that 2 V W T PW  W T X d V W  W T H

Xd2 18

20

Fig.3 Evaluation of tracking performance. 3

e1

2.5

1

1 Tracking errors

1

d (V 1 

1.5

1

1

1

1

1

1

X1

(43)

2 X1

2 KH 1 ) W1  2K 2

and

0.5

V2

0 -0.5

W2T P2W2  W2T X2 d V 2 W2 d (V 2 

-1

1  ) W2 2K

2



2

 W2T HX 2

(44)

KHX22

2  Moreover, one may deduce V3 from (21) and (41) as

-1.5 -2

0

2

4

6

8

10 t(s)

12

14

16

18

V3

20

Fig.4 Tracking error convergence e(t ) . From all aforementioned simulation results, one may conclude that the proposed identifier can estimate precisely the unknown system dynamics, and the proposed critic NN can approximate the solution of HJB equation. Thus, optimal tracing control performance can be obtained with the proposed control.

6

1

e2

2

-2.5

(41)

1 1 § · 2*eT ¨  Ke e  BR1 BT ’I2T W2  Bue*  BR1 BT ’H v  H N ¸ 2 2 © ¹  K (eT Qe  ue*T Rue* )



This paper is concerned with adaptive optimal tracking control for a class of continuous-time nonlinear systems with unknown dynamics. To obviate the requirement on the

8994



2 d  ª Ke *  K Omin (Q)  BT R1 B’I2  BT R1 B  2 * º e ¬ ¼ 2 1  * BT R1 B’I2 W2 4 2 1 2  K Omin ( R)  * B ue*  * BT R1 B ’H vT ’H v  *H NT H N 2 (45) ˆ ( x) is a bounded variable. Consequently, we where B \9



Conclusions

2*eT e  K (eT Qe  ue*T Rue* )



have V V1  V2  V3 d (V 1 

1 ) W1 2K



2

 K Omin ( R )  * B

§ · 1 1  ¨V 2   *IM B T R 1 B ¸ W2 2 4 K © ¹ 2

u

2

* 2 e





 ª K e *  K Omin (Q )  B T R 1 B (IM  1)  2 * º e ¬ ¼ 2 2 KH KH 1  * BT R 1 B ’H vT ’H v  *H NT H N  X 2  X 1 2 2 2

2

(46) If the parameters K , * and K are appropriately chosen such that ­ 1 4KV 2  2 1 ½ , K ! max ® , ¾, *  T 1 2 2 V V KI ¯ 1 2 ¿ M B R B





2 ­ BT R 1 B (IM  1)  2 * ½° °* B K ! max ® , ¾ Omin (Q) °¯ Omin ( R ) °¿



K e ! K Omin (Q) / *  BT R 1 B (IM  1)  2

then (46) can be further presented as 2 2 2 V d a1 W1  a2 W2  a3 e  J where a1 a3

V 1  1/ 2K , a2

(47)

V 2  1/ 2K  1/ 4*IM BT R 1 B ,





K e *  K Omin (Q )  BT R 1 B (IM  1)  2 *

and

KH 2 KH 2 1 * BT R 1 B ’H vT ’H v  *H NT H N  X 2  X1 are all 2 2 2 positive constants, where J defines the effect of identifier J

errors H N , HX1 and the critic NN errors ’H v , HX 2 . Then it can be shown that V is negative if W1 ! J / a1 , W2 ! J / a2 , e ! J / a3 

(48)

which implies the tracking error e , the NN weights errors W1 and W2 are all bounded. Furthermore, we have 1 1 1 ˆ ( x)]T 'I2T W2  R 1[\9 ˆ ( x)]T ’H v (49) uˆe  ue* R [\9 2 2 When t o f , the upper bound of (49) is 1 ˆ ( x )]T IM W2  IH d H u (50) lim uˆe  ue* d R 1[\9 t of 2 where H u ! 0 is a positive constant depending on the critic NN approximation error ’H and weight error W .



v



2

References [1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control: Wiley. com, 2012. [2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1: Cambridge Univ Press, 1998. [3] K. Doya, "Reinforcement learning in continuous time and space," Neural computation, vol.12, no.1, p.219-245, 2000. [4] P. J. Werbos, "A menu of designs for reinforcement learning over time," Neural networks for control, p.67-95, 1990.

8995

[5] J. Si, A. G. Barto, W. B. Powell, and D. C. Wunsch, Handbook of learning and approximate dynamic programming: IEEE Press Los Alamitos, 2004. [6] F.-Y. Wang, H. Zhang, and D. Liu, "Adaptive dynamic programming: an introduction," IEEE Computational Intelligence Magazine, vol.4, no.2, p.39-47, 2009. [7] F. L. Lewis and D. Vrabie, "Reinforcement learning and adaptive dynamic programming for feedback control," IEEE Circuits and Systems Magazine, vol.9, no.3, p.32-50, 2009. [8] H.-G. Zhang, X. Zhang, Y.-H. Luo, and J. Yang, "An overview of research on adaptive dynamic programming," Acata Automatica Sinica, vol.39, no.4, p.303-311, 2013. [9] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, "Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol.38, no.4, p.943-949, 2008. [10] D. L. D.Wang, Q. Wei, D. Zhao, N. Jin, "Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming," Automatica, 2012. [11] T. Dierks, B. T. Thumati, and S. Jagannathan, "Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence," Neural Networks, vol.22, no.5, p.851-860, 2009. [12] T. Hanselmann, L. Noakes, and A. Zaknich, "Continuous-time adaptive critics," IEEE Transactions on Neural Networks, vol.18, no.3, p.631-647, 2007. [13] M. Abu-Khalaf and F. L. Lewis, "Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach," Automatica, vol.41, no.5, p.779-791, 2005. [14] D. Vrabie and F. Lewis, "Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems," Neural Networks, vol.22, no.3, p.237-246, 2009. [15] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, "Adaptive optimal control for continuous-time linear systems based on policy iteration," Automatica, vol.45, no.2, p.477-484, 2009. [16] K. G. Vamvoudakis and F. L. Lewis, "Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem," Automatica, vol.46, no.5, p.878-888, 2010. [17] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, "A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems," Automatica, vol.49, no.1, p.82-92, 2013. [18] H. Modares, F. L. Lewis, and M. B. Naghibi-Sistani, "Adaptive Optimal Control of Unknown Constrained-Input Systems Using Policy Iteration and Neural Networks," IEEE Transactions on Neural Networks and Learning Systems, vol.In press, 2013. [19] H. Zhang, L. Cui, X. Zhang, and Y. Luo, "Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method," IEEE Transactions on Neural Networks, vol.22, no.12, p.2226-2236, 2011. [20] J. Na, G. Herrmann, X. Ren, M. N. Mahyuddin, and P. Barber, "Robust adaptive finite-time parameter estimation and control of nonlinear systems," In: Proceeding of IEEE International Symposium on Intelligent Control (ISIC), Denver, CO, USA, p.1014-1019, 2011. [21] V. Nevistic and J. A. Primbs, "Constrained nonlinear optimal control: a converse HJB approach," California Institute of Technology, Pasadena, CA., Technical Report 1996.