Online adaptive approximate optimal tracking control ... - IEEE Xplore

0 downloads 0 Views 830KB Size Report
of continuous-time nonlinear systems with unknown dynamics. ... tracking control with simplified dual approximation structure for continuous- time unknown ...
412

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

Online Adaptive Approximate Optimal Tracking Control with Simplified Dual Approximation Structure for Continuous-time Unknown Nonlinear Systems Jing Na

Guido Herrmann

Abstract—This paper proposes an online adaptive approximate solution for the infinite-horizon optimal tracking control problem of continuous-time nonlinear systems with unknown dynamics. The requirement of the complete knowledge of system dynamics is avoided by employing an adaptive identifier in conjunction with a novel adaptive law, such that the estimated identifier weights converge to a small neighborhood of their ideal values. An adaptive steady-state controller is developed to maintain the desired tracking performance at the steady-state, and an adaptive optimal controller is designed to stabilize the tracking error dynamics in an optimal manner. For this purpose, a critic neural network (NN) is utilized to approximate the optimal value function of the Hamilton-Jacobi-Bellman (HJB) equation, which is used in the construction of the optimal controller. The learning of two NNs, i.e., the identifier NN and the critic NN, is continuous and simultaneous by means of a novel adaptive law design methodology based on the parameter estimation error. Stability of the whole system consisting of the identifier NN, the critic NN and the optimal tracking control is guaranteed using Lyapunov theory; convergence to a near-optimal control law is proved. Simulation results exemplify the effectiveness of the proposed method. Index Terms—Adaptive control, optimal control, approximate dynamic programming, system identification.

I. I NTRODUCTION

A

ified with to a cost

MONG various modern control methodologies, optimal control has been well-recognized and successfully verin some real-world applications[1] , which is concerned finding a control policy that drives a dynamical system desired reference in an optimal way, i.e., a prescribed function is minimized. In general, the optimal control

Manuscript received July 7, 2013; accepted March 24, 2014. This work was supported by National Natural Science Foundation of China (61203066). Recommended by Associate Editor Zhongsheng Hou Citation: Jing Na, Guido Herrmann. Online adaptive approximate optimal tracking control with simplified dual approximation structure for continuoustime unknown nonlinear systems. IEEE/CAA Journal of Automatica Sinica, 2014, 1(4): 412−422 Jing Na is with the Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, 650093, China (e-mail: [email protected]) . Guido Herrmann is with the Department of Mechanical Engineering, University of Bristol, BS8 1TR, UK (e-mail: [email protected]).

can be derived by using Pontryagin s minimum principle, or by solving the Hamilton-Jacobi-Bellman (HJB) equation. Although mathematically elegant, traditional optimal control designs are obtained offline and impose the assumption on the complete knowledge of system dynamics[2] . To allow for uncertainties in system dynamics, adaptive control[3−4] has been developed, where the unknown system parameters are online updated/estimated by using the tracking error, such that the tracking error convergence and the boundedness of the parameter estimates can be guaranteed. However, classical adaptive control methods are generally far from optimal. With the wish to achieve adaptive optimal control, one may add optimality features to an adaptive controller, i.e., to drive the adaptation by an optimality criterion. An alternative solution is to incorporate adaptive features into an optimal control design, e.g., improve the optimal control policy by means of the updated system parameters[2] . Recently, a bioinspired method, reinforcement learning (RL)[5−7] , that was developed in the computational intelligence and machine learning societies, has provided a means to design adaptive controllers in an optimal manner. Considering the similarities between optimal control and RL, Werbos[8] introduced an RL-based actor-critic framework, called approximate dynamic programming (ADP), where neural networks (NNs) are trained to approximately solve the optimal control problem based on the named value iteration (VI) method. A survey of ADPbased feedback control designs can be found in [9−12]. The discrete/iterative nature of the ADP formulation lends itself naturally to the design of discrete-time (DT) optimal control[13−15] . However, the extension of the RL-based controllers to continuous-time (CT) systems entails challenges in proving stability and convergence for a model-free algorithm that can be solved online. Some of the existing ADP algorithms for CT nonlinear systems lacked a rigorous stability analysis[6, 16] . By incorporating NNs into the actorcritic structure, an offline method was proposed in [17] to find approximate solutions of optimal control for CT nonlinear systems. In [2, 18], an online integral RL technique was developed to find the optimal control for CT systems

NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH · · ·

without using the system drift dynamics, which led to a hybrid continuous-time/discrete-time sampled data controller based on policy iteration (PI) with a two time-scale actor–critic learning process. This learning procedure was based on sequential updates of the critic (policy evaluation) NN and actor (policy improvement) NN. Thus, while one NN was tuned, the other one remained constant. Vamvoudakis and Lewis[19] further extended this idea by designing an improved online ADP algorithm called synchronous PI, which involved simultaneous tuning of both actor and critic NNs by minimizing the Bellman error, i.e., both NNs were tuned at the same time by using the proposed adaptive laws to approximately solve the CT infinite horizon optimal control problem. To avoid the need for the complete knowledge of system dynamics in [19], a novel actor-critic-identifier architecture was proposed in [20], where an extra NN of the identifier was employed in conjunction with an actor-critic controller to identify the unknown system dynamics. Although the states of the identifier converged to their true values, the identifier NN weight convergence was not guaranteed. Moreover, the knowledge of the input dynamics was still required. On the other hand, most of the ADP based optimal control methods have been developed to address the stabilization or regulation problem, and only a few results have been reported for optimal tracking control[21−23] . For these results, the key idea is to superimpose an optimal control that stabilizes the error dynamics at the transient stage in an optimal way under the assumption of a traditional steady-state tracking controller (e.g., feedback linearization control, adaptive control). In [23], an observer was adopted to reconstruct unknown system states, while in [21] an adaptive NN identifier was used to online estimate unknown system dynamics. Although the obtained control input was ensured to be close to the optimal control within a small bound, it was not guaranteed that the NN identifier weights stayed bounded in a compact neighborhood of their ideal values. In this paper, we will provide a solution where the convergence of the identifier weights is guaranteed and the convergence of the critic NN weights to a nearly optimal control solution is shown. To the best of our knowledge, ADP-based optimal tracking control has rarely been designed for CT systems with unknown nonlinear dynamics and guaranteed parameter estimation convergence. In this paper, we propose a new ADP algorithm for solving the optimal tracking control problem of nonlinear systems with unknown dynamics. Inspired by the work of [20−21], the requirement of the complete or at least partial knowledge of system dynamics in the existing ADP algorithms for CT systems is eliminated. This is achieved by constructing an adaptive NN for the identifier of system dynamics; a novel adaptive law based on the parameter estimation error[24] is utilized such that, even in the presence of an NN approximation error, the identifier NN weights are guaranteed to converge to a small region around their true values under a standard persistent excitation (PE) condition or a slightly more relaxed singular value condition for a filtered regressor matrix.

413

To achieve optimal tracking control, an adaptive steady-state control for maintaining desired tracking at the steady-state is augmented with an adaptive optimal control for stabilizing the tracking error dynamics in an optimal manner. To design such an optimal control, a critic NN is employed to online approximate the solution to the HJB equation. Thus, the optimal value function is obtained, which is then used to calculate the control action. The identifier parameters and critic NN weights are online updated continuously and simultaneously. In particular, a direct parameter estimation scheme is used to estimate NN weights; this is in contrast to the minimization of the Bellman error or the residual approximation error in the HJB equation by using least-squares[20] or the modified LevenbergMarquardt algorithms[19] . We will also show that the identifier weight estimation error affects the critic NN convergence; the conventional PE condition or again a relaxed condition on a filtered regressor matrix is sufficient to guarantee parameter estimation convergence. To this end, a novel adaptation scheme based on the parameter estimation error that was originally proposed in our previous work [24] is employed for updating both identifier weights and critic NN weights; this may lead to fast convergence and provides an easy online-check of the required convergence condition[24] . Finally, the stability of the overall system and the uniform ultimate boundedness (UUB) of the identifier and critic weights are proved by using Lyapunov theory, and the obtained control guarantees the tracking of a desired trajectory, while also asymptotically converging to a small bound around the optimal policy. The main contributions can be summarized as follows. 1) The optimal tracking control problem of nonlinear CT systems is studied by proposing a new critic-identifier based ADP control configuration. The actor NN is not necessary to prove the overall stability. Thus, instead of the tripleapproximation structure, this introduces a simplified dualapproximation method. To achieve tracking control, a steadystate control is used in conjunction with an adaptive optimal control such that the overall control converges to the optimal solution within a small bound. 2) A novel adaptation design methodology based on the parameter estimation error is proposed such that the weights of both the identifier NN and critic NN are online updated simultaneously. With this framework, all these weights are ‘directly’ estimated with guaranteed convergence rather than updated to minimize the identifier error and Bellman error by using the gradient-based schemes (e.g., least-squares in [20]). It is shown that the convergence of the identifier weights to their true values in a bounded sense is achieved, which is also important for the convergence of the optimal control. The paper is organized as follows. Section II provides the formulation of the optimal control problem. Section III discusses the design of the identifier to accommodate unknown system dynamics. Section IV presents the adaptive tracking control design and the closed-loop stability analysis. Section V presents simulation examples that show the effectiveness of the proposed method, and Section VI gives some conclusions.

414

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

II. P ROBLEM FORMULATION

can be rewritten in the form of a recursive neural network (RNN)[21, 27−28] :

Consider a continuous-time nonlinear system as x˙ = F (x, u),

where x ∈ Rn , u ∈ Rm are the output and input of the studied system, F (x, u) ∈ Rn × Rm → Rn is a Lipschitz continuous nonlinear function on a compact set Ω ∈ Rn × Rm that contains the origin, such that solution x of system is unique for any finite initial condition x0 and control u. This paper will address the optimal tracking control for system (1), i.e., to find an adaptive controller u which ensures that the system output x tracks a given trajectory xd and fulfills the following infinite horizon cost function[21] (in a sub-optimal sense):  ∞ r(e(τ ), u(τ ))dτ, (2) V ∗ (e(t)) = min u(τ )∈Ψ(Ω)

x˙ = Ax + Bu + C T f (x) + ε,

(1)

t

where Ψ(Ω) is a set of admissible control policies[17] , e = x − xd is defined as the tracking error and r(·, ·) : Rn × Rm → Rn , r(e(τ ), u(τ )) ≥ 0 is the utility function; the utility function is to be defined later. Note that in this paper the command reference xd and its derivative x˙ d are all continuous and bounded. It should also be noted that the tracking error e rather than the system state x is used in cost function (2), because the tracking control rather than the regulation problem is studied in this paper. Remark 1. Many industrial processes can be modeled as system (1), such as missile systems[25] , robotic manipulators[4] and biochemical processes[26] . Although some results[2, 11, 12, 19−20] have been recently developed to address the optimal regulation problem of (partially unknown) system (1) by means of ADP, only a few results[21−22, 25] have been reported concerning the tracking control of system (1). In [22], the plant system is assumed to be precisely known. To facilitate the control design, the following assumption is made about system (1). Assumption 1[27] . The function F (x, u) in (1) is continuous and satisfies a locally Lipschitz condition such that (1) has a unique solution on the set Ω that contains the origin. The control action u has a control-affine form as in [19−20] with constant input gain B. Since the dynamics of (1) are unknown, the optimal tracking control design presented in this paper can be divided into two steps as in [20−21]: 1) Propose an adaptive identifier by using input-output data to reconstruct the unknown system dynamics for (1); 2) Design an adaptive optimal tracking controller based on the identified dynamics and ADP methods. III. A DAPTIVE IDENTIFIER BASED ON PARAMETER ESTIMATION ERROR

In this section, an adaptive identifier is established to reconstruct the unknown system dynamics using available input–output measurements. From Assumption 1, system (1)

(3)

, B ∈ R are known matrices and where A ∈ R C ∈ Rp×n is the unknown weight matrix, ε ∈ Rn is a bounded approximation error of the RNN, and f (x) ∈ Rp is a nonlinear regressor function vector, which is Lipschitz continuous function such that f (x) − f (y) ≤ κ x − y holds for some positive constant κ > 0. To determine the unknown parameters C, we define the filtered variables xf , uf , ff of x, u, f as ⎧ ⎨ k x˙ f + xf = x, xf (0) = 0, k u˙ f + uf = u, uf (0) = 0, (4) ⎩ ˙ k ff + ff = f, ff (0) = 0, n×n

n×m

where k ∈ R is a positive scalar constant filter Then for any positive scalar constant  > 0, filtered and ‘integrated’ regressor matrices P1 Q1 ∈ Rp×n as ⎧ ⎨ P˙1 = −P1 + ff ffT , P1 (0) = 0,  T f ⎩ Q˙ 1 = −Q1 + ff x−x − Ax − Bu , f f k

parameter. we define the ∈ Rp×p and

Q1 (0) = 0, (5)

and another auxiliary matrix M1 ∈ R P1 and Q1 as

p×n

calculated based on

M1 = P1 Cˆ − Q1 ,

(6)

where Cˆ is the estimation of C. Then the adaptive law for estimating Cˆ is provided by ˙ Cˆ = −Γ1 M1 , (7) with Γ1 > 0 being a constant, positive definite learning gain matrix. Lemma 1[24] . Under the assumption that variables x and u in (3) are bounded, vector M1 in (6) can be reformulated as M1 = −P1 C˜ + ψ1 for bounded ψ1 (t) =  t −(t−r) ˜ ˆ ff (r)εT − 0e f (r)dr, where C = C − C is the estimation error. Proof. For the ordinary matrix differential equation of (5), one can obtain its solution as[24] ⎧  ⎪ P1 = 0t e−(t−r) ff (r)ffT (r)dr, ⎪ ⎨ Q1 =  T  t −(t−r) ⎪ x(r)−xf (r) ⎪ ⎩ e f (r) − Ax (r) − Bu (r) dr. 0

f

k

f

f

(8) On the other hand, by applying the linear filter operation (4) on both sides of (3) it can be obtained that x˙ f = Axf + Buf + C T ff + εf ,

(9)

where εf is the filtered version of bounded error ε in terms of k ε˙f + εf = ε (Vector εf will be used only for analysis). Then from the first equation of (4), it is found that x − xf . (10) x˙ f = k

NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH · · ·

Consequently, we can obtain from (9) and (10) that x − xf (11) = Axf + Buf + C T ff + εf . k By substituting (11)  t into (8), we have Q1 = P1 C − ψ1 with ψ1 (t) = − 0 e−(t−r) ff (r)εT f (r)dr being a bounded variable, i.e., ψ1  ≤ ε1ψ for a constant ε1ψ > 0 because the NN regressor function f (x) and approximation error ε are all bounded for bounded x and u. Then, (6) can be rewritten as M1 = P1 Cˆ − Q1 = −P1 C˜ + ψ1 ,

(12)

where C˜ = C − Cˆ is the estimation error.  Moreover, to prove the parameter estimation convergence, we need to analyze the positive definiteness property of P1 . Denote λmax (·) and λmin (·) as the maximum and minimum eigenvalues of the corresponding matrices, then we have the following lemma. Lemma 2[24] . If the regressor function vector f (x) defined in (3) is persistently excited[3] , then the matrix P1 defined in (5) is positive definite, i.e., its minimum eigenvalue λmin (P1 ) > σ1 > 0 with σ1 being a positive constant. We refer to [24] for the detailed proof of Lemma 2. Now, we have the following result. Theorem 1. If x and u in (3) are bounded and the minimum eigenvalue of P1 satisfies λmin (P1 ) > σ1 > 0 for system (3) with the parameter estimation (7), then we have 1) For ε = 0 (i.e., no reconstruction error), the estimation error C˜ exponentially converges to zero; 2) For ε = 0 (i.e., with bounded approximation error), the estimation error C˜ converges to a compact set around zero. Proof. Consider the Lyapunov function candidate as V1 = 1 ˜ ˙ tr( C˜ T Γ−1 1 C), then the derivative V1 is obtained from (7) as 2 ˜T ˜˙ ˜T ˜ V˙ 1 = tr(C˜ T Γ−1 1 C) = −tr(C P1 C) + tr(C ψ1 ).

(13)

1) In case that ε = 0 and thus ψ1 = 0, (13) is reduced to ˜ ≤ −σ1 ||C|| ˜ 2 ≤ −μ1 V1 , V˙ 1 = −tr(C˜ T P1 C)

(14)

where μ1 = 2σ1 /λmax (Γ−1 1 ) is a positive constant. Then according to Lyapunov s theorem (Theorem 3.4.1 in [4], p110), the parameter estimation error C˜ converges to zero exponentially, where the convergence rate depends on the excitation level σ1 and the learning gain Γ1 . 2) In case that there is a bounded approximation error ε = 0, (13) can be further presented as

˜ + tr(C˜ T ψ1 ) ≤ −||C||(˜ ˜ σ1 V1 − ε1ψ ) V˙ 1 = −tr(C˜ T P1 C) (15)

for σ ˜1 = σ1 2/λmax (Γ−1 ) being a positive constant. Then according to the extended Lyapunov theorem (Theorem 3.4.3 in [4], p111), the parameter estimation error C˜ ultimately √ uniformly converges to the compact set Ω1 := ˜ V1 ≤ ε1ψ /˜ σ1 , whose size depends on the bound of C| the approximation error ε1ψ and the excitation level σ1 . This completes the proof. 

415

Remark 2. For adaptive law (7), variable M1 of (6) obtained based on P1 , Q1 by (5) contains the information on the weight estimation error P1 C˜ as shown in (12), where the residual error ψ1 will vanish for vanishing NN approximation error ε → 0. It is well known that ε → 0 holds for sufficiently large hidden layer nodes in identifier (3), i.e., p → +∞. Thus M1 can be used to drive parameter estimation (7). Consequently, parameter estimation Cˆ can be directly obtained without using an observer/predictor error in comparison to [20−21] (see Theorem 1 in [20−21]). Remark 3. Lemma 2 shows that the required condition (i.e., λmin (P1 ) > σ1 > 0) for the parameter estimation convergence in this paper can be fulfilled under a conventional PE condition[3] . In general, the direct online validation of the PE condition is difficult in particular for a nonlinear system. To this end, Lemma 2 provides a numerically verifiable way to online validate the required convergence condition of the novel adaptation law (7), i.e., by calculating the minimum eigenvalue of matrix P1 to test λmin (P1 ) > σ1 > 0. This condition does not necessarily imply the PE condition of f (x). It is also to be noticed that the PE condition of f (x) can be suitably weakened when a well-designed control is imposed[24, 29] , e.g., transformed into an a priori verifiable ‘sufficient richness (SR)’ requirement on the command reference. IV. A DAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL

As shown in Section III, the unknown weight parameter C can be online estimated. Without loss of generality, we assume that there is an unavoidable approximation error ε in (3) such that the estimated weight matrix Cˆ would converge to a compact set around its true value C. In this case, system (1) can be rewritten as x˙ = Ax + Bu + Cˆ T f (x) + ε + εN ,

(16)

where εN = C˜ T f (x) can be taken as an adaptation error for which a later analysis will show the boundedness, i.e., εN  ≤ φN for a constant φN > 0 in the compact set Ω. Then the optimal control of (1) is transformed into the optimal control of (16). In this section, the optimal controller design of (16) will be provided in detail. To achieve optimal tracking control, it is noted that the overall control u can be composed of two parts as u = us +ue , where us is the adaptive steady-state control used to maintain the tracking error close to zero at the steady-state stage, and ue is the adaptive optimal control designed to stabilize the tracking error dynamics in the control transient in an optimal manner[21−22, 25] . Consider the tracking error as e = x − xd , so that e˙ = Ax + Bu + Cˆ T f (x) − x˙ d + ε + εN .

(17)

Since the adaptive steady-state control us is used to guarantee a steady state at zero for the tracking error, it should be designed to retain the steady-state dynamics e˙ = x− ˙ x˙ d = 0 in

416

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

(17), i.e., u = us needs to guarantee x = xd for ε + εN = 0. Thus, the steady-state control signal us can be selected as

where Ve := ∂V ∂e denotes the partial derivative of the value function V with respect to e. The optimal cost function V ∗ (e) is defined as

 ∞  V ∗ (e) = min r(e(τ ), ue (e(τ )))dτ , (22)

us = B ⊗ (x˙ d − Axd − Cˆ T f (xd )),

(18)

where B ⊗ denotes the generalized inverse of B. Note that the input gain B in (3) is assumed to be known but may not be invertible (i.e., it may have the rank lower than n). It is shown that us depends on the available variables xd , ˆ and thus can be implemented based on identifier (3) x˙ d , A, C, with adaptive law (7). Substituting (18) into (17), the tracking error dynamics can be rewritten as e˙ = Ae + Cˆ T [f (x) − f (xd )] + Bue + ε + εN + εϕ , (19) where εϕ = (BB ⊗ − I)(x˙ d − Axd − Cˆ T f (xd )) denotes the residual error due to the generalized inverse of B, which is clearly bounded because xd and x˙ d are bounded and f is Lipschitz continuous, i.e., εϕ  ≤ φp for a constant φp > 0. It is noted that εϕ will be null under the so-called matching condition, e.g., (BB ⊗ − I)A = 0 or (BB ⊗ − I)Cˆ T = 0, which is a standard condition raised in nonlinear control when counteracting disturbances[30−31] . As shown above, by using the adaptive steady-state control us in (18), the error dynamics can be further presented as (9), which is not necessarily stable, in particular for a system with identifier error ε + εN . In this sense, the tracking problem of system (16) can be reduced into the regulation problem of (19). Hence, the adaptive optimal control ue will be designed to stabilize the tracking error dynamics (19) in an approximately optimal manner. In this case, the optimal value function (2) for system (1) can be reformulated using ue from system (19) to provide the value function  ∞ r(e(τ ), ue (e(τ )))dτ, (20) V (e(t)) = t

where the utility function can be chosen as n×n r(e(τ ), ue (e(τ ))) = eT Qe + uT e Rue with Q ∈ R m×m and R ∈ R being symmetric positive definite matrices. Thus, the tracking problem is optimized by using control ue , which optimally stabilizes e. It will be shown below that ue is a function of control error e. Definition 1[17] . A control policy μ(e) is defined as admissible with respect to (20) on a compact set Ω, denoted by μ(e) ∈ Ψ(Ω). If μ(e) is continuous on Ω, μ(0) = 0, μ(e) = u(e) stabilizes (19) on Ω, and V (e) is finite ∀e ∈ Ω. The remaining problem can be formulated as: given the CT error system (19) with the admissible control set μ(e) ∈ Ψ(Ω) and the infinite horizon cost function (20), find an admissible control policy ue (e) ∈ μ(e) such that cost (20) associated with system (19) is minimized. For this purpose, we define the Hamiltonian of system (19) as H(e, ue , V ) =VeT [Ae + Cˆ T (f (x) − f (xd )) + Bue + ε + εN + εϕ ] + eT Qe + uT e Rue ,

(21)

ue ∈Ψ(Ω)

t

which satisfies the HJB equation 0=

min [H(e, u∗e , V ∗ )] .

ue ∈Ψ(Ω)

(23)

Then we can obtain the optimal control u∗e by solving ∂H(e, u∗e , V ∗ )/∂u∗e = 0 as ∂V ∗ (e) 1 u∗e = − R−1 B T , (24) 2 ∂e where V ∗ is the solution to the HJB equation (23). Remark 4. In order to find the optimal control (24), one needs to solve the HJB equation (23) for the value function V ∗ (e) and then substitute the solution into (24) to obtain the optimal control u∗e . For linear systems, considering a quadratic cost functional, the equivalent of the HJB equation is the well known Riccati equation[1] . However, for nonlinear systems, the HJB equation (23) is a nonlinear partial differential equation (PDE) which is difficult to solve. In the literature, there are a number of results concerning the optimal control for (19) in terms of critic-actor based ADP schemes, where two NNs, i.e., a critic NN and an actor NN, are employed to approximate the value function and its corresponding policy. However, some of them run in an offline manner[17] and/or require at least partial knowledge of system dynamics[2,18−20, 22] . In the following, an online adaptive algorithm will be proposed to derive the optimal control solution for system (19) using the NN identifier introduced in the previous section and another critic NN for approximating the value function of the HJB equation (23). Instead of sequentially updating the critic and actor NNs[2, 18] , both networks are updated simultaneously in real time, and thus lead to the synchronous online implementation. A. Value Function Approximation via NN Assuming the optimal value function is continuous and defined on compact sets, then a single-layer NN can be used to approximate it[19] , such that the solution V ∗ (e) and its derivative ∂V ∗ (e)/∂e with respect to e can be uniformly approximated by V ∗ (e) = W ∗T Φ(e) + ε1 ,

(25)

and ∂V ∗ (e) (26) = ∇ΦT W ∗ + ∇ε1 , ∂e where W ∗ ∈ Rl are the unknown ideal weights and Φ(e) = [Φ1 , · · · , Φl ]T ∈ Rl is the NN activation function vector, l is the number of neurons in the hidden layer, and ε1 is the NN approximation error. ∇Φ := ∂Φ/∂e and ∇ε1 := ∂ε1 /∂e denote the partial derivative of Φ(e) and ε1 with regard to e, respectively.

NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH · · ·

Some standard NN assumptions that will be used throughout the remainder of this paper are summarized here. Assumption 2[17, 19−21] . The ideal NN weights W ∗ are bounded by a positive constant WN , i.e., W ∗  ≤ WN ; the NN activation function Φ(·) and its derivative ∇Φ(·) with respect to argument e are bounded, e.g., ∇Φ ≤ φM ; and the function approximation error ε1 and its derivative ∇ε1 with respect to e are bounded, e.g., ∇ε1  ≤ φε . In a practical application, the NN activation functions {Φi (e) : i = 1, · · · , l} can be selected so that as l → +∞, Φ(e) provides a complete independent basis for V ∗ (e). Then using Assumption 2 and the Weierstrass higher-order approximation theorem, both V ∗ (e) and ∂V ∗ (e)/∂e can be uniformly approximated by NNs in (25) and (26), i.e., as l → +∞, the approximation errors ε1 → 0, ∇ε1 → 0 as shown in [17, 19]. Then the critic NN Vˆ (e) that approximates the optimal value function V ∗ (e) is given by ˆ T Φ(e), Vˆ (e) = W

(27)

ˆ is the estimation of the unknown weights W ∗ in where W critic NN (25), which will be specified by adaptive law (38). In this case, one may obtain the approximated optimal control as ∂ Vˆ (e) 1 1 ˆ, (28) = − R−1 B T ∇ΦT W u ˆe = − R−1 B T 2 ∂e 2 such that the overall optimal tracking control for system (16) can be given as u = us + u ˆe

(29)

with us being the steady-state control given in (18). Consider that the ideal optimal control u∗e can be determined based on (24) and (26) as ∂V ∗ (e) 1 1 = − R−1 B T (∇ΦT W ∗ + ∇ε1 ), u∗e = − R−1 B T 2 ∂e 2 (30) and then substitute the estimated optimal control u ˆe of (28) into the error dynamics (19), we have e˙ =Ae + Cˆ T [f (x) − f (xd )] + Bue + ε + εN + εϕ = 1 ˆ+ Ae + Cˆ T [f (x) − f (xd )] + B[− R−1 B T ∇ΦT W 2 1 −1 T R B (∇ΦT W ∗ + ∇ε1 )] + Bu∗e + ε + εN + εϕ 2 1 ˜+ = Ae + Cˆ T [f (x) − f (xd )] + BR−1 B T ∇ΦT W 2 1 Bu∗e + BR−1 B T ∇ε1 + ε + εN + εϕ . (31) 2 Remark 5. Since the overall control (29) is derived using the steady-state control (18) and the approximate optimal control (28) that depends on the estimated optimal value ˆ , the critic NN in (27) can be used to function ∇ΦT W determine the control action without using another NN as the actor in [19−21]. This can reduce the computational cost and improve the learning process. However, alternatively, a ˆ a , may be used in a similar separate actor NN, e.g., ∇ΦT W

417

way for producing the approximated optimal control action ˆ a as that shown in [19−21]. u ˆe = − 12 R−1 B T ∇ΦT W B. Adaptive Law for Critic NN ˆ, The problem now is to update the critic NN weights W ˆ such that W converge to a small bounded region around the ideal values W ∗ . To derive the adaptive law, we denote fd = f (xd ), substitute (26) into Hamiltonian function (21), and thus rewrite the HJB equation (23) as 0 = H(e, ue , V ∗ ) = W ∗T ∇Φ[Ae + Cˆ T (f − fd ) + Bue ]+ eT Qe + uT e Rue + εHJB ,

(32)

where εHJB = ∇ε1 [Ae + Cˆ T (f − fd ) + Bue + ε + εN + εϕ ]+ W ∗T ∇Φ(εN + εϕ + ε) is the residual error due to the NN approximation errors, which can be made arbitrarily small by using a sufficiently large number of NN nodes[17, 19] , i.e., ε → 0 as p → +∞ and ∇ε1 → 0 as l → +∞. Equally, ˜ (x) Theorem 1 implies that estimation error εN = Cf converges to zero as p → +∞ for bounded control and states. In contrast, εϕ = 0 when B is of rank n. To facilitate the design of the adaptive law, we denote the  known terms in (32) as Ξ = ∇Φ Ae + Cˆ T (f − fd ) + Bue and Θ = eT Qe + uT e Rue , and then represent (32) as Θ = −W ∗T Ξ − εHJB .

(33)

In (33), the unknown critic NN weights W ∗ appear in a linearly parameterized form, and will be ‘directly’ estimated in the following development by utilizing the parameter estimation error method proposed in Section III. Remark 6. It is shown in (32) that the residual HJB equation error εHJB is due to the critic NN approximation error ∇ε1 in (26), the identifier error ε + εN in (16) and the matching condition error εϕ . As claimed in [17, 19], the critic NN approximation error ∇ε1 converges uniformly to zero as the number of hidden layer nodes increases, i.e., ∇ε1 → 0 as long as l → +∞. That is, ∀μ > 0, ∃N (μ) : sup ∇ε1  ≤ μ. Moreover, in case that there is no NN approximation error in (3), i.e., ε = 0, the effect of the identifier error εN in (16) will vanish (i.e., εN → 0 as p → +∞) because C˜ → 0 holds for ε = 0 as proved in Theorem 1 for bounded state x and control input u. Finally, the fact εϕ = 0 is also true under the matching condition in (19). Consequently, if there are no approximation errors ε in identifier (16) and ∇ε1 in critic NN (26), and the matching condition holds, the residual error in (33) is null, i.e., εHJB = 0. Remark 7. Some available ADP based optimal controls ˆ by are designed to online update the critic NN weights W minimizing the squared residual Bellman error in the approximated HJB equation[19−21, 23] , where the Least-squares[20] or modified Levenberg-Marquardt algorithms[19] are employed. In the following, we will extend our previous results[24] to design the adaptive law to directly estimate unknown critic NN weights W ∗ based on (33) rather than to reduce the Bellman error.

418

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

Similar to Section III, we define the auxiliary filtered regressor matrix P2 ∈ Rl×l and vector Q2 ∈ Rl as  P˙2 = −P2 + ΞΞT , P2 (0) = 0, (34) Q˙ 2 = −Q2 + ΞΘ, Q2 (0) = 0, where  > 0 is the design parameter. Then one can obtain  t P2 = 0 e−(t−r) Ξ(r)ΞT (r)dr, t (35) Q2 = 0 e−(t−r) Ξ(r)Θ(r)dr. Define another auxiliary vector M2 ∈ Rl based on P2 and Q2 in (34) as ˆ + Q2 , M2 = P2 W

(36)

ˆ is the estimation of W ∗ , which will be given in the where W following adaptive law (38). By substituting (33) into (35), we have Q2 = −P2 W ∗ + t ψ2 with ψ2 = − 0 e−(t−r) εHJB (r)Ξ(r)dr being a bounded variable for bounded state x and control u, e.g., ψ2  ≤ ε2ψ for some ε2ψ > 0. In this case, (36) can be rewritten as ˆ + Q2 = −P2 W ˜ + ψ2 , M2 = P2 W

(37)

˜ = W∗ − W ˆ is the NN weight error. where W ˆ is provided by Then the adaptive law for estimating W ˆ˙ = −Γ2 M2 , W

(38)

with Γ2 > 0 being a constant matrix. Similar to Section III, the condition that λmin (P2 ) > σ2 > 0 is needed if one desires to precisely estimate the unknown critic NN weights W ∗ so that the approximated value function (27) converges to its true value (25). As shown in [19−20], a small probing noise can be added to the control input to retain the PE condition if this condition is not satisfactory. This implies λmin (P2 ) > σ2 > 0, as stated in Lemma 3. Lemma 3[24] . If the regressor function vector Ξ defined in (33) is persistently excited, then matrix P2 defined in (35) is positive definite, i.e., its minimum eigenvalue λmin (P2 ) > σ2 > 0. Then we have the following theorem. Theorem 2. For the critic NN adaptive law (38) with regressor vector Ξ satisfying λmin (P2 ) > σ2 > 0 and for bounded state x and control u, one has 1) For εHJB = 0 (i.e., no NN approximation errors), the ˜ exponentially converges to zero; estimation error W 2) For εHJB = 0 (i.e., with bounded approximation errors), ˜ converges to a bounded set around the estimation error W zero. Proof. Consider the Lyapunov function candidate as V2 = 1 ˜ T −1 ˜ W Γ2 W , then the derivative V˙ 2 can be calculated along 2 (38) as ˜ +W ˜ Tψ . ˜˙ = −W ˜ TP W ˜ T Γ−1 W V˙ = W (39) 2

2

2

2

1) In case that εHJB = 0, i.e., ψ2 = 0, thus (39) can be reduced as ˜ ≤ −σ2 ||W ˜ T P2 W ˜ ||2 ≤ −μ2 V2 , V˙ 2 = −W

(40)

where μ2 = 2σ2 /λmax (Γ−1 2 ) is a positive constant. Then according to Lyapunov s theorem (Theorem 3.4.1 in [4], p110), ˜ converges to zero exponentially, the weight estimation error W where the convergence rate depends on the excitation level σ2 and the learning gain Γ2 . 2) In case that there are bounded approximation errors, i.e., εHJB = 0, (39) can be written as

˜ +W ˜ T ψ2 ≤ −||W ˜ T P2 W ˜ ||(˜ V˙ 2 = −W σ2 V2 − ε2ψ ) (41)

for σ ˜2 = σ2 2/λmax (Γ−1 ). Then according to the extended Lyapunov theorem (Theorem 3.4.3 in [4], p111), the weight ˜ uniformly ultimately converges to the estimation error W √ ˜ | V2 ≤ ε2ψ /˜ σ2 , whose size depends compact set Ω2 := W on the bound of approximation error ε2ψ and the excitation  level σ2 . This completes the proof. C. Stability Analysis Now, we summarize the main results of this paper as follows. Theorem 3. For system (3) with controls (18), (28) and adaptive laws (7), (38) being used, if the initial control action is chosen to be admissible and regressor vectors f and Ξ satisfy λmin (P1 ) > σ1 > 0 and λmin (P2 ) > σ2 > 0, then the following semi-global results hold: 1) In the absence of approximation errors, the tracking error ˜ converge to e and the parameter estimation errors C˜ and W zero, and adaptive control u ˆe in (28) converges to its optimal ˆe → u∗e if ∇ε1 = 0. solution u∗e in (28), i.e., u 2) In the presence of approximation errors, the tracking ˜ are error e and the parameter estimation errors C˜ and W uniformly ultimately bounded, and the adaptive control u ˆe in (28) converges to a small bound around its optimal solution ue − u∗e  ≤ εu for a small positive constant u∗e in (24), i.e., ˆ εu . Please refer to Appendix for the detailed proof of Theorem 3. V. S IMULATIONS In this section, a numerical example is provided to demonstrate the effectiveness of the proposed approach. Consider the following nonlinear continuous-time system ⎧ ⎨ x˙ 1 = −x1 + x2 , (42) x˙ = −0.5x1 − 0.5x2 (1 − (cos(2x1 ) + 2)2 )+ ⎩ 2 cos(2x1 ) + 2u. The results are to be compared with the exact results in [32]. Then weight matrices Q and R of cost (2) are chosen as identity matrices of appropriate dimensions. The control objective is to make system states x track the desired trajectory x1d = sin(t) and x2d = cos(t) + sin(t). It is assumed that system dynamics are partially unknown, and we first use identifier (3) to reconstruct    system dy−1 1 0 namics with A = , B= being known −0.5 0 2

NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH · · ·



T 0 0 0 is the unknown identifier −0.5 0.5 1 weights to be estimated, and the activationfunction is chosen T as f (x)= x2 , x2 (cos(2x1 ) + 2)2 , cos(2x1 ) . The parameters for simulation are set as k = 0.001,  = 1, Γ1 = 350. ˆ The initial weight parameter is set as C(0)=0. Two different scenarios are investigated, the adaptive algorithm without and with injection of additional noise. The noise has a uniform distribution and a maximal amplitude of 0.1 induced at the measurements for x1 and x2 . It is removed after a duration of 4 seconds. Fig. 1 shows the profile of the estimated identifier weights Cˆ with adaptive law (7), where one may find that the identifier weight estimation converges to their true value C after a 3.5 second transient without noise. It is evident that the algorithm with noise injection converges slightly faster. matrices, C=

Fig. 1.

419

ˆ is shown for the online estimation of the critic NN weights W ˆ converges in about 1.5 seconds. in Fig. 2; this indicates that W ˆ 3 converge close to its optimal value ˆ 2 and W In particular, W ˆ 1 for the noise induced case carries a of 1 and 0, while W ˆ larger error. W1 does not affect the closed loop behavior, but has influence on the value function estimate. This means that the designed adaptive optimal control (28) converges close to its optimal control action in (44). An error in the weights is to be expected as εϕ = 0. The novel identifier and critic NN weight update laws (7) and (38), based on the information of the parameter estimation error, lead to faster convergence of weights compared to [19]. Moreover, for the noise-free case, the system states for tracking the given external command are shown in Fig. 4, the tracking error profile is given in Fig. 5, and the associated control action is provided in Fig. 6. The noise induced case provides again very similar trajectories, which are not displayed here for space reasons.

ˆ Convergence of identifier parameters C.

In the following, the control performance will be verified. For this purpose, the adaptive steady-state control (18) for system (42) to maintain the steady-state performance can be written as    1  cos(t) − us = 0, cos(t) − sin(t) 2    sin(t) −1 1 − cos(t) + sin(t) −0.5 0 ⎤ ⎡ cos(t) + sin(t)  ˆ T ⎣ (cos(t) − sin(t))(cos(2 sin(t)) + 2)2 ⎦ . C cos(2 sin(t))

Fig. 2.

ˆ. Convergence of critic NN weights W

Fig. 3.

Excitation conditions λmin (P1 ) and λmin (P2 ).

(43)

As input matrix B is of rank 1, it is evident that εϕ = (BB ⊗ − I)(x˙ d − Axd − Cˆ T f (xd )) is not zero. Thus, the computation of optimal control (28) using adaptive law (38) for the critic NNs may be subjected to a small error. To this end, following [19, 32], the optimal value function and the associated optimal control for system (42) are V ∗ (e) =

1 2 e + e22 2 1

∂V ∗ (e) 1 and u∗e = − R−1 B T = −2e2 . 2 ∂e (44)

Similar to [19−20], we select the activation function for the critic NN as Φ(e) = [e21 , e1 e2 , e22 ]T , then the optimal weights W ∗ = [0.5, 0, 1]T can be derived. Note that only the last nonzero coefficient W3∗ = 1 affects the closed loop. The time trace

A critical issue in using the proposed adaptive laws (7) and (38) is to ensure sufficient excitation of regressor vectors f (x) and Ξ. This condition can be fulfilled in the studied system as shown in Fig. 3, where the online evolutions of λmin (P1 ) and λmin (P2 ) are provided. The scalar λmin (P1 ) remains positive at all time. The value of λmin (P2 ) is sufficiently large till the

420

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

time instant of 4 second when the noise is removed for the noise induced case; λmin (P2 ) for the noise-free case remains sufficiently large until the time instant of 2 second, i.e., after the NN weight convergence is obtained (See Figs. 1 and 2).

Fig. 6.

Control action profile u.

A PPENDIX Proof of Theorem 3. Consider the Lyapunov function as

Fig. 4.

Evaluation of tracking performance.

V = V1 + V 2 + V3 + V4 = 1 ˜ T −1 ˜ 1 ˜ ˜ T Γ−1 W Γ2 W + ΓeT e + KV ∗ (e) + Σψ2T ψ2 , tr(C 1 C) + 2 2 (A1)

where V ∗ (e) is the optimal value function (20), and K, Γ and Σ are positive constants. This Lyapunov function is investi˜ ∈ Rp×n ×Rl ×Rn ×R1 ×Rn ×Rn gated in a compact set Ω ˜ W ˜ , e, ψ2 , xd , x˙ d ), which contains the element in tuple (C, ˜ ˜ W ˜ , e, ψ2 , xd , x˙ d ) ∈ Ω (0, 0, 0, 0, 0, 0) in its interior, and (C, ˜ and Ω should implies (e + xd , us (xd , x˙ d ) + ue (e)) ∈ Ω. Ω

Fig. 5.

Convergence of tracking error e = x − xd .

VI. C ONCLUSIONS An adaptive optimal tracking control is proposed for a class of continuous-time nonlinear systems with unknown dynamics. To achieve the optimal tracking control, an adaptive steady-state control for maintaining the desired steadystate tracking performance is accomplished with an adaptive optimal control for stabilizing the tracking error dynamics at transient stage in an optimal manner. To eliminate the need for precisely known system dynamics, an adaptive identifier is used to estimate the unknown system dynamics. A critic NN is used to online learn the approximate solution of the HJB equation, which is then used to provide the approximately optimal control action. Novel adaptive laws based on the parameter estimation error are developed for updating the unknown weights in both identifier and critic NN, such that the online learning of identifier and the optimal policy is achieved simultaneously. The PE conditions or more relaxed filtered regressor matrix conditions are required to ensure the error convergence to a bounded region around the optimal control and stability of the closed-loop system. Simulation results demonstrate the improved performance of the proposed method.

be both chosen to be sufficiently large but of fixed size. In ˜ W ˜ , e, ψ2 , xd , x˙ d ) particular, any temporal initial value of (C, ˜ is assumed to be within the interior Ω, while in particular xd ˜ Thus, for any initial and x˙ d are chosen to remain within Ω. trajectory, state x and control u remain bounded for at least finite time t ∈ [0, T1 ], which again implies in particular ψ1 to be bounded in this time interval. Thus, consider inequality ab ≤ a2 η/2 + b2 /2η for η > 0, then derivative V˙ 1 along (7) is derived as ˜T ˜˙ ˜T ˜ V˙ 1 = tr(C˜ T Γ−1 1 C) = −tr(C P1 C) + tr(C ψ1 ) ≤ ˜ 2 + ηε2 /2, (A2) − (σ1 − 1/2η)||C|| 1ψ

and derivative V˙ 2 along (38) is derived as ˜ +W ˜ T ψ2 ≤ ˜˙ = −W ˜ T P2 W ˜ T Γ−1 W V˙ 2 = W 2 ˜ ||2 + η||ψ2 ||2 /2. − (σ2 − 1/2η)||W

(A2)

Moreover, one may deduce V˙ 3 from (20) and (31) as ∗ V˙ 3 = 2ΓeT e˙ + K(−eT Qe − u∗T e Rue ) =  ˜ + Bu∗e + ˆ T (f − fd ) + 1 BR−1 B T ∇ΦT W 2eT Γ Ae + C 2  1 −1 T ∗ BR B ∇ε1 + ε + εN + εϕ + K(−eT Qe − u∗T e Rue ) ≤ 2   ˆ 2 κ2 ) e2 + − Kλmin (Q) − Γ(4 + 2 A + ||C||   1 ˜ ||2 − Kλmin (R) − Γ B2 u∗e 2 + Γ||BR−1 B T ∇ΦT ||2 ||W 2 1 T Γ||BR−1 B T || (∇ε1 )T ∇ε1 + ΓεT ε + ΓεT N εN + Γεϕ εϕ . 2 (A4)

NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH · · ·

It is evident that ψ˙ 2 = −ψ2 + ΞεHJB . Hence, similar to the parameter η > 0, parameter μ > 0 is introduced to compute an upper bound of the derivative of V4 = Σψ2T ψ2 as V˙ 4 = 2Σψ2T ψ˙ 2 =  2Σψ2T



− ψ2 + Ξ W

∗T



ˆ T (f − fd ) + Bue + ε + εN + εϕ ∇ε1 Ae + C



 ≤

1 Σ||Ξ(W ∗T ∇Φ + ∇ε1 ) (εϕ + ε))||2 + μ 1 ˆ T (f − fd ))||2 + 1 Σ||Ξ∇ε1 1 BR−1 B T ∇ΦT W ˆ ||2 + Σ||Ξ∇ε1 C μ μ 2 1 1 Σ Ξ∇ε1 Ae2 + Σ||Ξ(W ∗T ∇Φ + ∇ε1 )||2 εN 2 . μ μ (A5) − Σ(2 − 5μ) ψ2 2 +

and smoothness of Φ(·) and V ∗ (·) imply that f (·), Ξ and Φ(·) ˜ Thus, (A6) can be further presented as are bounded on Ω. ˜ 2 − a2 ||W ˜ ||2 − a3 e2 − a4 ψ2 2 + γ, V˙ ≤ −a1 ||C|| (A7) where γ ΓεT ϕ εϕ +

∇Φ (εN + εϕ + ε) +

421

1 μΣ

T

1 −1 T B (∇ε1 ) ∇ε1 + ΓεT ε + = 2 Γ BR 2 1 1 2 ∗T ∇Φ + ∇ε1 ) (εϕ + ε)) + 2 ηε1ψ + μ Σ Ξ(W 2

Ξ∇ε1 12 BR−1 B T ∇ΦT W ∗ defines the effect of the identifier errors ε, ψ1 , the critic NN approximation error ∇ε1 and the matching error εϕ . 1) In case that there are no approximation errors in both identifier and critic NN, i.e., εN =∇ε1 = ψ1 = ψ2 = εϕ = 0, then we have γ = 0, such that (A7) can be deduced as ˜ 2 − a2 ||W ˜ ||2 − a3 e2 − a4 ψ2 2 ≤ 0. V˙ ≤ −a1 ||C|| (A8)

ˆ ⊂ Ω, ˜ in (C, ˜ W ˜ , e, ψ2 ) with Thus, there is a compact set Ω (0, 0, 0, 0) in its interior, which is a set of attraction. Then ˆ according to Lyapunov s theorem, V → 0 holds as within Ω V˙ = V˙ 1 + V˙ 2 + V˙ 3 + V˙ 4 ≤     ˜ W ˜ and e all t → +∞ such that the estimation errors C, 1 1 ˜ 2 − σ1 − − Γ + Σ||Ξ(W ∗T ∇Φ + ∇ε1 )||2 f 2 ||C|| converge to zero. 2η μ  1 1 2 In this case, by assuming the critic NN approximation error − σ2 − − ΓφM ||BR−1 B T ||2 − 2η 2 ∇ε1 = 0, we have  1 1 ˜ ||2 − 1 Σ||Ξ∇ε1 BR−1 B T ∇ΦT ||2 ||W ˆ + 1 R−1 B T ∇ΦT W ∗ = u ˆe − u∗e = − R−1 B T ∇ΦT W μ 2    2 2 ˆ 2 κ2 − 1 −1 T Kλmin (Q) − Γ 4 + 2 A + ||C|| T ˜ (A9) R B ∇Φ W ,  2 1  T 2 2 2 2 ˆ e − Σ ||Ξ∇ε1 C || κ + Ξ∇ε1 A such that μ    η 2 ∗ 2 2 1 ψ2  + Kλmin (R) − Γ B ue  − Σ(2 − 5μ) − ˜ || = 0. ue − u∗e  ≤ φM ||R−1 B T || lim ||W lim ˆ 2 t→+∞ t→+∞ 2 1 2 1 (A10) Γ BR−1 B T (∇ε1 )T ∇ε1 + ΓεT ε + ΓεT ηε1ψ + ϕ εϕ + 2 2 2 2) In case that there are bounded approximation errors 1 Σ Ξ(W ∗T ∇Φ + ∇ε1 ) (εϕ + ε)) + in both identifier and critic NN, then we have γ = 0. μ ˙ 1 1 (A6) Consequently, according to (A7), it can be shown that V is Σ||Ξ∇ε1 BR−1 B T ∇ΦT W ∗ ||2 . μ 2 negative if



˜ || > γ/a2 , ˜ > γ/a1 , ||W ||C|| The design parameters η, μ, Γ, Σ and K are appropriately  



2 e > γ/a3 , ψ2  > γ/a4 . (A11) chosen such that Kλmin (R) − Γ B > 0 and the scalars a1 , a2 , a3 and a4 are positive and larger than certain positive Then again for some set Ω ˆ ⊂ Ω, ˜ the estimation errors C, ˜ W ˜, constant a > 0, where ψ2 and e are all uniformly ultimately bounded according to 

2  ˆ 1 1 ∗T 2 s theorem within the set of attraction Ω. Lyapunov a1 = σ1 − f  , − Γ + Σ Ξ(W ∇Φ + ∇ε1 ) ∗ 2η μ Next we will prove ˆ ue − ue  ≤ εu . Recalling the expres2 1 1 2 ∗ −1 T from (24) or (30) and u ˆe from (28), we have sions of u − a2 = σ2 − − ΓφM BR B e 2η 2 1 ˆ + 1 R−1 B T (∇ΦT W ∗ + ∇ε1 ) 1 1 u ˆe − u∗e = − R−1 B T ∇ΦT W Σ||Ξ∇ε1 BR−1 B T ∇ΦT ||2 , 2 2 μ 2   1 1 −1 T T ˜ + R−1 B T ∇ε1 . ˆ 2 κ2 − (A12) = R B ∇Φ W a3 = Kλmin (Q) − Γ 4 + 2 A + ||C|| 2 2   1 When t → ∞, the upper bound of (A12) is ˆ T ||2 κ2 + Ξ∇ε1 A2 , Σ ||Ξ∇ε1 C 1 μ ˜ ||+ ˆ ue − u∗e  ≤ ||R−1 B T ∇ΦT ||||W η 2 a4 = Σ(2 − 5μ) − . 2 1 −1 T (A13) ||R B || ∇ε1  ≤ εu . This can be achieved by selecting η > 0 and K > 0 large 2 enough, while a > 0, μ > 0, Γ > 0 and Σ > 0 are chosen to Clearly, the upper bound εu depends on the critic NN ˜ and the NN estimation error ∇ε1 . be small enough to satisfy in particular min(σ1 , σ2 ) > a > 0 approximation error W and a4 > a > 0. Note also that Lipschitz continuity of f (·)  ˜ (x), we have Considering that εN = Cf

422

IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 4, OCTOBER 2014

R EFERENCES [1] Lewis F L, Vrabie D, Syrmos V L. Optimal Control. Wiley. com, 2012. [2] Vrabie D, Lewis F L. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks, 2009, 22(3): 237−246 [3] Sastry S, Bodson M. Adaptive Control: Stability, Convergence, and Robustness. New Jersey: Prentice Hall, 1989. [4] Ioannou P A, Sun J. Robust Adaptive Control. New Jersey: Prentice Hall, 1996. [5] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: Cambridge University Press, 1998. [6] Doya K J. Reinforcement learning in continuous time and space. Neural computation, 2000, 12(1): 219−245

adaptive dynamic programming method. IEEE Transactions on Neural Networks, 2011, 22(12): 2226−2236 [22] Mannava A, Balakrishnan S N, Tang L, Landers R G. Optimal tracking control of motion systems. IEEE Transactions on Control Systems Technology, 2012, 20(6): 1548−1558 [23] Nodland D, Zargarzadeh H, Jagannathan S. Neural network-based optimal adaptive output feedback control of a helicopter UAV. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(7): 1061−1073 [24] Na J, Herrmann G, Ren X M, Mahyuddin M N, Barber P. Robust adaptive finite-time parameter estimation and control of nonlinear systems. In: Proceedings of IEEE International Symposium on Intelligent Control (ISIC). Denver, CO: IEEE, 2011. 1014−1019 [25] Uang H J, Chen B S. Robust adaptive optimal tracking design for uncertain missile systems: a fuzzy approach. Fuzzy Sets and Systems, 2002, 126(1): 63−87

[7] Sutton R S, Barto A G, Williams R J. Reinforcement learning is direct adaptive optimal control. IEEE Control Systems Magazine, 1992, 12(2): 19−22

[26] Krstic M, Kokotovic P V, Kanellakopoulos I. Nonlinear and Adaptive Control Design. New York: Wiley, 1995.

[8] Werbos P J. A menu of designs for reinforcement learning over time. Neural Networks for Control. MA, USA: MIT Press Cambridge, 1990. 67−95

[27] Kosmatopoulos E B, Polycarpou M M, Christodoulou M A, Ioannou P A. High-order neural network structures for identification of dynamical systems. IEEE Transactions on Neural Networks, 1995, 6(2): 422−431

[9] Si J, Barto A G, Powell W B, Wunsch D C. Handbook of Learning and Approximate Dynamic Programming. Los Alamitos: IEEE Press, 2004.

[28] Abdollahi F, Talebi H A, Patel R V. A stable neural network-based observer with application to flexible-joint manipulators. IEEE Transactions on Neural Networks, 2006, 17(1): 118−129

[10] Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: an introduction. IEEE Computational Intelligence Magazine, 2009, 4(2): 39−47 [11] Lewis F L, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009 9(3): 32−50 [12] Zhang H G, Zhang X, Luo Y H, Yang J. An overview of research on adaptive dynamic programming. Acata Automatica Sinica, 2013, 39(4): 303−311 [13] Dierks T, Thumati B T, Jagannathan S. Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence. Neural Networks, 2009, 22(5): 851−860 [14] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943−949 [15] Wang D, Liu D R, Wei Q L, Zhao D B, Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica, 2012, 48(8): 1825−1832 [16] Hanselmann T, Noakes L, Zaknich A. Continuous-time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631−647

[29] Lin J S, Kanellakopoulos I. Nonlinearities enhance parameter convergence in strict feedback systems. IEEE Transactions on Automatic Control, 1999, 44(1): 89−94 [30] Edwards C, Spurgeon S K. Sliding Mode Control: Theory and Applications. Boca Raton: CRC Press, 1998. [31] Sira-Ramirez H. Differential geometric methods in variable-structure control. International Journal of Control, 1988, 48 (4): 1359−1390 [32] Nevistic V, Primbs J A. Constrained Nonlinear Optimal Control: A Converse HJB Approach, Technical Report CIT-CDS 96-021, California Institute of Technology, Pasadena, CA, 1996.

Jing Na Professor in Kunming University of Science and Technology. He received his Ph. D. degree from Beijing Institute of Technology in 2010. From 2011 to 2012, he was a Postdoctoral Fellow with the ITER Organization. His research interest covers intelligent control, adaptive parameter estimation, neural networks, repetitive control, and nonlinear control & applications. Corresponding author of this paper.

[17] Abu-Khalaf M, Lewis F L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica, 2005, 41(5): 779−791 [18] Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis F L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 2009, 45(2): 477−484 [19] Vamvoudakis K G, Lewis F L. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 2010, 46(5): 878−888 [20] Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis K G, Lewis F L, Dixon W E. A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica, 2013, 49(1): 82−92 [21] Zhang H G, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using

Guido Herrmann Received his Ph. D. degree from University of Leicester, UK, in 2001. From 2001 to 2003, he was a Senior Research Fellow in the Data Storage Institute in Singapore. From 2003 until 2007, he was a research associate, fellow, and lecturer in University of Leicester. He joined University of Bristol, UK, as a lecturer in March 2007. He was promoted to a Senior Lecturer in 2009 and a Reader in Control and Dynamics in 2012. He is a Senior Member of the IEEE. His research interest covers the development and application of novel, robust and nonlinear control systems.