Neural dynamic programming based online

0 downloads 0 Views 614KB Size Report
'sisotool' GUI in MATLAB 6.1 an optimal PI controller is designed to obtain a settling time less than 0.1 s in addition to zero steady-state error and no overshoot.
Neural dynamic programming based online controller with a novel trim approach S. Chakraborty and M.G. Simoes Abstract: Neural dynamic programming (NDP) is a generic online learning control system based on the principle of reinforcement learning. Such a controller can self tune with a wide change of operating conditions and parametric variations. Implementation details of a self-tuning NDP based speed controller of a permanent-magnet DC machine along the online training algorithm are given. A simple solution is developed for finding the trim control position for the NDP controller NDP controller that can be extended to other problems. The DC machine is chosen for the implementation because it can be easily operated in a variety of operating conditions, including parametric variations, to prove the robustness of the controller and its multiobjective capabilities. The simulation results of the NDP controller are compared with the results of a conventional PI controller to access the overall performance.

1

Introduction

Adaptive critic design (ACD) neural network structures have evolved from a combination of reinforcement learning, dynamic programming and back-propagation with a long and solid history of work [1 –11]. Dynamic programming is a very useful tool in solving nonlinear MIMO control cases, most of which can be formulated as a cost minimisation or maximisation problem. Unfortunately the backward numerical process required for running dynamic programming makes computation and storage very problematic, especially for high-order nonlinear systems, i.e. the commonly known ‘curse of dimensionality’ problem [12 – 14]. Over the years, progress had been made to overcome this by building a system called critic to approximate the cost function of the dynamic programming [6, 9] in the form of ACDs. The basic structures of ACD proposed in the literature were heuristic dynamic programming (HDP), dual heuristic programming (DHP) and globalised dual heuristic programming (GDHP), and their action-dependent (AD) versions, which gives action-dependent heuristic dynamic programming (ADHDP), actiondependent dual heuristic programming (ADDHP) and action dependent globalised dual heuristic programming (ADGDHP) [4, 6, 9]. A typical ACD consists of three neural network modules called action (decision making module), critic (evaluation module) and model (prediction module). In the action-dependent versions’ action is directly connected to the critic without using models. The basic idea in ACD is to adapt the weights of the critic network to approximate the future reward-to-go function J(t) such that it satisfies the modified Bellman equation used in dynamic programming. Instead of finding

q IEE, 2005 IEE Proceedings online no. 20041119 doi: 10.1049/ip-cta:20041119 Paper first received 24th February and in revised form 18th August 2004 The authors are with Colorado School of Mines, Engineering Division, Golden, Colorado 80401-1887, USA IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

the exact minimum, a neural network is used to get an approximate solution for the following dynamic programming equation: J  ðXðtÞÞ ¼ minfJ  ðXðt þ 1ÞÞ þ gðXðtÞ; Xðt þ 1ÞÞ  U0 g uðtÞ

ð1Þ where X(t) are the states of the system, gðXðtÞ; Xðt þ 1ÞÞ is the immediate cost incurred by u(t), the control action at time t, and U0 is a heuristic term used to balance [15]. After solving (1), the optimised control signal u(t) is utilised to train the action neural network. The weights of the action module are so adapted that it gives that desired control signal using the system states as its input. The simple block diagram representing (1) is given in Fig. 1 [9]. The training procedure of both the critic and action network is described in detail in the following sections of the paper. Now to adapt JðXðtÞÞ in the critic network, the target on the right-hand side of (1) must be known a priori. It is required to wait for a time-step until the next input becomes available in the case of action-dependent ACD. Consequently JðXðt þ 1ÞÞ can be calculated by using the critic network at time ðt þ 1Þ: A simple block diagram of action-dependent ACD [16] is shown in Fig. 2a. When the problem is of temporal nature, i.e. not waiting for the subsequent time-steps to infer incremental costs, a pre-trained model network is to be used to calculate Xðt þ 1Þ: The block diagram of this ACD [17] is given in Fig. 2b. The problem in such an approach is to train the model network that becomes complex for nonlinear MIMO systems. The basis of the proposed online training algorithm lies in the combination of ACD and temporal difference (TD) reinforcement learning [18, 19]. It has been shown in the literature that TD-based reinforcement learning can be combined with ACD to get a new structure called neural dynamic programming [20]. In the approach proposed the system model network (the one that can predict the future system states and consequently the cost-to-go for the next time-step) is excluded and a model for future prediction used. In this model the 95

Fig. 1 Simple block diagram for solving modified Bellman equation by ACD

Fig. 2

Basic adaptive critic design block diagrams

a Action-dependent ACD b Simple ACD

previous J values are stored and together with the current J value, and the temporal difference obtained along which a reinforcement signal rðtÞ is used to train the critic network. The signal rðtÞ is a measure of instantaneous primary reward of the current action, which means that it tries to adapt the critic in such a way that the actual system states follow the desired system states for optimised operation. The adaptation of the action network is conducted by indirectly back-propagating the error between the critic network output JðtÞ and a control signal called ultimate control objective Uc ðtÞ: The ultimate control objective function Uc ðtÞ is a weighted cost function which can be defined according to the main control objective for the controller, giving more flexibility in designing the controller. For a closed-loop NDP controller the nominal control position (sometimes referred to as trim control position [21, 22]) has to be scheduled as a function of system states and environmental parameters [21]; a novel way to find the trim control position is proposed as follows. It is actually an open-loop solution to the control problem, giving a good understanding into forming a closed-loop solution. It also

Fig. 3 96

helps to facilitate realistic simulations where one often wants to start with arbitrary initial conditions. There are existing methods for implementing the trim network some of which require knowledge of the governing equations of the system, not usually a feasible option for complex systems where the detailed system equations are hard to obtain. Another previously used technique is the use of the existing NDP neural network framework to design the trim [21]. This method is robust but causes extra complexity to the algorithm along with more convergence time. The trim controller implemented here is called as the correction module which corrects the output of the controller. The correction module actually consists of two simple neural networks easily trained offline. The proposed configuration is simpler than the ACD and yet very useful for complex MIMO systems because the controller can handle a continuous and large number of discrete states due to the use of gradient information instead of a search algorithm for finding the actions. Also, the design is robust in the sense that it is insensitive to system parameter variations, initial weights of action and critic networks, and the control objective. 2

General structure of NDP controller

Figure 3 shows the block diagram of the NDP controller in the closed loop. A brief description of each block follows.

2.1

Reinforcement signal

The reinforcement signal r(t) is used to train the critic network. It is a measure of instantaneous primary reward of the current action; it tries to adapt critic in such a way that the actual system states follow the ideal system states for optimised operation. In systems where explicit feedback is available, the following quadratic reinforcement signal can be used: 2 n  ^ X ðXi  Xi Þ ð2Þ rðtÞ ¼  ðXi ÞMax i¼1 where Xi is the ith state of the state vector X, Xi is the desired reference state, X^ i is the actual state and ðXi ÞMax is the nominal maximum state value.

2.2

Generation of Uc ðtÞ

The adaptation of the action network is done by indirectly back propagating the error between the critic network output JðtÞ and a control signal called the ultimate control objective Uc ðtÞ: The ultimate control objective function

Block diagram of proposed NDP structure IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

Uc ðtÞ is a signal which can be defined according to the main control objective for the controller. Suppose the system has n state variables denoted by X1 ; X2 ; X3 ; . . . ; Xn where the multiobjective controller is to be applied so that the weighted quadratic error between m number of the actual states and the corresponding desired reference states will reach a minimum. In such a case Uc ðtÞ can be defined in the following way: Uc ðtÞ ¼ 

m X 1 ððXi Þerror Þ2 K i¼1 i

ð3Þ Fig. 4 Main components of NDP structure

where ðXi Þerror is the error between the actual ith state Xi to its desired value, and 1=Ki is the weight for the ith state error. This ultimate control objective function is flexible because it can be defined depending on the control actions required for the system. The weighting variable Ki helps to give different levels of importance for the different control actions in case of a multiobjective controller.

2.3 Internal model For both r(t) and Uc ðtÞ the desired reference states Xi are needed. These desired reference states can be either given as the external references to the controller (for explicit multiobjective controller) or can be derived internally and used for enhancement of the control actions. In the second case, the internal model works like an observer generating desired pseudostates from the controller output. One important characteristic is that when the differential equations of the system are explicitly known, they can be used in the internal model. But for a system where the system equations are unknown or complex, a simple offline trained neural network model is preferable.

node output, n þ 1 the total number of inputs into the critic network including action network output u(t) as one of ð2Þ them, Wci ðtÞ the weight of connection between the ith hidden node to the output node of the critic network, ð1Þ Wcij ðtÞ the weight of connection between the jth input node and the ith hidden node, and xj ðtÞ the jth input to the critic network. The temporal difference method is utilised in the critic training. The previous J values are stored and together with the current J value, the temporal difference is obtained along which a reinforcement signal r(t) and used to train the critic network. The prediction error for the critic network can be defined as ec ðtÞ ¼ aJðtÞ  ½Jðt  1Þ  rðtÞ

The output of the critic network JðtÞ approximates the discounted total reward-to-go given by RðtÞ as follows: RðtÞ ¼ rðt þ 1Þ þ arðt þ 2Þ þ a2 rðt þ 3Þ þ :

ð4Þ

where R(t) is the future accumulative reward-to-go value at time t, a is a discount factor for the infinite-horizon problem ð0 < a < 1Þ; and rðt þ 1Þ is the reinforcement signal value at time t þ 1: The critic network can be implemented with a simple multilayer feedforward neural network. The critic network has n measured states along with the action network output u(t) as the inputs, and the output of critic network is J(t) which is an approximation of cost function R(t). The neural network can have either linear or with nonlinear sigmoid functions at the nodes. The three-layer feedforward neural network with one hidden layer shown in Fig. 4a is commonly used as critic. The equations for the feedforward critic network are Nh X

ð2Þ

Wci ðtÞpi ðtÞ

ð5Þ

ð8Þ

and the objective function that the critic network minimises is 1 Ec ðtÞ ¼ e2c ðtÞ 2

2.4 Critic network

JðtÞ ¼

a Critic b Action

ð9Þ

where JðtÞ is the output of the critic network, the J function, as an approximation of RðtÞ; a the discount factor with range f0; 1g; and rðtÞ the reinforcement signal at time t. The weight update rules for the critic network are based on the gradient descent algorithm as follows: W c ðt þ 1Þ ¼ W c ðtÞ þ DW c ðtÞ

ð10Þ

  @Ec ðtÞ DW c ðtÞ ¼ lc ðtÞ  @W c ðtÞ

ð11Þ

  @Ec ðtÞ @E ðtÞ @JðtÞ ¼  c @W c ðtÞ @JðtÞ @W c ðtÞ

ð12Þ

where lc ðtÞ is the positive learning rate of the critic network at time t, and W c is the weight vector in the critic network. For a three-layer feedforward neural network as shown in Fig. 4a, by applying the chain rule to (11) and (12) the adaptation of critic can be summarised as follows.

i¼1

2.4.1 Weights of hidden to output layer: qi ðtÞ

pi ðtÞ ¼

qi ðtÞ ¼

nþ1 X

1e ; 1 þ eqi ðtÞ ð1Þ

Wcij ðtÞxj ðtÞ;

i ¼ 1; 2; 3; . . . ; NhðcriticÞ

"

ð6Þ ð2Þ DWci ðtÞ

i ¼ 1; 2; 3; . . . ; NhðcriticÞ

¼ lc ðtÞ 

@Ec ðtÞ

#

ð2Þ

ð13Þ

¼ aec ðtÞpi ðtÞ

ð14Þ

@Wci ðtÞ

ð7Þ

j¼1

where Nh is the number of hidden nodes in critic, qi ðtÞ ith hidden node input of the critic network, pi ðtÞ the ith hidden IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

@Ec ðtÞ ð2Þ

@Wci ðtÞ

97

2.4.2

Weights of input to hidden layer: " ð1Þ

DWcij ðtÞ ¼ lc ðtÞ 

@Ec ðtÞ

W a ðt þ 1Þ ¼ W a ðtÞ þ DW a ðtÞ

#

ð1Þ

@Wcij ðtÞ

  1 ð2Þ 2 ð1  p ðtÞWc ðtÞ ðtÞÞ xj ðtÞ ¼ ae c i i ð1Þ 2 @Wc ðtÞ @Ec ðtÞ



@E ðtÞ DW a ðtÞ ¼ la ðtÞ  ac @W a ðtÞ

ð15Þ

ij

where lc ðtÞ is the positive learning rate for the critc network.

2.5

Action network

The action network generates the desired plant control based on the measurements of the plant states and operates as the actual controller for the system. In a similar way to the critic network, the action network can be implemented with a standard multiplayer linear or nonlinear feedforward neural network. The inputs to the action are the n measured system states and the output is the control action uðtÞ: For a multiobjective controller the control-space dimension defines the number of action network outputs. The three-layer feedforward neural network with one hidden layer as shown in Fig. 4b is commonly used as action. The equations of the feedforward action networks follow: uðtÞ ¼

vðtÞ ¼

Nh X

1  evðtÞ 1 þ evðtÞ ð2Þ

Wai ðtÞgi ðtÞ

i¼1

2.5.1

hi ðtÞ ¼

1e ; 1 þ ehi ðtÞ

n X

  1 2 ¼ eac ðtÞ ð1  uðtÞ Þ gi ðtÞ ð2Þ 2 @Wa ðtÞ i

 Nh  X ð1Þ 1 ð2Þ 2 Wci ðtÞ 1  pi ðtÞ Wci;nþ1 ðtÞ 2 i¼1 ð27Þ 2.5.2

Weights of input to hidden layer:

ð1Þ Waij ðtÞxj ðtÞ;

ð1Þ DWaij ðtÞ

ð19Þ

i ¼ 1; 2; 3; . . . ; NhðActionÞ ð20Þ

j¼1

¼ la ðtÞ 

@Eac ðtÞ

#

ð1Þ

@Waij ðtÞ

ð28Þ

    1 1 ð2Þ 2 2 ¼ eac ðtÞ ð1  uðtÞ Þ Wai ðtÞ 1  gi ðtÞ xj ðtÞ ð1Þ 2 2 @Wa ðtÞ @Eac ðtÞ ij

where Nh is the number of hidden nodes in the action, vðtÞ the input to the action node, uðtÞ the output from the action network, hi ðtÞ the ith hidden node input for the action, gi ðtÞ the output of the ith hidden node, n the number of inputs for the action network i.e. the number of states for ð2Þ the system, Wai ðtÞ the weights of connection between the ith hidden node to the output node of the action ð1Þ network, Waij ðtÞ the weights of connection between the jth input node and the ith hidden node, and xj ðtÞ the jth input to the action network. The adaptation of the action network is to back-propagate the error between the desired ultimate objective Uc ðtÞ and the cost function RðtÞ: If the explicit cost function is available, the actual cost function RðtÞ is used. When the critic network is used to approximate RðtÞ; the critic output JðtÞ is used instead of RðtÞ: In the second case, back-propagation is done through the critic network. The weights of action network are updated to minimise the following performance error: 1 Eac ðtÞ ¼ e2ac ðtÞ 2

ð21Þ

eac ðtÞ ¼ JðtÞ  Uc ðtÞ

ð22Þ

where and eac ðtÞ is the prediction error for the action network, and Eac ðtÞ the objective function that action tries to minimise. The weight update rules for the action network are also based on the gradient descent algorithm as 98

Weights of hidden to output layer: " # @Eac ðtÞ ð2Þ ð26Þ DWai ðtÞ ¼ la ðtÞ  ð2Þ @Wai ðtÞ

" i ¼ 1; 2; 3; . . . ; NhðActionÞ

ð25Þ

@Eac ðtÞ

hi ðtÞ

gi ðtÞ ¼

ð24Þ

where la ðtÞ is the positive learning rate of the action network at time t, and W a is the weight vector in the action network. For a three-layer feedforward action network as shown in Fig. 4b the chain rule is applied to the (24) and (25) to get the following equations:

ð17Þ

ð18Þ



  @Eac ðtÞ @Eac ðtÞ @JðtÞ @uðtÞ ¼  @W a ðtÞ @JðtÞ @uðtÞ @W a ðtÞ

ð16Þ

ð23Þ



 Nh  X ð1Þ 1 ð2Þ Wci ðtÞ 1  p2i ðtÞ Wci;nþ1 ðtÞ 2 i¼1 ð29Þ

where pi ðtÞ is the ith hidden node output from the critic network, and la ðtÞ the positive learning rate for the action network. Normalisation is performed in both action and critic networks to confine the values of weights into some appropriate range by Wcðt þ 1Þ ¼

WcðtÞ þ DWcðtÞ kWcðtÞ þ DWcðtÞk1

ð30Þ

Waðt þ 1Þ ¼

WaðtÞ þ DWaðtÞ kWaðtÞ þ DWaðtÞk1

ð31Þ

Following this description of all the component blocks for the NDP controller it is now possible to understand the interaction between these components in the controller. In the proposed online control design the controller is ‘naive’ at the beginning as both the action network and critic network are randomly initialised in their weights. To facilitate the convergence speed the action network can also be pretrained with the static data from the system before starting the controller. Once the system with controller is running the system states are acquired and the control signals rðtÞ and Uc ðtÞ are also calculated. The NDP controller then starts training the critic network while IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

the action networks weights are kept constant. After training is complete the critic network weights are kept constant at their final values. Then the controller starts training the action network. When action training is complete the action network weights are frozen and the new control signal is generated using feedforward the action network with those final action network weights. With this new control signal the critic network is again trained, followed by action network training. This critic – action training process continues a few times to ensure proper adaptation of both the action and critic network, after which the system is run with the final control signal from the action network and new states are observed and the whole process is repeated for those new state values. The detailed operation of the controller is given in the Section-4 flowcharts, where an application for the controller to command the speed of a permanent-magnet DC motor is presented. 3 System description: Permanent-magnet DC motor The NDP controller is designed to drive the speed of a permanent-magnet DC motor. A simple model for the DC motor is developed. The input to the system is the ^ ; actual voltage Va ðtÞ and the outputs are actual speed o current I^a ; and the signal L^ a ðdIa =dtÞ is also made available. The DC machine parameters used in simulations are given in Table 1. The DC motor is the chosen plant because it is a simple nonlinear system (considering inductance saturation and parameter variation) that will enable our claims about the nonlinear control capability of the proposed controller to be verified. Although the controller is designed for speed control, a simple modification in the ultimate control objective Uc ðtÞ can be utilised to get better transient response in armature current and bounded back-EMF. With this system, it can be easily verified that the NDP controller works with various operating conditions, parametric variations and load disturbances. 4

Implementation of NDP controller

For this particular system the ultimate control objective Uc ðtÞ is defined as 1 1 Uc ðtÞ ¼  ððoÞer Þ2  ððIa Þer Þ2 2 8   2 1 dI La a  16 dt er

ð32Þ

where ðoÞer is the difference between actual speed and the reference speed, ðIa Þer difference between actual and reference armature current, ðLa dIa=dtÞer the difference between actual and reference ðLa dIa =dtÞ: The main objective is to control the speed of the DC motor such that error in speed is minimised. The minimisation of Table 1: Parameters of the permanent magnet DC motor Parameter

Value

Parameter

Value

Ra

1:78 O

Kt 

0.935

La

0.03 H

Kv 

0.935

Kt is torque constant; Kv is voltage constant;  is machine internal flux in Weber; Ra is motor armature resistance; La is motor armature inductance

IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

error in armature current and rate of change of armature current are also implemented as shown in (32). Inputs to the critic network are state variables ^ , I^a , L^ a dIa =dt and the output of the action network Va ðtÞ. o The output of the critic is the cost function J(t). Inputs to the action network are the actual state variables. Output of this network is the control signal Va ðtÞ, i.e. the voltage for the PWM converter of the DC motor or directly the terminal voltage of the DC motor. ^ : The The internal model has two inputs Va ðtÞ and o output is the ideal current Ia and La ðdIa =dtÞ: The governing equation of the system assuming constant field excitation (as a permanent-magnet DC motor) is as follows: € þ La ðB þ KÞo _ þ ðKt fÞðKv fÞo ¼ Kt fðVa  Ia Ra Þ La J o ð33Þ where J is the moment of inertia of the motor, B the damping constant for the system, K the load torque constant, i.e. TL ¼ Ko; Ra ; La the machine armature resistance and inductance, Va the machine terminal voltage, here it is equal to the controller output Va , Ia the armature current, and o the rotor speed in rad=s. The DC motor model and all other models used are implemented in Simulink and the programs are written in Matlab version 6.1. Initially the self-tuning controller has no prior knowledge about the plant but only the online measurements. The controller is actually the feed forward action network where the weights are updated on the fly. The variables speed, armature current, rate of change of armature current are scaled to between 0:9 to 0.9 because the neural network layer transfer functions are bipolar. The r(t) and Uc ðtÞ are scaled between 0:9 to 0 because in accordance with the control formulation presented here they are always negative. The action network output is to be scaled back to the actual range. The flowcharts for the complete procedure and the individual component training are given in Figs. 5– 7. The important training parameters are given in Table 2. The learning rate of both the action and critic network is increased by a factor of two if corresponding training error decreases and is decreased by a factor of 0.8 as training error increases. A variable-step ode solver with maximum stepsize limited to 10 ms is used:

4.1 A Design of correction module For a closed-loop NDP controller the nominal control position (sometimes referred as trim control position [21, 22]) has to be scheduled as a function of system states and environmental parameters [21]. Previous works on NDP were successful because they were tested on systems (like the inverted pendulum) having zero trim requirements [20]. Finding a trim control position is actually an open-loop solution to the control problem, giving a good understanding into forming a closed-loop solution. It also helps to facilitate realistic simulations where one often wants to start with an arbitrary initial condition rather than controlling the DC machine while it is running at a particular state [21]. A simple but effective method for determining the trim control position for the DC machine speed control is implemented here and is called the correction module as it corrects the output of the controller. The correction module consists of two simple offline trained neural networks. The first neural network takes the scaled reference speed and scaled armature current Ia as inputs and generates the terminal voltage as output, whereas the second one takes the scaled actual speed and scaled armature current Ia as inputs 99

Fig. 7

Flowchart for action training program

Table 2: Summary of important training parameters

Fig. 5

Fig. 6

Value

Critic: initial learning rate

0.4

Critic: minimum learning rate

0.01

Critic: discount factor

0.95

Critic: number of nodes in hidden layer

10

Critic: maximum training cycles

no. of patterns

Critic: threshold error for training

0.01

Action: initial learning rate

0.3

Action: minimum learning rate

0.01

Action: number of nodes in hidden layer

10

Action: maximum training cycles

no. of patterns

Action: threshold error for training

0.01

Flowchart for complete process

Flowchart for critic training program

and generates the terminal voltage as output. The outputs of these networks do not need to be optimised so the target for training can be easily obtained, i.e. the system voltage from the machine running with a suboptimal PI speed controller. The training is conducted offline using the ‘nntool’ GUI available in Matlab 6.1. 100

Parameters

The reason behind taking two neural networks is that with only the first one it can work fine with most of the operating conditions except when there is some load torque disturbance. With a load torque increase, the armature current increases as there is no explicit armature current controller (although the NDP system can optimise armature current in a particular trim position), so with the same reference speed the correction module generates higher trim values than what is expected. The second neural network takes care of this variation, because with load changes, when armature current changes the speed also tries to change. In addition, the second neural network captures the dynamics to generate proper trim control position. If only the second NN was used as correction module the speed would end up to less than the reference speed. This is because without a reference speed input the system would not generate proper trim position corresponding to that reference speed. To decide which neural network is to be used, some simple conditional statements are used. The correction module is portrayed in the block diagram of Fig. 8. The scaling gain blocks are used to denormalise the outputs of the neural networks back to their actual range. Then they are low-pass filtered. The transfer function of this filter is based on the mechanical pole of the DC machine as GðsÞ ¼

KJ s þ KJ

ð34Þ

IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

where K is defined as the load torque constant TL ¼ Ko: The decision block compares the actual speed with the reference speed to decide which neural network will work. Figure 9 shows the reference speed profile used from the training data set.

It is very simple to generate the correction module without changing the training algorithm of the main ACD network. Before designing the controller for a specific application, the correction module has to be trained offline with the static data from that particular system. Once trained offline, there is no need to change its weights during the online training phase. 5

Fig. 8 Block diagram of correction module showing all components involved in the design

Simulation study with NDP Controller

The NDP controller was tested at different operating conditions under different parameter variations subjected to load torque variations. Some of these results are given along with the PI-controlled DC machine to get a good comparison. The permanent-magnet DC motor transfer function is obtained with the parameters of the motor given in Table 1. Some additional parameters used are J (moment of inertia of motor) ¼ 0:0012 kg:m2 ; load torque proportional to o; i.e. TL ¼ Ko; K (load torque constant) ¼ 0:2 Nm:s: With these parameters the transfer function for the DC motor is Go;V ¼

Fig. 9 Reference speed profile from data set used for offline training of correction module neural networks

0:935 ð3:54 10 Þs þ ð0:0081Þs þ 1:23 5

2

ð35Þ

It is evident from Fig. 10b that without any controller the system has a almost 55% steady-state error as expected. The overshoot is also quite high, in the order of 20%: The steady-state error as well as the overshoot can be eliminated by implementing a simple PI controller. By using ‘sisotool’ GUI in MATLAB 6.1 an optimal PI controller is designed to obtain a settling time less than 0.1 s in addition to zero steady-state error and no overshoot. The response on the PI-controlled system is given in Fig. 11. The PI

Fig. 10 DC motor without controller a Root locus b Step response

Fig. 11 PI controlled DC motor a Closed-loop poles in root locus b Step response IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

101

Fig. 12 Step response of PI-controlled DC motor at different operating point K ¼ 0:02

controller parameters are KP ¼ 0:01; KI ¼ 50: This PI controller is optimally tuned with machine parameters given in Table 1. The additional parameters used are J ¼ 0:0012 kg:m2 and K ¼ 0:2 Nm:s: But when the operating conditions change this simple PI controller is no longer the optimal one. This is clear from Fig. 12 when the load torque constant K is changed to 0.02 Nm.s. All the following simulation studies were done at operating conditions that differ from the condition for which the PI controller was optimised, i.e. K ¼ 0:02 Nm:s is used.

5.1

Speed and load torque change

In this simulation study the speed step-changes from 0 to 100 rad=s and then to 150 rad=s at time 0.5 s along with the additional step load torque change from 0 to 1 Nm at 1.35 s. To test robustness the simulation studies were conducted at

Fig. 13 Speed profile of PI and NDP controller with step speed and load torque changes

Fig. 14 Armature current profile of PI and NDP controller with step speed change and step load torque change 102

operating conditions different from the condition for which the PI controller was optimised. From Fig. 13 it can be seen that the speed response of the NDP controller is much better than the PI controller. There is no overshoot in speed for a step change in speed with the NDP controller. Again the oscillation in speed is less for the NDP controller when there is a step change in torque. Figure 14 shows the armature current ripple at standing and at the step change in speed is much smaller with the NDP controller than with the PI controller. With the torque change though, the current ripple with NDP controller is slightly larger than that with PI controller; the oscillation dies at a much faster rate for the NDP controller.

5.2

Parameter variation analysis

In this case study a step change in speed was imposed from 0 to 100 rad=s and then to 150 rad=s at time 0.5 s. The additional load torque step changed from 0 to 1 Nm at 1.5 s and then came back to 0 at 2.19 s. Also considered here 70% increase of the armature resistance value from its nominal value due to heating and change in armature inductance with current due to saturation. This case was also done at operating conditions different from the condition for which the PI controller is optimised. The parametric variations were incorporated to prove the robustness of the self-tuning NDP controller. In Fig. 15 the parameter variations of the DC machine are shown. Figure 16 shows the load torque disturbance. From Figs. 17 –20 it can be seen that the speed and current response with the NDP controller are much better than the PI controller both at transient and at steady state. In all the simulations a simple PI controller was used to compare with the NDP-based adaptive controller. As the DC

Fig. 15 Simulated armature resistance change to 70% of nominal value with time, and armature inductance change with armature current

Fig. 16 Load torque changes from 0 to 1 Nm at 1.5 s and then returns to 0 at 2.19 s IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

Fig. 17 Speed profile of PI and NDP controller with step speed change, periodic load torque change, and armature resistance and inductance variation

Fig. 18 Magnified speed profile of PI and NDP from Fig. 17 to show comparison between controllers when load torque changes

motor model is well-known it is easy to implement an adaptive PI controller to get better results than a simple PI controller. But in general most of these adaptive controller design techniques, such as gain scheduling, require some form of system identification based on the parameterised model with a set of design equations relating controller parameters to plant parameters [23]. With such techniques there is a necessity to impose a model structure on the system, which introduces approximation even when best-fit parameters for such models are available. Furthermore, in some cases the complexity of the system may make the modelling infeasible, a typical problem in all model-based reference adaptive control approaches. On the other hand, some PI tuning techniques, such as unfalsified control theory based tuning, can work without plant models. But they are complex to implement and have the limitation that the set of unfalsified controllers may shrink to a null set if there are no PID controllers capable of meeting the performance specification [24]. In contrast, the NDP controller is robust with parameter variations and load disturbances. It is simpler to implement than most of the optimal controllers. It does not need any plant models and can be designed to have either explicit or implicit multiobjective control actions which in turn makes the controller very flexible. The NDP controller can be very useful for complex MIMO systems because the controller can handle continuous and large number of discrete states due to the usage of gradient information instead of search algorithm for finding the actions. The closed-loop stability analysis of the NDP controller is an ongoing topic of active research and is outside the scope for this paper. With detailed analysis it IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005

Fig. 19 Armature current profile of PI and NDP controller with step speed change, periodic load torque change, armature resistance and inductance variation

Fig. 20 Starting current profile of PI and NDP controller with step speed change, periodic load torque change, armature resistance and inductance variation

can be shown that the back-propagation-based weight adaptation of the critic and action networks ensures that weight estimation errors are uniformly ultimately bounded [7]. The normalisation on neural network weights (30) and (31), the scaling of input variables to the controller and the use of bipolar-layer transfer functions (6), (17) and (19) that saturate at 1 help to keep the controller output within the acceptable range. Furthermore, the correction module acts as a feedforward controller to enhance the stability of the closed-loop system. 6

Conclusions

A step-by-step procedure for designing a NDP controller has been fully described, analysed and implemented. A DC machine was used as an example of plant, but the approach can be extended to other systems. The results from the NDPcontrolled system were compared with a PI-based system, corroborating the validity of the algorithms used and the superiority of the controller. This work included a novel technique for finding the trim control position by correction module. The results verified that the NDP controller was able to successfully control a nonlinear system at various operating conditions, parametric variations and load disturbances. A noticeable feature is the multiobjective capability of controlling machine speed, with better response for armature current and bounded back-EMF. It appears that NDP is a good candidate for controlling nonlinear MIMO systems. Future work will concentrate effort on memory buffer size, network size and detailed stability analysis. 103

7

Acknowledgments

The authors are grateful to the National Science Foundation (NSF) for their continuing support in this work. 8

References

1 Werbos, P.J.: ‘Advanced forecasting methods for global crisis warning and models of intelligence’, in ‘General Systems Yearbook’, 1977, Vol. 22, pp. 25–38 2 Bertsekas, D.P., and Tsitsiklis, J.N.: ‘Neuro-dynamic programming’ (Athena Scientific, Belmont, MA, 1996) 3 Lendaris, G.G., and Paintz, C.: ‘Training strategies for critic and action neural networks in dual heuristic programming method’. Proc. 1997 IEEE Int. Conf. on Neural Networks, Houston, TX, June 1997, pp. 712–717 4 Prokhorov, D.V.: ‘Adaptive critic designs and their applications’. PhD dissertation, Texas Tech University, Lubbock, TX, 1997 5 Prokhorov, D.V., Santiago, R.A., and Wunsch, D.C.: ‘Adaptive critic designs: A case study for neurocontrol’, Neural Netw., 1995, 8, pp. 1367–1372 6 Prokhorov, D.V., and Wunsch, D.C.: ‘Adaptive critic designs’, IEEE Trans. Neural Netw., Sept. 1997, 8, pp. 997–1007 7 Lewis, F.W., Campos, J., and Selmic, R.: ‘On adaptive critic architectures in feedback control’. Presented at IEEE Conf. on Decision and Control, Phoenix, AZ, Dec. 1999, pp. 5–10 8 Werbos, P.J.: ‘A menu of designs for reinforcement learning over time’, in Miller, W.T., Sutton, R.S., and Werbos, P.J. (Eds.): ‘Neural networks for control ’ (MIT Press, Cambridge, MA, 1990), Chapter 3 9 Werbos, P.J.: ‘Approximate dynamic programming for real-time control and neural modeling’, in White, D.A., and Sofge, D.A. (Eds.): ‘Handbook of intelligent control: neural, fuzzy, and adaptive approaches’ (Van Nostrand Reinhold, NewYork, NY, 1992), Chapter 13 10 Ferrari, S., and Stengel, R.F: ‘An adaptive critic global controller’. Presented at the American Control Conf., May 2002

104

11 Shannon, T.T., and Lendaris, G.G.: ‘Adaptive critic based design of a fuzzy motor speed controller’. Proc. IEEE Int. Symp. on Intelligent Control, Mexico City, Mexico, Sept. 2001, pp. 359–363 12 Bryson, A.E.: ‘Dynamic optimization’ (Addison-Wesley-Longman, Menlo Park, CA, 1999) 13 Bellman, R.E.: ‘Dynamic programming’ (Princeton Univ. Press, Princeton, NJ, 1957) 14 Dreyfus, S.E., and Law, A.M.: ‘The art and theory of dynamic programming’ (Academic Press, New York, NY, 1977) 15 Werbos, P.J.: ‘Neurocontrol and supervised learning: An overview and valuation’, in White, D.A., and Sofge, D.A. (Eds.): ‘Handbook of intelligent control’ (Van Nostrand Reinhold, NewYork, NY, 1992) 16 Liu, D.: ‘Adaptive critic designs for problems with known analytical form of cost function’. Proc. INNS-IEEE Int. Joint Conf. on Neural Networks, Honululu, HI, May 2002, pp. 1808–1813 17 Liu, D., Xiong, X., and Zhang, Y.: ‘Action dependent adaptive critic designs’. Proc. INNS-IEEE Int. Joint Conf. on Neural Networks, Washington, DC, July 2001, pp. 990 –995 18 Barto, A.G., Sutton, R.S., and Anderson, C.W.: ‘Neuron like adaptive elements that can solve difficult learning control problems’, IEEE Trans. Syst., Man Cybern., 1983, 13, pp. 834 –847 19 Sutton, R.S.: ‘Learning to predict by the methods of temporal difference’, Mach. Learn., 1988, 3, pp. 9– 44 20 Si, J., and Wang, Y.-T.: ‘Online learning control by association and reinforcement’, IEEE Trans. Neural Netw., 2001, 12, pp. 264–276 21 Enns, R., and Si, J.: ‘Helicopter flight control design using a learning control approach’. Proc. 39th IEEE Conf. Decision and Control, Sydney, Australia, Dec. 2000, pp. 1754–1759 22 Enns, R., and Si, J.: ‘Helicopter tracking control using direct neural dynamic programming’. Proc. Int. Joint Conf. on Neural Networks, 2001, Vol. 2, pp. 1019–1024 23 Ringwood, J.V., and O’Dwyer, A.: ‘A frequency-domain based selftuning PID controller’. Proc. Asian Control Conf., Tokyo, Japan, July 1994, pp. 331–334 24 Jun, M., and Safonov, M.G.: ‘Automatic PID tuning: an application of unfalsified control’. Proc. IEEE Symp. on CACSD, Hawaii, August 1999

IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005