Adaptive Optimal Control via Reinforcement Learningfor Omni

2016 4th International Conference on Control, Instrumentation, and Automation (ICCIA) 27-28 January 2016, Qazvin Islamic Azad University, Qazvin, Iran

Adaptive Optimal Control via Reinforcement Learning for Omni-Directional Wheeled Robots Arash Sheikhlar

Ahmad Fakharian

Department of Electrical, Computer and Biomedical Engineering Islamic Azad University, Qazvin branch Qazvin, Iran E-mail: [email protected]

Department of Electrical, Computer and Biomedical Engineering Islamic Azad University, Qazvin branch Qazvin, Iran E-mail: [email protected]

Abstract—The main problem of wheeled soccer robots is the low level controller gains regulation particularly in competition. The low level control task is tracking the desired angular velocities of the robot wheels which are generated by the high level controller. Since the robot’s model and environment have many uncertainties, traditional controller gains must be adjusted before every match along the competition. In this paper, a linear quadratic tracking (LQT) scheme is designed to solve this problem. The controller can regulate the parameters on-line by policy iteration reinforcement learning algorithm. The output paths of the four-wheeled soccer robot with the adaptive LQT are compared with traditional LQT and the results show that the proposed method can provide superior performance in presence of uncertainties and nonlinearities. Keywords— reinforcement learning; omni-directional robot; linear quadratic tracking control; optimal control; adaptive control

I. INTRODUCTION Omni-directional wheeled robot is a type of mobile robots which is used in many studies because of its maneuverability, reasonable accuracy and swiftness [1, 2]. Omni-directional wheels of the robot, allow it to move simultaneously and independently along different paths. Robot with motorized wheels is used in many researches and also in RoboCup competitions (such as small size league) since they indicate a fine actual life example of the multi-agent robot collaboration. The major issue for a number of traditional control systems is the compensating of uncertainties and the robot’s model nonlinearities. In addition, one of the significant difficulties for robot teams is the manual tuning of the low level controller gains before every match along the competition. The reason is that when the parameters of the robot’s plant or surface changes, conventional controllers are not able to show fine responses. The plant’s parameters can be change because of the motor’s heat. On the other side, the surface parameters can be altered by the robot wheels which may damage the soccer field’s carpet during the competition. These are some of uncertainties in a small size league (SSL)

978-1-4673-8704-0/16/$31.00 ©2016 IEEE

soccer robot’s platform. Consequently, using an automatic method which can change the controller parameters adaptively according to the alterations of the system’s parameters seems to be a sensible strategy. In [3], Torque controller is implemented on an omnidirectional four wheeled robot in order to solve tuning problem but the significant drawback appears when the motors of robot get hot during the motion. Several Control approaches and optimal trajectory planning of omni-directional wheeled robots have been reported in [4-6]. In [7], fuzzy adaptive PI has been tested in simulations. In [8], a delay compensator fuzzy tracking algorithm has been developed for the robots. Some structures of tracking control methods have formerly used in speed control in [9, 10]. In this paper, an adaptive optimal tracking controller is designed based on policy iteration reinforcement learning algorithm to compensate uncertainties of a wheeled omnidirectional robot. For this purpose, a linear quadratic tracking (LQT) controller is used to track the desired set point of wheels’ angular velocities by satisfying the energy cost function and minimum control effort. By considering that these robots have many unknown dynamics and some parameters of them can be altered, reinforcement learning is employed to compensate on-line the alterations. The learning algorithm has employed the policy iteration approach which has two steps: policy evaluation and policy improvement. This method is implemented as a low level controller of the four wheeled omni-directional soccer robot and designed based on a linearized dynamic model of the robot. However, the real model is completely nonlinear. This paper is divided into five portions. This first portion includes the introduction and problem description. Second portion deals with the mathematical model of robot. In portion 3, methodology of the designed controller has been explained. In portion 4, simulation results have been demonstrated. Finally, in portion 5 some conclusions terminate the paper.

208

II. MATHEMATICAL MODEL A four wheeled omni-directional mobile robot has wellknown kinematics and dynamic models. Four wheels of the robots yield a four motorized robotic structure. The motors are 50watt brushless DC (BLDC). For each motor, an encoder is employed to measure angular velocity of the related wheel. Figure 1(a) demonstrates a MRL Small Size League (SSL) robot as a four wheeled omni-directional soccer robot.

B. Dynamic modelling The first step to realize the behavior of a robot is extracting its dynamic model. It should be mentioned that the kinematic model does not consider the robot’s volume and inertia. Consequently, kinematics is not able to give enough information about the robot’s motion behavior by observing the inputs and outputs. There are some methods for determining the dynamics of robot like Newton’s second law. In the current study, motor dynamic constraints or friction related to the velocity are the most significant nonlinearities in wheeled mobile robot and they are able to affect the robot’s behavior in acceleration and deceleration. Development of the relations between driver’s torque, actuating voltage, angular velocity and angular acceleration can explain the dynamics of robot and this is the significant goal of following expressions. Exerted force matrix in the moving coordinates and traction force matrix relation are depicted as follow:

Fm = HFt

(6) By considering equations (2), (6), Newton’s second law and mass matrix M we have:

Figure 1: (a) MRL SSL robot (b) mobile and global frames

A. Kinematic modelling In this subsection, the kinematics of four wheeled omnidirectional robots is represented. Notice that the robots have two frames: mobile frame and global frame. As demonstrated in figure 1(b), position vector in both frames can be written as follows:

[ [

° X m = x m ® °¯ X g = x g

φ

ym yg

φ

] ]

T

ª cos ϕ W = «« sin ϕ «¬ 0

(1)

T

− sin ϕ cos ϕ 0

0º 0» . » 1 »¼

(8)

where m and J represent the robot’s mass and rotational inertia. utilizing orthonormal feature of (2), that can be expressed as

Ft = (H) −1 mgW M [

rr represents robot’s radius, ωL

(3) is wheels angular

velocity and geometrical matrix H is:

− sin θ 2 cos θ 2

− sin θ 3 cos θ 3

l

l

− sin θ 4 º cos θ 4 »» . »¼ l

(4)

g b

W(HT ) −1 rr ω L

g m

T ) −1 r ω + gW (HT ) −1 r ω ] (10) W(H r L m r L

τ L = J L ω L + c L ωL + Ft rr

(11)

τ m = J m ω m + c mωm + (1 n )τ L

(12)

ωm = (1 n ) ωL

(13)

where ωm = [ωm 1 ωm 2 ωm 3 ωm 4 ] is angular velocity vector that contains rotational velocities of four motors, τ represents torque, c is rotational damping coefficient, J shows the inertia and n is motor gearhead’s built-in gear ratio. Also, m and L indexes represent motor and load. By considering motor shafts, rotational dynamic equations can be written as: ª

Also, l is wheel base and θ is Angle of wheels with respect to local coordinates. In addition, to obtain acceleration of the robot, we should take time derivative of equation (3) that yields equation (5).

X g = gbW (HT ) −1 rr ω L +

(9)

By considering each wheel’s inertia in related motor shaft, rotational dynamics can be modeled as following equations:

(2)

X g = gbW X m = bg R (HT ) −1 rr ω L

ª − sin θ1 H = «« cos θ 1 «¬ l

M = diag ( m , m , m , J )

From equations (5) and (7), the robot’s traction force matrix can be obtained as follows:

To obtain the velocity vector, we should take a time derivative of position vector which yields equation (3). Furthermore, by considering the angular velocity, characteristics of the robot’s wheels and geometric matrix H, the following expression can be achieved where

(7)

( gm W)−1 = mgW T

To alter the frames from mobile to global, a rotational matrix should be employed as following g b

Fg = MXg = mgWHFt

(5)

209

º W(HT ) −1 » ω m ¬ ¼ (14) 2 ª cL rr −1 g T g T −1 º + « (c m + 2 ) I 4× 4 + 2 H bW m b W(H ) » ω m n n ¬ ¼ Equation (15) is a simplified version of equation (14). N and M matrix elements are listed in appendix.

τ m = « (J m +

JL r2 ) I 4× 4 + r 2 H 2 n n

−1 g b

W

T

m

g b

τ m = N ω m + M ω m

x (k + 1) = Ax (k ) + Bu (k ) y (k ) = Cx (k )

(15)

N and M matrixes are depicted in following expressions:

ª s1 «s N =« 2 «s 3 « ¬s 4

s2 s1 s4 s3

ª s7 « s ϕ M =« 8 « − s 9ϕ « ¬ − s 10 ϕ

s4 º s 4 s 3 »» s5 s6 » » s6 s5 ¼ − s 8ϕ s 9ϕ s3

s7

− s 10ϕ

s 10 ϕ s 9ϕ

s7 s 11ϕ

where x ( k ) is n dimension vector that represents the state,

u (k ) is the control signal vector with m elements, y (k ) is

(16)

− s 10ϕ º − s 9ϕ »» − s 11ϕ » » s7 ¼

(17)

To simulate the torque control, driver's dynamic behavior according to its actuating voltage, back EMF, terminal resistance, and the motor torque constant is also demanded for next stages and it is written as:

τm

K = (V − em f .ω m ) m rw

where

the output vector with p elements and A B and C matrixes are dynamics of the systems with constant elements. Suppose that the infinite horizon performance index of above system is as following expression 1 ∞ T J (k) = ¦ ª(Cx (i ) − r (i) ) Q (Cx (i ) − r (i) ) + u (i )Ru (i ) º (21) ¬ ¼ 2 i =k where Q and R are symmetric positive definite matrixes. It is assumed that the reference trajectory is generated using the following model of command generator: (22) r (k + 1) = Fr (k ) where r (k ) is the reference input trajectory vector with p elements and F is a matrix with p × p dimensions. Also, by considering the dynamic system (20) and reference trajectory dynamics (22), it is assumed that the augmented system is as follows: ª x (k + 1) º ª A 0 º ª x ( k ) º ª B º X ( k + 1) = « »=« »« » + « » u (k ) (23) ¬ r ( k + 1) ¼ ¬ 0 F ¼ ¬ r ( k ) ¼ ¬ 0 ¼

(18)

rw represents wheels’ radius. By noticing that the

inputs of system are voltages of motors and the outputs are angular velocities of wheels, a mixture of equations (15) and (18) yields a coupled nonlinear state space equation for the four-wheeled omni-directional wheeled robots as follows: x = − N

where

state u = [u 1

−1

(20)

k k ( EM F m .I 4*4 + M ) x + m N Rt Rt

−1

u

≡ TX ( k ) + B 1u ( k )

where X ( k ) = [ x ( k ) r ( k ) ] is the augmented state. The performance index that is expressed in (21) can be just utilized when F is Hurwitz. However in some cases, this limitation makes the performance index impractical. For example if the command trajectory is a unit step or sinusoidal signal, the performance index value approaches to infinity. To avoid the divergence drawback, a discount factor γ is used in the performance index as follows: T

(19)

T is x = [ x 1 x 2 x 3 x 4 ]T =[ωm 1 ωm 2 ωm 3 ωm 4 ] variable and the system’s output. Also, T u 2 u 3 u 4 ]T = [V 1 V 2 V 3 V 4 ] is the control input

vector.

V ( x ( k ), r ( k ) )

III. METHODOLOGY This section explains the details of designed controller and its purpose. In first step, the low level digital linear quadratic tracking (DLQT) controller is designed using policy iteration (VI) reinforcement learning (RL) adaptive algorithm. The low level control task is tracking the desired angular velocities of the robot wheels which are generated by the high level controller. In this section, a different method which introduced in [11] is utilized to formulate the infinite horizon DLQT control problem. In this case, a reference trajectory which is generated using a linear command generator is assumed and it should be mentioned that the problem’s value function is able to be represented as a quadratic expression in terms of x k and rk . After that, a Bellman equation is obtained to solve the LQT problem, and an augmented form of LQT algebric Riccati equation (ARE) is given. These considerations provide the LQT problem for utilizing reinforcement learning. Notice the following discrete linear dynamic system:

1 ∞ T = ¦ γ i − k ª(Cx (i ) − r (i ) ) Q (Cx (i ) − r (i ) ) + u T (i)Ru (i) º ¬ ¼ 2 i =k

(24)

where 0 < γ ≤ 1 . By considering the augmented states, the above value function can be written as following equation: 1 ∞ V ( x (k ), r (k ) ) = ¦ γ i − k ª¬ X T (i)Q1 X (i ) + u T (i ) Ru (i ) º¼ (25) 2 i =k where Q1 = C 1T QC 1 and C 1 = [C − I ] . By noticing these assumptions, in the infinite horizon DLQT for a stable policy as (26) u (i ) = K x x (i ) + K r r (i ) it is proven that the value function (24) can be expressed as following equation [11]: 1 (27) V ( x ( k ), r(k) ) = X T ( k ) PX ( k ) 2 where P is a symmetric positive definite matrix. In next stage, we attempt to derive Bellman and algebraic riccati equations for DLQT problem according to augmented system defined in (23) and P in (27). Using equations (24) and (27) we obtain

210

+

(28)

1 ∞ ª T ¦ (Cx (i ) − r (i ) ) Q (Cx (i ) − r (i ) ) + u T (i )Ru (i ) º¼ 2 i = k +1 ¬

that generates the bellman equation for LQT as follows: 1 T V ( x (k ), r (k ) ) = (Cx (k ) − r (k ) ) Q (Cx (k ) − r (k ) ) 2 (29) 1 T + u (k )Ru (k ) + γV ( x (k + 1), r (k + 1) ) 2 The LQT bellman equation can also be written as following expression using equations (27) and (29) in terms of P matrix. X T (k )PX ( k ) (30) = X T (k ) Q1 X (k ) + u T (k )Ru (k ) + X T ( k + 1) PX ( k + 1) From equations (22), (23) and (27), to find a causal solution for DLQT problem, it has been proved that any optimal policy has following formulation [11]: (31) u (k ) = −K 1X (k )

where

K 1 = ( R + B 1T PB 1 ) B 1T PT −1

and

P

satisfies

the

following LQT ARE

Q1 − P +T T PT −T T PB1 ( R + B 1T PB1 ) B1T PT = 0 −1

(32)

that its solution is stable. In next stage, the introduced causal LQT form is utilized to develop value iteration reinforcement learning (RL) algorithm. For a control policy such as K 1 , the augmented Bellman equation can be written as a LQT Lyapunov equation

Q1 − P + K RK 1 − γ (T − B1K 1 ) P (T − B1K 1 ) = 0 (33) The following policy iteration RL algorithm can repeatedly solve the equation (30) instead of direct solution of the ARE LQT expressed in equation (32). T

T 1

Algorithm (Policy iteration RL) Initialize an arbitrary stabilizer control policy K 1

1. Policy assessment, solve for P equation

j +1

(

X T ( k ) P j +1 X (k ) = X T ( k ) Q1 + ( K 1j

T

Adaptive LQT refrence LQT

-50 -100

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3

3.5

4

4.5

time (sec) 20 10 0 -10

0

0.5

1

1.5

2

2.5

)

RK 1j X ( k )

Figure 2: angular velocity of 1st wheel and related control signal

(35)

Refrence tracking of output 2 100

(36)

It has been proved that P j in above algorithm converges to the solution of the LQT ARE expressed in equation (32) and K 1 is stabilizing in each iteration [12]. IV. RESULTS In this section, simulation results of the DLQT controller are depicted. The proposed low level control has been tested

211


50

w2 (rad/sec)

K 1j +1 = ( R + γ B 1T P j +1B 1 ) γ B 1T P j +1T

0

time (sec)

+γ X T ( k + 1) P j +1 X (k + 1) 2. Policy enhancement −1

50

-20

via LQT Lyapunov

)


w1 (rad/sec)

1 + u T (k ) Ru (k ) 2

in tracking of a rectangular pathway. The voltage of four motors and output angular velocities of the four wheels are demonstrated and analyzed. In this movement of robot, the linear and angular velocities of the robot are 2 meter/sec and 1 rad/sec respectively. The applied adaptive LQT is an energy efficient method because of its minimum effort. In figure (2), (3), (4) and (5) the wheels angular velocities of the robot’s rectangular motion and the related actuating voltages of motors are depicted. The simulation has been performed in a noise-polluted environment. To examine the robustness of the presented adaptive LQT, the dynamic matrixes of system (A and B) has been added to a randomly distributed noise matrix with amplitude 0.1. The noise assumption is according to [8, 10] and it satisfies several uncertainties such as motor heat and surface friction alterations. As it is clear in the figures (6), (7) and (8), the presented strategy is able to reduce the deviation of robot from the desired path. Figure (6) demonstrates the alterations of position and orientation of the robot based on time in the rectangular trajectory for two schemes. As depicted in figure (7), Adaptive LQT shows a better tracking performance in comparison conventional LQT in presence of random disturbance as an additive uncertainty. Furthermore, figure (8) and table1 demonstrate that adaptive LQT has a lower tracking error in position and orientation compared to LQT controller. It shows that the mean tracking error of the robot movements are reduced 46% for the adaptive LQT in comparison with pure LQT.

voltage of motor 1 (v) (control signal)

1 T (Cx (k ) − r (k ) ) Q (Cx (k ) − r (k ) ) 2

0 -50 -100

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3

3.5

4

4.5

time (sec) voltage of motor 2 (v) (control signal)

V ( x ( k ), r (k ) ) =

10 5 0 -5 -10

0

0.5

1

1.5

2

2.5

time (sec)

Figure 3: angular velocity of 2nd wheel and related control signal

Trajectory 1.5


1

0


-50 -100

0

0.5

1

1.5

2

2.5

3

3.5

4

0.5

4.5

y(m)

w3 (rad/sec)

50

time (sec)

Adaptive LQT Desired path LQT

0

voltage of motor 3 (v) (control effort)

10

-0.5

5 0

-1

-5 -10

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

-1.5 -1.5

time (sec)

-1

-0.5

0 x(m)

0.5

1

1.5

rd

Figure 4: angular velocity of 3 wheel and related control signal Figure 7: Rectangular path tracking

Refrence tracking of output 4 100 Adaptive LQT refrence LQT

w4 (rad/sec)

50

error x (m)

-50 -100

0

0.5

1

1.5

2

2.5

3

3.5

4

-0.1

4.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3

3.5

4

4.5

3

3.5

4

4.5

time (sec)

15

error y (m)

0.1

10 5 0

0 -0.1

-5

0

0.5

1

1.5

2

2.5

time (sec) 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

error φ (rad)

voltage of motor 4 (v) (control signal)

LQT Adaptive LQT

0

time (sec)

-10

tracking error of robot in x, y and φ axis

0.1

0

time (sec)

Figure 5: angular velocity of 4th wheel and related control signal

0.1 0 -0.1

0

0.5

1

1.5

2

2.5

time (sec)

position x (m)

Position of robot in x, y and φ axis

Figure 8: Tracking error of position and orientation of the robot in the rectangular path tracking

2 0 -2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Table 1: Position and orientation error analysis

orientation φ (rad)

position y (m)

time (sec) 2


0 -2

0.5

1

1.5

2

2.5

3

3.5

4

Adaptive LQT ( x direction) LQT ( x direction) Adaptive LQT ( y direction) LQT ( y direction) Adaptive LQT ( φ direction)

4.5

time (sec) 6 4 2 0 -2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

LQT ( φ direction)

time (sec)

Figure 6: Position and orientation of the robot in rectangular path tracking

212

Mean error 1.8 cm 4.35cm 0.58cm 0.69cm 0.03rad 0.04rad

[6]

V. CONCLUSION In this paper, reinforcement learning based LQT controller is applied to the four wheeled omni-directional mobile robot to realize adaptive control. Simulation test is performed to examine the effectiveness of this controller. Results demonstrate that the proposed controller reduces the path following error compared to the traditional LQT controller in presence of uncertainties.

L. Yong Liu, J.J. Zhu, R.L. Williams II, J. Wu, Omnidirectional mobile robot controller based on trajectory linearization, Journal of Robotics and Autonomous Systems 56 (5) (2008) 461–479. [7] Sheikhlar, Arash, Ahamad Fakharian, and Aras Adhami-Mirhosseini. "Fuzzy adaptive PI control of omni-directional mobile robot." Fuzzy Systems (IFSC), 2013 13th Iranian Conference on. IEEE, 2013. [8] Sheikhlar, A., A. Fakharian, M. Zarghami and M. B. Menhaj. "Delay compensation on fuzzy trajectory tracking control of omni-directional mobile robots" Amirkabir International Journal of Modeling, Identification, Simulation & Control, in press. [9] S. Yang, A. Zhu and G. Yuan, “A bioinspired neurodynamics-based approach to tracking control of mobile robots”, IEEE Trans. Ind. Electron, vol. 59, No. 8, pp. 3211-3220, 2012. [10] E. Hashemi, M. G. Jadidi, N. G. Jadidi, “Model-based PI–fuzzy control of four-wheeled omni-directional mobile robots”,Robotics and Autonomous Systems., vol. 59, no. 11, pp. 930–942, 2011. [11] Kiumarsi, Bahare, Frank L. Lewis, Hamidreza Modares, Ali Karimpour, and Mohammad-Bagher Naghibi-Sistani. "Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics."Automatica 50, no. 4 (2014): 1167-1175. [12] Hewer, G. A. (1971). An iterative technique for the computation of steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control, 16(4), 382–384.

APPENDIX N and M matrix elements shown in equation (12) and (13) are mathematically described as bellow. These parameters introduce relationship among various dynamic characteristics of the robot to analyze the robot’s dynamic behavior. s1 = J m +

J L rr 2 + (0.2810 m + 11.75J ) n2 n2

rr 2 (0.3821m + 11.75J ) n2 r2 s 3 = r 2 ( − 0.2619 m + 9.050 J ) n r2 s 4 = r 2 ( − 0.5717 m + 9.050 J ) n J r2 s 5 = J m + L2 + r 2 (0.2459 m + 6.9710 J ) n n rr 2 s 6 = 2 (0.07323m + 6.9710 J ) n c s 7 = c m + L2 n r2 s 8 = 0.2784 r 2 m n r2 s 9 = 0.02184 r 2 m n rr 2 s 10 = 0.2566 2 m n s2 =

s 11 = 0.2347

rr 2 m n2

REFERENCES [1]

[2]

[3]

Kim, Hongjun, and Byung Kook Kim. "Online minimum-energy trajectory planning and control on a straight-line path for three-wheeled omnidirectional mobile robots." Industrial Electronics, IEEE Transactions on 61.9 (2014): 4771-4779. H.A. Samani, A. Abdollahi, H. Ostadi, S.Z. Rad, “Design anddevelopment of a comprehensive omni-directionalsoccer robot”,International Journal of Advanced Robotic Systems, vol. 1 (3), pp.191–200, 2004. K. Sukvichai1, P. Wasuntapichaikul, Y. Tipsuwan, “Implementation of torque controller for brushless motor on omni directional weeled robot’’,

ITC-CSCC 2010, pattaya, thailand, 2010, pp 19-22. [4]

[5]

Lee, Ming-Han, and Tzuu-Hseng S. Li. "Kinematics, dynamics and control design of 4WIS4WID mobile robots." The Journal of Engineering 1.1 (2015). O. Purwin, R. D’Andrea, Trajectory generation and control for four wheeled omnidirectional vehicles, Journal of Robotics and Autonomous Systems 54 (1) (2006) 13–22.

213