Reinforcement Learning for Penalty Avoiding Policy ... - CiteSeerX

2 downloads 13621 Views 271KB Size Report
In Markov Decision Processes, we call a rule penalty if and only if it has a .... S6. P. R1 a0 a1. 0.65. R2. Figure 4: E2: An environment of 2 rewards (R1,R2) and a penalty ..... ment Learning, Proc. of 3rd International Conference on Fuzzy Logic ...
Reinforcement Learning for Penalty Avoiding Policy Making Kazuteru Miyazaki

Shigenobu Kobayashi

National Institution for Academic Degrees

Tokyo Institute of Technology

3-29-1 Otsuka, Bunkyo

4259 Nagatsuta, Midori

Tokyo, Japan, 112-0012

Yokohama, Japan, 226-8502

[email protected]

[email protected]

Abstract

Reinforcement Learning is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to a reward. In general, the purpose of reinforcement learning system is to acquire an optimum policy that can maximize expected reward per an action. However, it is not always important for any environment. Especially, if we apply reinforcement learning system to engineering, we expect the agent to avoid all penalties. In Markov Decision Processes, we call a rule penalty if and only if it has a penalty or it can transit to a penalty state where it does not contribute to get any reward. After suppressing all penalty rules, we aim to make a rational policy whose expected reward per an action is larger than zero. In this paper, we propose the Penalty Avoiding Rational Policy Making algorithm that can suppress any penalty as stable as possible and get a reward constantly. By applying the algorithm to the tick-tack-toe, its e ectiveness is shown. 1

Introduction

Reinforcement learning (RL) is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. If we give the agent what should he do (its purpose) and/or don't (its restriction), he can learn how to satisfy them. In RL, it is important how to design rewards. Recently, in most RL systems [2], a positive reward is given to the agent when he has achieved a purpose, and a negative one is given to him when he has violated a restriction. However if we set incorrect reward values, the agent will learn unexpected behavior. For example, in two players game, considering the case that a positive reward is given to the winner, and a negative one is given to the loser. If we have designed incorrect reward values for them, the agent may lose

the game even if there is a victory strategy. This is because that these two type rewards are treated at the same dimension. In this paper, we make a distinction between a positive reward for achievement of a purpose and a negative one for violation of a restriction. We call a positive reward a reward and a negative one a penalty. We call a policy rational if and only if expected reward per an action is larger than zero. Furthermore, a penalty avoiding policy is de ned as a policy that cannot get any penalty. If there is a rational policy in penalty avoiding policies, we should learn the policy and its expected reward per an action should be more large. Otherwise, we should learn the policy whose expected penalty per an action is the smallest in all policies. Section 2 describes the problem, the method and notations. Section 3 proposes the Penalty Avoiding Rational Policy Making algorithm that can suppress any penalty as stable as possible and get a reward constantly. Section 4 applies the algorithm to the ticktack-toe as a numerical example. Section 5 is conclusion. 2 2.1

The Domain Problem Formulation

Consider an agent in some unknown environment. At each time step, the agent gets information about the environment through its sensors and chooses an action. We denote agent's sensory inputs as x; y; 1 1 1 and its actions as a; b; 1 1 1. As a result of some sequence of actions, the agent gets a reward or a penalty from the environment. We assume that the environment is Markov Decision Processes (MDPs). A pair of a sensory input and an action is called a rule. We denote a rule `if x then a' as xa. The function that maps sensory inputs to actions is called a policy. We call a

b

x b

a b

a

a

z



y x,y,z ; · o

˝

b

a,b ; s fi

a

y x a

b

; æV

Figure 1: E1: An environment of 3 sensory inputs and 2 actions.

Iænæ

Figure 3: An example of penalty rules (xa; ya) and a penalty state (y).

S4

Iænæ

P

x

b xb

x

a

y

xa

a

z

a za

ya

y

S1

b

S3 yb

S5

Gs\[hP

a1

a0 S0

0.65

R1

x

a xa

z b zb

xa xa

y

S6

b yb

Gs\[hQ

S2

R2

Figure 4: E2: An environment of 2 rewards (R1,R2) and a penalty (P).

Figure 2: An example of episodes and detours in E1. sible. policy rational if and only if expected reward per an action is larger than zero. We call a sequence of rules used between the previous reward (penalty) and the current one (penalty) an episode. For example, when the agent selects xb; xa; ya; za; yb(reward); xa; zb; xaandyb(reward) in gure 1, there are two episodes (xb; xa; ya; z a; yb) and (xa; zb; xa; yb) (see gure 2). We call a subsequence of an episode a detour when the sensory input of the rst ring rule and the sensory output of the last ring rule are the same though both rules are di erent. For example, episode 1 in gure 2 has two detours (xb) and (ya; za). The rule that does not exist on a detour in some episode is rational. Otherwise, a rule is called irrational. We call a rule penalty if and only if it has a penalty or it can transit to a penalty state in which there is no rational rule. For example, in gure 3, xa and ya are penalty rules, and state y is a penalty state. Furthermore, a penalty avoiding policy is a policy that cannot contain any penalty rule. In this paper, we aim to make a rational policy that can avoid to transit to penalty sates as stable as pos-

2.2

Previous Works

Q-learning (QL) [7] and Policy Iteration Algorithm (PIA) [1] can guarantee the optimality that maximizes the expected reward per an action. In QL and PIA, we must design appropriate values for rewards and penalties. However, it is dicult to design them to get expected results. For example, in E2 ( gure 4), if we set (R1; R2; P ) = (50; 100; 050), QL and PIA make the policy that a1 is selected in S0 even if there is the penalty avoiding rational policy that a0 is selected in S0. In this case, if we assign very large values to the penalty (for example P = 010000), we can get the penalty avoiding rational policy. However, we cannot always get rational policy by it. For example, in E2, if we set (R1; R2; P ) = (0; 100; 010000), they make no rational policy that a0 is selected in S0. Though it is a penalty avoiding policy, it cannot get any reward. Thus, there is dicult problem to design reward and penalty values. Therefore, we take a reward and a penalty independently, and don't assign any value for them.

procedure begin



– Œ ¯ … –

[

»Ł Ł – «

(a) o – ‰ G s \[ h ‰ – ˘ Ø [

¯ C } [ N • ØD

(b)



(c)





do ¨ ” @ @ » @ @ }[ N @ ¨ ” @ @ » @ @ —˘ ´ w hile V ‰

“ ‹ § • ¯ I ´ “ t ^ ‡ Œ “ ‹ § • [ ¯ J “ } [ N ‡ › ¨ › ˘

Ø } [ N • Ø D \ ¨ [ “ ‡ ł [ ‰ ˝ ‰ [ ¯ ØD Ø [ } [ N •Ø D ´ \ ¨ ›¨ › ˘ Œ ˜ ¢Ø D — ˘´ “ } [ N ‡ Œ Ø D

‡ł





‡ł



Figure 6: An example of P RJ .

end.

Figure 5: The Penalty Rule Judgment algorithm (P RJ ).

‡ł

For each episode, we apply the Penalty Rule Judgment as shown in gure 5. P RJ can nd all penalty rules in the current rule set. It uses a mark to nd them. First, we set a mark on the rule that has been got a penalty directory (see gure 6a). Second, we set a mark on the state where there is no rational rule or there is no rule that can transit to no marked state (see gure 6b). Last, we set a mark on the rule where there are marks in the states that can be transited by it (see gure 6c). We can regard a marked rule as a penalty rule. We can nd all penalty rules in the current rule set by continuing the above process until there is no new mark. algorithm (PRJ)

2.3

Approaches

In this paper, we treat a reward and a penalty as the following manners. (1) If we can avoid to get all penalties, we aim to learn a penalty avoiding rational policy that can get a reward as many as possible. (2) Otherwise, we aim to learn a rational policy whose transition probabilities to penalty states are low in all policies. We propose the Penalty Avoiding Rational Policy Making algorithm to realize it in the next section. 3

Proposal

of

the

Penal-

ty Avoiding Rational Policy Making algorithm 3.1

Basic Idea

In this paper, we aim to make a rational policy that can avoid a penalty as stable as possible. To avoid all penalties, we should suppress all penalty rules in rule set. In section 3.2, we propose the Penalty Rule Judgment algorithm that can nd all penalty rules in the current rule set. After suppressing all penalty rules, it is important how to make a rational policy. In section 3.3, we discuss it by considering whether there is a rational policy in penalty avoiding policies or not. In section 3.4, we propose the Penalty Avoiding Rational Policy Making algorithm by combining the contents of section 3.3 and 3.4. 3.2

The Penalty Rule Judgment algorithm

3.3

How to Make a Rational Policy

If there is a rational policy in penalty avoiding policies, we can make a rational policy that doesn't contain any penalty after suppressing all penalty rules by P RJ . In section 3.3.1, we propose the Rational Policy Improvement algorithm to make a rational policy in the situation. On the other hand, if there is no rational policy in penalty avoiding policies, we need new mechanism that can avoid a penalty stochastically. We discuss it in section 3.3.2. 3.3.1

The Rational Policy Improvement algorithm

We show the Rational Policy Improvement algorithm in gure 7. It can make a rational policy that doesn't contain any penalty when there is a rational policy in penalty avoiding policies. It uses a mark to make a policy. First, we take the rules that have been got a reward directory in the current policy, and set marks on them (see gure 8a). Second, we take the following rules in the current policy (see gure 8b) ; the states that can (RPI)

procedure begin … »

æV [

I›

PŁ–«

‰–˘ Ø [ I ´\˘ •Ø

› Ł }[N •ØD

C

R

R

(a)

do ¨” “‹§•Ø [ › Ł D @ » ¯ › “¢ Ł¯ ŁC'´C » @ [ ¯J ´ \¨ S˜“} [N‡Œ˜¢ ØD @¨ ” “‹§ •Ø }[N •ØD @ » ¯I • «› “ Ł ‡Œ˜¢ØD @if › “ — ˘ ´ Ł ‡ Œ ¨ ¢ then C }[N‘d ~ [ › Ł C @@@» [ I ´ \˘•Ø }[N•ØD w hile V ‰ › ¨ › ˘ — ˘ ´ “}[N‡ ŒØD end.

(b)

R

(b)

(b)

(b) (b) (c)

Figure 7: The Rational Policy Improvement algorithm (RP I ). be transited by the rule have already marked. and, in the current policy, there is no rule at the state that can select the rule. Set a mark to the state that is contained by the current policy. We continue the above process until there is no new mark. There is a stochastic state transition in some rule. Therefore some states that can be transited by the rule may have no mark even if it is important to get a reward. For example, in gure 9, the state x cannot to be marked, even if there is a mark on the state y, since the state z is marked after the state x has been marked. We call it the mark transition stop rule (MTSR). The MTSR is important to get a reward. Therefore, after stopping a mark transition, we take a MTSR in current policy and set a mark on the state that can select the rule to continue a mark transition. 3.3.2

Estimation of Transition Probabilities to Penalty States

If there is no rational policy in penalty avoiding policies, RP I cannot make a rational policy. In this case, we must select some penalty rules to make a rational policy. In this section, we add new mechanism to RP I . It is important to select the penalty rule that is hard to transit to penalty states. In this paper, we assume that all penalty states are equal to be avoided. Then, we select a penalty rule whose transition probability to penalty states is the least in all penalty rules at the state that can select the rule. We call the state transition probability to penalty s-

R

(b)

R

R

Figure 8: An example of RPI.

z

y

x a

Figure 9: An example of the Mark Transition Stop Rule. tates a penalty level. We use the following upper bound of the interval estimation to estimate it. Pub

=10

x n

+

Z 2 2n

+ pZ n

qx

1+

n (1 Z 2 n

0 nx ) + Z4n 2

;

(1)

where n is the number of selecting the rule, x is the number of transition to non penalty states, Z = 1:96 if the con dence is 95% and Z = 2:58 if the con dence is 99%. Initial value of Pub is 0. If Pub is 0, the rule isn't a penalty rule. The more Pub reaches to 1, the more its transition probability to penalty states is high. If there is no non-penalty rule in current state, after estimating penalty levels of all penalty rules in it, we should select the rule whose penalty level is the least. We can get a rational policy whose probability to penalty states is the least.

procedure begin

– æ



‘ ‹A

1

S

¤ 2 L fl e • ˜ [ ‡ł [ W – [ W ¤ – [ x K v ¨ L fl œ »• Ø D xi

Y œ» • Ø D o ^ • Ø D Ł • Ø ‰

m o • Ø D

do ´ « T ‚ “ @ai 1 L fl

Ø s fiai xi ª ª

o ˝ • Ø D « • Ø D

@if æ V ‰ ˝ – ‰ then 1 L fl @ @ - s fi ˛ ‡ ł [ W ' @call( – [ » Ł Ł – «) D

¶ O

• Ø C

@ – [ ¤ ‡ ł [ O ‰ [ W @ ˛ Ccall( I › P Ł – «) D @if › “ ¢ Ł ¨ “ ¶ • Ø then » ¯ @ I ´ \ ¨e [ – [ x v Z C @ » l ¯ Æ ¢ [ › o ^ • Ø D @ Œ ‰ › 2 L fl R s [ • Ø D @ J ª @ – [ w hile end.

· o x

˝xi m o • Ø D Ł• Ø ‰ L fl

X V • Ø D

Figure 10: The Penalty Avoiding Rational Policy Making algorithm (P ARP ).

3.4

The

Penalty

Avoiding

Rational

Policy Making algorithm

In this section, we have integrated the contents of the above two sections. We show the Penalty Avoiding Rational Policy Making algorithm (PARP) in gure 10. Initialize the two memories; the 1st memory is used to judge irrational rules and the 2nd memory is used to preserve the current rational policy. We consider that all rules are irrational and they entry the irrational rule set. Furthermore, initialize the penalty rule set and the memory that is used to estimate a penalty level. At each time step, the agent senses the state xi and executes an action ai according to some action selector such as random selection, the minimal select method [4], k-certainty exploration method [4] and an action selection based on the 2nd memory. Then, the action ai is written on the xi of the 1st memory. If the agent gets a reward, the rules that exist on the

1st memory are rational. Therefore, after suppressing these rules in the irrational rule set, we nd all penalty rules by P RJ . P RJ must be used for each episode since some irrational rule can be change to a rational rule. We apply RP I to the rule set that doesn't contain penalty and irrational rules. If there is a state that cannot decide the policy, we calculate penalty levels of rules that can be selected in the state, and the rule that has the least penalty level copies to the current policy, Then, the current policy copies to the 2nd memory. To improve the current policy, the agent senses the next state and updates the memory to estimate the penalty level. We can get a rational policy that cannot contain any penalty in the 2nd memory. 4 4.1

A Numerical Example Application to the Tick-Tack-Toe

We apply P ARP to the tick-tack-toe. In this paper, we give the following knowledge to an opponent player. (1) If there is an action that makes him winner, he selects the action. (2) Otherwise, he selects an action to interfere an action that makes the learning player winner. These knowledge are not given to the learning agent. If the opponent player cannot use the above knowledge, he selects an action at random. In general, the result of RL depends on how to set a reward and a penalty. In this paper, we use two design plans. In plan 1, we give a reward for winner. When the agent cannot win, a penalty is given to him. On the other hand, in plan 2, we give a penalty for loser. When the agent can win or even, a reward is given to him. If both player are clevers, the result should be even since there is no victory strategy in the tick-tacktoe. Therefore, if we use plan 1, the agent tries to win but it has possibility to defeat. On the other hand, if we use plan 2, the agent can learn the strategy that never defeat but it cannot always win. If there is a victory strategy, we don't need plan 2. We use plan 1 and 2 since there is no victory strategy in the tick-tack-toe. First, we suppress penalty rules according to plan 2 and make a policy that cannot contain them. We can guarantee not to defeat by the policy. After suppressing all penalty rules according to plan 2, we try to make a policy according to plan 1. If we can make the policy, it is a victory strategy. The learning agent selects an action based on the 2nd memory. When the agent cannot select an action since there is no rational policy in penalty avoiding pol-

is higher than that of P ARP 0 . It means that the estimation of the penalty levels discussed on section 3.3.2 is useful. We can con rm the e ectiveness of our method through this example.

P ARP

wK

100

Ł (100run

‰ ˇ)

PARP

S.D.= 2.82

90

PARP- S.D.= 3.31

ƒ

80 70 60

5

s 50 ƒ 40 30 20



10

PARP PARP-

0 0

1000

2000

Figure 11: The result in which the learning agent is the rst movement.

w K

100

ª Ł (100run

‰ ˇ) PARP-

90 80



70

ł « “ fl ƒ

60

PARP

Conclusions

In general, the purpose of reinforcement learning system is to acquire an optimum policy. However, it is not always important for any environment. Especially, if we apply reinforcement learning system to engineering, we expect the agent to avoid all penalties. In this paper, we propose the Penalty Avoiding Rational Policy Making algorithm (PARP) that can suppress any penalty as stable as possible and get a reward constantly. By applying the algorithm to the tick-tack-toe, its e ectiveness is shown. Now we are con rming the e ectiveness of PARP on the Othello game. In the future work, we should combine P ARP to Pro t Sharing [3]. Furthermore, we aim to extend P ARP to more dicult domain such that Partially Observable Markov Decision Problems [5] and multi-agent systems [6].

50 40

ƒ

30 20

PARP-

10

PARP

PARP S.D.= 4.27 PARP- S.D.= 3.90

0 0

1000

2000

Figure 12: The result in which the learning agent is the second movement. icy, P ARP uses the penalty levels discussed on section 3.3.2. On the other hand, P ARP 0 selects an action at random. 4.2

Simulation Results

Figure 11 and 12 show the victory and defeat rates plotted against the number of plays. We have made 100 runs with di erent random seeds. The learning agent is the rst movement in gure 11 and the second one in gure 12. Both P ARP and P ARP 0 can learn a policy that doesn't defeat in any game. Therefore the gdefeat rates of all plots can reach to zero. The victory rate of

References [1] Bertsekas, D.P.: Dynamic Programming and stochastic control, Mathematics in Science and Engineering 125 , Academic Press, New York (1976). [2] Sutton, R. S. and Barto, A.: Reinforcement Learning: An Introduction, A Bradford Book , The MIT Press (1998). [3] Miyazaki, K., Yamamura, M., and Kobayashi, S.: On the Rationality of Pro t Sharing in Reinforcement Learning, Proc. of 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing , pp.285-288, (1994). [4] Miyazaki, K., Yamamura, M., and Kobayashi, S.: kCertainty Exploration Method : an action selector to identity the environment in reinforcement learning, Arti cial Intelligence 91 , pp.155-171, (1997). [5] Miyazaki, K. and Kobayashi, S.: Learning Deterministic Policies in Partially Observable Markov Decision Processes, International Conference on Intelligent Autonomous System 5 , pp.250-257, (1998). [6] Miyazaki, K. and Kobayashi, S.: Rationality of Reward Sharing in Multi-agent Reinforcement Learning, Second Paci c Rim International Workshop on Multi-Agents , pp.111-125, (1999). [7] Watkins, C. J. H., and Dayan, P.: Technical note: Q-learning, Machine Learning Vol.8, pp.55-68(1992).