Learning automata approach to hierarchical

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, NO. 1, JANUARY/FEBRUARY 1991

263

a simple way. Incorporation of the first feature leads to a somewhat symmetric structure of the present model. Experiments by Holling [7] and a number of other researchers which are summarized by Murdoch and Oaten [3] have confirmed that the Holling-Tanner type responses are most common. That is why interaction terms used in the present model are believed to be more realistic.

I- - -L---1- - - 1-

- -

-

1

PRE
M,, optimal if lim,,, E [ M ( n ) ] =d l & max,{di}, E-optimal if limn , ,E [ M ( n ) ]> dl - E for E > 0 and absolutely expedient if E [ M ( n 1)- M ( n ) ( p ( n )> ] 0 for all n and p,(n) E (0,l). For further details, the reader is referred to [ l l ] .In more complex problems also, including that considered in this paper, the same performance measures are used. In particular, we will be interested in absolutely expedient schemes. For ease of discussion, we shall use a linear reward-inaction ( L R - , ) learning algorithm that is absolutely expedient and E-optimal. If an automaton uses an L R - , scheme, E [ M ( n ) ]tends to a value greater than d, - E as n +m for any E > 0, if the step size a of the algorithm is sufficiently small. Q and S Models: In the above description of the learning automaton, the response of the environment was assumed to be binary. This is referred to as a P-model. In situations where finer distinctions have to be made of the responses of the environment, Q and S models are defined. In Q models, the output P ( n ) can assume a finite number of values in the interval [O,11 while in S models, P ( n ) is the continuous random variable

+


Environment Ez

I

Fig. 5. Interaction of two hierarchies at two levels.

T l m st* 1.0,.

. . . . . .

so0

n

. . . . . . . . . . ,

. . . . . . . . . . loo0

1500

2000

T b step n

T l m stup n

0.01 . . . . . . . . . . . . . . 500 loo0 1500 Ttrp step n

T&Q step

MOO

I

Fig. 6 . Behavior of action probabilities for a = 0.05 and h = 1.

over the same interval. The learning algorithms developed for P models have by and large been extended to Q and S models.

B. Multiple Environments In many cases learning has to be affected with a vector of outputs from the environment rather than a scalar as in Section 11-A. This can be described in terms of a single automaton operating in N environments simultaneously (Fig. 2). If the automaton has r actions and the reward probability of the jth environment to the ith action is d,' (i = 1,2; . ., r ; j = 1,2,. ... N), the entire system is described by Nr probabilities. In [3], [ 5 ] , and [61, different assumptions are made about the

environments and the corresponding performance indices are defined. For our purpose, the result reported by Baba in [6] is most relevant. According to Baba, the problem can be stated in terms of Q models. If Z;=,d{> Zy=,dj Vi 1 (i = 1,2;. .,r ) then the Zth action corresponds to the optimum. It is shown in [6] that if an absolutely expedient algorithm (for Q models) is used by the automaton (e.g., SL,-, scheme), E-optimality can b e achieved. In this case, the composite environment can be considered in some sense as the average of all the constituent environments. In particular, the expected payoff of the environment for action a i is ( l / N ) C c l d { . At any stage n , if M of the N environments provide a positive or successful response, the output P(n) = M / N .

+

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, NO.l, JANUARY/FEBRUARY 1991

0.0

0.0

1

300

0

1.0,

.

.

400

800

MO

1000

1

0

.

' 900

4w

800

rn

I

loo0

T h p rtsp n

Tlrm stap n

I

~ h step m n

1.0

I

T h m step n

Fig. 7. Behavior of action probabilities for a = 0.05 and A = 0.

C. Automata in Hierarchical Systems When the number of actions for a learning automaton becomes large, the time taken for the action probability vector to converge also increases. Under such circumstances, a hierarchical structure of decision making can be used to improve speed. Fig. 3 shows a simple hierarchical system with two levels and two actions for each automaton. A acts first choosing either A, or A , who then act to select one of two actions. The latter elicits a response from a stationary random environment. This response is used by all automata in the selected path to update their probabilities. This method is also generalized in [5].A hierarchical system can be thought of as a single automaton whose actions are the union of the actions of all the automata at the bottom level of the hierarchy. Hence, the concept of absolute expediency can be extended to a hierarchical system by requiring that the equivalent automaton be absolutely expedient. In [12]it is shown that this can be achieved with different step sizes at the various levels. In [5]this has also been generalized further to situations where each automaton acting at any level receives a response from a local environment in addition to the global response obtained at the end of a cycle.

D. Games of Automata Fig. 4 shows a game situation where more than one automaton operate into the same environment. Let A , ( i = 1,2,. . . ,N ) represent the N automata. At stage n, all the automata choose one action from their action sets and the environment provides a response that depends upon the set of actions chosen at that instant. In an identical payoff game, the response obtained from the environment is the same for all the automata. If automaton

A , has ri actions {a;,a;, . . .,a;,) in its action set, the environment can receive one of 17Elr, actions. Assuming automaton A , chooses its i,th action, A , chooses its i2th action, etc., the reward probability of the environment is di,,i2,...,,Nprob(B = llal(n)= a'' I ' a 2 ( n )= a t ; . .,a N ( n )= a:). The problem then is to determine the algorithms that the different automata should ,M(n) = d - E where d = maxi,,i2;.., i N use so that limn , {d, , ... ). It is assumed that after each play, every automaton obkkfCe'sNtheoutcome and updates its action probabilities as though it were operating into a stationary random environment. The group behavior of the N automata can be judged by the expected payoff at stage n given by M ( n ) g E[/3(n)lp1(n), p2(n),. . .,pN(n)].In [9] it is shown that if all the automata use an L R - , scheme, AM(n) E[M(n + 1) ~ ( n > l p ' ( np2(n), ) , . . . , p N ( n ) ]is nonnegative for all n. This result has been shown to be true for any absolutely expedient scheme in [lo]for a pair of learning automata.

E. Summary In the four cases considered above, each automaton uses an absolutely expedient scheme, or more specifically, an LR-, scheme. At every stage, the action probabilities of the automata are updated strictly on the basis of the response of the environment and not on the basis of any knowledge regarding the other automata or their strategies. In Section 11-A, the single automaton uses an LR-, scheme and is E-optimal; in Section 11-B, it is once again c-optimal, but this is achieved by converting the responses of the different environments into a single equivalent response. The automata in Section 11-C use L,-, schemes but use different step sizes. The proof that the hierarchy is E-opti-

267


0.0

I

0

-

'

200

I 800

200

800

1WO

Ttms step n

I

Ttms s t e p n

Ttms step n

T b step n

T b step n

Fig. 8. Behavior of action probabilities for a = 0.05 and A

ma1 is based on the fact that the action probabilities at the lowest level are updated precisely as in an L,-, scheme. The most important result dealing with automata games in Section 11-D states that the collective of automata involved in an identical payoff game improves its performance at every step even when each automaton behaves as though it is operating in a stationary random environment. As mentioned earlier, all these results are relevant for the problem considered in this paper.

The collective of automata that we shall consider contains M hierarchies, each of which has N levels (assuming that all hierarchies have the same number of levels is merely for convenience of notation and involves no loss of generality). Considering the ith hierarchy, the single automaton at the first level is denoted by A'. The finite number of actions of A' are denoted as A;, A;; . ., and are themselves automata. Similarly, the actions of A: are denoted by A:,, A:,, . . . etc. It is assumed that at any instant n, the automata in the N levels art involved in N games. The automata A' at the first level play an identical payoff game G, at level 1. In addition, each automaton A' chooses some action according to a probability distribution pl(n) and these M automata in turn play a game G,. The process continues to the lowest level so that N games in all are played, each having M automata participating in it. The outcome of each game is a success or failure. The probability of success is determined by the automata taking part in the game at each level. The end of a cycle occurs when the Nth game is played. The N outcomes of the games are fed back to all the

=

1/2.

automata participating in the games in that cycle. These automata update their action probabilities based on the responses obtained and the next cycle commences at stage (n 1) with the first level. This collective of automata and environments will be referred to as 9/in the following sections. Assuming that the automaton A' has r, actions, it is seen that r, games of M automata can be played at the gecond level. Of these, only one game is played in each cycle. Obviously, the number of choices of games increases geometrically at each level. The objective is to determine how the automata should use the information received in each cycle to update their action probabilities, so that the expected payoff increases monotonically with stage number. Even a qualitative statement of the problem as given above indicates the complexity of the overall problem and the rather cumbersome notation that has to be used to state it precisely. Hence we shall first confine ourselves to the analysis of a collective 9; consisting of only two hierarchies and two levels. This problem is considered in detail in Section IV, and the principal result is stated as Theorem 1. flow the above result can be generalized to the collective of automata 4f is considered briefly in Section VI.

+

n:,

i ~ INTERAC~~ON . OF TWOHIERARCHIES OF AUTOMATA AT

A. The Collective

Two LEVELS

4'

Fig. 5 shows a collective of automata consisting of only two hierarchies and two levels. At stage n, automaton A performs one of two actions a , or a, with probabilities p,(n) and pz(n) = 1 - p,(n), respectively. Similarly, automaton B performs y, or y2 with probabilities q,(n) and q,(n) = 1- q,(n). Let


268

0.0 100

60

150

. . . " 60

200

. . . .

1.0

, . . .

.

.

.

.

.

a

. . . . . . . . I

LOO

1MI

200

TtrJm 8t.P n

T l m step n .

.

.

r

.

.

. . . . . . . .

Tium s t e p n

Fig. 9. Behavior of the action probabilities for a = 0.2 and A = 1/2.

p ( n ) = (p,(n),p,(n))' and q ( n )= ( q l ( n ) ,q2(n)IT.Based on the actions chosen by A and B, environment El provides a response ~ ' ( nE )( 0 , l ) ;the probability of getting a reward is given by d h where d,!k A prob(/31(n)= l l a ( n )= a , , y ( n ) = y k ) ( i , k = 1,2). Performing action a , by A and yk by B is equivalent to choosing A , and Bk at the second level. If the action pair chosen at the first level is ( a , ,y k ) , at the second level automata A , and B, play a 2 x 2 game. A , chooses one of two actions a i l or a,, with probabilities p,,(n) and pi2(n)= 1 - p,,(n), respectively, while Bk chooses between y k l and y,, using a probability distribution ( q k l ( n ) 1, - qkl(n)).The action pair (ai,,y k l ) elicits a response p 2 ( n )E {O,1} from environment E the probability of p r o b ( ~ 2 ( n=) getting a reward is given by d c where d:,, lIai,aij,y k ,y k l ) Automata A and B at t h e first level update their action probabilities based on the actions performed and the two responses ~ ' ( nand ) p2(n). At the second level, only those automata that were chosen update their action probabilities. Our aim is to determine the learning algorithms that the various automata should use to improve the performance at every stage. The following performance measure is used to state this problem quantitatively.

,

B. Performance Measure In the first level, the two automata with pure strategy (or action) sets {al,a,} and ( y 1 , y 2 }are involved in a sequential stochastic game. The game (or equivalently the environment E l ) can be described by the game matrix Dl where Dl

=

The expected payoff from the game at stage n is then denoted by M l ( n ) where

Similarly, at the second level, the environment E , is described by a 4 x 4 game matrix D 2

Since the action chosen at this level is conditioned on the action chosen at the previous level, it is clear that only four actions are involved at any stage in the game (i.e., only one of four subgame matrices is involved). If the absolute probabilities of the actions a , , and y,, are, respectively, pij(n) and &,(n) then plj(n>= p,(n)p,,(n> ( i ,j = 1,2) and q,,(n) = q k ( n ) q k 1 ( n()k , l = 1,2) and the expected payoff at the second level is M,(n)

=P(n)D,G(n)

where p ( n ) = ( p l , ( n ) ,p12(n),fi2,(n), p,,(n))' and q ( n ) = ( q l l ( n )q, I 2 ( n )q, 2 J n ) ,ij2,(n))'. The total expected payoff at stage n is defined as ~ ( nA )A M , ( n ) + ( 1 - A ) M , ( n ) ,

A E [O,11

where the value of A determines the weight associated with the two game matrices. While the choice of A will depend on the prior information available in any situation, in the following analysis we shall assume that the two games are equally weighted

269


Fig. 10. (a) Sample path of M(n) with a = 0.05 and A = 1. (b) Average Mtn). so that A = 1 / 2 , and

changed to a

d n )=

=-

Since M ( n ) is a random variable, the performance measure that we will use, as described in Section I, is the expected payoff E [ M ( n ) ] . A M ( n ) A E [ M ( n 1)- M ( n ) l j ( n ) , q ( n ) ]denotes the change in M ( n ) at stage n. If A M ( n ) g 0 so that M ( n ) is a submartingale, the entire system is absolutely expedient by definition. Theorem 1 is related to learning algorithms that result in such absolutely expedient behavior. Since the total expected payoff is a weighted sum of the expected payoffs from the two environments, the response of the composite environment E l , can be taken as the weighted sum of the two responses p l ( n ) and P2(n), where the weights are A and (1 - A), respectively. Hence

+

and the composite environment is a Q model environment with P ( n ) E {O,A, 1 - A , 1). For the specific case when A = 1 / 2 , P ( n ) E (O,1/2,11.

C. The SLR-, Algorithm In what follows, we shall assume that after each cycle, all the automata in the selected paths use a linear reward inaction (SLR-,) scheme, which is known to be absolutely expedient and €-optimal in all stationary random environments. For a learning automaton with two actions { a l , a 2 }and action probability vector p T ( n ) = ( p , ( n ) ,p2(n)), the S L R - , algorithm can be represented as

for A , ( i

= 1,2)

m t l ) .

a

for B i ( i = 1,2).

9,(n + 1 ) '

(3)

It has been shown in [13] that the division by p,(n + l ) or q,(n 1) in ( 3 ) does not pose any problems.

+

E. Performance of the Collectice 9 ; With the structure of the collective as stated in Section IV-A, a performance measure M ( n ) as defined in Section IV-B and algorithms for updating the probabilities of all the automata participating in the games as given in Sections IV-C and IV-D, the principal result can be stated as given in Theorem 1.

Theorem I: Let every automaton in the collective 9;described above update its action probabilities at every stage using the SLR-, algorithm (21, the composite response P ( n ) as defined in ( 1 ) and step size q ( n ) given in (3). Then there exists a constant a* E (O,l] such that for all 0 < a < a* the overall system is absolutely expedient, i.e.,

Proof The proof of Theorem 1 can be given in the following four steps. Step I: If the automata taking part at any stage n update their action probabilities using the step sizes given in (3), the action probabilities at the second level can be shown to be updated according to an SLR-, scheme. The proof of this follows along the same lines as in [13]and is based on the fact that the changes in probabilities depend upon the response and not whether it is the output of a stationary random environment or a game of automata. Step 2: The updating of the automata at any stage is based on the outcomes of two games. While the game matrix Dl corresponding to the game at the first level belongs to B 2 x 2 , the game matrix D , at the second level belongs to B 4 x 4 . Considering only the actions at the lowest levels of the hierarchies, the game matrix D , can be replaced by an equivalent game matrix Dl E B4X4. If

where a ( n ) is the action performed at stage n , P ( n ) E [O, 11 is the composite response from the environment, and ~ ( nis) the step size used by the automaton at stage n. This scheme is applicable to both Q and S model environments.

D. Choice of Step Size In the problem as stated above, the step sizes of automata A and B remain constant and have a value a E ( 0 , l ) . The step sizes of automata A,, A,, B,, and B, vary with time; at stage n = 0, ~ ( 0for ) all the automata has the value a. At any stage n , the step size of the automaton taking part in the game is

This is because, irrespective of the action chosen at the lower level, the outcome of the first game depends only on the action chosen by the automata A and B. The action probabilities at the second level are updated on the b ~ i of s the responses of the two games whose game matrices are D l and D,, respectively.


270

Fig. 11. ( a ) Sample path of M ( n ) with a = 0.05 and A

= 0.

(b) Average M ( n ) .

Step 3: The outcomes P 1 ( n )( i = 1,2) of each of the two games are nonnegative. It can be easily verified that belongs to (0,l). Assuming that the outcomes are given equal weight (i.e., A = 1/2), the effective outcome ( p l ( n ) +p 2 ( n ) ) / 2 belongs to the set (0,1/2,1). We shall refer to such a stochastic game as a Q-model game. Step 4: On the basis of steps 2 and 3, the behavior of the collective 9 : can be expressed as a Q-model game of two automata in which each automaton has four actions. Let two automata with actions a,, ( i ,j = 1,2) and ykl ( k ,1 = 1,2) play a Q-model game. For the action pair (a,,,ykl) let u , , , ~ ,and u , , , ~ ~ Similarly, AqTDTp has a similar form as in (5). Taking expectacorrespond to tion over all strategy pairs, it can be shown that the interaction term E[SpTDSq]is of order a2 and is of the form

where the matrices U and V can be computed from and D2 = [d:, as follows:

,,]

El = [z:j,kl]where

~ ~ ~ ;is ;made ~ ~ up i j of three terms:

Uij,kl = z~j,kld:,kl.

The expected reward at any stage n can be expressed as

+

where D = =( E l 0 , ) / 2 is the matrix that represents the expected value of ~ ( n=)( p l ( n ) +p 2 ( n ) ) / 2 . Note that D can be written in terms of U and V as

In step 1, it was shown that the action probabilities p ( n ) and 4 ( n ) are updated according to an SL,-, scheme with the updating algorithm as in ( 2 ) and (3). For any stage n , let A M ( n ) E [ M ( n+ 1)- M ( n ) l j ( n ) ,q(n)l be the expected change in M(n). Using the notation p(n + 1) = j ( n ) + S p ( n ) and q ( n + 1)= q ( n ) + Sq(n), it is easy to show that

where

Aq(n) = E[Gqlp(n),q(n)]. The first two terms in (4) correspond to the incremental gain due to each player when the action probabilities of the other player are constant. Since j ( n ) and q ( n ) are updated according to an SLR-, scheme that is absolutely expedient in all stationary random environments, these two terms, which are of order a ,

It is clear that all the terms in A M ( n ) are nonnegative except for the term c : ; ; ~ ~Thus ~ ~ . A M ( n ) can be expressed as A M ( n ) = a f , ( p , q ) + a 2 f 2 ( p , q )where f , ( p , q ) ~ 0 and f 2 ( p , q ) can be obtaiged from (5), (61, and (7). Further, f 2 ( p , q).+ 0 with p and q at least as fast as f,(p,q). Hence, for a sufficiently small a , a f , ( p , q ) > a 2 ) f 2 ( p , q ) lfor all p and q. More precisely, there exists a constant a* E (O,l] such that A M ( n ) B 0 whenever 0 < a < a*. Comments: 1) The constant factor a in the step size of all the automata in 9; has been assumed to be the same in the above analysis. While the principle used in the adjustment of all the probabilities in a hierarchy requires that the constant factor a be the same for all the algorithms, this factor can be different for different hierarchies. In such a case, the maximum of all the step sizes a,,, satisfies the condition of Theorem 1. 2) For the updating of the probabilities at stage n at the second level, the probability p,(n + 1) of the action at the first level is required. In practice this information need not be transferred from the first level to the automata at the second level but can be generated by them directly using a model of the algorithm used by the automaton at the first level and the responses of environments.


Fig. 12. (a) Sample path of M(n) with a = 0.05 and A

3) In a single stochastic game between two learning automata, both of which use an L,-, scheme, it was found in [9] that the expression for AM(n) is of the form where f, and f, are nonnegative functions of their arguments. In contrast to this, when the two automata play two or more games simultaneously or equivalently, a Q-model game, as in this section, this property of AM(n) is not retained. While f,(p,q) is sign definite, f 2 ( p , q ) is not. The arguments used in this section therefore depend upon the magnitude of a to achieve absolute expediency. The fact that two absolutely expedient algorithms playing an identical payoff game with a P-model achieve absolute expediency for all a E (0, l), but that the same does not carry over to a Q-model is an important result that has come out of the analysis. This, however, does not affect the conclusions in both cases since arguments based on slow learning invariably need a small value of the step size a.

The results presented in the previous section are best illustrated by considering a simple example. The two game matrices corresponding to the two environments E l and E, were chosen as ro.6

0.2

0.3

0.851

=

1/2. (b) Average M(n).

action pair (all,Y22)is chosen at the second level. This is shown in Fig. 7. Theoretically, this corresponds to the case where the two hierarchies play a single game and the results are identical to those presented in [9]. A = 1/2: When the two matrices are weighted equally, the equivalent game matrix has the form

Once again, the same actions as in the previous case are optimal. The reinforcement at the second level is seen to prevail in this case and the decision shifts from that determined by Dl to the one determined by D, as can be seen from the behavior of the action probabilities in Fig. 8. Fig. 9 shows the behavior of the action probabilities when the step size a is increased to 0.2. As the magnitude of a is increased, the speed of response increases but as predicted by the theory, the probability of convergence to the optimal pair decreases. For each case considered above, the value of M(n) along a sample path and the average value of M(n) are shown in Figs. 10-13. As predicted by theory, the expected value of M(n) (average over 100 trials in this experiment) increases monotonically.

VI.

INTERACTION

OF

According to D l , the optimal action pair at the first level is (a,, y,) while according to D,, the action pair at the second level that is optimal is (a,,, y,,). The latter implies that (a,, y , ) must be chosen at the first level. The behavior of the collective with the above game matrices was simulated for several values of the step size a as well as the weighting factor A. Three typical responses observed are shown in the figures. A = I: In this case, the entire weight is attached to the matrix Dl and hence the performance is determined by the choice of the actions at the first level. Since the probability 0.7 corresponding to the action pair (a,,y,) is the only one that is the maximum both in its row and column, absolute expediency also implies E-optimality for sufficiently small values of a. The behavior of the action probabilities for a = 0.05 is shown in the Fig. 6. Since the performance is independent of the action chosen at the second level, it is interesting to note that these probabilities need not converge to any specific value. However, as a consequence of the convergence of the actions of the automata at the first level, it can be stated that the actions at the second level converge w.p.1 to the limit set {(a,,, yll), (ffll~rl2~~(ff12~~ll)~(ffl2~~12)~.

A = 0: In this case, the performance is determined entirely by D,. For sufficiently small values of the step size a ( = .05), the optimal action pair (a,, y,) is chosen at the first level and the

MULTIPLE HIERARCHIES AUTOMATA

OF

The results presented in the previous section can be extended to situations where each automaton has more than two actions. where each hierarchy has more than two levels, and finall; where there are more than two hierarchies. In this section we merely indicate the changes that it necessitates in the implementation of the algorithms as well as the modifications that have to be made in the analysis to prove absolute expediency. We first consider the collective A2' with two hierarchies and two levels where each of the automata have more than two actions. The only consequence of this is a change in the dimension of the game matrix at each level. The automata taking part at any stage are updated exactly as in Section IV. If the two hierarchies have N levels each, we have a collective 4 ;of automata and in this case the entire problem is reduced to two automata playing N stochastic games simultaneously. The actions of the two automata correspond to the actions at the lowest levels of the two hierarchies and both automata use SL,-, schemes to update their action probabilities. The equivalent game matrix D of the two automata is now computed as

where Bi is an (r, x r,) matrix (where r, and r, are the number of actions of the hierarchies at the lowest level) and can be obtained from Di using a process similar to that in Section IV.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, N O . l , JANUARY/FEBRUARY 1991

O . 2 1 - . . - ' . - - - ' . . . . ' . - . . 1 60 100 160 PO0

(b)

(a) Fig. 13. (a) Sample path of M(n) with a = 0.2 and A

The response of the overall environment, @(n), at stage n is the average outcome of the games at the N levels. As in the 9; case, each automaton updates its step size on the basis of the probability of the action at the previous level. Finally, if there are M > 2 hierarchies, the identical payoff game at each level is played between M automata. Since in [9], it has been shown that absolute expediency results when all the automata use SL,..., schemes, the same proof carries over to ; . In this case, the txpression for AM(n) has the collective 9 the form

where only f,(p,q) is known to be positive semidefinite. However, since limf,, ( fi(p, q ) ) / ( f , ( ~ ,4))= C , (i = 2 , 3 , . .,M), where C is a constant, the same arguments can be used to show that the collective behavior results in absolute expediency for some sufficiently small value of a. O n the basis of the above discussions, we can state the following theorem for a collective 9 ; of automata involved in an identical payoff game.

,

Theorem 2: If in a collective 9 ; of automata of the type described above, N automata are involved in each of the M hierarchies and at any stage n, all the M N automata are aware of the outcomes of the N games played at each of the N levels and the step size of automaton chosen at stage n and level L + 1 is q,+,(n) = (qL(n))/(p(n + I)), where p ( n + 1) is the probability of the action corresponding to that automaton being chosen, then there exists a constant a* E (O,l] such that for all 0 < a < a*, the overall system is absolutely expedient. We do not provide a proof of the above theorem since it follows exactly along the same lines as the proof of Theorem 1. Comments: 1) The hierarchies in the above analysis need not be uniformly bifurcating. The various automata in the collective may have different number of actions and the hierarchies themselves may have different number of levels. The same theory can be applied with minor modifications to such situations also. 2) The analysis in Section IV was carried out using A = 1/2. The same analysis applies for a constant value of A with A E [O, 11 and can b e extended to any convex combination of outputs of the different games. This results in an equivalent global output

and the corresponding game matrix is

=

1/2. (b) Average M(n).

3) SL,-, algorithms are used in Sections IV and VI at all the levels for convenience. The same result is also obtained when any absolutely expedient algorithm is used. 4) In complex problems, decision makers may obtain information from different sources. For example, one response to an action may be the outcome of a stochastic game while the other is the output of a stationary random environment. By suitably defining the performance function as well as the response P(n) to be used in the learning algorithms, the same theory can be extended to a wider class of problems. VII. ANEXAMPLE In this section we present one specific example in the area of image recognition. While the example might appear somewhat contrived, it nevertheless serves to illustrate how the method suggested may be applicable to multilevel multiresolution problems in the recognition of images. The problem of detecting objects in an image and labeling them consistently is well known. We consider the case where an image is available at two resolutions and the problem is to determine whether there are objects belonging to the set { c a r , t a n k , ~ h i ~ , ~ l ainn eit) and whether the background belongs to the set {sky,mountains,sea,flat land). At a low level of resolution, cars and tanks look alike while planes and ships look similar. Hence, the actions of the automaton at the first level may be a = {car, tank} and a, = {plane,ship}. The actions a , , , a,,, a,, and a,, then correspond to the individual elements car, tank, plane, and ship, respectively. Similarly, choosing y, = {sky,mountain} and y, = {sea, flat land) results in the four elements corresponding to the actions y y y,,, and y,,. At the first level, assuming a, and y, are chosen, the response is good (@ = 1) or bad (@ = 0) with probabilities d,, and (1 - dl,) depending on the consistency of the labeling. Similarly, at the second level, the action a,, (car) and y2, (flat land) are chosen from the image with higher resolution and the response once again depends on the consistency of the two labels. The action probabilities are updated according to the rules specified in the previous sections and the process continues. If the simulation results given in Section V correspond to the above problem, it implies that at the first level the action pair a , = {car, tank), y, = {sky,mountain}, and a! = {car, tank}, y, = {sea,flat land} have higher reward probabilities than the other actions with the pair (a,, y , ) being slightly better. However, at the lower level and with the higher resolution image, the probability of the pair ( a , , , y,,) that corresponds to (car,flat land) has a significantly higher reward probability. In view of this, the overall system converges to this decision after repeated trials.

,,, ,,,

REFERENCES [I] M. L. Tsetlin, "On the behavior of finite automata in random media," Automat. Remote Contr., Vol. 22, pp. 1210-1219, 1962. 121 V. I. Varshavskii and I. P. Vorontsova, "On the behavior of


stochastic automata with variable structure," Automat. Remote Contr., vol. 24, pp. 327-333, 1963. [3] D. E. Koditscheck and K. S. Narendra, "Fixed structure automata in multiteacher environment," IEEE Trans. Syst. Man Cybern., vol. SMC-7, pp. 616-624, 1977. 141 M. A. L. Thathachar and R. Bhaktavatsalam, "Learning automata working in parallel environments," J . Cybern. Inform. Sci., vol. I , pp. 121-127, 1978. [5] K. R. Ramakrishnan, "Hierarchical systems and cooperative games of learning automata," Dept. Engineering, Indian Institute of Science, Bangalore, India, Ph.D. dissertation, 1982. [6] N. Baba, New Topics in Learning Automata Theory and Applications. New York: Springer-Verlag, 1984. 171 K. S. Narendra and M. A. L. Thathachar, "On the behavior of learning automata in a changing environment with application to telephone traffic routing," IEEE Trans. Syst. Man Cybern., vol. SMC-10, pp. 262-269, 1980. [8] P. R. Srikantakumar and K. S. Narendra, "A learning model for routing telephone networks," SIAM J. Contr. Optimization, vol. 20, pp. 34-57, 1982. 191 K. S. Narendra and R. M. Wheeler, "An N player sequential stochastic game with identical payoffs," IEEE Trans. Syst. Man Cybern., vol. SMC-13, pp. 1154-1 158, 1983. [lo] M. A. L. Thathachar and K. R. Ramakrishnan, "A cooperative game of a pair of learning automata," Automatica, vol. 20, pp. 797-801, 1984. [ I l l K. S. Narendra and M. A. L. Thathachar, Learning Automata- An Introduction. Englewood Cliffs, NJ: Prentice Hall, 1989. [12] M. A. L. Thathachar and K. R. Ramakrishnan, "A hierarchical system of learning automata," IEEE Trans. Syst. Man Cybern., vol. SMC-11, pp. 236-241, 1981.

273

generalization is not always good. In this article we introduce a new parameter, gain, into back propagation networks and show that it can yield benefits for learning speed and generalization. Consider a multilayer feed-forward network, as in standard back propagation. Let a f be the activation of the ith node of layer s, and let a s= [ a s . . . a i l T be the column vector of activation values in layer s. The input layer is layer 0. Let wc be the weight on the connection from the jth node in layer s - 1 to the ith node in layer s, and let wj = [ w i . . . w;IT be the column vector of weights from layer s - 1 to the ith node of layer s. The net input to the ith node of layer s is defined as nets = (wj,as-I) = Zkw,skai-', and let netS=[nets . . . net;lT be the column vector of net input values in layer s. The activation of a node is given by a function of its net input,

where f is any function with a bounded derivative, and g; is a real number called the gain of the node. Suppose that for a particular input pattern, a', the desired output is the teacher pattern t = [ t , . . . t,IT, and the actual output is a L , where L denotes the output layer. Define an error function on that pattern, E = (1/2)Zi(ti - a;)'. The overall error on the training set is simply the sum, across patterns, of the pattern error E. We then perform gradient descent on E with respect to w;. The chain rule yields

Benefits of Gain: Speeded Learning and Minimal Hidden Layers in Back-Propagation Networks John K. Kruschke and Javier R. Movellan Abstract-The gain of a node in a connectionist network is a multiplicative constant that amplifies or attenuates the net input to the node. The objective of the work is to explore the benefits of adaptive gains in back propagation networks. First it is shown that gradient descent with respect to gain greatly increases learning speed by amplifying those directions in weight space that are successfully chosen by gradient descent on weights. Adaptive gains also allow normalization of weight vectors without loss of computational capacity, and we suggest a simple modification of the learning rule that automatically achieves weight normalization. Finally, a method for creating small hidden layers by making hidden node gains compete according to similarities between nodes, with the goal of improved generalization performance, is described. Simulations show that this competition method is more effective than the special case of gain decay.

The back propagation learning algorithm [I]-[S] has become a very popular method for training connectionist networks. Two of the appealing properties of back propagation are its tolerable learning speed and its ability to generalize to novel inputs. Unfortunately back propagation is sometimes too slow, and

where 8: = - aE/d net:. In particular, the first three factors of (2) indicate that

The recursive formula (3) for 6: is the same as in standard back propagation [I], [2] except for the appearance of the gain parameter. Combining (2) and (3) yields the learning rule for weights:

where E , is a small positive constant called the "step size" of gradient descent with respect to weights. Gradient descent on error with respect to the gains can also be computed. Using the chain rule as previously, it is easy to compute that aE -;= ( ~ a i + ' w i : ' ) f ' ( g j net:) net:. ag, k Then ~ g =f.5,13f n e t f / g f ,

Manuscript received December 10, 1988; revised June 30, 1990. J. K. ~ r u s c h k ewas with the Department of P s ~ c h o l o g University ~, of California, at Berkeley, Berkeley, CA. He is now with the Department of Psychology, Indiana University, Bloomington, IN 47405. J. R. Movellan was with the D~~~~~~~~~ of ~ ~ ~university ~ hof California at Berkeley, Berkeley, CA. He is now with the Department of Psychology, Carnegie-Mellon University, Pittsburgh, PA 15213. IEEE Log Number 9039978.

(6)

where E , is the step size of the gains. The learning rule for gains (6) is easily incorporated into standard back propagation pro~grams. l In~ particular, ~ ~ all, the quantities that appear in (6) are locally available the affected gain g:. An equivalent method was first introduced by Movellan [6] and independently proposed by Tawel [7]. Other authors (e.g.,

0018-9472/91/0100-0273$01.00 01991 IEEE