Learning Level-k Play in Noncooperative Games Drexel ... - CiteSeerX

Learning Level-k Play in Noncooperative Games By Roger A. McCain

Drexel University For presentation at the conference of the Eastern Economic Association, Philadelphia, PA 2010

Learning Level-k Play in Noncooperative Games By Roger A. McCain While the Nash equilibrium is the best established and understood solution concept for noncooperative games, it has been argued that it is not an appropriate concept for one-off play (Bernheim, 1984, e.g.), since the information necessary to correct mistaken expectations about the play of other agents would not be available in one-off play. However, even Bernheim’s rationalizable strategies assume unbounded rationality and are, indeed, cognitively difficult. (For an introductory discussion, see McCain, 2010, Ch. 11.) In the light of experimental evidence that human rationality is indeed bounded, Crawford has proposed (with several co-authors) the level-k theory of strategy choice. The level-k theory proposes that agents are of different types, including a hierarchy as follows: Level 0 players do not think strategically or model the counterpart’s choice at all, while, for any k>0, level k players choose the best response to play by level k-1 players. This is a theory of boundedly rational play for one-off games, and has been the subject of several experimental studies.1 However, the proportions of players of each type are assumed given, somewhat as a constant of nature. Here is a prima facie criticism: if people gain more experience with choice of strategies in games in general, does it not seem that they might learn to play in increasingly sophisticated ways? If so, then the frequency of occurrence of different types should be subject to explanation as a product of such a learning process. The objective of this paper is to explore this possibility in the context of agent-based computer simulation.

2

1. Level-k Theory This section is expository and reviews some illustrative examples and issues of level-k theory. A recognized issue for level k theory is to specify the level 0 play. What does it mean to choose without strategic thinking? One obvious possibility is that the level 0 player chooses a strategy at random. However, in some experiments (Crawford et. al. 2008), the observations suggest that level 0 play is better modeled in terms of cognitive salience. That is, some strategies are thought to have cognitive salience that leads a naïve chooser to choose them systematically. Now, the experiments also suggest that there are few if any level 0 agents; that level 0 is not so much a model of actual agents as a model of the way that certain agents (level 1 agents) model the behavior of their counterparts. The two possibilities represent rather different hypotheses on this point: on the one hand, the hypothesis that the counterpart’s decision is just unpredictable, and on the other hand, that the counterpart’s decision is predictable on the assumption that both counterparts share recognition of a pattern. Pattern recognition is a highly sophisticated cognitive skill (from the point of view of computer science) but one for which human beings seem to share an unconscious and arguably prerational capability. Since it is prerational, its inclusion in a theory of boundedly rational choice is not contradictory, and, indeed, is unavoidable. Nevertheless, for the purposes of this study, cognitive salience will be ignored and level 0 play will be considered to be random play. To illustrate level-k reasoning, consider Game 1, Table 1, a two-player game with four strategies per player designed for this paper, and entitled Tarbaby. Tarbaby is a game designed for complexity in the context of level k theory. While it has an unique Nash

3

equilibrium in pure strategies at 1,I, this is reached only with level 4 play, and levels 1-4 all lead to different strategies, as shown in Table 2 below. (Like the Tarbaby in the story, one who struggles with this game becomes more deeply entangled in it.) Strategies 2 and III are best responses to random choice among the four strategies; however, if A (at level 1) chooses 2, B’s best response (at level 2) is strategy IV for a payoff of 5; but that, in turn, elicits a best response (at level 3) from A of strategy 3, and so on.

Table 1. Game 1: Tarbaby B A

I

II

III

IV

1

10, 10

9, 0

3, 8

1, 5

2

8, 3

8, 0

0, 4

8, 5

3

0, 9

7, 7

0, 8

9, 0

4

5, 1

7, 5

6, 4

1, 1

Table 2. Levels and Strategies in Tarbaby At At At At

level level level level

1 2 3 4

A's A's A's A's

strategy strategy strategy strategy

is is is is

2 4 3 1

and and and and

B's B's B's B's

strategy strategy strategy strategy

is is is is

III IV II I

For a second example, consider the familiar Prisoner’s Dilemma, shown in Table 3. For this game, at any positive level, strategies 2, II – the dominant strategies – are chosen. This illustrates an important fact about level-k solutions: If at any level (in this case level 1) a Nash equilibrium is played, then the Nash equilibrium is recapitulated at every higher level. We will consider one other example, another familiar two-by-two game, the Battle of the Sexes. This is shown in Table 4. At level 1, A chooses strategy 1 and B chooses strategy II; at level 2 they are reversed, and they continue alternating odd

4

and even k for all higher k. This illustrates that 1) even if there is a Nash equilibrium in pure strategies, it may not be played at any level, so that 2) Nash equilibrium is played only when players at different levels are matched. Table 3. The Prisoner’s Dilemma

A

1 2

B I 5,5 6,0

II 0,6 1,1

Table 4. The Battle of the Sexes

A

1 2

B I 6,5 2,2

II 3,3 5,6

The experimental studies sometimes allow for other “types,” including those that play a unique Nash equilibrium spontaneously and those “sophisticated” players who attempt to estimate the probability of being matched with a player of each different “type” and choose a best response on that basis. These are observed with low frequencies if at all. There also appear to be few, if any, of type 3 or higher. Thus, in practice, we might limit our experimental categories to types 1 and 2, who respectively treat the counterpart’s play as unpredictable or as a best response to unpredictable play, along with equilibrium and sophisticated players. This study will, however, allow for types of levels 0-4. For our purposes, if types 0, 3 and 4 are uncommon, this should be a result, not an assumption, of the model. Equilibrium and sophisticated types will not be considered but will be left for future research.

5

2. Learning The learning model for this paper is in the broad tradition of reinforcement learning (Bush and Mosteller, 1955, Camerer and Ho, 1999), or Q-Learning (Tuyls and Nowe, 2005 see esp. p. 87) as reflected in recent work in artificial intelligence (Anderson et al. 2004) and the neurological study of decision processes (Montague, 2006). For a model of this kind, the decision among a number of alternatives is made probabilistically, with the probability of a particular alternative taken from a probability distribution that associates a higher probability with alternatives that have, in the agent’s past experience, resulted in greater reinforcement (higher payoffs, profits or utility). As the agent gains experience, the probability distribution becomes more concentrated on one particular alternative, commonly the optimal one. Reinforcement learning has sometimes been rejected as too slow to account (alone) for real human learning but when associated with other forms of neural information processing has come to play a role in the neurology of decision-making (Montague) and can be the basis for the choice made among heuristic rules in artificial intelligence models (Anderson et al.) In this study the heuristic rules are defined by play at level 0, 1, 2, 3, or 4. Expected value payoffs for each level of play are computed by a Koyck averaging process on past experience. The probability of choosing level i is then the fraction

eκEVi 4

κEV j

=

EViκ 4

∑e

∑ EV

j =0

j =0

κ j

where EVj is the expected value payoff for play at level j and z is a constant that

€

determines the reliability of the decision in choosing the level with the greatest expected

6

value. On the basis of the available empirical evidence, in this study z=2.2 (Anderson et al. 2004). Figure 1 shows the influence of κ on the probability that the strategy is chosen, a choice between just two alternatives. With just two alternatives, if

EVi >0, then i is the alternative with the greater expected value. Thus, if the EV ∑ i expected values are equal, the probabilities are 50/50. Otherwise, the solid line shows the probability of choosing strategy i with a κ of 5. The gray line shows the same for a κ of 2, and the cross-hatched line for a κ of 50. As we see, a greater κ implies a greater probability that the strategy with the larger expected value will be chosen. 1.00

0.75

0.50

0.25

0.00 0.00

0.20

0.40

0.60

0.80

1.00

Figure 1. The Influence of Payoff on Probability of Choice

For this study, the level of play is chosen according to this algorithm with 95% probability and chosen at random with equal probabilities in 5% of cases, to assure that agents will eventually have some experience of all alternatives, and thus avoid locally stable equilibria that are artifacts of the initialization or very early experience.

7

3. Results a. Tarbaby The first two experiments are designed primarily to validate the simulation program. We first explore a simulation of the Tarbaby game, as an illustration of the approach and the kinds of results we may find. For this example, there were 100 simulated agents, and at each iteration, two agents chosen at random played the two unsymmetrical roles in the Tarbaby game. There were 10,000 iterations. Agents had information only on their own experience and there was no imitative learning, nor was there a cognitive cost associated with higher levels such as levels 3 and 4. The evolution of the levels of play by 100 randomly matched players over 10000 iterations is shown by the Figure 2.

Figure 2. Number of Agents Choosing Each Level of Play

8

Initialization artifacts are largely eliminated in the first 200 iterations (indeed the first 50) and so are not shown. What we see is that play at levels 1, 2, and 4 remain most common, with little distinction between them; level 0 play tends to be at 10% or less, and level 3 play is even less common, usually less than 5% of agents. The reason for this is far from mysterious. Figure 3 shows the evolution of the expected payoff by levels of play, averaged over the 100 players. Here we see that the expected payoff for level 3 play is considerably less than that for the other levels, including level 0.

Figure 3. Average Expected Value Payoffs by Levels A careful examination of the game lends some insight as to why this happens. Level 3 play pays rather well if both players play at that level, with 7, 7; but otherwise it is something of a disaster. Consider the Table 5:

9

Table 5. Payoffs for Level 3 Play in Tarbaby

Player A plays 1 3 2 3 4 3

Player B plays 3 1 3 2 3 4

Level 3 payoff 0 0 5 0 0 0

With a distribution of types, then, this particular game is highly prejudiced against level 3 play. The low frequency of play of strategy 3 is then a (relatively) rational response to the context of the choice. This result illustrates the following points: 1) We can expect that results of learning will differ from one game to another; 2) learning in these simulations does respond in a predictable way to opportunities, and 3) despite a favorable payoff to common play at a relatively high level (level 4), and despite the fact that there are no explicit cognitive costs of high-level play, the distribution of levels of play over levels 1-4 remains rather stable, and the lower levels (such as 1 and 2) will not be eliminated in this instance. The random number seed for this experiment was 310545, but a number of other experiments gave qualitatively similar results for Tarbaby.

10

b. Four Familiar Two-By-Two Games

In this experiment, agents were paired to play games drawn from the four familiar two-by-two games, the Prisoner’s Dilemma, the Battle of the Sexes, and two variants on the Stag Hunt. The Prisoner’s Dilemma and Battle of the Sexes games have been shown as tables 3 and 4 above. For the two variants of the Stag Hunt, see Tables 6 and 7 below. For the first version of the Stag Hunt, the strategy chosen at all positive levels is strategy 1 (which corresponds to the payoff-dominant Nash equilibrium) and for the second version it is strategy 2 (which corresponds to the payoff-dominated equilibrium). For these games, apart from level 0, there is little advantage in higher rather than lower levels of play. For the Prisoner’s Dilemma, the strategy played at all levels except level zero is the dominant strategy. Table 6. A Stag Hunt Game

A

1 2

B 1 8,8 3,0

2 0,3 4,4

Table 7. Another Stag Hunt Game

A

1 2

B 1 6,6 3,0

2 0,3 4,4

At each iteration, agents were matched at random to play one of these four games, with the game also chosen at random. Figure 4 below shows the number of agents

11

choosing each level of play in this experiment. We see that, apart from an anomaly at about 80000 repetitions, level 0 play is largely eliminated, but there is little if any trend in the other levels, very much as we would expect. The average expected value payoffs are shown in the Figure 5. Here again we see little tendency for one level to be more successful than another. Figure 6 shows the frequency with which the four games were played in each reporting period of 200 iterations. The random number seed for this series was 310545. This experiment illustrates how experience based on the play of a number of different games in random alternation may influence the choice of a level of play. Comparing the first two experiments, we find that the simulations may or may not result in different tendencies to choose different levels, and that these tendencies correspond in reasonable ways to the different structures of the games.

Figure 4. Numbers Choosing Levels in Playing One of Four Common Two-By-Two Games

12

Figure 5. Expected Values of Levels with Four Common Two-by-Two Games

Figure 6. Frequency of Play of the Four Familiar Games

13

c. Twenty Randomly Generated Four-Strategy Games

The software used in this study includes a capability to generate games at random. The games are two-person games of four strategies each, not in general symmetrical, and to that extent similar to the Tarbaby game. For the randomly generated game, each payoff number is chosen by an equiprobable random selection from the integers from 0 to 10. In some preliminary experiments, four such games were generated and at each iteration of the simulation, the game played was randomly chosen. However, within these small sets of randomly selected games, results varied widely, as in some cases the small set included highly complex games, while in other cases relatively simple games predominated. As the experiment with Tarbaby shows, a particular game may lead to extreme results, and this seems to be true also of relatively small sets of randomly generated games. This section reports an experiment with twenty randomly generated games. Of these games, 10 have one or more Nash equilibria in pure strategies and converge to the Nash equilibria at level 1 (four cases), level 2 (four cases) or level 3 (two cases). Five have no Nash equilibria in pure strategies. The remaining 5 have one or more Nash equilibria but do not converge to them at any level. Of those 5, three have unique Nash equilibria and two multiple Nash equilibria in pure strategies, and in at least two cases the failure to converge is connected with the weak character of the Nash equilibria. The number of agents choosing each level is shown by Figure 7. The expected value payoffs are shown in Figure 8. To verify that games were randomly chosen, the

14

number of times each game was chosen in each 200-iteration reporting period are shown in Figure 9. The random number seed was 765234.

Figure 7. Levels Chosen with a Large Set of Randomly Generated Games

Figure 8. Expected Values with a Large Set of Randomly Generated Games

15

Figure 10. Repetitions of Each Game in 200 Iterations

In this example, we see that level 0 play is relatively scarce, although it rarely constitutes less than 5% of play. The expected payoff of level 0 play is consistently less than the expected payoffs from play at positive levels. Apart from that, however, there is little sign of any tendency for any of the positive levels to predominate or to be eliminated. Further experiments along the same lines produced similar results. The level k theory is justified partly by the idea that lower levels in the hierarchy conserve cognitive resources. There are, however, no differences of cognitive effort in these simulations. This may account for the tendency of the higher levels (3 and 4) to persist in about equal proportions with the lower levels (1 and 2). Indeed, we observe that in these simulations there is no tendency either for the lower levels to be eliminated, despite the fact that the higher levels require no more effort. This seems to reflect the fact that 1) the higher levels seem to yield no higher payoffs, even in randomly generated games a few of which converge to their Nash equilibria only at level 4, and 2) that choices made by agents chosen at random over a population of level 1-4 players may approximate random play nearly enough so that level 1 play does rather well. Conversely,

16

the persistence of level 1 play means that level 2 players do fairly well, and so on through level 4, so that roughly equal distributions could be a stable situation. The simulation results reported in sections b and c are representative of a number of other experiments with other pseudorandom number series, which will not be reported in the interest of brevity.

4. Learning Level k Play with a Cognitive Cost

As noted, the simulations discussed in the previous sections assumed that playing at a higher level involves no particular cognitive effort or cost. This section revisits the simulation with 20 randomly generated games, allowing for increasing cognitive effort costs associated with higher level play. For these simulations, cognitive effort cost is determined by a constant z, so that the payoff to level 2 play is decreased by z, for level 3 play by 2z, and for level 4 play by 3z. There is no cognitive effort cost associated with either level 0 or 1 play. In some experiments not reported in detail, when a cognitive effort cost was associated with level 1 play, level 0 play became common. Since the objective of this exercise is to reproduce – if possible – the experimental results that levels 0, 3, and 4 are uncommon, this seemed counterproductive. As before, the same twenty randomly generated games were played in a random rotation. As an example of the results, consider Figures 11, 12, which show the frequency of play of the different levels and the expected value of payoffs by level when z=0.75.

17

Figure 11. Frequency of Play by Level with z=0.75

Figure 12. Expected Values of Payoffs by Level with z=0.75 The value 0.75 for cognitive effort cost is substantial in that it reduces the expected value payoff for level 3 play in the simulation with z=0 by about 40%. Figures 13 and 14 show how the frequency of and net payoff to level 3 play evolves with z varying from 0 to 1. We see that smaller values of z leave level 3 play at a frequency that is comparable to that of levels 1 and 0. With z at 0.75 or 1, we see that about half of agents play at level 1, and fewer at level 2, but other levels remain at frequencies in the

18

neighborhood of 5% or less. Level 4 play declines more rapidly than level 3, not surprisingly.

Figure 13. Frequency of Level 3 Play as z Varies

Figure 14. Expected Value of Level 3 Play as z Varies

19

While the simulations with relatively high cognitive effort cost do lead to relatively frequent play of levels 1 and 2 and the relative disappearance of levels 3 and 4 as well as level 0 play, the assumptions necessary to generate this result are fairly extreme. On the one hand, we suppose that there is no cognitive cost for level 1 play, by comparison at least with level 0 play; and the cognitive effort costs at each stage are a very substantial proportion of the expected value of payoffs from each level of play.

5. Further Research, Conclusion and Summary

The results here have been based on relatively few experiments, and further experiments might extend or confirm the results reported here. There are some refinements of the program that might best be undertaken before this is done. Agent “types” who spontaneously choose Nash equilibrium strategies and sophisticated agents are not included in this study. It would be desirable to extend the model to include agents of this sort. Finally, the model as it is treats weak Nash equilibria rather arbitrarily. A better treatment of weak Nash equilibria is desirable. Nevertheless, the study suggests some important conclusions. The striking point about these results is that in general, more cognitively complex decision processes (i.e. levels 3 and 4, which first require that level 1 and 2 solutions be computed) are not more advantageous, even in the absence of any cognitive cost. For a particular game (as the Tarbaby) they may be positively disadvantageous. Conversely, in a complex game such as Tarbaby, a mixed population including players at levels 2-4 may approximate a random distribution over strategies, so that level 1 play could approximate sophisticated

20

play and be relatively successful! This being so, and supposing also that level 3 and higher play does require greater cognitive effort than level 1 or 2 play, it is far from surprising that play at level 3 or higher is at best very uncommon. Even if the cognitive cost of such play is relatively slight, there being no benefits, higher level play will seldom if ever be reinforced.

21

References Bernheim, B. Douglas (1984), “Rationalizable Strategic Behavior,” Econometrica v. 52, no. 4 (July) pp. 1007-1028. Brocas, Isabelle, Camerer, Colin; Carrillo, Juan D; and Wang, Stephanie W (2009), Measuring attention and strategic behavior in games with private information ( CEPR Discussion Papers: 7529). Bush, R. and F. Mosteller (1955), Stochastic Models of Learning (New York: Wiley and Sons). Camerer, C, and T. H. Ho (1999), “Experience-weighted attraction learning in normal form games,” Econometrica v. 67, no. 4 (JUL) pp. 827-874. Charness, Gary and Levin, Dan (2007), The Origin of the Winner's Curse: A Laboratory Study (Department of Economics, UC Santa Barbara, University of California at Santa Barbara, Economics Working Paper Series: 17-07c). Costa-Gomes, Miguel A, Crawford, Vincent P and Iriberri, Nagore (2000), “Comparing Models of Strategic Thinking in Van Huyck, Battalio, and Beil's Coordination Games,” Journal of the European Economic Association v. 7, no. 23 (April-May) pp. 365-76. Crawford, Vincent P and Gneezy, Uri; and Rottenstreich, Yuval (2008), “The Power of Focal Points Is Limited: Even Minute Payoff Asymmetry May Yield Large Coordination Failures,” American Economic Review v. 98, no. 4 (Sept) pp. 1443-1458. Georganas, Sotiris and Nagel, Rosemarie (2008), English Auctions with Toeholds: An Experimental Study (Department of Economics and Business, Universitat Pompeu Fabra, Economics Working Papers). Koyck, L. M. (1954), Distributed Lags and Investment Analysis (North-Holland Publishing Company, Amsterdam). Montague, Read (2006), Why Choose This Book? (New York: Dutton). Tuyls, Karl and Ann Nowe (2005), “Evolutionary game theory and multi-agent reinforcement learning,” The Knowledge Engineering Review v. 20, no. 1 pp. 63-90.

22

Endnote 1

Costa-Gomez et. al. 2000, Crawford et. al. 2008, Charness and Levin 2007, Brocas et. al. 2009, Georganas and Nagel 2008, e.g.

23