Performance Management in Competitive Distributed Web ... - CiteSeerX

Performance Management in Competitive Distributed Web Search Rinat Khoussainov and Nicholas Kushmerick Department of Computer Science, University College Dublin Belfield, Dublin 4, Ireland {rinat, nick}@ucd.ie

1. Introduction Distributed heterogeneous search environments are an emerging phenomenon in Web search. They can be viewed as a federation of independently controlled metasearchers and many specialised search engines. Specialised search engines provide focused search services in a specific domain (e.g. a particular topic). Metasearchers help to process user queries effectively and efficiently by distributing them only to the most suitable search engines for each query. Compared to the traditional search engines like Google or AltaVista, specialised search engines (together) provide access to arguably much larger volumes of high-quality information resources, frequently called the “deep” or “invisible” Web [10]. We envisage that such heterogeneous environments will become more popular and influential. Previous research has mainly targeted performance of such environments from the user’s perspective (e.g. improved quality of search results). However, a provider of search services is more interested in parameters like the number of queries processed versus the amount of resources used. Performance optimisation of search engines from the service provider’s point of view is the focus of our work. An important factor that affects the search engine performance in heterogeneous search environments is competition with other independently controlled search engines. When there are many engines available, users are interested in sending queries to those that would provide the best possible results. Thus, the service offered by one search engine influences queries received by its competitors. Multiple search providers can be viewed as participants in a search services market competing for user queries. We examine the problem of performance-maximising behaviour for non-cooperative specialised search engines in heterogeneous search environments. In particular, we analyse a scenario in which individual search engines compete for queries by choosing which documents to index. We provide game-theoretic analysis of a simplified version of the problem and motivate the use of the concept of “bounded rationality” [9]. Bounded rationality assumes that decision

makers are unable to act optimally in the game-theoretic sense due to incomplete information about the environment and/or limited computational resources. We then cast our problem as a reinforcement learning task, where the goal of a specialised search engine is to exploit sub-optimal behaviour of its competitors to improve own performance.

2. Problem Formalisation Performance metric. We adopt an economic view on search engine performance. Performance is a difference between the value of the search service provided (income) and the cost of the resources used to provide the service. In our simplified version of the problem we use the following formula for the search engine performance: P = α1 Q − α2 QD − α3 C − α4 D, where Q is the number of queries received in a given time interval, D is the number of documents in the engine’s index, C is the number of new documents added to the index during the given time interval, and αx are constants. α1 Q represents the service value: if the price of processing one search request for a user is α1 , then α1 Q would be the total income from service provisioning. α2 QD represents the cost of processing search requests. If x amount of resources is sufficient to process Q queries, then we need 2x to process twice as many queries in the same time. Similarly, twice as many resources are needed to search twice as many documents in the same time. Thus, the amount of resources is proportional to both Q and D, and so can be expressed as α2 QD, where α2 reflects the resource costs. The example of the FAST search engine confirms that our cost of search is not that far from reality [7]. α3 C is the cost of crawling and indexing new documents. While the cost of indexing new documents is indeed proportional to the number of the documents added, the cost of crawling can vary between documents. Therefore, here we assume an average crawling cost. Finally, α4 D is the cost of document storage and maintenance. This includes

storing a copy of the document as well as keeping the document description up-to-date. Engine selection model. Let us assume that users only issue queries on a single topic, e.g. “cars”. We will see later how this extends to multiple topics. It is reasonable to assume that users would like to send queries to search engines that contain the most relevant documents on the topic, and the more of them, the better. Population of search engines is performed by topic-specific (focused) Web robots [3]. Suppose that search engines have “ideal” Web robots which for a given D can find the D most relevant documents on the topic. Then the engine with the largest number of documents on the topic will be the most suitable for user queries. This model can be extended to multiple topics, if the state of a search engine is represented by the number of documents Dit that engine i indexes for each topic t. A query on topic t is sent to the engine i with the largest Dit . Of course, such extension ignores possible overlaps between topics in both queries and documents. On the other hand, if we associate each topic with a search term, the whole engine selection model would closely reflect how some real-life engine selection algorithms work [1]. Decision making process. For each time interval, search engines simultaneously and independently decide on how many documents to index on each topic and allocate resources to process expected search queries (thus incurring the corresponding costs). Since search engines cannot have unlimited crawling resources, we presume that they can only do incremental adjustments to their index contents that require the same time for all engines. The user queries are allocated to the engines based on their index parameters (Dit ) as described above. An important detail here is that search engines cannot know in advance how many queries users will actually submit during the time interval or how many queries a search engine will receive (since, it also depends on the competitors). We assume that if an engine allocates more resources than it gets queries, then the cost of the idle resources is wasted. If, on the contrary, a search engine receives more queries than expected (i.e. more queries than it can process), the excess queries are simply rejected and the search engine does not benefit from them. Such a model reflects the natural inertia in resource allocation: we cannot buy of sell computing resources “on the fly” for processing of each individual user query. Thus, the performance of a search engine i in a given time interval is: ˆ i ) − α2 Q ˆ i Di − α 3 Ci − α 4 Di , Pi = α1 min(Qi , Q ˆ i is the number of queries expected by engine i where Q in this time interval, P Qi is the number of queries actually received, Di = t Dit is the number of documents in the engine’s index, and Ci is the number of documents added to the index during the given time interval.

In our experiments, the number of user queries expected by engine i on topic t at stage k equals to the number of queries actually submitted on this topic at the previous ˆ t (k) = Qt (k − 1). Then, taking into account that a stage: Q i search engine only expects queries on the topics for which it has documents indexed (Dit > 0), the total number of queries expected is: X ˆ ti . ˆi = Q Q t:Dit >0

We assume that metasearcher(s) provide statistics on the number of previously submitted user queries. The number of queries on topic t actually received by engine i can be represented as ( 0 : ∃j, Dit < Djt t , t Qi = Q : i ∈ B, B = b : Dbt = maxj Djt |B|

where B is the set of search engines indexing the most documents for topic t, and Qt is the number of queries on topic t actually submitted by the users. That is, the search engine does not receive any queries, if it indexes less than competitors, and receives its appropriate share when it is among the “best” engines for topic t. Thus, the total number of queries received by search engine i can be calculated as X Qi = Qti . t:Dit >0

3. Analysis and Solution Approach The decision-making process can be modelled as a multi-stage game. At each stage, a matrix game is played, where players are search engines, actions are values of (Dit ), and player i receives payoff Pi (as above). If player i knew the actions of its opponents at a future stage k, it could calculate the optimal response as the one maximising Pi (k). In reality, players do not know the future. One possible way around this would be to agree on (supposedly, mutually beneficial) actions in advance. To avoid deception, players would have to agree on playing a Nash equilibrium of the game [5], since then there will be no incentive for them to not follow the agreement. Agreeing to play a Nash equilibrium, however, becomes problematic when the game has multiple such equilibria. Players would be willing to agree on a Nash equilibrium yielding to them the highest (expected) payoffs, but the task of characterising all Nash equilibria of a game is NP-hard even given complete information about the game (as follows from [4]). NP-hardness results and the possibility that players may not have complete information about the game lead to the idea of “bounded rationality”, when players may not use

the optimal strategies in the game-theoretic sense [9]. Our proposal is to cast the problem of optimal behaviour in the game as a learning task, where the player would have to learn a strategy that performs well against the given (possibly sub-optimal) opponents. Learning in games has been studied extensively in game theory and machine learning. Examples include fictious play and opponent modelling [8, 2]. We apply a more recent algorithm from reinforcement learning called GAPS [6]. In GAPS, the learner plays a parameterised strategy represented by a finite state automaton, where the parameters are the probabilities of actions and state transitions. GAPS implements stochastic gradient ascent in the space of policy parameters. After each learning trial, parameters of the policy are updated by following the payoff gradient. GAPS has a number of advantages important for our domain. It works in partially observable games. It also scales well to multiple topics by modelling decision-making as a game with factored actions (where action components correspond to topics). The action space in such games is the product of factor spaces for each action component. GAPS, however, allows us to reduce the learning complexity: rather than learning in the product action space, separate GAPS learners can be used for each action component. It has been shown that such distributed learning is equivalent to learning in the product action space [6]. We call a search engine controller that uses the proposed approach COUGAR, which stands for COmpetitor Using GAPS Against Rivals. COUGAR: adaptive engine controller. The task of the search engine controller is to manage the number of documents indexed on each topic to maximise the engine performance. When making decisions, the controller can receive information about current characteristics of its own search engine and the external environment in the form of observations. The observations are partial, e.g. they are not sufficient to figure out the exact state of the opponents. The COUGAR controllers are modelled by a set of nondeterministic Moore automata (M t ), one for each topic, functioning synchronously. The outputs of each automaton are actions changing the number of documents Dit indexed for the corresponding topic t. The available actions are: Grow (increase the number of documents indexed on topic t by one); Same (do not change the number of documents on topic t); and Shrink (decrease the number of documents on topic t by one). The resulting action of the controller is the product of actions (one for each topic) produced by each of the individual automata. The automata inputs are observations of the state of its own search engine and observations of the opponents’ state. The observations of its own state reflect the number of documents in the search engine’s index for each topic. We do not assume that search engines know the contents of each

other’s indices. Instead, the observations of the opponents’ state reflect the relative size of the opponents’ indices for each topic t: Winning – there are opponents indexing more documents for topic t than our search engine; Tying – the opponents index either the same or fewer documents for topic t than our search engine; Losing – our search engine indexes more documents for topic t than all opponents. Each of the Moore automata M t receives observations only for the corresponding topic t. Therefore, for T topics the controller’s inputs consist of 2T observations. One may ask how the controller can obtain information about relative index sizes of its opponents. This can be done by sending a query on the topic of interest to the metasearcher (as a user) and requesting a list of the best search engines for the query (i.e. using the normal functionality of a metasearcher). Training to compete. Training of the COUGAR controller to compete against various opponents is performed in series of simulation trials. Each simulation trial consists of 100 days, where each day corresponds to one stage of the multi-stage game played. The search engines start with empty indices and then, driven by their controllers, adjust their index contents. In the beginning of each day, engine controllers receive observations and simultaneously produce control actions (change their document indices). Users issue a stream of search queries for one day. The metasearcher distributes queries between the engines according to their index parameters on the day. At the end of the day, the search engines collect rewards according to the performance formula from Section 2. For the next day, the search engines continue from their current states, and users issue the next batch of queries. The resulting performance (reward) in the whole trial is calculated as a sum of discounted rewards from each day. The experience from simulation trials is used by COUGAR to improve its performance against the opponents. In the beginning, all automata in the COUGAR controller are initialised so that all actions and transitions have equal probabilities. After each trial, the COUGAR controller updates its strategy using the GAPS algorithm. That is, the action and state transition probabilities of the controller’s automata are modified using the payoff gradient (due to the lack of space see [6] for details of the update mechanism). Repeating trials multiple times allows COUGAR to gradually improve its performance (to derive a controller strategy with a good performance).

4. Experimental Results We evaluated our approach in a number of simulation experiments, which demonstrated COUGAR’s ability to compete successfully with different opponents. For the sake of brevity, we present here results from two such experiments. In the first one, 2 search engines compete in a scenario with

COUGAR (topic 1) COUGAR (topic 2) Wimp (topic 1) Wimp (topic 2)

6

Documents

3

0

3

6 0

10

20

30

40

50 Days

60

70

80

90

100

Figure 1. “Wimp” vs COUGAR, sample trial 12000 10000 8000 Trial reward

2 topics. One search engine was using a fixed strategy, called “Wimp”, the other one was using the COUGAR controller. In the second experiment, 10 COUGARs compete with each other in an environment with 10 different topics. The experimental setup consisted of three main components: the generator of search queries, the metasearcher, and the search engines. The search engine component included a document index, an income counter, and an engine controller. The state of the document index is represented by a vector (Dit ) of the numbers of documents indexed by the search engine for each topic. This is the information used by the metasearcher to select search engines. To simulate user search queries, we used HTTP logs obtained from a Web proxy of a large ISP. We developed extraction rules individually for 47 well-known search engines. The total number of queries extracted was 657,861 collected over a period of 190 days. We associated topics with search terms in the logs. To simulate queries for T topics, we extracted the T most popular terms from the logs. The number of queries on topic t for a given day was equal to the number of queries with term t in the logs belonging to this day. COUGAR against “Wimp”. Consider the “Wimp” strategy first for the case of a single topic. The set of all possible document index sizes is divided by “Wimp” into three non-overlapping sequential regions: “Confident”, “Unsure”, and “Panic”. The “Wimp’s” behaviour in each region is as follows: Confident – the strategy in this region is to increase the document index size until it exceeds that of the opponent. Once this goal is achieved, ”Wimp” stops growing and keeps the index unchanged; Unsure – in this region, “Wimp” keeps the index unchanged, if it is bigger or the same as the opponent’s index. Otherwise, it retires (i.e. reduces the index size to 0); Panic – “Wimp” retires straight away. The overall idea is that “Wimp” tries to outperform its opponent while in the “Confident” region by growing the index. When the index grows into the “Unsure” region, ”Wimp” prefers retirement to competition, unless it is already winning over or tying with the opponent. This reflects the fact that the potential losses in the “Unsure” region (if the opponent wins) become substantial, so ”Wimp” does not dare to risk. In case of multiple topics, “Wimp” did not differentiate between topics of both queries and documents. When assessing its own and the opponent’s index sizes, “Wimp” was simply adding the documents for different topics together. Finally, “Wimp” was changing its index size synchronously for each topic. Common sense tells us that one should behave aggressively against “Wimp” in the beginning, to knock him out of competition, and then enjoy the benefits of monopoly. This is exactly what the COUGAR controller has learned to do: it learned to increase the index until “Wimp” retires and then tried to reduce the index size to cut down the costs in the absence of competition (recall from Section 2 that

6000 4000 2000 0 COUGAR Wimp

-2000 0

20000

40000

60000

80000

100000

Trials

Figure 2. “Wimp” vs COUGAR: learning curve

a smaller index incurs lower resource costs). Figure 1 visualises a testing trial between “Wimp” and COUGAR by showing the number of documents indexed by the engines on each day of the trial (the top half of Y axis shows the number of documents for topic 1, the bottom half shows the number of documents for topic 2). Figure 2 shows how the COUGAR’s performance improved during learning. COUGAR in self play: scaling up. One of the key requirements for a successful player in the Web search game is the ability to cope with potentially large numbers of topics and opponents. An advantage of GAPS is its ability to handle well multiple topics by modelling decision-making as a game with factored actions. To analyse the scalability of our approach with the number of topics and opponents, we simulated 10 COUGAR learners competing in a scenario with 10 different topics. It is not guaranteed from the theoretical point of view that the gradient-based learning will always converge in self play. In practice, however, we observed that learners converged to relatively stable strategies. Figure 3 presents the obtained learning curves. Analysis showed that the COUGARs “agreed” to split the query market (see Table 1). An important detail is that more popular topics at-

6000

Trial reward

5000 4000 3000 2000 1000 0 0

50000 100000 150000 200000 250000 300000 Trials

Figure 3. 10 COUGARs competing for 10 topics

Table 1. Market split between 10 COUGARs Topic 1 2 3 4 5 Term ireland dublin irish free java Engine 4,7 1,9 8 5 5 Topic Term Engine

6 map 2

7 university 3

8 mp3 2

9 download 6

10 jobs 10

tracted more search engines. This confirms that the market split was really profit-driven and demonstrates COUGAR’s “economic thinking”. However, though not indicated explicitly in Figure 3, the most profitable engine actually specialises on the relatively unpopular topic three, while the four engines that competed for the two most popular topics ended up earning substantially less money. This result illustrates that naive strategies such as blindly indexing popular documents can be suboptimal in large-scale heterogeneous search environments.

5. Conclusions and Future Work Performance management for specialised search engines in heterogeneous Web search systems is a difficult task. The utility of any local content change depends on the state and actions of other search engines in the system. Thus, naive strategies (e.g, blindly indexing lots of popular documents) are ineffective, because a rational search engine’s indexing decisions should depend on the (unknown) decisions of its opponents. In this paper, we provided a theoretical analysis of the problem and proposed a method, utilising reinforcement learning techniques, for automatically managing search en-

gine content. Our adaptive search engine, COUGAR, has shown the potential to derive non-trivial and economically sensible behaviour strategies. Given the complexity of the problem domain, attempting to make our analysis 100% realistic from the very beginning is neither feasible nor reasonable. Though our preliminary results are not immediately applicable in practice, they demonstrate the feasibility of the proposed reinforcement learning approach. Clearly, we have made many strong assumptions in our models. One future direction will be to relax these assumptions to make our simulations more realistic (e.g. we intend to perform experiments with real documents and using some existing metasearch algorithms). While we are motivated by the optimal behaviour for search services over document collections, our approach is applicable in more general scenarios involving services that must weigh the cost of their inventory of objects against the expected inventories of their competitors and the anticipated needs of their customers. For example, it would be interesting to apply our ideas to large retail e-commerce sites which must decide what products to stock.

Acknowledgements This research was supported by grant SFI/01/F.1/C015 from Science Foundation Ireland, and grant N00014-03-10274 from the US Office of Naval Research.

References [1] J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proc. of the 18th Annual Intl. ACM SIGIR Conf., pages 21–28. ACM Press, July 1995. [2] D. Carmel and S. Markovitch. Learning models of intelligent agents. In Proc. of the 13th National Conf. on AI, pages 62–67, Menlo Park, CA, 1996. [3] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In Proc. of the 8th WWW Conf., 1999. [4] V. Conitzer and T. Sandholm. Complexity results about Nash equilibria. Technical Report CMU-CS-02-135, Carnegie Mellon University, 2002. [5] M. Osborne and A. Rubinstein. A Course in Game Theory. The MIT Press, 1999. [6] L. Peshkin. Reinforcement Learning by Policy Search. PhD thesis, MIT, 2002. [7] K. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39:289–302, June 2002. [8] J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54:296–301, 1951. [9] A. Rubinstein. Modelling Bounded Rationality. The MIT Press, 1997. [10] C. Sherman and G. Price. The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Independent Publishers Group, 2001.