Distributed Online Filtering

Distributed Online Filtering Costin Raiciu

David S. Rosenblum

Mark Handley

University College London {c.raiciu|d.rosenblum|m.handley}@cs.ucl.ac.uk Project URL: http://www.cs.ucl.ac.uk/staff/craiciu/onlinefiltering/ Web search is a reactive process: search engines index the information published on many millions of servers, and users send queries to find sites that contain particular information. While this approach is adequate for static content, it fails to capture the temporal nature of online information. Sports scores, networking research preprints, stormy blog discussions, and indeed news are all examples of information whose value is strictly related to its timely delivery to users. Additionally, it is a bit awkward that the very numerous web servers publish information and basically hope that the (tens of) thousands of machines owned by search companies will make their content available to the world. Perhaps the web servers, having spare capacity most of the time (except for the rare case of flash crowds), can eliminate the middle-man and deliver information to interested users directly? This work focuses on online filtering of documents by the web servers themselves, organized in an overlay. Users express their long term interests as queries and register them with the web servers. Content published by the web servers is matched against users’ interests and a decision is made to decide whether a given document should be delivered to the user. This approach is proactive, as decisions are taken online based on content. It does not come to replace search, but to complement it by enabling timely information delivery to the users who have expressed interest in that information. Requirements and Metrics. Our main metric is throughput, measured as the number of documents processed per second by the whole network. We require the system should be scalable according to this parameter: by adding more nodes (i.e. web servers) and maintaining the same number of user queries, the maximum throughput should improve. Constraints. Measurements presented elsewhere [1] show that the product between the number of queries stored and the number of documents processed is almost constant for a node: this is the node’s computing power. Further, each node has a limited bandwidth budget it can use. Both constraints can be measured; alternatively, a budget can be specified to avoid degrading the performance of the web server.

Document

K2

Document {K1,K2,K3}

Load Balancer

K5

Query

K1

Query {K1,K4} K4

(a) Keyword Algorithm

K3

Load Balancer

(b) Matrix Algorithm

Figure 1. Basic Algorithms

Solution. There are two classes of solutions: contentsensitive and content agnostic. Content-sensitive solutions use keywords to route documents and store queries in the overlay. Solutions range from assigning single keywords to nodes to assigning all possible combinations of keywords to nodes (in CAN style). In this space, we create the Keyword algorithm that assigns every keyword to a single node in DHTstyle (Fig. 1.a). Queries are broken up into keywords and stored on each of the nodes in charge of those keywords. The most important keywords in documents (25-50 keywords) are extracted and used to route the document. Keywords in documents and queries are considered in alphabetical order, such that it is easy to determine when a document has been matched against all significant queries and also whether a document should be sent to a particular user upon a match. We considered several context-sensitive algorithms, but found them to be worse than this simple approach [1]. Content-agnostic solutions simply aim to rendez-vous each document with all the queries. The most general algorithm replicates subscriptions R times and routes documents to N/R nodes. Nodes are arranged in a matrix with R columns and N/R rows, replicating queries horizontally and routing documents vertically: the Matrix algorithm (Fig. 1.b). Comparison. We now analytically compare Matrix and Keyword. For this, we consider a simple theoretical model that assumes nodes are homogeneous and compute the maximum throughput without loss of the two algorithms. For the Matrix algorithm, we derive the value of R that minimizes bandwidth usage; this value depends on the relative frequencies and sizes of documents and queries, and the number of nodes. We briefly review our assumptions for the analysis.

Throughput (Doc/s)

60000

Keyword 0.5M Keyword 1M Keyword 2M Matrix 0.5M Matrix 1M Matrix 2M

50000 40000 30000

max 0

Query

max 0

Document

20000 10000 0 0

2000

4000

6000

8000

10000

Storing queries

Number of nodes

(a) Comparison

Forwarding documents

(b) ROAR:Rendez-Vous On A Ring

Figure 2. Analysis and Improvements

User interests are soft-state, power-law distributed, and change infrequently. To extract features of documents we use an RSS data set resulted from polling 80k RSS feeds between January and March 2006. Using these inputs, we derived an analytical model to capture the load balancing behavior of Keyword (see [1] for more details). The computations are performed assumming the bandwidth budget is 100kbps and the computation capacity is 107 (docs · queries / sec). We vary the network size from 1 to 10000 nodes; the trend is similar when the number of nodes is much greater. We vary the number of users from half a million to 2 million. Discussion. The results (Fig. 2.a) show that Keyword does better if N < 5000, but its throughput quickly saturates at 2000 nodes due to the highly skewed distribution of documents and queries. The content-agnostic algorithm, on the other hand, scales well with the number of nodes. There is no clear winner: probably a combination of the two yields the best results. Next, we discuss ways of implementing these algorithms in real-life. ROAR: Rendez-Vous On A Ring. Matrix ties network structure with replication, making it difficult to change the replication level, a requirement for optimal solutions. We propose ROAR, a novel algorithm that separates the structure from replication by using a circular ID space virtually partitioned in R intervals. ROAR runs on top of Chord; it stores a copy of each query in each interval, while documents are routed to all the nodes in a randomly chosen interval (Fig. 2.b). ROAR is able to change the replication level with minimum bandwidth cost while maintaining matching correctness. In reality, web servers are heterogeneous, thus resource poor nodes become bottlenecks. To fix this, ROAR aims to maintain the sum of bandwidths constant across different intervals, closely approximating the optimal solution for bandwidth bound nodes. Load balancing Keyword. We use ROAR’s dynamic partitioning and replication to fix Keyword’s load bal-

ancing issue. Busy nodes in the overlay are instantiated with sub-networks running the ROAR algorithm: this is the KeyRing algorithm. In KeyRing, the randomized join and leave operations are modified to direct nodes to join existing hotspots of the network. This is achieved by having each node monitor the distribution of keywords in the documents it routes, and instructing new nodes to join the network according to this distribution. Ongoing and Future Work. We are running simulations to compare the performance of the KeyRing algorithm against Keyword and ROAR running in isolation. In the future we intend to prototype the algorithms and measure the performance in a controlled deployment. Related Work. The works closest to ours are SIFT [3] and Cobra [2]. SIFT focuses on centralized filtering; Cobra is a distributed system, but filtering only uses replication to scale; when the number of users is realistic (e.g. hundreds of millions), the memory of a single node is too small to store all queries, making partititioning a requirement.

References [1] C. Raiciu. Phd transfer report: On distributed online filtering. UCL, 2006. [2] I. Rose, R. Murty, P. Pietzuch, J. Ledlie, M. Roussopoulos, and M. Welsh. Cobra: Content-based filtering and aggregation of blogs and rss feeds. In Proc. NSDI ’07. [3] T. W. Yan and H. Garcia-Molina. The sift information dissemination system. ACM Trans. Database Syst., 24(4), 1999.

2