Designing and Developing Large-Scale Network Services*

26 downloads 0 Views 70KB Size Report
Adolfo Rodriguez, Ken Yocum, Dejan Kostic, Charles Killian, Kashi Vishwanath,. David Becker ..... [1] Marcos K. Aguilera, Jeffrey C. Modul, Janet L. Wiener, Patrick ... [19] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari.
Designing and Developing Large-Scale Network Services



Adolfo Rodriguez, Ken Yocum, Dejan Kosti´c, Charles Killian, Kashi Vishwanath, David Becker, Priya Mahadevan, Jeffrey Chase, and Amin Vahdat Department of Computer Science, Duke University

ABSTRACT The current state-of-the-art in building and evaluating large-scale network services suffers from a long and problematic development cycle. Implementation is difficult, tedious, and redundant. Experimentation does not capture realism, scale, and network-level insight simultaneously. Evaluation tends to focus too much on the implementation of algorithms and not enough on the algorithms’ behavior in real network settings. We present a methodology called Tumi that leverages two cooperative components to qualitatively improve this process. First, MACEDON enables the concise specification of the high-level behavior of overlays, providing language and runtime support for a variety of common distributed computing tasks. Next, we seek to extract network-level performance data from ModelNet, a tool that allows for the realistic emulation of large-scale network services. Specialized evaluation tools complement these tools and help guide algorithmic principles. We have used an initial version of the infrastructure to develop performancetuned versions of AMMO, Bullet, Chord, NICE, Overcast, Pastry, RanSub, Scribe, and SplitStream. While still in development, our current results indicate that our completed infrastructure will transform overlay research from a frustrating, ad-hoc process to an effective, streamlined methodology.

1. INTRODUCTION Large-scale distributed services are becoming increasingly important. Researchers are steadily leveraging the rich functionality provided by these systems. Examples include application layer multicast, content distribution networks, and distributed hash tables (DHTs). Typically, the emphasis is on creation of highly scalable and adaptive overlay networks, logical networks built on top of the underlying IP substrate aimed at meeting the high demands of today’s network applications. Unfortunately, this emergence is inhibited by a number of challenges plaguing the design, development, deployment, and evaluation of overlay services. This paper outlines these issues and proposes mechanisms for addressing them based on our vast experiences with overlays. We view current overlay research as following a cycle consisting of four phases as depicted in figure 1, each of which suffers from a number of challenges. First, an overlay researcher designs an algorithm to meet specific performance goals, optimizing for certain underlying substrate metrics such as latency and bandwidth and providing for application behavior such as O(lg n) routing hops in DHTs. From this description, one or more implementations are created that are used to evaluate the performance of the algorithm under certain network conditions. For example, many researchers create a hand-crafted simulator capable of evaluating performance under large numbers of nodes and an extensive live implementation for evaluation in real settings. Such implementations are tedious and difficult, both due to the sheer size of the software components ∗ This research is supported in part by the National Science Foundation (EIA-99772879, ITR-0082912), Hewlett Packard, IBM, Intel, and Microsoft. Vahdat is also supported by an NSF CAREER award (CCR-9984328).

Algorithm

Implementation

Evaluation Postprocessing tools

Live code 5000+ lines

Simulator 1000+ lines

Experimentation

Live deployment

Simulation

Figure 1: Current overlay research cycle needed to build scalable implementations and the complexity of such functionality. Using an algorithm’s implementation, researchers use experimentation to gather run-time performance data. Usually, this includes a combination of simulation (such as with the network simulator, ns [13]) and small-scale live Internet runs (e.g. PlanetLab [14]). Unfortunately, simulation is unable to completely capture the complex behavior of real applications and networks. Live experiments suffer from scale limitations and the inability to extract vital networklevel performance data. As a result, experimentation cannot fully generate the necessary information to effectively gain complete understanding of overlays. The evaluation phase of the cycle processes the information generated through experimentation. Researchers process performance data with hand-crafted tools and subsequently modify their implementation in light of code bugs or sub-optimal performance. Because researchers employ disparate implementation techniques, the evaluation of competing overlays reflects differences in implementation methodologies as opposed to algorithmic principles. Further, it is difficult to complete the cycle appropriately since the focus is on implementation rather than the algorithm itself. The Tumi methodology overcomes challenges in each phase of the overlay development cycle (figure 2), using two collaborative components, MACEDON [15] and ModelNet [20]. First, we are designing a simple language and operating framework, MACEDON, for specifying the high-level behavior of network algorithms from which we generate fully-functional code. While still under development, we have used MACEDON to develop performance-tuned versions of Nice [3], SplitStream [4], Overcast [9], RanSub [10], Bullet [11], AMMO [16], Pastry [17], Scribe [18], and Chord [19]. These MACEDON-encoded specifications typically consist of 200600 lines, greatly decreasing development and debugging efforts. Further, because the generated C++ code shares the same baseline implementation for probing, joining, failure detection, etc., relatively fair comparisons can be carried out between competing sys-

Multicast/DHT application

overlay.mac < 600 lines Evaluation

MACEDON API

Split Stream

Bullet

Implementation

MACEDON API

Simulation

RandTree

Live deployment

AMMO

Network aware emulation

NICE

MACEDON code generator

Overcast

MACEDON postprocessing

Scribe

DHTs

MACEDON API

Chord/Pastry

MACEDON API

Network substrate (TCP/IP, ns) Experimentation Figure 2: Streamlined, Tumi-enabled overlay research cycle

tems, isolating differences in algorithms rather than implementation artifacts. Second, ModelNet allows for the evaluation of large-scale network services in a controlled cluster environment. ModelNet emulates packets hop-by-hop through a user-specified network topology, capturing the effects of latency, congestion, network and end-host failures, and per-hop queuing disciplines. ModelNet enables a more complete evaluation of network services since key topological and routing characteristics can be extracted from the ModelNet environment and used to evaluate services along network metrics. The Tumi methodology promotes a synthesis of static analysis and realtime system introspection, allowing for both live and post-mortem analysis. By specifying application- and network-level expectations of system behavior, Tumi monitors the many facets of overlay performance. This work is in the spirit of other tools that infer performance bottlenecks, bugs, or bad design choices from runtime observations [1, 2, 5]. In the end, Tumi provides the researcher with control, scalability, introspection, performance, realism, and reproducibility. The remainder of this paper describes how Tumi influences each phase in the overlay cycle. Section 2 provides an overview of the MACEDON language, the way we specify the behavior of overlays. The process of generating an implementation from this language is described in section 3. Section 4 elaborates on experimentation using ModelNet.a Our evaluation tools are discussed in section 5. We conclude in section 6.

2. ALGORITHM SPECIFICATION The main goal of MACEDON is to enable designers of large-scale networked systems to focus on the design of their distributed algorithm rather than on the challenges of building and maintaining robust distributed system implementations. In support of this goal, MACEDON provides an API that describes how applications and overlays interact with one another and a specification language that describes the complex behavior of overlays.

2.1 MACEDON API Application programmers use the MACEDON API to invoke functions in MACEDON protocol implementations. We argue that providing such a common API (as proposed in [6]) is key to enabling applications to seamlessly move from one protocol to another without modification. At a high level, overlays support either route or multicast primitives that transmit data from a source to one or more destinations through the overlay. Typically, overlays provide up-

Figure 3: Layering protocols in MACEDON (pattern-filled protocols are optional) calls at each routing hop that enable intermediate nodes to perform application-specific functions. For example, an intermediate Scribe node receiving a join request for a particular group (through the underlying DHT) will add that group to its list of sessions and allow the request to propagate toward the destination, thus building a multicast distribution tree. Using a general purpose API, MACEDON enables layering where protocols use services of lower layers while providing services to higher layers (figure 3). Protocols range from our simple RandTree protocol that creates a random overlay tree to complex algorithms such as SplitStream and Bullet that make heavy use of underlying layers. Higher layers in MACEDON (including the application layer itself) register with lower layers to receive callbacks during certain operations. To support routing through overlays, there are forward and deliver callbacks that are used when a message is to be forwarded through the overlay or has arrived at its final destination. Other callbacks include notify, used by the lower layer to inform the higher layer that its neighbor sets have changed, and error for error processing.

2.2 MACEDON Language MACEDON provides a domain-specific language that can be used to accurately and concisely represent a variety of overlay algorithms. Individual participants in a distributed system are each in one of a set of user-defined node states. Participants transition to other states based on a variety of external events, including the reception of a message from a remote participant, the expiration of a timer, or an API request from a higher layer. Participants also maintain auxiliary state, a set of user-defined variables that change based on event occurrences. We now briefly describe some of the important features of the MACEDON language. Figure 4 provides an outline of a MACEDON specification. The specification begins with a protocol statement that identifies the protocol name and which, if any, protocol that is used by this protocol. The type of addressing (in this case, hashed addresses) is specified next. The description then specifies the set of valid node states. Overlays also specify neighbor types that represent peer relationships. For example, an overlay tree would have parent and children neighbor types. These neighbor types can be designated as failure detected, allowing applicationspecific code to execute upon the detection of node failure as evidenced by a node’s inability to respond to messages. MACEDON messages include a list of fields specified within. Message declara-

tions allow MACEDON to marshall fields automatically on behalf of the overlay. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

protocol XXX uses YYY addressing hash states { } neighbor_types { } messages { } auxiliary_state { } transitions { STATE recv MESSAGE_NAME { } STATE timer TIMER_NAME { } API API_NAME { } }

Figure 4: Structure of overlay specifications in MACEDON. The most important portion of a MACEDON algorithm specification is the transitions that describe the actions that must be performed in moving from one state to another. Actions include setting node state or auxiliary state, sending a message, and scheduling a timer. For example, an Overcast node would use auxiliary state to track which of its siblings have reported the best connectivity thus far. Transitions identify the state in which the transition occurs and the stimulus of each transition (with the recv keyword for received messages, the timer keyword for timer expirations, or the API keyword for API downcalls).

3. IMPLEMENTATION PHASE From an algorithm’s specification, MACEDON generates C++ code capable of functioning in the ns‘[13] simulation environment and natively over TCP/IP. The generated code can run live across the Internet, on top of a deployment testbed such as PlanetLab [14], and over an emulated wide-area network, such as ModelNet [20] described in section 4.

3.1 The Engine This generated code is linked with a high-performance engine that implements the common functionality required of overlay implementations such as threading, marshalling messages, and interfacing with the Sockets API. We provide a scheduler that is responsible for the management of a pool of engine threads that process system events, timer expirations, and received messages. We employ a serialization model of engine and application-level (issuing API calls) threads that differentiates between control and data paths. By default, control paths are exclusively locked with a write lock, while data paths are locked with a read lock, though this behavior can be modified by the algorithm designer. The MACEDON engine uses the MACEDON API to access the transport uniformity layer. Transport implementations code to the API regardless of their communication semantics (connection vs. connectionless). Currently, MACEDON provides interfaces for TCP, UDP, TFRC [7], and a simple UDP-based protocol aimed at overcoming some of TCP’s implementation restrictions (e.g., the number of socket descriptors). We use non-blocking socket calls for all implementations in sending and receiving data, alleviating the need for using large numbers of threads.

3.2 Reusable Libraries MACEDON provides libraries for performing functionality common to a variety of distributed systems. For example, our target applications often probe other participants for bandwidth, latency, or packet loss rate. While efficient and accurate network probing is an active area of research [8], we implement a variety of probing

techniques in the MACEDON library allowing application developers to choose from the available set or to implement their own as necessary. Ideally, any refinements to the probing strategy made by an application developer will become available to all applications. Other common functionality includes node locking to effect an overlay transformation, bloom filters, hash address location caching, and sequenced data bitmapping/caching.

4. EXPERIMENTATION PHASE In this section, we provide a brief description of ModelNet and the scalability, accuracy, and functional challenges we see in support of our integrated development and evaluation environment. The ModelNet architecture is composed of edge and core nodes. Edge nodes may execute arbitrary operating systems, or multiple virtual machines, running on arbitrary hardware architectures. They run native IP stacks and function as they would in real environments with the exception that they are configured to route all IP traffic through cores. To decrease the number of edge machines required for largescale evaluations, ModelNet allows for virtual edge nodes (VNs), that enable the multiplexing of multiple application instances on a single client machine. Cores are responsible for the emulation of topology-specific network characteristics. Upon receiving a packet from a client, each core determines the set of pipes (links in the topology) the packet must traverse in being delivered from source to destination by consulting an n2 matrix mapping each source-destination pair (contained in the IP packet header) to a series of pipes. Each pipe is associated with values for packet queue size, bandwidth, latency and loss-rate. The emulation executes in real time, with cores delaying or dropping input packets according to the characteristics of each pipe. Hence, packets traverse the emulated network with the same rates, delays, and losses as they would in a real network. When a packet exits the chain of pipes, the core transmits the packet to the edge node hosting the destination VN. In this way, ModelNet captures the end-to-end characteristics of packet delivery and the effects of per-hop congestion and queuing disciplines.

4.1 Scalability Considerations ModelNet exhibits per-hop and per-packet overheads. On our hardware, a single core node can process 140,000 packets/sec, each traversing 1 hop, with this number dropping to 100,000 packets/sec for an 8-hop diameter network [20]. For full-sized packets, this corresponds to an aggregate bisection bandwidth 1 Gb/s, saturating the CPU of a 1.4 Ghz Pentium III. We scale ModelNet’s emulation capacity, the number of packets emulated per unit time, by using multiple core nodes and assigning disjoint regions of the target topology across them. This topology partitioning capitalizes on the fact that network emulation is a simple distributed computation of packet delays, and can leverage traditional parallel computing scaling techniques. Core nodes hand off emulation responsibility when a packet’s next hop is assigned to a remote core node. This cross-core communication is the principle source of overhead in a multi-core system. Intelligent partitioning algorithms attempt to reduce the total amount of cross-core traffic [21]. However, optimal topology partitioning is NP-hard and assumes perfect knowledge of the application’s communication patterns. Partitioning algorithms depend on accurate estimations of network flows to provide good partitions. But communication patterns are often dynamic, as applications may adjust to network conditions such as flash crowds or network partitions. Hence, we are also investigating dynamic partitioning techniques that repartition the target topology during an emulation based on

Interestingly, there are cases in which the benefits of partitioning are limited. For instance, consider a simple dumbbell topology where the middle link exceeds the emulation capacity of a single core. In such scenarios, we are exploring the benefits of pipe replication, where the responsibility for emulating the same pipe is distributed among multiple cores. While increasing scalability, this could adversely impact accuracy because packets being emulated at one pipe replica do not interact with packets at a second replica. We are investigating lightweight techniques to periodically synchronize state among replicas to maintain TCP-friendly accuracy while limiting cross-core communication overhead.

Overlay algorithm

Application state and event log

runs in

ModelNet

Network performance data

subjects packets to

Post-processing

observed communication patterns.

Network topology

4.2 Emulation Realism A goal of this work is to subject network services to accurate representations of Internet network dynamics. Currently we use simple uniform assumptions of competing traffic in ModelNet. We do so in a scalable fashion by adjusting link capacities and drop rates, albeit in an ad-hoc fashion. However, this lacks fidelity with respect to real competing traffic which may exhibit bursty long term dependent behavior. An alternate approach is to make use of background traffic generators as additional end applications creating real background noise. Unfortunately, with network topologies of sufficient size, the background traffic would consume the majority of the emulation capacity. Hence, we are researching ways to accurately capture the effects of various communication patterns without requiring the actual execution of code to inject these patterns. Our approach is to emulate a given application or set of applications using models of user behavior. We then hope to measure and characterize router queue occupancy as a function of time. We can then use the resulting distributions to mark certain queues as occupied within the core node based on a distribution of per-router “background” packet arrival rate. Another concern is that ModelNet currently assumes shortest path (as measured by hop count) routing between all pairs of hosts (calculated offline). This approach leads to two limitations. First, the O(n2 ) routing table limits practical system deployments to approximately 10,000 VN’s. Second, we do not capture the details of routing protocols such as BGP and OSPF. One recent study indicates that routes may take minutes to converge after some failures [12]. Similarly, we are not able to capture suboptimal results that may result from policy or from routing protocols that may abstract all routers within a large AS to a single hop. We have already extended ModelNet to support emulation of ad hoc routing protocols and are adding support for a simple adaptation of BGP.

4.3 Gathering Performance Data ModelNet provides a powerful environment for network monitoring. As it emulates the network, it may gather detailed statistics about hops, packets, and flows. General statistics include drop rates, flow counts, link utilization, and route usage. Tumi provides a unique opportunity to both reduce the cost of real-time data acquisition and log application-specific state from the network. Observations are triggered with the violation of MACEDON expectations. These are transformed into specific passive or active logging events that are dynamically inserted into the running emulation. Passive events log for a specified duration (amount of time, number of bytes), while active events begin logging when unexpected network conditions arise. Precision logging reduces the volume of uninteresting events. Further, because MACEDON standardizes packet formats, we can log application-specific metrics, e.g., the ratio of control to data traffic.

Figure 5: Components of the Tumi evaluation

5. EVALUATION PHASE The goal of Tumi’s evaluation phase is to gain understanding of an overlay’s behavior. The major components are illustrated in figure 5. First, we have complete access to topological and routing information from the target topology used by ModelNet. Second, MACEDON provides space-efficient snapshots of overlay state and events (timers expiring, message transmissions, etc.), enabling examination of an overlay’s global context throughout the experiment. Third, ModelNet generates network behavior logs containing per-flow statistics, packet loss events, and queuing effects. Currently, Tumi evaluation consists of post-processing tools that combine available information and present it in a way that enables insight into algorithm performance.

5.1 Static Analysis To form the basis of post-processing, we import network topology information from ModelNet. This includes per-link latency, bandwidth, and loss rates as well as node routing tables. This global topological knowledge enables us to compute a wide variety of static graph-theoretic structures. For example, if we are interested in constructing a single-source multicast overlay that provides optimal latency, we determine the shortest path tree (SPT). In addition, we can construct IP multicast trees using the specified topology. By observing the constructed overlay, we can compare its latency to optimal overlay structures and IP multicast. Given that we know the list of physical links traversed by each overlay edge, we can compute overlay stress for each physical link, i.e. the number of times a data packet traverses that link.

5.2 Application Events Although helpful, static post-processing does not suffice in many cases. Instead, the algorithm designer can turn her attention to the dynamic overlay algorithm behavior captured in MACEDON logs. We have applied this technique to our AMMO [16] work. AMMO uses RanSub to construct loop-free trees via a mechanism called TreeMaint. While developing TreeMaint, we encountered numerous synchronization problems attributed to race conditions. Occasionally, a protocol error would cause a loop to be formed in the tree. To debug the protocol, we employed a systematic approach using the overlay log to discern the long causality chain of messages that caused the error condition to occur. To streamline this process, we propose storing such overlay-level data in structured sets of XML files, or a database, and providing access via a programmatic interface. In this manner, the algorithm designer could leverage existing code for manipulating overlay state and reexamine runtime decisions by looking at system-wide state. Fur-

ther, this would automate the process of identifying explicit chains of events leading to error scenarios.

5.3 Performance Data To understand why a seemingly correct algorithm displays suboptimal performance, we propose importing network behavior logs from ModelNet. This would allow us to answer why certain overlay edges do not provide expected performance. The method of postprocessing using complete state and event logs along with ModelNet performance data is powerful enough to enable the designer to identify even the most subtle performance problems. However, this flexibility comes at the cost of considerable disk space to store the logs and computation to generate performance data. We propose expectation-based evaluation to significantly reduce the time and disk space required to debug and fine-tune an overlay algorithm. Expectations are similar to assertions as they signify a violation of some invariant. Unlike assertions, a protocol violating an expectation continues to function normally. Application-level expectations allow the overlay designer to insert statements into the MACEDON specification expressing the high-level properties of per-node state. A violation of the application-level expectation would cause Tumi to permanently log state and events preceding the violation. Once complete, only information regarding expectation violations are maintained for post-processing. Although we expect this type of expectation to identify significant problems in overlays, it is possible that this technique does not capture all the intricacies of large-scale distributed systems. By allowing explicit specification of network behavior, networklevel expectations enable the overlay designer to express expected network performance behavior. For example, the expectation could refer to a number of data objects that are expected to be transferred per time unit or expected latency along a given overlay edge. A violation of a network-level expectation instructs ModelNet to save the relevant state and network statistics. Much like application expectations, only network expectation violations are maintained and processed. In this manner, this functionality leverages a tight coupling between the overlay-centric nature of MACEDON and the network-aware capabilities of ModelNet.

6. CONCLUSION In this paper, we present a unified development and evaluation methodology, called Tumi, for creating large-scale network services. The infrastructure consists of two cooperating components. MACEDON allows for the concise specification of distributed systems and generates functional code providing eased development and consistent overlay evaluation. ModelNet emulates unmodified code according to the characteristics of user-specified network topologies. While still in progress, we report on our successes and outline strategies to address the challenges in each phase of overlay research. Ultimately, we believe the Tumi methodology will qualitatively improve the overlay research cycle, providing researchers with a streamlined approach to achieving control, scalability, introspection, performance, realism, and reproducibility.

7. REFERENCES [1] Marcos K. Aguilera, Jeffrey C. Modul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles, October 2003. [2] Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau. Information and control in gray-box systems. In the 18th ACM Symposium on Operating Systems Principles, October 2001.

[3] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable application layer multicast. In Proceedings of ACM SIGCOMM 2002, pages 165–175, 2002. [4] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron, and Atul Singh. SplitStream: High-Bandwidth Multicast in Cooperative Environm ents. In Proceedings of the 19th ACM Symposium on Operating System Principles, October 2003. [5] Mike Chen, Emre Kiciman, Eugene Fratkin, Eric Brewer, and Armando Fox. Pinpoint: Problem determination in large, dynamic internet services. [6] Frank Dabek, Ben Zhao, Peter Druschel, John Kubiatowicz, and Ion Stoica. Towards a Common API for Structured Peer-to-Peer Overlays. In 2nd International Workshop on Peer-to-peer Systems (IPTPS’03), February 2003. [7] Sally Floyd, Mark Handley, Jitendra Padhye, and Jorg Widmer. Equation-based congestion control for unicast applications. In SIGCOMM 2000, pages 43–56, Stockholm, Sweden, August 2000. [8] Manish Jain and Constantinos Dovrolis. End-to-End Available Bandwidth: Measurement methodology, Dynamics, and Relation with TCP Throughput. In Proceedings of ACM SIGCOMM, August 2002. [9] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, and Jr. James W. O’Toole. Overcast: Reliable Multicasting with an Overlay Network. In Proceedings of Operating Systems Design and Implementation (OSDI), October 2000. [10] Dejan Kosti´c, Adolfo Rodriguez, Jeannie Albrecht, Abhijeet Bhirud, and Amin Vahdat. Using Random Subsets to Build Scalable Network Services. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, March 2003. [11] Dejan Kosti´c, Adolfo Rodriguez, Jeannie Albrecht, and Amin Vahdat. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In Proceedings of the 19th ACM Symposium on Operating System Principles, October 2003. [12] Craig Labovitz, Abha Ahuja, Abhijit Abose, and Farnam Jahanian. An Experimental Study of Delayed Internet Routing Convergence. In Proceedings of Sigcomm, August 2000. [13] The network simulator - ns-2. http://www.isi.edu/nsnam/ns/. [14] Larry Peterson, Tom Anderson, David Culler, and Timothy Roscoe. A Blueprint for Introducing Disruptive Technology into the Internet. In Proceedings of ACM HotNets-I, October 2002. [15] Adolfo Rodriguez, Sooraj Bhat, Charles Killian, Dejan Kosti´c, and Amin Vahdat. MACEDON: Methodology for Automatically Creating, Evaluating, and Designing Overlay Networks. Technical Report CS-2003-09, Duke University, September 2003. [16] Adolfo Rodriguez, Dejan Kosti´c, and Amin Vahdat. Scalability in Adaptive Multi-Metric Overlays. In The 24th International Conference on Distributed Computing Systems (ICDCS), March 2004. [17] Antony Rowstron and Peter Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms, pages 329–350, Heidelberg, Germany. [18] Antony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel. SCRIBE: The Design of a Large-scale Event Notification Infrastructure. In Third International Workshop on Networked Group Communication, November 2001. [19] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A Scalable Peer to Peer Lookup Service for Internet Applications. In Proceedings of the 2001 SIGCOMM, August 2001. [20] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kosti´c, Jeff Chase, and David Becker. Scalability and Accuracy in a Large-Scale Network Emulator. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), December 2002. [21] Ken Yocum, Ethan Eade, Julius Degesys, David Becker, Jeff Chase, and Amin Vahdat. Toward scaling network emulation using topology partitioning. In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2003.