A Generalized Framework for Network Performance

A Generalized Framework for Network Performance Management using End-to-end Mechanisms Prashant Pradhan, Manpreet Singh*, Debanjan Saha and Sambit Sahu IBM T.J. Watson Research Center *Cornell University {ppradhan, dsaha, sambits}@us.ibm.com [email protected]

Abstract

tion [11], ISP selection [13], overlay routing [12, 2] and TCP aggregation [1]. This paper argues for moving away from viewing network perFurther, on a longer time scale, application traffic demands formance management as a problem of providing and using should be used to scale and evolve the network in an informed network QoS mechanisms. Instead, we look at it as a con- manner. This is possible by making application traffic demand strained resource planning problem, where applications should patterns available to network providers for long-term capacuse end-to-end knobs (e.g., server selection, ISP selection, ity planning. This inherent division of responsibility where overlay routing) to intelligently map their traffic requirements the application maps its traffic onto the available capacity over onto the network capacity and paths available to them. We a short term, and network providers make informed capacity assert that this problem has a clean and generalized structure, scaling decisions over a long term, represents a viable apwhich allows the creation of a service that abstracts away the proach to the network performance management problem. setting of these end-to-end control knobs cleanly from the apThere is a clear motivation for abstracting away the complication. The key building blocks of such a service are a plexity of network monitoring and planning from the applinetwork monitoring component that must adequately capture cation into a service. Such a service can then also act as constraints on routing and network capacity, and a planning the bridge between applications and network providers by component that formulates the setting of control knobs as opproviding application traffic demand matrices to the network timization problems, and uses the monitoring data to come up provider. with a setting that satisfies the application’s requirements. We While the motivation for designing such a service has been present the architecture of this service and the design of its recognized [5, 12], the point solutions proposed so far in the monitoring and planning components. literature either address only a part of the problem, or lack key elements which make them unfit for applications and end-toend knobs different from the ones they target. We assert in 1 Introduction this paper that the network performance management problem Network performance and availability are key determinants has a clean and generalized structure, which allows a general of an application’s end-to-end performance and user experi- solution applicable to a broad set of control knobs and applience. However, the distributed and decentralized nature of cations. The contribution of this work is to expose this gennetwork resources make performance management a difficult eralized structure, identify the requirements from monitoring problem. Network performance management has traditionally and planning to realize a general solution, and provide such a been thought of as a problem of providing QoS capabilities solution. in the network, which can be used by applications by going The rest of the paper is organized as follows. In section 2 through brokers that understand the underlying network struc- we describe the general structure inherent in the problem of usture and traffic demands, and can thus do the requisite planning ing end-to-end control knobs to manage network performance. and admission control [15]. However, the needed brokering We then discuss where proposed solutions lack key elements and QoS infrastructure remains an unrealized vision. of a general solution. A high-level architecture and API of We assert that network performance management is more our service is then presented. Section 3 describes the moniof a constrained resource planning problem, where an applica- toring component of our service, which addresses key limitation must intelligently map its traffic demand onto the capacity tions of existing monitoring solutions. Section 4 describes the available on the network paths it may use. These paths are in planning component, that uses the monitoring data to provide turn constrained by IP routing between the application’s end- applications with a setting of end-to-end control knobs. We points. The knobs the application may use to pack its traffic conclude in section 5 by describing the implementation status over these paths are end-to-end techniques like server selec- of our service and the applications we are interfacing it with. 1

2 Generalizing the Network Performance Management Problem Network performance requirements for most applications are soft, and can be suitably captured by an average bandwidth demand and an average per-request latency. Our problem statement is to map an application’s traffic onto the underlying network paths such that its bandwidth demand fits within the bandwidth available on these paths, and the latency on these paths does not exceed a certain limit. Solving this problem has two essential elements. The first is a representation of the underlying network between the application’s endpoints as a graph, with the following key pieces of information. First, routing constraints between pairs of endpoints must be captured by creating an IP-level topological map1 . This graph expresses constraints on routing and exposes path sharing, which are both critical for capturing constraints on packing the application’s bandwidth demand within the network capacity. Second, edges in the resulting graph must be annotated with latency and available bandwidth2. Bandwidth annotation is required to express capacity constraints, whereas latency annotation is required for expressing performance constraints. Finally, this graph must be derived efficiently, and purely from end-to-end measurements. Most monitoring schemes proposed in the literature fall short of one or more of these requirements. Monitoring services that provide latency maps of the network with [8] or without [6] explicit topology representation, are adequate for applications that do not have significant bandwidth demands. While voice is a typical example of such applications, voice conferencing services with a large number of participants do not fall in this category. Some solutions utilize mechanisms like SNMP inside the network to identify the bottleneck and have no mechanisms otherwise to isolate bottlenecks in a path [5]. Other solutions are unaware of topology and model the underlying network as a fully connected mesh [2], thus ignoring sharing in the planning decision. The second aspect of the problem is to plan how to use a given end-to-end mechanism to map traffic onto this capacity under the constraints of the underlying network capacity and routing. As we show in section 4, given the underlying graph, the setting of the end-to-end control knobs can be deduced by solving a small set of optimization problems. Many proposed monitoring solutions leave the responsibility of using the monitored data to manage performance on the application [5, 6, 8]. Some efforts have tried to close the loop between monitoring and planning [2], but their monitoring solutions lacks topology information. Others [12] only plan for

Figure 1: Architecture of a network performance management service an application’s fault response by providing alternate routes, but do not plan for bandwidth and latency constraints. In the remainder of this section, we describe the high-level architecture and API of the service built upon the above mentioned monitoring and planning components. We call this service Network Performance Manager (NPM).

2.1 Application Interface A key aspect of designing the NPM service is to design the right interface for the application to specify its demand and performance requirements. Matching most applications’ abstract view of their network connectivity, NPM only requires the application to provide its communication endpoints and the traffic demand between these endpoints. NPM simplifies the specification of the endpoints and the traffic demand by providing the application with a callback that notifies NPM about the source, destination and size of all network transfers it performs. This allows NPM to maintain and populate the traffic demand matrix internally. Note that this gives NPM the added flexibility to aggregate multiple endpoints (e.g. clients) into prefixes representing networks that the endpoints belong to, thus yielding more scalability to monitoring and planning. Thus the application implicitly provides its bandwidth requirements to NPM. The average delay requirement, however, is specified explicitly. The application also specifies one of several end-to-end control knobs to NPM, for which NPM returns the appropriate setting to meet the application’s performance requirements.

2.2 High-level Architecture The high-level architecture of NPM is shown in figure 1, consisting of a centralized control program that interacts with multiple NPM agents. The role of the NPM agent is twofold. It provides a simple socket endpoint so that the control program can orchestrate monitoring by running probe flows between selected pairs of endpoints. The agent also receives and the setting of the end-to-end control knobs recommended

1 Note that for structured topologies, as generated by IP routing, the complexity of such a graph is a small multiple of the number of endpoints. 2 By available bandwidth, we refer to the amount of bandwidth that an application can get by opening one or more TCP flows. Please note that this is different from the raw bandwidth of the link, or the bandwidth obtained by a UDP flow

2

by NPM to make local configuration changes, e.g. populating DNS servers to achieve the appropriate server selection and populating application-layer routing tables in overlay nodes. The agents may run as a thread within the application, or as a separate process, and are fired up upon application startup on a server, or as an applet on a client browser using the application. The data acquisition component of the control program orchestrates the monitoring and collects monitoring data from the agents to create the annotated graph between the endpoints. This data feeds into an optimization solver that finds a setting of a chosen end-to-end control knob to meet the application’s performance requirement. Finally these settings are converted by a deployment engine to XML configuration files and sent to the NPM agents which carry out the necessary local configuration/control actions. Figure 2: An illustration of the annotation algorithm.

3 Netmapper: A graph annotation service for networked applications

that links of the network are successively saturated and annotated. The goal in the selection of these flows should be to minimize the amount of probe traffic in the network, minimize the number of simultaneous probe flows in the network 3 and the number of steps in the algorithm. The implementation of such an algorithm is easily amenable to dynamic programming, where the maximum amount of traffic that can be driven into a link from a set of annotated subgraphs can be computed, and then the computed flow can be driven through the link to check if it saturates the link. However, the algorithm generates progressively more simultaneous flows in the network as the size of the subgraphs increases.

The underlying network structure connecting an application’s endpoints, and the capacities of the links in the network is the key input needed for planning the setting of an application’s end-to-end control knobs. This input is provided by Netmapper, which is a tool that takes as input the set of endpoints for a network application, and outputs an annotated graph representing the network connectivity and capacity between these endpoints. Netmapper uses pairwise traceroutes between the endpoints to get the IP-layer topological map of the network. It then annotates the edges of this graph, to the extent possible, with their available bandwidth and latency. Assuming that we can get the latency annotation using per-hop delays reported by traceroute, we will focus on bandwidth annotation in the remaining part of this section. Netmapper must utilize end-to-end probe flows to annotate links inside the network. The extent to which Netmapper is able to annotate links depends upon the set of endpoints available to it for running probe traffic. As more endpoints become available from a set of applications, Netmapper is able to refine the annotation with more detail.

3.2 An Improved Algorithm with Bottleneck Identification Netmapper uses a powerful primitive that can significantly reduce the number of probes and amount of probe traffic required by the basic algorithm. This primitive is bottleneck identification. By identifying a bottleneck on a path, we are not only able to annotate the bottleneck link in one step, we also eliminate a number of probe steps proportional to the size of the subgraph hanging off the bottleneck link. We illustrate this point using an example graph shown in Figure 2. The example shows a directed subgraph with a tree structure. The first figure shows the actual link capacities in the graph. We show the steps that the basic algorithm would take to annotate the graph, compared to the algorithm which uses bottleneck identification. Assuming the basic algorithm starts with a probe between N0 and N4, the path bandwidth is 10, which means at least 10 units of bandwidth is available on edges E1, E2 and E4. No more information can be gathered

3.1 Basic Algorithm To understand Netmapper’s operation, we must first understand the fundamental method to annotate internal links of a graph when we only know the bandwidth on a given set of paths. The way to annotate a link in the network with its capacity is to saturate the link such that no more traffic can be driven through it. The indication of saturation is that the received traffic rate is less than the transmission rate of all flows driven through the link, and the capacity is then simply the received traffic rate. Probe flows can be orchestrated in a way

3 Since the share of network bandwidth of a set of TCP probe flows, and hence their intrusiveness, is proportional to the number of simultaneous flows.

3

from this path, hence the algorithm runs a probe between N0 and N5, which gives a bandwidth of 5. Now, the algorithm must test how much can be pumped through edge E2, which can be upto 15. Running both flows together yields 10, and hence E2 is annotated with 10. Running a probe between N0 and N6 yields 5, which annotates E3 and E6 with 5. Running a probe between N0 and N6 yields 10, which annotates E3 and E7 with 10. We must test how much can be pumped through E3, which can be 15. Running both flows together yields 10, hence E3 is annotated with 10 as well. Finally, edge E1 must be tested for saturation at 15, by starting flows in each subtree, which yields 10 as the annotation for E1. The total number of steps taken by the algorithm are 7, 3 of them requiring driving multiple probe flows into the network. Now consider the steps with bottleneck identification. When the flow from N0 to N4 is started, we identify E1 as the bottleneck and annotate it in one step. More importantly, the steps to annotate the subtrees are significantly reduced. Due to E1 being bottlenecked at 10, there is no way to pump more than 10 units into any path. Hence we only need to see if any link in the subtree rooted at N1 has less bandwidth than 10. By running one probe each to N4, N5, N6 and N7, and identifying bottleneck links along the path, this refines the annotation of links E5 and E6, giving us the complete annotation in 4 steps, each requiring only one probe flow. Note that if any link in the subtree that turns out to be the bottleneck, has another subtree hanging off it, the same simplification applies to that subtree. The complete algorithm for the general case can be simply derived as follows. When we identify a link as the bottleneck, with capacity C, we consider all paths of which the link is a part, and annotate each edge on such paths with C upto the point where another path is incident on this path. The rationale is that for edges in this part of the path, there is no way to drive more traffic into the edge than C. Since IP routing from a source to a set of destinations has a tree structure, this yields a significant reduction in the probe traffic that needs to be generated to annotate links downstream from a bottleneck link. Note that stability of network performance metrics over reasonable lengths of time [7] leads us to believe that Netmapper can provide a stable map of network performance that can be updated on a coarse time scale.

A

R1

C

2 Mbps

R2

1 Mbps

R3

B

D

Figure 3: Topology on Emulab network used to show that BFind often reports a different bottleneck link

is that it uses an unresponsive UDP flow to ramp up the traffic on a path. Though UDP is the only viable choice when we have control over only one of the endpoints of the path, several undesirable properties result from this choice. Firstly, since the UDP flow is unresponsive, it may push away background traffic from the bottleneck link, especially TCP flows with small windows. This causes the probe flow to “open up” the bottleneck, potentially leading to incorrect identification of the bottleneck. Secondly, a collection of UDP probes can be an extremely intrusive to other traffic in the network. Finally, we mentioned before that we attempt to annotate a network with the bandwidth available to a TCP flow, assuming that the application will use network-friendly TCP or TFRC flows. A link being the bottleneck for a TCP flow depends upon several complex factors, for instance RTTs and window sizes of the background TCP traffic. Hence, in general, the bottleneck for a UDP flow may not be the bottleneck for a TCP flow. The desirable solution would be to use TCP’s own bandwidth ramp-up to induce and detect a bottleneck in the network. The intuition behind developing a TCP-friendly variant of BFind is derived from TCP Vegas [9]. TCP Vegas reacts proactively to impending congestion by reacting to an increase in RTT before incurring a loss event. The premise is that there is a stable, detectable increase in queuing delay before a loss will occur. However, this increase in queuing delay lasts only for a few RTTs, and hence must be detected within that timescale. Thus, in order to identify bottlenecks in conjunction with TCP, traceroute must be able to sample the increased delay quickly. Standard traceroute, however, can take time proportional to the number of hops in the path times the RTT, to get its result. Hence, we modify standard traceroute to send multiple ICMP packets with increaing TTL values, in parallel. This allows us to get the per-hop delays in one RTT, and 3.3 Bottleneck Identification provides the sampling frequency needed for bottleneck identiWe now describe our bottleneck identification mechanism, fication to work with TCP. which is key to the annotation algorithm. The key challenge To demonstrate the improved properties of a TCP-based in bottleneck identification is that we can only rely on end- bottleneck estimator, we setup the topology shown in figure 3 to-end measurements to isolate an internal network property. on the Emulab[10] network. We open two long-running TCP An interesting approach to the problem, introduced in [3] was flows from C to D. A TCP flow between A and B has R1-R2 to correlate an increase in the traffic rate of a flow with an as the bottleneck link, since 1/3rd of the link bandwidth (0.67 increase in the per-hop delay reported along its path by tracer- Mbps) is available to all 3 flows. However, running BFind oute, and identify the hop showing the maximum increase in between A and B reports R2-R3 as the bottleneck, because it delay as the bottleneck link. A key limitation of the approach pushes away the traffic of the 2 TCP flows between C and D 4

If there are E edges in the graph, we define a matrix A with E rows and m.n columns where Ai,j = fi if path j passes through edge i. (Note that path j goes between server j mod m and client j div m.) We also define the capacity column vector B where Bi is the capacity of edge i. Note that by introducing the extra edge of capacity fi to client i, we make sure that the feasible settings of the indicator variables will be such that exactly one of Ii,j will be 1, i.e. exactly one server will be selected for each client. We want to maximize the number of clients whose demand can be packed in the network by choosing an appropriate server mapping. Since the feasible solution only turns on one indicator variable for each client, it is sufficient to maximize the sum of all indicator variables. Our formulation is thus given by the integer linear program as follows :

on the link R2-R3, occupying more than its fair share of 0.67 Mbps on R2-R3. We also conducted extensive experiments with Netmapper over the real Internet using the IBM IntraGRID [17] nodes as the endpoints. The IBM IntraGRID is a network of nodes for Grid computing that has over 50 nodes spread all over the world. In these experiments we found that BFind converged to a bottleneck at a probe bandwidth 3-4 times higher than the TCP-based variant. In many cases, BFind was not able to converge to a bottleneck because either standard traceroute was not able to sample the induced increase in delay, or it pushed away most of the competing background traffic. In fact, during one of our BFind experiments, two of the IBM IntraGRID nodes in Europe were taken off the network by AT&T since their network surveillance system classified the excessive bandwidth being consumed by the UDP traffic as worm-like behavior by the endpoints ! A final point refinement we make in Netmapper is that we maintain a rough estimate of the number of background flows bottlenecked at a link, together with the available bandwidth as part of its annotation. This is because if we measured C as the bandwidth available to a single TCP probe flow through that link, the bandwidth available to a set of K application flows bottlenecked at that link will be given roughly by C(N + 1)/(N + K).

M ax CX AX ≤ B,

0 ≤ Xi ≤ 1

Xi ∈ I

∀i

Since we are trying to pack flows between designated endpoints, a latency requirement can be simply factored in by pruning the set of flows whose end-to-end delay does not meet the latency requirement. Let us now consider the problem of ISP selection for a multi-homed server. We are given a set C = c1 , . . . , cn of clients and a set S = s1 , . . . , sm of ISPs that the server connects to. Each client c has a flow request f i . We need to come 4 End-to-end Control Settings as Opti- up with an allocation iof each client to exactly one ISP. Note that this problem is mathematically exactly same as the one mization Problems above. Hence, the same problem formulation can be used to In this section, we describe how the problem of setting typical solve for ISP selection. end-to-end control knobs available to an application, can be While solutions to ILPs are not efficient in general, note formulated as a small set of optimization problems. The set that for packing problems, efficient approximation algorithms of end-to-end knobs available may be richer than what is con- exist. The above ILP is essentially a generalized packing probsidered here. However, optimization problems can be devised lem, and similar approximation algorithms can be used for for other end-to-end knobs in a similar manner. Note that the them, which we do not discuss in this paper. planning is done on a coarse time scale using averaged values Finally, we consider overlay routing. Note that when the of traffic demand and network capacity. The time constraints problem is to route traffic between two given endpoints with on the planning are thus not very stringent. bandwidth and latency constraints, simple (often polynomialWe first discuss the case of server selection. We are given time) solutions exist [16]. However, routing while packing a set C = c1 , . . . , cn of clients (or client networks) and a set traffic demand between all pairs of endpoints is a hard probS = s1 , . . . , sm of servers. Each client ci has a traffic demand lem. We present the general formulation below, and later sugfi . We need to come up with an allocation of each client to ex- gest a more efficient heuristic. actly one server such that the traffic demands of the maximum We are given a set C = c1 , . . . , cN of N nodes which are the number of clients packs within the network capacity. application’s endpoints. For generality assume all of the endWe define indicator variables Ii,j such that Ii,j = 1 if server points can act as intermediate routing nodes. We are also given j is chosen for client i, and 0 otherwise. We also modify our a traffic demand matrix A between these endpoints, where Ai,j graph such that we add an edge to each client node i with ca- is the traffic demand from ci to cj . To formulate the optipacity fi . mization problem, we use two constraints. One is flow conserThere are m.n indicator variables. We arrange them in a vation, which is an elegant way of capturing routing choices column matrix X such that Xm.i+j = Ii,j . Thus, indices of in the problem. The other constraint is edge capacities. To X correspond to I written in row-major form. Since each (i, j) formulate flow conservation, we define an incidence matrix I combination corresponds to a path, this gives us an ordering on such that Ii,j = 1 if edge j enters node i and Ii,j = −1 if edge j leaves node i. If we define a flow column vector F such that these m.n paths. 5

Fi represents the flow on edge i, then I.F is an array where the ith entry is 0 if node i is an intermediate node, a positive number equal to the net flow into the node, if i is a pure sink node, and a negative number equal to the net flow out of the node, if i is a pure source node. The traffic demand between any two endpoints is considered as a separate flow, akin to a multi-commodity flow problem. Using Fp to represent the matrix of edge flows for commodity p, we can then write the flow conservation constraint for each flow as I.F p = K where K is a vector with value Ai,j for source node i, and value −Ai,j for sink node j. The capacity constraint is straight-forward, and is given by P : p Fp ≤ B where B is the edge capacity vector as defined in the previous formulation. These two constraints capture the feasible values of the flows. We would like to pick the flows that also minimize latency, which can be done by assigning unit cost to each link, and minimizing cost. The formulation thus becomes : X M in C.Fp

stresses our algorithms both in terms of managing bandwidth and latency for the application. The second application is a distributed network gaming application that needs networkaware placement and mapping of servers to clients. Both of these studies are being driven by traces from production conferencing and gaming services. The characteristics of these applications are significantly different that they allow us to demonstrate the broad applicability of NPM as a network performance management service for different kinds of applications.

References [1] M. Singh, P. Pradhan and P. Francis, MPAT: Aggregate TCP Congestion Management as a Building Block for Internet QoS, ICNP 2004. [2] Y. Chu, S. Rao, S. Seshan and H. Zhang, Enabling Conferencing Applications on the Internet Using an Overlay Multicast Architecture, SIGCOMM 2001. [3] A. Akella, S. Seshan and A. Shaikh, An Empirical Evaluation of Wide-Area Internet Bottlenecks, IMC 2003. [4] N. Hu, L. Li, Z. Mao, P. Steenkiste, J. Wang, Locating Internet Bottlenecks: Algorithms, Measurements, and Implications, SIGCOMM 2004. [5] N. Miller and P. Steenkiste, Collecting Network Status Information for Network-Aware Applications, INFOCOM 2000. [6] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt and L. Zhang, IDMaps: A global internet host distance estimation service, IEEE Transactions on Networking, Oct 2001. [7] Y. Zhang and N. Duffield, On the constancy of internet path statistics, IMW 2001. [8] Y. Chen, D. Bindel, H. Song and R. H. Katz, An Algebraic Approach to Practical and Scalable Overlay Monitoring, SIGCOMM 2004. [9] L. Brakmo, S. O’Malley, and L. Peterson, TCP Vegas: New techniques for congestion detection and avoidance, SIGCOMM 1994. [10] Emulab test-bed. http://www.emulab.net [11] http://www.akamai.com [12] D. Andersen, H. Balakrishnan, M. Kaashoek, and R. Morris, Resilient Overlay Networks, SOSP 2001. [13] A. Akella, B. Maggs, S. Seshan, A. Shaikh and R. Sitaraman, A Measurement-Based Analysis of Multihoming, SIGCOMM 2003 [14] S. Savage, Sting: a TCP-based network measurement tool, USITS 1999. [15] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang and W. Weiss, An architecture for differentiated services, RFC 2475, Oct 1998. [16] S. Chen, K. Nahrstedt, An Overview of Quality-of-Service Routing for the Next Generation High-Speed Networks, IEEE Network Magazine, Dec 1998. [17] https://intragrid.webahead.ibm.com [18] M. Kodialam, T. Lakshman and S. Sengupta, Online Multicast Routing with Bandwidth Guarantees, SIGMETRICS 2000.

p

I.Fp = K

X

Fq ≤ B

Fp [i] ≥ 0

∀p

q

Since the size of this problem can be considerably big, in practice we utilize a heuristic developed by Lakshman et al [18], where traffic demand between pairs of endpoints is accomodated sequentially, while identifying and keeping “critical” links unloaded, which are links that, if heavily loaded, would make it impossible to satisfy future demands between certain ingress-egress pairs.

5 Conclusions In this paper we’ve argued for a different approach to the network performance management problem where, instead of utilizing network QoS mechanisms, an application uses end-toend mechanisms to map its traffic demands onto the available network capacity. We’ve asserted that if appropriate network monitoring information is available, the problem can be generalized by feeding this information into a set of optimizations problems corresponding to different end-to-end controls available. We’ve discussed that related solutions in the literature either solve the problem partially, or lack key pieces of information from the monitoring data. We’ve presented a service called NPM (Network Performance Manager) and presented its architecture, and its monitoring and planning components, which we believe can provide a general solution to the problem of using end-to-end mechanisms to control network performance of a broad class of applications. Currently we are interfacing two applications with NPM. One is a peer-to-peer voice conferencing service, which 6