A Load-balancing Scheme for Parallel Internet ... - Semantic Scholar

4 downloads 0 Views 233KB Size Report
May 14, 2004 - tracted from outgoing traffic at San Diego Super- computer ...... fast routing lookups, in: ACM SIGCOMM 1997, Cannes, France, 1997, pp. 3–14.
A Load-balancing Scheme for Parallel Internet Forwarding Systems

W. Shi, M. H. MacGregor, P. Gburzynski Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada. Phone: (780)492-1930 Fax: (780)492-1071

Abstract By investigating the flow level characteristics of Internet traffic, we are able to trace the root of load imbalance in hash-based load-splitting schemes. We show that workloads with Zipf-like distributions cannot be balanced using simple hashing. We have developed a novel packet scheduler that can achieve balanced load for parallel forwarding systems. Our scheduler capitalizes on the non-uniform flow reference pattern and especially the presence of a few high-rate flows in typical Internet traffic mix. We show that detecting and scheduling these flows is very effective in balancing workloads among network processors. Since the number of high-rate flows is small, reassigning them to different processors causes the least disruption to the states of the processors, and reorders fewer packets within individual network flows. Our ideas are validated by simulation results. Key words: Internet forwarding systems, parallel forwarding, load balancing, hashing

1

Introduction

The continuing Internet bandwidth explosion and the advent of new applications have created great challenges for Internet routers. These devices have to offer high throughput, computation power, as well as flexibility. One answer to these challenges is network processors (NP) [1] which provide the right balance between performance and flexibility. Email addresses: [email protected] (W. Shi), [email protected] (M. H. MacGregor), [email protected] (P. Gburzynski).

Preprint submitted to Elsevier Science

14 May 2004

Forwarding Engines: FE 1

Scheduler

Input

FE 2

Traffic Queues

FE 3 FE 4

cache cache cache cache

Fig. 1. A Multi-processor Forwarding System

To achieve high throughput, NP’s are optimized for key packet forwarding algorithms and high-speed I/O. More importantly, multiple network processors are employed to forward packets in parallel to achieve scalability. Although designs from vendors vary, Fig. 1 shows a generalized model where the forwarding engines (FE) are the basic packet processing units. A scheduler that dispatches packets to the FE’s is essential to multi-FE system performance [2]. It is necessary for the scheduler to distribute workloads in a load-balanced manner such that the parallel system can achieve its full forwarding potential. Two general packet dispatching schemes are popular: round robin (RR) and hashing. Given the number of FE’s, N , and the index i of the FE that processes the current packet, the RR scheduler delivers the next incoming packet to FE (i + 1)%N , where % is the modulo operation. If we assume that the FE’s are identical in processing capabilities and that each packet requires the same amount of processing, load balancing is naturally achieved with RR. In the real world, where these assumptions cannot be made, other approaches exist to deliver the next packet to the least-loaded processor. One drawback of these schemes, however, is that they dilute the temporal locality in the scheduled traffic and this results in poor cache performance [3]. Another disadvantage is that these schemes do not preserve packet ordering in a network flow and this is detrimental to end-to-end protocol performance [4, 5]. A dispatching scheme based on hashing uses several packet header fields as inputs to a hash function. The return value is used to select the FE that will forward the packet. The hashing scheme actually improves locality in flows and can preserve packet order within network flows [3]. The major disadvantage of simple hashing is that it does not balance workloads. In the short term, this 2

can be caused by traffic dynamics since data network traffic is known to be bursty [6, 7]. But more importantly, as we show in this paper, load imbalance is caused by the skewed distribution of network flow rates in the long run. In the context of IP forwarding, we define a flow as a sequence of packets with the same destination IP address. This represents a coarser aggregation of network traffic than used by some popular flow definitions, e.g., using the fivetuple of source and destination IP addresses, source and destination transport layer port numbers, and transport layer protocol identifier, to identify a flow. Our definition is targeted at the packet forwarding process where IP destination address lookup [8–11] is the bottleneck. Our approach is extensible to finer flow definitions. A more detailed discussion can be found in Section 6.4. The rest of the paper is organized as follows. Section 2 discusses relevant studies on Internet workload characterization and parallel forwarding system load balancing. Section 3 shows the flow level characteristics of Internet traffic and explores the implications for load balancing. Section 4 describes the design of our scheduler. Section 5 discusses techniques for identifying high-rate flows. Section 6 shows simulation results and Section 7 summarizes the work.

2

2.1

Related Work

Load Balancing for WWW Caches

Load balancing is important to the performance of Web sites and Web caches that use multiple servers to process user requests. For Web cache systems a hash-based scheme called highest random weight (HRW) is proposed in [12] to achieve high Web cache hit rate, load balancing, and minimum disruption in the face of server failure or reconfiguration. To request a Web object, HRW uses the object’s name and the identifiers of cache servers, e.g., IP addresses, as the keys of a hash function to produce a list of weights. The server whose identifier produces the largest weight is selected to serve the request. If a server fails, only the object requests that were mapped to that server are migrated to other servers, while the other requests are not affected. This achieves minimum disruption. HRW has been extended to accommodate heterogeneous server systems [13] and leads to the popular cache array routing protocol (CARP). The idea is to assign multipliers to cache servers to scale the return values of HRW. A recursive algorithm is provided to calculate the multipliers such that the object requests are divided among the servers according to a pre-defined distribution. 3

2.2

Internet Workload Characterization

Ref. [14] examines Internet traffic at the connection level. It is found that the burstiness of Internet traffic is not due to a large number of flows being active at the same time, as assumed in some Internet traffic models [15], but rather is caused by a few large files transmitted over high-bandwidth links. These connections contribute to alpha traffic and the rest create beta traffic. Ref. [16] studies the flow patterns of measured Internet traffic, and points out that network streams can be classified by both size (elephants and mice) and lifetime (tortoises and dragonflies). Tortoises are flows that last more than 15 minutes and contribute a small portion of the number of flows (one or two percent), but carry fifty to sixty percent of the total traffic.

2.3

Load Balancing for Parallel Forwarding Systems

Ref. [2] describes a load balancer for parallel forwarding systems. Traffic is split according to a two-step table-based hashing scheme [17]. Packet header fields that uniquely identify a flow are used as the key of a hash function. The return value is used as an index to a look-up memory to derive the index of the processor to which the packet should be forwarded. Flows that yield the same return value are called a flow bundle and are associated with one processor. Three techniques are combined in [2] to achieve load balancing. First, a time stamp is kept and updated at every packet arrival for each flow bundle. Before the update, the time stamp is compared with the current system time. If the difference is larger than a pre-configured value, the flow bundle is assigned to the processor that is currently least-loaded. Second, flow reassignment monitors the states of the input queues of the processors. Flow bundles are redirected from their current over-loaded processor to the processor with the minimum number of packets in its queue. Third, high-rate flow bundles are detected and repeatedly assigned to the least-loaded processors. This is called packet spraying. Ref. [18] proposes a scheduling algorithm for parallel IP packet forwarding. The scheme is based on HRW [12,13]. It is noticed that although HRW provides load balancing over the request object space, load imbalance can still occur due to uneven popularities of the individual objects. An adaptive scheduling scheme is introduced in [18] to cope with this problem. It includes two parts: the triggering policy and the adaptation. Periodically, the utilization of the system is calculated and compared to a pair of thresholds to determine if the system is under or over-utilized. In either condition, an 4

Table 1 Traces Used in Experiments Trace FUNET

Length(entries)

Description

100,000

A destination address trace which is used in evaluating the LC-trie routing table lookup algorithm in [11] from Finnish University and Research Network (FUNET).

UofA

1,000,000

A 71-second packet header trace recorded in 2001 at the gateway connecting the University of Alberta campus network to the Internet backbone.

Auck4

4,504,396

A 5-hour packet header trace from National Laboratory of Applied Network Research (NLANR). This is one from a set of traces (AuckIV) captured at the University of Auckland Internet uplink by the WAND research group between February and April 2000.

SDSC

31,518,464

A 2.7-hour packet header trace from NLANR. Extracted from outgoing traffic at San Diego Supercomputer Center (SDSC) around the year 2000.

IPLS

44,765,243

A 2-hour packet header trace from NLANR. This is from a set of traces (Abilene-I) collected from an OC48c Packet-over-SONET links at the Indianapolis router node.

adaptation is invoked which adjusts the weights (called multipliers in [13]) for the FE’s to affect load distribution. In other words, the algorithm treats over or under-load conditions as changes of processing power of the FE’s. It is proved that the adaptation algorithm keeps the minimal disruption property of HRW.

3

Flow Level Internet Traffic Characteristics

To study flow level Internet traffic characteristics, we have experimented with traces collected from networks ranging from campuses to major Internet backbones [19]. In this paper, we show the results for five traces (see Table 1). When results are shown for only one or two traces, it is implied that the results are similar for the other traces. Many objects on the Internet have been observed to follow Zipf-like distributions [20] and IP address is one. For example, [21] uses a Zipf-like distribution for IP destination address frequency to produce synthetic address sequences. Given that Web requests dominate Internet traffic, the Zipf-like distribution 5

1e+07 FUNET UofA Auck4 LDestIP IPLS

1e+06 100000 Frequency

10000 1000 100 10 1

1

10

100

1000 Rank

10000

100000

1e+06

Fig. 2. IP Address Popularity Distribution

10

12 FUNET 8.81993−.929069*x

9

UofA 11.2473 − 1.03856*x 10

8

7

log(Frequency)

log(Frequency)

8

6 5

6

4

4 2

3 2

0

1

2

3

4

5

6

0

7

0

1

2

3

log(Rank)

16

4 5 log(Rank)

6

7

8

9

14 Auck4 14.5115−1.65917*x

14

SDSC 13.7552−.90524*x 12

10

10

log(Frequency)

log(Frequency)

12

8 6

8

6

4 4

2 0

0

1

2

3

4 5 log(Rank)

6

7

8

2

9

0

2

4

6 log(Rank)

8

10

16 IPLS 15.5189−1.21439*x

14

log(Frequency)

12 10 8 6 4 2 0

0

2

4

6 log(Rank)

8

10

12

Fig. 3. Fitting IP Address Popularity Distribution with Zipf-like Functions

6

12

of IP destination address can be the result of Zipf-like Web request size distributions [22, 23]. In this section, we show that flow popularity distribution in Internet traffic can be approximated by a Zipf-like function [24] and that dominating high-rate flows exist in every trace that we examine. We then explore the implications of both observations for load balancing in parallel forwarding systems.

3.1

Zipf-like Distribution of IP Address Frequency

The address frequency distributions of the five traces are shown in Fig. 2. The outstanding common feature of these traces is that popular flows are very popular, and tiny flows are many. Although their scales differ, each curve can be approximated by a straight line in the log-log plot. This is a Zipf-like function: P (R) ∼ 1/Ra ,

(1)

where a ≈ 1. To get a proper fit we bin the data into exponentially wider bins [25] so that they appear evenly spaced on a log scale as shown in Fig. 3. The slopes fitted for the five traces, SDSC, FUNET, UofA, IPLS, and Auck4, are -0.905, -0.929, -1.04, -1.21, and -1.66, respectively.

3.2

The Presence of Dominating High-rate Flows

Also common to all traces is the presence of several popular addresses dominating a large number of less popular addresses. Table 2 shows the the number of packets in the ten most popular flows in each trace. Measurements of 94 traces from various campus and backbone networks indicate that this occurs in all the traces we have collected. This is the motivation of our load balancing scheme (see Section 4).

3.3

Simple Hashing Cannot Achieve Load Balancing

Let m be the number of identical FE’s in a parallel forwarding system and let K be the number of distinctive addresses. Let pi (0 < i ≤ K) be the popularity of address i and let qj (0 < j ≤ m) be the number of addresses distributed to FE j. The load imbalance of the system can be expressed as the coefficient of 7

Table 2 No. of Packets of 10 Largest Flows in the Traces R

FUNET

UofA

Auck4

SDSC

IPLS

1

8,233

158,707

640,730

1,183,834

2,788,273

2

7,424

24,245

440,149

581,495

944,253

3

2,971

20,769

196,513

524,542

919,088

4

2,470

17,482

194,757

235,363

808,773

5

2,298

15,146

186,095

212,150

732,339

6

1,614

14,305

177,388

168,384

582,367

7

1,387

13,308

135,286

160,798

570,316

8

1,317

12,348

135,033

138,657

510,043

9

1,309

12,028

132,812

125,531

473,562

10

1,258

11,824

104,716

125,389

470,072

variation (CV) [26] of qj (Ref. [12], theorem 4): CV [qj ]2 = (

m−1 )CV [pi ]2 , K −1

(2)

The system is balanced if CV [qj ]2 approaches zero as K and the number of packets approach infinity. It is also required that CV [pi ] should be finite. The discrete-form probability density function (PDF) of a Zipf-like distribution such as the distribution in Eq. 1 representing address frequency is: P (X = i) =

1 −α i , Z

i = 1, 2, . . . , K, α > 1

(3)

where Z is a normalizing constant: Z=

∞ X

i−α .

i=1

Given that the average frequency of the K objects, E[pi ], is V ar(pi ) E[pi ]2 E[p2i ] − E[pi ]2 = E[pi ]2

CV [pi ]2 =

1 , K

we have

(4)

8

= =

1 K

PK

K Z2

1 −2α i=1 Z 2 i 1 K2 K X −2α

i

−1

− 1.

i=1

Substituting CV [pi ]2 in Eq. 2, we have CV [qj ]2 ∼

K K(m − 1) X i−2α . Z 2 (K − 1) i=1

−2α As α > 1 and K → ∞, items Z and K converge, and thus CV [qj ]2 i=1 i is non-zero. This is the reason that a hash based scheme, such as HRW [12], is not able to achieve load balancing in parallel forwarding systems when the population distribution of objects in its input space is Zipf-like.

P

3.4

Adaptive Load Balancing

Typically, when a system is unbalanced, an adaptive mechanism in the scheduler would be invoked to manage the mapping from the system’s input space to its output space [2,18]. The result is that some flows will be shifted from the most loaded (source) FE’s to less loaded (target) ones. Flow shifting causes disturbance to target FE’s and resource waste in source FE’s. For example, for target FE’s, flow immigration cause cache cold start and for source FE’s flow emigration renders established cache entries useless. As the performance of modern processors depends to a large degree on high-speed caching, flow migration could seriously hurt overall forwarding performance. For these reasons it is desirable that the number of flows shifted be small to achieve minimum adaptation disruption. It is worth noting that the disruption here is not the same as that in HRW. The latter tries to migrate as few flows as possible during system configuration changes when removing or adding servers. On the other hand, here we are concerned with disruptions caused by migrating flows in order to re-balance workloads. Most state-of-the-art schedulers migrate flows without considering their rates [2, 18], but this is ineffective. The probability of shifting low-rate flows is high since there are many of them. Shifting these flows does not help balance the system much, but causes unnecessary disruption to FE states. The high-rate flows are few so it is less likely that they would be shifted first, but it is usually these flows that cause trouble. According to [14], traffic peaks are due to a few aggressive flows instead of the aggregation of many flows. In a 9

IP Traffic

Flow Classifier

Load Adapter Selector Hash Splitter

Input

Queues

To FE’s

Fig. 4. Load Balancing Packet Scheduler

hash based dispatching scheme, such potent flows representing a large portion of the traffic are scheduled onto a few FE’s out of the whole FE pool. This is a major cause of load imbalance in hash based parallel forwarding systems. It is worth noting that the packet spraying mechanism in [2] is similar in spirit to the proposal here. Packet spraying, however, was proposed to deal with “rare” and “emergency” situations when a single flow bundle exceeds the processing power of one FE: it does not actively spray to balance load. The measurements in the previous sections indicate that with simple hashing one FE will often face a workload significantly higher than the others. Moreover, packet spraying operates on bundles. A finer control, e.g., scheduling individual high-rate flows, will be much less disruptive. These are the major departures of the load balancing scheme discussed in rest of this paper from packet spraying.

4

Load Balancer

The Zipf-like flow size distribution and in particular the small number of dominating addresses indicate that scheduling the high-rate flows should be effective in balancing workloads among parallel forwarding processors. Since there are few high-rate flows, the adaptation disruption should be small. Our scheduler design divides Internet flows into two categories: the high-rate flows and the rest. By applying different forwarding policies to the two classes of 10

flows, the scheduler achieves load balancing effectively and efficiently. Fig. 4 shows the design of our packet scheduler. When the system is in a balanced state, packets flow through the hash splitter to be assigned to an FE. When the system is unbalanced, the load adapter may decide to override the decisions of the hash splitter. When making its decisions, the load adapter refers to a table of high-rate flows developed by the flow classifier. The hash splitter uses the packet’s destination address as input to a hash function. The packet is assigned to the FE whose identifier is returned by the hash function. There are several possible choices for the hash function. For example, the function could use the low order bits of the address and calculate the FE as the modulus of the number of FE’s. The load adapter becomes active when the system is unbalanced. It looks up the destination address of each passing packet in a flow table to see whether it belongs to one of the high-rate flows identified by the classifier. If so, the load adapter sets the packet to be forwarded to the FE that is least loaded at that instant. Any forwarding decisions made by the load adapter override those from the hash splitter: the selector gives priority to the decisions of the load adapter. In this sense, the hash splitter decides the default target FE for every flow. As noted above, the load balancer functions only when the system is unbalanced. Periodically, the system is checked and if it is unbalanced the load balancer is activated: the least loaded FE is identified and high-rate flows are shifted to it from their default FE’s. Later if, as a result of the adaptation, the system becomes balanced, the balancer is deactivated and consequently, the high-rate flows are automatically shifted back to their default FE’s. As an alternative the balancer can remain active all the time. This eliminates flows shifting back when the system becomes balanced. An important design parameter is F , the size of the balancer’s flow table. Generally, shifting more high-rate flows by having more flows in the table is more effective as far as load balancing is concerned. Nevertheless, to reduce cost, speedup the lookup operation, and minimize adaptation disruption, the flow table should be as small as possible. Another component in the system that is critical to the success of the load balancing scheme is the flow classifier (See Fig. 4). The flow classifier monitors the incoming traffic to decide which flows should be put in the balancer’s flow table. We discuss this procedure next. 11

5

Identify High-rate Flows

Accurately identifying the high-rate flows in incoming traffic is critical to the success of our design. In this section, we describe high-rate flow characteristics and propose a scheme to detect high-rate flows.

5.1

High-rate Flows

Flows that are both large and fast can be the source of long-term load imbalance but can also be used effectively to balance load. These flows are similar to the alpha flows in [14]. In addition, taking the bursty nature of Internet traffic into consideration, we also classify flows that are smaller in size but are fast enough to cause short-term load imbalance or buffer-overflow. It is pointed out in [16] that flow size and lifetime are independent dimensions. There might be a correlation between flow size and rate but generally, the notion of long-lived flows in most previous studies is not accurate enough for our purposes. As a result, short-cut establishment triggering [27] for long-lived flows cannot be used to detect high-rate flows. Instead, we need a mechanism that takes into account both the number of packets and the length of time over which the packets arrive.

5.2

Detecting High-rate Flows

We define window size, W , as the total number of packets over which flow information is collected. The incoming traffic is divided into a sequence of nonoverlapping windows: W1 , W2 , . . . , Wn , n → ∞, each containing W packets. Suppose we are receiving packets in Wi . We find the set Fi that contains the largest flows in Wi . The number of flows in Fi equals to the size of the flow table, F , |Fi | = F . F0 = {}. At the end of Wi , we replace the flows in the flow table by those in Fi . This mechanism benefits from the phenomenon of temporal locality in network traffic. Due to the packet train [28] behavior of network flows, it is highly possible that flows in Fi are also some of the largest ones over the next W packets. That is Fi ∩ Fi+1 6= {}. Let δi = |Fi−1 ∩ Fi |. To measure the effect of W on the continuity of the content of the flow table due to temporal locality, we define Pn

∆=

i=1 δi /F

n 12

0.9 0.8 0.7

Delta

0.6 0.5 0.4 IPLS UofA SDSC Auck4

0.3 0.2

0

1000

2000

3000

4000 5000 6000 Window Size (Pkts)

7000

8000

9000

10000

Fig. 5. Effect of W on ∆ (F = 5)

where n=

NF W

and NF is the number of packets forwarded. Thus, 0 ≤ ∆ ≤ 1. The larger the value of ∆, the better flow information collected in the current window predicts high-rate flows for the next window. Small W values are preferred when the input buffer size is limited and load adjustments must be made to reflect the existence of short-term bursty flows. Larger W values can be used when the system employs larger buffers to tolerate the load imbalances caused by bursts of small flows. Fig. 5 shows the effect of W on ∆ for the first one million entries of the four larger traces in Table 1 with F = 5. The larger the value of W , the better the current high-rate flows predict the future. This is critical to the success of the flow classifier. Our experiments show that the largest flow of the entire trace is almost always identified as the largest flow of every window. We show in Section 6 that shifting even only the largest flow is very effective in balancing workloads. To implement high-rate flow detection, another traffic model, the hyperbolic footprint curve [29, 30], u(W ) = AW 1/θ , A > 0, θ > 1, could be used to relate W to u(W ), the total number of flows expected. 13

6

Simulations

In this section, we present simulation results to validate our ideas. We use real world traces to drive a forwarding system with multiple FE’s.

The adapter implements the scheduling scheme that decides when to remap flows (the triggering policy), what flows to remap, and where to direct the packets. To effectively achieve load balancing with minimum adaptation disruption the adapter only schedules packets in the high-rate flows. Packets in the normal flows are mapped to FE’s by the hash splitter.

There are multiple choices for the triggering policy that decides when the adapter should be activated to redirect packets. For example, the adapter can be triggered by a clock after a fixed period of time. This scheme is easy to implement as it does not require any load information from the system. Periodic remapping may not be efficient, however, as far as minimizing adaptation disruption is concerned since it could shift load when the system is not unbalanced. As an alternative, the adapter can monitor the lengths of the input queues and remapping can be triggered by events indicating that the system is unbalanced to some degree, based on the input buffer occupancy, the largest queue length, or the CV of the queue length growing above some pre-defined threshold. The system load condition could be checked at every packet arrival for fine-grained control or periodically to reduce the checking overhead.

As another design dimension, the remapping policy decides to which processor(s) the high-rate flows should be migrated. The solution in our simulations is to redirect all the high-rate flows to the FE with the shortest queue.

The hash splitter implements a simple modulo operation to dispatch a packet, i.e., F E ID = (IP Address)%N where N is the number of FE’s. This is another advantage of scheduling high-rate flows; we do not need to use complex hash functions to generate perfectly uniformly distributed random FE identifiers. The reason is that there are many low-rate flows but they do not contribute as significantly as a few dominating flows. Uniform distribution of the input addresses is not as important as the distribution of high-rate flows. As a result, our scheduling scheme is capable of balancing the load with low-complexity and efficient hash splitters. 14

2 4FE 8FE 16FE 32FE 64FE 128FE

1.8 1.6 CV of Workloads

1.4 1.2 1 0.8 0.6 0.4 0.2 0

0

5

10

15 20 # of Flows Scheduled (F)

25

30

Fig. 6. UofA Trace (T=50)

6.1

Simulations with Periodic Remapping

6.1.1

Parameters

In the packet scheduler shown in Fig. 4, the major design parameters include • • • •

the the the the

6.1.2

number of entries in the flow table, F number of FE’s, N re-scheduling interval T for periodic invocation size of the system buffer

Results

Figs. 6 and 7 show the simulation results for the UofA trace and the first one million entries of the IPLS trace. CV [qj ] (Section 3.3) is the load balance metric. For the UofA trace, scheduling the largest flow is all that is needed to reduce the CV by more than one order of magnitude. This is especially true when the number of FE’s is small. For the IPLS trace, we have similar results although the scale differs. Two trends are obvious from the figures. First, as the number of FE’s, N , increases, more high-rate flows need to be scheduled to achieve the same CV as with smaller values of N . On the other hand, when more high-rate flows are scheduled, the load balancing results are better. For most configurations, re-scheduling just the largest flow reduces the CV by one or more orders of magnitude. This is particularly true for the UofA trace, where the largest flow is especially dominant. 15

1.2 4FE 8FE 16FE 32FE 64FE 128FE

CV of Workloads

1

0.8

0.6

0.4

0.2

0

0

5

10

15 20 # of Flows Scheduled (F)

25

30

Fig. 7. First One Million Entries of the IPLS Trace (T=50)

The results for larger rescheduling intervals have similar trends. Larger T values relax the demand for computing resources by the scheduler and reduces disruption to the processors. On the other hand, larger T values might lead to temporarily undetected load imbalance situations, or even packet loss. An optimal T small enough to detect capture bursts on short time scales depends on the pecularities of traffic and the configuration of the system, e.g., the number of FE’s and the size of the input buffer.

6.2

Simulations of an Occupancy-Driven Adapter

Periodic remapping is simple to implement, but may incur unnecessary remapping or undetected imbalance. Instead, we can use some indication of load imbalance from the processors. A combination of a measure of system load or load imbalance and a threshold can be effective. When the metric is over the threshold, the adapter is invoked. This demand-adaptive policy enhances our load balancing scheme. In this section, we present simulation results for a 4-FE system; the threshold used to invoke the adapter is the input buffer occupancy. The FE’s all share one physical input buffer but each has its own virtual queue. We are concerned with the effects of different threshold values, buffer sizes, and numbers of shifted high-rate flows on packet drop rates. In the simulations, the classifier uses a window W of 1000 packets to detect high-rate flows. For the UofA trace, we assume a fixed service time (1/µ) of 200 milliseconds 16

0.18 Hashing Hashing + 1 Flow, Threshold = .8 Hashing + 3 Flows, Threshold = .8 Hashing + 5 Flow, Threshold = .8 Hashing + 1 Flows, Threshold = .6 Hashing + 3 Flow, Threshold = .6 Hashing + 5 Flows, Threshold = .6 Hashing + 1 Flow, Threshold = .4 Hashing + 3 Flows, Threshold = .4 Hashing + 5 Flows, Threshold = .4

0.16 0.14

Drop Rate

0.12 0.1 0.08 0.06 0.04 0.02 0

0

20

40

60

80 100 120 Buffer Size (Packets)

140

160

180

200

Fig. 8. Drop Rates in a 4-FE System (UofA Trace) 0.04 Hashing Hashing + 1 Flow, Threshold = .8 Hashing + 3 Flows, Threshold = .8 Hashing + 5 Flow, Threshold = .8 Hashing + 1 Flows, Threshold = .6 Hashing + 3 Flow, Threshold = .6 Hashing + 5 Flows, Threshold = .6 Hashing + 1 Flow, Threshold = .4 Hashing + 3 Flows, Threshold = .4 Hashing + 5 Flows, Threshold = .4

0.035 0.03

Drop Rate

0.025 0.02 0.015 0.01 0.005 0

0

50

100 Buffer Size (Packets)

150

200

Fig. 9. Drop Rates in a 4-FE System (First 1 Million Entries of the IPLS Trace)

for each packet at the individual FE’s. For the UofA trace, the mean interarrival time (1/λ) is about 71 milliseconds. This gives the overall utilization ,ρ, of the system: ρ=

λ 1/71 = = 0.7142. µ 1/(200/4)

Fig. 8 shows the results. First, scheduling high-rate flows outperforms simple 17

hashing by a large margin, especially as the buffer size increases. Large buffers are not desirable for latency-sensitive applications, but they are often needed for today’s high-speed links to relax the demand for processing power on the FE’s. This is due to the gap between optical transmission capacity and electronic processing power in store-and-forward devices. Second, with the same threshold, scheduling more than one of the high-rate flows only modestly improves performance, compared with scheduling just the largest flow. On the other hand, lowering the threshold is relatively more effective. This comes, however, at the cost of increasing the frequency of remapping, which causes more disruption to FE states. At the extreme, the adapter is invoked at every packet arrival when the threshold value is 0. Next, we use the first one million entries in the IPLS trace to drive the simulation with a system utilization of 0.8. The other parameters are identical to those in the simulations with the UofA trace. Fig. 9 shows the results. Again, compared with hashing only, shifting high-rate flows can improve performance by a large margin. The performance of hashing is not as poor as in Fig. 8. This is due to the difference in the peculiarities of the traffic used in the experiments. For example, compared to the IPLS trace, the largest flow in the UofA trace is much more dominant. 6.3

Adaptation Disruption and Packet Reordering

To illustrate the advantage of shifting dominating high-rate flows, we compare the results of two simulations: scheduling only the most dominating flow (MDF) and scheduling only less dominating flows (LDF) to achieve similar drop rates. The MDF is simply the flow that is identified by the flow classifier as the largest flow during a certain period. The LDF’s are the second largest, the third largest, etc. We simulate a periodic mapping policy where after a fixed checking period (T=20) the flows in the flow table are mapped onto the least-loaded FE.

6.3.1

Metrics

The decision of the adapter to re-map high-rate flows to the least loaded FE causes disruption. If the flows in the flow table are not currently destined to the target FE, flow-shifts occur. We use ζ to quantify adaptation disruption:

ζ=

N o. of F lowShif ts . T race Length

Let Li be a flow in a trace, where 0 < i ≤ |S| and S is the set that contains 18

Table 3 Comparison between Shifting Only the Most Dominating High-rate Flow and Shifting Only Less Dominating Ones Simulation

No. of Flows

CV [qi ]

η

ζ

Rr

Auck4 MDF

1

.172

.133

.0413

.0612

Auck4 LDF

37

.265

.133

1.84

.146

IPLS MDF

1

.0782

.0103

.0357

.00965

IPLS LDF

439

.137

.0121

22.2

.0626

SDSC MDF

1

.181

.0904

.0342

.00546

SDSC LDF

2

.164

.0749

.0714

.00740

UofA MDF

1

.143

0

.0357

.00965

UofA LDF

500

.288

.103

25.2

.0511

all the flows in the trace. Let Pi,j be a packet in Li , where 0 < j ≤ Ni and Ni is the number of packets in Li . Let Ti,j be the time that the packet Pi,j is observed. At the input port, Ti,j < Ti,j+1 , 0 < j < Ni . At the output port, however, due to possible packet reordering, Ti,j might be larger than Ti,j+1 . If    1 if Ti,j > Ti,j+1

r(i, j) = 

 0 otherwise

then the packet reordering rate Rr for NF packets forwarded is P|S| PNi

Rr =

6.3.2

i=1

j=1

NF

r(i, j)

.

Results

Table 3 summarizes the results of applying MDF and LDF to the four traces. With similar packet drop rates, η, scheduling the most dominating flow always causes less adaptation disruption, ζ, and fewer packet reorders, Rr . For the Auck4, IPLS, and UofA traces, scheduling the most dominating flow also achieves a smaller CV [qi ]. More than one lower-rate flow is always needed to achieve similar packet drop rates. The extreme case is the UofA trace, where the highest-rate flow represents around 16 per cent of the aggregate traffic. When it is scheduled onto an FE, even if the rest of the traffic is spread evenly among the other seven FE’s, the system is still not perfectly balanced. It is worth noting that packet reordering in the proposed scheduler occurs only 19

Table 4 Number of Flows with Different Flow Definitions Destination IP Five-Tuple UofA

5798

23847

Auck4

5299

210510

IPLS

241774

688460

Table 5 Zipf Parameter a with Different Flow Definitions Destination IP

Five-Tuple

UofA

-0.905

-0.901

Auck4

-1.66

-0.840

IPLS

-1.21

-1.01

in high-rate flows. In light of maximizing the utilization of system resources, this may not be unfair: as an analogy to the taxation practice in human societies, the flows that consume substantially more resources should be “taxed” more heavily in a public network system.

6.4 Simulations with a Finer Flow Definition

In this section, we demonstrate by simulation that our methods are applicable to finer flow definitions. We define a flow as packets that share the same five-tuple: source and destination IP addresses, source and destination port numbers, and protocol number. It is important to understand the implications of finer definitions of flows. First, the finer the flow-identification method, the more flows are found in the traffic, which indicates that K in Eq. 2 can be significantly larger and a good hash function can spread load more evenly over the FE’s. Therefore, we expect smaller gains by scheduling large flows. The numbers of flows in the three traces (IPLS, Auck4, and UofA) with the two flow definitions, i.e., IP destination and the five-tuple are listed in Table 4. In addition, measurements show that the absolute value of the Zipf parameter a (Eq. 1) decreases with the finer flow classification. Comparison of the Zipf parameters for the two flow definitions is shown for the three traces in Table 5. The decrease in the absolute value of a indicates that the flow population distributions are more homogenous with the finer flow definition. 20

At the same time, the largest flows in a trace are much less dominant. For example, in the UofA trace, the largest flow with the five-tuple flow definition counts only for 1.4% of the total number of packets in stead of 15.9% with the classification using destination IP addresses. Simulation results are shown in Fig. 10 for a system with eight FE’s. The system implements periodic remapping with T = 20 and system utilization of 0.8. With a finer flow definition, the hashing itself becomes more effective. Identifying and scheduling the largest flows still improves performance by a decent margin although it seems that more flows are needed to further reduce drop rates.

7

Conclusions

Workload characterization is essential in system design and often leads to insights for system performance improvement. In this paper, we have studied flow popularity distribution characteristics in Internet traffic and we have found that this distribution can be approximated by a Zipf-like function. We show that, contrary to the common belief [12, 18], a simple hashing scheme cannot balance workloads with Zipf-like distributions. The existence of a few dominating high-rate flows in every trace that we examined motivated a novel hash based adaptive load balancing scheduler for parallel forwarding systems. Our scheduler achieves effective load balancing by identifying and shifting high-rate flows. Our scheme has low complexity and thus is efficient. The storage demand is trivial, and moreover, as high-rate flows are few, the disruption to system states is minimal. Minimizing adaptation disruption is an important goal of the scheduling scheme described in this paper. Adaptation disruption is quantified as the ratio of flow shifts to measure the degree of disturbance caused by different load balancing schemes. Another method that is particularly targeted at cache performance evaluation is to measure the temporal locality of the traffic seen at each FE [3]. It is also interesting to see how the performance of the proposed scheduler, in terms of adaptation disruption and packet reordering, compares with that of other schemes. We have shown that to maintain similar drop rates, schedulers that shift lower-rate flows would have to remap more flows than in our proposal. Thus the scheduler proposed here compares favorably with such alternatives. 21

0.45 Hashing 1 Largest Flow 3 Largest Flows 5 Largest Flows

0.4 0.35

Drop Rate

0.3 0.25 0.2 0.15 0.1 0.05 0

0

20

40

60

80 100 120 Buffer Size (Packets)

140

160

180

200

0.4 Hashing 1 Largest Flow 3 Largest Flows 5 Largest Flows

0.35 0.3

Drop Rate

0.25 0.2 0.15 0.1 0.05 0

0

20

40

60

80 100 120 Buffer Size (Packets)

140

160

180

200

Fig. 10. Scheduling Largest Flows with a Finer Flow Definition with the Auck4 Trace (above) and the UofA Trace

References

[1] N. Shah, Understanding network processors, Master’s thesis, U. C. Berkeley (September 2001). [2] G. Dittmann, A. Herkersdorf, Network processor load balancing for highspeed links, in: 2002 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2002), San Diego, CA,

22

USA, 2002, pp. 727–735. [3] W. Shi, M. H. MacGregor, P. Gburzynski, Effects of a hash-based scheduler on cache performance in a parallel forwarding system, in: Communication Networks and Distributed Systems Modeling and Simulation Conference (CNDS 2003), Orlando, FL, USA, 2003, pp. 130–138. [4] J. Bennett, C. Partridge, N. Shectman, Packet reordering is not pathological network behavior, IEEE/ACM Transactions on Networking 7 (6) (1999) 789– 798. [5] E. Blanton, M. Allman, On making TCP more robust to packet reordering, ACM Computer Communication Review 32 (1) (2002) 20–30. [6] W. E. Leland, M. S. Taqq, W. Willinger, D. V. Wilson, On the self-similar nature of Ethernet traffic, in: ACM SIGCOMM 1993, San Francisco, CA, USA, 1993, pp. 183–193. [7] V. Paxson, S. Floyd, Wide-area traffic: The failure of Poisson modeling, IEEE/ ACM Transactions on Networking 3 (3) (1995) 226–244. [8] R. Jain, A comparison of hashing schemes for address lookup in computer networks, IEEE Transactions on Communications 40 (3) (1992) 1570–1573. [9] V. Srinivasan, G. Varghese, Fast address lookup using controlled prefix expansion, in: ACM SIGMETRICS 1998, Madison, WI, USA, 1998, pp. 1–10. [10] M. Degermark, A. Brodnik, S. Carlsson, S. Pink, Small forwarding tables for fast routing lookups, in: ACM SIGCOMM 1997, Cannes, France, 1997, pp. 3–14. [11] S. Nilsson, G. Karlsson, IP-address lookup using LC-tries, IEEE Journal on Selected Areas in Communications 17 (6) (1997) 1083–1092. [12] D. G. Thaler, C. V. Ravishankar, Using name-based mappings to increase hit rates, IEEE/ACM Transactions on Networking 6 (1) (1998) 1–14. [13] K. W. Ross, Hash routing for collections of shared Web caches, IEEE Network 11 (7) (1997) 37–44. [14] S. Sarvotham, R. Riedi, R. Baraniuk, Connection-level analysis and modeling of network traffic, in: ACM SIGCOMM Internet Measurement Workshop, San Francisco, CA, USA, 2001, pp. 99–103. [15] W. Willinger, M. Taqqu, R. Sherman, D. Wilson, Self-similarity through highvariability: Statistical analysis of Ethernet LAN traffic at the source level, IEEE/ACM Transactions on Networking 5 (1) (1997) 71–86. [16] N. Brownlee, K. Claffy, Understanding Internet traffic streams: Dragonflies and tortoises, IEEE Communications 40 (10) (2002) 110–117. [17] Z. Cao, Z. Wang, E. Zegura, Performance of hashing-based schemes for Internet load balancing, in: IEEE INFOCOM 2000, Tel-Aviv, Israel, 2000, pp. 332–341.

23

[18] L. Kencl, J. Le Boudec, Adaptive load sharing for network processors, in: IEEE INFOCOM 2002, New York, NY, USA, 2002, pp. 545–554. [19] National Laboratory for Applied Network Research (NLANR) Measurement and Operations Analysis Team (MOAT), Packet header traces, http://moat.nlanr.net. [20] L. A. Adamic, B. A. Huberman, Zipf’s law and the Internet, Glottometrics 3 (2002) 143–150. [21] M. Aida, T. Abe, Stochastic model of Internet access patterns: coexistence of stationarity and Zipf-type distributions, IEICE Transactions on Communications E85-B (8) (2002) 1469–1478. [22] P. Barford, M. Crovella, Generating representative Web workloads for network and server performance evaluation, in: ACM SIGMETRICS 1998, Madison, WI, USA, 1998, pp. 151–160. [23] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web caching and Zipf-like distributions: evidence and implications, in: IEEE INFOCOM 1999, New York, NY, USA, 1999, pp. 126–134. [24] G. K. Zipf, Human Behavior and the Principle of Least-Effort, Addison-Wesley, Cambridge, MA, 1949. [25] L. A. Adamic, Zipf, power-laws, and pareto - a ranking tutorial, http://ginger.hpl.hp.com/shl/papers/ranking/ranking.html (April 2000). [26] R. Jain, The Art of Computer Systems Performance Analysis, John Wiley & Sons, Inc., 1991. [27] A. Feldmann, J. Rexford, R. C´aceres, Efficient policies for carrying Web traffic over flow-switched networks, IEEE/ACM Transactions on Networking 6 (6) (1998) 673–685. [28] R. Jain, S. Routhier, Packet trains: Measurements and a new model for computer network traffic, IEEE Journal of Selected Areas in Communications SAC-4 (6) (1986) 986–995. [29] D. Thiebaut, J. L. Wolf, H. S. Stone, Synthetic traces for trace-driven simulation of cache memories, IEEE Transactions on Computers 41 (4) (1992) 388–410. [30] W. Shi, M. H. MacGregor, P. Gburzynski, Synthetic trace generation for the Internet, in: The 4th IEEE Workshop on Workload Characterization (WWC-4), Austin, TX, USA, 2001, pp. 169–174.

24