measuring the bandwidth of packet switched ... - Semantic Scholar

3 downloads 27138 Views 810KB Size Report
Only Packet Pair measures the bottleneck bandwidth along a path without requiring measurements from the sending host. Adaptive .... 2.5 Network Monitoring . .... 4.2 This table shows the different software versions used in the experiments.
MEASURING THE BANDWIDTH OF PACKET SWITCHED NETWORKS

a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy

Kevin I-Sen Lai November 2002

c Copyright by Kevin I-Sen Lai 2003 ° All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Mary Baker (Principal Adviser)

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

David Cheriton

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Nick McKeown

Approved for the University Committee on Graduate Studies:

iii

Abstract Measuring the link bandwidth along a path is important for many applications. Some applications can adapt their content to exchange quality for speed. Some routing architectures route based on the bandwidth of links. Users and organizations benchmark the bandwidth of networking equipment and service providers for purchasing decisions and verifying service level agreements. Researchers measure bandwidth to understand the behavior of the Internet. However, the diversity of the Internet makes bandwidth measurement challenging. The capacity, utilization, and latency of Internet links vary over several orders of magnitude. In addition, the heterogeneity of administrative domains inhibits the deployment of measurement infrastructure at the network core and end hosts. The contribution of this dissertation is to increase the accuracy and decrease the overhead of doing end-to-end bandwidth measurement. Our thesis is that it is possible to do accurate and efficient end-to-end bandwidth measurement in the current heterogeneous Internet. We present a variety of techniques to deal with these challenges. Packet Tailgating is a technique to measure all the link bandwidths along a path. It requires no modifications to existing routers and typically requires 50% fewer probe packets than existing techniques, although the estimates of all current end-to-end techniques (including packet tailgating) can deviate from the nominal by as much as 100%. Receiver Only Packet Pair measures the bottleneck bandwidth along a path without requiring measurements from the sending host. Adaptive Kernel Density Estimation Filtering improves the accuracy of bottleneck link measurement by filtering out the effect of cross traffic, even when link bandwidths differ by several orders of magnitude. Potential Bandwidth Filtering improves the accuracy of passive bottleneck link bandwidth measurement by filtering out the effect of applications that send packets at a low rate. The combination of these techniques to measure bottleneck link bandwidth generates estimates that are within 10% of the nominal.

iv

Acknowledgements For Ching, Grace, and Jessica, who have supported me all these years.

v

Contents Abstract

iv

Acknowledgements

v

1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1

Continuing Importance of Bandwidth . . . . . . . . . . . . . .

4

Networking and Routing . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.1

Networking Terms . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.2

Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3.1

Measure Link Bandwidth . . . . . . . . . . . . . . . . . . . . .

7

1.3.2

Measure from End Hosts . . . . . . . . . . . . . . . . . . . . .

9

1.3.3

Measure Passively When Possible . . . . . . . . . . . . . . . .

9

1.4

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.5.1

13

1.2

1.3

Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Background and Related Work 2.1

2.2

15

Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1.1

Single Packet Model . . . . . . . . . . . . . . . . . . . . . . .

17

2.1.2

Packet Pair Model . . . . . . . . . . . . . . . . . . . . . . . .

20

2.1.3

Multi-Packet Model . . . . . . . . . . . . . . . . . . . . . . . .

22

Attributes of Bandwidth Measurement Techniques . . . . . . . . . . .

24

vi

2.3

2.4

2.5

2.6

2.2.1

Active/Passive Measurement . . . . . . . . . . . . . . . . . . .

24

2.2.2

Measurement Nodes . . . . . . . . . . . . . . . . . . . . . . .

26

Measuring All Link Bandwidths . . . . . . . . . . . . . . . . . . . . .

27

2.3.1

Single Packet Techniques . . . . . . . . . . . . . . . . . . . . .

27

2.3.2

Comparison with Multi-Packet Techniques . . . . . . . . . . .

32

Measuring the Bottleneck Link Bandwidth . . . . . . . . . . . . . . .

32

2.4.1

Packet Pair Technique Overview . . . . . . . . . . . . . . . . .

32

2.4.2

Fair Queueing . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.4.3

NetDyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.4.4

Bprobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.4.5

Tcpanaly . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.4.6

Pathrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Network Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.5.1

Packet Filter . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.5.2

Berkeley Packet Filter . . . . . . . . . . . . . . . . . . . . . .

40

2.5.3

RMON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.5.4

Windmill . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Nonparametric Density Estimation . . . . . . . . . . . . . . . . . . .

42

2.6.1

Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

2.6.2

Kernel Density Estimation . . . . . . . . . . . . . . . . . . . .

43

2.6.3

Adaptive Kernel Density Estimation . . . . . . . . . . . . . .

44

3 Inferring All Link Bandwidths

45

3.1

Packet Tailgating Derivation . . . . . . . . . . . . . . . . . . . . . . .

46

3.2

Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.2.1

Metrics of the Entire Path . . . . . . . . . . . . . . . . . . . .

49

3.2.2

Metrics of the Link . . . . . . . . . . . . . . . . . . . . . . . .

50

3.2.3

Metric of Part of the Path . . . . . . . . . . . . . . . . . . . .

51

Complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.3.1

Inter-packet Transmission Time . . . . . . . . . . . . . . . . .

52

3.3.2

Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.3

vii

3.4

3.5

3.6

3.3.3

Using Round Trip Measurements . . . . . . . . . . . . . . . .

54

3.3.4

Inducing Acknowledgements . . . . . . . . . . . . . . . . . . .

55

3.3.5

Dropped Packets . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.3.6

Invisible Nodes . . . . . . . . . . . . . . . . . . . . . . . . . .

56

Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.4.1

Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.4.2

Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.4.3

Large Difference in Adjacent Link Capacities . . . . . . . . . .

58

3.4.4

Susceptibility to Downstream Queueing . . . . . . . . . . . . .

59

3.4.5

Accumulation of Error . . . . . . . . . . . . . . . . . . . . . .

60

Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.5.1

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.5.3

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4 Inferring the Bottleneck Link Bandwidth 4.1

67

Packet Pair Property . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.1.1

Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Measurement Techniques . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.2.1

Passive Measurement Host(s) . . . . . . . . . . . . . . . . . .

76

4.3

Packet Pair Filtering Techniques . . . . . . . . . . . . . . . . . . . . .

80

4.4

Cross Traffic Queueing . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.4.1

Adaptive Kernel Density Estimation Filtering . . . . . . . . .

82

4.4.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.4.3

Smoothing Parameter

. . . . . . . . . . . . . . . . . . . . . .

88

Packets Sent at a Low Rate . . . . . . . . . . . . . . . . . . . . . . .

89

4.5.1

Received/Sent Bandwidth Filtering . . . . . . . . . . . . . . .

89

4.5.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .

91

Continuous Measurement . . . . . . . . . . . . . . . . . . . . . . . . .

92

4.6.1

93

4.2

4.5

4.6

Age Based Filtering . . . . . . . . . . . . . . . . . . . . . . . .

viii

4.7

4.8

4.6.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.6.3

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

4.7.1

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.7.2

Varied Bottleneck Link . . . . . . . . . . . . . . . . . . . . . .

98

4.7.3

Resistance to Cross Traffic . . . . . . . . . . . . . . . . . . . . 102

4.7.4

Different Packet Pair Techniques . . . . . . . . . . . . . . . . 103

4.7.5

Agility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5 Conclusions

107

5.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.2

Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3

Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

A Distributed Packet Capture Architecture

112

A.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . 113 A.2.1 DPCap Packet Formats

. . . . . . . . . . . . . . . . . . . . . 114

A.2.2 Creating a DPCapServer . . . . . . . . . . . . . . . . . . . . . 115 A.2.3 Creating a DPCapClient . . . . . . . . . . . . . . . . . . . . . 116 A.2.4 Measuring Timing Granularity . . . . . . . . . . . . . . . . . . 117 A.2.5 Packet Information Matching . . . . . . . . . . . . . . . . . . 117 A.2.6 Flow Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2.7 Bandwidth Consumption . . . . . . . . . . . . . . . . . . . . . 120 A.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 B Separated Nettimer Bottleneck Results

123

Bibliography

129

ix

List of Tables 1.1

Common Link Technologies in the Internet . . . . . . . . . . . . . . .

10

2.1

Variable Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.1

Program Versions: This table lists the versions of the programs we used. .

60

3.2

Short Path: This table lists the results of running the link bandwidth programs on the short path. TTL is the distance of the link from the source. C is the number of channels in the link. BW/C is the actual physical bandwidth per channel. Columns 3-6 are bandwidths given in Mb/s. . . . . . .

3.3

Network Load: This table lists the total number of packets transferred during the short and medium path probes. . . . . . . . . . . . . . . . . . .

3.4

61 61

Medium Path: This table lists the results of running the link bandwidth programs on the medium path. TTL is the distance of the link from the source. C is the number of channels in the link. BW/C is the actual physical bandwidth per channel. Columns 3-6 are bandwidths given in Mb/s. . . .

4.1

62

This table shows the different path characteristics used in the experiments. The Short and Long column list the number of hops from host to host for the short and long path respectively. The RTT columns list the round-triptimes of the short and long paths in milliseconds. . . . . . . . . . . . . .

4.2

95

This table shows the different software versions used in the experiments. The release column gives the RPM package release number. . . . . . . . .

x

96

4.3

This table summarizes nettimer results over all the times and days. “Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. (σ) lists the standard deviation over the different traces. . . .

99

A.1 This table shows the CPU cycles consumed by nettimer and the application it is measuring (scp). “User” lists the user-level CPU seconds consumed. “System” lists the system CPU seconds consumed. “Elapsed” lists the elapsed time that the program was running. “% CPU” lists (User + System) / scp Elapsed time. . . . . . . . . . . . . . . . . . . . . . . . .

121

B.1 This table shows the 18:07 PST 12/01/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

124

B.2 This table shows the 16:36 PST 12/02/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125

B.3 This table shows the 11:07 PST 12/04/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126

B.4 This table shows the 18:39 PST 12/04/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

127

B.5 This table shows the 12:00 PST 12/05/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

128

List of Figures 2.1

This figure shows a model of the network. . . . . . . . . . . . . . . .

2.2

This figure shows the amount of time a packet spends on links 0 and 1. In

2.3

this example, s0 = 6000 bits, b0 = 2Mb/s, d0 = 2ms, b1 = 3 Mb/s, d1 = 5ms. ´ P ³ 0 This packet travels across these two links in t02 = 1i=0 sbi + di = 12ms.

17

19

This figure shows two packets of the same size traveling from the source to

the destination. The wide part of the pipe represents a high bandwidth link while the narrow part represents a low bandwidth link. The spacing between the packets caused by queueing at the bottleneck link remains constant downstream because there is no additional downstream queueing. . . . . .

2.4

20

This figure shows the amount of time several packets from a flow spend on links l − 1 and l. In this example, sk−1 = 4000 bits, sk = 2000 bits, sk+1 = 12,000 bits, bl−1 = 2Mb/s, dl−1 = 2ms, bl = 1 Mb/s, dl = 3ms. Packet k is queued at link l for qlk = max (0, 11 − 3 − 5) = 3ms. Packet k + 1 arrives 1ms after packet k leaves because something delayed it earlier in the path.

2.5

23

This figure illustrates how the single packet model and linear regression are used to estimate bandwidth. The graph shows several hypothetical measurements of the round trip delay of packets of different sizes traveling along the same path. The gray samples experienced queueing. The black samples did not experience queueing. The line is the linear regression of the black samples. The inverse of the slope is an estimate of the bandwidth of the path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

28

4.1

This figure shows four cases of how the spacing between a pair of packets changes as they travel along a path. The black boxes are packets traveling from a source on the left to a destination on the right. Underneath each pair of packets is their spacing relative to the spacing caused by the bottleneck link. The gray boxes indicate cross traffic that causes one or both of the packets to queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

80

The left graph shows some Packet Pair samples plotted using their received bandwidth against their sent bandwidth. “A” samples correspond to case A, etc. The right graph shows the distribution of different values of received bandwidth after filtering out the samples above the x = y line. In this example, density estimation indicates the best result. . . . . . . . . . . .

4.3

82

This is a graph of the distribution of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck link bandwidth. The x-axis is the bandwidth of the sample. The y-axis is a count of bandwidth samples. “Hist 1000000” is a histogram plot with a 1Mb/s bin width. “Hist 1000” is a histogram plot with a 1Kb/s bin width. . . . . . . . . . . . . . . . . .

4.4

83

This is a graph of the distribution of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck link bandwidth. The x-axis is the bandwidth of the sample. The y-axis is a count of bandwidth samples. “Hist 1000000” is a histogram plot with a 1Mb/s bin width. “Kernel” is a plot using an adaptive kernel density estimation function. . . . . . . . . .

4.5

84

The left graph shows some Packet Pair samples plotted using their received bandwidth against their sent bandwidth. “A” samples correspond to case A, etc. In this example, the ratio of received bandwidth to sent bandwidth is a better indicator than density estimation.

4.6

. . . . . . . . . . . . . . .

88

This a graph of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck bandwidth. These samples are computed from the data packets of one TCP flow and the acknowledgements of another TCP flow in the reverse direction. Both the x and y axis are on a log scale.

xv

90

4.7

This graph shows the bandwidth reported by nettimer using RBPP and ROPP as a function of time. The measurements come from a long path towards a 100Mb/s Ethernet bottleneck. The Y-axis shows the bandwidth in b/s on a log scale. The X-axis shows the number of seconds since the connection began.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.1 This C structure defines the format of the configuration information sent from the client to the server. . . . . . . . . . . . . . . . . . . . . . . . .

114

A.2 This C structure defines the format of the initial configuration packet sent from a server to a client. . . . . . . . . . . . . . . . . . . . . . . . . . .

114

A.3 This C structure defines the format of data packets sent from a server to a client. Data packets contain captured packet data.

. . . . . . . . . . . . 115

A.4 This C structure is used to specify parameters in the creation of a distributed packet capture server. . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvi

116

Chapter 1 Introduction One key aspect of the Internet’s performance is the bandwidth that its users can communicate with each other. We use bandwidth to mean the data-rate (in bits/second) of a link or flow. Although the most highly connected users have access to exponentially greater amounts of bandwidth over time, bandwidth continues to be a limited resource because of 1) the growth of transfer sizes to fill the available bandwidth, 2) the slow adoption of high bandwidth technologies (only 16M users connected to the Internet with a bandwidth of 1Mb/s or more in the United States during April 2001 compared with the 167.1M total U.S. Internet users in May 2001 [net01]), and 3) the proliferation of low bandwidth wireless technologies (95M cellular phone subscribers in the U.S. in May 2000 [Bue00]). Since bandwidth is limited, administrators, users, and applications need to manage it carefully by replacing, routing around, and/or adapting to the bandwidth-limiting component along a path (the bottleneck). And so we need a way to efficiently, quickly and accurately measure the bandwidth along a path. This is not as easy as it may seem at first. Simply deploying measurement equipment in the Internet is impractical because of the network’s decentralized administration. But performing measurements from the edge of the network is fraught with difficulties, too. For example, introducing probe traffic alters the very bandwidth we are trying to measure, and can be very inefficient. Routes can change frequently and it is difficult to detect when they do.

1

CHAPTER 1. INTRODUCTION

2

And link bandwidths in use today vary across five orders of magnitude, requiring accuracy over a wide dynamic range of measurement. All these factors make bandwidth measurement difficult in the Internet. In this thesis, we present techniques to measure link bandwidth that are faster, more accurate, less obtrusive, and more easily deployable than existing techniques in technologically and administratively heterogeneous networks like the Internet. The result is an implementation of a flexible bandwidth measurement architecture. It does not make assumptions about application or transport protocols or amounts of data transferred or the stability of Internet traffic and can therefore be tailored to a wide variety of application requirements. In the remainder of this introduction, we further motivate measuring bandwidth (Section 1.1), provide some networking background (Section 1.2), describe our approach (Section 1.3), identify the challenges to bandwidth measurement (Section 1.4), and summarize our contributions and the structure of the remainder of the thesis (Section 1.5).

1.1

Motivation

It is important to measure link bandwidth because it is a limited resource. Since it is limited, an application’s bandwidth needs may exceed the available capacity, thus limiting the application’s performance. For example, a web browsing client or a file transfer client will suffer slow transfers or a video streaming application will suffer dropped frames or poor resolution. By measuring link bandwidth, applications and users can take action to deal with limited bandwidth: Benchmarking Equipment and Services Networking equipment and services are bought at least partially based on the link bandwidth they can support. The businesses that sell equipment and services compete for customers, and therefore have an incentive to claim a higher link bandwidth for lower cost than their competitors. Experience with the Standard Performance Evaluation Corporation (SPEC) benchmark in which system manufacturers designed their systems to

CHAPTER 1. INTRODUCTION

3

perform better specifically for the benchmark [DR99] suggests that networking companies will eventually hide or overstate the link bandwidth of their equipment or services. Without independent and accurate verification, the vendor’s specification of link capacity is suspect. Bandwidth-Aware Routing There are frequently multiple paths to the ultimate destination of a communication because the destination is replicated [CC96a] or simply because there are multiple paths between the source and destination hosts. In both cases, understanding the bandwidth of the links along the possible paths allows hosts to predict the performance along them. Understanding the bandwidth of links also allows building efficient overlay networks (e.g., providing resilient routing [ABKM01] or multicast [RM99]). Adapting Content Many applications have flexibility in the amount of data that they send and can tradeoff the fidelity of their content for timeliness of delivery. For example, a web server could scale the size and quality of its pictures, sound, and video depending on the bottleneck bandwidth of the path to the client [FGBA96]. Characterizing Networks Understanding the structure (the topology and characteristics of links and nodes) of the Internet is valuable for locating and removing bottlenecks, determining realistic simulation parameters, and understanding applications that we cannot conceive of today. However, the Internet has grown beyond the size where anyone has a complete understanding of its state at any point in time. Therefore, it is important to sample its structure (which includes the bandwidth of links) on a continuing basis. Analyzing Protocols Link layer and transport protocols deliver data while trying to maximize throughput and meeting other design goals (e.g., fairness or congestion control). Knowledge of bandwidth allows protocol developers to determine the efficiency of their protocols and act accordingly.

CHAPTER 1. INTRODUCTION

1.1.1

4

Continuing Importance of Bandwidth

It is currently the case that Dense Wave Division Multiplexing (DWDM) allows link bandwidth to triple each year while Internet traffic is only doubling each year [CO02]. One might hypothesize that if this continues, bandwidth will no longer be limited relative to application demands, and therefore bandwidth measurement will no longer be necessary. However, it is not likely that link bandwidths will grow at a sustained rate that exceeds the growth rate of applications’ bandwidth demands [CO02]. Although there have been localized shortages and gluts in time (e.g., Internet traffic doubled every three or four months in 1995-1996, and link bandwidths did not keep pace [CO02]) and space (e.g., there is currently believed to be a glut of link bandwidth in the core of the Internet), historical imbalances did not persist and current imbalances are not likely to persist. Instead, the supply of link bandwidth and applications’ demand for it seek an equilibrium point (but do not necessarily reach one) [CO02]. Demand is not likely to greatly exceed supply because the resulting network congestion would lower the performance of applications and reduce demand. Supply is not likely to exceed demand because deploying more link bandwidth costs money and companies only do so if they believe there is sufficient demand to pay the costs. In addition, if supply temporarily exceeds demand, then the performance of existing applications will increase, thus causing demand to increase, and new applications will be created, thus increasing demand even more. For example, the unusual growth of Internet traffic in 1995-1996 was caused by the popularity of the relatively new World Wide Web application [CO02]. This shortage led companies to increase the growth rate of link capacities. However, the growth in traffic did not persist, although telecommunications companies believed it would [CO02]. This led to the current glut in the Internet’s core, which has led those companies to slow their deployments of more link bandwidth. It may be the case that given sufficient funds, DWDM technology could triple link bandwidth each year, but there will only be sufficient funds if there is sufficient traffic. Thus, it is not likely that the sustained growth rate of link bandwidths will exceed the growth rate of applications’ traffic such that bandwidth measurement is unnecessary.

CHAPTER 1. INTRODUCTION

1.2

5

Networking and Routing

In this section, we describe the basic structure and characteristics of a network. This provides background for the description of our approach in measuring the link bandwidth of all the links along a path (in Section 1.3.1) or of just the bottleneck link (in Section 1.3.1).

1.2.1

Networking Terms

Link bandwidth is the maximum data-rate that an end host could transmit across a link if every other link along the path from the sender to the receiver had a higher link bandwidth than that link. For example, a link has a link bandwidth of 10Mb/s if that is the maximum data-rate that an end host could transmit across it even though the other links on the path had a link bandwidth of 100Mb/s. As defined here, a link’s bandwidth may be limited by the router it is connected to, and therefore not an inherent property of the link. We define link bandwidth from this end-to-end perspective because we are interested in bandwidth as it affects application performance, and so whether the link bandwidth is a property of the link itself or the combination of the link and its attached router is not relevant in this context. Link bandwidth is affected by anything that contributes packet-length dependent delays to a packet as it is travels across one link and to another. Some examples of sources of packet-length dependent delays are the signaling rate of the underlying physical technology, the overhead consumed by link layer headers and media access control protocols, and the rate at which a router can copy data from one link to another. Nominal link bandwidth is the link bandwidth reported by the manufacturer for a link. We assume that this is actual link bandwidth in the absence of other evidence. Bottleneck link bandwidth along a path is the smallest link bandwidth along a path. It limits the rate that the sending node can send along the path.

CHAPTER 1. INTRODUCTION

6

Available bandwidth is the link bandwidth that other traffic does not consume. More precisely, available bandwidth is the maximum amount of data that a host can transfer over a link during a time interval so that the queue on that link maintains the same length. Bottleneck available bandwidth along a path is the minimum of the available bandwidths of all the links along the path. The bottleneck available bandwidth depends on both the behavior of traffic and the link bandwidths, but it will never exceed the bottleneck link bandwidth along that path.

1.2.2

Routing

There are several aspects of network routing that motivate measuring bandwidth frequently, inexpensively, and separately in each direction. Routers decide how to route traffic by exchanging information about which hosts they connect to. As links and routers fail, routers propagate routing changes throughout the network. This may cause the links along a path to change, causing the link bandwidths along a path to change. One study [Pax97b] observes routes changing as frequently as once every few minutes. Another source of route changes is mobile hosts. These hosts may change location several times a day, with a completely different connection to the Internet in each case. As a consequence of these frequent route changes, measuring bandwidth should be inexpensive enough in node and network resources so that measurement can be done frequently. Another aspect of network routing is that sometimes routes are asymmetric, i.e. traffic in one direction takes a different path than traffic in the opposite direction. Thus, it is important to measure the bandwidth along a path separately in each direction.

1.3

Approach

Our approach to the applications described above is to measure both the link bandwidth of all the links along a path and the bottleneck link bandwidth along a path,

CHAPTER 1. INTRODUCTION

7

to measure from end hosts, and to measure passively when possible and actively otherwise.

1.3.1

Measure Link Bandwidth

We measure link layer bandwidth to obtain a throughput metric that is not dependent on a specific higher layer protocol (e.g. TCP or HTTP). This allows measurement across the link technologies that exist in the Internet and is a first step towards the development of available bandwidth algorithms. By developing techniques to measure bandwidth at a lower layer, we can measure bandwidth regardless of the protocol used at higher layers. If a network layer adheres to this strict definition of layering where layers can only communicate with their neighbors, then lower layers are insulated from higher layers. This is important because there are many higher layer protocols and their relative popularity has changed several times in the history of the Internet. The most popular application layer protocol was once the File Transmission Protocol (FTP) [Ste94], became the Hypertext Transfer Protocol (HTTP) [Ste96], and is now the Gnutella[Kan01] peer-to-peer file sharing protocol. The most popular transport protocol has generally been the Transmission Control Protocol (TCP) [Ste94], but there have been several different versions of TCP with different performance characteristics [FF96]. We measure the bandwidth at the boundary between the link layer and the network layer (the link bandwidth). This is the lowest layer bandwidth that can be measured across the variety of technologies that exist in the Internet because of the hourglass shape of the Internet architecture. By definition, all link technologies in the Internet adhere to the IP protocol at the networking layer but may vary in their implementations and interfaces for layers below that. Therefore, techniques to measure bandwidth lower than the network layer would require understanding the characteristics of all current link technologies and would have to be changed when new ones are deployed in the future. The disadvantage of measuring bandwidth at a lower layer compared to a higher layer is that the actual bandwidth that a higher layer sees may be very different from

CHAPTER 1. INTRODUCTION

8

the lower layer bandwidth. Bandwidth can only decrease in higher layers. However, using link bandwidth, other lower layer metrics (e.g. latency, cross traffic load, and packet loss), and models for the performance of higher layer protocols (e.g. TCP [MSMO97]), we can calculate the performance of higher protocols. The advantage of this approach is that we can use lower layer metrics as building blocks in the modeling of higher layer metrics and therefore do not have to repeat the development of lower layer metric measurement techniques. The final reason we measure link bandwidth is as a first step towards measuring available bandwidth. Available bandwidth gives a better predictor of actual application performance, but we measure link bandwidth because 1) link bandwidth is useful in some cases where available bandwidth is not, 2) link bandwidth results are easier to verify in the Internet, and 3) measuring link bandwidth helps in the measurement of available bandwidth. Link bandwidth is more useful than available bandwidth for benchmarking and characterization (described in Section 1.1) because those applications are more interested in the characteristics of the underlying network than the load on the network. Link bandwidth results are easier to verify in the Internet because traffic load (which affects available bandwidth) changes on smaller time scales (sometimes less than 1ms) and is difficult to verify independently without instrumenting routers. Instrumenting routers on a realistic path in the Internet is far more difficult than determining the nominal link bandwidth of the links on that path. Finally, measuring link bandwidth helps in the measurement of available bandwidth because available bandwidth depends on both cross traffic behavior and link bandwidths (as described in Section 1.2). Measure the Bottleneck Link Bandwidth We measure the bottleneck link bandwidth because many of the applications described in Section 1.1 only want to know the maximum rate at which they can send from one host to another. In addition, the bottleneck link bandwidth can be measured in less time, using fewer packets, and with more accuracy than the bandwidth of all the links along a path.

CHAPTER 1. INTRODUCTION

9

Measure the Bandwidth of All Links Along a Path We measure the bandwidth of all the links along a path for applications that need to know where the bottleneck is along a path (e.g. routing, described in Section 1.1) and applications that need to understand all the sources of delay (e.g. characterization, described in Section 1.1).

1.3.2

Measure from End Hosts

We measure from the end hosts instead of relying on some measurement functionality in the routers. Measuring from end hosts is more challenging than relying on router help, but relying on router help is difficult to deploy and unworkable in the presence of heterogeneous administrative domains. Router help is difficult to deploy because it requires modifying all the routers along a path and the Internet has no widely accepted method for querying link capacities from routers. Several standards have been developed [CFSD90] [web00], but none has been widely deployed. In addition, none is likely to be deployed in the future because of heterogeneity of administration in the Internet. Any one path in the Internet may pass through several domains administered by different organizations (e.g. companies, universities, or countries). These organizations usually have contracts to carry their neighbor’s traffic. However, organizations rarely have contracts with organizations with which they do not have direct network links even though their traffic may traverse many such non-neighboring domains. Consequently, domains have no incentive to report the bandwidth of links to non-neighboring domains accurately even though their traffic may traverse those links. In fact, different domains are frequently competitors, and so have an incentive to avoid revealing the performance characteristics of their internal links.

1.3.3

Measure Passively When Possible

We measure by listening to existing traffic (passively) when possible and by generating our own probe traffic (actively) otherwise. We measure passively when possible to

10

CHAPTER 1. INTRODUCTION

Table 1.1: Common Link Technologies in the Internet Name Ethernet Ethernet Ethernet 802.11b WaveLAN ADSL Modem CDMA

Bandwidth Bandwidth Symmetry 1Gb/s symmetric 100Mb/s symmetric 10Mb/s symmetric 11Mb/s symmetric 2Mb/s symmetric 1.5Mb/s, asymmetric 128Kb/s 56Kb/s, asymmetric 33.3Kb/s 14.4Kb/s symmetric

Wireless

Link Layer Protocol

no no no yes yes no

IEEE 802.3z/802.3ab IEEE 802.3u IEEE 802.3 IEEE 802.11b WaveLAN ATM/PPPoE

no

V.90/PPP

yes

CDMA/PPP

minimize traffic injected into network because probe traffic only reduces bandwidth available not only to our own applications, but also to other users of the network. If a technique were used by a significant fraction of the Internet, then the difference between no packets and a few hundred packets per hour could be the difference between a scalable system and an inoperable system. Even if no other application is using the bandwidth, in some cases users must pay for transmissions in money (e.g. a per bit contract with an ISP) or in watts (e.g. a portable host with a limited power supply). However, in some cases, there is insufficient existing traffic to do the desired measurement. In such cases, we generate our own probe traffic, but we try to minimize the amount.

1.4

Challenges

In this section, we outline the challenges to doing link bandwidth measurement. Heterogeneity of links The heterogeneity of link technologies in the Internet means that we can make very few assumptions about the links we are measuring. Table 1.1 shows some examples of the varying bandwidth, bandwidth symmetry, and link layer protocols of common Internet links. Since the examples shown in

CHAPTER 1. INTRODUCTION

11

the table have bandwidths that vary by five orders of magnitude (from 14Kb/s to 1Gb/s), an algorithm cannot safely assume that an estimate will fall in a restricted range (e.g. 1-100Mb/s). The asymmetry of some links means that we must measure separately in each direction on a link. The different link layer protocols prevent us from relying on knowledge of the workings of any one link layer protocol. Heterogeneity of traffic Since link bandwidths can vary by several orders of magnitude, so can the traffic load on those links. This adds another source of highly variable delay that can interfere with bandwidth measurement. No Router Help Because of our approach of measuring from end hosts (Section 1.3.2), we cannot know absolutely what is happening at the links along a path. Consequently, we can only infer from our observations of how traffic is perturbed when it reaches the end hosts. However, other traffic along the path, but not related to the link we wish to measure, can also perturb the measurement traffic, so we must deal with this interference. Difficulty of deploying at both sender and receiver Users and organizations can usually deploy new software at their own hosts, but only rarely deploy software at the host at the other end of their communications. This is because the other host is rarely under their administrative control. Consequently, a measurement technique must be able to cope with only observing the transmission of packets, but not their reception, or the reverse. Need to minimize measurement probe traffic Sending massive amounts of measurement probe traffic across a link defeats the purpose of doing the measurement in the first place: to transmit data more efficiently across that link. Active techniques send probe traffic. Passive techniques do not. For many networks and applications, it would be preferable to use passive techniques or active techniques that send minimal probe traffic. However, existing traffic may make it difficult or impossible to compute an estimate. Similarly, fewer probe packets may not be sufficient to filter out the interference from cross traffic.

CHAPTER 1. INTRODUCTION

12

Route and Link Changes Route and link characteristic changes in the Internet may cause link bandwidth estimates to become invalid. Route changes cause the set of links along a path to change and therefore the bandwidth along that path to change. Stationary hosts may change route as frequently as once a day [Pax97b]. Mobile hosts may change location much more frequently than that, causing their routes to other hosts to change. Link characteristics may change for wireless users because of changes in the signal-to-noise ratio (SNR), which can be caused by changes in attenuation (due to distance between the sender and receiver) and various interference levels (due to fading, multipath, scattering, crosstalk, etc.). For example, the nominal bandwidth of an 802.11b link changes from 11Mb/s to 5Mb/s to 2Mb/s as the SNR drops. Route and link changes interact poorly with the need to minimize measurement probe traffic. Coping with frequent changes requires doing measurements frequently, which may increase the amount of probe traffic.

1.5

Contributions

In this section, we list our solutions to the challenges described in Section 1.4. Packet Tailgating We introduce the Packet Tailgating technique to estimate all the link bandwidths along a path. Packet Tailgating requires no modifications to existing routers and requires fewer probe packets than existing active techniques that address the same problem. We analytically derive the Packet Tailgating technique from a deterministic model of packet delay, showing that it applies to every packet-switched, store-and-forward, FCFS-queueing network. Using measurements on Internet paths, we show that although Packet Tailgating requires 50% fewer probe packets than existing techniques, the estimates of all current end-to-end techniques (including packet tailgating) can deviate from the nominal by as much as 100%. Analytical Derivation of the Packet Pair Property To address the problem of measuring just the bottleneck bandwidth along a path, we analytically derive

CHAPTER 1. INTRODUCTION

13

the Packet Pair Property from a deterministic model and show that it applies to every packet-switched, store-and-forward, FCFS-queueing network. The Packet Pair Property has previously been empirically shown to be valid in the Internet. Receiver Only Packet Pair To address the problem of only being able to deploy software at one host, we developed the Receiver Only Packet Pair technique. Using the Packet Pair Property, this passive technique measures the bottleneck bandwidth along a path without requiring measurements from the sending host. Adaptive Kernel Density Estimation Filtering To cope with the heterogeneity of link bandwidths and cross traffic loads, we developed a kernel density estimation-based technique to filter Packet Pair samples. Simulation and measurements suggest that it is robust across bottleneck link bandwidths and cross traffic that vary by five orders of magnitude. Potential Bandwidth Filtering To cope with the problem of having poorly conditioned traffic for doing passive measurement, we developed the Potential Bandwidth Filtering algorithm to filter Packet Pair samples. We show in simulation and measurements that it is robust across a wide variety of traffic conditions.

1.5.1

Structure

The remainder of the thesis is organized into the following chapters: Background and Related Work (Chapter 2), Distributed Packet Capture Architecture (Chapter A), Inferring All Link Bandwidths (Chapter 3), Inferring the Bottleneck Link Bandwidth (Chapter 4), and Conclusions (Chapter 5). In Chapter 2, we describe different kinds of packet delay models and previous work to measure network bandwidth. In Chapter A, we describe the design and implementation of a distributed packet capture architecture to aggregate and correlate measurements from multiple nodes in the network. We evaluate the architecture using Internet measurements. In Chapter 3, we derive the Packet Tailgating technique from a packet delay model and analyze it analytically. Using simulation and Internet measurements, we evaluate

CHAPTER 1. INTRODUCTION

14

its effectiveness. In Chapter 4, we describe the Packet Pair Property for FCFS-queueing networks and use it to measure bottleneck link bandwidth. We show the consequences of poor filtering and how our filtering algorithms produce better results. Finally, in the Chapter 5, we conclude and enumerate areas for future work.

Chapter 2 Background and Related Work This chapter provides the context for this work. We separate this work into the following areas: network models, measuring the bandwidth of all the links along a path, measuring the bandwidth of the bottleneck link along a path, distributed network measurement, and density estimation.

2.1

Network Models

In this section, we give an overview of different network models of packet delay. We use these models to derive properties about the network in Chapters 3 and 4, and then use the properties as the basis for measurement techniques. We use models instead of inferring properties directly from measurements (e.g. through curve-fitting) because it is difficult to know whether a set of measurements is representative of the entire Internet and therefore whether a property holds in the entire Internet. Instead, we make some assumptions about the network and use the models that arise from these assumptions to derive useful properties. We describe the Single Packet model (Section 2.1.1), the Packet Pair model (Section 2.1.2), and the Multi-Packet Model (Section 2.1.3). All of these models have the following assumptions: Packet-Switched Network In a packet-switched network, network switches decide how to forward each packet independently of other packets, even packets of the 15

CHAPTER 2. BACKGROUND AND RELATED WORK

16

same flow. The alternative is a flow-switched network. The Internet is a packetswitched network even though some parts of it run over flow-switched networks (e.g. the telephone network). Store-and-Forward Switching Store-and-forward switching means that switches must receive the last bit of a packet before forwarding the first bit. The alternative is cut-through switching. Store-and-forward switching introduces more delay than cut-through, but is much simpler to implement. Most of the Internet uses store-and-forward switching. FCFS and Drop Tail queueing policy A queueing policy governs how packets are removed from a queue both to be serviced and dropped. The most common servicing policy is First-Come, First-Served (FCFS). This means that routers service queues by sending packets in the order that they arrived. The most common dropping policy is drop tail. This means that when the queue size exceeds a threshold, the last packet to arrive is dropped. No effect from other traffic Non-measurement traffic can cause measurement traffic to queue in unpredictable ways. The models we describe in this section assume that there is no non-measurement traffic. Instead, we account for the effect of non-measurement traffic by filtering both the data to be fed to the model and the results that the model produces (Sections 2.3, 2.4, and Chapters 3 and 4). Alternatively, these models could assume that the non-measurement traffic follows a particular distribution for arrival time and packet size. However, such a distribution is not likely to be valid everywhere in the Internet. Single channel link at the network layer A packet traversing a single channel link has a transmission delay equal to the packet’s size divided by the bandwidth of that link. In contrast, each channel in a multi-channel link has a separate bandwidth. The aggregate bandwidth of the link is the sum of the bandwidths of the channels. As a result, a packet traversing a multi-channel link has a transmission delay equal to the packet’s size divided by the bandwidth of the channel that the packet traveled across, not the aggregate bandwidth of the

17

CHAPTER 2. BACKGROUND AND RELATED WORK

Packet s0 b0 , d 0

b1 , d 1

Host

b n-1, d n-1

Router

Figure 2.1: This figure shows a model of the network.

link. This assumption only applies to how the link appears at the network layer, regardless of how it is implemented at lower layers. For example, a router could stripe packet by packet across each of the channels of a multi-channel link. This will appear as a multi-channel link at the network layer. However, if the router stripes each packet byte by byte (or some other sub-packet unit) across each of the links, then the link will appear to be a single channel link at the network layer. Examples of multi-channel links are a BRI ISDN link (composed of two 64Kb/s channels in parallel) [Pax97a] and Cisco’s EtherChannel [Inc02] (composed of any number of Ethernet channels). These links may reorder packets, thus causing unnecessary TCP retransmissions and decreasing TCP throughput. The models in this section use the terms shown in Figure 2.1. Link l has bandwidth bl and delay dl . There are n links. Packet k has size sk and there are p packets. Table 2.1 summarizes the variables we use.

2.1.1

Single Packet Model

Bellovin [Bel92], Jacobson [Jac97b], and Downey [Dow99] use what we call the single packet model for packet delay to measure link bandwidths. Kleinrock ([Kle76], p. 297) describes a similar model. We call this the single packet model because it assumes that there is only a single packet traveling from the source to the destination. This

CHAPTER 2. BACKGROUND AND RELATED WORK

18

Table 2.1: Variable Definitions. n links dl sec. dl sec. bl bits/sec. sk bits tkl sec. qlk sec. lbn link number

hop length of the path fixed delay of link l sum of fixed delays up to and including link l bandwidth of link l size of packet k time when packet k fully arrives at link l amount of time packet k is queued at link l the bottleneck link

model is mainly interesting as a predecessor to the multi-packet model described in Section 2.1.3. Single Packet Model Equation The single packet model uses the following equation (variables defined in Table 2.1 and Section 1.2): t0l

=

t00

+

l−1 µ 0 X s i=0

bi



+ di ,

(2.1)

where t0l is the time needed for a single packet to travel across the l − 1 links before the lth node. The single packet model makes several assumptions. It assumes that there is no load dependent delay. Load dependent delay is delay caused by traffic from other flows. Load independent delay is the minimum (or best-case delay [Bel92]) that a packet experiences in an otherwise ³unloaded network. Load independent delay is ´ 0 composed of the transmission delay sbi and fixed delay (di ) of each link. Transmission delay is the delay component that varies with packet size. The single packet model and later models in this chapter assume that the transmission delay is linearly proportional to packet size (s0 ) and inversely proportional to the link bandwidth (bi ). Transmission delay includes the time to transmit a packet onto a link and any packet length dependent delays that routers may contribute. Fixed delay is the component of packet delay that is constant for a link. An example of a fixed delay source is the propagation delay due to the speed of propagation of an electromagnetic wave along

CHAPTER 2. BACKGROUND AND RELATED WORK

19

routers Link 0

t00 s0 /b0 Time (ms)

t10

Link 1

pa

cke

t0

d0 s0 /b1

pa

ck

et

0

d1

t 20

Figure 2.2: This figure shows the amount of time a packet spends on links 0 and 1. In this example, s0 = 6000 bits, b0 = 2Mb/s, ³ d0 = 2ms, ´ b1 = 3 Mb/s, d1 = 5ms. This packet P1 s0 0 travels across these two links in t2 = i=0 bi + di = 12ms.

the medium. Figure 2.2 shows an example of using the single packet model to obtain packet delay. Single Packet Model Limitations The single packet model has several limitations. The model attributes transmission delays and fixed delays to links instead of routers. As described above, both routers and links contribute to delay. Consequently, when using the single packet model to model a path with an inefficient router, that router’s delays will be attributed to the link preceding and the link following the router. The later models in this chapter also have this limitation. Another limitation is the assumption that load independent delays are either linear with respect to packet size or fixed. The delay through an unloaded router may be variable in other ways. The routing lookup time might not be constant, depending on data structure and memory technology. The time to traverse the internal switch fabric might not be deterministic, depending on the arbitration scheme. The buffering

20

CHAPTER 2. BACKGROUND AND RELATED WORK

flow direction

source

destination

packets

(t

1 0

- t 00

)


t10 − t00 ), then the packets will arrive at the destination with the same spacing bn

(t1n − t0n ) as when they exited the bottleneck link ( bsl 1 ). The spacing will remain the bn

same because the packets are the same size and no link downstream of the bottleneck link has a lower bandwidth than the bottleneck link (as shown in Figure 2.3, which is a variation of a figure from [Jac88]). Here we state the packet pair equation more formally as a property of the network models that we consider: Theorem 2.1.1 (Packet Pair Property) Let bmin(l) ≤ bi , (∀i, 0 ≤ i ≤ l), then if we send two packets of the same size (s0 = s1 ) with a small time difference (t10 − t00 ≤

s1 ) bmin(n−1)

and there is no cross traffic, they will

arrive with a difference in time equal to the size of the second packet divided by the smallest bandwidth on the path (t1n − t0n =

s1 ). bmin(n−1)

Packet Pair Model Assumptions In addition to the assumptions listed above (Section 2.1), this model also assumes that the two packets are of sufficient size and are sent close enough together in time that they queue together at the bottleneck link. Carter and Crovella actively send

CHAPTER 2. BACKGROUND AND RELATED WORK

22

traffic for measurement purposes to guarantee that this is the case. However, when doing purely passive measurement, this cannot be guaranteed, so some previous work tries to detect and filter out traffic that does not meet this condition (Section 2.4). The problem is that this may filter out the only available samples. We present an algorithm that solves this problem in Chapter 4. Although we assume that routers use FIFO-queueing, if the router uses weighted fair queueing, then packet pair measures the available bandwidth of the bottleneck link [Kes91a]. Although we assume that links are single channel, an extension to the packet pair model (Packet Bunch Mode) avoids this assumption by considering bunches of more than two packets [Pax97a]. We do not consider this extension here because of the rarity of multi-channel links and because larger groups of packets are more likely to encounter interference from cross traffic. Packet Pair Model Limitations The limitation of the packet pair model is that it does not take into account end-to-end delay. As a result, this model misses a metric that is important to some applications, including most interactive applications like IP telephony, video conferencing, and distributed games.

2.1.3

Multi-Packet Model

Paxson [Pax97b], Stoica [SZ99], and Banerjee and Agrawala [BA00] describe what we call the multi-packet model. Kleinrock [Kle75] notes the pipelining effect of smaller packets, which is a consequence of the multi-packet model, but deals with average packet sizes instead of distinct packet sizes. In contrast to the previous models, the multi-packet model takes into account both other packets in the same flow and the end-to-end delay. We show in Chapter 4 that the single packet and packet pair models are special cases of the multi-packet model. We use the multi-packet model to derive the packet tailgating technique (Section 3.1) to measure all the link bandwidths along a path.

CHAPTER 2. BACKGROUND AND RELATED WORK

23

routers

Link l-1

Time (ms)

Link l

Link l+1

pa

cke

tk

pa

-1

cke

tk

k l

t

qk l

pa

pa

cke

tk

cke

tk

+1

dl

-1

pa

cke

tk

tk-1 l+1

Figure 2.4: This figure shows the amount of time several packets from a flow spend on links l − 1 and l. In this example, sk−1 = 4000 bits, sk = 2000 bits, sk+1 = 12,000 bits, bl−1 = 2Mb/s, dl−1 = 2ms, bl = 1 Mb/s, dl = 3ms. Packet k is queued at link l for qlk = max (0, 11 − 3 − 5) = 3ms. Packet k + 1 arrives 1ms after packet k leaves because something delayed it earlier in the path.

Multi-Packet Model Equation The multi-packet model consists of a delay equation derived from two other equations: an arrival time equation and a queueing delay equation. The following arrival time equation is a slight variation on the single packet equation (2.1) (variables defined in Table 2.1): tkl

=

tk0

+

l−1 µ k X s i=0

bi

+ di +

qik



.

(2.3)

k This equation predicts that packet k arrives at link l at its transmission time³ (t´ 0) k s plus the sum over all the previous links of the latencies (di ), transmission delays bi ,

and queueing delays (qik ) of those links. Equation 2.3 differs from the single packet

equation (2.1) in that it considers k − 1 packets in the same flow and the queueing delays of those packets. We model the queueing delay due to other packets in the same flow using the

24

CHAPTER 2. BACKGROUND AND RELATED WORK

following equation: ¢ ¡ k−1 qlk = max 0, tl+1 − dl − tkl .

(2.4)

This equation predicts that packet k is queued at the router just before link l from the time it arrives at that router (tkl ) until it can begin transmitting, which is the time when the previous packet (k−1) arrives at the next router (tk−1 l+1 ) minus the fixed delay 0 = 0). of this link (dl ). We assume that the first packet is never queued (q00 = · · · = qn−1

Figure 2.4 shows an example of using Equation 2.4 to compute queueing delay. Notice that packet k + 1 in the figure is not queued at all because it arrives at link l after packet k has been transmitted. Queueing delay cannot be negative, so the max() function in the queueing equation causes it to be 0 in this case. We combine Equations 2.3 and 2.4 to form the multi-packet delay equation: tkl

=

tk0

+

l−1 µ k X s i=0

bi

+ di + max

¡

k−1 0, ti+1

− di −

tki

¢



.

(2.5)

The multi-packet equation is as least as powerful as both the single packet and packet pair models because we can derive both of those models from Equation 2.5. To reduce the multi-packet equation to the single packet equation (2.1) we take k = 0. We derive packet pair model from the multi-packet model in Section 4.1.1.

2.2

Attributes of Bandwidth Measurement Techniques

In the following sections, we describe different bandwidth measurement techniques. In this section, we describe some general attributes of those techniques as a framework for understanding how these techniques relate to each other.

2.2.1

Active/Passive Measurement

An active technique sends packets to be measured while a passive technique relies on packets sent by other applications. An active technique can be further rated by how

CHAPTER 2. BACKGROUND AND RELATED WORK

25

many packets it sends into the network for measurement. We do not count packets that are sent for other reasons, such as distribution of measurement results or clock synchronization, because they are dependent on a particular implementation of the measurement algorithm. An active technique may have a large effect on the network. Some active techniques send large amounts of data into the network to collect sufficient data to filter out the effect of transient conditions like congestion and process scheduling on the measurement hosts. These probe packets can saturate a low bandwidth link and deny service to other applications running on the same host or on other hosts in the same network. A technique that sends large amounts of data into the network could not be widely deployed because its traffic would consume a significant fraction of the total bandwidth in the Internet. An advantage of active techniques over passive techniques is that they may provide accurate results more quickly. Given that two techniques both require n packets to calculate bandwidth, the active technique can simply generate the n packets, while the passive technique must wait for n packets to be generated. If the passive technique has to wait so long that one or more of the transient network conditions has changed, then its result will be less accurate than the active programs. If the passive technique continues to see insufficient traffic, and network conditions continue to change, then it may never converge on an accurate answer. In addition, an active technique can create exactly the kind of traffic that it needs to garner an accurate bandwidth measurement. For example, both the techniques described in Section 2.3 and Chapter 3 require such specialized traffic that they are virtually impossible to run using existing traffic patterns. In addition, even the passive techniques described in Section 2.4 and Chapter 4 can only use a subset of the available packets. They must reject packets that are not the same size and packets that are not acknowledged and therefore do not have a round trip time (if the technique is using round trip time instead of one-way delay). An example of traffic that is usually not acknowledged is multicast traffic (acknowledgements would cause ack-implosion). Even packets that meet these requirements may not be optimal for the packet pair property (Section 4.5).

CHAPTER 2. BACKGROUND AND RELATED WORK

2.2.2

26

Measurement Nodes

A bandwidth measurement technique must take measurements (packet source, destination, transmission time, arrival time, and size) at specific measurement nodes along a path and of specific packets at those nodes. The measurement nodes attribute of a bandwidth measurement technique is specified relative to an arbitrary flow travelling from a source to a destination. Examples of measurement nodes are the sending node only, the receiving node only, both the sending and the receiving nodes, or all the nodes along a path. Most nodes can measure both the packets it sends and receives; the issue is whether the measurement technique makes use of that information. The measurement nodes attribute affects ease of deployment and accuracy. A measurement technique that requires fewer measurement nodes is easier to deploy than a technique that requires more nodes. For example, a technique that needs to measure packets at every router along a path in the Internet would be very difficult, if not impossible, to deploy. Even if different administrative domains cooperated, the software would have to be ported to many different kinds of routers, and the router would probably be slower or more expensive than if it did not have to perform such functions. It is easier to deploy an implementation at only the endpoints, and even easier to deploy it at only one endpoint. Unfortunately, accuracy suffers when measuring at only the endpoints, and even more so when measuring at only one endpoint. For example, cross traffic interferes with measuring the bandwidth of all the links along a path. The endpoints may not have sufficient knowledge of when and where the interference occurred to filter it. Another limitation is that if only one endpoint can measure, then which endpoint it is determines which direction of bandwidth can be measured. More specifically, a sender-based technique can only measure bandwidth away from the measurement node, while a receiver-based technique can only measure bandwidth towards from the measurement node. This is a problem for paths with asymmetric routes or asymmetric links. Another problem with measuring at only one endpoint is that the Packet Pair Property requires both the transmission and arrival times of a pair of packets. This

CHAPTER 2. BACKGROUND AND RELATED WORK

27

is impossible to do when taking measurements at only one node. Sender-based techniques work around this by using round trip times instead of arrival times. Receiverbased techniques work around it by simply assuming that the transmission times meet the Packet Pair Property’s requirements. In both cases, the solutions could significantly impair accuracy. In the sender-based case, round trip time measurements are susceptible to interference from cross traffic on the return path. In the receiver-based case, the result may be polluted by invalid bandwidth samples which did not meet the Packet Pair Property’s requirements.

2.3

Measuring All Link Bandwidths

In this section, we describe previously developed techniques to measure the bandwidth of all the links along a path.

2.3.1

Single Packet Techniques

Bellovin [Bel92] and Jacobson [Jac97b] use the single packet delay model (Section 2.1.1) to develop a technique for measuring link bandwidths. Although Equation 2.1 specifies the one-way delay, Bellovin and Jacobson instead use the round-trip delay to successive routers along a path. The round-trip delay can be modeled as the sum of the one-way delay for the initial packet and that of a packet sent in response to the initial packet (the acknowledgement packet or ack ). Bellovin sends ICMP ECHO packets to routers, which respond with ICMP ECHO-REPLY packets. Jacobson sends UDP packets to the destination with a time-to-live (TTL) field equal to the distance in hops of the particular router. When the TTL expires at that router, the packet is dropped and the router responds with an ICMP TIME-EXPIRED packet. Bellovin and Jacobson resolve the problematic assumption about no queueing by observing that queueing caused by additional traffic can only increase delays. Therefore, the minimum of several observed delays of a particular packet size fits the model. Their technique is to send several packets for each of several different packet sizes, plot the delays of these packets versus their sizes, and then use linear

Delay (sec.)

CHAPTER 2. BACKGROUND AND RELATED WORK

28

= filtered out = min delay for a size slope = 1/bandwidth

Packet Size (bytes) Figure 2.5: This figure illustrates how the single packet model and linear regression are used to estimate bandwidth. The graph shows several hypothetical measurements of the round trip delay of packets of different sizes traveling along the same path. The gray samples experienced queueing. The black samples did not experience queueing. The line is the linear regression of the black samples. The inverse of the slope is an estimate of the bandwidth of the path.

CHAPTER 2. BACKGROUND AND RELATED WORK

29

regression (Figure 2.5) to obtain the slope of the graph. The inverse of the slope is the bandwidth. In practice, the problems with this technique are that 1) linear regression is expensive, 2) routers may not send acknowledgement packets in a timely manner, 3) some nodes are “invisible”, and 4) the reverse path adds noise. Cost of the Single Packet Technique The linear regression described above is expensive in the number of packets it must send. Jacobson [Jac97a] provides pathchar as an implementation of the single packet technique. In this section, we analyze the time taken and bandwidth consumed by pathchar [LB99]. Pathchar sends packets varying in size from 64 bytes to the path MTU with a stride of 32 bytes. Therefore, the number of different packet sizes pathchar sends is ¹

MT U s= 32

º

− 1.

(2.6)

For Ethernet, the MTU is 1500 bytes, so s is 45. In addition, it sends p packets per size for every hop. In the default configuration, p = 32. It must wait for each packet it sends to be acknowledged before sending the next packet. Thus, the total time for pathchar to run is h X

p · s · li ,

(2.7)

i=1

where h is the number of hops and li is the round trip fixed delay from the sender to hop i. We assume that the receiver immediately sends an ack in response to a packet and that the sender immediately sends out the next packet when an ack arrives. For a 10-hop Ethernet network with an average round trip fixed delay of 10ms, pathchar would run in 144 seconds. This is too slow for a host to run it for every TCP connection, or even every 10 minutes. It can be configured to send fewer packets of each size, but at the cost of accuracy.

CHAPTER 2. BACKGROUND AND RELATED WORK

30

More importantly, pathchar consumes considerable amounts of network bandwidth. The average bandwidth used for probing a particular hop is average packet size = round trip f ixed delay

32·s 2

+ 32 li

(2.8)

in bytes/s, where li is the round trip fixed delay (in seconds) across that hop. For a 1-hop Ethernet network with a fixed delay of 1ms, the average bandwidth consumed is 6.02Mb/s. This would be a considerable imposition on a 10Mb/s Ethernet. Farther hops would consume less bandwidth, but pathchar always has to probe closer hops before farther hops. Furthermore, the total data transferred is (p)(h)

Ã

s X i=2

!

32i ,

(2.9)

where h is the number of hops. For the 10-hop Ethernet network mentioned before, pathchar sends 10 MB of data. In fact, pathchar will send 10 MB of data on a 10-hop network regardless of the bandwidth of the network, since it only depends on the number of hops, the path MTU, and p. If the path MTU is high and one of the early hops is a low bandwidth network link, such as a 56K modem, then pathchar can consume most of the bandwidth of that link for an extended amount of time. This means that we would have problems scaling pathchar usage up to a large number of hosts. Although some of this expense can be mitigated by adapting the number of packets sent to the observed variance in measurements, the remainder of the expense is inherent to the single packet technique. Downey [Dow99] uses adaptive statistical methods to detect the convergence of a link bandwidth estimate to avoid sending further packets. He shows that this reduces the number of packets required when there is little cross traffic. However, the single packet technique still requires that p be sufficiently large to perform an accurate linear regression. For p = 2, even small variations in the measurements can cause large differences in bandwidth estimates, so typically p ≥ 4. We describe how to remove this limitation in Section 3.2.

CHAPTER 2. BACKGROUND AND RELATED WORK

31

Timely Acknowledgements The single packet technique requires getting timely acknowledgements from routers. This has the advantage that no special software needs to be deployed on routers to gather timing information, but unfortunately, it may not work in all parts of the Internet. Because of malevolent use of ICMP packets, some routers and hosts either rate-limit them or filter them out [Sav99], thus slowing down or precluding measurement. Invisible Nodes Bridges, host operating systems (OSs), and network interface cards (NICs) are usually store-and-forward nodes but do not decrement the IP TTL and are not individually addressable in IP. Consequently, the IP TTL decrement method cited above cannot detect or measure links corresponding to these “invisible” nodes. There is a node between the source application and the source operating system because the sending OS usually must copy packets from the application’s address space to the kernel’s. In addition, the source OS’s network card driver usually must copy the sent packet from kernel address space across the system bus to the NIC. Finally, if the destination is a PC, the packet usually must be copied from the destination’s NIC to the destination’s kernel address space. The application–kernel, kernel–NIC, and NIC–kernel copies usually must be individually complete before the packet can be forwarded any further in the pipeline. These invisible nodes cause error in the measurement of the next link. Reverse Path Interference Relying on acknowledgements and round-trip delays means that there is twice the possibility that queueing could corrupt a sample when compared to a technique that relies only on one-way delay. Queueing in the reverse path can delay the acknowledgement, even if there is no queueing in the forward path. As a result, many packets may be required to filter out the effect of other traffic and calculate a regression with high confidence. To use one-way delay, single packet-based techniques would need to deploy new functionality at every router on a path.

CHAPTER 2. BACKGROUND AND RELATED WORK

2.3.2

32

Comparison with Multi-Packet Techniques

The packet tailgating technique described in Section 3.2 can overcome most of these limitations. It performs linear regression only once instead of once for each link. Because it does not rely on timely delivery of acknowledgements from routers, it is robust against routers that generate ICMP packets inconsistently. Finally, it can increase accuracy and reduce the number of packets sent by measuring one-way delay instead of round trip delay, without requiring new software at routers. Packet tailgating still suffers from the invisible node problem; we partially address it in our implementation described in Section 3.3.

2.4

Measuring the Bottleneck Link Bandwidth

In this section, we describe previous work in measuring just the bottleneck link bandwidth (the bottleneck bandwidth) along a path. Although this bandwidth is a subset of the results of the techniques described in the previous section, we consider measuring the bottleneck bandwidth a separate problem because it is more useful and it can be measured in fewer packets than measuring the bandwidth of all the links along a path. The bottleneck bandwidth is more useful to know because it limits the available bandwidth along a path, which is a limiting metric for many applications.

2.4.1

Packet Pair Technique Overview

Bolot [Bol93], Carter and Crovella [CC96a], Paxson [Pax97b], and Dovrolis et al. [DRM01] all use the packet pair model (Section 2.1.2) to measure bottleneck bandwidth. The general form of these techniques is 1) to actively generate probe traffic or passively listen to existing traffic, 2) capture packet timings and sizes at one or both of the sending host and the receiving host, 3) use the packet pair property to calculate samples from the captured information, and 4) filter out samples which have been affected by queueing delays from cross traffic.

CHAPTER 2. BACKGROUND AND RELATED WORK

2.4.2

33

Fair Queueing

Keshav [Kes91b] uses packet pair as a technique to measure congestion in a fairqueueing network. A transport end-point sends two packets at a rate higher than the bottleneck available bandwidth. Keshav proves that in a fair-queueing network, the separation of the acknowledgements of the two packets is inversely proportional to the bottleneck available bandwidth. This can then be used to prevent the sender from overloading the network. Our work differs from Keshav’s in that we are interested in FIFO-queueing network. In such networks, the packet pair technique measures the bottleneck link bandwidth instead of available bandwidth.

2.4.3

NetDyn

Bolot measures bottleneck bandwidth using the NetDyn tool, developed by Sanghi and Agrawala [AS92]. NetDyn actively sends 32 byte UDP packets from a source host to an intermediate host at regular interval. The intermediate host (where NetDyn software has been deployed) forwards the packets to the destination host. At each point, the packet is timestamped. In this case, the destination host is the source host, allowing the total delay (the round trip time) to be measured without clock skew. The round trip time of packet n is plotted against packet n + 1. Through manual inspection, a line is fitted to the plot. The intercept of the plot is the transmission interval and the inverse of the slope is the bottleneck bandwidth. Analysis This is an early application of the packet pair property to FIFO-networks. It provides the key insight of how to use the packet pair property to measure link bandwidth. However, it is not clear how to set the transmission interval to robustly measure different link bandwidths. In addition, Bolot raises but does not address the issue of how to automate the filtering process.

CHAPTER 2. BACKGROUND AND RELATED WORK

2.4.4

34

Bprobe

Carter and Crovella build on Bolot’s work to develop the bprobe tool. Bprobe actively sends ICMP ECHO packets to the destination host (which can be any Internet host) causing it to respond with an ICMP ECHO-REPLY. Bprobe uses the response packet to measure the round trip time. Using the packet pair property, bprobe computes bandwidth samples from the round trip times. Bprobe sends packets in 7 phases. Each phase consists of 10 packets of the same size. The packet size for each phase varies from 124 bytes to 8000 bytes. The packet size variance is designed to handle varying Maximum Transmission Unit (MTU) values on different paths. To do the filtering, Carter and Crovella make a key insight. They theorize that cross traffic is not highly correlated in size or arrival rate, and therefore samples that encounter cross traffic queueing will not be highly correlated while samples that did not encounter cross traffic queueing will be highly correlated. As a result, bprobe filters the bandwidth samples by attempting to find correlation. The technique is to 1) compute a set of intervals from each phase, 2) start with the set of intervals from the largest packet size and combine it with sets from successively smaller packet sizes by taking the union of intervals, and 3) select the interval from the combined set that is the union of the most intervals from different sets. The intervals are initially computed by expanding each bandwidth sample until an adequate bandwidth estimate is computed. Carter and Crovella validate bprobe on a variety of Internet paths with bottleneck bandwidths of 56Kb/s, 1.54Mb/s, and 10Mb/s. They find that their techniques are more accurate on shorter paths and smaller link speeds. On some paths, they find that there are multiple modes in the distribution of bandwidth samples. Although the largest mode is close to the expected result, the presence of the other modes indicates that cross traffic is sometimes correlated in size or arrival time.

CHAPTER 2. BACKGROUND AND RELATED WORK

35

Analysis Bprobe resolves the issue raised by Bolot about how to build a tool that automates the measurement process and provides the key insight about how to do the filtering. However, several issues remain unresolved or unclear: 1) packet size, 2) interval weighting, 3) interval expansion, and 4) robustness with widely varying bandwidths. Carter and Crovella note that larger packet sizes are better sources of samples because they are more likely to cause queueing at the bottleneck link. Bprobe considers this by combining sets from larger packets before those from smaller packets. We provide a more comprehensive framework for this in Section 4.5. Another issue is that the final interval is chosen based on how many sets contribute to it instead of how many samples contribute to it. It is not clear what the basis for this decision is. It is more intuitive to count the samples, which is what we do in Section 4.4.1 In addition, it is not clear how the intervals are expanded. If they are expanded by a fixed amount, then two intervals may be combined that are otherwise relatively far apart. For example, intervals A and B are close and intervals C and D are far apart. At one step of the interval expansion, neither A and B nor C and D are combined. At the next step of expansion, A is combined with B and C is combined D. Because of the fixed rate of expansion, an intermediate step where A is combined with B but C is not combined with D is skipped. Consequently, the expansion rate may change the result by an arbitrary amount. We revisit this issue in Section 4.4.1. The final issue is robustness with widely varying bandwidths. It is not clear how the filtering algorithm handles widely varying bandwidths. In the previous example, if A = 50Kb/s, B = 100Kb/s, C = 10M b/s, and D = 10.05M b/s, then in absolute distance, A and B are just as highly correlated as C and D, but in relative distance, C and D are far more highly correlated than A and B. We resolve this issue in Section 4.4.1.

CHAPTER 2. BACKGROUND AND RELATED WORK

2.4.5

36

Tcpanaly

Paxson extends packet pair-based bottleneck link measurement with the tcpanaly tool. Tcpanaly is a passive tool that uses traces collected at one or more nodes in the network. Using tcpanaly, Paxson studies 17,575 traces of TCP flows taken from 40 different points in the Internet. Paxson identifies several challenges to measuring bottleneck bandwidth: packet reordering, multi-channel links, poor timing precision, changing bandwidth, and asymmetric bandwidth. Paxson notes that packet reordering disrupts application of the packet pair property. It indicates a more general problem of successive packets not traversing the same path. Reordering can be detected by comparing the ordering of the transmission times to the ordering of the receive times or by examining TCP sequence numbers. Paxson addresses reordering by filtering out reordered packets. Multi-channel links cause significant problems as we describe in Section 2.1. Paxson addresses this by extending the packet pair model to consider bunches of more than two packets using an algorithm called Packet Bunch Mode (PBM). However, he notes that larger groups of packets are more likely to encounter interference from cross traffic. Poor timing precision interferes with bottleneck bandwidth measurement because it causes error in the packet transmission and reception times. Paxson also addresses this by considering bunches of more than two packets. Another issue is changing bandwidth. The bottleneck bandwidth along a path could change because of a route change or because a changing electro-magnetic environment affects a wireless link. This is a problem for a technique that monitors continuously because a route change is likely to happen eventually. Paxson addresses this by computing multiple estimates over a trace and determining whether the estimates overlap in time. If they do not then this is considered a bandwidth change. Paxson also identifies the challenge of paths with asymmetric bottleneck bandwidth. As mentioned above (Section 2.2.2), this is a problem when measuring from only one node in the network. Paxson calls measurement from just the data sender Sender Based Packet Pair (SBPP). The solution is measuring at both the sender and the receiver. This technique is called Receiver Based Packet Pair (RBPP), although

CHAPTER 2. BACKGROUND AND RELATED WORK

37

a more appropriate name would be Sender Receiver Based Packet Pair (SRBPP). Analysis The Paxson study is comprehensive and thorough. However, our study differs from Paxson’s in 1) the consideration of multi-channel links, 2) the measurement timing, 3) the degree of study control, and 4) the use of heuristics. Paxson considers and studies multi-channel links, but we do not because of their rarity at the time of both Paxson’s and our study and the accuracy cost of considering them. Paxson finds that PBM and RBPP give exactly the same result 80% of the time and differ by more that 20% only 2-3% of the time. This indicates that multi-channel links were rare on the paths Paxson examined. For our study, we could not acquire the multi-channel technology that Paxson examined (BRI ISDN) and an ISP that would support it. In addition, as mentioned above, to consider multi-channel links requires considering bunches of more than two packets. This increases the likelihood of noise. Paxson studies traces, but we study techniques to do measurement in real time. Tcpanaly assumes that it is operating on a complete trace of a 100Kb bulk transfer TCP connection. For some applications, it is useful to get an estimate before transferring large amounts of data. Consequently, we study techniques that can generate an estimate after seeing only a few packets and then efficiently update that estimate after seeing later packets. It is not clear what kind of an estimate tcpanaly will generate after seeing only a few packets and it likely has to be run from scratch when later packets arrive. In addition working on few numbers of packets, the techniques we study can also see packets and generate new estimates indefinitely. It is likely that tcpanaly will run out of memory given a sufficiently long trace. This is not simply an implementation issue because we encountered significant problems designing a continuously running algorithm (see Section 4). Paxson’s study is larger, but lacks a controlled bottleneck link while ours is smaller, but includes a controlled bottleneck link. Paxson examines 17,575 different traces across a multitude of different paths. However, since so many different nodes are involved, it is difficult to verify manually that the estimated bottleneck bandwidth is

CHAPTER 2. BACKGROUND AND RELATED WORK

38

close to the nominal bottleneck bandwidth as reported by the manufacturer. Instead, we examine 280 different traces across only a few different paths, but we manually set the bottleneck link technology so that we can verify the accuracy of different techniques. Finally, Paxson uses several heuristics in filtering. For example, the filtering algorithm tests the “expansion factor” of bandwidth samples and accepts samples so that .2 ≤ f ≤ .95. It is not clear what the significance of these values is or how they should be changed. Similarly, the custom density estimation algorithm drops clusters of size less than 4. It is not clear what happens if all the clusters are of size less than 4. Another example is that the algorithm tests the samples for consistency against the overall population of samples. If the consistency is less than .2 or .3, then the sample is dropped. Again, it is not clear what the significance of this value is. It is inevitable for algorithms that operate on a heterogeneous system like the Internet to have odd constants (as the algorithms in later sections certainly do). However, our approach is both to minimize their use and to indicate the impact of changing these constants so that there can be guidelines for setting them.

2.4.6

Pathrate

Dovrolis, Ramanathan, and Moore [DRM01] study packet pair techniques using active measurements and simulation. They determine that when using a histogram to find correlation in the bandwidth sample distribution, large amounts of cross traffic can create modes in the histogram that are larger than the mode caused by the bottleneck bandwidth. This significantly complicates finding correlation of bandwidth samples using a histogram. In addition, they observe that neither maximum nor minimum size packets are optimal for gathering packet pair bandwidth samples. Maximum size packets have a better resistance to poor timing precision because the timing error is usually fixed relative to the packet size. Therefore, larger packets will generate bandwidth samples with smaller error. However, smaller packets are less likely to encounter queueing between the two packets of the packet pair because that spacing is smaller for smaller

CHAPTER 2. BACKGROUND AND RELATED WORK

39

packets. Thus, the optimal measurement packet size depends on both the timing precision of the measurement hosts and the cross traffic along the path. These observations motivate the development of the pathrate tool. Pathrate deals with the problem of multiple modes in the histogram by using heuristics to select the correct mode. It deals with the packet size problem by using 800 byte packets as a compromise between the typical Internet path Maximum Transmission Unit (MTU) size of 1500 bytes and the minimum size of 20 bytes. Analysis Dovrolis et al. attribute the multiple mode histogram problem to the underlying distribution. However, there are several problems with using histograms for density estimation Section 2.6, one of which could cause multiple modes. It is not clear whether the multiple modes actually exist in the underlying distribution or are an artifact of using a histogram. In addition, the restricted variation of bandwidths (25Mb/s and 100Mb/s) measured by Dovrolis et al. allows them to use a fixed histogram bin size of 1Mb/s. This is would not be effective for bandwidths that fall significantly outside this range, e.g. 56Kb/s or 1Gb/s. We use kernel density estimation, which does not suffer from the same problems as histograms. Banerjee and Agrawala Banerjee and Agrawala [BA00] improve the filtering of packet pair-generated bandwidth samples by noting that samples that have encountered some queueing have a longer round trip time or one way delay than those samples that did not. They filter out all samples except those whose round trip times are below the 1-2 percentile of the round trip time distribution of all the samples. Their analysis shows that this filtering produces significantly different results, but it is not clear what the exact increase in accuracy is.

CHAPTER 2. BACKGROUND AND RELATED WORK

2.5

40

Network Monitoring

The algorithms described above need to know the sizes, transmission times, and arrival times of packets from specific flows. One way to gather this information is to use a distributed network monitoring infrastructure. Distributed network monitoring consists of local monitoring at two or more nodes, distributing the monitored information to a remote node, and correlating the reports from different nodes together. Previous work in this area has focused on local network monitoring (Sections 2.5.1 and 2.5.2), distributing the monitored information to a remote node (Section 2.5.3), or both (Section 2.5.4).

2.5.1

Packet Filter

Mogul, Rashid, and Accetta [MRA87] develop a packet filter mechanism to allow efficient user-level implementations of network protocols in kernel-based operating systems like UNIX. They also mention network monitoring as an application. Their work derives from earlier work that used packet filters to demultiplex packets for a single address space operating system (the Xerox Alto). They do not consider the problem of gathering together packet information from distributed filters. Mogul et al. argue that user-level implementations of network protocols are easier to implement and maintain and are more portable than kernel implementations. However, user-level protocol implementations without packet filtering are inefficient because packets have to be copied from kernel space to the user-level demultiplexing process, back into the kernel, and then to the destination process. Their solution is to reduce the number of packet copies by pushing the demultiplexing into the kernel. This reduces the number of software layers between the filtering software and the network hardware.

2.5.2

Berkeley Packet Filter

McCanne and Jacobson [MJ93] build on the work of Mogul et al. by developing a filter language that can be efficiently implemented on RISC architectures and by

CHAPTER 2. BACKGROUND AND RELATED WORK

41

filtering packets in interrupt context to reduce intra-kernel copies. Filtering packets in interrupt context reduces intra-kernel copies because some kernels must copy a packet before it can be processed outside of interrupt context. This continues the approach of Mogul et al. of reducing the number of software layers between the filtering and the network hardware. We use libpcap, a cross-platform derivative of the Berkeley Packet Filter, for local packet capture.

2.5.3

RMON

RMON [Wal00] is an SNMP [CFSD90] MIB that specifies protocols for managing remote network monitors and distributing the monitored information. It defines statistics that are calculated locally at the monitor and allows raw packets to be captured and sent to SNMP clients. RMON does not specify how packet capture information from different nodes can be correlated together. The main advantage of using RMON is compatibility with existing monitors and querying applications. However, RMON cannot efficiently and reliably transfer large amounts of packet capture data [WRWF96]. RMON uses UDP without retransmissions to send data to clients and therefore may drop packets. In addition, each packet transferred requires a MIB traversal and explicit client request. We could have used RMON as simply a control protocol and devised our own reliable and efficient data transfer protocol, but we could not retain compatibility with existing monitors and applications, thereby removing the advantage of using RMON. Consequently, we use our own simple, reliable, and efficient protocol (described in Chapter A).

2.5.4

Windmill

Malan and Jahanian develop an extensible distributed passive monitoring architecture called Windmill [MJ98]. A Windmill experiment engine understands a variety of transport and application-level protocols so that it can distill a high bandwidth packet stream into a low bandwidth stream of protocol-specific events. Only these events are disseminated to clients. The advantage of this approach is that it can reduce the

CHAPTER 2. BACKGROUND AND RELATED WORK

42

bandwidth required for dissemination five orders of magnitude. We take a different approach in Appendix A.

2.6

Nonparametric Density Estimation

The bottleneck bandwidth measurement algorithms described above filter out the effect of cross traffic by finding modes in a distribution of bandwidth samples. One class of statistical techniques that can be used to find modes is nonparametric density estimation [WJ95] [Sco92]. Given a random variable X with probability density function f and samples of X x0 , x1 , . . . xn−1 , a univariate density estimator estimates f from x0 , x1 , . . . xn−1 . A parametric density estimator would assume that f has a particular distribution form (e.g. normal) and then estimate the unknown parameters of that form from the data. A nonparametric density estimator does not make such an assumption.

2.6.1

Histogram

One nonparametric density estimator is a histogram. The idea is to form uniformly sized bins over X’s range and assign the samples to bins according to their value. As a result, each bin has a value equal to the number of samples that fell into its range. Regardless of whether X is a continuous or discrete random value, a histogram generates a discrete probability density function. The advantages of histograms are ease of implementation and speed. Inserting new values into a histogram takes O(1) time and finding modes takes O(b), where b is the number of bins. However, histograms have the disadvantages of fixed bin widths, fixed bin alignment, and uniform weighting of points within a bin. Fixed bin widths make it difficult to choose an appropriate bin width without making assumptions about the distribution. For example, a bin width of 100 would not be able to distinguish between a cluster of samples at 1 and a cluster at 8. On the other hand, a bin width of 1 could make a cluster of samples around 100 seem

CHAPTER 2. BACKGROUND AND RELATED WORK

43

like smaller clusters of samples at 99, 100, and 101. In general, fixed bin widths are a problem for any data set with a very large range relative to the magnitude of the samples. Another disadvantage is fixed bin alignment. For example, two points could lie very close to each other, but on either side of a bin boundary. The bin boundary ignores that relationship and does so arbitrarily because the bin boundary depends on where the bin alignment begins. Usually bin alignment begins at 0 or −binwidth/2 and proceeds at intervals of the bin width, but continuous data is rarely so regular. Finally, uniform weighting of points within a bin means that points close together will have the same density as points that are at opposite ends of a bin. This problem is exacerbated by fixed bin widths because an overly large bin with many samples uniformly distributed within it would seem to have a high density.

2.6.2

Kernel Density Estimation

Kernel density estimation (KDE) [WJ95] [Sco92] avoids the alignment and uniform weighting problems of histograms. The idea is to define a kernel function K(t) with the following property: Z

+∞

K(t)dt = 1.

(2.10)

−∞

The density at a received bandwidth sample x is n

1 X d(x) = K nh i=1

µ

x − xi h



,

(2.11)

where h is the window width (usually called the bandwidth by statisticians, but doing so would be confusing in this context), n is the number of points within h of x, and xi is the ith such point. The window width controls the smoothness of the density function. It is analogous to the bin width of histograms. KDE avoids the fixed bin alignment problem of histograms by aligning the window on x, the point at which the density is to be estimated, instead of fixed points along the range of the X. KDE can avoid the weighting problem by selecting a kernel function that gives

CHAPTER 2. BACKGROUND AND RELATED WORK

44

greater weight to points close to the origin of the kernel space. The kernel space is space in which the kernel function is defined. One example is the triangular kernel function:

2.6.3

  0     1+t K(t) =  1−t     0

 t < −1     t≤0  t>0

t>1

.

(2.12)

    

Adaptive Kernel Density Estimation

Adaptive kernel density estimation (AKDE) [Sai94] is a variation of kernel density estimation that, in addition to KDE’s other properties, avoids the fixed bin width problem of histograms. The idea is to define the window width h as the function h(x) so that the kernel density function becomes n

1 X d(x) = K nh(x) i=1

µ

x − xi h(x)



.

(2.13)

An example of a simple adaptive window width function is h(x) = cx,

(2.14)

where c is the smoothing parameter. This resolves the fixed bin width problem, but introduces the question of how much to smooth. Techniques to do optimal and automatic smoothing parameter selection for arbitrary distributions are beyond the scope of this study. However, a simple adaptive window width function like (2.14) is sufficient for distributions where distinct modes are generally a ratio of c apart from each other [Sai94]. For example, (2.14) with c = .1, this technique can distinguish between distinct modes at 1 and 1.1 or 100 and 110, but not between modes at 1 and 1.01 or 100 and 101. This is sufficient for our purposes (see Section 4.4.1).

Chapter 3 Inferring All Link Bandwidths Measuring all the link bandwidths along a path is important for a variety of applications including selecting from multiple unicast routes, building multicast routing trees, benchmarking routing services provided by ISPs, and characterizing networks. Unfortunately, existing techniques require sending an excessive number of packets for even a moderate level of accuracy. In this chapter, we present the Packet Tailgating [LB00] technique for measuring all the link bandwidths along a path. Packet Tailgating requires fewer packets than existing techniques for a similar accuracy. Although Packet Tailgating can reduce the number of packets required for the existing level of accuracy, the accuracy of known techniques is not sufficient for many applications. We describe a variety of possible causes for Packet Tailgating’s inaccuracy and some solutions. In Section 3.1, we derive the theoretical basis for Packet Tailgating from the multi-packet delay model described in Section 2.1.3. Using this basis, we develop the basic technique in Section 3.2. We expand on the basic technique to deal with some complexities in Section 3.3. In Section 3.4, we analyze the properties of Packet Tailgating. In Section 3.5, we measure the performance of an implementation of Packet Tailgating in the Internet. We conclude and describe opportunities for future research in Section 3.6.

45

46

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

3.1

Packet Tailgating Derivation

We develop the Packet Tailgating technique by using the multi-packet model (Equation 2.5) and making certain assumptions that we believe can be satisfied in practice. We use the multi-packet model because it captures the bandwidth and fixed delay of links and intra-flow queueing of packets, while omitting the complexity of modeling the behavior of extra-flow queueing. We make the following assumptions: 1. We can send one packet (packet k − 1) with no queueing. 2. We can send a second packet (packet k) that queues behind the first packet at a specific link (link m), but not at any later link. Since we are deriving from the multi-packet model (Equation 2.5), we have all of its assumptions, too (see Section 2.1). It is possible that a different set of assumptions than the two listed above would result in a better technique, but those listed above are sufficient. We describe how these assumptions can be satisfied in practice in Section 3.2. Given these assumptions, we use the multi-packet delay model to solve for the bandwidth bm of the link m at which queueing occurs. We rewrite the multi-packet model (Equation 2.5) to give the time packet k takes to arrive at the destination link n (with variables defined in Table 2.1): tkn

=

tk0

+

n−1 µ k X s i=0

bi

+ di + max

¡

k−1 0, ti+1

− di −

tki

¢



.

(3.1)

We split the packet’s delay into three parts: time spent traveling up to the link we want to measure, time spent at that link, and time spent traveling from that link

47

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

to the destination: "

= tk0 +

m−1 Xµ k i=0

¡ k−1 ¢ s + di + max 0, ti+1 − di − tki bi

¶#

+

¸ ¢ ¡ k−1 sk k + dm + max 0, tm+1 − dm − tm + bm " n µ ¶# X sk ¡ k−1 ¢ + di + max 0, ti+1 − di − tki . b i i=m+1 ·

Using Equation 3.1, we can simplify the first part: =

£

¸ ¡ k−1 ¢ sk k + + dm + max 0, tm+1 − dm − tm + bm " n µ ¶# X sk ¡ k−1 ¢ k . + di + max 0, ti+1 − di − ti bi i=m+1

tkm

¤

·

By assumption 2, packet k queues at link m, so in the second part of the equation, we use the second part of the maximum function, and simplify: µ k ¸ " X ¶# n ¡ ¢ sk s k−1 k−1 = tm + + dm + tm+1 − dm − tkm + + di + max 0, ti+1 − di − tki bm bi i=m+1 " n µ ¶# X sk ¡ k−1 ¢ sk k−1 k + tm+1 + + di + max 0, ti+1 − di − ti = . bm bi i=m+1 £

¤ k

·

Again by assumption 2, there is no queueing after link m, so we can drop the queueing terms from the last part: µ k ¶ n X s sk k−1 + tm+1 + + di . = bm b i i=m+1 We substitute using Equation 3.1: m

sk X + = bm i=0

µ

¶ ¶ n−1 µ k X ¡ k−2 ¢ sk−1 s k−1 k−1 + di + max 0, ti+1 − di − ti + di . + t0 + bi b i i=m+1

48

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

By assumption 1, the first packet experiences no queueing, so we can eliminate the queueing terms: m

sk X = + bm i=0

µ

¶ ¶ n−1 µ k X sk−1 s k−1 + di + t0 + + di . bi b i i=m+1

We rearrange some terms: m−1 µ k−1 ¶

sk−1 X + = bm i=0

s

bi

n−1 µ k ¶ X s

+

bi

i=m

+ t0k−1 +

n−1 X

(di ) .

i=0

We define the following variables for more compact notation: dl =

l X

di

(3.2)

µ ¶ 1 . bi

(3.3)

i=0

and

l

X 1 = bl i=0

Using these definitions, we continue simplifying the equation: m−1 µ

n−1 µ ¶ X 1

X sk−1 = + sk−1 bm i=0

1 bi

sk−1 sk−1 = + m−1 + sk bm b

à n−1 µ ¶ X 1

sk−1 sk−1 = + m−1 + sk bm b



+s

i=0

µ

1

bn−1

k

bi

i=m

bi





m−1 Xµ i=0

1

+ t0k−1 + dn−1

bm−1



1 bi

¶!

+ tk−1 + dn−1 0

+ t0k−1 + dn−1 .

Solving for bm and collecting terms,

bm =

sk−1 (tkn +

sk −sk−1 bm−1



sk bn−1

− t0k−1 − dn−1 )

.

(3.4)

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

49

This shows that we can compute the bandwidth of the link m at which queueing occurs (bm ) from the sizes of the two packets (sk−1 , sk ), the arrival time of the second packet at the destination (tkn ), the transmission time of the first packet (t0k−1 ), the bandwidth of all links before the measured link (bm−1 ), the bandwidths of all the links along the path (bn−1 ), and the delay of all the links along the path(dn−1 ). In the next section, we describe how to measure these quantities in practice.

3.2

Technique

Of the metrics in Equation 3.4 that we need to know to compute bm , some (bn−1 , dn−1 ) are metrics of the entire path to be measured, one is a metrics of part of that path (bm−1 ), and the rest are metrics of the particular link to be measured (sk−1 , sk , tkn , tk−1 0 ). We use separate techniques to measure each of these groups. Our nettimer prototype implements these techniques.

3.2.1

Metrics of the Entire Path

To compute the metrics of the entire path (bn−1 , dn−1 ), we use a separate phase (the sigma phase) to distinguish it from the techniques that follow. Since the sigma phase does not compute any metrics that are specific to the link to be measured, the two assumptions given in Section 3.1 do not apply. It may seem impossible to compute bn−1 without a priori knowledge because it depends on the bandwidths of all the links along the path (Equation 3.3), which is what we are trying to compute in the first place. However, we do not have to know the specific bandwidth of each link along the path to compute bn−1 . Instead, we can measure bn−1 and dn−1 directly. We do this by sending single packets of different sizes from the source to the destination. We measure the one-way delay of each packet and assume there is no clock skew (we remove these assumptions in Section 3.3). In this context, clock skew is offset between the clock at the sender and the clock at the receiver. For any particular packet size, we save the minimum delay over several trials. We use linear

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

50

regression on the delay samples with packet size as the domain and the minimum delay as the range. From the resulting line, bn−1 is the inverse of the slope and dn−1 is the y-intercept. We continue sending packets until the confidence of the linear regression exceeds some threshold (nettimer defaults to 99%). This is the same technique that Jacobson and Downey use (see Section 2.3.1). As mentioned before, this procedure requires many packets. While previous work does this for every link along a path, we only run the sigma phase once from the source to the destination, resulting in considerable packet savings.

3.2.2

Metrics of the Link

We compute the metrics (sk−1 , sk , tkn , t0k−1 ) specific to the link we are trying to measure in the tailgating phase. The purpose of the tailgating phase is to compute these metrics while satisfying the two assumptions from Section 3.1: 1. We can send one packet (packet k − 1) with no queueing. 2. We can send a second packet (packet k) that queues behind the first packet at a specific link (link m), but not at any later link. Satisfying the first assumption is not a problem. Satisfying the second assumption is more difficult. We cause queueing at link m by sending a very large first packet (the tailgated packet) and a very small second packet (the tailgater packet). The second packet will generally have a much smaller transmission delay than the larger first packet, so it will continuously catch up to and queue behind the first packet. The second packet is like a car with an impatient driver who drives too closely to the car in front of it, thus the name. This allows us to satisfy the first part of the second assumption. We satisfy the second part of the second assumption by setting the IP TTL field of the first packet to be equal to m (the number of hops from the source to the link to be measured). This causes the first packet to be dropped at the mth link, allowing the second packet to travel unimpeded to the destination.

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

51

As with the sigma phase, we assume there is no clock skew and take the minimum of several one-way delay samples of the tailgater packet ((we remove these assumptions in Section 3.3). In our nettimer implementation, we randomly probe different links until the error for all of them drops below 2%. We chose this value to balance the time to finish with the accuracy of the results. We calculate the error for a particular link by using the bootstrap method [ET93]. We randomly re-sample with replacement from the set of delays for a link until we have a new set of samples of 25% of the size of the original. Using this new set, we compute a new minimum delay. We repeat this process 20 times and then compute the variance of all the new delays. The error is this variance divided by the actual minimum delay of the set. We selected these values so that the CPU overhead of the computation is unnoticeable at typical network latencies.

3.2.3

Metric of Part of the Path

As with the metrics of the entire path, the metric of part of the path (b m−1 ) would seem to require already knowing the quantity we are trying to measure. However, since bm−1 is composed of the bandwidths of the links before the link we are trying to measure, we can take an inductive approach. We start by running the tailgating phase for the closest link. This link has a b m−1 of 0 since there are no links before the first link. We feed this result into the tailgating computation for the bandwidth of the second link, we feed the bandwidths of the first and second links into the computation for the bandwidth of the third, and so on.

3.3

Complexities

In addition to the basic technique described above, we have to deal with several additional complexities: 1) calculating the maximum inter-packet transmission time, 2) clock skew, 3) using round trip measurements, 4) causing acknowledgements to be sent, 5) detecting that packets were dropped, and 6) dealing with invisible nodes.

52

CHAPTER 3. INFERRING ALL LINK BANDWIDTHS

3.3.1

Inter-packet Transmission Time

As discussed in Section 3.2, we must send the tailgated and tailgater packets so that queueing takes place at the link to be measured. We could guarantee this by sending the packets back to back, but that is difficult in practice. The source host’s operating system and architecture usually cause a small delay before the second packet can be sent. Here we derive the maximum bound on this delay before it prevents the two packets from queueing. We want packet k to queue at link m: k qm > 0.

We substitute using Equation 2.4: ¡ k−1 ¢ max 0, tm+1 − dm − tkm > 0

k−1 − dm . tkm < tm+1

We substitute using Equation 2.3 and the assumption that the tailgated and tailgater packets experience no queueing before lq : tk0

+

lq −1 µ k X i=0

s + di bi




___ bl

C) s1 < ___ bl

s1 = ___ bl

cross traffic

s1 < ___ bl

D) > ___ s1 bl

s1 > ___ bl

Figure 4.1: This figure shows four cases of how the spacing between a pair of packets changes as they travel along a path. The black boxes are packets traveling from a source on the left to a destination on the right. Underneath each pair of packets is their spacing relative to the spacing caused by the bottleneck link. The gray boxes indicate cross traffic that causes one or both of the packets to queue.

In addition, the lack of transmission timing information prevents us from using some of the filtering algorithms described in Section 4.3. This further impairs accuracy.

4.3

Packet Pair Filtering Techniques

After gathering measurements and applying the Packet Pair Property to generate bandwidth samples, we filter out samples that do not satisfy the Packet Pair Property’s assumptions. Examples of causes for failing to meet the assumptions are cross traffic causing queueing of measurement traffic after the bottleneck link, hosts not sending acknowledgements consistently, and measurement packets that did not queue at the bottleneck link. The goal of a filtering algorithm is to detect and remove samples that are affected by these conditions. Before describing our filtering functions, we differentiate between the kinds of

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

81

samples we want to keep and those we want to filter out. Figure 4.1 shows one case that satisfies the assumptions of the Packet Pair Property and three cases that do not. There are other possible scenarios but they are combinations of these cases. Case A shows the ideal Packet Pair case: the packets are sent sufficiently quickly to queue at the bottleneck link and there is no queueing after the bottleneck link. In this case, the bottleneck bandwidth is equal to the received bandwidth and we do not need to do any filtering. In case B, one or more packets queue between the first and second packets, causing the second packet to fall farther behind than would have been caused by the bottleneck link. In this case, the received bandwidth is less than the bottleneck bandwidth by some unknown amount, so we should filter this sample out. In case C, one or more packets queue before the first packet after the bottleneck link, causing the second packet to follow the first packet closer than would have been caused by the bottleneck link. In this case, the received bandwidth is greater than the bottleneck bandwidth by some unknown amount, so we should filter this sample out. In case D, the sender does not send the two packets close enough together, so they do not queue at the bottleneck link. In this case, the received bandwidth is less than the bottleneck bandwidth by some unknown amount, so we should filter this sample out. Active techniques can avoid case D samples by sending large packets with little spacing between them, but passive techniques are susceptible to them. Examples of case D traffic are TCP acknowledgements, voice over IP traffic, remote terminal protocols like telnet and ssh, and instant messaging protocols.

4.4

Cross Traffic Queueing

To filter out the effect of cross traffic queueing (case B and C), we use some previously noted insights, but implement our own algorithm. The first insight is that samples influenced by cross traffic will tend not to correlate with each other, while the case A samples will correlate strongly with each other (Carter and Crovella’s assumption described in Section 2.4.4). We assume that cross

Received Bandwidth

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

highest density

C C

C

82

C

C

C

x=y

A

A

C A

B B B

B

Sent Bandwidth

Density

Figure 4.2: The left graph shows some Packet Pair samples plotted using their received bandwidth against their sent bandwidth. “A” samples correspond to case A, etc. The right graph shows the distribution of different values of received bandwidth after filtering out the samples above the x = y line. In this example, density estimation indicates the best result.

traffic will have a uniform distribution of packet sizes and interarrival times at the links along the path. The other insight is that packets sent with a low bandwidth that arrive with a high bandwidth are definitely from case C and can be filtered out (Paxson’s assumption from Section 2.4.5). Figure 4.2 shows a hypothetical example of how we apply these insights. Using the second insight, we eliminate the case C samples above the received bandwidth = sent bandwidth (x = y) line. To implement the first insight, we calculate a smoothed distribution from the samples and select the point with the highest density as the bandwidth. Finding the highest density point requires density estimation. Since we do not want to make further assumptions about the distribution of bandwidth samples, we use nonparametric density estimation (Section 2.6). To avoid the problems associated with histograms described in that section, we use adaptive kernel density estimation.

4.4.1

Adaptive Kernel Density Estimation Filtering

Adaptive kernel density estimation allows us to filter bandwidth samples without assuming that the bottleneck link bandwidth falls within a particular range. This is

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

16

Hist 1000000 Hist 1000

14 Count of Bandwidth Samples

83

12 10 8 6 4 2 0 0Mb/s

5Mb/s

10Mb/s Bandwidth

15Mb/s

20Mb/s

Figure 4.3: This is a graph of the distribution of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck link bandwidth. The x-axis is the bandwidth of the sample. The y-axis is a count of bandwidth samples. “Hist 1000000” is a histogram plot with a 1Mb/s bin width. “Hist 1000” is a histogram plot with a 1Kb/s bin width.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

16

Hist 1000000 kernel

14 Count of Bandwidth Samples

84

12 10 8 6 4 2 0 0Mb/s

5Mb/s

10Mb/s Bandwidth

15Mb/s

20Mb/s

Figure 4.4: This is a graph of the distribution of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck link bandwidth. The x-axis is the bandwidth of the sample. The y-axis is a count of bandwidth samples. “Hist 1000000” is a histogram plot with a 1Mb/s bin width. “Kernel” is a plot using an adaptive kernel density estimation function.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

85

important in the Internet, where link bandwidths may vary by four or more orders of magnitude. In Section 2.6.3, we describe the theory behind adaptive kernel density estimation. In this section, apply the theory to the problem of filtering bottleneck link bandwidth samples. Figures 4.3 and 4.4 illustrate the problems associated with histograms in the context of bandwidth measurement. We gathered the data for these plots according to the methodology described in Section 4.7.1. Figure 4.3 is an example of the problem with fixed bin widths. The nominal bottleneck bandwidth on this path is 10Mb/s so a bin width of 1Mb/s is appropriate to obtain a result with a graphing error ≤ 1Mb/s. Using the 1Mb/s bin width plot, we would conclude that the bottleneck link bandwidth is 5.5Mb/s (the source of the discrepancy of this plot is explained below). However, without knowing the bottleneck link bandwidth a priori, it is difficult to choose a useful bin width. A 1Kb/s bin width would have been appropriate if the bottleneck link bandwidth were 33Kb/s. If we had used a 1Kb/s bin width for this set of bandwidth samples (as shown in the other plot), then we would not have been able to produce an estimate because there is no clear mode. Bottleneck link bandwidths in the Internet vary across many orders of magnitude, so this is a significant problem. Figure 4.4 is an example of the problems of fixed alignment, fixed bin width, and uniform weighting of points. As in the previous graph, the nominal bandwidth is 10Mb/s. The fixed bin alignment problem of histograms results in a relatively small, but measurable error. The kernel density estimation plot shows that the true mode of this data set is at 9.7 Mb/s. Even if the histogram plot accurately determined the mode, the best result it could give is 9.5Mb/s because its bins are aligned at 0. More significantly, fixed bin width and uniform weighting of points combine to cause the histogram to calculate an erroneous estimate of 5.5Mb/s. This can be seen by comparing the difference in density estimates of histograms and adaptive kernel density estimation (AKDE) at the bins starting at 0Mb/s, 4Mb/s, and 9Mb/s. At 0 Mb/s, the histogram gives a density of 8, while AKDE gives a density of 1 or 2. This indicates that the samples in this region are distributed almost uniformly from 0Mb/s to 1Mb/s. The histogram shows a correlation among samples that fall within its fixed bin width (in this case, samples between 0b/s and 1Mb/s), while the adaptive

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

86

density estimation algorithm shows correlation only among samples that are within the window width of each other (in this region, with c = .1, the window width varies between .1b/s and 100Kb/s). A similar effect occurs at 4Mb/s. In this case, the histogram bin width of 1Mb/s is closer to the AKDE window width of 400Kb/s to 500Kb/s, so the histogram is closer to the AKDE estimate, but the histogram still overestimates the density. At 9Mb/s, the estimates are nearly identical because the widths are 1Mb/s and 900Kb/s to 1Mb/s, respectively. Note that the plot of the AKDE algorithm shows that the assumption that cross traffic will cause a uniform distribution of samples that experienced queueing is not entirely valid. Several modes in the distribution almost reach the global mode in Figure 4.4. This observation and a possible cause are noted in previous work (Section 2.4.6).

4.4.2

Implementation

As mentioned at the beginning of this chapter, two of our requirements for bottleneck link bandwidth measurement are real time measurement and being able to measure over long time scales. Therefore, nettimer requires an AKDE implementation that supports these requirements. Existing implementations of AKDE are designed for offline processing of data files. They assume that they can access their entire data set before producing a result and that their entire data set fits into memory. However, real time measurement requires producing a result using a data set that must be continuously and quickly updated as new bandwidth samples arrive. In addition, measuring over long time scales requires that samples be deleted from the data set so that only the most relevant samples remain in memory (the meaning of “relevant” is described in Section 4.6). To these ends, we use a new implementation of the AKDE algorithm, which we describe in this section. This implementation is part of a utility library (libkl) that is linked with nettimer. The libkl AKDE implementation supports three main operations: insert a data sample, delete a data sample, and read the kernel values of the samples. In the case of

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

87

the first two, the kernel values of all samples within the window width of the inserted or deleted sample must be updated (see Section 2.6.2). Let s be the total number of samples, saf f ected be the number of samples affected by the changed sample and swidth be the average number of samples that are within the window width of each of the samples affected by the changed sample. The total number of sample accesses required for one insertion or deletion is saf f ected swidth . The main issue is worst case performance. In the worst case, all the samples are within the window width of each other and the total number of sample accesses is s2 . This could be unacceptably slow for a large number of points. Our approach is to optimize worst case performance at the cost of some precision. We observe that in the worst case, all the samples are tightly clustered together and consequently have very similar kernel values. Since the kernel values are similar and their sample values are similar, calculating their kernel values separately is redundant. Our solution is to compute kernel values at fixed points using bins similar to histogram bins. The kernel value of the bin represents the kernel values of all the samples that fall inside the bin. When using the kernel value of a bin, we use the bandwidth value of the midpoint of the bin. For example, a bin from 100b/s to 110b/s contains 30 bandwidth samples. When we insert a new sample at 101 b/s, we only have to update the kernel value of the bin, instead of the other 30 points separately. When using the kernel value of the bin, we use a bandwidth value of 105 b/s. We call the bin width parameter the resolution. The loss of precision introduced by this implementation is equal to the resolution/2 because poor bin alignment could cause the true mode to be at the left or right edge of the bin. A larger resolution allows higher performance. In fact, if the resolution exceeds the window width, the algorithm is similar to a histogram. A smaller resolution reduces error and comes closer to a pure adaptive kernel density estimation algorithm.

Received Bandwidth

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

x=y

A

88

A

D D DD

Sent Bandwidth

Density

Figure 4.5: The left graph shows some Packet Pair samples plotted using their received bandwidth against their sent bandwidth. “A” samples correspond to case A, etc. In this example, the ratio of received bandwidth to sent bandwidth is a better indicator than density estimation.

4.4.3

Smoothing Parameter

In this section, we describe the significance of the smoothing parameter (c) described in Section 2.6.3 for bandwidth measurement. The smoothing parameter and the adaptive window width function (Equation 2.14) are important because they can have a significant effect on the precision and performance of the AKDE algorithm. A smaller smoothing parameter allows faster performance because it causes window widths to be narrower, resulting in fewer samples within each window. In addition, a smaller smoothing parameter allows distinguishing between narrower distinct modes. However, a larger smoothing parameter is less susceptible to noise. A smaller smoothing parameter is susceptible to an isolated spike because it will separate samples that are farther apart than the window width. A larger smoothing parameter will smooth those samples together so that they can overcome isolated spikes. We currently use a default value of 1.1 for the smoothing parameter because distinct bandwidths in the Internet are at least 10% apart. Table 1.1 shows that the closest common technologies are 10Mb/s Ethernet and the 11Mb/s 802.11b protocol.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

4.5

89

Packets Sent at a Low Rate

Low bit-rate applications and TCP acknowledgements can disrupt regular Packet Pair filtering algorithms because they send packets at a low rate. Examples of low bit-rate applications are IP telephony, ssh/telnet, and instant messaging. TCP acknowledgements are sent at a low bit-rate because they are only sent for every other data packet and are only 40 bytes long. These types of packets will not queue together at a high capacity bottleneck and will therefore produce case D samples from Figure 4.1. Furthermore, these case D samples will be highly correlated because the sender sends them at a consistent rate. This will deceive the kernel density algorithm because it assumes that incorrect samples are not highly correlated. If a flow consists of only case D samples, then no filtering algorithm can help because there are no samples that reflect the bottleneck link bandwidth. Instead, we concentrate on the situation where there are a few case A samples among the case D samples. Figure 4.5 shows a hypothetical graph of many case D and a few case A bandwidth samples. In this example, density estimation would indicate a bandwidth lower than the correct one because there are so many case D samples. However, there are still some case A samples that indicate the bottleneck bandwidth.

4.5.1

Received/Sent Bandwidth Filtering

Our approach [LB99] is to observe that case D samples have a sent bandwidth close to their received bandwidth. Graphically, this means that in Figure 4.5, the case D samples lie along the x = y line. On the other hand, case A samples are sent with a high bandwidth but received with a lower bandwidth and both are greater than the sent and received bandwidths of case D samples. Therefore, case A samples lie to the right of the x = y line, and may be much farther away from it. In addition, for a given sent bandwidth, a case A samples are above the case D samples. Figure 4.6 shows that this applies to data collected from the Internet. We gathered the data for this plot according to the methodology described in Section 4.7.1. The

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

90

2 way

Received Bandwidth

10Mb/s

1Mb/s

100Kb/s

10Kb/s 10Kb/s 100Kb/s 1Mb/s 10Mb/s 100Mb/s 1Gb/s Sent Bandwidth Figure 4.6: This a graph of bandwidth samples collected from a cross country path with a 10Mb/s bottleneck bandwidth. These samples are computed from the data packets of one TCP flow and the acknowledgements of another TCP flow in the reverse direction. Both the x and y axis are on a log scale.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

91

samples along the x = y line are calculated from the TCP acknowledgements of the TCP connection with data flowing in the reverse direction. The samples have a ratio of sent to received bandwidth of nearly 1. The samples off the x = y line and centered on the y = 10M b/s line are calculated from the TCP data packets of the connection with data flowing in the forward direction. These samples reflect the bottleneck link bandwidth of 10Mb/s. These samples have a ratio of sent to received bandwidth of 100 to 1 in this example, but in general the ratio could be arbitrarily high.

4.5.2

Implementation

Using this insight we define the received/sent bandwidth ratio of a received bandwidth sample x to be p(x) = 1 −

ln(x) . ln(s(x))

(4.4)

where s(x) is the sent bandwidth of x. We take the logarithm of the bandwidths because bandwidths may differ by orders of magnitude (as Figure 4.6 shows). If we did not take the log, then samples from the far right of Figure 4.6 would be favored too much over samples from the middle of the graph. Unfortunately, given two samples with the same sent bandwidth, Equation 4.4 favors the one with the smaller received bandwidth. To counteract this, we define the received bandwidth ratio to be r(x) =

ln(x) − ln(xmin ) . ln(xmax ) − ln(xmin )

(4.5)

where xmin and xmax are the minimum and maximum values of x in the sample set. We compose the kernel density algorithm and the received/sent bandwidth ratio algorithm together by normalizing the values of the kernel density function, Equation 4.4, and Equation 4.5 and taking their linear combination: f (x) = 0.4 ∗

d(x) + 0.3 ∗ p(x) + 0.3 ∗ r(x). d(x)max

(4.6)

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

92

where d(x)max is the maximum kernel density value. By choosing the value of x that maximizes f (x) as the bottleneck link bandwidth, we can take into account both the density and the received/sent bandwidth ratio without favoring smaller values of x. The weighting of each of the components is arbitrary. Although it is unlikely that this is the optimal weighting, the results in Section 4.7 indicate that these weightings work well under typical Internet conditions.

4.6

Continuous Measurement

Continuous measurement allows highly available and accurate passive measurement in spite of a variety of temporary network conditions. An application may need a bandwidth measurement at an arbitrary time and may not have time to wait for a probe of the network path. Such a probe may be lengthy or inaccurate because of temporary congestion along the path. Instead, a passive technique can observe continuously and may have collected sufficient measurements by the query time to produce a high confidence result. The issues with doing this are bandwidth changes, cross traffic queueing, memory consumed, calculation performance, packets that did not queue at the bottleneck link, and sample age. The bottleneck link bandwidth along a path may change because of route changes or host mobility. The longer measurement continues, the more likely a bandwidth change will occur. On the other hand, cross traffic queueing may produce enough samples to appear as a bandwidth change. The more samples a continuous measurement technique uses, the more likely it is to overcome temporary congestion, but memory limits the number of samples kept. In addition, more samples are slower to operate on. However, if only a small group of samples are kept, samples from packets that did not queue at the bottleneck link (case D samples from Section 4.5) could pollute the group. Finally, very old samples are unlikely to be relevant to the current bandwidth because of bandwidth changes.

93

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

4.6.1

Age Based Filtering

Our approach to continuous measurement is to consider the age of samples in addition to the correlation Section 4.4 and received/sent ratio of samples Section 4.5. The age of a sample is the number of seconds ago that a sample was collected. Older samples are less likely to be valid than newer samples because it is more likely that a route change has occurred since they were gathered. The correlation of a sample is measured by the AKDE algorithm and is important to distinguish a sample that did not experience cross traffic queueing from one that did. The received/sent ratio of a sample indicates whether the sample was likely to have queued at the bottleneck link.

4.6.2

Implementation

Our solution is to keep a limited number of samples (the window) from which we select the sample that maximizes Equation 4.6. This takes into account the correlation and received/sent bandwidth ratio. When the window is full, we evict samples based on their age and received/sent ratio. We define the age (z) of a sample (x) to be the number of seconds ago that a sample was collected. We compute the following normalized function of the age: t(z(x)) = 1 −

log10 (z(x)) . log10 (z0 )

(4.7)

This function is similar to Equation 4.4. We take the logarithm of the sample age because ages may vary from milliseconds to thousands of seconds. The minimum age can be arbitrarily small. The maximum sample age depends on the length of measurement. z0 controls the scale over which samples should be considered. We compose the age function with the received/sent bandwidth ratio algorithm (Equations 4.4 and 4.5) by normalizing their values and taking their linear combination: g(x) = 0.4 ∗ t(z(x)) + 0.3 ∗ p(x) + 0.3 ∗ r(x).

(4.8)

We evict the sample from the window which has the lowest value for g. This sample

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

94

is least likely to be relevant because it is old and/or did not cause queueing at the bottleneck link.

4.6.3

Analysis

The algorithm described above allows the user or application to explicitly control two kinds of trade-offs: 1) quickly detecting bandwidth changes while resisting transient cross traffic and 2) using recent samples that queued at the bottleneck link. We balance bandwidth change detection, speed, and cross traffic resistance by using the window size. A large window resists temporary cross traffic queueing more than a small window because it will include periods without cross traffic and different sources of cross traffic that are unlikely to correlate with a particular received bandwidth. However, a large window detects changes in bandwidth more slowly than a small window because it has more samples of the previous bandwidth. The algorithm will not switch to reporting the new bandwidth until either enough of the old samples have been evicted from the window or enough new samples are added to give a high enough AKDE value. It is not clear if there is an optimal window size. The rate at which routes and congestion change varies greatly in the Internet. Nettimer defaults to a window size of 128 samples. We balance recent samples and samples that queued at the bottleneck link by using the age scale (z0 ) in Equation 4.7 and the weighting values in Equation 4.8. Large values of z0 and the weight for t(z(x)) allow very old samples to be more relevant than recent samples that did not experience queueing. As with the window size, these values depend on the rate of route changes. Nettimer defaults to a z 0 of 86,400 seconds (the number of seconds in a day).

4.7

Measurements

In this section, we examine the accuracy and agility of our techniques for measuring bottleneck link bandwidth as they are implemented in nettimer. Many of these

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

95

Table 4.1: This table shows the different path characteristics used in the experiments. The Short and Long column list the number of hops from host to host for the short and long path respectively. The RTT columns list the round-trip-times of the short and long paths in milliseconds. Type of bottleneck link Ethernet 100 Mb/s Ethernet 10 Mb/s WaveLAN 2 Mb/s WaveLAN 11 Mb/s ADSL V.34 Modem CDMA

Short 4 4 3 3 14 14 14

RTT 1 1 4 4 19 151 696

Long 17 17 18 18 19 18 18

RTT 74 80 151 151 129 234 727

examples show the utility in measuring bottleneck link bandwidth.

4.7.1

Methodology

Our approach in gathering these measurements is to take tcpdump traces on pairs of machines while we transfer a file between them. We vary the bottleneck link bandwidth, path length, and workload. We run the traces through nettimer and analyze the results. Our methodology consists of the network topology, the hardware and software platform, accuracy measurement, the network application workload, and the network environment. Network Topology Our network topology consists of a variety of paths (listed in Table 4.1) where we vary the bottleneck link technology and the length of the path. WaveLAN [wav00] is a wireless local area network technology made by Lucent. ADSL (Asymmetric Digital Subscriber Line) is a high bandwidth technology that uses phone lines to bring connectivity into homes and small businesses. We tested the Pacific Bell/SBC [dsl00] ADSL service. V.34 is an International Telecommunication Union (ITU) [itu00] standard for data communication over analog phone lines. We used the V.34 service of

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

96

Table 4.2: This table shows the different software versions used in the experiments. The release column gives the RPM package release number. Name GNU/Linux Kernel RedHat tcpdump tcptrace openssh nettimer

Version 2.2.16 7.0 3.4 5.2.1 2.3.0p1 2.1.0

Release 22 10 1 4 1

Stanford University. CDMA (Code Division Multiple Access) is a digital cellular technology. We tested CDMA service by Sprint PCS [spr00] with AT&T Global Internet Services as the Internet service provider. These are most of the link technologies that are currently available for users in the United States. In all cases the bottleneck link is the link closest to one of the hosts. This allows us to measure the best and worst cases for nettimer as described below. The short paths are representative of local area and metropolitan area networks while the long paths are representative of a cross-country, wide area network. We did not obtain access to a tracing machine outside of the United States. Hardware and Software Platform All the tracing hosts use Intel Pentium processors ranging from 266MHz to 500MHz. The versions of software used are listed in Table 4.2. Accuracy Measurement We measure network accuracy by showing a lower bound (TCP throughput on a path with little cross traffic) and an upper bound (the nominal bandwidth specified by the manufacturer). TCP throughput by itself is insufficient because it does not include the bandwidth consumed by link level headers, IP headers, TCP headers and retransmissions. The nominal bandwidth is insufficient because the manufacturer

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

97

usually measures under conditions that may be difficult to achieve in practice. Another possibility would be for us to measure each of the bottleneck link technologies on an isolated test bed. However, given the number and types of link technologies, this would have been difficult. In addition, we wanted to examine the performance of our techniques when dealing with the kinds of cross traffic loads found in the operational Internet. Network Application Workload The network application workload consists of using scp (a secure file transfer program from openssh) to copy a 7,476,723 byte MP3 file once in each direction along a path. The transfer is terminated after five minutes even if the file has not been fully transferred. We copy the file in both directions because 1) the ADSL technology is asymmetric and we want to measure both bandwidths and 2) we want to take measurements where the bottleneck link is the first link and the last link. A first link bottleneck link is the worst case for nettimer because it provides the most opportunity for cross traffic to interfere with the Packet Pair Property. A last link bottleneck link is the best case for the opposite reason. We copy a 7,476,723 byte file as a compromise between having enough samples to work with and not having so many samples that traces are cumbersome to work with. We terminate the tracing after five minutes so that we do not have to wait hours for the file to be transferred across the lower bandwidth links. Network Environment The network environment centers around the Stanford University campus but also includes the networks of Pacific Bell, Sprint PCS, Harvard University and the ISPs that connect Stanford and Harvard. We ran five trials so that we could measure the effect of different levels of cross traffic during different times of day and different days of the week. The traces were started at 18:07 PST 12/01/2000 (Friday), 16:36 PST 12/02/2000 (Saturday), 11:07 PST

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

98

12/04/2000 (Monday), 18:39 PST 12/04/2000 (Monday), and 12:00 PST 12/05/2000 (Tuesday). We believe that these traces cover the peak traffic times of the networks that we tested on: commute time (Sprint PCS cellular), weekends and nights (Pacific Bell ADSL, Stanford V.34, Stanford residential network), and work hours (Stanford and Harvard Computer Science Department networks). Within the limits of our resources, we have selected as many different values for our experimental parameters as possible to capture some of the heterogeneity of the Internet.

4.7.2

Varied Bottleneck Link

One goal of this work is to determine whether nettimer can measure across a variety of network technologies. Dealing with different network technologies is not just a matter of dealing with different bandwidths because different technologies have different link and physical layer protocols that could affect bandwidth measurement. Using Table 4.3, we examine the Receiver Based Packet Pair results for the different technologies. This table gives the mean result over all the times and days of the TCP throughput and the Receiver Based result reported by nettimer. Ethernet For both 100Mb/s and 10Mb/s Ethernet, the TCP throughput is significantly less than the nominal bandwidth. In general, this could be caused by cross traffic, not being able to open the TCP window enough, bottlenecks in the disk, inefficiencies in the operating system, and/or the encryption used by the scp application. In the long path cases, the low throughput is caused by small TCP window sizes and long round trip delays. The TCP windows of both the sender and receiver default to a maximum size of 32Kb. Since the round trip time on the long path is approximately 73ms and TCP has a maximum throughput of windowsize/rtt, the maximum throughput TCP can achieve on this path is 3.59 Mb/s, regardless of bottleneck link bandwidth. This is a widespread and persistent problem. In addition to the 2.2 Linux kernel,

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

99

Table 4.3: This table summarizes nettimer results over all the times and days. “Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. (σ) lists the standard deviation over the different traces. High bandwidth Type D Ethernet t Ethernet t Ethernet a Ethernet a Ethernet t Ethernet t Ethernet a Ethernet a WaveLAN t WaveLAN t WaveLAN a WaveLAN a WaveLAN t WaveLAN t WaveLAN a WaveLAN a ADSL t ADSL t

technologies P N s 100.0 l 100.0 s 100.0 l 100.0 s 10.0 l 10.0 s 10.0 l 10.0 s 11.0 l 11.0 s 11.0 l 11.0 s 2.0 l 2.0 s 2.0 l 2.0 s 1.5 l 1.5

(Mb/s): TCP (σ) 21.22 (0.13) 2.09 (0.41) 19.92 (0.05) 1.51 (0.58) 6.56 (0.06) 1.85 (0.14) 7.80 (0.03) 1.66 (0.21) 4.67 (0.03) 1.58 (0.13) 5.03 (0.01) 1.30 (0.23) 1.38 (0.01) 1.05 (0.09) 1.07 (0.05) 0.87 (0.26) 1.21 (0.01) 1.16 (0.01)

RBPP (σ) 93.61 (0.01) 69.29 (0.01) 95.58 (0.01) 95.76 (0.01) 9.66 (0.00) 9.65 (0.00) 9.63 (0.00) 11.82 (0.03) 8.67 (0.08) 8.38 (0.01) 6.46 (0.02) 6.53 (0.05) 1.49 (0.01) 1.48 (0.02) 1.21 (0.01) 1.18 (0.01) 1.24 (0.00) 1.24 (0.00)

ROPP (σ) 68.89 (0.32) 58.73 (0.05) 87.21 (0.08) 92.86 (0.03) 9.65 (0.00) 9.64 (0.00) 9.62 (0.00) 9.17 (0.09) 8.18 (0.02) 8.62 (0.00) 5.65 (0.00) 5.17 (0.03) 1.47 (0.02) 1.47 (0.02) 1.21 (0.01) 1.20 (0.01) 1.24 (0.00) 1.24 (0.00)

SBPP (σ) 63.16 (0.34) 29.50 (0.16) 107.82 (0.02) 48.23 (0.45) 228.13 (0.17) 941.14 (0.80) 75.92 (0.54) 9.54 (0.06) 6.13 (0.08) 5.58 (0.16) 9.08 (0.01) 4.58 (0.07) 1.52 (0.00) 1.49 (0.00) 1.21 (0.00) 1.16 (0.06) 1.22 (0.02) 1.22 (0.01)

Low bandwidth Type D ADSL a ADSL a V.34 t V.34 t V.34 a V.34 a CDMA t CDMA t CDMA a CDMA a

technologies (Kb/s): P N TCP (σ) s 128.0 96.87 (0.19) l 128.0 107.00 (0.01) s 33.6 26.43 (0.04) l 33.6 26.77 (0.04) s 33.6 27.98 (0.01) l 33.6 28.05 (0.00) s 19.2 5.30 (0.05) l 19.2 5.15 (0.09) s 19.2 4.76 (0.24) l 19.2 3.50 (0.53)

RBPP (σ) 110.04 (0.00) 109.71 (0.00) 27.03 (0.04) 27.55 (0.04) 28.84 (0.01) 28.91 (0.00) 11.04 (0.04) 10.58 (0.06) 14.83 (0.29) 14.98 (0.35)

ROPP (σ) 109.34 (0.00) 109.47 (0.00) 26.94 (0.03) 27.35 (0.04) 28.44 (0.01) 28.54 (0.00) 9.93 (0.13) 9.33 (0.13) 17.04 (0.06) 11.46 (0.70)

SBPP (σ) 108.86 (0.00) 107.98 (0.00) 26.73 (0.03) 27.28 (0.04) 28.99 (0.01) 28.77 (0.01) 39.55 (1.14) 11.48 (0.82) 4.09 (0.44) 19.87 (0.70)

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

100

Windows 95/98, Solaris, and Linux 2.4 also default to a small maximum TCP window size. Operating system developers probably restrict the default maximum to resist denial-of-service (DoS) attacks. Some DoS attacks consist of sending many SYN packets with a spoofed source address to the victim host. The victim host does not receive a reply to most of its SYN ACK packets. Consequently, much of its memory is consumed by the buffers and structures reserved for half-open TCP sockets that are waiting to time out. Operating systems allocate socket buffers equal to the maximum window size because TCP senders are allowed to have a window’s worth of packets in flight, so a buffer must be large enough to accommodate those packets, or be forced to drop packets. Thus, larger window sizes allow spoofed SYN-based DoS attacks to be more effective by consuming larger amounts of the victim host’s memory. The discrepancy between the TCP throughput and the nettimer estimate shows the utility of measuring the bottleneck link bandwidth. Without nettimer, the low TCP throughput could be attributed to a low bandwidth bottleneck. Instead, nettimer shows that the low throughput is due to a problem in TCP. Another anomaly is the 69.29Mb/s estimate for 100Mb/s Ethernet towards the bottleneck in Stanford. This is due to a bandwidth shaper implemented by the Stanford network administrators to limit bandwidth entering the Stanford graduate residences (where one of the measurement hosts was located). The administrators did this because downloads of MP3 files through Napster were consuming bandwidth allotted for academic purposes. This again shows the utility of measuring bottleneck link bandwidth. The existence of the bandwidth shaper had not been announced to the Stanford graduate residents when it was detected by nettimer. WaveLAN/802.11b In the WaveLAN cases, both the nettimer estimate and the TCP throughput estimate deviate significantly from the nominal. Another study [BPSK96] reports a peak TCP throughput over 2Mbs/s WaveLAN of 1.39Mb/s. In general, this may be attributed to electromagnetic interference, signal attenuation, cross traffic, and link layer overhead. The technical specification for the wireless equipment claims a

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

101

maximum range of 300m in an open space with attenuation effects beginning at 30m. We took the traces with a distance of less than 2m between the wireless node and the base station. There were no other obvious sources of electromagnetic radiation nearby. We took measurements early in the morning when there was little cross traffic with similar results to those shown in Table 4.3. The remaining explanation is link layer overhead. The WaveLAN documentation [wav00] indicates that the 11Mb/s nominal bandwidth is the raw signaling rate of the 802.11b protocol. Link layer headers and channel access protocols consume at least 17% of this. In practice, the actual bottleneck link bandwidth is no more than 9.13Mb/s. The discrepancy between the 11Mb/s nominal bandwidth reported in WaveLAN’s marketing and sales literature and the actual link bandwidth is another example of the utility of nettimer. Network technology manufacturers have an economic incentive to mis-report the bandwidths of their technologies. Only tools like nettimer can discover this kind of deception. Another anomaly is that the nettimer measured WaveLAN bandwidths are consistently higher in the direction towards the bottleneck (and the wireless client) than away from it. The hardware in the PCMCIA NICs used in the host and the base station are identical. This asymmetry in bandwidth is due to an asymmetry in the channel access protocol that gives priority to packets sent by the base station over packets sent by the wireless client. ADSL/V.34 Although nettimer is able to measure the appropriate bandwidth in each direction of the asymmetric ADSL link, the bandwidths consistently deviate from the nominal by 15%-17%. Since the TCP throughput is very close to the nettimer measured bandwidth, this deviation is most likely due to the overhead from PPP headers and byte-stuffing (Pacific Bell/SBC ADSL uses PPP over Ethernet) and the overhead of encapsulating PPP packets in ATM (Pacific Bell/SBC ADSL modems use ATM to communicate with their switch). Link layer overhead is also the likely cause of the deviation in V.34 results.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

102

CDMA The CDMA results exhibit an asymmetry similar to the WaveLAN results. However, we believe that the base station hardware is different from our client transceiver and this may explain the difference. However, this may also be due to an interference source close to the client and hidden from the base station. In addition, since the TCP throughputs are far from both the nominal and the nettimer measured bandwidth, the deviation may be due to nettimer measurement error.

4.7.3

Resistance to Cross Traffic

One key issue with using Packet Pair Property-based bottleneck link bandwidth measurement is whether filtering techniques can compensate for the effect of cross traffic. Table 4.3 shows that the filtering techniques described in 4.3 can compensate for the cross traffic along cross-country paths between academic networks during peak usage times. We would expect that the long paths would have more cross traffic interference than the short paths because there are more routers where there is an opportunity for cross traffic queueing. In addition, there is more opportunity for cross traffic interference for paths leading away from the bottleneck than towards the bottleneck because queueing after the bottleneck is more likely to interfere than queueing before the bottleneck. As a result, to determine the extent of cross traffic interference, we examine the long paths leading away from the bottleneck (the “a l” rows of Table 4.3). There is no systematic difference between the “a l“ rows and the other rows, indicating that nettimer’s filtering algorithms are able to filter out the effect of cross traffic when using RBPP. Although nettimer is accurate for the cases described above, there are many other cases where that may not be so. For example, a busy web server may have highly correlated packet sizes and arrival times, which would violate the Packet Pair Property’s assumptions. Another example is a saturated network link shared by many flows. Such a link may rarely transmit two packets of the same flow backto-back, which is a requirement of the Packet Pair Property. This is also the case

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

103

with any round-robin multi-channel link technology (e.g. ISDN). However, we did not encounter any of these situations in our testing.

4.7.4

Different Packet Pair Techniques

In this section, we examine the relative accuracy of the different Packet Pair techniques. The right two columns of Table 4.3 show the Receiver-Only and Sender-Based results averaged over all the traces. Sender Based Packet Pair is not particularly accurate, reporting 20%-1000% of the nominal bandwidth, even on the short paths. As described in Section 4.2.1, this is most likely the result of interference from cross traffic in the reverse path and TCP delayed acknowledgements. An active SBPP technique could compensate for these problems. Receiver Only Packet Pair is almost as accurate as RBPP, even away from the bottleneck, which gives 17-18 hops of opportunity for post-bottleneck queueing. This means that when a distributed packet capture server cannot be deployed at a remote host, and most of the packets are sent with a high bandwidth (i.e. filtering using the received/sent bandwidth ratio is unnecessary), ROPP can still give accurate results. The main disadvantage of ROPP is that, unlike RBPP, it cannot deal with packets sent at a low rate using the sent/received filtering algorithm described in Section 4.5.

4.7.5

Agility

One key advantage of using the Packet Pair Property to measure bottleneck link bandwidth is that it converges quickly. Figure 4.7 shows the bandwidth that nettimer reports at the beginning of a connection using RBPP and ROPP. The TCP throughput is shown for comparison. For the first 2.25 seconds of the connection, TCP is setting up the connection and scp is authenticating and setting up the encryption. At 2.38 seconds, the first large data packets arrive at the receiver. At this point, 3453 bytes have been received and nettimer reports a bandwidth of 74Mb/s. After some fluctuation, nettimer converges to 96Mb/s at 2.55 seconds. This is 170ms after the first large data packet

104

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

RBPP 100Mb/s ROPP TCP

Bandwidth

10Mb/s 1Mb/s 100Kb/s 10Kb/s 1Kb/s 0

0.5

1

1.5

2 2.5 Time (s)

3

3.5

4

Figure 4.7: This graph shows the bandwidth reported by nettimer using RBPP and ROPP as a function of time. The measurements come from a long path towards a 100Mb/s Ethernet bottleneck. The Y-axis shows the bandwidth in b/s on a log scale. The X-axis shows the number of seconds since the connection began.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

105

arrives. In contrast, the TCP throughput does not converge until 4 seconds have elapsed, or 1.62 seconds after the first large data packet arrives. This is 9.53 times slower than nettimer. Converging within 3453 bytes would allow an adaptive web server to measure bandwidth using just the text portion of most web pages and then adapt its images based on that measurement.

4.8

Conclusion

In this chapter, we derive the Packet Pair Property from the multi-packet model. We use this property as the basis for a technique to measure the bottleneck link bandwidth along a path. We describe techniques for generating, gathering, and filtering bandwidth samples. We show that these techniques allow measurement in real time, from traffic that is passively or actively generated, while varying the number and location of measurement points, and while varying the bottleneck link technology. We present measurements while varying the following: • path length • distance of the bottleneck from the destination • whether the bottleneck link is wired or wireless • symmetry of bottleneck bandwidth • whether the bottleneck link is point-to-point or shared media • bottleneck bandwidth (from 14Kb/s to 100Mb/s) The following are areas for future research: Active SBPP The problems with passive measurement using SBPP suggest that active SBPP would be more effective. Active SBPP combined with the filtering techniques described here could be nearly as accurate as the passive RBPP results.

CHAPTER 4. INFERRING THE BOTTLENECK LINK BANDWIDTH

106

Active ROPP In some cases, a user or application would like to measure the bandwidth to a local host without deploying software at the remote host and without existing traffic from the remote host to the local host. One solution would be to cause the remote host to send traffic suitable for measurement [Sav99]. Hosts may restrict this capability in the future because it is also the basis for denialof-service attacks. Continuous Measurement Although we describe continuous measurement techniques in Section 4.6 that are implemented in nettimer, we did not comprehensively measure their effectiveness under a variety of traffic loads. Parameterized Filtering It may be possible to derive a model of the effectiveness of the Packet Pair filtering techniques under certain assumptions about the distribution of cross traffic packet sizes and arrival times.

Chapter 5 Conclusions We conclude by summarizing our contributions and proposing areas for future work.

5.1

Summary

Measuring bandwidth in the Internet is challenging for the following reasons: Heterogeneity of links We can make few assumptions about links because of the heterogeneity of link technologies in the Internet. In Section 4.7, we measure bandwidth across seven different link technologies, each with varying characteristics. Since the different technologies vary over five orders of magnitude, the bandwidth estimate could have a large error. Also, the asymmetry of some links requires measuring a link in both directions. Finally, link latencies differ, and this difference must be distinguished from the delay caused by varying bandwidth. Heterogeneity of traffic Links with bandwidths that vary by several orders of magnitude can also carry traffic that varies by the same amount. This adds another source of highly variable delay that can interfere with bandwidth measurement. No Router Help Since we measure from end hosts, we cannot know with certainty 107

CHAPTER 5. CONCLUSIONS

108

what is happening at the links along a path. We can observe how traffic is perturbed when it reaches the end hosts. However, other traffic along the path that is not related to the link we wish to measure can also perturb the measurement traffic and we must deal with this interference. Difficulty of deployment at both sender and receiver Users and organizations can usually deploy new software at their own hosts, but can only rarely deploy software at the host at the other end of their communications. This is because the other host is rarely under their administrative control. Consequently, a measurement technique must be able to cope with only observing the transmission or reception of packets. Need to minimize measurement probe traffic Sending massive amounts of measurement probe traffic across a link defeats the purpose of doing the measurement in the first place: to transmit data more efficiently across that link. Active techniques send probe traffic. Passive techniques listen to existing traffic. For many networks and applications, it would be preferable to use passive techniques or active techniques that send minimal probe traffic. However, existing traffic may be insufficient in quantity or quality for passive techniques. Similarly, fewer active probe packets may not be sufficient to filter out the interference from cross traffic. Route and Link Changes Route and link characteristic changes in the Internet may cause link bandwidth estimates to become invalid. Route changes cause the set of links along a path to change and therefore the bandwidth along that path to change. Stationary hosts may change routes as frequently as once a day [Pax97b]. Mobile hosts may change location much more frequently than that, causing their routes to other hosts to change. Link characteristics may change for wireless users because of changes in the distance from hosts to base stations and changing interference. For example, the nominal bandwidth of 802.11b links changes from 11Mb/s to 5Mb/s to 2Mb/s as the distance from the host to the base station increases.

CHAPTER 5. CONCLUSIONS

109

Route and link changes interact poorly with the need to minimize measurement probe traffic. Coping with frequent changes requires measuring frequently, which may increase the amount of probe traffic. We address these challenges by using a variety of end-to-end techniques: Packet Tailgating We introduce the Packet Tailgating technique to estimate all the link bandwidths along a path. Packet Tailgating requires no modifications to existing routers and requires fewer probe packets than existing active techniques that address the same problem. We analytically derive the Packet Tailgating technique from a deterministic model of packet delay, showing that it applies to every packet-switched, store-and-forward, FCFS-queueing network. Using measurements on Internet paths, we show that although Packet Tailgating requires 50% fewer probe packets than existing techniques, the estimates of all current end-to-end techniques (including packet tailgating) can deviate from the nominal by as much as 100% Analytical Derivation of the Packet Pair Property To address the problem of measuring just the bottleneck bandwidth along a path, we analytically derive the Packet Pair Property and show that it applies to every packet-switched, store-and-forward, FCFS-queueing network. The Packet Pair Property had previously been empirically shown to be valid in the Internet. Receiver Only Packet Pair To address the problem of only being able to deploy software at one host, we present the Receiver Only Packet Pair technique. Using the Packet Pair Property, this passive technique allows measuring the bottleneck bandwidth along a path without measurements at the sending host. Adaptive Kernel Density Estimation Filtering To cope with the heterogeneity of link bandwidths and cross traffic loads, we develop a kernel density estimation-based technique to filter Packet Pair samples. We show in simulation and measurements that it is robust across bottleneck link bandwidths and cross traffic that vary by five orders of magnitude.

CHAPTER 5. CONCLUSIONS

110

Potential Bandwidth Filtering To cope with the problem of having poorly conditioned traffic for doing passive measurement, we develop the Potential Bandwidth Filtering algorithm to filter Packet Pair samples. We show in simulation and measurements that it is robust across a wide variety of traffic conditions.

5.2

Future Directions

The following are areas for future research: Understanding the Accuracy of Link Bandwidth Measurement The results from Section 3.5 show that none of the existing techniques to measure all the link bandwidths along a path are accurate. This indicates that the likely source of inaccuracy is a disconnect between the models and the way routers and links actually behave in a real network. A key area of future research is understanding the source of the discrepancy and building more realistic models. Link Layer Specific Measurement The results from Section 3.5 show that link layer generic measurement may not be sufficiently accurate for some applications. An investigation of how the different link layer technologies perturb the tailgating technique might discover a way to correct the error systematically. IP Measurement Infrastructure The techniques we present use end-to-end measurement, which means they are easier to deploy and more difficult to subvert than infrastructure-based techniques. However, infrastructure-based techniques can be more accurate. The challenge is to design an incrementally deployable and trustable infrastructure. Active SBPP The problems with passive measurement using SBPP suggest that active SBPP would be more effective. Active SBPP combined with the filtering techniques described here could be nearly as accurate as the passive RBPP results. Active ROPP In some cases, a user or application would like to measure the bandwidth to a local host without deploying software at the remote host and without

CHAPTER 5. CONCLUSIONS

111

existing traffic from the remote host to the local host. One solution would be to cause the remote host to send traffic suitable for measurement [Sav99]. Hosts may restrict this capability in the future because it is also the basis for denialof-service attacks. Continuous Measurement Although we describe continuous measurement techniques in Section 4.6 that are implemented in nettimer, we did not comprehensively measure their effectiveness under a variety of traffic loads. Parameterized Filtering It may be possible to derive a model of the effectiveness of the packet pair filtering techniques under certain assumptions about the distribution of cross traffic packet sizes and arrival times (e.g., Poisson packet interarrival times).

5.3

Availability

All of the software developed in the course of this thesis and associated documentation are available at the following address: http://mosquitonet.stanford.edu/~laik/projects/nettimer/

Appendix A Distributed Packet Capture Architecture In this appendix, we describe our distributed packet capture architecture. Accurate bandwidth measurement requires some form of distributed network monitoring to collect packet sizes, transmission times, and arrival times. A distributed packet capture architecture gathers this packet information from different nodes in the network at a processing host, where reports about the same packet arriving at or leaving from different nodes are matched. This matching is necessary to determine both the transmission time and arrival time of each packet. Given that reports may arrive out of order and that packets may be duplicated, matching with complete accuracy may be impossible, but an architecture should at least minimize inaccuracies. In addition, the system should measure and collect the timing granularities of the monitoring nodes so that clients can estimate the error in their calculations. Our goals for such an architecture are to 1) have flexibility in the kinds of calculations done so that we can change bandwidth measurement algorithms without having to redeploy software at monitoring nodes, 2) minimize resources (CPU cycles, memory, network bandwidth) consumed, and 3) minimize the latency between the observation of a packet and its use in a calculation so that we can do the measurement in as close to real time as possible. 112

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

113

In the following sections, we describe our approach, design, and measurements.

A.1

Approach

Our approach is to push analysis functionality out of the network monitors and into the clients of those monitors. Packet capture servers only capture and filter packet headers and forward them to packet capture clients, which do all of the analysis. The advantages of this approach are fewer CPU cycles consumed on the packet capture servers, more flexible analysis, and greater security. The cost is the network bandwidth consumed by forwarding packet information (consisting of packet headers and timing information). Delegating the analysis functionality to the packet capture clients reduces the compute burden on the packet capture servers because the analysis is typically CPU intensive (as is the case with the filtering algorithms described in Section 4.3). This is especially important if the packet capture server is collocated with other servers (e.g. a web server). Delegating to the clients also allows more flexibility in analysis because only the client needs to be modified to deploy new analysis algorithms. Finally, delegating to the clients avoids the possible security problems of allowing client code to be run on the server. In addition, some operating systems (e.g. UNIX) require that packet capture code run with root privileges. By separating the client and server code, only the server needs to run with root privilege while the client can run as a normal user process, reducing the amount of code that has to run with root privileges, and consequently reducing the possibility that a bug allows an attacker to gain root privilege.

A.2

Design and Implementation

In this section, we describe the design and implementation of our distributed packet capture architecture. We structure our implementation as a distributed packet capture library (libdpcap) built on top of a local packet capture library (libpcap)

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

typedef struct DPCapClientConfig uint32_t version_major; /* uint32_t version_minor; /* uint32_t version_tertiary;/* } DPCapClientConfig;

114

{ Client’s major version number */ Client’s minor version number */ Client’s tertiary version number */

Figure A.1: This C structure defines the format of the configuration information sent from the client to the server. typedef struct DPCapServerConfig uint32_t version_major; /* uint32_t version_minor; /* uint32_t version_tertiary;/* uint32_t cap_len; /* uint32_t clock_resolution;/* } DPCapServerConfig;

{ Server’s major version number */ Server’s minor version number */ Server’s tertiary version number */ Len of data to capture from packets. */ Maximum clock resolution. */

Figure A.2: This C structure defines the format of the initial configuration packet sent from a server to a client.

[MJ93]. As a result, the nettimer tool can measure live in the Internet or from tcpdump traces. The libdpcap library implements the creation of distributed packet capture servers (DPCapServers) and clients (DPCapClients), the measurement of timing precision, and the matching of packet information.

A.2.1

DPCap Packet Formats

Before describing the Application Programming Interface (API) of the library, we describe the format of the packets that the client and server exchange. Figure A.3 shows the format of the initial packet from the client to the server. The client’s version information allows the server to decide if it will support a client of the specified version. Figure A.2 shows the format of the server’s response. The server’s version information allow the client to decide if it will support a server of the specified version. The cap len specifies the maximum amount of data that the server will capture from a single packet. The clock resolution specifies the timing granularity of the server’s packet timings.

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

115

typedef struct DPCapPacketInfo { struct timespec time_stamp; /* When the packet was captured. */ unsigned short capture_length; /* The amount of data captured. */ unsigned short flags; /* PacketCaptureFlagType flags. */ /* The captured packet data goes here. */ } DPCapPacketInfo;

Figure A.3: This C structure defines the format of data packets sent from a server to a client. Data packets contain captured packet data.

The data that a server sends to a client consists of a stream of records. Each record contains information about one captured packet and is formatted as shown in Figure A.1. The time stamp contains the time at which the packet is captured. The capture length is the amount of captured data. This could be less than the maximum amount that a server allows if the packet’s length is shorter than the maximum. The flags contains information about how the packet is captured.

A.2.2

Creating a DPCapServer

To create a DPCapServer, the programmer fills in the structure shown in Figure A.4 and calls the DPCapServerNew() function. The most interesting fields are send thresh, send interval, filter cmd, and cap len. send thresh is the number of bytes of packet information the server will buffer before sending them to the client. This should usually be at least the TCP maximum segment size to minimize the number of less-than-full-size packets sent. Setting this to larger values also reduces the number of calls to the send() system call. send interval is the amount of time to wait before sending the buffered packet headers regardless of whether there is enough data to fill a packet. This prevents packet information from languishing at the server waiting for enough data to exceed send thresh. The server sends the buffer when send interval or send thresh is exceeded. send interval allows the administrator of the server to control the trade off between minimizing the latency of packet information and maximizing network bandwidth utilization efficiency. Using the libpcap filter language, filter cmd specifies which packets the server

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

116

typedef struct DPCapServerDef { KLEventManager *manager; /* KLEvent manager */ const char *listen_service; /* Port the server is listening on */ unsigned int send_thresh; /* # bytes to buffer before sending */ struct timespec send_interval; /* Time to wait before sending */ const char *filter_cmd; /* Filters packets */ int cap_len; /* Length of packets to capture */ int verbosity; /* Verbosity of status messages */ DPCapServerInitFinishedFun init_finished; /* Callback when server init finished */ void *init_finished_client_data; /* Data for callback */ DPCapServerType type; /* Read traces or live capture */ union { const char *trace_name; /* File to read traces from */ const char *interface; /* Net interface to capture from */ } type_data; } DPCapServerDef;

Figure A.4: This C structure is used to specify parameters in the creation of a distributed packet capture server.

should capture. This can cut down on the amount of unnecessary data sent to the clients. For example, to capture only TCP packets between cs.stanford.edu and eecs.harvard.edu the filter cmd would be “host cs.stanford.edu and host eecs.harvard.edu and TCP”. cap len specifies how much of each packet to capture. A larger number allows interpretation of higher layer information, but increases network bandwidth consumption.

A.2.3

Creating a DPCapClient

To start a libdpcap client, the application specifies a set of servers to connect to and its own filter cmd. The client sends this filter cmd to the servers with whom it connects. This further restricts the types of packet headers that the client receives.

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

A.2.4

117

Measuring Timing Granularity

After a client connects to a server, the server responds with its timing granularity. Granularity is important because it allows the client to estimate the error in its calculations. Different machines and operating systems may have very different timing granularities when capturing packets. For example Linux kernels before Version 2.2.0 have a granularity of 10ms, while Linux kernels with Version 2.2.0 or later have a granularity of < 20 microseconds, almost a thousand times difference. This can make a significant difference in the accuracy of a calculation. The servers measure timing granularity by sending 200 packets through the loopback interface and measuring their transmissions and arrival times. The smallest non-zero interval between successive events of the same type (transmission or arrival) is taken as the timing granularity. This gives a lower bound on the possible error.

A.2.5

Packet Information Matching

Matching packet information at the client requires solving two difficult problems: determining which packet reports correspond to each other and determining the order of packet capture servers in a packet’s path. Each server gives the client packet reports with the stipulations that the reports may be received in a different order than they were generated and the reports may contain only a limited amount of packet header. We limit the amount of packet header that reports contain to reduce the bandwidth consumption. Given this information, the client must determine which reports correspond to the same packet. This is difficult because the shorter packet headers are, the more likely that two packet headers from different packets will be identical. One solution is to use parts of the packet header to distinguish packets. However, there are complications in practice. Several parts of the IP and TCP header of the same packet may change as the packet travels a path in the Internet. The IP timeto-live (TTL) must be decremented at each hop along a path. In addition, the IP header checksum depends on the TTL, so it changes too. Various schemes co-opt the IP identification field and change its value along a path.

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

118

Some Point-to-Point Protocol over Ethernet (PPPoE) gateways rewrite the TCP maximum segment size (MSS) field. They do this because PPPoE consumes a few additional bytes in each packet for its header between the Ethernet and IP headers, thus reducing the IP maximum transmission unit (MTU) size along a path. Normally path MTU probing would handle this, but that relies on ICMP, which some firewalls block. In the same vein, some routers rewrite the TCP window size field as a form of congestion control. Finally, IP Network Address Translation (NAT) gateways rewrite even the IP source and destination address and the TCP source and destination ports. Consequently, it is difficult to design a solution which works everywhere in the Internet. Instead, our solution works in the portion of the Internet that does not use NAT. We uniquely identify packets based on the IP source and destination address and some transport protocol specific information. For TCP, this is the source and destination port, the sequence number, the acknowledgement sequence number, and the segment length. This distinguishes between all packets, except those that have passed through a NAT gateway, exact retransmissions, and packets that have wrapped around the sequence number space. In practice, only the packets that have passed through a NAT gateway have been a problem because a significant number of hosts are behind NAT gateways. Two possible future solutions to this problem are explicit matching and a modified TCP checksum. For explicit matching, the user specifies the IP address of a host and the IP address of the NAT gateway the host is behind. This would mainly be effective for static, controlled situations where there are few hosts to be measured. Another solution is to use a modified TCP checksum. The unmodified TCP checksum is not sufficient to accurately match packet information because it is computed using the IP source and destionation address, the TCP source and destination port numbers, and TCP options (e.g., the MSS). As noted above, some types of gateways modify this information, thus causing a packet to have a different TCP checksum when received than it did when sent. The solution is to compute a modified TCP checksum by subtracting these fields from the original TCP checksum. Since the TCP

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

119

checksum is computed over the entire payload, the likelihood of a collision is small, assuming that payloads are not likely to be the same. This would allow working around NAT gateways, at a cost of a greater probability of collision.

A.2.6

Flow Definition

Many applications organize packet information into flows. The libdpcap library already organizes packet information into flows to do efficient matching, so it is convenient to export that functionality to applications. The main issue with doing so is how a flow is defined. The libdpcap library defines a flow to be packets that have the same (source IP address, destination IP address) tuple (a network level flow). We could also have defined it to be packets that have the same (source IP address, source port number, destination IP address, source port number) tuple (a transport level flow). If the Internet were truly end-to-end and adhered to its original design and intent, then there would be no disadvantage to using network level flows. Network level flows have the advantage of being able to aggregate the traffic of multiple transport level flows (e.g. TCP connections) so that the application has more packets to work with. However, the Internet has evolved since its conception, so that NAT gateways have become common. In the presense of NAT gateways, a single network level flow could be composed of several transport level flows, each with very different performance characteristics. For example, suppose an external host is connected to a NAT gateway via a 10Mb/s link. An internal host is connected to the NAT gateway via a 56Kb/s link and another internal host is connected via a 1Mb/s link. Both internal hosts are communicating with the external host. Aggregating the packet information of both internal hosts into one network level flow would present conflicting information about the bottleneck bandwidth experienced by the external host. A transport level flow avoids the problem by segregating the packets of different transport sessions. The libdpcap library implements network level flows because when we started implementing, NAT gateways were not widespread while popular WWW browsers

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

120

would open several short TCP connections with servers. As with packet matching above, a possible future feature is to allow users of the library to explicitly specify whether network or transport level flows should be used.

A.2.7

Bandwidth Consumption

As mentioned in Section A.1, the main disadvantage of our approach is bandwidth consumed by forwarding the packet information. We estimate this by computing the size of each report: cap len + sizeof(timestamp) (8 bytes) + sizeof(cap len) (2 bytes) + sizeof(flags) (2 bytes). For TCP traffic, nettimer needs at least 40 bytes of packet header. In addition, the report includes the link layer header, which varies in size for different link technologies. To be safe, we set the capture length to 60 bytes, so each libdpcap packet report consumes 72 bytes. 20 of these headers fit in a 1460 byte TCP payload, so one component of the overhead is 1500 bytes / 20 * 1500 = 5.00% per client. In addition, a server captures its own traffic because it does not distinguish between the data packets and its own packet header traffic, so it captures the headers of packets containing the headers of packets containing headers and so on. This can be modeled as an infinite geometric series: ¶i ∞ µ X 1 . 20 i=1 We simplify: ¶i ∞ µ X 1 −1 = 20 i=0 =

1 1 −1 1 − 20

≈.0526 We experimentally verify this cost in Section A.3. On a heavily loaded network or for many clients, this could be a problem. However,

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

121

Table A.1: This table shows the CPU cycles consumed by nettimer and the application it is measuring (scp). “User” lists the user-level CPU seconds consumed. “System” lists the system CPU seconds consumed. “Elapsed” lists the elapsed time that the program was running. “% CPU” lists (User + System) / scp Elapsed time. Name server client scp

User .31 9.28 .050

System .43 .15 .21

Elapsed 32.47 26.00 16.37

% CPU 4.52% 57.6% 1.59%

if we are only interested in a pre-determined subset of the traffic, we can use the packet filter to reduce the number of packet reports. Another solution is to use compression, although we do not implement this. Compression of a stream of packet headers can achieve high compression ratios because of the high degree of redundancy between successive packets in the same flow. One study [LYBS00] shows that packet headers can be compressed by a factor of 5. This would reduce the amount of overhead to 1% per client. Another solution is to multicast the packet information to the clients. This would make the overhead 5% regardless of the number of clients.

A.3

Measurements

In this section, we report measurements of the resources consumed by our libdpcap implementation. We use the nettimer bandwidth measurement program to drive the libdpcap library. We deployed a libdpcap server on an otherwise unloaded 366MHz Pentium II and a libdpcap client on an otherwise unloaded 266MHz Pentium II. The two machines were four hops apart with a 100Mb/s Ethernet bottleneck link. We used scp to copy a 7,476,723 byte file between the two machines while nettimer was running and measured the CPU time and bandwidth consumed. Table A.1 lists the CPU resources consumed by each of the components. The CPU cycles consumed by the distributed packet capture server are negligible, even for a 366MHz processor on a 100Mb/s link. The distributed packet capture client does consume a substantial number of CPU seconds to classify packets into flows and

APPENDIX A. DISTRIBUTED PACKET CAPTURE ARCHITECTURE

122

run the filtering algorithm. Transferring the packet headers from the libdpcap server to the client consumed 473,926 bytes. Given that the file transferred is 7,476,723 bytes, the overhead is 6.34%. This is higher than the 5.26% predicted in Section A.2 because 1) scp transfers some extra data for connection setup, 2) some data packets are retransmitted, and most significantly, 3) scp sends data packets that are less than the TCP maximum segment size. These results show that the compute burden of doing analysis can be 50% or more of the cycles on a typical machine. On the other hand, the extra bandwidth consumed by forwarding packet headers is typically less than 7% per client of the bandwidth of the flows being measured. This shows that our approach is viable when additional flexibility in analysis, reduction in compute burden on the server, and/or additional security are more important than a few percent increase in consumed bandwidth.

Appendix B Separated Nettimer Bottleneck Results The tables in this appendix show the nettimer results broken down by measurement period. We were not able to gather traces for every technology for every time period. The failures were caused by human error, network failure, and/or host failure. Failed traces are shown in the table as rows where all the bandwidths are .

123

APPENDIX B. SEPARATED NETTIMER BOTTLENECK RESULTS

124

Table B.1: This table shows the 18:07 PST 12/01/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. High bandwidth technologies (Mb/s): Type D P N TCP Ethernet t s 100 24 Ethernet t l 100 3.0 Ethernet a s 100 21 Ethernet a l 100 2.1 Ethernet t s 10 6.8 Ethernet t l 10 2.3 Ethernet a s 10 7.8 Ethernet a l 10 2.3 WaveLAN t s 0 0 WaveLAN t l 0 0 WaveLAN a s 0 0 WaveLAN a l 0 0 WaveLAN t s 2 1.4 WaveLAN t l 2 1.1 WaveLAN a s 2 1.0 WaveLAN a l 2 .41 ADSL t s 1.5 1.2 ADSL t l 1.5 1.2

RB (σ) 93 (.19) 69 (.12) 96 (.06) 95 (.05) 9.7 (.03) 9.6 (.03) 9.6 (.04) 12 (.27) 0 (.00) 0 (.00) 0 (.00) 0 (.00) 1.5 (.02) 1.5 (.03) 1.2 (.03) 1.2 (.02) 1.2 (.03) 1.2 (.03)

Error (σ) .3 (.57) .4 (.18) .3 (.19) .3 (.18) .6 (1.2) 14 (.39) .1 (.97) 1.2 (.33) 0 (.00) 0 (.00) 0 (.00) 0 (.00) .4 (.48) .5 (.45) .2 (.31) .3 (.16) .2 (.90) .3 (.43)

RO (σ) 31.1 (.68) 61.8 (.28) 84.2 (.17) 89.4 (.13) 9.6 (.04) 9.6 (.04) 9.6 (.04) 8.9 (.26) 0 (.00) 0 (.00) 0 (.00) 0 (.00) 1.5 (.03) 1.5 (.04) 1.2 (.03) 1.2 (.04) 1.2 (.03) 1.2 (.04)

Error (σ) 1.8 (21) .7 (3.4) .4 (.65) 6.4 (29) 1.4 (2.9) 2.5 (5.8) .2 (7.9) 3.1 (16) 0 (.00) 0 (.00) 0 (.00) 0 (.00) .2 (1.1) .3 (2.3) .2 (1.3) .4 (.08) .1 (1.3) .2 (6.2)

SB (σ) 27 (.02) 31 (1.1) 100 (.21) 50 (.56) 160 (.35) 2000 (.43) 51 (2.2) 9.6 (1.1) 0 (.00) 0 (.00) 0 (.00) 0 (.00) 1.5 (.00) 1.5 (.05) 1.2 (.09) 1.0 (.09) 1.3 (.05) 1.2 (.06)

Error (σ) .1 (.17) 9.0 (1.2) .7 (.30) 2.3 (2.2) 3.3 (4.4) 45 (2.7) 19 (.52) 5.5 (.22) 0 (.00) 0 (.00) 0 (.00) 0 (.00) 25 (.72) .5 (.35) .3 (.20) .3 (.17) .1 (.10) .1 (.53)

Low bandwidth technologies (Kb/s): Type D P N TCP RB (σ) ADSL a s 128 65 110 (.03) ADSL a l 128 107 110 (.02) V.34 t s 0 0 .0 (.00) V.34 t l 0 0 .0 (.00) V.34 a s 0 0 .0 (.00) V.34 a l 0 0 .0 (.00) CDMA t s 0 0 .0 (.00) CDMA t l 0 0 .0 (.00) CDMA a s 0 0 .0 (.00) CDMA a l 0 0 .0 (.00)

Error (σ) .1 (1.40) .1 (1.50) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00)

RO (σ) 109 (.04) 110 (.04) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00)

Error (σ) .1 (1.6) .1 (1.4) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00)

SB (σ) 1100 (.06) 1100 (.06) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00)

Error (σ) .2 (.58) .2 (.23) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .0 (.00)

125

APPENDIX B. SEPARATED NETTIMER BOTTLENECK RESULTS

Table B.2: This table shows the 16:36 PST 12/02/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. High bandwidth technologies (Mb/s): Type D P N TCP Ethernet t s 100 18 Ethernet t l 100 2.4 Ethernet a s 100 19 Ethernet a l 100 2.2 Ethernet t s 10 6.7 Ethernet t l 10 1.5 Ethernet a s 10 8.2 Ethernet a l 10 1.6 WaveLAN t s 11 4.7 WaveLAN t l 11 1.5 WaveLAN a s 11 5.1 WaveLAN a l 11 1.5 WaveLAN t s 2 1.4 WaveLAN t l 2 1.0 WaveLAN a s 2 1.1 WaveLAN a l 2 1.0 ADSL t s 1.5 1.2 ADSL t l 1.5 1.2

RB (σ) 93 (.18) 7.4 (.10) 94 (.13) 95 (.05) 9.7 (.03) 9.7 (.03) 9.6 (.03) 12 (.43) 8.3 (.11) 8.3 (.05) 6.6 (.06) 6.4 (.16) 1.5 (.05) 1.5 (.07) 1.2 (.02) 1.2 (.02) 1.2 (.03) 1.2 (.03)

Error (σ) 1.2 (.19) .4 (.24) .3 (.33) .3 (.16) 1.2 (.96) 1.1 (.52) .1 (.76) 1.4 (.34) .3 (.17) .3 (.13) .2 (.15) .4 (.18) .6 (.33) .5 (.41) .2 (.63) .1 (.39) 4.2 (.39) 4.8 (.32)

RO (σ) 81 (.30) 6.5 (.30) 78 (.21) 92 (.09) 9.6 (.03) 9.6 (.05) 9.6 (.03) 8.5 (.29) 7.9 (.25) 8.6 (.26) 5.6 (.09) 5.3 (.19) 1.5 (.03) 1.5 (.04) 1.2 (.04) 1.2 (.05) 1.2 (.03) 1.2 (.03)

Error (σ) 2.9 (22) .6 (1.3) .4 (.36) 18 (2.5) .7 (3.4) .8 (14) .2 (.22) 3.2 (17) .3 (.20) .3 (.17) .4 (18) .8 (.30) .2 (.85) .3 (3.2) .2 (3.0) .3 (13) 1.2 (.87) 1.1 (1.0)

SB (σ) 8.5 (.33) 26 (1.4) 110 (.19) 67 (.25) 260 (.26) 13 (1.1) 79 (1.7) 1.7 (2.0) 6.9 (3.0) 4.7 (.33) 9.1 (.27) 5.0 (.17) 1.5 (.03) 1.5 (.06) 1.2 (.00) 1.2 (.03) 1.2 (.05) 1.2 (.05)

Error (σ) 1.6 (.40) 18 (1.5) 1.2 (.28) 1.5 (2.2) .5 (.63) 51 (1.8) 9.4 (.57) 8.4 (.17) 1.4 (.38) 5.5 (.42) .3 (.43) .5 (.20) .2 (.31) 1.0 (.49) .2 (.41) .2 (.14) .1 (.33) .2 (.28)

Low bandwidth technologies (Kb/s): Type D P N TCP RB (σ) ADSL a s 128 107 110 (.02) ADSL a l 128 107 110 (.03) V.34 t s 33.6 26 27 (.06) V.34 t l 33.6 28 29 (.06) V.34 a s 33.6 28 29 (.07) V.34 a l 33.6 28 29 (.05) CDMA t s 19.2 5.4 11 (.14) CDMA t l 19.2 5.5 11 (.14) CDMA a s 19.2 6.2 20 (.10) CDMA a l 19.2 6.0 19 (.14)

Error (σ) 1.5 (.21) 1.6 (.28) .2 (.94) .2 (.99) .2 (.79) .4 (.11) .3 (.15) .3 (.10) .3 (.10) .4 (.90)

RO (σ) 110 (.04) 110 (.03) 26 (.05) 20 (.09) 29 (.07) 29 (.10) 1.8 (.28) 1.6 (.32) 18 (.26) 18 (.27)

Error (σ) .8 (3.5) .2 (2.1) .1 (1.3) .1 (1.3) .2 (.82) .2 (2.3) 1.1 (2.9) .9 (2.6) 1.4 (3.7) .8 (2.3)

SB (σ) 110 (.02) 110 (.03) 26. (.10) 28. (.11) 29 (.06) 29 (.01) 9.1 (.42) 1.5 (.2) 3.3 (1.2) 40 (2.9)

Error (σ) .1 (.63) .1 (.57) .4 (.09) .3 (.11) .3 (.10) .2 (.11) 27 (1.4) .5 (.98) 42 (.48) 48 (.53)

126

APPENDIX B. SEPARATED NETTIMER BOTTLENECK RESULTS

Table B.3: This table shows the 11:07 PST 12/04/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. High bandwidth technologies (Mb/s): Type D P N TCP Ethernet t s 100 19 Ethernet t l 100 2.3 Ethernet a s 100 19 Ethernet a l 100 1.8 Ethernet t s 10 6.3 Ethernet t l 10 1.9 Ethernet a s 10 7.9 Ethernet a l 10 1.3 WaveLAN t s 11 4.7 WaveLAN t l 11 1.8 WaveLAN a s 11 5.1 WaveLAN a l 11 1.5 WaveLAN t s 2 1.4 WaveLAN t l 2 1.2 WaveLAN a s 2 1.1 WaveLAN a l 2 .99 ADSL t s 1.5 1.2 ADSL t l 1.5 1.2

RB (σ) 95 (.07) 7.1 (.06) 96 (.10) 97 (.03) 9.7 (.03) 9.7 (.02) 9.6 (.02) 12 (.37) 8.3 (.08) 8.5 (.10) 6.3 (.09) 7.1 (.24) 1.5 (.04) 1.5 (.05) 1.2 (.01) 1.2 (.02) 1.2 (.02) 1.2 (.03)

Error (σ) .2 (.51) .4 (.11) .3 (.28) .3 (.10) 3.2 (1.0) 5.6 (1.3) .1 (.65) 1.4 (.33) .3 (.19) .3 (.14) .2 (.23) .4 (.25) .6 (.24) .7 (.27) .2 (.49) .1 (.41) .6 (1.3) .4 (.55)

RO (σ) 82 (.28) 59 (.28) 94 (.12) 96 (.06) 9.7 (.03) 9.6 (.04) 9.6 (.03) 1.9 (1.3) 8.3 (.20) 8.6 (.23) 5.7 (.08) 5.2 (.20) 1.5 (.03) 1.5 (.04) 1.2 (.04) 1.2 (.04) 1.2 (.03) 1.2 (.04)

Error (σ) 2.9 (23) 15 (15) 4.0 (29) .3 (.08) .8 (2.6) 1.2 (2.6) .2 (.21) 2.5 (13) .2 (.22) .4 (11) .3 (4.7) 1.2 (.47) .3 (.82) .3 (3.4) .4 (3.4) .3 (2.7) .2 (2.2) .2 (4.5)

SB (σ) 69 (.32) 24 (1.3) 11 (.15) 63 (.30) 210 (.34) 260 (1.3) 12 (2.4) 9.1 (.06) 5.9 (.06) 7.1 (3.1) 8.9 (.26) 4.4 (.24) 1.5 (.03) 1.5 (.04) 1.2 (.02) 1.2 (.00) 1.2 (.07) 1.2 (.04)

Error (σ) 1.8 (.30) 23 (1.4) 1.1 (.19) .9 (.46) 1.1 (.81) 4.7 (2.4) 18 (.47) 4.7 (.25) 12 (.34) 7.5 (.38) .3 (.28) .5 (.19) .7 (.50) 1.7 (.25) .2 (.32) .2 (.14) .3 (.16) .1 (.35)

Low bandwidth technologies (Kb/s): Type D P N TCP RB (σ) ADSL a s 128 110 110 (.03) ADSL a l 128 110 110 (.03) V.34 t s 33.6 26 26 (.05) V.34 t l 33.6 26 26 (.07) V.34 a s 33.6 28 28 (.05) V.34 a l 33.6 28 29 (.04) CDMA t s 0 0 0 (.00) CDMA t l 0 0 0 (.00) CDMA a s 0 0 0 (.00) CDMA a l 0 0 0 (.00)

Error (σ) .1 (.70) .1 (1.1) .2 (.95) .2 (.96) .2 (.80) .2 (.69) 0 (.00) 0 (.00) 0 (.00) 0 (.00)

RO (σ) 110 (.04) 110 (.04) 26 (.05) 26 (.10) 28 (.08) 29 (.10) 0 (.00) 0 (.00) 0 (.00) 0 (.00)

Error (σ) .1 (.99) .1 (.88) .1 (1.3) .2 (2.2) .1 (1.0) .2 (2.7) 0 (.00) 0 (.00) 0 (.00) 0 (.00)

SB (σ) 110 (.03) 110 (.04) 26 (.10) 26 (.11) 29 (.04) 29 (.07) 0 (.00) 0 (.00) 0 (.00) 0 (.00)

Error (σ) .1 (.62) .2 (.21) .4 (.08) .4 (.08) .2 (.30) .3 (.13) 0 (.00) 0 (.00) 0 (.00) 0 (.00)

APPENDIX B. SEPARATED NETTIMER BOTTLENECK RESULTS

127

Table B.4: This table shows the 18:39 PST 12/04/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. High bandwidth technologies (Mb/s): Type D P N TCP Ethernet t s 100 24 Ethernet t l 100 .69 Ethernet a s 100 21 Ethernet a l 100 .01 Ethernet t s 10 5.9 Ethernet t l 10 1.8 Ethernet a s 10 7.4 Ethernet a l 10 1.6 WaveLAN t s 11 4.4 WaveLAN t l 11 1.7 WaveLAN a s 11 5.0 WaveLAN a l 11 1.4 WaveLAN t s 2 1.4 WaveLAN t l 2 .96 WaveLAN a s 2 1.1 WaveLAN a l 2 1.0 ADSL t s 1.5 1.2 ADSL t l 1.5 1.1

RB (σ) 94 (.18) 68 (.16) 97 (.05) 96 (.09) 9.7 (.03) 9.6 (.04) 9.6 (.02) 11 (.33) 8.2 (.09) 8.4 (.07) 6.3 (.08) 6.2 (.26) 1.5 (.06) 1.5 (.03) 1.2 (.02) 1.2 (.03) 1.2 (.03) 1.2 (.03)

Error (σ) .3 (.66) .4 (.26) .3 (.19) .4 (.07) 13 (.60) 2.3 (2.0) .1 (.62) 1.7 (.31) .3 (.14) .3 (.18) .2 (.19) .5 (.96) .8 (.15) .6 (.27) .2 (.63) .1 (.41) 1.6 (.91) .1 (.41)

RO (σ) 8.9 (.27) 54 (.48) 92 (.13) 95 (.15) 9.7 (.03) 9.6 (.04) 9.6 (.03) 8.8 (.34) 8.3 (.24) 8.6 (.24) 5.6 (.10) 4.9 (.21) 1.5 (.03) 1.5 (.04) 1.2 (.03) 1.2 (.06) 1.2 (.04) 1.2 (.04)

Error (σ) .4 (.57) 59 (3.6) .4 (.23) 4.5 (2.0) 1.6 (2.7) .9 (2.7) .2 (.30) 2.1 (12) .2 (.19) .3 (.14) .3 (4.4) 1.3 (1.3) .3 (.81) .4 (1.1) .5 (4.0) .3 (9.2) .4 (2.5) .1 (6.9)

SB (σ) 77 (.31) 37 (1.2) 110 (.15) 13 (1.6) 240 (.33) 1600 (.66) 120 (1.3) 9.2 (.05) 5.8 (.06) 5.3 (.63) 9.2 (.26) 4.1 (.39) 1.5 (.04) 1.5 (.04) 1.2 (.02) 1.2 (.02) 1.2 (.05) 1.2 (.05)

Error (σ) 1.2 (.34) 32 (1.8) .9 (.30) 180 (.77) .6 (.35) 15 (3.7) 16 (.75) 5.2 (.36) 6.0 (.49) 6.0 (.60) .3 (.34) .7 (.82) .7 (.15) .9 (.30) .2 (.28) .2 (.13) .1 (.35) .1 (.64)

Low bandwidth technologies (Kb/s): Type D P N TCP RB (σ) ADSL a s 128 108 11 (.02) ADSL a l 128 107 110 (.02) V.34 t s 33.6 26 27 (.04) V.34 t l 33.6 26 26 (.08) V.34 a s 33.6 28 29 (.04) V.34 a l 33.6 28 29 (.04) CDMA t s 19.2 5.5 11 (.16) CDMA t l 19.2 5.5 1.7 (.29) CDMA a s 19.2 3.5 16 (.53) CDMA a l 19.2 3.1 18 (.25)

Error (σ) .0 (2.4) .1 (1.1) .2 (.85) .2 (.95) .1 (.97) .2 (.43) .3 (.72) .4 (.23) 1.6 (1.5) .4 (.24)

RO (σ) 110 (.04) 110 (.04) 26 (.07) 26 (.10) 29 (.08) 29 (.08) 1.8 (.29) 9.7 (.44) 17 (.43) 16 (.47)

Error (σ) .0 (4.7) .1 (1.2) .1 (1.4) .2 (2.1) .2 (1.3) .2 (.70) .6 (2.0) 1.4 (2.6) 3.1 (2.9) 4.8 (2.9)

SB (σ) 110 (.03) 110 (.02) 26 (.09) 26 (.09) 29 (.06) 29 (.07) 100 (2.4) 24 (4.2) 2.4 (.40) 9.5 (1.1)

Error (σ) .1 (.81) .1 (.43) .4 (.08) .3 (.11) .3 (.12) .3 (.13) 25 (1.4) 510 (.84) 3.0 (.50) 3.0 (1.0)

APPENDIX B. SEPARATED NETTIMER BOTTLENECK RESULTS

128

Table B.5: This table shows the 12:00 PST 12/05/2000 nettimer results.“Type” lists the different bottleneck technologies. “D” indicates whether the bandwidth is being measured (a)way from or (t)owards the bottleneck end of the path. “P” indicates whether the (l)ong or (s)hort path is used. “N” lists the nominal bandwidth of the technology. “TCP” lists the TCP throughput. “RBPP” lists Receiver Based Packet Pair results. “ROPP” lists the Receiver Only Packet Pair results. “SBPP” lists the Sender Based Packet Pair results. Each of the “Error” columns lists the error estimate of nettimer for the appropriate technique. (σ) lists the standard deviation over the length of the trace. High bandwidth technologies (Mb/s): Type D P N TCP Ethernet t s 0 0 Ethernet t l 0 0 Ethernet a s 0 0 Ethernet a l 0 0 Ethernet t s 10 7.1 Ethernet t l 10 1.7 Ethernet a s 10 7.7 Ethernet a l 10 1.5 WaveLAN t s 11 4.8 WaveLAN t l 11 1.3 WaveLAN a s 11 5.0 WaveLAN a l 11 .79 WaveLAN t s 2 1.4 WaveLAN t l 2 .96 WaveLAN a s 2 1.1 WaveLAN a l 2 .95 ADSL t s 0 0 ADSL t l 0 0

RB (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) 9.7 (.03) 9.7 (.03) 9.6 (.03) 12 (.38) 9.8 (.30) 8.3 (.04) 6.5 (.10) 6.4 (.18) 1.5 (.07) 1.5 (.07) 1.2 (.01) 1.2 (.02) .0 (.00) .0 (.00)

Error (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) 7.6 (1.2) 12 (.55) .1 (.75) 1.4 (.26) .3 (.24) .3 (.14) .2 (.37) .4 (.20) .6 (.25) .6 (.38) .2 (.52) .1 (.38) .0 (.00) .0 (.00)

RO (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) 9.7 (.03) 9.6 (.04) 9.6 (.03) 8.7 (.33) 8.2 (.25) 8.7 (.23) 5.7 (.09) 5.2 (.25) 1.5 (.03) 1.5 (.04) 1.2 (.04) 1.2 (.05) .0 (.00) .0 (.00)

Error (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) 1.8 (3.0) 1.6 (3.2) .2 (.20) 1.5 (1.6) .3 (.20) .3 (.18) .3 (1.0) 1.2 (2.2) .3 (.81) .4 (3.5) .2 (4.6) .3 (9.8) .0 (.00) .0 (.00)

SB (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) 270 (.15) 650 (1.4) 120 (1.3) 9.2 (.04) 5.9 (.06) 5.2 (.29) 9.1 (.27) 4.8 (.23) 1.5 (.04) 1.5 (.11) 1.2 (.02) 1.2 (.08) .0 (.00) .0 (.00)

Error (σ) .0 (.00) .0 (.00) .0 (.00) .0 (.00) .6 (6.1) 61 (1.8) 9.9 (.70) 5.7 (.34) 14 (.35) 11 (.95) .3 (.39) .6 (.18) .4 (.18) 1.1 (.62) .2 (.30) .2 (.17) .0 (.00) .0 (.00)

Low bandwidth technologies (Kb/s): Type D P N TCP RB (σ) ADSL a s 0 0 .0 (.00) ADSL a l 0 0 .0 (.00) V.34 t s 33.6 28 29 (.05) V.34 t l 33.6 28 29 (.07) V.34 a s 33.6 28 29 (.05) V.34 a l 33.6 28 29 (.04) CDMA t s 19.2 5.0 1.4 (.35) CDMA t l 19.2 4.5 9.7 (.44) CDMA a s 19.2 4.6 9.3 (.62) CDMA a l 19.2 1.5 7.6 (1.2)

Error (σ) .0 (.00) .0 (.00) .2 (.79) .2 (.92) .2 (.91) .3 (.37) .4 (1.3) .7 (1.9) 1.7 (1.2) 21.6 (1.1)

RO (σ) .0 (.00) .0 (.00) 29 (.06) 28 (.09) 28 (.09) 29 (.08) 8.1 (.65) 7.7 (.73) 16 (.46) .1 (.42)

Error (σ) .0 (.00) .0 (.00) .1 (1.5) .1 (2.3) .1 (1.9) .2 (.71) 5.4 (1.5) 1.1 (1.4) 3.6 (2.3) 38 (.53)

SB (σ) .0 (.00) .0 (.00) 28 (.09) 28 (.11) 30 (.11) 29 (.04) 6.1 (.96) .4 (3.6) 6.6 (.93) 1.6 (1.0)

Error (σ) .0 (.00) .0 (.00) .3 (.13) .4 (.08) .3 (.16) .3 (.12) 1.3E5 (1.3) 1.0E4 (1.2) 23 (1.6) .4 (.84)

Bibliography [ABKM01] David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. Resilient overlay networks. In Symposium on Operating Systems Principles, pages 131–145, 2001.

URL citeseer.nj.nec.com/

andersen01resilient.html. [AS92]

K. Agrawala and D. Sanghi. Network dynamics: An experimental study of the internet. In Proceedings of GLOBECOM, December 1992. URL citeseer.nj.nec.com/agrawala92network.html.

[BA00]

Suman Banerjee and Ashok Agrawala. Estimating available capacity of a network connection. In IEEE International Conference on Networks, sept 2000.

[Bel92]

Steve M. Bellovin. A Best-Case Network Performance Model. 1992. URL http://www.research.att.com/~smb/papers/netmeas.ps.

[Bol93]

Jean-Chrysostome Bolot. End-to-End Packet Delay and Loss Behavior in the Internet. In Proceedings of ACM SIGCOMM, 1993.

[BPK97]

Hari Balakrishnan, Venkata N. Padmanabhan, and Randy H. Katz. The effects of asymmetry on TCP performance. In Mobile Computing and Networking, pages 77–89, 1997. URL citeseer.nj.nec.com/261063. html.

[BPSK96]

Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, and Randy Katz. A Comparison of Mechanisms for Improving TCP Performance over Wireless Links. In Proceedings of ACM SIGCOMM, 1996. 129

BIBLIOGRAPHY

[Bue00]

130

Maryanne Murray Buechner. Cell phone nation. Time Magazine, 5(4), August 2000. URL http://www.time.com/time/digital/magazine/ articles/0,4753,50423,00.htm%l.

[CC96a]

Robert L. Carter and Mark E. Crovella. Dynamic Server Selection using Bandwidth Probing in Wide-Area Networks. Technical Report BU-CS96-007, Boston University, 1996.

[CC96b]

Robert L. Carter and Mark E. Crovella. Measuring Bottleneck Link Speed in Packet-Switched Networks. Technical Report BU-CS-96-006, Boston University, 1996.

[CFSD90]

J.D. Case, M. Fedor, M.L. Schoffstall, and C. Davin. Simple network management protocol (snmp), May 1990. RFC 1157.

[CO02]

K. G. Coffman and Andrew M. Odlyzko. Handbook of Massive Data Sets, chapter Internet growth: Is there a ”Moore’s Law” for data traffic?, pages 47–93. Kluwer, 2002. URL citeseer.nj.nec.com/277050.html.

[Dow99]

Allen B. Downey. Using pathchar to Estimate Internet Link Characteristics. In Proceedings of ACM SIGCOMM, 1999.

[DR99]

Kaivalya Dixit and Jeff Reilly. SPEC CPU95 Q&A, sept 1999. URL http://www.specbench.org/osg/cpu95/qanda.html.

[DRM01]

Constantinos Dovrolis, Parameswaran Ramanathan, and David Moore. What do packet dispersion techniques measure? In Proceedings of IEEE INFOCOM, April 2001.

[dsl00]

DSL. 2000. URL http://www.pacbell.com/DSL.

[ET93]

Bradley Efron and Robert J. Tibshirani. An Introductin to the Bootstrap. Chapman and Hall, 1993.

[FF96]

Kevin Fall and Sally Floyd. Simulation-based comparisons of tahoe, reno, and sack tcp. ACM Computer Communications Review, 1996.

131

BIBLIOGRAPHY

[FGBA96] Armando Fox, Steven D. Gribble, Eric A. Brewer, and Elan Amir. Adapting to Network and Client Variability via On-Demand Dynamic Distillation. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, 1996. [Inc02]

Cisco Systems Inc. Understanding EtherChannel Load Balancing and Redundancy on Catalyst Switches, April 2002. URL http://www.cisco. com/warp/public/473/4.html.

[itu00]

ITU. 2000. URL http://www.itu.int.

[Jac88]

Van Jacobson. Congestion Avoidance and Control. In Proceedings of ACM SIGCOMM, 1988.

[Jac97a]

Van Jacobson.

pathchar.

1997.

URL ftp://ftp.ee.lbl.gov/

pathchar/. [Jac97b]

Van Jacobson. pathchar – a tool to infer characteristics of Internet paths. Presented at the Mathmatical Sciences Research Institute, 1997.

[Kan01]

Gene Kan. Gnutella. In Andy Oram, editor, Peer-to-Peer: Harnessing the Power of Disruptive Technologies, pages 94–122. O’Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472, first edition, March 2001.

[Kes91a]

Srinivasan Keshav. A Control-Theoretic Approach to Flow Control. In Proceedings of ACM SIGCOMM, 1991.

[Kes91b]

Srinivasan Keshav. Congestion Control in Computer Networks. PhD thesis, University of California, Berkeley, August 1991.

[Kle75]

Leonard Kleinrock. Queueing Systems, Volume I: Theory. John Wiley and Sons, 1975.

BIBLIOGRAPHY

[Kle76]

132

Leonard Kleinrock. Queueing Systems, Volume II: Computer Applications. John Wiley and Sons, 1976.

[KP87]

Phil Karn and Craig Partridge. Improving Round-Trip Time Estimates in Reliable Transport Protocols. Computer Communication Review, 17 (5):2–7, August 1987.

[LB99]

Kevin Lai and Mary Baker. Measuring Bandwidth. In Proceedings of IEEE INFOCOM, March 1999.

[LB00]

Kevin Lai and Mary Baker. Measuring Link Bandwidths Using a Deterministic Model of Packet Delay. In Proceedings of ACM SIGCOMM, August 2000.

[LYBS00]

Jeremy Lilley, Jason Yang, Hari Balakrishnan, and Srinivasan Seshan. A unified header compression framework for low-bandwidth links. In Proceedings of MobiCOM, pages 131–142, 2000. URL citeseer.nj.nec. com/lilley00unified.html.

[Mah00]

Bruce A. Mah. pchar. 2000. URL http://www.ca.sandia.gov/\verb+ ~+bmah/Software/pchar/.

[MJ93]

Steve McCanne and Van Jacobson. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proceedings of the 1993 Winter USENIX Technical Conference, 1993.

[MJ98]

G. Robert Malan and Farnam Jahanian. An Extensible Probe Architecture for Network Protocol Performance Measurement. In Proceedings of ACM SIGCOMM, 1998.

[MRA87]

Jeffery C. Mogul, Richard F. Rashid, and Michael J. Accetta. The packet filter: An efficient mechanism for user-level network code. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP), volume 21, pages 39–51, 1987. URL citeseer.nj.nec. com/mogul87packet.html.

133

BIBLIOGRAPHY

[MSMO97] Matthew Mathis, Jeffery Semke, Jamshid Mahdavi, and Teunis Ott. The macroscopic behavior of the TCP Congestion Avoidance algorithm.

Computer Communications Review, 27(3), July 1997.

URL

citeseer.nj.nec.com/mathis97macroscopic.html. [net01]

Nielsen//NetRatings. 2001. URL http://www.netratings.com.

[Pax97a]

Vern Paxson. End-to-End Internet Packet Dynamics. In Proceedings of ACM SIGCOMM, 1997.

[Pax97b]

Vern Paxson. Measurements and Analysis of End-to-End Internet Dynamics. PhD thesis, University of California, Berkeley, April 1997.

[RM99]

Sylvia Ratnasamy and Steven McCanne. Inference of multicast routing trees and bottleneck bandwidths using end-to-end measurements. In Proceedings of IEEE INFOCOM ’99, 1999.

[Sai94]

S. Sain. Adaptive kernel density estimation, 1994. URL citeseer.nj. nec.com/sain94adaptive.html.

[Sav99]

Stefan Savage. Sting: a TCP-based Network Measurement Tool. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1999.

[Sco92]

Dave Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Addison Wesley, 1992.

[spr00]

Sprint PCS. 2000. URL http://www.sprintpcs.com/.

[Ste94]

W. Richard Stevens. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley Publishing Company, 1994.

[Ste96]

W. Richard Stevens. TCP/IP Illustrated, Volume 3: TCP for Transactions, HTTP, NNTP, and the UNIX Domain Protocols. Addison-Wesley Publishing Company, 1996.

BIBLIOGRAPHY

[SZ99]

134

Ion Stoica and Hui Zhang. Providing guaranteed services without per flow management. In Proceedings of ACM SIGCOMM, 1999.

[Wal00]

Steven Waldbusser. Remote network monitoring management information base, may 2000. rfc2819.

[wav00]

WaveLAN. 2000. URL http://www.wavelan.com/.

[web00]

Automatic bandwidth delay product discovery. 2000. URL http://www. web100.org/docs/bdp.discovery.php.

[WJ95]

Matt P. Wand and M. Chris Jones. Kernel Smoothing. Chapman and Hall, 1995.

[WRWF96] Mustapa Wangsaatmadja, Windiaprana Ramelan, Bradley Williamson, and Craig Farrell. Remote packet capture using rmon. In AUUG 96 and Asia Pacific World Wide Web, 1996.