An Optimal Median Calculation Algorithm for Estimating ... - CiteSeerX

An Optimal Median Calculation Algorithm for Estimating Internet Link Delays From Active Measurements. Dima Feldman and Yuval Shavitt Tel-Aviv University, Ramat Aviv 69978 ,Israel

Abstract. Delay estimation in the Internet can improve performance of many applications, e.g., web browsing, peer-to-peer applications and distributed games. For this purpose it was suggested to build an Internet distance service that can efficiently supply applications with delay information based on an Internet delay map. This can be achieved by deploying a large scale measurement infrastructure such as the DIMES project where internal delay information is extracted from end to end measurements. We suggest to estimate internal Internet link delays by the median of differences between the measurement to its end points. We suggest a very efficient algorithm for this calculation, which works in linear time and constant space; prove its correctness; and compare its performance to the much slower intuitive algorithms on real Internet data. Key words: link delay, median, median estimation

1

Introduction

As the Internet evolved rapidly in the last decade, so has the interest in measuring and studying its structure. The delay between end nodes of the network is an important parameter that influence the experience of applications such as web browsing and online games. Knowledge of the delays between network nodes is also likely to improve network utilization of peer-to-peer and content distribution application. There are several works [1–5] that suggested using end-to-end delays between small amount of nodes in order to produce estimation of a host distance to other hosts. The DIMES project [6, 7] produces a vast amount of measurement results from thousands of locations worldwide. DIMES software based agents that reside in volunteer host machines use both traceroute and ping measurements to discover the Internet connectivity and internal delays. The results of these measurements are stored in a MySQL [8] database in raw format. All the works that suggested to use Internet delay estimation [1–5] assume that obtaining link delays from measurements is a trivial task. We claim that this task is not trivial, especially if we want to estimate the delay on an internal Internet link for which we do not have a direct end-to-end measurement. Thus,

2

Dima Feldman and Yuval Shavitt

we suggest to use the median of the difference between the traceroute measurement to the link end-nodes as a good estimation of the delay on a link. But more important, if one aims at building an Internet wide delay infrastructure the calculation of millions of link delays from many measurements need to be efficient, especially when done in parallel, and this is the major contribution of this paper. In this work we use DIMES results to estimate link delays in the core of the network. We have developed a robust and effective technique for link delay estimation using a fast and memory efficient algorithm for median estimation. We tested our algorithm on one US wide provider, and show that the results comply with direct inefficient delay calculations and to geographic distances.

2

Link delay estimation and Median calculation

Link delay is composed of queueing delay, transmission delay, and propagation delay. The transmission delay is negligible for today’s high speed links, the queueing delay is dynamic in many cases, and thus, propagation delay is the stable and important measure for many applications. The propagation delay is linearly related to the geographic line length1 and is roughly equal to 1mS RTT between points at 100Km distance. In measurements of RTT delays (e.g., using traceroute), there is additional noise due to delays caused by the CPU load on the responding nodes. When measuring the propagation delay, we can treat the queueing delay and the delay due to CPU load as additive positive noise [9].

Fig. 1. Link Delay Measurement

Currently, the DIMES project has over 11,000 agents spread in 91 countries, on all continents except Antarctica [7]. Agents continuously download script files that can be tailored by their location. The script includes a list commands for the agent (such as ping and traceroute ) and target IP addresses. The use of traceroute gives us the RTT delay from an agent to each one of the hosts on the route to some designated address. It should be noted that we have no control 1

Geographic line length should be distinguished from the air distance since fibers usually follow railways and highways.

An Optimal Median Calculation Algorithm for Link Delays Estimation.

3

over the route to the destination, therefore by tracerouting to a host on the other side of the globe, the agent measures delays to the intermediate hosts along a route selected by both the intra- and inter-AS routing algorithms. This route may not be unique, for example due to load balancing, or may change during the measurement. Our goal in this paper, is to estimate the delay between each pair of nodes (routers) that consecutively appear in any traceroute measurements. The straight forward method for link delay estimation between two nodes S and D, as shown in figure 1, is the difference of the minimum RTT as it is measured by agents. For agent j, delayj (S, D) = min(RT T (Dj )) − min(RT T (S j )). To combine multiple agents results, we chose a weighted median algorithm, which gives a larger weight to the agent that made more measurements of a specific link2 . This method, which we term the Min Min algorithm, has two major disadvantages: non-revertability of the data and high storage requirement3 . In a dynamic network, a topology change in layer 2 may cause a layer 3 link delay to change. Since a = min(a, b) is a non-increasing function, it might never reflect the change. The other problem is the storage requirement for the calculation. In the two day period of DIMES measurements we examined, each network edge was measured by 10 agents on average, what is forcing us to keep on average 10 records for each edge. The alternative method we propose is a statistical analysis of the RT T (Di )− RT T (Si ) values, where i is a measurement number. Notice that we ignore the measurement source. PN The classical approach is averaging : delay(D, S) = N1 i=1 RT T (Di ) − RT T (Si ). However, for this to give a good estimation the data should have Gaussian noise and no outliers.

Fig. 2. Histogram of RT T (Di ) − RT T (Si ) of 216.140.17.53 → 216.140.17.110 link. The mean of the distribution is 24.5ms, while the median is 31ms.

2

3

For values and weights vectors v¯ and w, ¯ weighted median defined as a median of values of v¯, when vi is repeated wi times. When working with a large amounts of data, large storage requirement is also translated to non-negligible computation/access time

4


It is widely agreed that for many practical measurements, especially with ‘long tail’ distributions, median gives a better estimation to the measured value than linear operator such as mean [10]. Median filtering is also widely used in image processing to remove impulse noise [11]. However, exact median calculations require sorting the data and then selecting the middle value. The time complexity of such a calculation is O(N log(N )) and the storage complexity is O(N ). This makes the use of direct median calculation impractical for large amount of data, such as in the DIMES case, especially when simultaneous calculations are performed. Many median estimation methods are mentioned by Battiato et al. [12] and the references therein, some of them have a running time of O(N ) but also O(N ) of storage is required. Rousseeuw et al. [10] propose a practical algorithm for median estimation using a construction termed Remedain with calculation complexity of O(N log(N )) and storage complexity of O(log(N )) but it requires predeterminate amount of samples. Many communication and signal processing algorithms often utilize ‘windowing’ algorithms to reduce the required memory and to give a higher priority to the latest samples (and therefore are more sensitive to the noise at the last samples) [13]. For link delay estimation there is no reason to give the last samples more importance than the first samples (we refer to dynamic macroscopic changes later in the paper). Thus, the ‘windowing’ approach is inappropriate. We suggest here our Fast Algorithm for Median Estimation (FAME ) which decrease the required storage for each link to only two double precision variables, while giving a good median estimate of i.i.d. samples. Our algorithm is linear in the number of samples with a small constant. Similar, but simpler algorithms were suggested in [14, 15], however they are highly vulnerable to noise in the last samples of the data and have significantly larger convergence time. To depict the advantage of using median in calculations of delay measurement observe Figure 2, which shows a histogram of the delay difference for a typical link example. The histogram shows the typical concentration of the samples in a vicinity of the true value and the existence of many outliers. In this case, like in many others, median(RT T (Di ) − RT T (Si )) gives reliable approximation of link delay.

3

The Fast Algorithm for Median Estimation

FAME uses two variables, the step size, step, and the median estimator, M. For every new data sample, d, M is increased if d is larger than M and decreased if smaller, namely M = M + step · sign(d − M ). If the data sample d is close to M, the step is halved: if d ∈ (M − step, M + step) : step = step/2. While the presentation in this paper concentrate on pseudo code which can be implemented with a high level language such as C or Java, FAME can be easily and efficiently implemented in hardware or with database query languages. The hardware implementation is important for calculating median for measurement values that arrive on the fly, e.g., in a programmable NIC. The MySQL


5

implementation is the one we used with DIMES and is also efficient due to the use of the ”insert ... on duplicate ...” statement which enables writing to the database only when a new estimation is needed. The proposed median estimation algorithm has two important features. First, the convergence time depends on the quality of the data. The larger the amount of outliers and the larger is the variance of the data, the longer it will take to reduce the step variable to a small value. Second, the step size as a function of the data size gives a qualifier for the estimation accuracy.

Algorithm 1 : Fast Algorithm for Median Estimation 1: Initialization: 2: M = data(1) 3: Step = max(|data(1)/2|, b) // b is a minimal initial step 4: 5: 6: 7: 8: 9: 10: 11: 12:

For each new item i: if M > data(i) then M = M − step else if M < data(i) then M = M + step end if if |data(i) − M | < step then step = step/2 end if

There are two modifications that can be applied to FAME : - In order to eliminate overshooting in median prediction when the number of samples is small, we apply the following change. In case where a new sample is in the range M ± step, M = M ± step (lines 6 and 8) can be replaced with M = data(i). We term this variation ”FAME NO OS”. - To allow the algorithm to follow changes in the link statistics over time, it is possible to multiply the step by 1 + ε for some small ε every step (For hardware implementation one can alternatively use a larger ε every few steps). This will give the algorithm a windowing flavor. 3.1

Proof sketch

In this section we show that the algorithm converges to the median, and it increases the accuracy with the amount of data. The proof assumes that all samples are i.i.d.. Lemma 1. Let C be one dimensional Markov chain with transition probabilities Pr (i) = 1 − P (x < X(i)) and Pl (i) = P (x < X(i)) as shown in figure 3. Then the mode of the steady state distribution of the Markov chain π(n) will be one of two nodes adjacent to the median of the distribution P (x < X). i.e. if X(i) ≤ median(x) ≤ X(i + 1) −→ i ≤ argmax(π(n)) ≤ i + 1

6


Fig. 3. 1-D Markov chain. P(x) is a probability density function of x. Pl (i) = P (x < X(i)) is transition probability from state i to state i − 1. Pr (i) = 1 − P (x < X(i)) is transition probability from state i to state i + 1.

Proof. Let m to be the median of a distribution function P (x ≤ X), i.e., P (m ≤ X) = 21 . According to the Markov chain definition ∀X(i) > m, Pr (i) ≤ Pl (i + 1), but in steady state π(i) · Pr (i) = π(i + 1) · Pl (i + 1). Therefore the steady state distributions ratio = π(i+1) π(i) ≤ 1. In the same manner, ∀X(i) < m : π(i+1) π(i) ≥ 1. Hence i ≤ argmax(π(n)) ≤ i + 1 when X(i) ≤ m ≤ X(i + 1).

t u

For a Markov chain C, we define the refined Markov chain C 0 . C 0 inherits all the states of C and, in addition, has exactly one state between every pair of inherited states of C. The transition probabilities are calculated using the same formula as in C. Lemma 2. Let X(m) and X(n), m < n, be the two states with the highest steady state probability in a Markov chain C; and let C 0 be a refined Markov chain of C. Then ∀i ≤ m, CDF 0 (i0 ) < CDF (i) when X 0 (i0 ) = X(i). Namely, the error probability in the refined Markov chain C 0 is smaller than the one in C. The proof is omitted due to space limitation. The phenomena of concentrated probability density function with the increasing number of nodes in the Markov chain is demonstrated in figure 4.

Fig. 4. Steady State CDF of different length Markov chains, for the Exponential distribution with λ = ln(2)/0.5 ' 1.38629 (median = 0.5, mean = λ−1 ' 0.7213).


7

As can be seen in Figure 5, FAME can be treated as a multi-level 1D-Markov chain, where each level is a median estimator and the accuracy increases as the algorithm receives more data samples. When the variance is smaller the faster step will be reduced and the algorithm will descend the Markov chain hierarchy. As a result, FAME convergence speed depends on the ‘quality’ of the data.

Fig. 5. FAME State Machine. Solid lines are transition inside a single detention of the chain, dashed line is a transition between chains with different steps.

4

Tests on Internet data

We have applied the algorithms, Min Min, FAME , FAME NO OS, and the median on RT T (Di ) − RT T (Si ), on the raw data collected by DIMES during a period of 2 days in July 2006. We have used the Min Min algorithm as a yardstick to the other methods. Figure 6 shows that for links with more than 10 measurement results, more than 95% of the results differs by less than 2ms from the Min Min results. Comparing all the results, we can see that FAME NO OS, is more accurate than FAME , and for 4ms accuracy it is even better than the direct median calculation. Overall, both FAME and FAME NO OS are shown to calculate the median with an extremely good quality. It should be noted that the runtime of our algorithm over 32,000,000 measurements of 677,000 edges was about 1 hour, while Min Min run for about 24 hours, and took 10 time more storage for the same measurements, on the same system (MySQL 5 database, 2x 2.66MHz Intel Xeon CPU with 1GB RAM). Next, we performed a sanity check for the delay estimation we obtained. We selected Broadwing, an ISP with a nation-wide presence in the US, which publish its network structure [16]. Out of all internal measurements of Broadwing network, we selected the 178 IP edges that we were able to perform a reverse DNS resolution for both their end points. Although, in some cases, there is inconsistency in DNS resolution and geographic location [17], we used manual examination and the DIMES router aliasing to reduce geographic mapping er-

8


Fig. 6. CDF of estimation error

rors 4 . Then we compared the geographic distances (air-line) to the calculated link delays. Figure 7(left) shows that for most links with positive link delay5 , the estimated link delay is highly correlated to the geographic distance. The histogram in figure 7(right) demonstrates that the majority of the results are above the theoretical limit, and there are many results in a close proximity to the theoretical limit. From the layer 2 map of Broadwing [16] it is clear that in many cases the physical route is longer than the air-line distance and even than the driving distance. Thus, we conclude that our delay estimation is reasonable though a more exhaustive study is needed. We drew the measured links on the geographic map of continental US to show the connectivity of the network at the IP level. In figure 8 the line width represent the number of measurements we obtained for each link, the more data we have the wider the line. For most cases there is more than one IP level link between a pair of cites, therefore only the most measured link is observable.

5

Conclusion

We have shown that FAME is efficient and accurate in calculating the median of delay measurements. We also showed that the delay measurements seem to well represent the propagation delay on the links. There are links that have negative delay results, most of them are small and represent short distance links with different router response times. Some links, exhibit consistent large negative delays, which can be explained, for example, by a different return path. For these links, traceroute measurements are not effective for delay estimation. With DIMES growing ability to measure from different vantage points, this effect might be mitigated. 4

5

Various GEOIP services were not useful in obtaining geographic locations of internal nodes of this network. In this and other networks large portions of the IP address space were mapped to the network HQ. Obviously there is no true link with negative delay and the estimated delay does not represents the distance.


9

Fig. 7. Left: Estimated RTT delay vs. geographic link length. The lower diagonal line is a theoretical limit assuming maximum 100Km distance for 1ms of RTT, the upper line is twice the theoretical limit. Right: A histogram of a link delay/distance ratio, 1 stands for the theoretical limit, 2 means the link delay is twice the theoretical limit, etc.

Fig. 8. A map of Broadwing ‘named’ routers connections.

10


References 1. Francis, P., Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt, Y., Zhang, L.: IDMaps: A global internet host distance estimation service. IEEE/ACM Transactions on Networking 9(5) (2001) 525–540 2. Shavitt, Y., Tankel, T.: Big-Bang simulation for embedding network distances in Euclidean space. IEEE/ACM Transactions on Networking 12(6) (2004) 993–1006 3. Shavitt, Y., Tankel, T.: On the curvature of the internet and its usage for overlay construction and distance estimation. In: Infocom, Hong Kong (2004) 4. Dabek, F., Cox, R., Kaashoek, M.F., Morris, R.: Vivaldi: a decentralized network coordinate system. In: SIGCOMM. (2004) 15–26 5. Lim, H., Hou, J.C., Choi, C.H.: Constructing internet coordinate system based on delay measurement. IEEE/ACM Transactions on Networking 13(3) (2005) 513–525 6. Shavitt, Y., Shir, E.: DIMES: Let the internet measure itself. In: ACM SIGCOMM Computer Communication Review. Volume 35. (2005) 71–74 7. : (Dimes) http://www.netdimes.org/. 8. : (MySQL) http://www.mysql.com/. 9. Shavitt, Y., Sun, X., Wool, A., Yener, B.: Computing the unmeasured: An algebraic approach to internet mapping. IEEE Journal on Selected Areas in Communications 22(1) (2004) 67–78 10. Rousseeuw, Peter J., Bassett, Gilbert W., Jr.: The remedian: A robust averaging method for large data sets. Journal of the American Statistical Association 85(409) (1990) 97–104 11. Tukey, J.: Nonlinear (nonsuperposable) methods for smoothing data. In: Congr. Rec. EASCOM 74. (1974) 673–681 12. Battiato, S., Cantone, D., Catalano, D., Cincotti, G., Hofri, M.: An efficient algorithm for the approximate median selection problem. In: CIAC. (2000) 226–238 13. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6) (2002) 1794–1813 14. Manzanera, A., Richefeu, J.C.: A robust and computationally efficient motion detection algorithm based on sigma-delta background estimation. In: ICVGIP. (2004) 46–51 15. McFarlane, N.J.B., Schofield, C.P.: Segmentation and tracking of piglets in images. Machine Vision and Applications 8(3) (1995) 187–193 16. : (Broadwing network map) http://www.broadwing.com/about-b4.html. 17. Zhang, M., Ruan, Y., Pai, V.S., Rexford, J.: How DNS misnaming distorts internet topology mapping. In: Proceedings of the 2006 Usenix Annual Technical Conference, Boston, MA, USA (2006)