Improving the accuracy of measurement-based ... - CiteSeerX

ARTICLE IN PRESS

Computer Networks xxx (2004) xxx–xxx www.elsevier.com/locate/comnet

Improving the accuracy of measurement-based geographic location of Internet hosts Artur Ziviani

a,*

, Serge Fdida b, Jose´ F. de Rezende c, Otto Carlos M.B. Duarte

c

a

Laborato´rio Nacional de Computaçaõ Cientı´fica (LNCC), Av. Getu´lio Vargas, 333, 25651-075 Petro´polis, Brazil Laboratoire dInformatique de Paris 6 (LIP6), Universite´ Pierre et Marie Curie (Paris 6), 8, Rue du Capitaine Scott, 75015 Paris, France Grupo de Teleinforma´tica e Automaçaõ (GTA), COPPE/EE, Universidade Federal do Rio de Janeiro, P.O. Box 68504, 21945-970 Rio de Janeiro, Brazil b

c

Received 16 July 2003; received in revised form 22 March 2004; accepted 30 August 2004

Responsible Editor: Z.-L. Zhang

Abstract Location-aware applications take into account from where the users are accessing and thereby can offer novel functionalities in the Internet. This paper focuses on improving the accuracy of a geographic location service that relies on delay measurements to locate Internet hosts. Host locations are inferred by comparing delay patterns of geographically distributed landmarks, which are hosts with a known geographic location, with the delay pattern of the target host to be located. We deal with two problems that influence the accuracy of the resulting location estimation: (i) the placement of the landmarks and the probe machines that perform the delay measurements; and (ii) how to best measure the similarity between the delay patterns of the landmarks and the one observed for the target host. For the landmark placement problem, we propose a demographic approach to improve the representativeness of each landmark with respect to the hosts to be located. Given a limited number of landmarks, results show that a demographic placement provides closer landmarks and more accurate location estimations for most hosts. Concerning the placement of probe machines, we show that they have to be sparsely placed to avoid gathering redundant data. Furthermore, we define and evaluate three similarity models. Experiments show that other similarity models outperform the commonly adopted Euclidean distance, resulting then in a more accurate geographic location of Internet hosts. 2004 Published by Elsevier B.V. Keywords: Internet host location; Measurement; Demographic placement; Similarity models

*

Corresponding author. Tel.: +55 24 2233 6199; fax: +55 24 2233 6198. E-mail addresses: [email protected], [email protected] (A. Ziviani), [email protected] (S. Fdida), [email protected] (J.F. de Rezende), [email protected] (Otto Carlos M.B. Duarte). 1389-1286/$ - see front matter 2004 Published by Elsevier B.V. doi:10.1016/j.comnet.2004.08.013

ARTICLE IN PRESS

2

A. Ziviani et al. / Computer Networks xxx (2004) xxx–xxx

1. Introduction Knowing the geographic location of an Internet host from an identification of that host, such as a name or IP address, enables a whole new class of location-aware applications. These applications take into account where the users are accessing from. Examples of novel location-aware applications are: local advertising on web pages, automatic selection of a language to first display the content, accounting the incoming users based on their positions, restricted content delivery following regional policies, and authorization of transactions only when performed from pre-established locations. In the current Internet, however, there is no direct relationship between host identification and the host physical location. The novel locationaware applications then require the deployment of a geographic location service for Internet hosts. A DNS-based approach to provide a geographic location service of Internet hosts is proposed in RFC 1876 [1]. Nevertheless, the adoption of the DNS-based approach is restricted since it requires changes in the DNS records and administrators have no motivation to register new location records. Tools such as IP2LL [2] and NetGeo [3] query Whois databases in order to obtain the location information recorded therein and then infer the geographic location of a host. However, if a large and geographically dispersed block of IP addresses is allocated to a single entity, the Whois databases may contain just a single entry for the entire block. Padmanabhan and Subramanian [4] investigate three important techniques to infer the geographic location of an Internet host. The first technique infers the location of a host based on the DNS name of the host or another nearby node obtained using traceroute. For example, the name bcr1-so2-0-0.Paris.cw.net indicates a router located in Paris, France. This technique is the base for GeoTrack [4], VisualRoute [5], and GTrace [6]. The creation and management of parsing rules is, however, a challenging task as there is no standard to follow. As the position of the last recognizable router in the path toward the host to be located is used to estimate the position of such a host, a lack of accuracy is also expected. The sec-

ond technique splits the IP address space into clusters such that all hosts with an IP address within a cluster are likely to be co-located. An example of such a technique is GeoCluster [4]. Nevertheless, this technique is based on information that may be inaccurate because the databases rely on data provided by users, which may be unreliable to provide correct location information. The third technique is based on delay measurements and the exploitation of a possible correlation between geographic distance and network delay. Such a technique is the base for GeoPing [4]. The location estimation of a host is based on the assumption that hosts with similar network delays to some fixed probe machines tend to be located near each other. Given a set of landmarks with a well-known geographic location, the location estimation for a target host to be located is the location of the landmark presenting the most similar delay pattern to the one observed for the target host. In this paper, we focus on improving the accuracy of the geographic location estimation of Internet hosts inferred from delay measurements. Hence, we investigate the correlation between geographic distance and network delay. This correlation is weak to moderate if considered worldwide, whereas we show it is stronger in regions with richer connectivity. We use the term rich, or poor, connectivity to represent the variety of connectivity and transit options found in a certain region at both router and autonomous system levels. An environment with rich connectivity is expected to be able to find more geographically straightforward paths from source to destination. Moreover, we identify two key points that influence the accuracy of the Internet host location from delay measurements. The accuracy basically depends on the placement of landmarks and probe machines as well as on how efficiently the similarity between delay patterns is evaluated. Therefore, we aim at improving the accuracy of the host location estimation by: (i) strategically placing landmarks and probe machines [7]; and (ii) selecting models to best measure the similarity between the delay pattern of each landmark and the one of the target host [8]. Landmarks are expected to reflect where most users and hosts are. The number and position of the landmarks are key points for the accuracy of

ARTICLE IN PRESS


the location estimation and for the impact on network load due to measurements. The problem of finding the best location to place Internet resources, to reduce both the traffic load on networks and the delay perceived by end users, was addressed by Krishnan et al. [9] for caches, Qiu et al. [10], Radoslavov et al. [11], Cronin et al. [12], and Bassali et al. [13] for mirrors. We use, in this paper, similar concepts to find the best placement of landmarks and probe machines to provide accurate geographic location estimations for the majority of Internet hosts. We propose and evaluate a demographic placement approach that considers the geographic distribution of users, and consequently of hosts to be located, to place landmarks and probe machines. Results show that the demographic placement provides a relatively small number of landmarks able to represent a large portion of users (hosts) within a limited coverage distance. Adopting fewer landmarks to locate a host implies in a lower amount of measurement traffic. We also verify that for a limited number of landmarks, the demographic placement improves the representativeness of each landmark. This improvement results in closer landmarks and more accurate location estimations for the most part of hosts to be located. We also apply our demographic approach to place the probe machines. Probe machines are placed on sites likely to have enough network infrastructure to make their deployment feasible and in a fashion to avoid gathering redundant measurement data. Another key issue concerning the location estimation accuracy is the evaluation of how similar a landmark and the target host delay patterns are. Measuring the similarity between different items is fundamental for accuracy of recommender systems [14] and pattern analysis [15]. In our case, a similarity model compares the delay patterns and determines the landmark with the most similar delay pattern with respect to the one of the target host. Hence, we investigate different similarity models and evaluate their accuracy for the measurement-based geographic location of an Internet host. Padmanabhan and Subramanian [4] adopt the Euclidean distance as a way to assess the similarity of the observed delay patterns. One of our similarity models considers the previously adopted

3

Euclidean distance and we use this distance as a reference for evaluation. As is further detailed in Section 5.2, Euclidean distance tends to be less robust to violations of the triangle inequality that are present in some parts of the Internet [16–18]. Results show that the city-block distance outperforms the Euclidean distance, thus providing more accurate location estimations of the target host. This paper is organized as follows. A formalization of the host location inference based on delay measurements is introduced in Section 2. In Section 3, we study the correlation between geographic distance and network delay. Section 4 presents and evaluates the demographic placement proposition. Section 5 defines and compares the similarity models we evaluate. In Section 6, we present our conclusions.

2. Inferring host geographic locations from delay measurements GeoPing [4] infers a host geographic location from delay measurements. In general, a moderate correlation between distance and delay prevents the capture of such a relationship under a mathematical model. Therefore, GeoPing adopts an empirical approach based on the observation that hosts sharing similar delays to other fixed hosts tend to be near each other geographically. We formalize the problem of inferring a host location from delay measurements as follows. Consider a set L ¼ fL1 ; L2 ; . . . ; LK g of K landmarks. Landmarks are reference hosts with a well-known geographic location. Consider a set P ¼ fP 1 ; P 2 ; . . . ; P N g of N probe machines. Fig. 1 illustrates the steps in inferring a host location from delay measurements, which are detailed along this section. Dotted lines represent the measurements taken by the probe machines while the solid lines indicate information exchange. The probe machines periodically determine the network delay, which is actually the minimum delay of several measurements, to each landmark (Fig. 1(a)). Therefore, each probe machine Px, 1 6 x 6 N, keeps a delay vector dx = (d1x, d2x, . . ., dKx), where dix is the delay between the probe machine

ARTICLE IN PRESS

4


Fig. 1. Inferring a host location from delay measurements.

Px and the landmark Li 2 L. Suppose one wants to determine the location of a given target host T. A location server that knows the landmark set L and the probe machine set P is then contacted. The location server asks the N probe machines to measure the delay to host T (Fig. 1(b)). Each probe machine Px, 1 6 x 6 N, returns a delay vector d0x ¼ ðd 1x ; d 2x ; . . . ; d Kx ; d Tx Þ, i.e., the delay vector dx plus the just measured delay to host T (Fig. 1(c)). After receiving the delay vectors from the N probe machines, the location server is able to construct the delay matrix D with dimensions (K + 1) · N: 3 2 d 11 d 12 . . . d 1N 7 6 6 d 21 d 22 . . . d 2N 7 7 6 6 . .. 7 .. .. ð1Þ D ¼ 6 .. 7 . . . 7 6 7 6 4 d K1 d K2 . . . d KN 5 d T 1 d T 2 . . . d TN The delay vectors gathered by the demanding location server from the probe machines correspond to the columns of the delay matrix D. The location server then compares the lines of the delay matrix D to estimate the location of host T. To infer the location of host T, the landmark L presenting the most similar delay pattern with respect to the delay pattern of host T is determined. The corresponding location of the landmark L is the location estimation of host T (Fig. 1(d)). The delay matrix D combined with the knowledge of the location of the landmarks of the set L compose a delay map recording the relationship between network delay and geographic location. The placement problem we deal with involves the number of needed landmarks and where to

place such landmarks and probe machines. We are thus interested in where to place a finite number of landmarks to maximize the representativeness of each placed landmark. Fewer landmarks imply a lower amount of measurement traffic injected in the network. Note that probe machines may perform the measurements toward the set of landmarks in a unsynchronized way, avoiding a scalability problem of measuring distances to all hosts at the same time. Furthermore, the initiative of performing measurements is kept at the probe machines to allow the use of oblivious hosts as landmarks. The amount of measurements may be evaluated as follows. Let D denote the time interval adopted by the probe machines to periodically gather the delay from the landmarks of the set L. The total number of measurements M to estimate the location of h hosts in a time interval s is l s m

Mðh; sÞ ¼ 2N K þh : ð2Þ D It should be noted that each measurement may consist of one to several delay samples, but only the minimum value is considered to not take into account delays due to network congestion. In the case we send p ping packets to estimate the minimum RTT between a probe machine and a landmark, the amount of measurement traffic injected in the network is actually given by pM.

3. Correlation between geographic distance and network delay In order to study the correlation between geographic distance and network delay, we adopt two datasets:

ARTICLE IN PRESS


The geographic locations of the hosts composing both the LibWeb and the RIPE datasets are well known. Given the latitude and the longitude of two points, the geographic distance between them is derived using Vincentys formulae [21].

700 1 600

3

400 300 200 100 0 0

2000

4000

6000

8000

10000 12000 14000 16000 18000

Distance (km)

Fig. 2. Scatter plot of geographic distance and network delay (LibWeb).

one order of magnitude. For example, points 1–3 correspond to hosts located in Algeria, Turkey, and Iran, respectively. The ping packets from Paris to Algeria (point 1), for instance, actually make their way through routers located in New York. Geographic properties of Internet routing are studied in further detail in [22]. Routes toward some locations take directions far from the straightforward physical direction. Spring et al. [23] show that interconnection policies between ISPs directly contribute to end-to-end paths being significantly longer than necessary. This inconvenience is strongly reinforced by the poor connectivity in certain regions. Poor connectivity weakens the correlation between geographic distance and network delay. 250 Possible outliers 200

RTT (ms)

3.1. Experimental results for the LibWeb dataset Fig. 2 presents the scatter plot of the geographic distance and the minimum delay between our probe machine and each target host for the LibWeb dataset. A weak correlation is observed between geographic distance and network delay worldwide, resulting in a coefficient of correlation R = 0.2971. Some points are significantly away from the others. For about the same geographic distance, the observed delay may be greater by

2

500 RTT (ms)

• LibWeb—delay measurements performed from the LIP6 laboratory located in Paris, France, to 135 target hosts with well-known locations all over the world in June 2002. The set of target hosts is mainly composed of university sites extracted from library web (LibWeb) servers around the world [19]. The geographic distribution of the target hosts is as follows: 56 in North America, 44 in Western Europe, 11 in Asia, seven in Eastern Europe, 7 in Latin America, 4 in the Middle East, 3 in Africa, and 3 in Oceania. From the 135 original target hosts, 109 hosts have answered the ping requests. The considered delay is the minimum of several measurements to not take into account delays due to congestion in intermediate routers. • RIPE—data collected from the Test Traffic Measurements (TTM) project of the RIPE network [20]. The dataset we consider is composed by the 2.5 percentile of the delay observed from each RIPE host to each other host in the set during a period of 10 weeks from early December 2002 until February 2003. All 55 hosts on the RIPE network are equipped with a GPS card, thus allowing their exact geographic position to be known. The hosts in the RIPE network are geographically distributed as follows: 42 in Western Europe, 5 in the US, 3 in Eastern Europe, 2 in the Middle East, 2 in Oceania, and 1 in Asia.

5

150

North America Cluster

100 R = 0.7283 R = 0.8534 50 Western Europe Cluster 0 0

2000

4000

6000

8000

10000

Distance (km)

Fig. 3. Correlation between distance and delay (richer connectivity in LibWeb).

ARTICLE IN PRESS


3.2. Experimental results for the RIPE dataset In contrast with the LibWeb dataset used in the experiment of Section 3.1, where one probe machine gathers delay measurements from 135 landmarks, the RIPE dataset gives us the possibility of considering multiple measurement points. As we have the measured delay from each single host to any other host in the RIPE network, any host in the set may be viewed as a probe machine, as a landmark, or both. It should be noted that RIPE hosts measure the one-way delay between them. This is possible because the RIPE hosts are equipped with GPS cards, allowing them to be enough synchronized. As a consequence, the delays observed between each pair of RIPE hosts are asymmetric, presumably due to asymmetries in routing and in concurrent traffic load. Therefore, we consider individually the viewpoint of each host toward the other hosts in the set. The correlation between geographic distance and network de-

450 400 350 300 Delay (ms)

From the set of 109 answering hosts, the 80 hosts located in North America and Western Europe have been identified. These regions have the richest connectivity linking their hosts [24–26]. Fig. 3 shows the correlation between geographic distance and network delay considering only the hosts located in North America and Western Europe. The regions that have a richer connectivity have a much stronger correlation between geographic distance and network delay. Even if there are a few isolated points away from the other points, two main clusters are observed in Fig. 3. These clusters correspond to hosts located in Western Europe and North America. Excluding the points that clearly remain out of the pack for both clusters (one outlier for the Western Europe cluster and two outliers for the North America cluster) results in the following coefficients of correlation: R = 0.7283 for the Western Europe cluster and R = 0.8534 for the North America cluster. These outliers result from hosts in regions that are likely to have poorer connectivity than the average of each group. For example, the outliers in North America are located in Hawaii and Alaska in contrast to the remaining hosts in this region that are located in the continental U.S. and Canada.

250 200 150 100 R = 0.6272

50

y = 0.011587 * x

0 0

2000

4000

6000

8000 10000 12000 14000 16000 18000 20000 Distance (km)

Fig. 4. Correlation between distance and delay for the RIPE dataset.

lay within the whole RIPE dataset is shown in Fig. 4. For the whole RIPE dataset, there is a moderate correlation between distance and delay (R = 0.6272). Nevertheless, we have observed that one single host in the RIPE network significantly contributes to weaken such a correlation. Fig. 5 presents the correlation for the RIPE dataset without the outlier host. A significantly stronger correlation between distance and delay (R = 0.8983) is observed disregarding the outlier host. Performing a traceroute toward this outlier host, one single link on the path strongly contributes to increase the end-to-end delay. In accordance with the experiments presented in Section 3.1, we observe once more that a richer connectivity 250

200

Delay (ms)

6

150

100

50 R = 0.8983 y = 0.010756 * x

0 0

2000

4000

6000

8000 10000 12000 14000 16000 18000 20000 Distance (km)

Fig. 5. Correlation disregarding the outlier host in the RIPE dataset.

ARTICLE IN PRESS


strengths the correlation between geographic distance and network delay. 3.3. Considerations on the host location from delay measurements From our study we observe a weak to moderate correlation between geographic distance and network delay worldwide. Similar results are presented in [27]. Nevertheless, we observe a much stronger correlation on regions with richer connectivity, indicating that the correlation becomes stronger as connectivity becomes richer. Furthermore, within regions with richer connectivity, we observe that hosts located in nearby areas present similar delays to fixed measurement points. Therefore, we verify the assumption that hosts with similar network delays to some fixed probe machines tend to be located near each other. Recent findings [24,25] indicate a strong correlation between population and router density in economically developed countries. Moreover, most users, and consequently most hosts to be located, are likely to be in regions presenting richer connectivity, whereby a stronger correlation between distance and delay holds. Two main aspects contribute to the robustness of the host inference from delay measurements against factors that may weaken the correlation between distance and delay. First, delay is measured from multiple geographically distributed probe machines rather than from one single location. Second, the minimum delay, among several delay samples, is considered rather than an individual delay sample.

4. Demographic placement of landmarks and probe machines A landmark is a reference to be used as location estimation of a certain quantity of hosts supposed to be nearly located. As a landmark is a reference for the geographic position of a set of nearby hosts, it should ideally represent the position of as many hosts as possible. Areas with high host density with a co-located landmark provide location estimations that reflect more accurately the

7

positions of a large number of hosts. Therefore, landmarks are expected to indicate where most users (and hosts) are. A strong correlation between population and router density is found in economically developed countries and in urban areas characterized by population density peaks [24,25]. Most users, and consequently most hosts to be located, are likely to be in regions with rich connectivity. We propose a demographic placement approach to address the issue of placing landmarks and probe machines. In this approach, we place landmarks and probe machines according to the user (host) population distribution. The main urban agglomerations spread worldwide are considered since they offer the highest concentration of users (hosts to be located). We consider all urban agglomerations with more than one million inhabitants [28], totalizing 407 locations. For illustration, Table 1 shows the top 10 urban agglomerations worldwide. It is known that the Internet infrastructure varies dramatically across different regions throughout the world. Therefore, we weight the populations of the different agglomerations with the number of Internet users in the country the agglomeration belongs to over the total population of the country. In applying such a weight, we estimate the main user agglomerations worldwide to be covered by the demographic placement. Table 2 presents the top 10 user agglomerations worldwide out of the total 407 considered agglomerations. Data on estimations of the Internet users and total population of each country are available in [29]. Denoting the set of user agglomerations as A, the landmark

Table 1 Top 10 urban agglomerations worldwide Rank

Urban agglomeration

Country

Total population

1 2 3 4 5 6 7 8 9 10

Tokyo New York Seoul Mexico City Saõ Paulo Bombay Osaka Delhi Los Angeles Jakarta

Japan USA South Korea Mexico Brazil India Japan India USA Indonesia

34,900,000 21,600,000 21,150,000 20,750,000 20,250,000 18,150,000 18,000,000 17,150,000 16,800,000 15,850,000

ARTICLE IN PRESS

8


tions A. The resulting locations, however, might include sites with no network infrastructure such as rural areas, forests, oceans, and alike. Limiting the placement of landmarks to the locations of the agglomerations is thus justified as it is suitable to place landmarks where is likely to exist a previous network infrastructure. Furthermore, agglomerations with a high level of network infrastructure provide a large set of hosts able to be used as landmarks. Eligible landmarks are any host able to echo ping messages and known to be located within the indicated agglomerations, thus resulting in a potentially large number of eligible landmarks in large user agglomerations. Such landmarks may even be unsuspecting participants in the procedure. As a consequence, for each considered urban agglomeration, a set of oblivious hosts may be selected to constitute a pool of eligible landmarks. As probe machines need to use one landmark from a certain urban agglomeration, such probe machines pick one of the eligible landmarks located in the urban agglomeration. The association between a geographic location and the pool of eligible landmarks located therein may be carried out using DNS, performing load balancing among co-located landmarks. Each pool of landmarks might be named new-york.netlandmark.info, london.netlandmark.info, and so forth rather than adding IP addresses into the code of probe machines, thus enhancing robustness against individual landmark failures. We consider two complementary approaches to determine the placement of the set of landmarks L

Table 2 Top 10 user agglomerations worldwide Rank

User agglomeration

Country

User population

1 2 3 4 5 6 7 8 9 10

New York Los Angeles Tokyo Seoul Chicago Washington London San Francisco Osaka Philadelphia

USA USA Japan South Korea USA USA UK USA Japan USA

11,496,837 8,941,984 7,449,579 6,755,013 5,003,253 4,178,248 3,868,029 3,858,892 3,842,190 3,353,244

placement problem can now be formally defined. The set A represents the 407 user agglomerations, totalizing 173,696,253 estimated users [28,29]. The cumulative user distribution over A is presented in Fig. 6a. The mean distance between each pair of elements in A is 8167 km. The cumulative distribution of such a distance is shown in Fig. 6b. The landmark placement problem is to find a set of landmarks L A with K landmarks subject to an optimization condition f ðLÞ. The possible locations of landmarks are limited to the locations of the agglomerations (L A). From the viewpoint of the optimization condition f ðLÞ, the solution to the placement problem under this restriction may be inferior to a solution that allows landmarks to be placed anywhere, i.e. not necessarily on the locations of the agglomerations. Allowing landmarks to be placed anywhere might provide geographically closer sites to place landmarks with respect to the fixed set of agglomera-

90 80 70 60 50 40 30 20 10 0 1 10

(a)

100 Percentage of agglomeration pairs (%)

Percentage of agglomerations (%)

100

2

10

3

4

5

10 10 10 Population in number of users

6

10

10

90 80 70 60 50 40 30 20 10 0

7

(b)

2500 5000 7500 10000 12500 15000 17500 20000 Distance between a pair of agglomerations (km)

Fig. 6. Cumulative user and distance distributions over the set A. (a) User distribution and (b) distance distribution.

ARTICLE IN PRESS


9

Table 3 Adopted notation for placing landmarks and probe machines gij hi G W M aij Xi Zi Yij Qi

geographic distance between agglomerations i and j number of users at agglomeration i geographic coverage distance maximum distance from any agglomeration to the nearest landmark minimum distance between any pair of placed probe machines 1 if agglomeration i can cover demands at agglomeration j 0 if not 1 if there is a landmark on agglomeration i 0 if not 1 if agglomeration i is covered 0 if not 1 if agglomeration i is assigned to a landmark at site j 0 if not 1 if there is a probe machine on agglomeration i 0 if not

of size K taking into account the concentration of users in A. Such approaches have been used to determine the placement of different kinds of facilities like fire stations, hospitals, or police departments [30]. In the first approach, given the maximum distance from an agglomeration to a landmark, we want to determine the minimum number of landmarks needed to cover all agglomerations. If the condition of covering all agglomerations is relaxed, a smaller number of landmarks may cover a large portion of the considered space of users as the user distribution is unequal throughout the agglomerations. Thus, given the maximum distance between an agglomeration and a landmark, we want to know how many landmarks are needed to cover at least a certain portion of the considered users, if not all. This model is known as the maximum covering location model [31]. In the second approach, after fixing the number of landmarks (K) to be located, we minimize the maximum distance from any agglomeration to its nearest landmark. This problem is known as the K-center problem [31]. The placement of probe machines is also investigated. We apply the demographic approach to place probe machines on sites with enough network infrastructure to make their deployment feasible and in a fashion to avoid gathering redundant measurement data. Table 3 presents the adopted notation to model the problem of placing landmarks and probe machines.

4.1. Maximum covering location model The maximum covering location model is obtained when the number of covered hosts to be located is maximized, considering a limited number of landmarks and a fixed coverage distance of each landmark. A landmark covering a certain region is a location estimation for the hosts within that region. The demographic placement of landmarks takes into account the concentration of users within the agglomerations. Using the notation from Table 3, the maximum covering location model is expressed by the objective function X f1 ðLÞ ¼ max hi Z i ð3Þ i

with the following constraints: X Zi 6 aij X j 8i;

ð4Þ

j

X

X j 6 K;

ð5Þ

j

X j ¼ 0; 1

8j;

ð6Þ

Z i ¼ 0; 1

8i:

ð7Þ

The constraint (4) states that users at the agglomeration i are covered if at least one site that covers agglomeration i is selected to host a landmark. The constraint (5) stipulates that we locate no more

ARTICLE IN PRESS

10


than K landmarks. The constraints (6) and (7) are the constraints of integrality for the decision variables X and Z. We adopt a standard greedy approach [31] to obtain the maximum coverage with time complexity OðjAj2 KÞ. After placing the first landmark to cover the most uncovered demand, the algorithm greedily looks for the best location for the next landmark until K landmarks are placed. Fig. 7 shows the percentage of covered users achieved for three policies of landmark placement and coverage distances of 10 and 500 km. The used policies were: random placement, geographic placement, and demographic placement. Under random placement, the landmarks are randomly selected, disregarding both the concentration of users and the proximity between agglomerations.

100 90

Demographic

Percentage of covered users (%)

80 70 60 Geographic

50 40

Random

30 20

Demographic - 10 km Geographic - 10 km Random - 10 km

10 0 1 10 25

50

75

(a)

100

150

200

250

Number of landmarks

100 Demog. 90 Percentage of covered users (%)

Geographic 80 Random

70 60 50

4.2. K-center problem

40 30 20

Demographic - 500 km Geographic - 500 km Random - 500 km

10 0 1 10 25

(b)

Geographic placement disregards the user concentration (hi = 1, "i) within the candidate agglomerations to locate landmarks. The number of agglomerations within the range given by the coverage distance G around an agglomeration determines the weight of the agglomeration. Among agglomerations with equal weight, the elected agglomeration to locate the landmark is randomly chosen. Error bars in the results from the random and geographic placement represent the 99% confidence interval. The demographic placement of landmarks significantly improves the representativeness of the chosen landmarks in terms of hosts to be eventually located when compared with the random and the geographic placement policies. The proposed methodology provides not only the number of needed landmarks to cover a certain percentage of users, but also the specific geographic locations of such landmarks. The geographic distribution of the landmarks indicated by the demographic placement approach is visualized using the GeoPlot tool [32]. Fig. 8 illustrates the demographic placement of 50 landmarks to cover almost 90% of the considered number of users for a coverage distance of 250 km. In Fig. 8, landmarks are concentrated in three main regions: USA, Western Europe, and Japan. Such results are consistent with recent findings [24,25] that show a high density of Internet infrastructure on such regions. The high density of placed landmarks on these three regions reflects the concentration of users therein. There are some areas of relatively high user concentration elsewhere, as in Brazil and Australia. Results indicate where to place landmarks to improve the representativeness of each landmark given the constraints on the accuracy and the number of landmarks to be placed.

50

75

100 150 Number of landmarks

200

250

Fig. 7. Covered users in demographic, geographic, and random placement. (a) G = 10 km and (b) G = 500 km.

We now tackle the problem of minimizing the maximum distance between an agglomeration and the nearest landmark while considering a fixed number of landmarks. One can notice that even if an agglomeration i is within the coverage distance of a landmark j (aij = 1), it may be assigned to another closer landmark k (Yij = 0 and Yik = 1).

ARTICLE IN PRESS


11

Fig. 8. Placement of 50 landmarks for a coverage distance of 250 km.

Therefore, aij P Yij because there is no reason to assign an agglomeration to a landmark other than the closest one. This minimization problem is known as the K-center problem [31] and may be formulated using the notation defined in Table 3 by the objective function f2 ðLÞ ¼ min W ;

ð8Þ

and it is subject to the following constraints: X X j ¼ K; ð9Þ j

Y ij 6 X j 8i; j; X gij Y ij 8i; W P

ð10Þ ð11Þ

In the demographic approach, in order to consider the concentration of users in each agglomeration, P constraint (11) should be replaced by W P hi j gij Y ij ; 8i. Otherwise, the minimization problem leads to a geographic placement of landmarks. One solution to the problem is to enumerate each possible subset of size K out of the jAj candidate locations and then verify which one provides the minimum value of W. Nevertheless, even for moderate values of jAj and K such an enumeration is not realistic, as the K-center problem is known to be NP-Complete [33]. An alternative binary search algorithm [31] to approximately solve the K-center problem may be outlined as follows:

j

X j ¼ 0; 1

8j;

ð12Þ

Y ij ¼ 0; 1

8i; j:

ð13Þ

The constraint (9) stipulates that K landmarks are to be located. The constraint (10) states that agglomeration i can only be covered by site j if site j has a landmark. The constraint (11) defines the lower bound on the maximum distance, which is being minimized. In other words, the maximum distance (W) must be greater or equal than the distance between any agglomeration i and the landmark j to which the agglomeration i is assigned. The constraint (12) is the integrality constraint on the decision variable X. The constraint (13) requires the agglomeration i to be assigned to only one landmark j.

1: Gmax maxi,j{gij} 2: GL 0; GH Gmax 3: while (GH 5 GL) do 4: G b(GH+GL)/2c 5: Compute the smallest set of landmarks LðGÞ that covers the entire set of agglomerations A for a coverage distance of G. 6: if ðjLðGÞj 6 KÞ then 7: GH G 8: else 9: GL G+1 10: end if 11: end while 12: GL is the solution to the objective function and LðGL Þ provides the locations of the K landmarks for the solution.

ARTICLE IN PRESS

A. Ziviani et al. / Computer Networks xxx (2004) xxx–xxx 10000

Demographic Geographic Random

9000 Random

8000 Maximum distance (km)

Using the binary search approach, we are able to solve the K-center problem with time complexity OðjAjK 2 log Gmax Þ. This approach defines a lower bound GL and an upper bound GH on the maximum distance between an agglomeration and the nearest landmark. The approach then successively narrows the range between such bounds until they converge into the smallest coverage distance that allows a set of K landmarks to cover all agglomerations. Such a coverage distance is the smallest maximum distance between an agglomeration and the nearest landmark for K placed landmarks, thus being the approximate solution of the objective function (8). The set of K landmarks that covers all agglomerations for such a coverage distance is the solution to the K-center problem. The binary search algorithm considers unweighted distances, resulting in a geographic placement of landmarks. In the demographic strategy, to consider demandweighted distances, step 1 in the binary search algorithm should state Gmax [maxi,j{gij}] [maxi{hi}]. Furthermore, in computing the smallest set of landmarks that covers the entire set of agglomerations (step 5 in the binary search algorithm), an agglomeration j is able to cover an agglomeration i if gijhi 6 G. We compare the results from the random, geographic, and demographic placement policies in Fig. 9. The geographic placement performs the best, providing the lowest maximum geographic distances between agglomerations and the nearest landmark. The demographic approach provides higher maximum and average distances, but such results mask the concentration of users within the agglomerations. Adding more landmarks may even keep the maximum distance observed under the demographic placement policy unchanged. A remote agglomeration with low user concentration may be kept far from the nearest landmark as additional landmarks are used to decrease the distance between more user-populated agglomerations to the nearest landmark. One observes such a situation in Fig. 9 between 50 and 150 placed landmarks as the maximum distance keeps leveled off. In spite of that, as shown in Fig. 9, the average distance to the nearest landmark keeps decreasing in this same range as the

7000 6000 5000 4000 3000

Demographic

2000 Geographic 1000 0 10

50

100

(a)


2500

250

300

350


2250 2000 Average distance (km)

12

1750 1500 1250

Demographic

1000 750 500 Random 250

Geographic

0

(b)

10

50

100


250

300

350

Fig. 9. Distance from an agglomeration to the nearest landmark (km). (a) Maximum geographic distance and (b) average geographic distance.

density of landmarks in denser user areas increases. The demographic placement considers the user concentrations within the different agglomerations. As a consequence, the demographic placement pushes the worst-case distances toward the farthest and least user concentrated agglomerations. High user concentrated agglomerations are assigned to closer landmarks and the agglomerations with the highest user concentrations host the landmarks. These results are presented in Fig. 10 using 50 and 100 placed landmarks. The demographic strategy provides smaller distances between the agglomerations and their nearest landmark for the most part of users at the expense of leaving farther landmarks for the remote and least user concentrated agglomerations.

ARTICLE IN PRESS


to remote agglomerations with the lowest concentrations of users. The result is a dense placement of landmarks in the regions presenting the highest concentrations of users like the USA, Western Europe, and Japan.

100 Demographic

Percentage of users covered (%)

90 80 Random

70 60 50 40

4.3. Placement of probe machines

Geographic

30 20 Demographic Geographic Random

10 0 0

(a)

250

500

750

1000

1250

1500

1750

2000

2250

2500

Distance from agglomeration to the nearest landmark (km) 100

Demog.

Percentage of users covered (%)

90 80 Random

70 60 50

Geographic

40 30 20


10 0 0

(b)

13

250

500

750

1000

1250

1500

1750

2000

2250

2500

Distance from agglomeration to the nearest landmark (km)

Fig. 10. Covered users as a function of the distance to the nearest landmark. (a) K = 50 and (b) K = 100.

Fig. 11 illustrates the difference in adopting either unweighted (geographic placement) or weighted (demographic placement) distances to be minimized in the K-center problem. Fig. 11 presents the placement of 50 landmarks considering the geographic placement, thus disregarding the concentration of users. Such a disregard of the concentration of users leads to a more uniform geographic distribution of the 50 landmarks worldwide and to a lower maximum unweighted or purely geographic distance W between any agglomeration and its nearest landmark. In the other hand, Fig. 11 presents the placement of 50 landmarks considering weighted distances under the demographic placement. The maximum (unweighted or purely geographic) distance W between an agglomeration and the nearest landmark is higher, but the worst cases are pushed

In this subsection, we address the issue of where to place the probe machines. Probe machines measure the delay to a target host and regularly gather the delays to landmarks. Probe machines that are near each other may share common paths to some remote landmarks or target hosts, thus providing redundant information to the location estimation decision. Therefore, our first goal is to make a geographically sparse distribution of probe machines to avoid shared paths. This sparse placement of probe machines avoids a possible correlation between the delay patterns gathered from the landmarks. In order to make the deployment of probe machines feasible, they are placed on agglomerations with better network infrastructure. The second goal is thus to maximize the number of users on the agglomerations selected to host probe machines. We adopt the number of users (hosts) as a means to reflect the level of network infrastructure on a given agglomeration. The problem of placing probe machines is to find a set P A with N probe machines that maximizes the minimum weighted distance between any pair of placed probe machines. We adopt the weighted distance between agglomerations to consider the user concentration at each agglomeration and thus place probe machines on agglomerations that have better network infrastructure. Using the notation defined in Table 3, the problem of placing probe machines is given by the objective function f3 ðPÞ ¼ max M

ð14Þ

and is subject to the following constraints: X Qi ¼ N ;

ð15Þ

i

M 6 hi

X

gij Qj ;

if Qi ¼ 1; 8i;

ð16Þ

j

Qi ¼ 0; 1

8i:

ð17Þ

ARTICLE IN PRESS

14


Fig. 11. K-center solution for placing 50 landmarks. (a) Geographic placement (disregards the concentration of users) and (b) demographic placement (considers the concentration of users).

The constraint (15) states that N probe machines are to be located. The constraint (16) defines the upper bound on the minimum distance between two placed probe machines, which is being maximized. The minimum weighted distance M must be lesser or equal than the distance between any pair of probe machines i and j. This constraint makes no sense if Qi = 0 (no probe machine at agglomeration i) as M is the weighted distance between a pair of probe machines i and j. The constraint (17) is the integrality constraint on the decision variable Q. We propose a greedy approach to solve the problem of placing probe machines with time com2 plexity OðjAj N Þ. This greedy approach is as follows: 1: A0 A 2: P A 2 A with maxi{hi} 3: while ðjPj < N Þ do 4: Find P A 2 A0 that maximizes D ¼ i distðA; P i Þ, 8P i 2 P

5: P P[A A0 A 6: A0 7: end while 8: P is the set of N probe machines The set of probe machines is initialized with the agglomeration that presents the highest user concentration. Afterwards the algorithm greedily places a new probe machine to maximize the weighted distance between the new probe machine and the previously placed probe machines until N probe machines are placed. The first placed probe machine influences the provided solution. Nevertheless, the decision to place the first probe machine on the agglomeration with the highest user concentration is consistent with the argument of placing probe machines where there exists enough network infrastructure. We compare the results provided to the problem of placing probe machines by the demographic, geographic, and random placements in Fig. 12. Under the geographic placement, to con-

ARTICLE IN PRESS


Average distance between probe machines (km)

14000

structure. The demographic placement largely outperforms the geographic and random placements. In the demographic placement, probe machines are placed on agglomerations that have a sufficient network infrastructure to make the deployment of probe machines feasible. Meanwhile, the demographic placement is still able to provide a relatively sparse distribution of probe machines to avoid the shared paths toward landmarks and hosts to be located as shown in Fig. 12a.


13000 12000 11000

Geographic

10000 9000 8000 7000

Demographic

6000 Random

5000

15

4000 3000 2000 1000 0 3

Percentage of users co-located with probe machines (%)

(a) 100

(b)

10

20

30 40 50 60 70 Number of probe machines

80

90

100

5. Measuring the similarity between delay patterns In this section, we investigate how to best measure the similarity between the delay pattern of each landmark and the one observed for the target host. The delay patterns result from the partial viewpoints gathered by the distributed probe machines. The landmark that presents the most similar delay pattern with respect to the one of the target host provides the location estimation of that host. Measuring the similarity of the concerned delay patterns is thus a key point for the accuracy of the host location from delay measurements.


90 80

Demographic

70 60 50 40

Geographic

30 20 10

Random

0 3

10

20

30

40 50 60 70 Number of probe machines

80

90

100

Fig. 12. Placing probe machines. (a) Distance between probe machines and (b) users co-located with probe machines.

sider unweighted distances, the constraint (16) is P replaced by M 6 j gij Qj , if Qi = 1, for all i. In Fig. 12a, we compare the average distance observed between all pairs of the N placed probe machines for the three placement policies. The geographic placement performs the best in sparsely distributing the probe machines as it presents the largest average distances. Nevertheless, such results place probe machines on remote user agglomerations that do not necessarily have a good network infrastructure since the geographic placement disregards user concentration throughout the agglomerations. Fig. 12b shows that the demographic placement locates the probe machines on agglomerations with high density of users, which is our adopted criterion to measure the level of network infra-

5.1. Similarity models We define the similarity function Sðx; yÞ : RN ! ½0; 1 to measure the degree of similarity between two delay patterns x and y where N is the number of adopted probe machines. We adopt such a function to evaluate the degree of similarity between the delay patterns gathered by the probe machines from each landmark and from the target host to be located. In this subsection, we investigate the adoption of three known similarity models [34]: Cosine-based, correlation-based, and distance-based. Each similarity model provides its own manner of implementing the similarity function Sðx; yÞ. The resulting similarity level between the delay patterns falls in the interval 0 6 Sðx; yÞ 6 1. The closer the similarity level is to 1, more similar the delay patterns are. 5.1.1. Cosine-based similarity In the first similarity model, the two delay patterns are thought of as two vectors in a

ARTICLE IN PRESS

16


N-dimensional delay space. The similarity between them is measured by computing the cosine of the angle h between these two vectors. The cosinebased similarity between the delay patterns x and y, denoted by Scos ðx; yÞ, is then given by xy ; ð18Þ Scos ðx; yÞ ¼ cos h ¼ kxkkyk where ‘‘Æ’’ denotes the dot-product of the two vectors and kxk is the Euclidean size of vector x 2 RN , qffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2ffi i.e. kxk ¼ i¼1 xi . It should be noted that the cosine of the angle h is already in the range of [0, 1] because the delay patterns x and y are both positive vectors. 5.1.2. Correlation-based similarity In this model, similarity is measured by computing the coefficient of correlation between the two delay patterns x and y. The coefficient of correlation is defined as corrðx; yÞ ¼

covðx; yÞ ; rx ry

ð19Þ

where cov(x, y) denotes the covariance between delay patterns x and y, and rx is the standard deviation of x. The correlation-based similarity, denoted by Scor ðx; yÞ, is scaled to the interval [0, 1] and given by Scor ðx; yÞ ¼

corrðx; yÞ þ 1 2

ð20Þ

5.1.3. Distance-based similarity We represent the distance u(x, y) between two delay patterns x and y by " #1=p N X p uðx; yÞ ¼ Lp ¼ jxi y i j ; p > 0: ð21Þ i¼1

The distance function u(x, y) belongs to the Lp family of functions. When p = 1, we have the Manhattan or city-block distance. In contrast, for p = 2, we have the Euclidean distance. Furthermore, 0 < p < 1 results in a non-metric distance function adequate to be used if distances do not satisfy the triangle inequality [35]. Shepard [36] argues in favor of an exponential decay function for a distance-based similarity model. A flexible dis-

tance-based similarity model, denoted by Sdis ðx; yÞ, which includes the exponential decay function is given by a

Sdis ðx; yÞ ¼ eðuðx;yÞ=bÞ ;

b > 1; a > 0;

where b is a scaling factor defined as P 11 0 0P yj xi 1 j i AA þ b ¼ max @1; @ i;j 2 kxk kyk

ð22Þ

ð23Þ

Since Sdis ðx; yÞ should decrease as u(x, y) increases, then a > 0. The same kind of function is explored for measuring similarity in different contexts like pattern analysis and similarity theory [15,35,37].

5.2. Experimental results for the similarity models In this subsection, we analyze the performance of the different similarity models to provide a location estimation of a target host to be located. We adopt the RIPE dataset to evaluate the three similarity models. The RIPE hosts are equipped with GPS cards, allowing us to know their accurate geographic location. This allows us to compare the estimated locations with the real ones and, as a consequence, derive our performance results. The hosts in the RIPE dataset are considered the target host to be located one at a time. The remaining hosts in the set are then used as probe machines and landmarks to perform the location estimation of the target host. We repeat this procedure to evaluate the resulting location estimation of each host in the dataset. 5.2.1. Ranking landmarks As we know the geographic position of each host in the RIPE dataset due to the GPS cards, we are able to determine the ideal landmark to be chosen by the similarity models being evaluated. The ideal landmark is the geographically closest landmark to the target host. We analyze the performance of the similarity models by ranking the ideal landmark for each target host. Each model provides a list of landmarks in descending order of similarity presented by their delay patterns with respect to the delay pattern of the target

ARTICLE IN PRESS

Sdis (α = 1)

p = 0.25 p = 0.5 p = 0.75 p=1 p=2 p=3 p=4 p=5 p=6

Average Rank

25

20

15

10

5

Scos

Scor

0

Fig. 13. Average rank of the ideal landmark.

host. We then observe the rank of the ideal landmark on the ordered list resulting from each evaluated similarity model. This rank is obtained for each host in the RIPE dataset being considered as a target at a time. Fig. 13 presents the average rank of the ideal landmark for each similarity model. The best performance in ranking the landmarks is obtained by the distance-based similarity model when p = 1, equivalent to the city-block distance. The city-block distance also has the lowest sensitivity to the considered values of a since it presents a similar performance independent of the value of this parameter. Moreover, the city-block distance outperforms the Euclidean distance previously adopted in [4] as well as the cosine-based and correlation-based similarity models. Furthermore, the previously adopted Euclidean distance performs similarly to the correlation-based similarity model. For data following a standard normal distribution, i.e. zero mean and unit variance, the correlation is indeed a linear transformation of the squared Euclidean distance. This relationship is given by corrðx; yÞ ¼ 1 L22 ðx; yÞ=2N , where L2(x, y) is the Euclidean distance between x and y. Even if in our experiments the delay patterns x and y clearly do not follow the standard normal distribution, our findings indicate a certain level of correlation between the correlation-based and the distancebased (with Euclidean distance) similarity models. We note that the distance-based similarity model is more sensitive to the variation of the distance

17

parameter p for larger values of a. The adopted values of p and a exert a great influence on the performance of the distance-based similarity model. The results obtained by the correlation-based and cosine-based similarity models are even comprised within the performance range of the distance-based model. For a = 0.1, relatively good performances are achieved by the distance-based model with non-metric distances (0 < p < 1). Some experiments show that the triangle inequality does not hold over all parts of the Internet [16–18]. These experiments indicate that the adoption of non-metric distances may be more adequate than metric distances on specific parts of the Internet. Euclidean distance as well as higher order distances tend to be less robust to violations of the triangle inequality that are present in some parts of the Internet. The cumulative probability of ranking the ideal landmark for different similarity models is depicted in Fig. 14. We compare the distance-based similarity model (presented in the form Sdis –p–a) for city-block and Euclidean distances with the correlation-based and cosine-based similarity models. As the Euclidean distance has been previously adopted, we use it as a performance reference for comparison. We observe that the distance-based similarity model with city-block distances outperforms the remaining ones. The correlation-based similarity model provides an

100

80 Cumulative Probability (%)

Sdis (α = 0.1)

p = 0.25 p = 0.5 p = 0.75 p=1 p=2 p=3 p=4 p=5 p=6


60

40

Sdis-1-1 Sdis-2-1 Scor Scos

20

0 1

5

10

15

20

25

30

Rank

Fig. 14. Cumulative probability of the rank of the ideal landmark.

ARTICLE IN PRESS

18


equivalent performance to the distance-based similarity model with Euclidean distance.

100

Sdis (α = 0.1)

Sdis (α = 1)

Average Distance (km)

2000 1750 1500 1250 1000

Cumulative Probability (%)

80

5.2.2. Distance accuracy We now evaluate the distance accuracy of the location estimation of each considered similarity model. The location estimation corresponds to the location of the landmark chosen by each similarity model, i.e. the first ranked landmark in the resulting ordered list of each similarity model. Therefore, we evaluate the performance of the different similarity models by comparing the error distance from the selected landmark and the target host to be located. Fig. 15 presents the average distance between the target host and the elected closest landmark. Such results are the average of each RIPE host being considered a target host one at a time. The remaining hosts are then used as landmarks and probe machines. As in the ranking landmarks evaluation, the best performance is achieved by the distance-based similarity model for the city-block distance (p = 1) and a = 1. Nevertheless, the insensitivity to a is not the same for the distance accuracy as in the evaluation of ranking the ideal landmark. The correlation-based similarity model presents a similar performance to the non-metric distances with a = 0.1, outperforming the distance-based similarity model with the reference Euclidean distance. The cumulative probability of the distance between the target host and the selected closest land-

60

40 Upper Bound Sdis-1-1 Sdis-2-1 Scos Scor

20

0 0 250 500

1000

1500

2000

3000

4000

Distance (km)

Fig. 16. Cumulative probability of the distance (km).

mark is shown in Fig. 16. Again, we compare the correlation-based and the cosine-based similarity models with the distance-based similarity model for city-block and Euclidean distances. The ‘‘Upper Bound’’ is the best possible performance for the RIPE dataset. This upper bound consists of the distances from the target hosts to their respective ideal landmarks. The city-block distance (Sdis –1–1) outperforms the remaining models, presenting the closest performance to the upper bound. It should be noted that the RIPE dataset is relatively small. Using one host as the target host results in 54 landmarks to choose from to infer a location estimation. Even if the elected landmark is the geographically closest landmark to the target host, not necessarily it is nearby that target host. For example, the RIPE host in Tokyo is the only one in Asia and its ideal landmark is located in Finland, 7819 km away. As a consequence of the limited number of elements in the RIPE dataset, the average distance from the ideal landmarks to their respective target hosts is 405 km. This distance is the best average distance the similarity models can reach for the RIPE dataset.

750 500

6. Conclusion

250 Scos

Scor

p = 0.25 p = 0.5 p = 0.75 p=1 p=2 p=3 p=4 p=5 p=6

p = 0.25 p = 0.5 p = 0.75 p=1 p=2 p=3 p=4 p=5 p=6

0

Fig. 15. Average distance between the target host and the location estimation.

This paper focuses on a geographic location service of Internet hosts based on a technique that infers host locations using delay measurements from geographically distributed landmarks. We

ARTICLE IN PRESS


aim at improving the accuracy of the geographic location estimation of the target host to be located. The contributions of this paper are based on three key points. First, we study the correlation between geographic distance and network delay. Poor connectivity weakens such a correlation. We observe for two different datasets that the correlation between distance and delay becomes stronger as connectivity becomes richer. Thereafter, we identify two key issues that influence the accuracy of the resulting geographic location estimation of a target host: the placement of landmarks and probe machines, and the similarity model that compares the observed delay patterns. Thereby we address the problem of strategically placing landmarks and probe machines to improve the accuracy of the geographic location estimation. We propose and evaluate a demographic placement approach that considers the geographic distribution of users (hosts) to place landmarks and probe machines. Results show that the proposed demographic placement allows a relatively small number of landmarks to represent a large portion of users within a limited coverage distance. Fewer landmarks also imply a lower amount of measurement traffic in the network. These landmarks are placed on areas of high user density, thus providing closer landmarks and more accurate location estimations for most hosts to be located. Probe machines are placed on locations that have enough network infrastructure and sparsely distributed to avoid gathering redundant data. Furthermore, in order to address the second key issue on accuracy, we investigate three different similarity models and how accurate they are for Internet host location from delay measurements. This investigation is carried out using experimental data from the RIPE measurement infrastructure. The similarity models select the landmark that provides the location estimation of the target host to be located, thus being a key component for accuracy. Despite such a fact, Euclidean distance was the only similarity model that has been previously adopted. Our findings show that the distance-based similarity model with a = 1 and cityblock distance outperforms in accuracy the remaining similarity models. This result includes the Euclidean distance that has been used as a ref-

19

erence for comparison. Therefore, strategically placing the landmarks and probe machines as well as using a more adequate similarity model contribute to improve the accuracy of the measurementbased geographic location of Internet hosts. Acknowledgments This work is supported by CAPES/COFECUB, FUJB, and CNPq. During the development of this work, Artur Ziviani had a scholarship from CAPES/Brazil to support his Ph.D. studies under a cooperation between LIP6 and UFRJ. The authors are thankful to Timur Friedman and Tijani Chahed for the access to the RIPE network.

References [1] C. Davis, P. Vixie, T. Goodwin, I. Dickinson, A means for expressing location information in the domain name system, Internet RFC 1876. [2] University of Illinois at Urbana-Champaign, IP Address to Latitude/Longitude, Available from . [3] D. Moore, R. Periakaruppan, J. Donohoe, K. Claffy, Where in the world is netgeo.caida.org?, in: Proceedings of the INET2000, Yokohama, Japan, 2000. [4] V.N. Padmanabhan, L. Subramanian, An investigation of geographic mapping techniques for Internet hosts, in: Proceedings of the ACM SIGCOMM2001, San Diego, CA, USA, 2001. [5] Visualware Inc., VisualRoute, Available from . [6] CAIDA, GTrace, Available from . [7] A. Ziviani, S. Fdida, J.F. de Rezende, O.C.M.B. Duarte, Demographic placement for Internet host location, in: Proceedings of the IEEE GLOBECOM2003, San Francisco, CA, USA, 2003. [8] A. Ziviani, S. Fdida, J.F. de Rezende, O.C.M.B. Duarte, Similarity models for Internet host location, in: Proceedings of the IEEE International Conference on Networks— ICON2003, Sydney, Australia, 2003, pp. 81–86. [9] P. Krishnan, D. Raz, Y. Shavitt, The cache location problem, IEEE/ACM Transactions on Networking 8 (5) (2000) 568–582. [10] L. Qiu, V.N. Padmanabhan, G.M. Voelker, On the placement of web server replicas, in: Proceedings of the IEEE INFOCOM2001, Anchorage, AK, USA, 2001. [11] P. Radoslavov, R. Govindan, D. Estrin, Topologyinformed Internet replica placement, Computer Communications 25 (4) (2002) 384–392.

ARTICLE IN PRESS

20


[12] E. Cronin, S. Jamin, C. Jin, A.R. Kurc, D. Raz, Y. Shavitt, Constrained mirror placement on the Internet, IEEE Journal on Selected Areas in Communications 20 (7) (2002) 1369–1382. [13] H.S. Bassali, K.M. Kamath, R.B. Hosamani, L. Gao, Hierarchy-aware algorithms for CDN proxy placement in the Internet, Computer Communications 26 (3) (2003) 251–263. [14] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms, in: Proceedings of the International World Wide Web Conference—WWW10, Hong Kong, 2001. [15] S. Santini, R. Jain, Similarity measures, IEEE Transactions on Pattern Analysis and Machine Inteligence 21 (9) (1999) 871–883. [16] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, L. Zhang, IDMaps: A global Internet host distance estimation service, IEEE/ACM Transactions on Networking 9 (5) (2001) 525–540. [17] S. Savage, A. Collins, E. Hoffman, J. Snell, T. Anderson, The end-to-end effects of Internet path selection, in: Proceedings of the ACM SIGCOMM99, Cambridge, MA, USA, 1999. [18] S. Banerjee, T.G. Griffin, M. Pias, The interdomain connectivity of PlanetLab nodes, in: Proceedings of the Passive and Active Measurement Workshop—PAM2004, Antibes Juan-les-Pins, France, Lecture Notes in Computer Science, vol. 3015, Springer, Berlin, 2004. [19] LibWeb, Available from . [20] RIPE Test Traffic Measurements, Available from . [21] T. Vincenty, Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations, Survey Review 22 (176) (1975) 88–93. [22] L. Subramanian, V.N. Padmanabhan, R. Katz, Geographic properties of Internet routing, in: Proceedings of the USENIX 2002, Monterey, CA, USA, 2002. [23] N. Spring, R. Mahajan, T. Anderson, Quantifying the causes of path inflation, in: Proceedings of the ACM SIGCOMM2003, Karlsruhe, Germany, 2003. [24] A. Lakhina, J.W. Byers, M. Crovella, I. Matta, On the geographic location of Internet resources, IEEE Journal on Selected Areas in Communications 21 (6) (2003) 934–948. [25] S.-H. Yook, H. Jeong, A.-L. Baraba´si, Modeling the Internets large-scale topology, Proceedings of the National Academy of Sciences 99 (2002) 13382–13386. [26] A. Ziviani, S. Fdida, J.F. de Rezende, O.C.M. B. Duarte, Toward a measurement-based geographic location service, in: Proceedings of the Passive and Active Measurement Workshop—PAM2004, Antibes Juan-les-Pins, France, Lecture Notes in Computer Science, vol. 3015, Springer, Berlin, 2004, pp. 43–52. [27] G. Ballintijn, M. van Steen, A.S. Tanenbaum, Characterizing Internet performance to support wide-area application development, Operating Systems Review 34 (4) (2000) 41–47.

[28] T. Brinkhoff, City Population, Available from . [29] Central Intelligence Agency (CIA), The World Factbook 2002, Available from , January 2002. [30] C. Toregas, R. Swain, C. Revelle, L. Bergman, The location of emergency service facilities, Operations Research 19 (6) (1971) 1363–1373. [31] M.S. Daskin, Network and Discrete Location, Wiley, New York, 1995. [32] CAIDA, GeoPlot, Available from . [33] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman, New York, 1979. [34] A.D. Gordon, Classification: Methods for the Exploratory Analysis of Multivariate Data, Chapman and Hall, London, 1981. [35] D.W. Jacobs, D. Weinshall, Y. Gdalyahu, Classification with non-metric distances: Image retrieval and class representation, IEEE Transactions on Pattern Analysis and Machine Inteligence 22 (6) (2000) 583–600. [36] R.N. Shepard, Toward a universal law of generalization for psychological science, Science 237 (1987) 1317–1323. [37] D.M. Ennis, J.J. Palen, K. Mullen, A multidimensional stochastic theory of similarity, Journal of Mathematical Psychology 32 (4) (1988) 449–465.

Artur Ziviani received the B.Sc. degree in Electronics Engineering in 1998 and the M.Sc. degree in Electrical Engineering (with emphasis in Computer Networking) in 1999, both from the Federal University of Rio de Janeiro (UFRJ), Brazil. In December 2003, he received the Ph.D. degree in Computer Science from the University Pierre et Marie Curie (Paris 6), Paris, France, where he has also been a lecturer during the 2003–2004 academic year. Since September 2004, he is with the National Laboratory for Scientific Computing (LNCC), located in Petropolis, Brazil. His major research interests are quality of service, mobile and wireless computing, Internet measurements, and grid computing.

Serge Fdida is a professor at the University Pierre et Marie Curie (Paris) since 1991. He received the Doctorat de 3eme Cycle in 1984, and the Habilitation a Diriger des Recherches specializing in Modelling of computer networks in 1989 from the University Pierre et Marie Curie. From 1989 to 1995, he was a Full Professor to the University Rene Descartes (Paris). His research interests are in the area of high speed networking, pervasive communication, resource management and performance analysis. He is heading the Network and Performance group of the LIP6 Laboratory (CNRS-University

ARTICLE IN PRESS

A. Ziviani et al. / Computer Networks xxx (2004) xxx–xxx of Paris 6). He was a Visiting Scientist at IBM Research during the 1990/91 academic year. He was the chairman (or co-chair) of the following events: IFIP Modelling Techniques and Tools87, IFIP High Performance Networking94 (HPN94), Performance of Data Communication95 (PCN95) and European Conference on Multimedia Applications, Services and Techniques97 (ECMAST97), Networked Group Communication (NGC99) and IFIP Networking2000. He was the editor of the proceedings of these conferences and is the author of a book on performance evaluation and a book on Networking. He is involved in many research projects in High Performance Networking in France and Europe. He was heading the RHDM Action in France for 8 years and the COST264 Action in Europe (98–02). He belongs to the FP6 network of Excellence ENEXT. He is a senior member of IEEE, a member of ACM and also involved in two IFIP working groups on networking. He is also the Co-Director of EURONETLAB, a joint laboratory established in 2001, between University Paris 6, CNRS, THALES and 6WIND, developing research and development work on ‘‘QoS Routers’’ and ‘‘Radio Routers’’.

Jose´ Ferreira de Rezende received the B.Sc. and M.Sc. degrees in Electronic Engineering from the Universidade Federal do Rio de Janeiro (UFRJ) in 1988 and 1991, respectively. He received the Ph.D. degree in Computer Science from the Universite´ Pierre et Marie-Curie, Paris, France in 1997. He was an associate researcher at LIP6 (Laboratoire dInformatique de Paris 6) during 1997. Since 1998 he is an Associate Professor at UFRJ. His research interests are in distributed multimedia applications, multipeer communication, performance evaluation and QoS aspects of high-speed, wireless and sensor networks.

21

Otto Carlos M.B. Duarte was born in Rio de Janeiro, Brazil, on October 23, 1953. He received the Electronic Engineer degree and the M.Sc. degree in Electrical Engineering from Universidade Federal do Rio de Janeiro, Brazil, in 1976 and 1981, respectively, and the Dr. Ing. degree from ENST/ Paris, France, in 1985. Since 1978 he is a Professor at Universidade Federal do Rio de Janeiro. From January 1992 to June 1993 he has worked at MASI Laboratory in Paris 6 University. In 1995, he has spent three months at International Computer Science Institute (ICSI) associated to the University of California at Berkeley. Presently, he is heading the computer network group (Grupo de Teleinforma´tica e Automaçaõ—GTA). His major research interests are in high speed communications, mobility, security and QoS guarantees.