Scheduling deadline-constrained bulk data ... - Semantic Scholar

11 downloads 54744 Views 232KB Size Report
Email: [email protected]. Pascale ... Given a set of such tasks, we study the Bulk Data Transfer .... With flexibility at its best, a bulk data transfer task.
Scheduling deadline-constrained bulk data transfers to minimize network congestion Bin Bin Chen

Pascale Vicat-Blanc Primet

Department of Computer Science National University of Singapore Email: [email protected]

LIP, UMR CNRS-ENS Lyon-INRIA-UCB Lyon 5668 ´ Ecole Normale Sup´erieure de Lyon, France Email: [email protected]

Abstract— Tight coordination of resource allocation among end points in Grid networks often requires a data mover service to transfer a voluminous dataset from one site to another in a specified time interval. With flexibility at its best, the transfer can start from any time after its arrival, use any and even time variant bandwidth value, as long as it is completed before its deadline. Given a set of such tasks, we study the Bulk Data Transfer Scheduling (BDTS) problem, which searches for the optimal bandwidth allocation profile for each task to minimize the overall network congestion. We show that the multi-interval scheduling, which divides the active window of a task into multiple intervals and assigns bandwidth value independently in each of them, is both sufficient and necessary to attain the optimality in BDTS. Specifically, we show that BDTS can be solved in polynomial time as a Maximum Concurrent Flow Problem. The optimal solution attained is in the form of multi-interval scheduling with the number of intervals upper-bounded. Simulations are conducted over several representative topologies to demonstrate the significant advantage of optimal solutions.

200s to 400s, and from 0s to 400s respectively. If transfer can begin only when disk in site B is allocated, and there is no pipeline between transfer and computing services, the bulk data transfer task r1 needs to move 200Gb data from site A to site B in the time interval [0s, 200s], to fully use the CPU resources. As another example, there is a 200Gb file stored in site D where the lease time will expire in 300s, owner of the data reserves new storage spaces in site C, where the lease begins from 100s. The bulk data transfer task r2 , thus, requires to move 200Gb data from site D to site C in the time interval of [100s, 300s]. CPU: [200s, 400s] Disk: [0s, 400s] B

A

Raid drive Raid drive

r1: 200Gb [0s, 200s]

I. I NTRODUCTION A. Motivations Grid computing is a promising technology that brings together geographically distributed resources to build very high performance computing environments for data-intensive or computing-intensive applications. One major requirement for grid computing is to control precisely how these large amounts of shared resources are used. An efficient and reliable network data transfer service plays a key role in building such environments. For example, in the Large Hadron Collider (LHC) Computing Grid project at CERN [1], massive quantities of data (15 Petabytes of data each year) are expected to be produced and distributed around the world for analysis. Such projects often build dedicated high speed grid networks. In these networks, bandwidth demand from a single sourcesink pair can easily reach the scale of hundreds of Mbps to even several Gbps. Such giant tasks introduce a relatively low multiplexing level, while consuming a large portion of bandwidth in underlying networks. In addition, tight coordination of resource allocation among end points in Grid networks often requires a data mover service to carry out a giant task in a specified time interval. For example, as shown in Figure 1, 200Gb data produced in site A needs to be moved to site B for processing. The CPU and disk resources in site B have been reserved in advance from

r2: 200Gb [100s, 300s]

C

Workstation

D Raid drive Raid drive

Raid drive

Workstation

Raid drive Workstation

Disk: [100s,-]

Fig. 1.

Disk: [-, 300s]

Bulk data transfer scheduling example

Compared with high-end Grid context, best effort data transfer in Internet has neither clear volume information, nor strict deadline. Distributed transport protocols, such as TCP, are used to statistically share available bandwidth among flows in a “fair” way. This core-stateless approach performs well unless total demand approaches full capacity of bottleneck link, which is relatively rare to happen. For example, an OC12 link (622Mbps) can concurrently support hundreds to thousands of flows from DSL lines (around 2Mbps). However, TCP/IP technology does not provide any guarantee in completion time of data transfer, thus can not meet the requirement of Grid networks to support accurate endpoint resource co-allocation. Variation of incoming traffic load can easily cause tasks miss their deadlines [2]. Make things worse, as the per-flow product of bandwidth and latency increases, TCP becomes inefficient and prone to instability [3]. To avoid both user dissatisfaction and end point resource underutilization, we believe that per-flow bandwidth allocation is

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

TABLE I C OMPARISON OF I NTERNET AND HIGH - END G RID NETWORKS Network Multiplexing QoS Internet Grid

high low

no yes

TCP efficiency high low

per-flow state complexity high low

necessary (and also practical) in high-end Grid networks, which are characterized by high QoS requirement and low multiplexing level. The differences between Internet and highend Grid networks are summarized in table I. Bandwidth reservation has been studied extensively for realtime applications [4], which are often approximately modelled as reserving a fixed amount of bandwidth from a given start time. In comparison, bulk data transfer tasks are specified in terms of volume and active window (from arrival time to deadline). With flexibility at its best, a bulk data transfer task can start from any time after its arrival, use any and even time variant bandwidth value, as long as it is completed before its deadline. In the following discussion, the value of the allocated bandwidth as a function of time is termed bandwidth allocation profile. The flexibility of choosing bandwidth allocation profile can be exploited to improve system performance. For example, to complete more tasks before their deadlines, sharing instantaneous bandwidth fairly among all active flows may not be optimal [5]. In some cases, it is beneficial to allow a connection with larger pending volume and earlier deadline to grab more bandwidth, similar to the case of Earliest Deadline First scheduling in real-time systems [6]. Note that this does not necessarily cause unfairness, as a bulk data transfer’s performance is normally measured in its average rate, rather than instantaneous transfer rate. Let us illustrate the goal of this work in detail with a simple example. Bandwidth (Gbps) Setting (a) 2 1 0

100

200

Bandwidth allocation profile: Time (s) r1

300

B. Our results

Bandwidth (Gbps) Setting (b)

r2

2 1 0

Fig. 2.

Time link 100

200

is allocated 4/3Gbps in [150s, 300s]. Bandwidth allocation profiles of the two tasks are represented by the two different patterns in the figure. Link bandwidth allocation profile is the time-wise summation of the profiles of both tasks, and is represented in grey color. If link capacity is provisioned based on the scheduling result, setting (a) requires a link capacity of 2Gbps, while setting (b) only requires 4/3Gbps, of which it can make fully use in [0s, 300s]. Assuming the link has a fixed capacity of 2Gbps and in [0s, 300s] there is an incoming task r3 . If r3 ’s volume is 120Gb and its active window is predicted as a random interval ω ⊂ [0s, 300s] with length |ω| = 200s, then r3 will be blocked in setting (a) unless preemption (modification of existing bandwidth reservation profile) is allowed. Setting (b) can accept r3 , because it distributes residual capacity evenly in [0s, 300s]. Instead, if r3 ’s volume is 50Gb and its active window is a random interval ω ⊂ [0s, 100s] or ω ⊂ [200s, 300s]. |ω| = 50s. Here [0s, 100s] and [200s, 300s] are considered as peak intervals for incoming sporadic transfers. Setting (a) can accept r3 because it keeps all residual capacity in peak intervals. r3 will be blocked in setting (b) unless preemption is allowed. Finally, if r3 ’s volume is 200Gb and its active window is ω = [0s, 300s], both settings have enough accumulated bandwidth in ω for r3 . However, r3 can be served with a single rate 2/3Gbps continuously through ω in setting (b), while it needs to be divided and served in two disjoint intervals [0s, 100s] and [200s, 300s] in setting (a). If connection does not support the flexibility of dividing its transfer demand among multiple disjoint intervals, r3 will be blocked in setting (a). This example shows the performance difference under different bulk data transfer scheduling decisions. It also demonstrates the potential benefits of supporting multi-interval transfer flexibility and preemption. Remark 1: Note that bandwidth reservation does not preclude bulk data transfer from opportunistically using the unreserved bandwidth with suitable transport protocols. If a bulk data transfer progresses faster using extra bandwidth, its future bandwidth reservation profile can be reduced accordingly.

300 (s)

Bulk data transfer scheduling example

Example 1: Bulk data transfer tasks r1 and r2 share a single bottleneck link. r1 ’s volume is 200Gb and its active window is [0s, 200s]. r2 ’s volume is also 200Gb, and its active window is [100s, 300s]. Their task specifications are available at time 0s. As shown in Figure 2, setting (a) allocates 1Gbps each to both of them throughout their own active windows, while in setting (b) r1 is allocated 4/3Gbps in [0s, 150s], and r2

For a given set of bulk data transfer tasks, we formalize the Bulk Data Transfer Scheduling (BDTS) problem as selecting bandwidth allocation profile for each task to minimize the network congestion factor, i.e., the maximum link utilization along the time axis in network. In Theorem 1, we show that the optimal network congestion factor does not change if we restrict the solution space to the practical multi-interval scheduling schemes, which only use bandwidth allocation profile taking the form of a step function; i.e., a multiinterval scheduling scheme divides the active window of each task into multiple intervals, and reserves a constant bandwidth value (including zero) independently in each of them. As a result, BDTS can be solved in polynomial time as a Maximum Concurrent Flow Problem (MCFP), and the

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

number of intervals used by each task’s bandwidth reservation profile is upper bounded in the optimal solution attained. Theorem 2 shows that BDTS becomes NP-Complete, if the bandwidth allocation profile is restricted to constant-value single-interval form. To demonstrate the significant advantage of optimal solutions, simulations are conducted over several representative topologies. The rest of paper is organized as follows. In section II we give the system model and formalize the BDTS problem. Section III studies the computational complexity of BDTS. Section IV evaluates performance of different schemes over four representative topologies, and section V discusses the application of our result in Grid networks. Related works are briefly summarized in section VI before we conclude with section VII. II. M ODEL AND PROBLEM FORMULATION Definition 1: A network is represented by a connected graph G(V, E), consisting of node set V and edge set E, with edge capacity µ(e) : E → R+ −{0}, where R+ is the set of nonnegative real numbers. Definition 2: A path φ = (v0 , v1 , . . . vh ) is a finite sequence of nodes, such that for 0 ≤ i < h, (vi , vi+1 ) ∈ E. This paper assumes all paths are simple, i.e., nodes in a path are distinct. Definition 3: A bulk data transfer task r = (νr , ωr , Φr ) is a triple, where νr is the volume of the dataset, ωr = [ηr , ψr ] is the active window (from arrival time ηr to deadline ψr ), |ωr | = ψr − ηr is the length of active window, and Φr = |Φ | {φ1r , φ2r , . . . , φr r } is the set of paths connecting source sr and destination dr . Briefly speaking, a bulk data transfer task r demands that νr volume of data, which is available at sr from time ηr , be moved to dr before time ψr along paths in Φr . We denote the set of all tasks as R, the union of all tasks’ active windows as Ω = ∪r∈R ωr , and the union of all tasks’ paths as Φ = ∪r∈R Φr . To simplify discussion, we assume that each path φ ∈ Φ is associated with only one task r ∈ R, which we denote as r(φ). A task r or a path φ ∈ Φr is active at time t if t ∈ ωr . Similarly a task r or a path φ ∈ Φr is active in interval π if π ⊆ ωr . We denote the set of paths active at time t as Φ(t), and the set of paths active in an interval π as Φπ . We denote the set of paths passing through an edge e as Φe . Remark 2: As shown by Theorem IV.1 in [7], Problem 0-1 TB which searches for the optimal path for a bulk data transfer task to minimize its amount of transfer time is NP-complete for advance reservation. 0-1 TB problem can be transformed to a variation of BDTS problem which requires that a single path should be selected from Φr for task r, thus makes this variation of BDTS problem NP-Complete. The proof is omitted here for brevity. To focus on the problem of bulk data transfer scheduling in time domain, this paper assumes that path information is pre-determined, and aggregate bandwidth can be achieved through multiple paths. In our problem formulation, we allow bulk data transfer scheduling be optimized jointly with multi-path routing.

As noted above, a bandwidth allocation profile is a function λ(t) : ω → R+ , which gives the value of bandwidth allocated at time t. We consider three types of bandwidth allocation profiles (task, path and link) in this paper. Definition 4: Bandwidth allocation profile of a task r is λr (t) : ωr → R+ . r is entitled to transfer at rate of λr (t) at time t. Definition 5: Bandwidth allocation profile of a path φ is λφ (t) : ωr(φ) → R+ , which specifies the amount of bandwidth reserved in every link e ∈ φ for path φ at time t. Definition 6: Bandwidth allocation profile of a link e is λe (t) : Ω → R+ , which is the total amount of bandwidth reserved in link e byall active paths passing through it at time t, i.e., λe (t) = φ∈Φe ∩Φ(t) λφ (t). Link e’s congestion e (t) factor at time t is fe (t) = λµ(e) , and its congestion factor over time interval ω is fωe = maxt∈ω fe (t). We define the network congestion factor over ω as fωG = maxe∈E fωe . Now we are ready to define the Bulk Data Transfer Scheduling (BDTS) problem as follows: Definition 7: Given a set of tasks R over a network G, the Bulk Data Transfer Scheduling problem BDTS(R,G) is to select bandwidth allocation profile λr for each task r ∈ R, so that the overall network congestion factor is minimized. Formally: minimize : fΩG   ψr s.t. λφ (t)dt = νr ∀r ∈ R; φ∈Φr



(1)

φr

λφ (t) ≤ fΩG ∗ µ(e) ∀e ∈ E, ∀t ∈ Ω;

φ∈Φe ∩Φ(t)

λφ (t) : ωr(φ) → R+ ∀φ ∈ Φ. Minimizing the network congestion factor is selected as the objective function because: • It answers the fundamental feasibility problem, i.e., if the minimum network congestion factor is greater than 1, there is no feasible solution to complete all tasks before their deadlines, and some mechanisms should be employed to select and reject a subset of tasks. Design of such mechanisms is an interesting problem by itself, and falls out of the scope of this paper. • If network capacity is provisioned based on the scheduling result (e.g., in optical network or overlay network), minimizing congestion factor can reduce the amount of capacity required, thus achieves a higher system utilization and leaves more bandwidth resource to other usage. • Minimizing congestion also reduces the average packet delay experienced by coexisting interactive traffic. It is interesting to consider otherobjective functions, for example, the total provision cost, e∈E (maxt∈Ω λe (t)). Roughly speaking, the scheduling optimization problems with different objective functions can be treated using the same techniques as in this paper, if the object function can be integrated into the constrained optimization problem in a linear form.

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

The first constraint in Formula (1) is volume demand requirement, i.e., the integral of a task’s bandwidth allocation profile, which is sum of integrals over all of its paths, is equal to its volume. The second one is the capacity constraint, which bounds the sum of the profiles of all active paths passing through a link. The third one gives the constraint on solution space. III. C OMPUTATIONAL COMPLEXITY OF BDTS We first look at the special case of BDTS when all tasks in R share a common active window Ω, and each has a single path, i.e. ∀r ∈ R, ωr = Ω & |Φr | = 1. Definition 8: To transfer ν volume of data in an interval ω over a path φ, spaghetti scheduling reserves a constant ν in all links e ∈ φ throughout ω. bandwidth value of |ω| Lemma 1: spaghetti scheduling is an optimal solution for BDTS if ∀r ∈ R, ωr = Ω & |Φr | = 1. Proof: Apply spaghetti scheduling to the given task set, the bandwidth allocation profile of every path, thus every link is a constant function in Ω, i.e., ∀e ∈ E, ∀t ∈ Ω, λe (t) = fΩe ∗ µ(e). We assume that link e∗ is the most congested link tasks whose in network, i.e. fΩe∗ = fΩG . Denote the set of  (only) paths pass through link e∗ as R∗ , we have: r∈R∗ νr = fΩG ∗ µ(e) ∗ |ω|. Suppose that there is another scheduling solution s which achieves a smaller network congestion factor  < fΩG . Since there is only a single path for each task, fΩG the volume of data passing through link e∗ remainsthe same,   r∈R∗ νr ≤ fΩG ∗ µ(e) ∗ |ω| < fΩG ∗ µ(e) ∗ |ω| = r∈R∗ νr , which leads to contradiction. We now consider BDTS in its general form when each task has multiple paths, different arrival time and deadline. Similar to [8], we use these arrival times and deadlines to divide the time line into 2|R| − 1 intervals. Let ti be the ith smallest value among the |R| arrival times and |R| deadlines (for i = 1, 2, . . . , 2|R|). The ith interval πi is the time period from ti to ti+1 . We denote the set of all intervals as Π, Π = {π1 , . . . , π2|R|−1 }. The active window ωr of a task r consists of a subset of Π, which we denote as Πr . Definition 9: For an interval π and a path φ ∈ Φπ , an interval-path decoupling profile νπφ is a nonnegative value, which gives the volume of data transferred in π through φ. Theorem 1: BDTS(R,G) is in P, and it possesses an optimal solution with bandwidth reservation profile for each task in the form of a step function with O(|R|) intervals. Proof: The proof proceeds by showing that BDTS has the same optimal objective value as the linear programming problem of: minimize : fΩG   νπφ = νr ∀r ∈ R; s.t.

(2)

π∈Πr φ∈Φr



νπφ ≤ fΩG ∗ µ(e) ∗ |π| ∀e ∈ E, ∀π ∈ Π;

φ∈Φe ∩Φπ

νπφ ≥ 0 ∀π ∈ Π, φ ∈ Φπ . Note that ∀t ∈ π, Φ(t) = Φπ because there is neither arrival time nor deadline falling inside each constructed interval.

Assume that all the interval-path decoupling profiles are given, a path φ active in interval π can be viewed as an atom-task with specified volume νπφ , single path φ and active window π. Thus, according to Lemma 1, spaghetti scheduling can be applied over all atom-tasks in each interval to derive the minimum network congestion factor over that interval, then the minimum network congestion factor over Ω is easily calculated as the maximum of network congestion factor in all spaghettischeduled intervals. Thus, to solve BDTS, it is sufficient to determine the optimal decoupling profiles, which subject to the constraints in Formula (2). Similar to Formula (1), the first constraint is volume demand requirement, the second one is capacity constraint, and the third one gives constraint on solution space. Formula (2) is a linear programming (LP) problem. If every task has a bounded number of multiple paths, i.e., ∀r ∈ R, |Φr | = O(1), Formula (2) has O(|R|2 ) variables and O(|R| ∗ |E|) constraints. The problems enjoying LP formulations can be roughly classified as effectively solvable, as in practice the simplex methods works in the time proportional to the number of optimization variables. The optimal solution attained in this way configures bandwidth allocation profile for each task in the form of a step function. The active window of a task r is divided into intervals νπφ is reserved on π ∈ Πr , and a constant bandwidth value |π| all links in path φ in each interval π. The number of intervals through a task’s active window is no larger than 2|R| − 1. If we consider slotted time axis, in which task can specify their arrival time and deadline only in the boundary of time slot, the number of intervals through a task’s active window is further bounded by min(2|R| − 1, |ω|t|r | ), where |t| is the length of a time slot, i.e., the minimum time granularity. Remark 3: In fact, Formula (2) represents a Maximum Concurrent Flow Problem (MCFP). MCFP is to find the largest λ such that there is a multicommodity flow which routes at least λ ∗ νr units of commodity r concurrently for all commodities r ∈ R. It is easy to show that for every feasible λ, there is a feasible configuration to achieve fΩG = λ1 . Besides standard linear programming techniques, a series of progressively faster polynomial-time algorithms have been devised for MCFP, see for example [9]. Example 2: Figure 3 shows a simple instance of BDTS(R,G), where R = {r1 , r2 , r3 }, G is an undirected four-nodes ring graph, V = {w, x, y, z}, E = {(w, x), (x, y), (y, z), (z, w)}. φr1 = (w, z, y), φr2 = (x, y, z), and φr3 = (w, x, y) as shown in the upper corner of the figure. They have different active windows, and their arrival times and deadlines divide time axis into 5 intervals Π = {π1 , . . . , π5 }, as shown in the bottom-left corner of the figure. To construct the graph G = (V  , E  ) for MCFP, we first generate 5 copies of G, namely G1 , . . . , G5 as subsets of G , each corresponding to an interval. For every edge e ∈ E with capacity µ(e), the corresponding edge ei in Gi has a link capacity of µ(e) ∗ |πi |. Every task r ∈ R corresponds to a type-r commodity with a demand νr . Note that both the link capacity unit (LCU)

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

and the demand volume unit (DVU) are in the form of data volume, rather than bandwidth unit. We add a pair of nodes sr and dr for each type-r commodity in G as its source and destination. For every interval πi ∈ Πr , we add an edge with infinite capacity between sr and sir , where sir is the node in Gi corresponding to r’s source node sr in G. We also add an edge with infinite capacity between dir and dr , where dir is the node in Gi corresponding to r’s destination node dr in G. We connect sir and dir by every path φi in Gi which corresponds to a path φ ∈ Φr in G. The constructed G is shown in the bottom-right corner, where vertices and edges without any path passing through them are omitted for simplicity. r1 r2 r3

w

z y

x w1

1

z1 y1

r1

w2

s1’

2 3

s2’ r2 r3 s3’

4 5

x2

y2

x3

y3

x4 x5

z2

d1’

z3

d2’

z4

w4

d3’

y4 z5 y5

Time Fig. 3.

Maximum concurrent multicommodity flow formalization

When R consists of some tasks with fixed path bandwidth allocation profiles, for example, circuit-switch applications, or accepted tasks which are non-preemptive, we have: Corollary 1: BDTS is in P when bandwidth allocation profiles of a subset of paths are given as problem input in the form of a step function. Proof: Fixing bandwidth allocation profile for a path is equivalent to introducing a new constraint into Formula (2). As the given path bandwidth allocation profile takes the form of a step function, the new constraint is still in linear form, thus BDTS(R,G) with the new constraint can still be solved in polynomial time as a linear programming problem. When there is only one flexible task in R, for example, in a non-preemptive online scheduling system, we have: Corollary 2: BDTS can be computed via a simple intervalpath water-filling construction, when all paths’ bandwidth allocation profiles are given as problem input in the form of a step function, except paths of a single request r∗ ∈ R. Allowing step function form is necessary to attain the optimality as shown below: Theorem 2: BDTS is NP-Complete if bandwidth allocation profile is restricted to constant-value single-interval form. Proof: Clearly, BDTS belongs to NP; its completeness is proved by reduction from 2-PARTITION. Consider an instance B1 of 2-PARTITION: given n integers {ν1 , ν2 , . . . , νn }, determines whether there is a subset I of indices such that   ν = i i∈I i∈I / νi . Assuming that, without loss of generality,

n D = i=1 νi is an even integer and 1 ≤ νi ≤ D/2 for 1 ≤ i ≤ n, we build the following instance B2 of BDTS: There are n + 1 tasks on a single link with capacity 1. For the ith task ri , 1 ≤ i ≤ n, its active window is [0, D + 1] and its volume is νi . The last task rn+1 has a different active window [D/2, D/2 + 1], and its volume is 1. It is easy to show that B1 has a solution if and only if the optimal network congestion factor fΩG = 1 in B2 , under the constraint that bandwidth allocation profile for each task is in constant-value single-interval form. IV. P ERFORMANCE EVALUATION In this section, we compare the optimal network congestion factor attained using BDTS to that attained by spaghetti scheduling. The latter is chosen for comparison, because it is a commonly used scheme which introduces less network congestion than other constant-value single-interval schemes, by greedily reserving as low instant bandwidth as possible. We consider four topologies as shown in Figure 4: single link, star, ring and a random topology. Single link topology represents the network with only one bottleneck, while star topology models the network where multiple sites are interconnected through an over-provisioned (for example, using “hose model”) core network. Then the core-network is abstracted as the central node, to which each site (peripheral node) is connected via its access link. The ring topology is also a common topology, especially in optical networks. It is used to demonstrate the impact of different path length and the joint optimization of scheduling and multi-path load balancing. Finally, the random graph includes both star topology and ring topology as its substructure. It is used to eliminate artificial effects caused by above symmetric topologies if there is any. In our experiments, the BDTS solution is calculated using Matlab LP solver. All experiments are repeated multiple times and the average value is presented. (a) single link

(b) star

(c) ring

(d) random

Fig. 4.

Network topologies

The task set R is generated randomly over an interval Ω. For a task r ∈ R, its volume νr is a random variable uniformly distributed in [j ∗ ν, ν], where j ∈ [0, 1] is termed volume variation factor with default value 0.5. The active window ωr of r is a random variable generated independently of νr . ωr ⊆ Ω. The active window length |ωr | is uniformly distributed in [k ∗ |Ω|, |Ω|], where k ∈ [0, 1] is termed window length factor. When k = 1, all tasks have identical active window Ω. For

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

network congestion factor

1.3

Spaghetti BDTS

1.2 1.1 1 0.9 0.8 0.7 0.1

0.2

0.3

Fig. 5.

0.4 0.5 0.6 0.7 0.8 window length factor

0.9

1

Single link topology

Next, we vary the network size of star topology to investigate its impact over congestion factor. Figure 6 shows that the network congestion factor increases with network size. When the number of links increases from 2 to 6, the network congestion factors of both schemes increase by 30%. After that, the increasing rate slows down. The congestion factors converge when the number of links increases further. The increasing of the network congestion factor can be explained from two points: (1) the network congestion factor is defined as the maximum of the link congestion factors of all its links, even when distribution of each link’s congestion factor keeps unchanged and is independent with each other, the network

congestion factor grows when the number of links grows; (2) as each path passes through 2 links, which are shared by different sets of paths, the optimization problem in a star topology is under more and stricter constraints compared to the single link topology. (3) The randomness in selecting source and sink nodes causes a certain degree of statistical traffic imbalance among links. However, the figure shows that the relative performance of BDTS solution compared to spaghetti scheduling is almost not affected by the size of network. In this experiment and the next one, the window length factor is k = 0.5. Similar trends are also observed using other values of window length factor. 1.5 network congestion factor

a chosen window length, the offset of window is uniformly distributed. In all topologies, source and destination of tasks are two distinct nodes uniformly selected from all nodes (except the central node in star topology). The average number of tasks passing through each link is termed multiplexing level l with default value 40 (the task number over a link is estimated by assuming each task uses and only uses one of its shortest path). Results presented below are also observed under a range of low to medium volume variation factors and low to medium multiplexing levels. Link capacity is given by ν∗l |Ω| , thus, the lower bound of the network congestion factor is (j +1)/2 (0.75 in experiments below), which is attained when all loads are distributed evenly both in the time axis and in the network. The number of tasks generated is |R| = l ∗ |E|/|φ|, where |φ| is the mean of the shortest path length between any pair of nodes chosen uniformly in the network. Note that to maintain a constant multiplexing level, |R| is varied together with the network topology and size in experiments below. Figure 5 plots the impact of window length factor in the single link topology. The BDTS solution approximately attains the lower bound in this case in the whole range of window length factors. In comparison, spaghetti scheduling’s performance is poor, especially when the window length factor is small. As shown in the figure, the BDTS solution reduces the network congestion factor by more than 30% when k = 0.1. The performance difference decreases when tasks’ active windows are stretched. In the extreme case of uniform window (k = 1), spaghetti scheduling attains optimality.

Spaghetti BDTS

1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 2

4

6

8

10

12

14

number of links

Fig. 6.

Star topology

Figure 7 plots the network congestion factor in ring topologies with increasing number of links. In a ring topology, there are two disjoint paths between any pair of nodes. we consider rings with an odd number of links to ensure that the shortest path for any pair is unique. Spaghetti-single path setting and the BDTS-single path setting only use the shortest path, while in the BDTS-multiple paths setting, bandwidth over the two paths are aggregated. Compared to the star topology, increasing ring size not only increases the chance that tasks passing through the same link may come from or go to different links, but also increases the average number of links in a path. This explains the fact that the network congestion factor in a ring topology also increases with the network size. The figure also shows that the joint optimization of task scheduling and multi-path load-balancing can reduce congestion significantly. Finally, Figure 8 shows a similar performance gain of BDTS solution in the random topology. Note that there are multiple shortest paths between some pairs of nodes. Spaghetti-single path setting and BDTS-single path setting only use single shortest path which is selected arbitrarily if there are more than one. The BDTS-multiple path setting distributes load optimally among all paths. The network congestion factor of this random topology is larger than both ring and star topologies because its asymmetric structure. However, a larger set of candidate routes, as shown in the BDTS-multiple path setting, reduces its network congestion factor to a level comparable to the symmetrical structures.

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

To support the multi-interval scheduling decision without complicating the design of transport layer, advance reservation transfers may be pre-processed and divided into a batch of sub transfer tasks, each for one constant bandwidth interval. Also, appropriate network provision can be carried out based on the attained solution. If tasks can not be preempted, sporadic tasks can be scheduled as described in Corollary 1 (for a batch of sporadic transfers) or Corollary 2 (for a single sporadic transfer). Postponing the scheduling decision of a sporadic advance reservation transfer until its eligible time can help increase the size of batch, thus the potential flexibility to be exploited. On the other hand, if tasks can be preempted, i.e., a task’s bandwidth allocation profile can be modified through its lifetime, procedure as described in Theorem 1 can be invoked every time a new task arrives, with the input task set consisting of the new task and all active preemptive tasks (with their remained volumes). If there is no feasible solution to accept the new task, an appropriate subset of tasks should be selected and rejected, subject to task priorities and system specified policy. Whether to support preemption is an implementation tradeoff between system complexity and potential performance gain.

network congestion factor

1.4 1.3 1.2

Spaghetti-single path BDTS-single path BDTS-multiple paths

1.1 1 0.9 0.8 4

6

8 10 number of links

Fig. 7.

12

14

Ring topology

2.6 Spaghetti-single path BDTS-single path BDTS-multiple paths

network congestion factor

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8

VI. R ELATED WORKS

0.6 0.1

0.2

0.3

Fig. 8.

0.4 0.5 0.6 0.7 0.8 window length factor

0.9

1

Random topology

V. A PPLICATION IN G RID NETWORKS This section discusses the application of above results in the design of grid networks. For simplicity, we assume that a central bandwidth broker manages the reservation of bandwidth in the whole network. This architecture is practical for high-end Grid networks with a relatively low multiplexing level. When network size and the number of coexisting demands grow, a careful selection of bandwidth allocation granularity, over the time domain, the flow population (aggregate reservation), and the network topology (e.g., a hierarchy of bandwidth broker), can be employed to strike a balance between the control overhead and reduction in data transfer congestion. There are both advance reservation transfers and immediate transfers in grid networks. As their names suggest, task definition of an advance reservation transfer is available to bandwidth broker before its transfer can actually be scheduled. For example, the raw data produced by a scientific instrument maybe first collected in a data repository center nearby, then is distributed to research institutes worldwide for analysis, say, every midnight. Note that periodic tasks are special advance reservation transfers, which happen periodically. On the other hand, immediate transfers arrive dynamically at the moment their transfers are ready to start. For example, a scientist may sporadically request a remote dataset. In the case when periodic transfers are of high priority, they are scheduled using Theorem 1 offline first to minimize the congestion.

Gu and Grossman [10] propose UDT (UDP-based Data Transfer) to address the problem of transferring large volumetric datasets over high bandwidth-delay product optical networks. Like some TCP variants such as [3] [11], UDT employs a new congestion control algorithm targeting at uncontrolled shared networks. Any adaptive protocol which can fully exploit the advanced reserved high capacity will have good performance in precisely controlled dedicated networks. To provide bulk data transfer with QoS as AgreementBased service, Zhang et al. [12] evaluate the mechanisms of traffic prediction, rate-limiting and priority-based adaptation. In this way, agreements which guarantee that, within a certain confidence level, file transfer can be completed under a specified time are supported. In comparison, we consider dedicated networks, and use more precise control (bandwidth reservation) to provide deadline-constrained bulk data transfer with deterministic QoS guarantee. Advance reservation of bandwidth [13] allows requesting bandwidth before actual transfer is ready to happen. For example, a scheduled tele-conference may reserve bandwidth for a specified future time interval. Burchard et al. [14] show that advance reservation causes bandwidth fragmentation in time axis (e.g., there is no capacity remained in [100s, 200s] in setting (a) of Figure 2), which may significantly reduce accept probability of requests arriving later. To address the problem, they propose the concept of malleable reservation, which specifies a range (or set) from which the start time and the value of the single rate can be selected. But they don’t consider the volume and deadline requirements, as well as the time-variant bandwidth allocation flexibility, which are specific characteristic of bulk data transfer.

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007

Gu´erin and Orda [7] address advance reservation from the routing perspective. They study the bulk data transfer service model, where the goal of a task is to transmit a given amount of data in the minimum amount of time. They assume discrete time slots and consider the case when the connection is capable of transmitting at any and time-variant bandwidth value that is available through the network. They prove that allowing advance reservation makes the optimal routing problem NPhard. Path selection remains intractable for a more general case where nodes are allowed to buffer data enroute. Thus, our paper assumes that path information is pre-determined and aggregate bandwidth can be achieved through multiple paths, so that we can focus on the bulk data transfer scheduling in time domain. To minimize congestion, bulk data transfer scheduling shift traffic from peak time to off-peak time in time domain. This is similar to multi-path routing with load balance [15], which deviates traffic from hot spots to unused network resources. In our problem formulation, we allow bulk data transfer scheduling be optimized jointly with multi-path routing. Essentially, both of them can be roughly classified as multi-commodity network flow problems. Kalaba and Juncosa [16] were the first to address communication network dimensioning problems using a multi-commodity flow approach. Coffman et al. [17] also consider the problem of scheduling a given set of bulk data (file) transfer tasks. In their model, there are no deadline constraints for tasks, and all vertices are directly connected. Each file is transferred directly between its endpoints and forwarding is not allowed. They formalize the optimization problem as to minimize the total time (makespan) for the overall transfer process, while respecting the port constraint of each vertex, which is the maximum number of simultaneous file transfers that the given vertex can engage in. They also assume that once a transfer begins, it continues without interruption until the transfer is completed. The amount of time needed to transfer a file does not depend on the scheduling decision. They formalize it as a problem of scheduling the edges of a weighted multigraph, and shows the general problem is NP-complete. While they put capacity in constraints and makespan in objective function, we use deadline as constraints, to minimize the required capacity. Our approach enables us to consider the case when tasks have different active windows. [18] also considers the scheduling of bulk data transfers with different active window and specified volume. Their objective function is to optimize the accept rate and network resource utilization in a specfic network topology, by manipulating the transfer start time and the value of the single rate used. The formulated optimization problem is proven NP-complete. VII. C ONCLUSION This paper studies the scheduling of bulk data transfers with specified volume, active time window and paths. Our results give insight in the computational complexity of advance reservation service in high performance networks. Theorem 1 and Theorem 2 show that supporting multi-interval transfer

capability is both sufficient and necessary to minimize network congestion. Specifically, multi-interval scheduling not only extends the solution space to include the optimal solution, but also reduces the computational complexity to attain the optimality. Numerical results over representative topologies show that compared to spaghetti scheduling, the optimal solutions obtained through LP solver can achieve more than 30% of reduction in congestion in common settings. As our future works, we expect to implement the multi-interval scheduling in existing and being-planned dedicated networks, including the Grid’5000 network in France. VIII. ACKNOWLEDGMENT This work has been funded by INRIA, the French ministry of Education and Research and CNRS, via ACI GRID’s Grid’5000 project. R EFERENCES [1] I. Bird et al. LHC computing grid technical design report. Technical Report CERN-LHCC-2005-024, June 2005. [2] B. Chen and P. Primet. A flexible bandwidth reservation framework for bulk data transfers in grid networks. Research report, INRIA, June 2006. [3] D. Katabi, M. Handley, and C. Rohrs. Congestion control for high bandwidth-delay product networks. In ACM SIGCOMM, Pittsburgh, PA, August 2002. [4] R. Braden, L. Zhang, S. Berson, S. Herzog, and S.Jamin. Resource ReSerVation Protocol (RSVP), ietf rfc 2205, September 1997. [5] S. Gorinsky and N. Rao. Dedicated channels as an optimal network support for effective transfer of massive data. In High-Speed Networking (HSN), Barcelona, Spain, April 2006. [6] J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo. Implications of classical scheduling results for real-time systems. IEEE Computer, 28(6):16–25, 1995. [7] R. Gu´erin and A. Orda. Networks with advance reservations: The routing perspective. In IEEE INFOCOM, Tel-Aviv, Israel, March 2000. [8] C. Martel. Preemptive scheduling with release times, deadlines, and due times. Journal of ACM, 29(3):812–829, July 1982. [9] F. Shahrokhi and D. Matula. The maximum concurrent flow problem. Journal of ACM, 37(2):318–334, April 1990. [10] Y. Gu and R. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, special issue on Hot topics in transport protocols for very fast and very long distance networks, January 2007. [11] L. Xu, K. Harfoush, and I. Rhee. Binary increase congestion control for fast long-distance networks. In IEEE INFOCOM, Hongkong, March 2004. [12] H. Zhang, K. Keahey, and W. Allcock. Providing data transfer with QoS as agreement-based service. In IEEE International Conference on Services Computing (SCC), Shanghai, China, August 2004. [13] D. Wischik and A. Greenberg. Admission control for booking ahead shared resources. In IEEE INFOCOM, San Francisco, CA, April 1998. [14] L. Burchard, H. Heiss, and C. De Rose. Performance issues of bandwidth reservations for grid computing. In Symposium on Computer Architecture and High Performance Computing (CAHPC), pages 82–90, Sao Paulo, Brazil, November 2003. [15] R. Banner and A. Orda. Multipath routing algorithms for congestion minimization. In NETWORKING 2005: 4th International IFIP-TC6 Networking Conference, Waterloo, Canada, May 2005. [16] R. Kalaba and M. Juncosa. Optimal design and utilization of communication networks. Management Science, 3(1):33–44, October 1956. [17] Jr E. Coffman, M. Garey, D. Johnson, and A. LaPaugh. Scheduling file transfers in a distributed network. In Proceedings of the second annual ACM symposium on Principles of distributed computing, pages 254–266, Montreal, Quebec, Canada, 1983. [18] L. Marchal, P. Primet, Y. Robert, and J. Zeng. Optimal bandwidth sharing in grid environment. In IEEE High Performance Distributed Computing (HPDC), Paris, France, June 2006.

Seventh IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07) 0-7695-2833-3/07 $20.00 © 2007