Search-based Job Scheduling for Parallel Computer ... - CiteSeerX

6 downloads 28694 Views 227KB Size Report
our results show that the best search-based scheduling pol- ... scheduling jobs on the parallel computer systems. ... We apply search algorithms to find a good.
Cluster 2005, Boston, Massachusetts, September 2005

Search-based Job Scheduling for Parallel Computer Workloads Sangsuree Vasupongayya Su-Hui Chiang Computer Sciences Department Portland State University {vsang, suhui, bart}@cs.pdx.edu

Abstract To balance performance goals and allow administrators to declaratively specify high-level performance goals, we apply complete search algorithms to design on-line job scheduling policies for workloads that run on parallel computer systems. We formulate a hierarchical two-level objective that contains two goals commonly placed on parallel computer systems: (1) minimizing the total excessive wait; (2) minimizing the average slowdown. Ten monthly workloads that ran on a Linux cluster (IA-64) from NCSA are used in our simulation of policies. A wide range of measures are used for performance evaluation, including the average slowdown, average wait, maximum wait, and new measures based on excessive wait. For the workloads studied, our results show that the best search-based scheduling policy (i.e., DDS/lxf/dynB) reported here simultaneously beats both FCFS-backfill and LXF-backfill, each roughly providing a lower bound on maximum wait and the average slowdown, respectively, among backfill policies. Keywords: parallel job scheduling, backfill, discrepancybased search.

1. Introduction Scheduling for high-performance parallel computer workloads typically has numerous performance goals that are in conflict with each other. For example, giving priority to short jobs improves responsiveness but may starve long jobs; favoring small jobs may improve throughput but has the potential to starve large-resource jobs. The problem is particularly challenging for non-preemptive job scheduling policies, which are an important class of policies currently used on many production parallel systems. These systems typically schedule jobs using some priority function, manually tuned and determined, which is not only complex but also ineffective. Ideally, administrators should be able to declaratively specify high-level performance goals, while the schedulers automatically optimize for the goals.

Bart Massey

Typically, priority job schedulers use either queue-based or job-based priority. Under queue-based priority schedulers (e.g., PBS [11], LSF [12]), the administrators can give higher priority to certain queues (e.g., short jobs). However, jobs in low-priority queues may starve. Although some measures may be taken to reduce starvation, it is difficult to tune the performance [4]. To better balance performance goals, many schedulers use a weighted priority function, as in the Maui Scheduler [9]. The job priority is a weighted sum of job measures, such as the current job waiting time, estimated run time, and requested number of processors. The weights can be adjusted to change the relative importance of the measures. While this may be more flexible, it’s still difficult for the administrators to tune the priority weights, due to complex interaction among different priority components. Furthermore, even if a set of priority weights work well for a given period of time, they may have poor performance for another period of time. In this study, we develop a goal-oriented job scheduler that allows administrators to specify high-level performance goals and automatically determines job priority to achieve the goals, while adapting to the workload changes without manual adjustment. Our new policies are based on combinatorial search techniques. Many good heuristics for exploring the potentially large search space have been proposed and applied in planning and scheduling in many AI and computer science problems. The key design issues and questions to be addressed are: (1) How should multiple performance goals be formulated in an objective function such that the search techniques can be applied to find the ’best’ schedule? (2) Within a reasonable time constraint, how do different search algorithms compare in the context of the parallel job scheduling problem? (3) How well do search-based goal-oriented scheduling policies perform, compared with traditional prioritybased scheduling policies, in particular, commonly used backfill policies [10]? To our knowledge, this is the first paper that applies complete search algorithms to deal with multiple performance goals that are potentially in conflict with each other, for

2.2. Search algorithms

scheduling jobs on the parallel computer systems. We evaluate the performance of scheduling policies by simulation, using ten monthly workloads that ran on a Linux cluster (IA-64) at NCSA during 2003-2004. In the next section, we discuss the objective function used and new search-based policies studied. Section 3 provides information for the workloads used and related work. Section 4 discusses our evaluation methodology. Sections 5-6 present policy performance evaluation results. Section 7 provides a summary.

To explore a potentially large search space within a time limit, various search algorithms have been proposed. Roughly speaking, there are two classes of search algorithms: local search and complete search. They have been successfully applied for solving the constraint satisfaction problems and resource constrained project scheduling problems [6] in the AI, bioinformatics, and computer science fields. In this study, we apply complete search algorithms. Our future work will study combining complete search algorithms with local search, to possibly improve the solution, as suggested in [5]. Complete search algorithms organize possible solutions into a search tree according to a branching heuristic, and systematically explore the solutions in the tree. We consider two commonly used discrepancy-based search algorithms: (1) limited discrepancy search (LDS) [7, 8]; (2) depth-bound discrepancy search (DDS) [17]. They differ in the order they explore the solutions. To explain the structure of our search tree and the search algorithms, consider four waiting jobs, numbered 1 to 4, in their arriving order. There are 4! (i.e., 24) possible orders (e.g., 1-2-3-4, 2-1-3-4), in which the four jobs can be considered for scheduling. Note that the order the jobs are considered for scheduling is not necessarily same as the order of starting. Figure 1(a) shows a tree containing the 24 job orders. Each node (i.e., job) is labeled by the job identifier, except the root. Each path (or called schedule) from the root to a leaf constitutes a possible order in which jobs are considered for scheduling. There are four branches (1,2,3,4) at the root (i.e., depth-0), three branches at each depth-one node, etc. In general, given n jobs, there are n! paths and O(nn ) nodes; each node at depth-i has n − i branches. The tree size grows exponentially with n, as shown in Figure 1(d) for several examples of n. In the example tree, the fcfs branching heuristic is used to order the branches from left to right at each node: only the left-most branch follows the heuristic (e.g., at the root, the branch 1 follows the heuristic); any other branch breaks the heuristic and is called a discrepancy, by convention. Note that the left-most path, 0-1-2-3-4, shown in bold in Figure 1(a), is the only path that contains no discrepancy. The assumption of discrepancy-based search algorithms is that a good heuristic is likely to make only a few mistakes. Both LDS and DDS algorithms proceed in iterations, exploring and comparing paths until the given time or node limit is reached. Figure 1 shows the paths in bold, to be explored from left to right in the 0th, 1st, and 2nd iterations. LDS explores paths that contain the fewest discrepancies first. On the 0th-iteration, LDS always branches left with the heuristic, i.e., 0-1-2-3-4 highlighted in Figure 1(a). On the 1st-iteration, LDS visits the six paths containing one

2. Search algorithms and new policies If an objective function is given, parallel job scheduling can be viewed as a special case of the combinatorial optimization problem, in that we are looking for the optimal schedule from a finite set of possible schedules according to the objective. On the systems considered, each job is submitted with a required number of nodes and runtime. Upon each job arrival and departure, the scheduler determines what waiting jobs can be started at the current time. The number of possible schedules at each decision point is potentially large. We apply search algorithms to find a good schedule within a given time constraint. In this section, we discuss the objective function used, the search algorithms, and the new search-based policies.

2.1. Formulating the Objective A wide range of performance measures may be considered for optimization. In this study, we focus on two goals commonly placed on parallel computers: (1) avoiding any job from incurring an ’excessive’ wait; (2) minimizing the average slowdown. Note that the slowdown of a job is the turnaround time divided by the actual runtime of the job. The excessive wait time of a job is the job wait time in excess of a given threshold, or called target wait time bound. The first goal may be formulated as: minimizing the total excessive wait time over all jobs. In this study, we consider both fixed and dynamic threshold, discussed in Section 5. Given the two potentially conflicting goals, the design issue is: how to simultaneously optimize for the two goals? One option is to define an objective as a weighted function of the two measures for optimization. However, this can be complex as it requires choosing the weights. A simpler alternative sufficient for our purpose is to use a hierarchical two-level objective: (1) minimizing the total excessive wait; (2) minimizing the average slowdown. In effect, we attempt to avoid jobs from incurring an excessive wait, while optimizing the average slowdown whenever possible. Specifically, schedule A is better than B if A has a smaller total excessive wait time, or the two schedules have the same total excessive wait but A has a lower average slowdown. 2

1

2

0

0

0 3

1

4

2

3

4

1

2

3

4

2 3 4 1 3 4 1 2 4 1 2 3

2 3 4 1 3 4 1 2 4 1 2 3

2 3 4 1 3 4 1 2 4 1 2 3

3 4 24 2 3 34 1 4 13 2 41 4 2 12 3 1 31 2

3 4 24 2 3 34 1 4 13 2 41 4 2 12 3 1 31 2

3 4 24 2 3 34 1 4 13 2 41 4 2 12 3 1 31 2

4 34 2 3 24 3 4 13 1 42 4 1 12 3 2 31 2 1

4 34 2 3 24 3 4 13 1 42 4 1 12 3 2 31 2 1

4 34 2 3 24 3 4 13 1 42 4 1 12 3 2 31 2 1

(a) LDS and DDS: 0th iteration

(b) LDS: 1st-iteration

(c) LDS: 2nd-iteration

0

00

0

0

# Jobs 1 1 2 # Paths 23 3 #4 Nodes 4 4 24 64 2 83 4 2 1 3 3 4 4 140K 1 3 2 4 41 12 24 110K 31 2 3 3 410 2 4 2 33 34 42 14 42 3,629K 13 3 42 14 41 14 32 2142134129,864K 311223 1 3 1 2 15 1,307,674M 3,554,627M 4 3 4 2 3K 2443=341000; 42 13 32 41 M 1 11 422342131122312 3 1 2 1 34 42=14 31000,000

(d) Tree size as number of waiting jobs

11

22

33

44

1

2

0

3

4

22 33 44 11 33 44 11 22 44 11 22 33

2 13 4 1 23 4 1 32 4 1 42 3

334422442233334411441133224411442211223311331122

23 4 32 4 42 3 13 4 13 4 41 3 21 4 21 4 24 1 12 3 12 3 31 2

443344223322443344113311442244111122332233112211

344324422332344314411331244214412112233213311221

(e) DDS: 1st-iteration

4 3 4 2(f) 3 2 4DDS: 3 4 1 3 2nd-iteration 1 42 4 1 12 3 2 31 2 1

0

0 Figure 1. Illustration of search tree and LDS 0and DDS algorithms (of four jobs) 00 0 1 2 3 4 11 22 33 44 1 2 3 4 2 3 shown 4 2 1 3 3in4 4Figure 41 12e.g., 24 310-1-2-4-3 defined in Section 2.1. Note that lxf discrepancy, in which 1 1 3 2 41(b), 2 3 2 3 4 1 3criteria 4 1 2in 4the1 objective 2 3 branching at 2 to 4 is the only discrepancy. Similarly, 22 33on 44 11 33is44also 11 22the44 order 11 22 of 33 jobs if2jobs 3 4are1 prioritized 3 4 1 2 using 4 1 the 2 3 hi3 4 2 4 2 33 34 42 14 42 13 33 42 14 41 14 32 214213412311223 1 3 1 2 3 4 24 2 3 34 1 4 13 2 4 14 2 1 23 1 3 12 erarchical two criteria in the objective used. Thus, there are the 2nd-iteration, LDS explores the eleven paths containing 334422442233334411441133224411442211223311331122 3 42 4 2 33 4 1 41 3 2 41 4 2 12 3 1 31 2 4 3 4 2 3 2443 34shown 42 13 32 41 34 in 31122312 3 1 2 1 LDS/lxf, DDS/fcfs, and DDS/lxf. two discrepancies, 42 14 Figure 31 11 42234211(c). 4 3 4 2 3 2 4 3 4 four 1 3 1 4policies: 2 4 1 1 2 3 2 LDS/fcfs, 3 12 1 Each policy may use fixed or dynamic DDS biases search to the discrepancies high in the tree. 443344223322443344113311442244111122332233112211 4 3 4 2 3 2 4 3 4 1 3 target 1 4 2 4 1 wait 1 2 3 2bounds, 31 2 1 1

12

23

3 4

4

to be explained in Section 5. Below, we discuss how to On the ith-iteration, DDS explores the branches on which a compare schedules, the stopping criteria for the search aldiscrepancy occurs at depth i; discrepancies are prohibited 0 gorithms, and some comments on the limitation of searchbelow but are allowed above depth i. On the 0th-iteration, 0 0 based policies. DDS explores the left-most branches, same as in LDS. On 1 1 2 Figure2 31(e), DDS 3 4 explores 4 1 2 For each 3 4 the 1st-iteration, in the three schedule, the start time of each job, say J, is 2 3 4 one 1 3 discrepancy 4 1 2 4 at 1 depth 2 3 one, i.e, branches paths containing computed in the order it appears on the path, according 2 3 4 1 3 4 1 2 4 1 2 3 2 3 4 1 3 4 1 2 4 1 2 3 2, 3, or 4 at the root, and no discrepancy below. On the 2ndto the currently executing jobs and the already determined 3 4 24 2 3 34 1 4 13 2 41 4 2 12 3 1 31 2 3 4 2 41(f), 2 3 3 DDS 4 1 4 1 3explores 2 4 1 4 2 1 the 2 3 1eight 3 1 2 paths 3 4con2 4 2 3 3 4 1 start 4 1 3 2times 4 1 4 2 1of2 3the 1 3 jobs 1 2 that appear above J on the given path. iteration, in Figure taining any branch at depth-one, one discrepancy at depth The solution (i.e., total excessive wait and the average slow4 34 2 3 24 3 4 13 1 42 4 1 12 3 2 31 2 1 4 34 2 3 24 3 4 13 1 42 4 1 12 3 2 31 2 1 43 4 2 32 4 3 41 3 1 42 4 1 12 3 2 31 2 1 two, and no discrepancy below, e.g., 0-1-3-2-4 or 0-2-3-1down) of each schedule is computed once the start times of 4, in which branching to 3 is a discrepancy at depth two. all jobs on the path are determined. If the branching heuristic is not good enough, by biasIterative search algorithms have the property of being ing search to discrepancies high in the tree, DDS has the anytime: the longer the algorithm runs, the higher the qualpotential to find good solutions sooner than does LDS. For ity of the solution available, but the current best solution example, 0-4-3-1-2 (the last path in the 2nd iteration of both can be extracted at any time. For comparison purposes, we algorithms), containing two discrepancies at depth one and impose a limit on the number of nodes visited, L, at each two, is the 12th path to explore in DDS but the 18th path in scheduling decision point, rather than a time limit. We vary LDS. The difference between when such a path is explored L in the range of 1K and 100K, which cover only a tiny fracunder DDS and LDS grows exponentially with the number tion of the tree under high-load workloads studied, in which of jobs in the tree; within a given time or node limit, such a there are at least 10 waiting jobs in most of the scheduling path may be explored under DDS but not LDS. decision points. For example, in a tree of 10 waiting jobs, there are almost 10 millions of nodes (Figure 1(d)); using 2.3. New search-based scheduling policies L = 1K covers only 0.01% and even L = 100K covers only 1% of the nodes in the tree. Thus, the success of search algorithms depends on whether they explore good schedules We implement search-based policies using both LDS and soon enough. The scheduling overhead increases with the DDS algorithms. Two branching heuristics are used: fcfs and lxf (i.e., largest slowdown first), consistent with the two number of nodes visited and, to a less extent, the number of 3

in the range of 70 - 80%, but July 2003 has a higher load (89%). For each month, the number of jobs and the load for each range of requested nodes are shown as a fraction of the monthly ”Total” column. The table shows that the job mix varies from month to month. The symbol ”*” highlights those particularly high values. In particular, (1) one-node jobs contribute over 10% of proc. demand in a few months (12/03, 1/04), but it’s only under 5% in most months; (2) July 2003 has a much larger demand due to the largest jobs (N > 64) than in other months; the largest jobs in July 2003 account for about 50% of the proc. demand and 8.5% of the submitted jobs, compared to under 20% of proc. demand and under 2% of the submitted jobs in many other months. Table 4 provides some information of the distribution of actual job runtime. Two ranges of actual runtime are shown: T ≤ 1 hour and T > 5 hours. For each month, jobs are partitioned to five classes according to requested number of nodes. The table provides the fraction of all jobs in each range of requested nodes and actual runtime. The sum over all ranges of requested nodes for each month is given in the row ”all”. The January 2004 workload stands out with the following features. First, a large fraction (32.7%, marked by ”*”) of jobs in 1/04 are relatively long (T > 5 hours), compared to under 15% in most other months. Second, the majority of these long jobs in 1/04 are one-node. Third, a large fraction (20.5%, marked by ”*” in the table) of jobs in 1/04 are relatively wide and short (N = 9-32 and T ≤ 1 hour), versus under 7% for most other months. The distinct features of 7/03 and 1/04 present a great challenge for scheduling policies, shown in Section 6.

Table 1. Notation Symbol N T R R* ρ L

Definition Number of job requested nodes Actual job runtime Requested job runtime in trace Requested job runtime in simulation Offered load in simulation Max. # of nodes visited in search algorithms

Table 2. Capacity and job limit on IA-64 Capacity (#Nodes) 128

Job Limit N R 12h 128 24h

Period 6/03 - 11/03 12/03 - 3/04

waiting jobs. In our simulation, it takes 30-65 milliseconds to visit 1K - 8K nodes in a tree of 30 jobs, which is greater than the number of waiting jobs for 70% of the decision points. Our simulator is implemented in Java programming language and run on a 2-GHZ Intel Pentium-4 Windows XP with 512MB memory. Same as backfill policies, our search-based policies use job runtime for making scheduling decisions, but userestimated runtimes are known to be inaccurate (e.g., [1]). However, to understand the full potential of search-based and backfill policies, we focus on the results of using actual runtimes in this study, but results of using requested runtimes are also provided. Another limitation is that our search-based policies can only make each decision according to the jobs currently present in system. Future work includes incorporating prediction of job runtime and arrivals to possibly tackle the limitations and improve performance.

3.2. Previous parallel job scheduling policies In this paper, we focus on non-preemptive policies, of which backfill policies are perhaps the most promising policies studied. Several widely used production schedulers (Maui, LSF, PBS, LoadLeveler) also has the backfill feature. Below reviews backfill policies and several variations. Under priority backfill policies, jobs are considered for scheduling in the given priority order, using backfill to schedule lower-priority jobs on resources that would otherwise be idle in the strict priority schedule. A given number of highest-priority waiting jobs are each given a scheduled start time, which is the earliest time enough resource will be available to start the job. The number of priority jobs that receive scheduled start times is a parameter of the policies. Many priority functions have been proposed for backfill policies, including FCFS (first come first served), SJF (shortest job first), LXF (largest slowdown first), and LXF&W-backfill (i.e., LXF plus a very small weight for job wait time). The key conclusion drawn from our previous papers [4, 1, 3] is that giving priority to short jobs can significantly improve average wait and slowdown of FCFS-

3. Background This section provides information about the system and workload used and reviews relevant parallel job scheduling policies. For convenience, Table 1 defines the notation used.

3.1. Workloads An Intel Itanium (IA-64) Linux cluster (a.k.a. Titan) from NCSA is studied. Table 2 summarizes the system capacity and job limits. There are 128 dual-processor nodes. A node is the smallest allocation unit. Note the runtime limit was increased from 12 to 24 hours in December 2003. Table 3 shows the load and job mix in each month. The column ”Total” of each month shows that the processor demand of all jobs (i.e., N × T of the jobs as a fraction of the total processor time available during the month) is typically 4

Table 3. Overview of monthly job mix on NCSA/IA-64 Month Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 Nov 03 Dec 03 Jan 04 Feb 04 Mar 04

Measure #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand #jobs proc. demand

Total 2191 82% 1399 89% 3220 79% 3056 72% 4149 71% 3446 73% 3517 74% 3154 73% 3969 74% 3468 75%

1 26.7% 0.3% 26.2% 0.5% 74.6% 1.7% 58.0% 3.1% 53.8% 4.7% 60.1% 8.0% 64.1% 11.0%∗ 39.0% 12.0%∗ 44.1% 7.7% 57.5% 2.8%

2 11.3% 0.1% 9.1% 0.2% 5.4% 0.7% 10.4% 0.5% 20.5% 6.6% 17.4% 3.7% 12.5% 5.1% 18.3% 8.8% 31.8%∗ 9.9% 13.1% 4.6%

Range of Requested # of Nodes 3-4 5-8 9-16 17-32 29.8% 6.3% 8.5% 10.5% 1.3% 1.1% 23.0% 37.4%∗ 6.9% 18.4% 7.9% 13.2% 0.4% 3.6% 6.7% 16.9% 1.3% 4.9% 4.9% 4.6% 0.1% 3.5% 9.6% 30.8%∗ 6.4% 5.8% 6.6% 8.4% 0.5% 4.3% 8.8% 35.4%∗ 5.8% 8.8% 5.5% 3.6% 1.6% 10.1% 17.3% 25.3% 4.9% 5.3% 3.6% 4.1% 0.9% 4.4% 11.6% 11.1% 6.8% 3.5% 3.7% 5.9% 7.6% 2.1% 9.5% 18.9% 8.0% 4.6% 9.2% 18.1% 5.3% 3.7% 17.3% 17.9% 10.0% 4.5% 4.6% 2.5% 11.7% 7.0% 18.8% 20.3% 10.3% 7.6% 5.8% 2.3% 8.3% 7.7% 37.6%∗ 16.8%

backfill, but may degrade the maximum wait. Thus, there is a tradeoff between minimizing the maximum wait and minimizing average-performance measures. For example, LXF-backfill significantly improves the average slowdown and wait of FCFS-backfill but has a poor maximum wait. In the extreme case, SJF-backfill has a starvation problem for long-running jobs [4] and thus is not a practical policy. Several variations of backfill policies have been proposed to improve FCFS-backfill, including Slack-based backfill [15], Relaxed backfill [18], Selective-backfill [14], Adaptive backfill policy [16], and Lookahead [13]. They were shown to have a lower average slowdown and/or average wait than that of FCFS-backfill. However, they potentially incur a worse maximum wait. To verify, we’ve studied Selective-backfill and Lookahead using NCSA/IA-64 workloads. We found that Selective-backfill performs very similarly to LXF-backfill, while Lookahead is very similar to FCFS-backfill, not shown to conserve space.

33-64 3.7% 21.7%∗ 8.4% 21.3% 1.8% 17.9% 1.1% 12.4% 1.6% 24.1% 3.7% 37.0%∗ 2.7% 39.7%∗ 1.7% 17.1% 1.7% 8.1% 1.6% 6.3%

65-128 2.4% 14.6% 8.5%∗ 49.7%∗ 2.1% 35.5%∗ 2.9% 34.6%∗ 0.3% 10.2% 0.8% 23.3%∗ 0.9% 6.1% 1.2% 18.0% 0.8% 16.4% 1.7% 15.9%

average slowdown, respectively, based on the discussions in Section 3.2. In our simulation, both backfill policies give only one priority job a scheduled start time, as we do not find more reservations to improve the performance. An extensive set of measures are used for performance evaluation, including the average wait, maximum wait, and average bounded slowdown, as well as several normalized excessive wait measures to provide information about jobs that incur long wait and to evaluate the extent to which search-based policies minimize the total excessive wait. These measures are computed over all jobs and each job class (defined by T and N). Similar to previous papers (e.g., [10, 14]), we use the bounded slowdown instead of slowdown, to reduce the dramatic effect of very short jobs on average slowdown measure. We use 1 minute to lower bound actual runtime, i.e., the bounded slowdown of jobs under 1 min. is 1 + wait time in minutes, same as for 1-min. jobs. The normalized excessive wait time of each job is the job wait time in excess of a threshold, t. For simplicity, ”normalized” is omitted in the text. Note if a job has a wait time ≤ t, the job has no excessive wait. To compare different policies in each month, two values are used for t: the maximum wait and 98th-percentile wait of FCFS-backfill in the given month. We denote the excessive wait with respect to 98% each t by E fmax cf s−bf and E f cf s−bf , respectively. With perjob excessive wait, we compute the total and average excessive wait over all jobs with an excessive wait. By definition,

4. Evaluation methodology We evaluate job scheduling policies using event-driven simulation of ten monthly IA-64 job traces, discussed in Section 3.1. New search-based scheduling policies are compared against FCFS-backfill and LXF-backfill, which roughly provide the envelope for the maximum wait and 5

Table 4. Distribution of actual job runtime in monthly NCSA/IA-64 workloads (Each percentage is the fraction of all jobs in the given month)

12/03 14.0% 4.4% 2.7% 1.7% 1.0% 23.8%

1/04 23.1%∗ 5.0% 2.4% 1.5% 0.7% 32.7%∗

2/04 6.8% 3.6% 3.3% 1.7% 0.3% 15.8%

3/04 3.0% 2.6% 3.2% 2.9% 0.3% 12.0%

ω = 300h ω = 100h ω = 50h

ω = 300h ω = 100h ω = 50h

(a) Max. wait

40

3/04

2/04

1/04

12/03

0

11/03

20 9/03

0

60

10/03

100

80

8/03

200

100

7/03

avg. bounded slowdown

max. wait (hr)

300

6/03

FCFS-backfill has a zero total E fmax cf s−bf in any month. Two levels of loads are simulated: (1) ρ = original load; (2) ρ = 0.9, studied in Section 5 and Section 6, respectively. Recall that most IA-64 monthly load studied is 70-80%. The ρ = 0.9 results are used to examine the impact of a high load, artificially created by shrinking job interarrival times, as in previous papers (e.g., [14, 2]). To be realistic, each simulation of a given month includes a one-week (from previous month) warm up and a one-week (from next month) cool down. Performance measures for a month are computed for jobs submitted during the month.

3/04

9/03 3.9% 0.4% 1.3% 2.9% 1.2% 9.7%

3/04 53.2% 10.1% 13.9% 4.5% 2.5% 84.1%

2/04

8/03 2.5% 0.7% 1.0% 3.5% 1.4% 9.1%

2/04 34.1% 20.5% 9.9% 4.6% 1.9% 71.0%

1/04

7/03 2.4% 0.4% 3.0% 5.0% 4.6% 15.4%

1/04 12.9% 6.0% 7.1% 20.5%∗ 1.9% 48.4%

12/03

6/03 0.3% 0.0% 0.7% 7.0% 1.7% 9.8%

12/03 36.0% 6.5% 6.2% 7.0% 1.7% 57.4%

11/03

# Nodes 1 2 3-8 9-32 33-128 all

T ≤1 hour 10/03 11/03 37.5% 33.7% 8.3% 12.5% 10.1% 6.8% 4.9% 5.1% 0.7% 2.1% 61.6% 60.2% T >5 hours 10/03 11/03 4.1% 8.7% 3.1% 4.4% 2.1% 1.4% 3.3% 1.9% 0.8% 1.6% 13.4% 18.0%

9/03

9/03 42.6% 9.8% 9.9% 10.9% 2.4% 75.6%

10/03

8/03 68.8% 4.3% 4.7% 4.6% 1.8% 84.1%

8/03

7/03 20.9% 7.7% 18.5% 13.4% 9.4% 69.9%

7/03

6/03 24.9% 11.1% 34.7% 6.2% 3.0% 80.0%

6/03

# Nodes 1 2 3-8 9-32 33-128 all

(b) Avg. bounded slowdown

Figure 2. Sensitivity to fixed target bound DDS/lxf; R* = T; ρ = original load; L = 1K

5. Results of original load Figure 2 shows the performance of DDS/lxf using ω = 50, 100, and 300 hours, for each of the ten months, shown for L = 1K. Results of using a larger number of nodes up to 10K are very similar (not shown). As shown in Figure 2(a), the maximum wait time each month increases as ω increases from 50 to 300 hours for most months, and approaches the given ω in many months. On the other hand, Figure 2(b) shows that the average slowdown is not as sensitive to ω for each month except July 2003. The results of average wait (not shown) are similar, except with a smaller performance difference between different ω. The same trend is observed for DDS/fcfs and LDS policies (not shown). Obviously, the maximum wait time cannot be arbitrarily reduced. For example, using ω = 25 hours (not shown) performs almost the same as that of ω = 50 hours for each month studied. In the extreme case if ω = 0 (not shown), the first-level objective (i.e., minimizing the total excessive wait) is equivalent to minimizing the average wait, resulting in a maximum wait even > 300 hours in several months.

In this section, we evaluate the performance of searchbased policies under original IA-64 monthly workloads. Of the four policies discussed in Section 2.3, we find that DDS/lxf performs the best with respect to the objective used. Comparisons of search algorithms will be provided in Section 6.3. In this section, we report only the results of DDS/lxf, for both using fixed and dynamic target bounds in Section 5.1 - 5.2, respectively. Actual job runtime is used by the schedulers for the results reported in this section.

5.1. Performance of fixed target wait bound For the original monthly workloads, the maximum wait under FCFS-backfill is about 50 hours for all except July 2003, shown in the next section. Thus, we vary the fixed target wait bound, ω, from 0 to 300 hours, to study the sensitivity of DDS/lxf to the target bound used in the first criteria of the objective, defined in Section 2.1. 6

FCFS−BF LXF−BF DDS/lxf/dynB

(a) Avg. wait

(b) Max. wait

100

3/04

2/04

1/04

12/03

11/03

9/03

0

10/03

50

8/03

3/04

2/04

1/04

12/03

11/03

9/03

10/03

0

8/03

50

150

7/03

avg. bounded slowdown

100

3/04

2/04

1/04

12/03

11/03

10/03

9/03

8/03

0

7/03

5

150

7/03

10

6/03

max wait (hr)

200

6/03

avg. wait (hr)

15

FCFS−BF LXF−BF DDS/lxf/dynB

6/03

FCFS−BF LXF−BF DDS/lxf/dynB

(c) Avg. bounded slowdown

Figure 3. Performance comparisons under original load R* = T; L = 1K

6.1. Performance under high load

The key conclusion is: the performance of search-based policies is sensitive to the target wait bound and that using a target bound too small or too large can be detrimental. As the workload can potentially change over a short period of time (say a few hours), using a target bound that automatically adapts to the workload is more appropriate.

In this section, we compare DDS/lxf/dynB against the two baseline backfill policies under ρ = 0.9. We find that L = 1K is still sufficient for all except January 2004, in which DDS/lxf/dynB requires exploring more than 1K nodes to reduce total excessive wait, due to a larger backlog in that month than in other months under ρ = 0.9 (shown in Figure 4(d)). To demonstrate the potential of goal-oriented scheduling policies based on search, we show the results of DDS/lxf/dynB using L = 8K in January 2004 and L = 1K in other months. The impact of number of nodes visited on search performance in January 2004 will be presented in Section 6.2. Other search algorithms will be compared with DDS/lxf/dynB in Section 6.3. Figure 4(a)-(c) plot the average wait, maximum wait, and average slowdown of each policy for each month under ρ = 0.9. The results are qualitatively similar to that under original load (in Figure 3(a)-(c)), except that the performance difference between policies is larger now under high load. Figure 4(e)-(f) plot the results of total excessive wait w.r.t. the 98th-percentile and maximum wait of FCFSbackfill. Figure 4(g)-(h) plot the number of jobs with an excessive wait and the average excessive wait w.r.t. the maximum wait of FCFS-backfill. The results show that (1) DDS/lxf/dynB has close to zero total excessive wait w.r.t. the maximum wait of FCFS-backfill for each month except 1/04; (2) DDS/lxf/dynB has a lower total excessive wait w.r.t. the 98th-percentile wait of FCFS-backfill, compared to LXF-backfill and even FCFS-backfill, for each month except 1/04; (3) Under LXF-backfill, those ’unfortunate’ jobs incur an average of 20 - 60 hours in excess of the maximum wait of FCFS-backfill in many months. We note that 7/03 and especially 1/04 present a great challenge for all policies studied. As shown in Figure 4(a)(b), FCFS-backfill has a much worse average wait and LXFbackfill has a much worse maximum wait in these two months than in others; DDS/lxf/dynB makes some compromise also in these two months. The challenge should be due to a large demand of long jobs and medium-wide jobs

5.2. Performance of dynamic target wait bound In this section, we consider a dynamic target wait bound, defined to be the waiting time of the job that has been currently waiting for the longest time in the queue. We compare DDS/lxf using the dynamic target bound (i.e., DDS/lxf/dynB) against LXF-backfill and FCFS-backfill for the original load. Again L = 1K is found to be sufficient for DDS/lxf/dynB. Figure 3(a) - (c) plot the results of average wait, maximum wait, and average bounded slowdown, respectively, for each policy. As expected, LXF-backfill has lower average wait and average slowdown than that of FCFS-backfill (graphs (a) and (c)), but FCFS-backfill has lower maximum wait than that of LXF-backfill (graph (b)). In comparison, DDS/lxf/dynB not only has the best or close to the best average wait and average slowdown (except 7/03 and 1/04), but also the best maximum wait for every month. In addition, DDS/lxf/dynB also has the best or close to the best total excessive wait w.r.t. the maximum and 98th-percentile of FCFS-backfill, not shown to conserve space. The results are encouraging, even though the performance difference is not dramatic under the original load. Next, we study the impact of a higher load on performance.

6. Further performance comparisons This section provides further performance comparisons under high-load workloads. Section 6.1 - 6.2 study the impact of high-load workloads and the impact of the number of nodes visited on DDS/lxf/dynB. Section 6.3 compares search algorithms. Finally, Section 6.4 studies the impact of using inaccurate requested runtimes. 7

FCFS−BF LXF−BF DDS/lxf/dynB

(e) Total E f98% cf s−bf

(f) Total E fmax cf s−bf

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

FCFS−BF LXF−BF DDS/lxf/dynB

40

(g) # jobs with E fmax cf s−bf

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

20 0

3/04

2/04

1/04

12/03

11/03

10/03

8/03

7/03

10

60

7/03

20

6/03

avg. normalized excessive wait (hr)

30

0

3/04

2/04

1/04

12/03

11/03

10/03

100 9/03

7/03

(d) Avg. queue length

80

6/03

#job w/ excessive wait

200

8/03

6/03

avg. queue length

0

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

3/04

2/04

1/04

12/03

11/03

9/03

300

0

3/04

2/04

1/04

12/03

11/03

10/03

9/03

8/03

7/03

500

20

40

7/03

1000

40

FCFS−BF LXF−BF DDS/lxf/dynB

400

6/03

total normalized excessive wait (hr)

1500

60

(c) Avg. bounded slowdown

500

6/03

total normalized excessive wait (hr)

50

FCFS−BF LXF−BF DDS/lxf/dynB

2000

0

100

0

(b) Max. wait

FCFS−BF LXF−BF DDS/lxf/dynB

80

9/03

(a) Avg. wait

10/03

0

8/03

50

FCFS−BF LXF−BF DDS/lxf/dynB

150

7/03

avg. bounded slowdown

100

3/04

2/04

1/04

12/03

11/03

9/03

10/03

6/03

0

8/03

5

150

7/03

10

6/03

max wait (hr)

200

7/03

avg. wait (hr)

15

FCFS−BF LXF−BF DDS/lxf/dynB

6/03

FCFS−BF LXF−BF DDS/lxf/dynB

(h) Avg. E fmax cf s−bf

Figure 4. Performance comparisons under high load (ρ = 0.9); R* = T DDS/lxf/dynB: L = 1K for each month except 1/04; L = 8K for 1/04

60 40 20 0 12h 8h 4h

1h Actual runtime 10m

1

8

(a) FCFS-backfill

32

64

128

# nodes

80

avg. wait (hr)

80

avg. wait (hr)

avg. wait (hr)

80

60 40 20 0 12h 8h 4h

1h Actual runtime 10m

1

8

(b) LXF-backfill

32

64

128

# nodes

60 40 20 0 12h 8h 4h

Actual runtime

1h 10m

1

8

32

64

128

# nodes

(c) DDS/lxf/dynB (L = 1K)

Figure 5. Average wait performance of different job classes (N×T) under each policy R* = T; ρ = 0.9; July 2003 in 1/04 and of wide jobs in 7/03, as shown in Section 3.1. To further understand the feature of each policy, Figure 5(a)-(c) plot the average wait for each job class under FCFS-backfill, LXF-backfill, and DDS/lxf/dynB for July 2003. Jobs are partitioned according to five ranges of actual job runtime and five ranges of requested nodes, as shown. These graphs demonstrate a trend observed in most months: (1) FCFS-backfill tends to provide poor performance for wide jobs (N > 32), even if they are short (T ≤ 1 hour); (2) LXF-backfill significantly improves short-wide jobs (T ≤ 1 hour and N > 32), at a great cost of long jobs that also have a relatively large number of nodes (T > 8 hours and N > 8); (3) DDS/lxf/dynB improves short-wide jobs of FCFS-backfill, but not so much as to sacrifice long-wide jobs as that under LXF-backfill. Understandably, one might argue in favor of LXFbackfill for its prompt service to short jobs (≤ 1h). If favoring short jobs is the only objective, LXF-backfill would be sufficient. However, the key point here is that search-based

policies can effectively optimize for the given two-level objective studied in this paper. If desirable, a target wait bound as a function of job runtime can be defined in the objective to further improve short jobs, to be studied in future work. The conclusion is that DDS/lxf/dynB achieves a low average wait and slowdown similar to that of LXF-backfill, while having a low maximum wait similar to that of FCFSbackfill for most months studied, and that the performance difference between policies is larger under a higher load.

6.2. Impact of number of nodes As mentioned earlier, L, the number of nodes visited, has an impact on DDS/lxf/dynB for January 2004 under high load. Figure 6(a)-(d) plot the total excessive wait, maximum wait, average wait, and average bounded slowdown of DDS/lxf/dynB, as a function of L, in the range of 1K and 100K, for 1/04. The two backfill policies are also included. As shown in Figure 6(a)-(b), DDS/lxf/dynB im8

FCFS−BF LXF−BF DDS/lxf/dynB

FCFS−BF LXF−BF DDS/lxf/dynB

500

0

1K

2K 4K 8K 10K 100K #nodes visited

(a) Total

E fmax cf s−bf

150

100

50

1K

2K

10

5

0

4K 8K 10K 100K #nodes visited

avg. bounded slowdown

1000

FCFS−BF LXF−BF DDS/lxf/dynB

15

avg. wait (hr)

200

max. wait (hr)

total normalized excessive wait (hr)

1500

FCFS−BF LXF−BF DDS/lxf/dynB

1K

(b) Max. wait

2K

4K 8K 10K 100K #nodes visited

(c) Avg. wait

150

100

50

0

1K

2K

4K 8K 10K 100K #nodes visited

(d) Avg. bounded slowdown

Figure 6. January 2004: Impact of number of nodes visited (L) on DDS/lxf/dynB under ρ = 0.9; R* = T

DDS/fcfs/dynB DDS/lxf/dynB LDS/lxf/dynB

DDS/fcfs/dynB DDS/lxf/dynB LDS/lxf/dynB 1200

(a) Avg. bounded slowdown

800 600 400

3/04

2/04

1/04

12/03

11/03

9/03

0

10/03

200 8/03

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

0

7/03

50

1000

7/03

100

6/03

total normalized excessive wait (hr)

150

6/03

avg. bounded slowdown

proves the total excessive wait and maximum wait as L increases. With L = 100K nodes, DDS/lxf/dynB achieves the same maximum wait and total Efmax cf s−bf , as that of FCFSbackfill. The improvement comes at the cost of only a slight increase in average wait and slowdown, which, although larger than that of LXF-backfill, are still much better compared to FCFS-backfill. The performance of DDS/lxf/dynB using L = 100K is perhaps the best achievable performance for 1/04 w.r.t. the given two-level objective studied.

(b) Total E fmax cf s−bf

Figure 7. Effect of search algorithms and branching heuristics; R* = T; ρ = 0.9; L = 2K

6.3. Comparisons of search algorithms This section compares search algorithms (DDS vs. LDS) and branching heuristics (lxf vs. fcfs). Figure 7(a) and (b) compare, respectively, the average bounded slowdown and total excessive wait of DDS/lxf/dynB, against DDS/fcfs/dynB and LDS/lxf/dynB, using L = 2K, showing the trend observed for the range of L simulated. First, using the fcfs branching heuristic, DDS/fcfs/dynB exhibits a behavior similar to FCFS-backfill, i.e., a poor average bounded slowdown in most months, showing that the lxf branching is better than fcfs for DDS w.r.t. our objective. Results are similar for LDS (not shown). Second, LDS/lxf/dynB has worse total excessive wait in 1/04 but (slightly) lower average bounded slowdown each month, showing that LDS/lxf/dynB follows lxf branching more than does DDS/lxf/dynB, as expected. The results suggest that lxf is not a good enough branching heuristic either, because otherwise LDS/lxf/dynB would have found similar, if not lower, total excessive wait than that of DDS/lxf/dynB. Thus, for the objective studied, (1) neither lxf nor fcfs branching heuristic are good enough, but lxf is better for DDS and LDS; (2) DDS is somewhat more effective than LDS; (3) for the range of L studied (1K-100K), the branching heuristic is a more dominating factor than the search algorithm (DDS and LDS) on the effectiveness of search.

we study to what extent inaccurate job requested runtimes impact relative policy performance. For this experiment, we use DDS/lxf/dynB with L = 4K in all months. Figure 8 compares DDS/lxf/dynB and the two baseline backfill policies, using user-provided job requested runtimes (i.e., R* = R). Figure 8(a)-(c) plot the average wait, maximum wait, and average slowdown for each policy. The results are still qualitatively similar to that of using R* = T (in Figure 4), except that the performance difference between policies is somewhat smaller, now that the inaccurate requested runtimes are used.

7. Conclusions In this study, we formulate two conflicting performance goals of job scheduling for parallel computer systems in a hierarchical two-level objective: (1) minimizing total excessive wait time; (2) minimizing the average slowdown. To optimize performance for the objective, we applied two well-known complete search algorithms: DDS and LDS, to design scheduling policies. We compared our new search-based policies against FCFS-backfill and LXFbackfill, each providing a rough lower bound on the maximum wait and the average slowdown, respectively. Policies were evaluated by simulation using ten monthly workloads that ran on NCSA/IA-64. Both original-load and artificially

6.4. Impact of inaccurate requested runtime So far, we have assumed that the schedulers use actual job runtimes to make scheduling decisions. In this section, 9

(a) Avg. wait

(b) Max. wait

(c) Avg. bounded slowdown

600 400

3/04

2/04

1/04

12/03

11/03

9/03

10/03

0

8/03

200 7/03

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

100

800

6/03

total normalized excessive wait (hr)

200

0

3/04

2/04

1/04

12/03

11/03

9/03

10/03

0

8/03

50

7/03

3/04

2/04

1/04

12/03

11/03

9/03

10/03

8/03

0

7/03

5

100

6/03

10

1000

7/03

150

FCFS−BF LXF−BF DDS/lxf/dynB

300

6/03

15

max wait (hr)

200

FCFS−BF LXF−BF DDS/lxf/dynB

avg. bounded slowdown

FCFS−BF LXF−BF DDS/lxf/dynB

20

6/03

avg. wait (hr)

FCFS−BF LXF−BF DDS/lxf/dynB

(d) Total E fmax cf s−bf

Figure 8. Performance of using inaccurate requested runtimes R* = R; ρ = 0.9; L = 4K created high-load (ρ = 0.9) workloads were studied. A wide range of policy performance measures were used, including the average performance measures, maximum wait, and several total excessive wait time measures. Our results are encouraging. In particular, we showed that DDS/lxf/dynB, the best search-based scheduling policy studied in this paper, simultaneously beats LXF-backfill and FCFS-backfill and achieves the best maximum wait as well as the best average slowdown and average wait time, for almost all workloads studied. This is impressive because typically there exists a trade-off between optimizing these average measures and the maximum wait, among traditional priority based policies. The ability to declaratively specify performance objectives for parallel computer systems and to automatically optimize for these performance objectives through search should increase parallel computer performance while decreasing administrator effort and error. The work reported here represents a strong step in that direction. Possible directions for future work include: further improving scheduling efficiency by developing more intelligent search algorithms possibly with branch-and-bound heuristics for pruning the search tree, applying job runtime prediction techniques to improve the accuracy of estimated job runtime for scheduling, and incorporating special priority and fairshare in the scheduling objective.

[4]

[5]

[6]

[7]

[8]

[9] [10]

[11] [12] [13]

[14]

References [15] [1] S.-H. Chiang, A. Dusseau-Arpaci, and M. K. Vernon. The impact of more accurate requested runtimes on production job scheduling performance. In Proc. 8th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Edinburgh, Scotland, July 2002. Springer Verlag. LNCS. 2537. [2] S.-H. Chiang and C. Fu. Re-evaluating reservation policies for backfill scheduling on parallel systems. In Proc. 16th IASTED Int’l Conf. on Parallel and Distributed Computing and Systems (PDCS), Cambridge, MA., November 2004. [3] S.-H. Chiang and C. Fu. Benefit of limited time-sharing in the presence of very large parallel jobs. In Proc. 19th

[16]

[17] [18]

10

IEEE Int’l Parallel & Distributed Processing Symposium (IPDPS), Denver, April 2005. S.-H. Chiang and M. K. Vernon. Production job scheduling for parallel shared memory systems. In Proc. 15th IEEE IPDPS, San Francisco, April 2001. J. M. Crawford. Solving satisfiability problems using a combination of systematic and local search. In 2nd DIMACS Challenge: Cliques, Coloring, and Satisfiability, Rutgers Univ., NJ., 1993. J. M. Crawford. An approach to resource constrained project scheduling. In Proc. 1996 Artificial Intelligence and Manufacturing Research Planning Workshop, 1996. W. D. Harvey and M. L. Ginsberg. Limited discrepancy search. In Proc. 14th Int’l Joint Conf. on Artificial Intelligence, Montreal, Canada, August 1995. R. E. Korf. Improved limited discrepancy search. In Proc. 13th National Conf. on Artificial Intelligence (AAAI-96), pages 209–215, Portland, OR., August 1996. MOAB . http://www.supercluster.org/moab. A. W. Mu’alem and D. G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling, IEEE Trans. Parallel and Distributed Syst., 12(6):529–543, June 2001. PBS Scheduler. http://www.nas.nasa.gov/Software/PBS/. Platform LSF. http://www.platform.com/products/LSFfamily/. E. Shmueli and D. G. Feitelson. Backfilling with lookahead to optimize the performance of parallel job scheduling. In Proc. 9th Workshop on JSSPP, Seattle, June 2003. Springer Verlag. LNCS. 2862. S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan. Selective reservation strategies for backfill job scheduling. In Proc. 8th Workshop on JSSPP, Edinburgh, Scotland, July 2002. Springer Verlag, LNCS. 2537. D. Talby and D. G. Feitelson. Supporting priorities and improving utilization of the IBM SP2 scheduler using slackbased backfilling. In Proc. 13th Int’l. Parallel Processing Symp., pages 513–517, San Juan, April 1999. D. Talby and D. G. Feitelson. Improving and stabilizing parallel computer performance using adaptive backfilling. In Proc. 19th IEEE IPDPS, Denver, April 2005. T. Walsh. Depth-bounded discrepancy search. In Proc. 15th Int’l Joint Conf. on A.I. , Nagoya, Japan, 1997. W. A. Ward, Jr. Carrie L. Mahood, and J. E. West. Scheduling jobs on parallel systems using a relaxed backfill strategy. In Proc. 8th Workshop on JSSPP, Edinburgh, Scotland, July 2002. Springer Verlag, Lect. Notes Comput. Sci. Vol. 2537.