SCHEDULING OF REGULAR AND IRREGULAR

SCHEDULING OF REGULAR AND IRREGULAR COMPUTATIONS: A System and Applications Tao Yang Department of Computer Science University of California Santa Barbara, CA 93106 [email protected]

Apostolos Gerasoulis, Jia Jiao Department of Computer Science Rutgers University New Brunswick, NJ 08903 fgerasoulis, [email protected]

September 16, 1995

Abstract We study the performance of task graph scheduling for both regular and irregular parallel computation. We present a scheduling tool named PYRROS, which takes as an input a task graph and a compile time estimation of computation and communication weights. It then produces a schedule and a parallel code based on this schedule. We analyze the eect of compile time errors to run time scheduling performance. For coarse grain task graphs small errors at compile time have a minimal impact in run time performance. We use PYRROS to produce parallel code for Gauss Jordan(GJ) and Gaussian Elimination(GE), sparse matrix computation and Fast Multipole method for irregular n-body simulations. The performance for GJ and GE on dense matrix computation is comparable to optimized hand written codes. For irregular problems such as n-body, scheduling overhead could become signi cant, but because of the iterative nature of such problems the overhead can be amortized over many iterations. And the overall performance with automatic scheduling is better than optimized manually derived schedules with no overhead.

1 Introduction There are two fundamental problems in automatic program parallelization. First the program partitioning and parallelism detection at the granularity level of the parallel machine, e.g. [8, 16, 42, 43]. And second the ecient execution of the detected parallelism. We focus on scheduling a class of parallel computation modeled as directed acyclic task graphs (DAGs). Task graph scheduling can eectively balance computational loads and reduce unnecessary communication. Finding optimal scheduling solutions is only possible for a small class of task graphs and in general the problem is NP-hard. Many scheduling heuristic algorithms have been proposed in the literature, e.g. [5, 9, 15, 20, 28, 41, 44, 34]. However, only few The work presented here was in part supported by ARPA contract DABT-63-93-C-0064 under \Hypercomputing and Design" project, by the Oce of Naval research under grant N00014-93-1-0944, by NSF RIA CCR-9409695. The content of the information herein does not necessarily re ect the position of the Government and ocial endorsement should not be inferred.

1

automatic scheduling systems have been developed. SCHEDULER by Dongarra and Sorensen [13] uses a centralized dynamic scheduling scheme. HYPERTOOL by Wu and Gajski [44] and TASKGRAPHER by El-Rewini and Lewis [15] use compile time scheduling algorithms but do not produce code for parallel machines. We present a programming tool called PYRROS which schedules tasks and produces parallel code for distributed memory architectures. PYRROS distinguishing feature is its low complexity algorithms which are competitive to existing higher complexity algorithms. Low complexity algorithms are necessary when applying scheduling to the solution of practical scienti c problems. We address two important issues. How well compile time scheduling performs for scienti c applications. And what is the eect of compile time error estimation of the weights to the actual run time execution of the schedule. We consider two classes of scienti c applications. Regular computation such as Gaussian Elimination and Gauss Jordan for dense matrix computation, and irregular computation such as sparse matrix Cholesky decomposition and n-body simulations using the fast multipole method. Scheduling becomes bene cial when the overhead remains small compared to the total amount of computation. Which is the case for coarse grain task graphs or iterative problems where scheduling overhead can be amortized over many iterations. For regular task graphs schedules with good performance can be derived manually without overhead, but it is much harder to do the same for irregular task graphs. For example, manually optimized schedules for Gaussian elimination based on block and wrap mapping perform as well as the PYRROS scheduling algorithms. But PYRROS is better than block and wrap for the N-body irregular computation even if we include scheduling overhead. Using the PYRROS algorithms other researchers have demonstrated performance improvements of up 75% over schedules based on block and wrap mapping for sparse matrix computation with ne grain partitions. But the overhead for ne grain computation is high and it needs to be amortized over many iterations, [6]. These results clearly demonstrate that scheduling provides bene ts in practice as long as the overhead is kept low. The paper is organized as follows. Section 2 discusses the macro-data ow task graph model for representing parallel computations and the organization of PYRROS. Section 3 summarizes the PYRROS scheduling algorithms. Section 4 discusses the run-time execution of a static schedule and its performance. Section 5 presents experiments with PYRROS for Gaussian and Gauss Jordan elimination. Section 6 discusses the scheduling performance and overhead for n-body simulations. Section 7 compares the performance of the PYRROS algorithms and the well known ETF higher complexity scheduling algorithm.

2 The Task Graph Model and the PYRROS System A directed acyclic task graph (DAG) is de ned by a tuple G = (V; E; C ; T ) where V is the set of task nodes and v = jV j is the number of nodes, E is the set of communication edges and e = jE j is the number of edges, C is the set of edge communication costs and T is the set of node computation costs. The value ci;j 2 C is the communication cost incurred along the edge ei;j = (ni; nj ) 2 E , which is zero if both nodes 2

are mapped in the same processor. The value i 2 T is the execution time of node ni 2 V . PRED(nx ) is the set of immediate predecessors of nx and SUCC (nx) is the set of immediate successors of nx .

A task is an indivisible unit of computation which may be an assignment statement, a subroutine or even

an entire program. We assume that tasks are convex, which means that once a task starts its execution it can run to completion without interrupting for communications, Sarkar [41]. In the task computation, a task waits to receive all data in parallel before it starts its execution. As soon as the task completes its execution it sends the output data to all successors in parallel.

Scheduling is de ned by a processor assignment mapping, PA(nj ), of the tasks onto the p processors and by a starting time mapping, ST (nj ), of all nodes onto the real positive numbers set. CT (nj ) = ST (nj )+ j is de ned as the completion time of task nj in this schedule.

Figure 1(a) shows a weighted DAG with all computation weights assumed to be equal to 1. Figure 1(b) shows a processor assignment using 2 processors. Figure 1(c) shows a Gantt chart of a schedule for this DAG. The Gantt chart completely describes the schedule since it de nes both PA(nj ) and ST (nj ). n1

n1 2

2

n2

2 n3

n4

1

1

1

n5

n6

n7

2

2

2

n8

(a)

n2

n3

n4

n5

n6

n7

P 0 n 1 n2 n3 n5

n8

(b)

n4 n 7 n 6 n8

P1 1

2

3

4

5

6

7

Time

Gantt chart

(c)

Figure 1: (a) A DAG with node weights equal to 1. (b) A processor assignment of nodes. (c) The Gantt chart of a schedule. The PYRROS system is a parallel programming tool for scheduling task graphs and generating parallel code that executes task schedules. The organization of PYRROS is shown in Figure 2. The current tool has the following components: a task graph language, with an interface to C and Fortran, allowing users to de ne partitioned programs and data; a scheduling system for clustering, load balancing and physical mapping, and communication/computation ordering; an X-window graphic displayer for visualizing task graphs and scheduling results; a code generator that inserts synchronization primitives and performs code optimization for the nCUBE-2, Meiko CS-2 and INTEL parallel machines. The input of PYRROS is a weighted task graph and the associated sequential C or Fortran code. For example, Figure 3 is the LU factorization algorithm where an n n matrix is divided into N block columns, and each block column consists of N submatrices with size r r where r = n=N is assumed to be an integer. A dependence structure is shown in Figure 4. When partial pivoting is introduced, the task Tkk performs an additional operation of nding the pivot element in each column which is local to that task. The row exchange information due to pivoting needs to be used in updating operation of Tkj . And the dependence structure remains the same as in Figure 4. Techniques for automatically generating 3

C/Fortran program+dependence X window interface

PYRROS

Graph displayer

Task graph language parser Schedule displayer

Program scheduler

Code generator

Clustering Mapping to processors

Mapping data /program Optimizing comm/mem Inserting primitives

NX/2 code

Meiko CS2 iPSC/860

ncube code

nCUBE-2

Figure 2: The system organization of PYRROS.

for k = 1 to N

T : f Factorize A as L U Compute L?1 and U ?1 for i = k + 1 to N?1 A = A U endg for j = k + 1 to N?1 T : f A = L A For i = k + 1 to N A = A ?A A endg k k

k;k

k

k

i;k

k

k

i;k

k

j

k

k;j

k;j

k

i;j

i;j

i;k

k;j

end end

Figure 3: Block column LU factorization and its task partitioning. 1

T1 T21

T14

T13

...

T1N

T22 4

3

...

...

T2

T2

...

..

N

T2

.

N TN-1

TNN

Figure 4: The task graph of block LU factorization. 4

coarse grain task graphs are described in [8, 16]. The weights for the LU DAG are estimated as follows, assuming that ! is the cost for a multiplication and an addition. In Figure 3, task Tkk computes Lk and Uk at a cost of r3!=3, nds L?k 1 and Uk?1 at a cost of r2 ! and then performs N ? k matrix multiplications with a lower triangular matrix at cost of (N ? k)r3!=2. Thus the total cost for Tkk is about (N ? k)r3 !=2, ignoring lower order factors. Similarly for task Tkj the cost is about (N ? k)r3!1 since the matrix operation Ai;j = Ai;j ? Ai;k Ak;j contains an additional subtraction which costs about :5! and !1 = 1:5! . We use the standard linear model + m to estimate the communication cost of sending a message of size m, where is the startup cost and is the transmission speed. From Tkk to Tkk+1 , communication weight is + (N ? k + 1)r2 since L?k 1 and Ai;k are communicated. The column partial pivoting cost is of lower order, O((N ? k)r), and adds only a small amount to the cost of Tkk . struct Dataitem colbk[n]; struct Dataitem partcolbk[n]; task P[k]{ int bc; set_weight((n-k)/2*r*r*r); read(&colbk[k], n*4*r*r); task_kk(&colbk[k], k, &partcolbk[k]); for(bc=k+1;bc