Sequencing Asynchronous Pipeline Tasks

Technical Report CS96-497, CSE Dept., UCSD, November 1996

Sequencing Asynchronous Pipeline Tasks Val Donaldson and Jeanne Ferrante Computer Science and Engineering Department University of California, San Diego La Jolla, California 92093-0114 fvdonalds,[email protected] Abstract

Asynchronous pipelining is a form of parallelism in which processors execute dierent loop tasks, as opposed to dierent loop iterations. An asynchronous pipeline schedule for a loop is a generalization of a noniterative DAG schedule, and consists of a processor assignment|an assignment of loop tasks to processors; plus a task sequence for each processor|an order on instances of tasks assigned to a processor. This variant of pipeline parallelism is particularly relevant in distributed memory systems (since pipeline control may be distributed across processors), but may also be used in shared memory systems. Results for asynchronous pipelining may also provide insights into other forms of pipelining, such as software pipelining. We show that the problem of nding an optimal asynchronous pipeline schedule is NP-hard under three dierent de nitions of optimality, even when the loop body is self-cyclic, i.e., has no multi-task dependence cycles. A number of pipeline scheduling subproblems are themselves NP-hard, but given a processor assignment, the problem of nding an asymptotically optimal task sequence for a self-cyclic loop can be solved in polynomial time. Self-cyclic loops represent a signi cant subset of loops which may contain cross-iteration dependences, I/O, and simple reductions. Assuming that a processor assignment for a loop is speci ed, we design a heuristic task sequencing algorithm which can be customized to trade o algorithm execution time against the quality of the resulting task sequence. The algorithm is asymptotically optimal for self-cyclic loops, and can be used as a component of a complete pipeline scheduling algorithm which generates schedules for self-cyclic loops that are guaranteed to be within 18% of optimal. For loops which contain multi-task cycles, the algorithm generates task sequences which average within 3.5{6.8% of optimal for a set of 10,800 random loops. The primary determinant of the execution time of the algorithm is the time required to determine the maximum cycle ratio of a loop, which in practice takes O(ve) time for a loop with v tasks and e dependence edges.

1 Introduction

Pipelining is an \assembly line" form of parallelism in which instances of subcomputations or tasks of a repeated computation, such as statements in a loop body, are executed concurrently. Pipeline parallelism exploits concurrency both within and across loop iterations, complementing other forms of parallelism. A computation which can not be parallelized using doall parallelism [15], which assigns dierent loop iterations to dierent processors, or noniterative DAG (directed acyclic graph) parallelism [16, 18], may permit pipeline parallelization. Loops which contain I/O or reductions, for example, may be pipelined. In asynchronous pipelining [6], pipeline tasks are scheduled for execution as soon as processor and data resources are available. This form of pipelining has been studied in the context of digital signal processing algorithms [2, 11], as well as other contexts [17]. The goal of accommodating execution on distributed memory systems motivates a number of the assumptions made in asynchronous pipelining. Most notably, communication times between tasks may be nonzero, and pipeline control is local to each processor, to avoid the overhead of global control. Although asynchronous pipelining is particularly appropriate for distributed memory systems, it may also be used in shared memory architectures. Many of the assumptions made in 1

asynchronous pipelining also apply to software pipelining[1, 13], so results for asynchronous pipelining may provide insights into software pipelining issues as well. An asynchronous pipeline schedule is a generalization of a noniterative DAG schedule [18], and consists of a processor assignment|an assignment of loop tasks to processors; and a task sequence for each processor| an order on instances of tasks assigned to a processor. We show that the problem of nding an optimal asynchronous pipeline schedule is NP-hard, under three dierent de nitions of optimality, even when the loop body is self-cyclic, i.e., the loop has no multi-task dependence cycles. We also show that a number of asynchronous pipeline scheduling subproblems, including processor assignment and task sequencing are also NP-hard in most cases, but task sequencing under the most common de nition of optimality can be solved in polynomial time for self-cyclic loops. Self-cyclic loops represent a signi cant subset of loops which may contain cross-iteration dependences, I/O, and simple reductions. The task sequence component of a pipeline schedule can itself be broken into two subcomponents: a task order, which is an order on tasks (rather than task instances) assigned to a processor, which gives the steadystate sequence of task instances on the processor; and an integer stage assignment, which controls initial and terminal task sequencing. Existing asynchronous pipeline scheduling heuristics [2, 9, 11, 17] rst generate stage assignments for all tasks, and then determine processor assignments and task orders. For a given stage assignment, nding an optimal processor assignment and task order combination is NP-complete even for self-cyclic loops. The comparatively restrictive algorithms in [2, 11] (all stage assignments are eectively set to zero) give no theoretical performance guarantees. The algorithms in [9, 17] generate schedules which are guaranteed to be within 100% of optimal (i.e., a factor of 2 of optimal) when interprocessor communication times are zero, although this guarantee is with respect to optimal noniterative DAG scheduling for an unrolled loop, rather than optimal pipeline scheduling. When communication times are nonzero, the guarantee is weakened in proportion to the ratio of communication to computation times in a loop. Our approach is to rst determine a processor assignment for a loop, and then determine task sequences for processors. In this paper we address the task sequencing problem. Assuming that a processor assignment is given, we design a heuristic algorithm for solving the task sequencing problem, which can be customized to trade o algorithm execution time against the quality of the resulting pipeline schedule. The algorithm is optimal for self-cyclic loops, and can be used as a component of a complete pipeline scheduling algorithm which generates schedules which are guaranteed to be within 18% of optimal in this case. For loops which contain multi-task cycles, the simplest, fastest version of our algorithm generates schedules which average within 6.82% of optimal for a set of 10,800 random loops. With more computational eort, schedules may be found which average within 4.54% or even 3.45% of optimal. The primary determinant of the execution time of the task sequencing algorithm is the execution time of the component algorithm used for nding the maximum cycle ratio of a loop, which is approximately O(ve) for a loop with v tasks and e dependence edges. Although our focus is on minimizing the execution time of a task sequence, other schedule aspects such as memory requirements could also be addressed. The paper is organized as follows. Section 2 summarizes the asynchronous pipelining framework from [6], which provides the foundation for both the theoretical results and algorithm development in the body of the paper. Section 3 discusses the computational complexity of asynchronous pipeline scheduling problems and subproblems, including task sequencing. Section 4 presents a single-pass \generic" task sequencing algorithm, which is the basis of subsequent \concrete" algorithms, and shows that it produces pipeline schedules which will not deadlock. Section 5 discusses sequencing for self-cyclic loops, and Section 6 considers sequencing for loops which contain arbitrary cycles. We summarize our contributions and conclusions in Section 7.

2 Asynchronous Pipelining

This section summarizes the notation, terminology, and primary results used for specifying and analyzing asynchronous pipeline schedules; see [6] for more details and discussion. A loop body may be modeled as a data dependence graph (DDG) [15]. De nition 1 A data dependence graph (DDG) is a weighted directed multigraph G = (V; E; fvtime ; fetime; fdist ), where: V is a set of vertices or tasks. We use v to denote jV j. 2

E is a multiset of e = jE j directed edges between tasks in V , where (X; Y ) 2 E denotes that task Y is data dependent on task X.

fvtime is a function from V to the nonnegative reals denoting the execution time of tasks. For a task X 2 V , we use X:time to denote fvtime (X). fetime is a function from E to the nonnegative reals denoting the communication time of data from

task X to task Y . We use (X; Y ):time to denote fetime ((X; Y )). This de nition of communication time abstracts away details such as communication startup times and the volume of data transmitted. fdist is a function from E to the integers which gives the dependence distance of an edge. We use (X; Y ):dist to denote fdist ((X; Y )). If (X; Y ):dist = d, then iteration k + d of task Y is dependent on iteration k of task X. 2

Tasks in De nition 1 may be of arbitrary size, from simple statements to complex compound statements or subroutine calls. Task execution times may be real, rather than integral or rational, and may vary from iteration to iteration, although for analysis and scheduling purposes we assume that there is an expected execution time for each task. Allowing task times to be zero rather than requiring that they be strictly positive allows dummy tasks to be added to a DDG to resolve scheduling problems. Dependence distances may be negative, which is particularly useful for representing pipeline scheduling constraints. In computational complexity discussions we assume that e = O(v). Figure 1a is an example loop from [9] with the corresponding DDG in Figure 1b. Since a DDG is a directed multigraph, a number of graph-theoretical de nitions are applicable. A (simple) cycle in a DDG is a closed path which starts and ends at the same task, such that each task is the source of exactly one edge in the cycle. A self-cycle is an edge from a task to itself. A strongly connected component (SCC) is a maximal set of tasks such that there is a path from any task in the set to any other task in the set. The level of an SCC is a nonnegative integer which is the maximum number of predecessor SCC's visited in any path in the DDG ending at any task in the given SCC. The level of a task is the level of the SCC containing the task. A DAG edge is an edge which is not contained in any cycle. A cycle edge is an edge which is contained in at least one cycle. An acyclic DDG is a DDG in which all edges are DAG edges. A task graph is an acyclic DDG in which all edges have dependence distances of zero. A self-cyclic DDG is a DDG in which all edges are DAG edges or self-cycle edges (this includes acyclic DDG's). A cyclic DDG is a DDG which contains one or more multi-task cycles, i.e., self-cycles are ignored in classifying a DDG as cyclic. This division of DDG's into cyclic and self-cyclic DDG's is signi cant, since loops which contain reductions or I/O are not acyclic but may be self-cyclic, and it is usually possible (and advantageous) to treat self-cyclic loops the same as acyclic loops. Pipeline execution of a loop may be analyzed in terms of several quantities de ned on DDG's. A DDG G satis es the positive cycle constraint i the sum of dependence distances in any cycle is positive. If G satis es the positive cycle constraint, G:mcr , the maximum cycle ratio of G, is the maximum over all cycles c in G, of the sum of all task execution and data communication times in c, divided by the sum of all dependence distances in c. If G is acyclic, G:mcr = 0. The maximum cycle ratio of a DDG can be found in polynomial time; discussions of a number of algorithms can be found in [3, 6, 17]. The theoretical complexity of Burns' primal-dual algorithm [3] is unknown, but empirically runs in O(ve) time. The DDG in Figure 1b satis es the positive cycle constraint, and has an iteration interval b = G:mcr = 1:5, from cycles such as (A; C; D). There is an important relationship between the maximum cycle ratio of a DDG and a family of auxiliary graphs. For any DDG G and any nonnegative real value b , de ne (X; Y ):weight = X:time + (X; Y ):time ? b (X; Y ):dist b

Graph G is the directed graph consisting of tasks and edges from G, with edge weights (X; Y ):weight . The weight of a path or cycle in G is the sum of all edge weights in the path or cycle, where the null path from a task to itself has weight zero. If graph G contains cycles, b < G:mcr i G has positive cycles, and b G:mcr i G does not contain positive cycles. Graph G can be checked for positive cycles in O(ve) time by solving a single-source longest path problem [5]. This transforms the maximum cycle ratio problem into a longest path problem, and is the basis of several maximum cycle ratio algorithms. b

b

b

b

b

b

b

3

A

A:1

for i := 1 to n do A[i] := f1(A[i],D[i-2]); B[i] := f2(A[i],E[i-2]); C[i] := f3(A[i],E[i-2]); D[i] := f4(C[i],B[i-1]); E[i] := f5(E[i],B[i]); endfor

B

C

D

E

est 1 5 0 1 1 2 3 ect 1 5 1 3 2 3 4 lst 1 5 0 1 1 2 3 lct 1 5 1 3 2 3 4 moment 0.5 2 1.5 2.5 3.5 stage 0 1 1 1 2 oset 0.5 0.5 0 1 0.5 :

:

[2] B:2

:

C:1

:

[2] E:1 [2]

(a)

[1] D:1

(b)

(c)

Stage 0: [0,1.5) A

b* = 1.5

A.moment = 0.5

C

Stage 1: [1.5,3) B

C.moment = 1.5 B.moment = 2.0

D

D.moment = 2.5

Stage 2: [3,4.5) E

Increasing moments

A:1 [1] B:2

[2]

[3]

C:1

[-2]

[2]

[1] E:1 [2]

E.moment = 3.5

[1]

D:1

(e)

(d) Figure 1: (a) An example loop from [9]. (b) The corresponding DDG G. Unbracketed numbers are task execution times and bracketed numbers are dependence distances; omitted dependence distances and data communication times are zero. G:mcr = 1:5. (c) Earliest/latest starting/completion times for G using b = 1:5, plus values derived in Step 6 of the Base Algorithm. (d) Using moments and divisor b = 1:5 in Step 6 of the Base Algorithm to generate stage assignments and osets. (e) The scheduled DDG G derived from the schedule hA : 0; E :2i, hB : 1i, and hC : 1; D : 1i; dashed edges are scheduling edges. G :mcr = G :pmax = 2. s

s

s

This fact may also be used to determine in O(ve) time if a DDG satis es the positive cycle constraint. If DDG G does satisfy the positive cycle constraint, all cycles have sums of dependence distances of at least one, so G:mcr is at most the sum of all task execution and data communication times in G. Therefore, if b is chosen to be the sum of all DDG component times, then G violates the positive cycle constraint i G contains a positive cycle. Graph G may also be used to de ne several quantities which are generalizations of the corresponding quantities used in noniterative DAG scheduling, which we will use in our task sequencing algorithm. Graph G is the reverse weighted graph which is similar to G except that the direction of all edges is reversed before edge weights are determined. De nition 2 Let G be a DDG satisfying the positive cycle constraint, and let b G:mcr . For each task X the earliest/latest starting/completion times of X with respect to b , and the critical path length of G are: X:est = the largest weight over all paths from any task in G to X X:ect = X:est + X:time G :cpl = maxfX:ect g X:lct = G :cpl ? (the largest weight over all paths from any task in G to X) b

b

R b

b

b

b

b

b

b

b

X

b

b

R b

b

4

for i := q:min stage + 1 to q:max stage + n do for each task X on q in task order do

j := i ? X:stage 1 and j n then receive/read X's jth iteration input (if any) execute jth iteration of X send/write X's jth iteration output (if any)

if j

endif endfor endfor

Figure 2: Pipeline control code for sequencing tasks for n iterations of a loop on a processor q, with iterations indexed from 1 to n. = X:lct ? X:time

2 These quantities can be determined for all tasks in G in O(ve) time by solving two single source longest path problems; one each for G and G . As a special case, when the DDG is a task graph, these values are the well-known quantities de ned for DAG's (the value of b is irrelevant when all dependence distances are zero). In Figure 1b, G:mcr = 1:5 and for b = 1:5, and the earliest/latest starting/completion times for all tasks are shown in Figure 1c. Asynchronous pipeline execution is a form of macro-data ow execution [16]. Execution of a task instance is atomic. All instances of a task execute on the same processor. An instance of a task begins execution at the earliest time that (1) previously scheduled iterations of tasks assigned to the same processor, including previous iterations of the task itself, are complete; and (2) current iteration input data from DDG predecessors is available. Tasks are assigned statically to processors, and all instances of a task are executed on the same processor. An asynchronous pipeline schedule for a DDG G on p processors consists of two components. The rst component is a processor assignment, which is a function from tasks in G to the p processors. A processor which has multiple tasks assigned to it is a shared processor, and any task assigned to a shared processor is a sharing task. G:pmax is the maximum sum of task times assigned to any processor (shared or not). The second component is a local task sequence for each processor, which is an order on instances of tasks assigned to the processor. A local task sequence consists in turn of two subcomponents. A local task order is a total order on tasks assigned to a processor, and is the steady-state sequence of task instances on the processor. A local stage assignment is a function from tasks assigned to a processor to the integers, which controls initial and terminal task sequencing on the processor. The stage assignment of a task X is X:stage . An asynchronous pipeline schedule for a DDG on p processors is a processor assignment for p processors, plus local task orders and stage assignments for each processor. An asynchronous pipeline schedule can be speci ed as a set of p local task sequence speci cations. The local task sequence speci cation q : hX1 :s1 ; X2 :s2 ; : : :; X : s i means that the k tasks X1 , X2 , : : :, X are assigned in that order to processor q, X1 :stage = s1 , and similarly for the remaining tasks. A task sequence speci cation for processor q can be used to sequence tasks on q by executing the pipeline control code in Figure 2 at runtime. The control code makes repeated passes in task order through the tasks assigned to the processor, and uses the stage assignments to determine whether the next instance of each task should be executed in the current pass. As an example, the task sequence speci cation hA :0; B :3; C :1i generates the sequence A AC AC ABC ABC : : :ABC ABC BC B B. A schedule is implemented at runtime in terms of local components, but scheduling algorithms such as those in [9, 17] and the task sequencing algorithms below generate \global" schedules. A global task order is a total order on all tasks in a DDG, and a global stage assignment is a stage assignment for all tasks. A pipeline schedule can be de ned as a processor assignment, a global task order, and a global stage assignment. The global task sequence speci cation hX1 :s1 ; X2 : s2 ; : : :; X : s i for a DDG with v tasks, together with processor assignments for all v tasks is equivalent to a set of p local task sequence speci cations. The local X:lst

b

b

b

R b

k

k

v

5

k

v

task sequence speci cations can be generated by using the processor assignment to project the global task sequence speci cation onto p local task sequence speci cations, which retain the same task orders and stage assignments as in the global task sequence speci cation. Although the process is nontrivial, it is also possible to generate a global task sequence speci cation from a set of local task sequence speci cations. We may therefore blur the distinction between local and global task orders, stage assignments, and task sequence speci cations, and drop the \local" and \global" quali ers in cases where the distinction is irrelevant or available from context. Scheduling edges may be added to a DDG to represent scheduling information for each local task sequence speci cation hX1 :s1 ; X2 :s2 ; : : :; X : s i as follows. For each i 2 [1; k ? 1], a scheduling edge (X ; X +1 ) is created, with (X ; X +1 ):dist = s ? s +1 . A scheduling edge (X ; X1 ) is also created with (X ; X1 ):dist = s ? s1 + 1. All scheduling edges are given communication times of zero. The set of scheduling edges for a processor forms a simple cycle called a scheduling cycle, which has a sum of dependence distances of 1. For nonsharing tasks, the scheduling cycle is a self-cycle edge with dependence distance 1. A scheduled DDG is a DDG in which each task is contained in exactly one scheduling cycle. Figure 1e is an example of a scheduled DDG. The rst of two fundamental issues which must be addressed in asynchronous pipelining is the possibility that a schedule might deadlock. When a scheduled DDG G has p scheduling cycles, the phrase \execution of G on p processors" implies the obvious processor assignment, and also implies that instances of tasks sharing the same processor are executed in the unique sequence generated by executing the control code in Figure 2 with any task sequence speci cation which produces the corresponding scheduling cycle. Theorem 1 may be used to determine if a pipeline schedule for a DDG will deadlock, and is also the key to proving that the task sequencing algorithm in Section 4 always produces schedules which do not deadlock. Theorem 1 [6] Let G be a scheduled DDG with p scheduling cycles. Then pipeline execution of G on p processors is free of deadlock i G satis es the positive cycle constraint. 2 If a schedule will not deadlock, the second fundamental issue is to determine its execution time. If G is a scheduled DDG, then G:time is the execution time of n iterations of G using the schedule implicit in G. We can de ne constants a and b such that G:time a + bn for all n 1, and G:time = a + bn for at least one value of n, as follows. De nition 3 Let G be a DDG satisfying the positive cycle constraint. Then the iteration interval b and startup time a for pipeline execution of G are G:time and a = maxfG:time ? bnjn 1g 2 b = lim !1 n The iteration interval b of a pipeline schedule is the average time between completion of successive iterations of the DDG using the schedule, and is the primary measure of the performance of a pipeline schedule. Theorem 2 is the key to determining the iteration interval of a pipeline schedule. Theorem 2 [6] Let G be a scheduled DDG with p scheduling cycles, satisfying the positive cycle constraint. Then the iteration interval b for pipeline execution of G on p processors is G:mcr . 2 The startup time a is a secondary performance measure, which will assume a dierentiating role in our complexity results. If G is a scheduled DDG with G:mcr = b, G :cpl ? b is an upper bound on a. k

i

i

i

k

i

i

k

i

k

k

n

n

n

n

n

n

b

3 Computational Complexity

Gasperoni and Schwiegelshohn [9] sketch a proof that one form of the asynchronous pipeline scheduling problem (the \asymptotic" form below) is NP-hard. In this section we show that asynchronous pipeline scheduling is NP-hard under three dierent de nitions of optimality, even when a loop is self-cyclic. We also show that a number of scheduling subproblems involving combinations of processor assignment, task ordering, and stage assignment are similarly NP-hard. In Section 5 we will show that one subproblem has a polynomial time algorithm. 6

The three forms of optimality we consider are asymptotic, xed-count, and variable-count optimality, which are dierentiated by the assumptions made regarding optimality of the startup time a. Asymptotic optimality requires only the iteration interval of a schedule to be optimal, ignoring the eect of the startup time. This is usually reasonable when the loop is executed for a relatively large number of iterations. Asymptotic optimality is the simplest form of optimality, and appears to be the only form considered in prior work. Fixed-count optimality requires a schedule to be optimal for some speci ed number of iterations n, eectively requiring optimality of a combination of the iteration interval and startup time \customized" for n iterations. This might appear to be an instance of noniterative DAG scheduling (after unrolling the loop n times), but recall that all instances of a task must execute on the same processor using the control code in Figure 2. Variable-count optimality is a direct generalization of asymptotic optimality, where the startup time a as well as the iteration interval b must be optimal. The asynchronous pipeline scheduling (decision) problem under these three variants of optimality are de ned as follows.

Asynchronous Pipeline Scheduling with Asymptotic Optimality Instance: DDG G, number of processors p 2 Z + , and an iteration interval b 2 R+0 . Question: Is there a p-processor asynchronous pipeline schedule for G which has an iteration interval of b or less, i.e., is there a schedule s for G such G :mcr b? Asynchronous Pipeline Scheduling with Fixed-Count Optimality Instance: DDG G, number of processors p 2 Z + , number of iterations n 2 Z + , and a deadline D in s

R+0 .

Question: Is there a p-processor asynchronous pipeline schedule s for G such that G :time D? Asynchronous Pipeline Scheduling with Variable-Count Optimality Instance: DDG G, number of processors p 2 Z , an iteration interval b 2 R , and a startup time a 2 R. Question: Is there a p-processor asynchronous pipeline schedule for G which has an iteration interval of b or less and a startup time of a or less, i.e., is there a schedule s for G such that G :mcr b, and for any integer n 2 Z , G :time a + bn? s

+

+

n

+ 0

s

s

n

Since an asynchronous pipeline schedule consists of three components: a processor assignment, a task order, and a stage assignment, we can also de ne six scheduling subproblems, where one or two of the three components are speci ed, and the remaining component(s) must be determined. We are particularly interested in three of these six subproblems. The kernel scheduling problem assumes that a stage assignment is given, and a processor assignment and a task order must be found. The scheduling algorithms in [2, 9, 11, 17] determine the stage assignment rst, and then generate the other two schedule components. The algorithms in [9, 17] solve the latter two subproblems simultaneously using a noniterative DAG scheduling algorithm on an acyclic subgraph of the DDG, which in [17] is called a \kernel graph." Our approach is to rst nd a processor assignment, and then solve the task sequencing problem, which are the other two scheduling subproblems of primary interest. Note that in contrast to the full scheduling problem, even if the DDG used as input to some of these subproblems satis es the positive cycle constraint and will therefore not deadlock when executed without processor sharing, when one or two schedule subcomponents are speci ed it may be the case that any choice of the remaining schedule components will result in a schedule which deadlocks. The following de nitions of these three scheduling subproblems are stated in terms of asymptotic optimality.

Asynchronous Pipeline Kernel Scheduling with Asymptotic Optimality Instance: DDG G, number of processors p 2 Z + , a stage assignment for G, and an iteration interval + b 2 R0 . Question: Is there an assignment of tasks in G to p processors and a task order for G, such that the schedule consisting of the processor assignment, the task order, and the speci ed stage assignment has an iteration interval of b or less? 7

Problem Scheduling Kernel Scheduling Processor Assignment Task Sequencing

General DDG's Asymptotic Fixed NP-complete NP-hard NP-complete NP-hard NP-complete NP-hard NP-complete NP-hard

Self-Cyclic DDG's Variable Asymptotic Fixed Variable NP-hard NP-complete NP-hard NP-hard NP-hard NP-complete NP-hard NP-hard NP-hard NP-complete NP-hard NP-hard NP-hard O(v + e) NP-hard NP-hard

Table 1: Summary of computational complexity results for asynchronous pipeline scheduling.

Asynchronous Pipeline Processor Assignment with Asymptotic Optimality Instance: DDG G, number of processors p 2 Z + , a task order and a stage assignment for G, and an + iteration interval b 2 R0 . Question: Is there an assignment of tasks in G to p processors such that the schedule consisting of the processor assignment and the speci ed task order and stage assignments has an iteration interval of b or less?

Asynchronous Pipeline Task Sequencing with Asymptotic Optimality Instance: DDG G, number of processors p 2 Z + , an assignment of tasks in G to p processors, and + an iteration interval b 2 R0 . Question: Is there a task sequence speci cation such that the schedule consisting of the task sequence speci cation and the speci ed processor assignment has an iteration interval of b or less?

All of these computational problems may also be de ned for input DDG's which are restricted to be selfcyclic. Considering all combinations, there are four basic problems: scheduling, kernel scheduling, processor assignment, and task sequencing; three forms of optimality: asymptotic, xed-count, and variable-count; and two forms of graphs: general, unrestricted DDG's, and self-cyclic DDG's. This results in a total of 24 computational problems. Our results on the complexity of these problems are summarized in Table 1. The 23 problems which are NP-hard (or NP-complete) are considered next, and the lone problem with a polynomial time algorithm is considered in Section 5 (Corollary 8). We will use two reference problems in our proofs. The rst is the well-known multiprocessor scheduling problem [8], where a set of independent tasks (tasks with no intertask dependences) are assigned to processors with the goal of balancing the load on each processor. This problem is strongly NP-complete (a problem is strongly NP-complete if it is NP-complete regardless of the magnitude of numerical values in a problem instance [8]).

Multiprocessor Scheduling Instance: Set V of tasks, number p 2 Z + of processors, for each task X 2 V an execution time X:time 2 Z + , and a deadline D 2 Z + . Question: Is there a partition V = V1 [ V2 [ [ V of V into disjoint sets such that for all i 2 [1; p], P 2 X:time D? Theorem 3 The asynchronous pipeline scheduling, kernel scheduling, and processor assignment problems p

X

Vi

are strongly NP-hard under asymptotic, xed-count, and variable-count optimality, even when the DDG is self-cyclic. Further, the problems which assume asymptotic optimality are strongly NP-complete.

Proof We will prove the NP-hardness claim for the asynchronous pipeline scheduling problem under asymptotic optimality for general DDG's using a reduction from the multiprocessor scheduling problem, and then argue that the same reduction applies to the other problems as well. An instance of the multiprocessor scheduling problem is a special case of the asynchronous pipeline scheduling problem. The set of tasks from an instance of the multiprocessor scheduling problem can be 8

viewed as a DDG G with no dependence edges. With no dependence edges, all task sequences for a set of tasks assigned to the same processor execute in the same amount of time for any iteration count n: exactly n times the sum of the execution times of the tasks assigned to the processor. The iteration interval of a pipeline schedule will be the largest sum of task times on any processor. Therefore, the iteration interval of a pipeline schedule will be D or less i one iteration executes in time D or less, which corresponds directly to a partition of the tasks in the original multiprocessor scheduling instance, so asynchronous pipeline scheduling with asymptotic optimality is NP-hard. Without dependence edges, all task sequences execute in the same amount of time, so even if the stage assignment is speci ed as in the kernel scheduling problem, or both the stage assignment and task order are speci ed as in the processor assignment problem, there is no real distinction between the asynchronous pipeline scheduling problem and these two problems in this reduction. Similarly, the startup time of any pipeline schedule will be zero, so any schedule which is optimal under one of the three forms of optimality is optimal under the others as well. Further, a DDG without edges is self-cyclic, so the 18 problems covered in the statement of the theorem are all NP-hard. By Theorem 2, the problem of nding the iteration interval of a given pipeline schedule can be reduced to the problem of nding the maximum cycle ratio of a DDG, which can be solved in polynomial time, so the problems which assume asymptotic optimality are in NP. We are not aware of any polynomial time algorithms for determining the exact execution time of n iterations of a pipeline schedule, or for exactly determining the startup time of a schedule, so it is an open question whether or not the remaining problems are in NP. (When processors are not shared, the exact execution time of n iterations of a pipeline schedule can be found by solving an integer linear program [14], so this special case of the problem is in NP.) 2 Our second reference problem, which we will call \noniterative task ordering," is from Hoogeveen et al. [12], and is essentially our asynchronous pipeline task sequencing problem restricted to execution of a single iteration of a task graph. Tasks are executed once in task order (so stage assignments are irrelevant, or may be assumed to be zero). Hoogeveen et al. only claim strong NP-hardness for the problem, but the execution time of a noniterative DAG schedule can be determined in O(v + e) time, so the problem is also strongly NP-complete. Recall that a task graph is an acyclic DDG in which all dependence distances are zero.

Noniterative Task Ordering Instance: Task graph G with zero data communication times, number of processors p 2 Z + , an + assignment of tasks in G to p processors, and a deadline D 2 Z0 . Question: Is there a task order for G such that the noniterative DAG schedule consisting of the task order and the speci ed processor assignment has an execution time of D or less?

Theorem 4 The asynchronous pipeline task sequencing problem is strongly NP-hard under xed-count and variable-count optimality, and strongly NP-complete under asymptotic optimality.

Proof The result for xed-count optimality is immediate by choosing n = 1, but the following proof applies to all three forms of optimality. An initial task in a task graph is a task without predecessors, and a nal task is a task without successors. For an instance of the noniterative task ordering problem, assume without loss of generality that the task graph G has a unique initial task, and a unique nal task. (If G violates this assumption, add a new \superinitial" task to G with zero execution time, with dependence edges to all original initial tasks with zero data communication times, and similarly for a new \super nal" task.) Generate an instance of the pipeline task sequencing problem by adding a back edge from the nal task of G to the initial task, with a data communication time of zero and a dependence distance of one to get a DDG G0 . Because of this back edge, all tasks in one iteration of G0 must execute before any task in the next iteration may execute. Therefore, if we restrict the stage assignments of all tasks to be zero (or any other xed value), then to avoid deadlock we must choose a task order which is a topological sort of tasks in G. Further, this accounts for all valid task sequences, since the only other possibility is to \cyclically shift" the tasks in a topological task order, with a corresponding increment or decrement of the stage assignments of the tasks shifted past the beginning or end of the order; these shifted sequence speci cations are equivalent to an unshifted sequence speci cation ([6] discusses this \shifting" in more detail). Because of the back edge, the execution time of n iterations of any pipeline schedule for G0 is exactly n times the iteration interval 9

of the schedule (the startup time of the schedule will be zero), so the iteration interval for a schedule for G0 is exactly the execution time of the corresponding schedule for the noniterative task graph G, and the NP-hardness claim follows for all three forms of optimality. The NP-completeness claim follows as in the proof of Theorem 3. 2

Theorem 5 The asynchronous pipeline task sequencing problem for self-cyclic DDG's is strongly NP-hard under xed-count and variable-count optimality.

Proof Again the result for xed-count optimality is immediate by choosing n = 1, but the following proof

applies to both forms of optimality. Let G be the task graph of an instance of the noniterative task ordering problem, and assume without loss of generality that G has a unique nal task. De ne G:vtime to be the sum of the execution times of all tasks in G. Generate an instance of the pipeline task sequencing problem by (a) adding a new nal task F with F:time = G:vtime to G, with an edge from the original nal task to F which has a data communication time and a dependence distance of zero to get a DDG G0; and (b) if the noniterative task graph instance assumes that p processors are available, use p + 1 as the processor count for the pipeline task sequencing instance. If we retain the processor assignment for G for the corresponding tasks of G0 and assign task F to the \extra" processor, then execution of task F dominates pipeline execution under any task sequence. For any pipeline schedule s, the rst instance of F begins execution at time a, the time that the rst instance of the original nal task of G completes. For b = F:time = G:vtime, the rst instance of F nishes at time a + b. Because b a, G :mcr = b, and for any iteration count n, G :time = a + bn. The execution time of any pipeline schedule for G0 therefore equals the execution time of the noniterative schedule plus bn, and the theorem follows. 2 Although we have focused explicitly on the computational complexity of kernel scheduling, processor assignment, and task sequencing, the three asynchronous pipeline scheduling subproblems of immediate interest, the reductions used in the theorems in this section also provide answers for some of the remaining subproblems. The reduction in the proof of Theorem 3 applies to the case where a task order is given, and processor and stage assignments must be determined (we do not have a name for this problem). The reductions in the proofs of Theorems 4 and 5 apply to the asynchronous pipeline task ordering problem, where processor and stage assignments are known, and a task order must be determined. The result of Corollary 8 in Section 5 below applies to the asynchronous pipeline stage assignment problem, where a processor assignment and a task order are known, and a stage assignment must be determined. This still leaves one variant of the task ordering problem, and ve variants of the stage assignment problem open. In the remainder of the paper the unquali ed term \optimal" will refer to asymptotic optimality. s

s

n

4 Generic Task Sequencing Algorithm

Because the task sequencing problem is NP-complete for general loops (under asymptotic optimality in particular), our goal is to design a heuristic algorithm which takes a loop and a processor assignment as input, and generates a task sequence such that the iteration interval of the resulting pipeline schedule, i.e., the maximum cycle ratio of the resulting scheduled DDG, is as small as possible. As a signi cant special case, the algorithm should be optimal for self-cyclic loops. Task sequencing can not change the cycle ratio of cycles from the original DDG G, which establishes a lower bound of G:mcr on the iteration interval of a schedule. Nor can task sequencing change the cycle ratio of scheduling cycles, since these are determined by the processor assignment, which establishes a second lower bound of G:pmax on the iteration interval. The goal of task sequencing is therefore to minimize the cycle ratios of mixed cycles in the scheduled DDG|cycles which contain both original DDG edges and scheduling edges. Our basic approach to task sequencing is to choose a positive divisor b , which is the \size" of pipeline stages, and a nonnegative moment for each task. In an approximate analogy to mechanics, X:moment is a summary \signature value" or \center of mass" for task X, which in the simplest case is a value between X:est and X:lct . The stage assignment for X is X:stage = bX:moment =bc, so that tasks whose moments lie within the same multiple of the divisor are in the same stage. The remainder or relative oset of a task's moment from the \lower bound" of its stage is used to determine the task's position in its task order. b

b

10

Figure 1d illustrates the generation of stage assignments and osets for tasks using divisor b = 1:5 and task moments from Figure 1c; the derived stage assignments and osets are shown in the nal rows of Figure 1c. There is a great deal of freedom available for choosing divisors and task moments. Figure 3 is the code for a single-pass Generic Task Sequencing Algorithm which leaves a number of actual algorithm parameters unspeci ed. After an informal discussion of the steps in the Generic Algorithm, we will prove that schedules generated by the algorithm will not deadlock. In Sections 5 and 6 we will discuss speci c choices at each step. In the following discussion we will use the loop in Figure 1 as an example, with tasks fA; E g, fB g, and fC; Dg assigned to three processors. Step 1 of the Generic Algorithm allows almost any DDG G0 to be used as the graph which is manipulated in the remainder of the algorithm, as long as G0 is topologically equivalent to the input DDG G, i.e., has the same tasks, edges, and dependence distances, with the possible exception of self-cycles. The rationale for this is that the characterization of deadlock in Theorem 1, which is the key to showing that pipeline schedules generated by the Generic Algorithm will not deadlock, is concerned solely with the sum of dependence distances in cycles. Task execution and data communication times are not relevant to deadlock characterization, although they are very relevant to nding a schedule with the smallest possible execution time. Self-cycles may be removed since the control code in Figure 2 always sequences instances of the same task in order. Deleting self-cycles often improves the generated schedule, and may also decrease the execution time of the task sequencing algorithm, but is technically optional rather than required. Note that if G is self-cyclic and self-cycles are deleted, then G0 is acyclic. Allowing tasks in the input DDG G to have execution times which are zero (from De nition 1) increases the options available for scheduling, but slightly complicates task sequencing. The requirement in Step 1d that all tasks of DDG G0 have strictly positive execution times is a simple technique for avoiding these complications. This constraint might be relaxed if compensating changes are made in other algorithm steps. In the example in Figure 1 we may choose G0 = G. In Step 2 any positive value b may be chosen as the divisor for dividing the DDG into stages, as long as b G0:mcr . This ensures that the graph G0 will have no positive cycles, and the earliest/latest starting/completion times from De nition 2 determined in Step 3 will be well-de ned. Weights are chosen in Step 4 for combining these values to get a moment for each task which is somewhere between the earliest starting and the latest completion times of a task (before accounting for the choices made in Step 5). In Figure 1 we may choose b = G0 :mcr = 1:5 in Step 2 to get the task times in the upper part of Figure 1d, and choose west = wlct = 1=2 and wlst = wect = 0 in Step 4. As a general observation, choosing larger moments for tasks and/or a smaller divisor b will increase the number of stages into which the DDG is divided, which generally increases the sum of dependence distances in mixed cycles in the scheduled DDG, decreasing the iteration interval of the resulting schedule. When DDG G0 in the Generic Algorithm contains cycles, G0:mcr is a lower bound on b , but when G0 contains multiple SCC's, the bound only applies within SCC's, rather than across SCC's. We may therefore arbitrarily increase the moments of tasks at higher levels, relative to the moments of tasks at lower levels, as allowed in Step 5. The two types of increments chosen in Steps 5a and 5b perform slightly dierent functions. Moment increments are added directly to task moments and aect both stage assignments and task orders, whereas stage increments aect only stage assignments, allowing the two task sequence components to be partially decoupled. Task moments, stage assignments, and osets are determined in Step 6 using information from earlier steps, and the task order is determined in Step 7 by sorting task osets. In Figure 1, if we set all moment and stage increments to zero, then the moments, stage assignments, and osets shown in Figure 1d are generated in Step 6. Ordering task A before task E in Step 7 (both have osets of 0.5), the generated schedule is hA :0; E : 2i, hB : 1i, and hC :1; D :1i. The corresponding scheduled DDG G is shown in Figure 1e. The iteration interval of the schedule is b = G :mcr = 2. We will look at possible choices for each of Steps 1, 2, 4, 5a, and 5b in greater detail in subsequent sections, after proving that schedules generated by the Generic Algorithm will not deadlock. b

s

s

Theorem 6 Let G be a DDG satisfying the positive cycle constraint, with an assignment of tasks to processors. Then execution of the Generic Task Sequencing Algorithm on G produces a pipeline schedule which is free of deadlock. Proof Note rst that for any DDG G which satis es the positive cycle constraint: (a) all pipeline schedules 11

Generic Task Sequencing Algorithm Input

1. A DDG G which satis es the positive cycle constraint 2. An assignment of tasks in G to processors Output

1. A task order for each processor 2. Stage assignments for all tasks Method

1. Choose a topologically equivalent DDG G0 which has (a) The same tasks as G (b) The same edges as G, except that self-cycles are deleted (c) The same data dependence distances as G (d) Positive, but otherwise arbitrary task execution times (e) Arbitrary (nonnegative) data communication times 2. Choose a positive real divisor b G0:mcr 3. Generate G0 and for each task X, nd X:est , X:lst , X:ect , and X:lct 4. Choose nonnegative real weights west , wlst , wect , and wlct such that west + wlst + wect + wlct = 1 5. Let G01; G02; : : :; G0 be the ` 1 SCC's of G0. Choose (a) Nonnegative real values G01 :moment increment ;G02:moment increment ;: : :;G0 :moment increment (b) Nonnegative integer values G01:stage increment ; G02:stage increment ; : : :; G0 :stage increment subject to the constraint that for any i; j 2 [1; `], i 6= j, if there is a path in G0 from G0 to G0 , then G0 :moment increment G0 :moment increment and G0 :stage increment G0 :stage increment 6. For each task X, where G0 is the SCC containing X, de ne (a) X:moment = G0 :moment increment +west (X:est )+wlst (X:lst )+wect (X:ect )+wlct (X:lct ) + G0 :stage increment (b) X:stage = moment (c) X:oset = X:moment ? b moment 7. Task order on each processor is given by sorting X:oset times in ascending order b

b

b

b

b

`

`

`

i

j

i

i

j

j

i

b

i

b

X:

b

i

X:

b

Figure 3: Generic Task Sequencing Algorithm.

12

b

b

for G satisfy self-dependences, since the control code in Figure 2 executes the instances of any task in sequential order, regardless of the task's stage assignment and its position in its processor's task order; and (b) the characterization of deadlock in Theorem 1 is independent of task execution and edge communication times. Therefore, a schedule for either G or the alternate DDG G0 chosen in Step 1 is free of deadlock i the same schedule is also deadlock-free for the other DDG. DDG's G and G0 are identical in many ways, so that statements or discussion referring to one DDG often apply to the other as well (for example, both DDG's have the same tasks.) To simplify notation, we will use X:time to refer to the execution time of task X in G0, not G, and similarly for data communication times for edges. Let G0 be the scheduled DDG derived from the Generic Algorithm, and assume initially that all stage increments from Step 5b are zero, i.e., G01:stage increment =G02 :stage increment = =G0 :stage increment = 0. With this restriction, we would like to nd a lower bound on the dependence distance of an edge (X; Y ) in G0 in terms of task moments and graph component times, which can be used to show that the sum of the dependence distances of edges in any simple cycle is positive. There are two cases to consider: either (X; Y ) is an edge of G0 , or (X; Y ) is a scheduling edge. s

`

s

Case 1 (X; Y ) is an edge of G0 . Because b G0:mcr (Step 2), G0 has no positive weight cycles, so Y:est X:est + (X; Y ):weight = X:est + X:time + (X; Y ):time ? b (X; Y ):dist b

b

b

b

b

Rearranging terms and accounting for the fact that dependence distances are integral, time + (X; Y ):time (X; Y ):dist X:est ? Y:est + bX: By similar reasoning it is also the case that (X; Y ):dist X:ect ? Y:ect + bY: time + (X; Y ):time ? Y:lst + X:time + (X; Y ):time X: lst (X; Y ):dist b (X; Y ):dist X:lct ? Y:lct + bY: time + (X; Y ):time Let G0 and G0 be the SCC's containing X and Y , respectively. Since (X; Y ) is an edge of G0, either G0 = G0 , or G0 is an ancestor of G0 , so from Step 5a, G0 :moment increment G0 :moment increment . Using the de nition of weights and moments from Steps 4 and 6, i

i

b

b

b

b

b

b

b

b

j

j

i

j

i

j

X:moment ? Y:moment = [G0 :moment increment + west (X:est ) + wect (X:ect ) + wlst (X:lst ) + wlct (X:lct )] ? [G0 :moment increment + west (Y:est ) + wect (Y:ect ) + wlst (Y:lst ) + wlct (Y:lct )] west (X:est ? Y:est ) + wect (X:ect ? Y:ect ) + wlst (X:lst ? Y:lst ) + wlct (X:lct ? Y:lct ) maxfX:est ? Y:est ; X:ect ? Y:ect ; X:lst ? Y:lst ; X:lct ? Y:lct g b

b

since weights sum to one, so

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

j

b

b

b

b

i

b

b

X:moment ? Y:moment + minfX:time ; Y:time g + (X; Y ):time b and since minfX:time ; Y:time g > 0, for G0:vmin = min fX:time g, + G0:vmin (X; Y ):dist X:moment ? Y:moment b (X; Y ):dist

X

13

b

Case 2 (X; Y ) is a scheduling edge. There are two subcases to consider, depending on the relative positions of X and Y in the task order for the processor they share.

Subcase 2.1 X is the immediate predecessor of Y in the task order. In this case X:oset Y:oset (Step 7). Under the assumption that all stage increments are zero, from Step 6, X:oset = X:moment ? b (X:stage ), and similarly for task Y . Substituting these expressions for X:oset and Y:oset and rearranging terms, X:stage ? Y:stage X:moment b? Y:moment By de nition (X; Y ):dist = X:stage ? Y:stage . Therefore, since dependence distances are integral, X: moment ? Y: moment (X; Y ):dist b Subcase 2.2 X is the last task in the task order, and Y is the rst (X = Y for a nonshared processor).

Since all task osets are less than b (Step 6), X:oset < Y:oset + b , so using reasoning similar to that in Subcase 2.1, X:stage ? Y:stage + 1 > X:moment b? Y:moment For this subcase (X; Y ):dist = X:stage ? Y:stage + 1, and again X: moment ? Y: moment (X; Y ):dist b so this inequality holds for all scheduling edges. Given these lower bounds on the dependence distance of any edge, for k 1 let c = (X1 ; X2 ; : : :; X ) be any simple cycle in G0 . If c consists solely of scheduling edges then c must be a scheduling cycle for some processor, so the sum of dependence distances in c is one, which satis es the positive cycle constraint. Otherwise, c must contain at least one edge from G0. For any edge (X; Y ) in G0 , de ne (X; Y ): = G0:vmin if (X; Y ) is an edge of G0, and (X; Y ): = 0 if (X; Y ) is a scheduling edge. Since c contains at least one edge from G0, the value for this edge is G0:vmin , which from Step 1d is positive. For any real values x and y, dxe + dye dx + ye, so the sum of dependence distances in c is k

s

s

?1 X

k

(X ; X +1 ):dist + (X ; X1):dist i

i=1

i

k

?1 X :moment X i

k

i=1

=

&P

k

1

k

i

i

b

i

k

b

k

?1(X :moment ? X +1 :moment + (X ; X +1 ):) + X :moment ? X1 :moment + (X ; X1): ' b

i=1

&P

? X +1 :moment + (X ; X +1 ): + X :moment ? X1 :moment + (X ; X1 ):

i

?1(X ; X

i=1

i

i

): + (X ; X1): b

i+1

i

i

k

k

'

k

We can conclude that if all stage increments in Step 5b are zero, G0 satis es the positive cycle constraint. To complete the proof, we want to show that all cycles still satisfy the positive cycle constraint if nonzero stage increments are allowed. Let c be any simple cycle in the scheduled DDG derived by applying the Generic Algorithm to G assuming that stage increments are zero, and let c0 be the corresponding cycle s

14

in the scheduled DDG derived using the same algorithm choices except that nonzero stage increments are allowed. We have already shown that cycle c satis es the positive cycle constraint; we now want to show that cycle c0 does as well. There are two cases to consider, depending on whether or not c contains tasks from multiple SCC's. Note that in any case, the dependence distances of original edges from G0 are constant. Only the dependence distances of scheduling edges may change, due to changes in stage assignments.

Case 1 All tasks in cycle c are contained in one SCC G0 of G0. In this case, if the dependence distance of a scheduling edge (X; Y ) in c is (X; Y ):dist = X:stage ? Y:stage , then the dependence distance of (X; Y ) in i

c0 is

(X:stage + G0 :stage increment ) ? (Y:stage + G0 :stage increment ) = X:stage ? Y:stage i

i

The dependence distance of any edge in c0 therefore equals the dependence distance of the corresponding edge in c, and c0 satis es the positive cycle constraint.

Case 2 Cycle c contains tasks from two or more SCC's. As in Case 1, the dependence distances of edges

local to an SCC are constant, so we need only focus on edges between distinct SCC's. There are two subcases to consider, depending on whether or not all edges between SCC's are scheduling edges.

Subcase 2.1 All edges in c between SCC's of G0 are scheduling edges. In this case, any edge in c entering

an SCC G0 can be uniquely paired with the next edge in the simple cycle which leaves G0 . In cycle c0, G0 :stage increment is subtracted from the dependence distance of the incoming edge and is added to the dependence distance of the outgoing edge. Since c is a cycle, any edge between SCC's will assume the role of an incoming edge which is paired with an outgoing edge exactly once, and will also assume the role of an outgoing edge which is paired with an incoming edge exactly once. Therefore, there is no net change in the sum of dependence distances in c0 relative to the sum in c, and c0 satis es the positive cycle constraint. i

i

i

Subcase 2.2 Cycle c contains at least one DAG edge. Since DAG edges alone cannot form cycles, c must

also contain at least one scheduling edge between SCC's. This case is similar to Subcase 2.1, except that the edge with which a scheduling edge is paired need not necessarily be adjacent to the same SCC. A scheduling edge in c entering an SCC G0 is paired with the next scheduling edge in c which is directed out of an SCC G0 , possibly crossing one or more SCC's along the way (a scheduling edge might be paired with itself). If the next edge in c which leaves G0 is a scheduling edge, then G0 = G0 , and G0 :stage increment is added to the dependence distance of the edge in c0 entering G0 , and is subtracted from the dependence distance of an outgoing edge, with no net aect on the sum of dependence distances in c0 . But if the next edge in c which leaves G0 is a DAG edge, G0 6= G0 . In this case, there is a path from G0 to G0 where inter-SCC edges are DAG edges from G0, so from Step 5b G0 :stage increment G0 :stage increment . Since G0 :stage increment is subtracted from the dependence distance of a scheduling edge, and G0 :stage increment is added to the dependence distance of an edge, the net aect is that the sum of dependence distances in c0 may increase relative to the sum in c but may not decrease, so c0 satis es the positive cycle constraint. i

j

i

i

j

i

i

i

i

j

i

j

j

i

i

j

All cycles in the scheduled DDG derived from the schedule generated by the Generic Algorithm therefore satisfy the positive cycle constraint, and the theorem follows by applying Theorem 1. 2 The primary goal of the remainder of the paper is to design a practical task sequencing algorithm based on the Generic Algorithm. In the following sections we will consider in detail speci c choices for the various steps in the Generic Algorithm, as well as other options for improving the quality of generated schedules.

5 Task Sequencing for Self-Cyclic Loops

Our rst step toward a fully speci ed task sequencing algorithm based on the Generic Algorithm is to focus on the role of DAG edges in task sequencing, which as a limiting case includes task sequencing for self-cyclic loops. The key observation regarding DAG edges is that if we have a deadlock-free schedule for a DDG, we can arbitrarily increase the stage assignment of the sink task of any DAG edge relative to the stage 15

assignment of the source task of the edge without introducing deadlock. In the following theorem, there is no assumption that the input schedule was generated by the Generic Algorithm. Also note that if a DDG is a subgraph of another DDG, with the same tasks but containing only a subset of the edges of the original DDG, then any deadlock-free pipeline schedule for the full DDG is also a deadlock-free schedule for the subgraph DDG.

Theorem 7 Let G be a DDG satisfying the positive cycle constraint, and let G be the subgraph of G consisting of all tasks in G and all cycle edges in G. If there is a deadlock-free p-processor pipeline schedule s for G such that G :mcr = b, then there is also a p-processor schedule s0 for G such that G 0 :mcr = b. s

s

Proof For p processors q ; q ; : : :; q , let q : hX : X :stage ; X :X :stage ; : : :; X q : hX : X :stage ; X :X :stage ; : : :; X 1

2

p

: X1 1 :stage i 2 : X2 2 :stage i

1

1;1

1;1

1;2

1;2

1;k1

;k

2

2;1

2;1

2;2

2;2

2;k

;k

.. .

q : hX 1 : X 1 :stage ; X 2 :X 2 :stage ; : : :; X p

p;

p;

p;

p;

p;kp

:X

p;kp

:stage i

be a deadlock-free pipeline schedule s for G,P such that the P iteration interval of s applied to subgraph G is b = G :mcr . De ne G:total time = 2 X:time + ( )2 (X; Y ):time . Note that b = 0 i G:total time = 0, so if G:total time = 0, the iteration interval of G is already zero. Therefore, assume that G:total time > 0 and b > 0. For each SCC G of G, de ne s

X

X;Y

G

G

s

i

G :stage increment = G :level G:totalb time For each task X, where G is the SCC containing X, de ne i

i

i

X:stage 0 = X:stage + G :stage increment i

and consider the pipeline schedule s0 for G, where s0 is q1 : hX1 1 :X1 1 :stage 0; X1 2 : X1 2:stage 0; : : :; X1 1 : X1 1 :stage 0 i q2 : hX2 1 :X2 1 :stage 0; X2 2 : X2 2:stage 0; : : :; X2 2 : X2 2 :stage 0 i .. . q : hX 1 :X 1 :stage 0 ; X 2 : X 2 :stage 0; : : :; X :X :stage 0i p

;

;

;

;

;k

;k

;

;

;

;

;k

;k

p;

p;

p;

p;

p;kp

p;kp

(schedules s and s0 are identical except for the stage assignments). We want to show that G 0 :mcr = b, which implies the weaker claim that schedule s0 is free of deadlock. Let c be any simple cycle in G , and let c0 be the corresponding cycle in G 0 . We want to show that cycle c0 has a cycle ratio of b or less. The proof has the same underlying structure as the part of the proof of Theorem 6 which accounts for nonzero stage increments. There are two cases to consider, depending on the composition of edges in c. Note that the dependence distances of edges from G are constant in any schedule. Only the dependence distances of scheduling edges may change, due to changes in stage assignments. Also note that if cycle c contains only cycle edges from G and scheduling edges (no DAG edges from G), then c is also a cycle in G. s

s

s

s

Case 1 All tasks in cycle c are contained in one SCC G of G. In this case, the dependence distance of a scheduling edge (X; Y ) in c is (X; Y ):dist = X:stage ? Y:stage , and the dependence distance of (X; Y ) in c0 i

is

(X:stage + G :stage increment ) ? (Y:stage + G :stage increment ) = X:stage ? Y:stage i

i

16

The dependence distance of any edge in c0 therefore equals the dependence distance of the corresponding edge in c. Since all edges in c are cycle edges in G or scheduling edges, c is also a cycle in G, so c0 has a cycle ratio of b or less. s

Case 2 Cycle c contains tasks from two or more SCC's. As in Case 1, the dependence distances of edges local to an SCC are constant, so we need only focus on edges between distinct SCC's. There are two subcases to consider, depending on whether or not all edges between SCC's are scheduling edges. Subcase 2.1 All edges in c between SCC's of G are scheduling edges. In this case, any edge in c entering an SCC G can be uniquely paired with the next edge in the simple cycle which leaves G . In cycle c0, G :stage increment is subtracted from the dependence distance of the incoming edge and is added to the dependence distance of the outgoing edge. Since c is a cycle, any edge between SCC's will assume the role of an incoming edge which is paired with an outgoing edge exactly once, and will also assume the role of an outgoing edge which is paired with an incoming edge exactly once. Therefore, there is no net change in the sum of dependence distances in c0 relative to the sum in c. In this subcase, c is again a cycle in G , so c0 has a cycle ratio of b or less. i

i

i

s

Subcase 2.2 Cycle c contains at least one DAG edge. Since DAG edges alone cannot form cycles, c must

also contain at least one scheduling edge between SCC's. This case is similar to Subcase 2.1, except that the edge with which a scheduling edge is paired need not necessarily be adjacent to the same SCC. A scheduling edge in c entering an SCC G is paired with the next scheduling edge in c which is directed out of an SCC G , possibly crossing one or more SCC's along the way (a scheduling edge might be paired with itself). If the next edge in c which leaves G is a scheduling edge, then G = G , and G :stage increment is added to the dependence distance of the edge in c0 entering G , and subtracted from the dependence distance of the outgoing edge, with no net aect on the sum of dependence distances in c0. But if the next edge in c which leaves G is a DAG edge, G 6= G . In this case, G :level < G :level so from the de nition of stage increments, G :stage increment is at least dG:total time =be larger than G :stage increment , and this dierence is added to the sum of dependence distances in c0, relative to the sum in c (the sum in c is at least one, since c satis es the positive cycle constraint). Since the sum of task execution and data communication times in any cycle is at most G:total time , the cycle ratio for c0 is at most G:total time < b 1 + total time i

j

i

i

j

i

i

i

i

j

i

j

i

j

G:

b

All cycles in G 0 therefore have cycle ratios of b or less. Further, there is a one-to-one correspondence between cycles in G and the subset of cycles in G 0 considered in Case 1 and Subcase 2.1. Therefore, at least one cycle c0 in Case 1 or Subcase 2.1 must have a cycle ratio of exactly b. 2 For a DDG G and a speci ed processor assignment, there are two fundamental lower bounds on the iteration interval of an optimal schedule for G. Assuming that data communication times between tasks sharing the same processor have been adjusted as necessary (typically zeroed), the rst lower bound is G:mcr , the maximum cycle ratio of the original unscheduled DDG. The second lower bound is G:pmax , the largest sum of task times on any processor. For self-cyclic DDG's, G:pmax G:mcr , so any schedule with an iteration interval of G:pmax is asymptotically optimal. The following corollary to Theorem 7 shows that in contrast to the NP-hardness of the other complexity questions considered in Section 3, the asynchronous pipeline task sequencing problem for self-cyclic DDG's under asymptotic optimality has a polynomial solution. Corollary 8 Let G be a self-cyclic DDG, with an assignment of tasks to processors. Then a pipeline schedule for G can found in O(v + e) time which has an asymptotically optimal iteration interval b = G:pmax . Proof The Generic Algorithm can be applied to nd a deadlock-free schedule s for G in O(v + e) time as follows. 1. G0 := G minus self-cycles; set X:time = 1 for all tasks and (X; Y ):time = 0 for all edges s

s

s

17

2. Choose b = 1 3. X:est 1 = X:level (the other quantities need not be computed) 4. Choose west = 1; wect = wlst = wlct = 0 5. Set all moment and stage increments to zero 6. X:stage = X:level ; X:oset = 0 7. Since all task osets are zero, any arbitrary order of tasks on processors is valid Since G0 is acyclic, all steps can be performed in O(v+e) time. Applying Theorem 7 to schedule s, subgraph G consists solely of tasks with no edges, so all cycles in scheduled DDG G are scheduling cycles and G :mcr = G:pmax . Stage increments in the proof of Theorem 7 can also be found in O(v + e) time. 2 Corollary 8 is an interesting theoretical result, but the startup times of schedules derived from the algorithm in the corollary are potentially very large, so the algorithm is unsatisfactory from a practical point of view. Although it is usually reasonable to focus on asymptotic rather than xed-count or variable-count optimality, we don't want to completely ignore schedule startup times. As a practical alternative, Theorem 7 may be interpreted as a guarantee that an asymptotically optimal schedule exists when stage increments are \large enough." Empirically, for schedules with the same iteration interval, schedules with smaller stage assignments do not necessarily have smaller startup times, but optimal startup times do correspond to relatively small stage assignments. We may therefore generate a sequence of schedules with increasing stage increments for tasks, until a schedule is found which is optimal. Checking to see if a schedule has an iteration interval of b (or less) is simpler than actually nding the iteration interval of the schedule. We need only check in O(ve) time if graph G has any positive weight cycles. The Base Task Sequencing Algorithm in Figure 4 follows this approach not only for self-cyclic DDG's, but for all DDG's, and will also be the starting point for our discussion of task sequencing for cyclic DDG's in Section 6. The Base Algorithm is an instantiation of the Generic Algorithm, so the correctness of the Base Algorithm, i.e., the fact that schedules generated by the algorithm will not deadlock, follows from Theorem 6. When applied to self-cyclic DDG's, the Base Algorithm runs in O(ve) time, and produces asymptotically optimal schedules with reasonable startup times. For an input DDG which is self-cyclic, Step 1 retains the graph component times from G, except that tasks with zero execution times are given small positive execution times (the for loop only applies to cyclic DDG's). Step 2 sets the divisor b for dividing the DDG into stages to G0:pmax , which is the optimal iteration interval for G0 (and is positive). In Step 3 it is actually only necessary to nd the earliest starting and latest completion times for tasks, since Step 4 sets weights west = wlct = 1=2, and wect = wlst = 0. These weights are less important for self-cyclic DDG's than for general DDG's, but can be found in O(v + e) time rather than O(ve) time in this case. Step 5 sets all moment increments to zero. Step 6 stores the original stage value for each task X as X:base stage to facilitate the computation of nal stage values in Step 8. Step 7 uses oset values as the primary sort key, and breaks ties using moments as a secondary key, to generate an initial \provisional" schedule. These steps may be illustrated using the self-cyclic loop in Figure 5a, with tasks fA; Dg, fB g, and fC; E g assigned to three processors. The DDG for the loop is shown in Figure 5b, and Figure 5c is the acyclic alternate DDG G0, with G0:pmax = 100. Figure 5d lists earliest/latest starting/completion times for tasks, as well as the task moments, stage assignments, and osets generated in Step 6. Figure 5e is the scheduled DDG corresponding to the schedule generated at the end of Step 7: hA : 0; D : 2i, hB : 0i, and hE : 2; C : 0i. Step 8 is the only step in the Base Algorithm which potentially performs multiple trials of a step from the Generic Algorithm. This step determines schedule times directly for the original DDG G, rather than for the alternate DDG G0 . The body of the while loop in Step 8 is only executed when the maximum SCC level is at least one, which is true i the DDG contains DAG edges, and generates a sequence of increasingly large stage increments for tasks as given by the oor expression. The pre x \sig" stands for \stage increment generation." The variable sig count is a positive real value which determines how quickly stage increments increase. For a large number of DDG's and algorithm parameters, we have never found a s

s

b

18

Base Task Sequencing Algorithm Input | A DDG G which satis es the positive cycle constraint, and a processor assignment for G Output | A task sequence for G, and the iteration interval b of the resulting schedule for G Method

1.

Algorithm Step 1 | Generate topologically equivalent DDG G0 */ 0 G := G minus self-cycles; tasks with execution time 0 are given execution time > 0 in G0 for each cycle edge (X; Y ) in G0, where G0 is the SCC in G0 containing (X; Y ) do (X; Y ):time := (X; Y ):time + (G0:mcr =G0 :mcr ? 1)(X:time + (X; Y ):time ) /* Generic

i

i

endfor

2.

/* Generic Algorithm Step 2 | Choose divisor b */ if G0 is acyclic then b := G0:pmax else b := G0:mcr

3.

/* Generic

Algorithm Step 3 | Find earliest/latest starting/completion times for tasks */ Generate G0 ; nd G0 :cpl and for each task X nd X:est , X:ect , X:lst , and X:lct 4. /* Generic Algorithm Step 4 | Choose task weights */ west = wlct = 21 ; wect = wlst = 0 5. /* Generic Algorithm Step 5a | Choose moment increments */ for each SCC G0 , 0 i < # SCC's, where G0 is the ith SCC in an increasing topological order do if G0 is acyclic then G0 :moment increment := 0 else G0 :moment increment := 14 (G0 :cpl )i / (# SCC's) b

b

b

i

b

b

b

i

i

endfor

6.

i

b

/* Generic Algorithm Step 6 | Find base stages for each task X, where X is in SCC G0i do

and osets for tasks */

X:moment :=G0 :moment increment + west (X:est ) + wect (X:ect ) + wlst (X:lst ) + wlct (X:lct ) X:stage := X:base stage := bX:moment =b c X:oset := X:moment ? b (X:base stage ) b

i

b

b

endfor

7.

/* Generic

Algorithm Step 7 | Find task order */ Task order on each processor is given by sorting (X:oset ,X:moment ) times in ascending order 8. /* Generic Algorithm Step 5b | Find iteration interval b and nalize stages for tasks */ if G0 is acyclic then b := G:pmax else b := G :mcr , where G is G minus DAG edges if maximum SCC level > 0 then sig count := 4; sig bias := 21 ; sig multiplier := 0 sig delta := 1= minfmaximum SCC level; sig count g while G :mcr > b do sig multiplier := sig multiplier + sig delta for each task X do X:stage := X:base stage + bsig multiplier X:level + sig bias c s

s

endfor endwhile endif

Figure 4: Base Task Sequencing Algorithm.

19

b

[1] A:20

A:20 20

for i := 1 to n do read(A); B := f1(A); C := f2(A); D := f3(B,C); E := f4(D,E); endfor

20

20

B:100

B:100

C:90

40

20 C:90

40

30

30 D:80

D:80

10

10

(a)

E:10

E:10

[1]

(c)

(b) [1]

A B C D est 100 0 40 40 180 ect 100 20 140 130 260 lst 100 0 40 60 180 lct 100 20 140 150 260 moment 10 90 95 220 stage 0 0 0 2 oset 10 90 95 20

E 270 280 270 280 275 2 75

[1]

[1] A:20

20

B:100

[1]

20 [-2]

[3]

C:90

A:20 20

20 [-3]

B:100

[4]

30 40

40 D:80 [-1] 10 E:10

(d)

C:90

30 D:80 [-2] [2]

10 E:10

[1]

(e)

[3]

[1]

(f)

Figure 5: (a) A self-cyclic loop with tasks fA; Dg, fB g, and fC; E g assigned to three processors. (b) The corresponding DDG G. Unbracketed numbers are task execution and data communication times and bracketed numbers are dependence distances; omitted values are zero. (c) The topologically equivalent DDG G0 generated in Step 1 of the Base Algorithm by removing self-cycle edges. (d) Earliest/latest starting/completion times for G0 using b = G0:pmax = 100, and values derived in Step 6. (e) The scheduled DDG G for the schedule generated in Step 7: hA :0; D :2i, hB :0i, and hE :2; C :0i. G :mcr = 110 from the loop (C; D; E). (f) The scheduled DDG G for the nal generated schedule hA :0; D :3i, hB : 0i, and hE : 3; C : 0i. G :mcr = G :pmax = 100. s

s

s

s

s

case requiring more than dsig count e additional trials beyond the original schedule before an optimal schedule is found (hence the sux \count"). This corresponds to at most doubling all original stage assignments, so it is natural to conjecture that this might be a theoretical bound, which could be used to improve the algorithm in Corollary 8. The actual number of stage increment generation trials required for a DDG will vary in response to both DDG characteristics and other algorithm parameters. We use sig count = 4 in the remainder of the paper. Because stage assignments are integral rather than real, the variable sig bias from the real interval [0; 1) is used to distribute the stage increments more uniformly among SCC levels. We use sig bias = 1=2. Both empirical evidence and the examples in Figure 6 support this choice. The asterisks in the table in Figure 6b (sig bias = 1=2) are more uniformly distributed across columns than in the table in Figure 6a (sig bias = 0), which corresponds to a more uniform distribution of stage increments across SCC levels. The variables sig multiplier and sig delta are used to actually generate the stage increments inside the while loop. In the examples in Figure 6, sig delta = 1/sig count = 1/4, and sig multiplier increases from 0 to 1 in increments 20

SCC level

SCC level

sig multiplier 0 1 2 3 4 5 6 7 8

sig multiplier 0 1 2 3 4 5 6 7 8

(a) Stage increments for sig bias = 0.

(b) Stage increments for sig bias = 1/2.

0 1/4 1/2 3/4 1

0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 12 0 01 12 23 34 0 0123 3456 012345678

0 1/4 1/2 3/4 1

0 0 0 0 0 0 0 0 0 0 01 1 1 12 2 2 01 12 23 34 4 012 2345 56 012345678

Figure 6: Two examples of stage increment generation parameters. The maximum SCC level = 8 and

sig count = 4 in both examples. Numeric entries indicate the stage increments added to tasks at each SCC level for ve sig multiplier values. Asterisks () mark adjacent values in each row which are dierent. The number of asterisks in a column is more uniformly distributed in (b) when sig bias = 1/2 than in (a) when sig bias = 0.

of 1/4. The initial schedule in Figure 5e has a suboptimal iteration interval of 110 from the mixed cycle (C; D; E), which contains two DAG edges. One iteration of the while loop in Step 8 increases the stage assignments of tasks D and E by one, producing the scheduled DDG in Figure 5f, which has an optimal iteration interval G :mcr = G:pmax = 100. To get a complete O(ve) pipeline scheduling algorithm for self-cyclic loops we may combine the Base Algorithm with a multiprocessor scheduling algorithm which ignores the dependence edges in a DDG, and assigns tasks to processors with the simpler goal of balancing the execution load across processors (this was one of our reference problems in Section 3). Graham's \largest processing time rst" (LPT) multiprocessor scheduling algorithm [10] has a performance guarantee of 33%; Coman, Garey, and Johnson's MULTIFIT algorithm [4] has a 20% guarantee; and Friesen and Langston's improved MULTIFIT (MFI) algorithm [7] has an 18% guarantee. All three of these algorithms execute in O(v log v) time, although the constants involved increase as the guarantee improves. Although the asynchronous pipeline scheduling problem under asymptotic optimality restricted to self-cyclic DDG's is NP-complete, the performance guarantee of the component processor assignment algorithm carries over to the pipeline scheduling problem. Therefore, under the assumption that log v is O(e), this pipeline scheduling problem has a practical O(ve) time algorithm which is guaranteed to within 18% of optimal. The O(v logv) \impractical" self-cyclic pipeline scheduling algorithm from Corollary 8 also inherits the 18% performance guarantee, as would an O(v log v) algorithm based on the conjecture that doubling the original stage assignments of tasks in the Base Algorithm produces optimal task sequences. s

6 Task Sequencing for Cyclic Loops

The Base Algorithm in Figure 4 may also be used with cyclic DDG's, and is the recommended single-pass version of our algorithm. The \single-pass" designation refers to the steps of the algorithm before Step 8, since this step can be viewed as a post-optimization step for DAG edges (even multiple-pass algorithm variants need execute the while loop in Step 8 only once). In this section we will empirically evaluate the performance of the Base Algorithm on cyclic graphs. We will also consider algorithm variants which make alternate choices at various steps as allowed by the Generic Algorithm, variants which generate multiple candidate schedules for a DDG using multiple passes with dierent algorithm choices, and variants which relax the requirement that candidate schedules must not deadlock.

6.1 Base Algorithm

Executing the Base Algorithm for cyclic graphs is similar to execution for self-cyclic graphs. Two dierences are the treatment of cycle edges in Step 1 and the choice of moment increments in Step 5, which are discussed in detail in Sections 6.2 and 6.5 below. Step 2 and the if statement in Step 8 which sets b each require a 21

maximum cycle ratio calculation when the input DDG is cyclic. Therefore, the primary determinant of the computational complexity of the algorithm for cyclic DDG's is the complexity of the component maximum cycle ratio algorithm, which as discussed in Section 2 generally takes O(ve) time in practice. This also holds for the multiple-pass algorithms discussed in later sections, since there is always a constant bound on the number of passes. Note that since all task times are positive, b will always be positive in Step 2. Also note that an O(ve) time single-source longest path calculation still suces for evaluating the while loop condition in Step 8; comparing G :mcr to b does not require a maximum cycle ratio calculation. Figure 1 illustrates execution of the Base Algorithm on a cyclic DDG. We use a set of random DDG's to empirically evaluate the performance of the Base Algorithm and other task sequencing algorithm variants. The random DDG's are generated from a set of seven parameters representing dierent graph characteristics. s

Number of tasks (4 classes): random integer value in 4{10, 11{50, 51{100, or 101{250 Task execution times (5 classes): 1, random integer value in 1{10, 90{100, 1{100, or 1{10,000 Maximum edge/task ratio (2 classes): random real value in 0.5{3.5 or 3.5{10 Maximum edge distance (3 classes): 1, 2, or random integer value in 3{10 Probability that an edge has a positive dependence distance (3 classes): random value in 5{20%, 20{50%,

or 50{100%

Edge communication times (5 classes): 0, 1, random integer value in 1{10, 90{100, or 1{100 Number of processors (3 classes): random integer value in 2{4, 5{10, or 11{20 Two graphs are generated for each of the 4 5 2 3 3 5 3 = 5,400 combinations of these parameters for

a total of 10,800 graphs as follows. The edge/task ratio is used to generate a target edge count. This target edge count is reduced if necessary to (v2 ? v)=2, the number of edges in a complete directed graph, since no multigraphs are generated. Edges are then generated between randomly chosen task pairs. Each task is limited to a maximum in-degree of 10, and a maximum out-degree of 10. If 100 attempts are made with no successful new edge creation, edge insertion is terminated. The actual average edge/task ratios for DDG's generated in each of the two classes of the third parameter are 2.2 and 5.5 edges/task. No self-cycles are generated, as these do not aect cyclic task sequencing. Any generated DDG which is not cyclic is discarded, and replacement DDG's are generated using the same parameters until a cyclic graph is generated. The number of processors is taken to be the minimum of the selected value and v ? 1, to guarantee that at least one processor is shared. For each generated DDG, tasks are assigned to processors using Graham's O(v logv) LPT multiprocessor scheduling algorithm [10]. Data communication times between tasks assigned to the same processor are zeroed. We use Burns primal-dual algorithm [3] for computing the maximum cycle ratio of a DDG, modi ed to use the solution of a longest path problem to provide the initial feasible solution when the DDG has negative dependence distances, or there is an upper bound on the maximum cycle ratio from earlier trials. We use a variant of the Bellman-Ford longest path algorithm [5] for longest paths. Statistics for execution of the Base Algorithm on the 10,800 cyclic DDG's are shown in Table 2. Each group of rows in the table summarizes algorithm execution for all 10,800 DDG's from the perspective of one parameter as listed in the rst column. Each row summarizes algorithm execution for the subset of DDG's corresponding to one class of the speci ed parameter, as indicated in the second column. The number of DDG's in each subset is given in the third column. The next three columns summarize the quality of the task sequences generated for the group. For each graph, bopt lb = maxfG:mcr ; G:pmax g is a lower bound on the iteration interval of a schedule, and (b=bopt lb ? 1)100 is an upper bound on the percent by which a schedule may be nonoptimal. Columns 4 and 5 list the arithmetic mean and the maximum of these values. Column 6 shows the percent of DDG's in each group which have iteration intervals which exceed the bopt lb bound. In columns 4{6, smaller values correspond to schedules with smaller average iteration intervals. The nal three columns list the mean, median, and maximum wall clock execution times for execution on an Intel Pentium Pro processor. 22

# (b=bopt lb ? 1)100 % Classes DDG's Mean Max b > bopt lb 4{10 2,700 1.58 99.3 8.7 11{50 2,700 5.53 87.4 33.1 51{100 2,700 9.34 85.4 47.3 101{250 2,700 10.83 92.6 48.0 Task Execution 1 2,160 2.92 66.7 20.5 Times 1{10 2,160 5.48 85.4 32.8 90{100 2,160 7.91 97.4 37.9 1{100 2,160 7.92 87.4 39.4 1{10000 2,160 9.89 99.3 40.9 Max Edge/Task 0.5{3.5 5,400 3.04 99.3 18.8 Ratios 3.5{10.0 5,400 10.61 92.6 49.8 Max Edge 1 3,600 7.43 99.3 37.3 Distances 2 3,600 6.85 97.4 35.1 3{10 3,600 6.19 87.4 30.4 Positive Distance 5{20% 3,600 7.30 86.3 36.1 Probabilities 20{50% 3,600 8.03 99.3 40.0 50{100% 3,600 5.14 85.4 26.7 Edge 0 2,160 9.30 99.3 37.5 Communication 1 2,160 8.70 87.4 38.6 Times 1{10 2,160 6.96 92.6 35.0 90{100 2,160 4.38 80.7 31.8 1{100 2,160 4.78 89.3 28.4 # Processors 2{4 3,600 6.54 99.3 36.7 5{10 3,600 8.02 86.3 36.5 11{20 3,600 5.90 92.6 29.6 Total 10,800 6.82 99.3 34.3 Parameter # Tasks/DDG

Execution Time (secs) Mean Med Max .0001 .0001 .0007 .0013 .0009 .0162 .0049 .0040 .0300 .0181 .0132 .1458 .0056 .0019 .1458 .0063 .0020 .1447 .0060 .0020 .1085 .0063 .0022 .1187 .0064 .0021 .1303 .0043 .0016 .0727 .0080 .0027 .1458 .0057 .0019 .1187 .0059 .0020 .1458 .0067 .0024 .1447 .0053 .0020 .0886 .0061 .0021 .1458 .0069 .0021 .1447 .0056 .0020 .0886 .0054 .0020 .0906 .0060 .0021 .1085 .0070 .0021 .1458 .0065 .0021 .1123 .0044 .0018 .0987 .0058 .0022 .1123 .0080 .0023 .1458 .0061 .0021 .1458

Table 2: Base Algorithm Statistics for 10,800 cyclic DDG's. Each group of rows summarizes algorithm execution for all 10,800 DDG's from the perspective of one parameter. Execution times are wall clock times on an Intel Pentium Pro processor. In the remainder of the paper, we will use the arithmetic mean of the expression (b=bopt lb ? 1)100 for the schedules generated for the 10,800 random DDG's as a summary value to compare task sequencing algorithm variants. From the last row of Table 2, this upper bound on the average percent by which schedules are nonoptimal is 6.82% for the Base Algorithm itself. In Step 7 of the Base Algorithm, when two or more tasks have the same oset, we use task moments as a secondary sort key, and order tasks with smaller moments before tasks with larger moments. If instead we use whatever arbitrary order our sorting algorithm generates, the resulting schedules average within 6.88% of optimal (a smaller average is better). If we use task moments as a secondary sort key, but order tasks with larger moments before tasks with smaller moments, schedules average within 6.89% of optimal. All algorithm variants in Sections 6.2{6.6 below are derived from the Base Algorithm by making alternative choices for just one step of the algorithm, and possibly by performing multiple trials using dierent choices for a speci c step. Section 6.7 considers simultaneous alternative choices for multiple steps.

6.2 Topologically Equivalent DDG's

Step 1 of the Generic Algorithm allows us to choose almost any DDG G0 as the graph which is manipulated by subsequent steps in the algorithm, as long as G0 is topologically equivalent to the input DDG G, i.e., has the same tasks, edges, and dependence distances, with the (optional) exception of self-cycles. Subject only 23

to the constraint that tasks must have positive execution times, task execution and data communication times in G0 may be arbitrary (and the restriction to positive task times might be relaxed if compensating changes are made in other steps). However, it seems reasonable to expect that using the component times from G as the component times for G0 would be the best choice for generating the schedule with the smallest iteration interval. Although this is often a reasonable assumption, there is at least one situation where it is advantageous to choose dierent component times. As noted in Section 4, choosing a smaller divisor b generally results in a schedule with a smaller iteration interval, but we must choose b G0:mcr . When a DDG contains more than one SCC, if the maximum cycle ratio of one or more SCC's is less than the maximum cycle ratio of the entire DDG (i.e., less than the maximum cycle ratio of some other SCC), it would be advantageous to consider each SCC in isolation, using the maximumcycle ratio of each SCC as its divisor. Although we do not know how to do this directly, we can approximate this eect by \scaling" the task execution and data communication times of any SCC G0 where G0 :mcr < G0:mcr . The obvious way to do this is to multiply all task execution and data communication times in G0 by G0 :mcr =G0 :mcr . Note that because all task times are positive, the divisor in this expression is positive, and the ratio is at least 1. Unfortunately, scaling task times in this manner also aects DAG edges, since a task in an SCC may be the source of one or more DAG edges, as well as cycle edges. We can avoid this unwanted side eect by adding the increase in both task execution and data communication times to cycle edges, as speci ed in the for loop in Step 1 of the Base Algorithm. Note that SCC scaling does not change G0 :mcr . From Table 2, the upper bound on the average percent by which schedules are nonoptimal is 6.82% for the Base Algorithm. Without modi cation of cycle edges in Step 1 of the Base Algorithm, generated schedules average within 6.86% of optimal (a smaller average is better). The minimal improvement in average schedule quality when cycle edges are modi ed is largely due to the fact that only 14% of the test DDG's have SCC's with maximum cycle ratios that are less than the maximum cycle ratio of the entire DDG. When restricted to this subset of DDG's, the average schedule is within 2.66% of optimal when cycle edges are modi ed, and within 2.92% of optimal when cycle edges are not modi ed. It is also possible to take the notion of scaling to increase the stage assignments of tasks one step further. For any SCC G0 , if we temporarily undirect the directed edges in G0 we can then partition the edges in G0 into biconnected components [5], which are subcomponents of G0 which have at most one task in common with each of the other subcomponents. The edges in each biconnected component can then be modi ed as in Step 1, but using the maximum cycle ratio of the biconnected component as the denominator of the scaling expression. Biconnected components can be found in O(v + e) time, but we did not pursue this algorithm modi cation. We may also generate multiple task sequences (i.e., multiple schedules) for a DDG, retaining a schedule with the smallest iteration interval as the nal schedule. One way to generate multiple schedules in the context of SCC scaling is to determine G0:min scc mcr , the smallest maximum cycle ratio of any SCC in G0 (after self-cycles have been deleted, but before modifying cycle edges), and if G0 :min scc mcr < G0:mcr , use scaling factors for all SCC's evenly distributed over the interval [1; G0:mcr =G0:min scc mcr ], rounding scaling factors down to G0:mcr =G0 :mcr when the current scaling bound is greater than this value for a particular SCC G0 . For example, if G0 :mcr = 5 and G0:min scc mcr = 2, four trials would have maximum scaling factors of 1, 1.5, 2, and 2.5, and a particular SCC G0 with G0 :mcr = 1:7 would use scaling factors of 1, 1.5, 1.7, and 1.7. If G0 :min scc mcr = G0:mcr , then scaling does not apply, and only one trial is performed. When multiple trials are performed, if any trial generates a schedule which has an iteration interval bopt lb , which is therefore known to be optimal, algorithm execution terminates early without considering additional trials. Table 3 summarizes the average percent by which schedules may be nonoptimal for all 10,800 random test DDG's, using selected numbers of scaling trials from 1 to 100. As in the previous discussion, the fact that only 14% of the test DDG's have SCC's with maximum cycle ratios that are less than the maximum cycle ratio of the entire DDG limits the eectiveness of this technique. Executing ten trials for just the DDG's where G0 :min scc mcr < G0 :mcr results in schedules which average within 2.07% of optimal, compared to 2.66% using a single trial. i

i

i

i

i

i

i

i

i

i

24

i

i

Maximum # Trials 1 2 3 4 5 6 7 8 9 10 20 50 100 (b=bopt lb ? 1)100 6.82 6.77 6.76 6.75 6.75 6.74 6.74 6.74 6.74 6.74 6.74 6.73 6.73 Table 3: Task sequencing statistics using selected numbers of SCC scaling trials. Maximum # Trials 1 2 3 4 5 6 7 8 9 10 20 50 100 (b=bopt lb ? 1)100 6.82 6.61 6.56 6.52 6.51 6.48 6.47 6.45 6.44 6.43 6.36 6.31 6.29 Table 4: Task sequencing statistics using selected numbers of divisors.

6.3 Divisors

Step 2 of the Generic Algorithm permits any positive divisor b to be chosen for dividing the DDG G0 into global stages, subject only to the constraint that b G0 :mcr . (Note that b is always positive in the Base Algorithm, since all tasks have positive execution times.) The Base Algorithm sets b = G0:mcr for cyclic DDG's, since this maximizes the number of stages spanned by the resulting schedule, which generally decreases the resulting iteration interval. Note however, that when G0 :mcr < G0:pmax , it is impossible to generate a schedule with an iteration interval less than G0:pmax , so we might consider setting b = maxfG0:mcr ; G0:pmax g instead. 46% of the 10,800 test DDG's fall into this category. If the Base Algorithm is modi ed with this one change, the iteration intervals for the schedules generated for the random test DDG's average within 8.70% of optimal, rather than 6.82% for the unmodi ed Base Algorithm, which supports choosing b to be as small as possible. We can also consider a multiple-pass algorithm variant which uses dierent divisors in dierent trials. When G0:mcr G0:pmax we just perform one trial with b = G0:mcr . But when G0 :mcr < G0:pmax , we may choose any number of divisors evenly distributed over the interval [G0:mcr ; G0:pmax ]. Table 4 summarizes execution of the Base Algorithm modi ed to use multiple trials with selected numbers of divisors when G0:mcr < G0:pmax (the statistics in the table are for all 10,800 DDG's).

6.4 Task Weights

Step 3 of the Generic Algorithm permits any choice of nonnegative real weights west , wect , wlst , and wlct which sum to 1. The four weights are not completely independent, since for any task X, X:est +X:time = X:ect , and X:lst +X:time = X:lct . For example, choosing weights west = wect = wlst = wlct = 1=4 is equivalent to choosing west = wlct = 1=2 and wect = wlst = 0, since b

b

b

b

1=4(X:est ) + 1=4(X:ect ) + 1=4(X:lst ) + 1=4(X:lct ) = 1=4(X:est ) + 1=4(X:est + X:time ) + 1=4(X:lct ? X:time ) + 1=4(X:lct ) = 1=2(X:est ) + 1=2(X:lct ) b

b

b

b

b

b

b

b

b

b

One consequence of the interdependence between weights is that any set of weights is equivalent to a set of weights in which at least one of the weights is zero. The second to last row of Table 5 shows statistics for execution of the Base Algorithm modi ed in Step 4 to use one of 24 weight sets, which are intended to represent a diverse choice of sets. The 24th entry in Table 5 uses random weights generated individually for each DDG. It is not surprising that the rst weight set, with west = wlct = 1=2 and wect = wlst = 0 generally produces schedules with smaller iteration intervals. The equivalence class of weight sets that this set represents is unique. It is the only set (class) which precisely balances the early and late component times of each task's moment. Particularly when X:est < X:lst for a task X, this tends to allow a task to \shift" to accommodate adjacent tasks in the scheduled DDG, without unnecessarily increasing the iteration interval of a schedule. This informal reasoning is supported by an examination of the results for other weight sets. b

b

25

Weight Set Index west wect wlst wlct (b=bopt lb ? 1)100 Cumulative Trials Weight Set Index west wect wlst wlct (b=bopt lb ? 1)100 Cumulative Trials

1 1/2 0 0 1/2 6.82 6.82 13 0 0 1/2 1/2 9.31 4.61

2 2/3 2/9 0 1/9 7.63 5.92 14 7/8 0 0 1/8 7.91 4.60

3 1/9 0 2/9 2/3 7.55 5.44 15 1/8 0 0 7/8 7.89 4.58

4 1/3 1/3 0 1/3 7.08 5.22 16 1 0 0 0 10.35 4.57

5 1/3 0 1/3 1/3 7.11 5.06 17 0 0 0 1 10.33 4.55

6 1/2 0 1/2 0 7.85 4.98 18 1/3 1/3 1/3 0 7.05 4.54

7 0 1/2 0 1/2 7.79 4.89 19 0 1/3 1/3 1/3 7.03 4.52

8 2/3 1/12 0 1/4 7.18 4.82 20 3/4 0 0 1/4 7.30 4.52

9 1/4 0 1/12 2/3 7.16 4.75 21 1/4 0 0 3/4 7.26 4.51

10 5/8 0 0 3/8 6.93 4.70 22 0 1 0 0 11.13 4.50

11 3/8 0 0 5/8 6.92 4.67 23 0 0 1 0 11.24 4.50

12 1/2 1/2 0 0 9.29 4.64 24

7.88 4.49

Table 5: Task sequencing statistics using 24 selected task weight sets; the 24th column uses random weights generated for each individual DDG. The second to last row is the average percent by which schedules may be nonoptimal for a single-pass algorithm using the speci ed weight set. The last row is the average for a multiple-pass algorithm which chooses the best schedule from all schedules generated from weight sets with indices no greater than the speci ed index. Sets such as 10 and 11 which are almost balanced between early and late times do well, whereas sets 16, 17, 22, and 23, which are highly unbalanced, produce the worst averages. We can also use dierent weight sets in dierent trials of a multiple-pass sequencing algorithm. Looking at all 276 unordered pairs of the 24 weight sets, the highest ranking pairs, such as pair (8,9), generate schedules which average within 5.82% of optimal. The best pair which includes weight set 1 is 22nd out of the 276 pairs, and is within 5.90% of optimal, so if two sets are used, set 1 is no longer necessarily a preferred candidate for inclusion. Except for set 1, all other sets may be grouped in pairs, with one set favoring early times and the other set favoring late times in complementary fashion. We might expect that using two complementary sets would produce good results, and this is often the case, as with pair (8,9). More generally, we might expect that pairing a set favoring early times with some set favoring late times (not necessarily the complementary set) would be a reasonable policy. This is also the case. The 65th pair is the highest ranking pair which violates this informal guideline without including set 1, and is within 6.10% of optimal. Comparing multiple-pass algorithms which use three weight sets, many of the best combinations include set 1 plus an early and a late weight set. The highest ranking combination is (1,8,9), which is within 5.40% of optimal. For combinations of four weight sets, the best combinations are within 5.20% of optimal. For ve weight sets, the best combinations, including (1,2,3,4,5), are within 5.06% of optimal. The best combinations of seven weight sets are within 4.87% of optimal. Results for a multiple-pass variant of the Base Algorithm, which chooses the best schedule from all schedules generated from weight sets with indices no greater than a given index are shown in the nal row of Table 5. Keeping in mind that a top ranking combination of k1 sets cannot necessarily be extended into a top ranking combination of k2 > k1 sets simply by adding k2 ? k1 extra sets, the selection and order of the rst seven weight sets in Table 5 is based on experimentation with most of the possible combinations of up to seven sets (of the 23 nonrandom sets in the table), with the goal of generating good schedules for any successive cumulative subset of weight combinations. The pairs (8,9) through (22,23) are ordered by successively choosing the remaining pair which produces the best iteration interval averages when included with the preceding sets of weights. We can extrapolate several general trends from experimentation with dierent combinations of weight sets. However many weight sets are used, better results are obtained by choosing diverse combinations of 26

sets. As the number of weight sets increases, the number of reasonable combinations also increases. All choices are subject to statistical averages, and small dierences between observed averages are generally not signi cant. Some DDG's can be optimally scheduled with any weight set, and any particular weight set will produce good schedules for some DDG's. When using a relatively small number of weight sets, random weight sets are not very competitive. However, at and beyond a threshold of around 15{20 of the weight sets in Table 5, adding random weight sets equals and then surpasses the ability of nonrandom sets to generate the best schedules for our set of test DDG's. Adding random sets (generated for each DDG) to the 23 nonrandom sets in Table 5 to reach combinations of 25, 30, 50, and 100 weight sets results in schedules which average within 4.47%, 4.43%, 4.33%, and 4.25% of optimal, respectively. Note that multiple-pass algorithms need not necessarily repeat all steps in all trials. In the case of multiple trials with dierent weight sets, the results of Steps 1{3 and 5 of the Base Algorithm do not change. Only Steps 4, 6, 7, and the calculation of b in Step 8 need be repeated. The while loop in Step 8 need only be executed once as a post-optimization step for DAG edges. This observation applies in varying degrees to other multiple-pass algorithm variants as well.

6.5 Moment Increments

Step 5a of the Generic Algorithm allows any nonnegative moment increment to be added to the moment of a task, as long as all tasks in the same SCC have the same moment increment, tasks in predecessor SCC's do not have larger moment increments, and tasks in successor SCC's do not have smaller moment increments. Like stage increments in Step 5b, moment increments can be used to guarantee schedule optimality with respect to DAG edges by increasing stage assignments, but unlike stage increments, moment increments aect the task order of a schedule, as well as the stage assignment. Modifying the task order of a task sequence for a cyclic DDG when DAG edges are \post optimized" may increase the iteration interval of a schedule, so stage increments are better suited to DAG edge optimization than are moment increments. It is possible, however, to generate schedules with smaller iteration intervals using nonzero moment increments. The technique we use for generating moment increments is to rst choose a nonnegative real \moment increment generation bound" mig bound to bound the relative magnitude of moment increments. In Step 5 of the Base Algorithm, mig bound = 1=4. From De nition 2, G :cpl = max fX:ect g is the length of any longest path through graph G . If all SCC's are indexed from 0 to one less than the number of SCC's in an increasing topological order, and if SCC G0 has index i in this order, then we may add X

b

b

b

i

G0 :moment increment = mig bound (G0 :cpl )i=(#SCC0 s) b

i

to the moment of any task X in G0 . If a DDG contains just a single SCC, the moment increment of the SCC will be zero, but otherwise, the position of tasks in the task order in one SCC may change relative to tasks in other SCC's. 64% of our 10,800 test DDG's contain two or more SCC's. De ning moment increments in terms of G0 :cpl rather than some absolute number minimizes the dierentiation between ne-grained and coarse-grained DDG's. The primary question to consider with this moment increment generation technique is what value to choose for mig bound. Table 6 gives (b=bopt lb ? 1)100 statistics for our set of test DDG's for selected values of mig bound. One possible explanation for the larger iteration intervals with mig bound = 0 is that without using moment increments, tasks in dierent SCC's are more likely to have the same increments in Step 7, which might increase the impact of arbitrarily choosing one of multiple similar task orders. It is not surprising that the iteration intervals of schedules generated using dierent positive values of mig bound do not monotonically decrease as mig bound increases, since we expect that using a variety of moment increments, and hence a variety of task orders for a set of DDG's will sometimes increase, and sometimes decrease the iteration interval of any particular schedule generated. We generally prefer smaller over larger moment increments (to reduce schedule memory requirements), so with reference to Table 6 we use mig bound = 1=4 in the Base Algorithm, producing schedules which are within 6.82% of optimal. We can also use mig bound as a multiple-pass bound on moment increment generation, with bounds for individual trials distributed uniformly over the interval [0,mig bound]. For example, with mig bound = 2, ve trials would have individual trial bounds of 0, 0.5, 1, 1.5, and 2. Table 7 gives summary statistics for selected numbers of trials with selected values of mig bound. As with choosing a single-pass value for mig bound, the i

b

27

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 (b=bopt lb ? 1)100 6.92 6.88 6.85 6.81 6.82 6.83 6.88 6.85 6.83 6.82 6.82 6.80 6.83 mig bound

Table 6: Task sequencing statistics for a single-pass sequencing algorithm using selected values of the moment increment generation bound mig bound.

mig bound

0.5 1.0 1.5 2.0 2.5 3.0 5.0 10.0

1 6.83 6.82 6.81 6.80 6.83 6.83 6.82 6.84

2 6.44 6.43 6.43 6.42 6.45 6.42 6.44 6.43

3 6.34 6.27 6.25 6.25 6.26 6.26 6.27 6.26

4 6.30 6.20 6.18 6.17 6.16 6.16 6.17 6.18

5 6.27 6.17 6.13 6.11 6.11 6.11 6.11 6.12

Maximum # Trials 6 7 8 6.26 6.25 6.24 6.14 6.12 6.11 6.10 6.08 6.07 6.08 6.06 6.04 6.07 6.05 6.03 6.07 6.04 6.02 6.06 6.03 6.02 6.07 6.05 6.02

9 6.23 6.10 6.05 6.03 6.01 6.01 5.99 6.00

10 6.23 6.09 6.05 6.02 5.99 5.99 5.98 5.98

20 6.21 6.06 6.01 5.97 5.95 5.93 5.91 5.91

50 6.19 6.04 5.98 5.94 5.92 5.90 5.87 5.85

100 6.19 6.04 5.97 5.93 5.90 5.88 5.86 5.83

Table 7: Task sequencing statistics for selected numbers of trials of selected values of mig bound . smallest values do not perform well, and increasing values do not correspond to monotonically decreasing iteration intervals. However, compared to the single-pass case, the threshold above which good results are expected is larger, at around 2.0.

6.6 Relaxing Deadlock Requirements

Sections 6.2{6.5 examined the space of options available at Steps 1, 2, 4, and 5a of the Generic Algorithm, and investigated both single-pass and multiple-pass variants of the Base Algorithm. All of the algorithm variants in these sections had one feature in common: Theorem 6 guaranteed that any schedule generated would be free of deadlock. However, the converse is not true. If constraints in the Generic Algorithm are violated, the resulting schedule will not necessarily deadlock. For example, if no processors are shared, then all local task orders are trivial orders on singleton sets, so the global task order and stage assignments are irrelevant, and any task sequence will be free of deadlock. When processors are shared, there are still a variety of circumstances where Generic Algorithm constraints may be violated without generating schedules which deadlock. We can compensate for the conservative nature of the Generic Algorithm in a multiple-pass task sequencing algorithm by generating schedules which might deadlock, if we can determine whether or not a schedule deadlocks, and at least one schedule is generated which is guaranteed not to deadlock. Generating schedules which might deadlock gives us more latitude, potentially allowing us to nd schedules with smaller iteration intervals with less computational eort than if we only consider schedules which are guaranteed not to deadlock. Theorem 1 provides the basis for determining whether or not a pipeline schedule deadlocks. Schedule s for DDG G deadlocks i the scheduled DDG G violates the positive cycle constraint. If all dependence distances are nonnegative, G can be checked for deadlock in O(v + e) time by checking for cycles in the subgraph obtained by removing all edges with positive dependence distances. In the general case where dependence distances may be negative, since the smallest sum of dependence distances satisfying the positive cycle constraint is one, the maximum cycle ratio of any schedule which does not deadlock is no greater than the sum of all task execution and data communication times in G. We can therefore check for deadlock in O(ve) time by setting b to the sum of all graph component times, generating graph G (De nition 2), and performing a longest path computation to check for positive cycles. We use this method in our variant of Burns' primal-dual algorithm [3] (if the schedule does not deadlock, the solution to the longest path problem s

s

b

28

mexp bound

1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 2.00

2 6.25 6.17* 6.22 6.30 6.37 6.44 6.49 6.54 6.56 6.60 6.73

3 6.18 5.99 5.95* 5.97 6.02 6.07 6.10 6.15 6.19 6.24 6.46

4 6.16 5.92 5.84* 5.84* 5.85 5.89 5.91 5.96 5.98 6.02 6.23

5 6.15 5.88 5.78 5.77* 5.77* 5.79 5.80 5.83 5.85 5.89 6.06

Maximum # Trials 6 7 8 9 6.13 6.13 6.13 6.12 5.86 5.84 5.84 5.83 5.75 5.73 5.72 5.70 5.72* 5.70 5.66 5.65 5.72* 5.68* 5.66 5.62* 5.72* 5.70 5.66 5.63 5.74 5.69 5.65* 5.63 5.76 5.71 5.67 5.64 5.77 5.72 5.68 5.65 5.80 5.75 5.70 5.67 5.94 5.87 5.80 5.74

10 6.12 5.82 5.69 5.65 5.62 5.60* 5.61 5.62 5.62 5.64 5.72

20 6.11 5.79 5.65 5.58 5.54 5.53 5.52* 5.52* 5.52* 5.52* 5.53

50 6.10 5.78 5.62 5.55 5.50 5.48 5.46 5.46 5.45 5.46 5.44*

100 6.10 5.77 5.61 5.54 5.49 5.47 5.44 5.44 5.43 5.43 5.41*

Table 8: Task sequencing statistics for selected numbers of trials of selected values of mexp bound. Asterisks (*) mark the smallest values in each column. conveniently provides an initial feasible solution to the linear programming problem). Some maximum cycle ratio algorithms can detect deadlock with little or no change to the basic algorithm. The monotonic search algorithm in [6], for example, will explicitly nd a cycle which violates the positive cycle constraint. As discussed in preceding sections, we generally expect that schedules with larger stage assignments are more likely to have smaller iteration intervals. Larger stage assignments can be generated in Step 6 of the Generic Algorithm by multiplying the moments of all tasks by some expansion factor (or equivalently, by choosing a smaller divisor). For any \moment expansion bound" mexp bound > 1 we can perform multiple task sequencing trials by multiplying all task moments by values uniformly distributed over the interval [1; mexp bound ]. For example, to perform four trials with mexp bound = 1:3, the expansion factors for the trials would be 1, 1.1, 1.2, and 1.3. Using an expansion factor of 1 corresponds to choosing unmodi ed moments, so the schedule generated for this trial will not deadlock. Table 8 summarizes the results of performing dierent numbers of trials for a variety of values of mexp bound. The smallest values in each column are marked with asterisks. Although there a number of minor exceptions, increasing the value of mexp bound as the number of trials increases generally produces schedules which are closer to optimal. For 2 and 100 trials with mexp bound = 1:05, 27% and 20%, respectively, of the trials which use expanded moments generate schedules which deadlock. For 2 and 100 trials with mexp bound = 2:00, 95% and 76% of the trials deadlock.

6.7 Parameter Combinations

The multiple-pass variants of the Base Algorithm investigated in Sections 6.2{6.6 all focused on a single step of the algorithm. With reference to the summary tables in these sections, we can compare the relative eectiveness of the dierent multiple-pass algorithms examined. Using a well-chosen set of diverse coecient weight sets in Step 4 is by far the most eective way to decrease the iteration interval of the generated schedule from the 6.82% average of the single-pass Base Algorithm. Not only is this true for any number of trials, but using just 4 dierent weight sets outperforms 100 trials varying any other parameter. As a speci c data point, with 10 trials using dierent weight sets, schedules average within 4.70% of optimal. Relaxing deadlock constraints by varying moment expansion bounds in Step 6 is the next most eective technique, where executing 10 trials generates schedules which average within 5.60% of optimal, when mexp bound = 1:30. Varying moment increments in Step 5a (mig bound = 5), divisors in Step 2, and SCC scaling factors in Step 1 are successively less eective, with averages within 5.98%, 6.43%, and 6.73% of optimal, respectively, for 10 trials. We might further ask if it is possible to get better results by using trials which vary combinations of parameters, rather than just a single parameter. We will examine two speci c instances of this general 29

question.

How close can we get to generating optimal schedules using a large number of trials? A simple brute force approach can be used to address this question. Any or all parameters can be varied at the same time, independent of the choices for other parameters. Favoring more choices for parameters which are more eective at lowering the average iteration interval of schedules as suggested in previous sections, we might choose:

Step 1 (2 choices) | perform and omit SCC scaling Step 2 (2 choices) | use b = G0:mcr and b = G0 :pmax Step 4 (25 choices) | use the 23 nonrandom weight sets in Table 5 and two random weight sets Step 5 (10 choices) | use 10 trials with mig bound = 5 Step 6 (10 choices) | use 10 trials with mexp bound = 1:3

Using all combinations of these choices results in a maximum of 10,000 trials for each DDG. (Recall that fewer trials are necessary when no SCC has a maximum cycle ratio less than G0:mcr in Step 1, when G0:mcr = G0:pmax in Step 2, when there is only one SCC in Step 5, or when a schedule is found which has an iteration interval bopt lb .) With these choices, schedules average within 3.45% of optimal, all schedules are within 68.8% of optimal, and at most 24.0% of schedules are nonoptimal. This algorithm requires an average of 4.3 seconds per DDG, and at most 2.9 minutes per DDG (wall clock times on an Intel Pentium Pro processor). The corresponding values for the single-pass Base Algorithm are given in Table 2. It is likely that the iteration intervals of generated schedules can be further decreased by increasing the number of choices at each step, and perhaps by choosing alternative parameter values. Can we do better than just using dierent weight sets in Step 4 for a small number of trials? A possibility here is to consider combinations of parameters in both Steps 4 and 6, which are the two steps which individually make the most eective use of multiple trials. As an alternative to the \try all possible combinations" approach, we can use a two phase approach to avoid a combinatorial increase in the maximum number of trials required. In the rst phase, mexp bound is set to 1 in Step 6 to guarantee that schedules will not deadlock, and some number of trials are performed to select a weight set which generates a schedule with a small iteration interval. In the second phase, this weight set is selected in Step 4, but the value of mexp bound is varied in Step 6. To implement this two phase approach it is necessary to choose which weight sets to consider in the rst phase, how many trials to perform in the second phase, and what value of mexp bound to use in the second phase. As in Table 8, where moment expansion bounds were considered in isolation, smaller values of mexp bound perform better for smaller numbers of trials and larger values of mexp bound perform slightly better for larger numbers of trials. However, compared to the earlier case, the preferred value for mexp bound in this two phase algorithm is smaller. Table 9 shows the results for a number of choices of weight sets ordered as in Table 5, and selected numbers of moment expansion trials with mexp bound = 1:20. Asterisks mark the smallest values for any xed number of trials. For up to about 3 trials, using weight sets alone is the preferred choice. For approximately 4{7 trials, setting mexp bound = 1:10 is the best choice. For 7{12 trials, mexp bound = 1:15 gives the best results, and for 9{25 trials, mexp bound = 1:20 gives the best results. For any xed total number of trials, increasing mexp bound also shifts the preferred balance between the number of weight set trials and the number of moment expansion trials slightly in favor of more moment expansion trials. As a reference data point, choosing 8 weight set trials, setting mexp bound = 1:15, and performing 2 second phase moment expansion trials for a maximum of 10 trials generates schedules which average within 4.54% of optimal. All schedules with these algorithm choices are within 73.2% of optimal, and at most 27.8% of schedules are nonoptimal. This algorithm requires an average of 0.0188 seconds per DDG, and at most 0.4413 seconds per DDG. As a second reference point from Table 9, schedules average within 4.15% of optimal for thirty trials (20 weight sets plus 10 moment expansion trials).

30

# Weight Sets 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 6.82* 5.92* 5.44* 5.22 5.06 4.98 4.89 4.82 4.75 4.70 4.67 4.64 4.61 4.60 4.58 4.57 4.55 4.54 4.52 4.52

1 6.30 5.61 5.21* 5.01* 4.86* 4.81 4.74 4.67 4.60 4.56 4.53 4.51 4.48 4.47 4.45 4.44 4.43 4.41 4.40 4.39

Maximum # Additional Trials with mexp bound = 1:20 2 3 4 5 6 7 8 5.97 5.84 5.77 5.72 5.70 5.66 5.65 5.38 5.29 5.23 5.19 5.17 5.15 5.13 5.04 4.95 4.90 4.87 4.85 4.83 4.82 4.86* 4.78 4.74 4.71 4.69 4.67 4.66 4.72* 4.65* 4.61 4.59 4.58 4.56 4.55 4.67 4.60* 4.56 4.54 4.53 4.51 4.50 4.61 4.55* 4.51 4.49 4.47 4.46 4.45 4.55* 4.49 4.45 4.43 4.42 4.40 4.39 4.48* 4.43* 4.38* 4.37 4.35 4.34 4.33 4.44 4.39 4.34* 4.33 4.32 4.30 4.29 4.41 4.36 4.32* 4.30* 4.29 4.28 4.26 4.40 4.34 4.30* 4.29 4.28 4.26 4.25 4.37 4.32 4.28* 4.26* 4.25* 4.24* 4.23* 4.36 4.31 4.27 4.25* 4.24* 4.23* 4.22* 4.35 4.30 4.26 4.24* 4.23* 4.22* 4.21 4.34 4.29 4.25 4.23* 4.22* 4.21 4.20 4.33 4.28 4.24 4.22* 4.21 4.19* 4.19 4.31 4.26 4.22* 4.20* 4.19* 4.18* 4.17* 4.30 4.25 4.21 4.19* 4.18* 4.17* 4.16* 4.29 4.25 4.21 4.19 4.18 4.17 4.16

9 5.65 5.12 4.81 4.65 4.54 4.49 4.44 4.38 4.32 4.29 4.26 4.25 4.22* 4.21 4.20 4.19 4.18 4.16* 4.15* 4.15*

10 5.63 5.12 4.80 4.65 4.53 4.48 4.44 4.38 4.32 4.28 4.25 4.24 4.22 4.21 4.20 4.18 4.17 4.16 4.15* 4.15*

Table 9: Task sequencing statistics for selected numbers of weight set trials followed by selected numbers of trials with mexp bound = 1:20. Asterisks (*) mark the smallest values for any xed number of trials.

7 Conclusion

Asynchronous pipelining is a form of pipeline parallelism which can be used in shared memory environments, but is particularly relevant in distributed memory environments, since it allows pipeline control to be distributed across processors using the generic control code in Figure 2. An asynchronous pipeline schedule is a generalization of a noniterative DAG schedule, and consists of an assignment of tasks to processors, plus a task sequence for each processor. We showed that the asynchronous pipeline scheduling problem is NP-hard under asymptotic, xedcount, and variable-count optimality, even when the loop body is self-cyclic. With one exception, the kernel scheduling, processor assignment, and task sequencing subproblems are also all NP-hard under all three forms of optimality, for both cyclic and self-cyclic loops. The exception is the task sequencing problem for self-cyclic loops under asymptotic optimality, which has an \impractical" O(v + e) time algorithm which typically generates schedules with very large startup times, and a \practical" O(ve) time algorithm which generates schedules with reasonable startup times. Our investigation of the computational complexity of asynchronous pipeline scheduling raises several additional unanswered questions. An armative answer to the rst and second of the following questions would imply an armative answer to the third question.

Is there a polynomial time algorithm for determining the exact execution time of a xed number of

iterations of a pipeline schedule? (The integer linear program in [14] for the case where processors are not shared suggests that this problem is probably in NP.) Is there a polynomial time algorithm for determining the startup time a of a pipeline schedule? Are the NP-hard entries in Table 1 also NP-complete? Is the task ordering problem for self-cyclic loops under asymptotic optimality NP-hard? 31

Is the stage assignment problem for cyclic loops under the three forms of optimality NP-hard? Is the

stage assignment problem for self-cyclic loops under xed-count or variable-count optimality NP-hard? Will doubling the base stage assignments of tasks in the Base Algorithm guarantee schedule optimality with respect to DAG edges? The Generic Task Sequencing Algorithm in Figure 3 is a task sequencing algorithm which leaves a number of parameter choices unspeci ed. We showed that schedules generated by any instantiation of the Generic Algorithm will not deadlock. The Base Task Sequencing Algorithm in Figure 4 is an instantiation of the Generic Algorithm which generates asymptotically optimal task sequences for self-cyclic loops in O(ve) time. The Base Algorithm can be used in conjunction with a multiprocessor scheduling algorithm to provide the processor assignment component of a complete pipeline schedule, such that the resulting schedule inherits any optimality guarantee provided by the multiprocessor scheduling algorithm. This implies that a pipeline schedule can be generated for a self-cyclic loop which is guaranteed to be within 18% of optimal. The \post-optimization" of DAG edges in the Base Algorithm also applies to DAG edges in cyclic DDG's, so DAG edges may be ignored in the earlier phases of any scheduling algorithm for general DDG's. This observation might be applied to improve the algorithms in [9, 17]. When applied to a set of 10,800 random cyclic DDG's, the Base Algorithm generates task sequences which average within 6.82% of optimal. Multiplepass variants of the Base Algorithm generate schedules which average within 4.54% of optimal for 10 trials, 4.15% of optimal for 30 trials, and 3.45% of optimal for 10,000 trials. All algorithm variants execute in expected O(ve) time. The most important question for future research is to determine if task sequencing can be used as a component of a complete, competitive scheduling algorithm for cyclic loops. This approach to asynchronous pipeline scheduling might also lead to new techniques for software pipeline scheduling.

Acknowledgments

This work was supported in part by an IBM Cooperative Fellowship.

References

[1] Alexander Aiken and Alexandru Nicolau. Optimal loop parallelization. Proc. SIGPLAN '88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988, pp. 308{317. [2] Sati Banerjee, Takeo Hamada, Paul M. Chau, and Ronald D. Fellman. Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems. IEEE Transactions on Signal Processing 43:8 (June 1995), pp. 1468{1484. [3] Steven M. Burns. Performance Analysis and Optimization of Asynchronous Circuits. Ph.D. Thesis, California Institute of Technology, Pasadena, California, 1991. [4] E. G. Coman, Jr., M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM Journal on Computing 7:1 (February 1978), pp. 1{17. [5] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990. [6] Val Donaldson and Jeanne Ferrante. Determining asynchronous pipeline execution times. Proc. 9th Workshop on Languages and Compilers for Parallel Computing, San Jose, CA, August 1996. [7] D. K. Friesen and M. A. Langston. Evaluation of a MULTIFIT-based scheduling algorithm. Journal of Algorithms 7:1 (March, 1986), pp. 35{59. [8] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, New York, 1979. 32

[9] Franco Gasperoni and Uwe Schwiegelshohn. Scheduling loops on parallel processors: a simple algorithm with close to optimum performance. Second Joint International Conference on Vector and Parallel Processing (Parallel Processing: CONPAR 92-VAPP V), Lyon, France, September 1992, pp. 625{636. [10] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics 17:2 (March, 1969), pp. 416{429. [11] Phu D. Hoang and Jan M. Rabaey. Scheduling of DSP programs onto multiprocessors for maximum throughput. IEEE Transactions on Signal Processing 41:6 (June 1993), pp. 2225{2235. [12] J. A. Hoogeveen, S. L. van de Velde, and B. Veltman. Complexity of scheduling multiprocessor tasks with prespeci ed processor allocations. Discrete Applied Mathematics 55:3 (December 1994), pp. 259{272. [13] Monica Lam. Software pipelining: an eective scheduling technique for VLIW machines. Proc. SIGPLAN '88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988, pp. 318{328. [14] Tsuneo Nakanishi, Kazuki Joe, Constantine D. Polychronopoulos, Akira Fukuda, and Keijiro Araki. Estimating parallel execution time of loops with loop-carried dependences. Proc. 25th International Conference on Parallel Processing, Ithaca, NY, August 1996, Vol III, pp. 61{69. [15] David A. Padua and Michael J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM 29:12 (December 1986), pp. 1184{1201. [16] Vivek Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA, 1989. [17] Tao Yang, Cong Fu, Apostolos Gerasoulis, and Vivek Sarkar. Mapping iterative task graphs on distributed memory machines. Proc. 24th International Conference on Parallel Processing, Oconomowoc, WI, August 1995, Vol II, pp. 151{158. [18] Tao Yang and Apostolos Gerasoulis. DSC: scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems 5:9 (September 1994), pp. 951{967.

33