Probabilistic Rotation: Scheduling Graphs with Uncertain Execution

0 downloads 0 Views 109KB Size Report
This paper proposes an algorithm called probabilistic rotation scheduling which takes advantage of loop pipelining to sched- ule tasks with uncertain times to a ...
Probabilistic Rotation: Scheduling Graphs with Uncertain Execution Timey Sissades Tongsima z

Chantana Chantrapornchai z

Abstract This paper proposes an algorithm called probabilistic rotation scheduling which takes advantage of loop pipelining to schedule tasks with uncertain times to a parallel processing system. These tasks normally occur when conditional instructions are employed and/or inputs of the tasks influence the computation time. We show that based on our loop scheduling algorithm the length of the resulting schedule can be guaranteed to be satisfied for a given probability. The experiments show that the resulting schedule length for a given probability of confidence can be significantly better than the schedules obtained by worst-case or average-case scenario.

1 Introduction In many practical applications such as interface systems, fuzzy systems, and artificial intelligence systems, etc., many tasks from these applications normally have uncertain computation time. Such tasks normally contain conditional instructions and/or operations that may take different computation time when calculating different inputs. A dynamic scheduling scheme may be considered to address the problem; however, the decision of the run-time scheduler which depends on the local on-line knowledge may not give a good overall schedule. Although many static scheduling techniques can thoroughly check for the best assignment for dependent tasks, the existing methods are not able to deal with such an uncertainty. Therefore, either worst-case or average-case computation for the task (may not reflect the real operating situation) is usually assumed. For iterative applications, the statistics of a computation time of those uncertain tasks are not difficult to be collected. In order to take advantage of these statistical data and loop pipelining, a novel loop scheduling algorithm, called probabilistic rotation scheduling (PRS), is introduced in this paper. This algorithm attempts to expose the parallelism of those certain and uncertain tasks (collectively called probabilistic tasks) within each iteration. The synchronization is then applied at the end of each iteration. Such a parallel computing style is also known as synchronous parallelism [2, 11]. The proposed algorithm takes an input application which is modeled as a hierarchical data-flow graph (DFG) where a node corresponds to a task, e.g., a collection of statements, and a set of edges represents dependencies y

This work was supported in part by the NFS grant MIP 95-01006. Dept. of Computer Sci. and Engr., Notre Dame, IN 46556. x Midwestern State University, Wichita Falls, TX 76308. z

Edwin H.-M. Sha z

Nelson Passos x

between these tasks. The dependency distances, also called delays, between tasks in different iterations is represented by short bar lines on those edges. The computation time of these nodes can be either fixed or varied. A probability model is employed to represent the timing of the probabilistic tasks. Considerable research has been conducted in the area of scheduling directed-acyclic graphs (DAGs) to the multiple processing system. Such graphs are obtained by ignoring edges containing one or more delays. Many heuristics have been proposed, e.g., list scheduling, and graph decomposition [5, 6], to schedule the DAG. These methods consider neither exploring the parallelism across iterations nor addressing the problem with the probabilistic tasks. In [7, 8], Ku and De Micheli proposed the relative scheduling method which handles tasks with unbounded nodes. Their approach, however, considers DAG as an input graph and does not explore the parallelism across iterations. Furthermore, if the statistics of the computation time of uncertain nodes can be collected, their method will not use these statistical information. For the class of global scheduling, software pipelining [9] is used to overlap instructions, i.e., exposing the parallelism across iterations. This technique, however, expands the graph by unfolding it. Furthermore, such an approach is limited to solving the problem without considering the uncertainty of the computation time [3, 9]. A rotation scheduling technique was developed by Chao, LaPaugh and Sha [1] This technique assigns nodes from a DFG to the system with limited number of functional units. It implicitly explores traditional retiming [10] in order to reduce the total computation time of the nodes along the longest paths, (also called the critical paths) in the DFG. In this paper, the rotation scheduling technique is extended so that it can deal with those uncertain tasks. An application with probabilistic execution time is transformed to a graph model, called the probabilistic data-flow graph (PG), which is a generalization of the DFG model. After the initial execution order and functional unit assignment are given. The probabilistic rotation scheduling is applied so that the total computation time of the final schedule above a given probability can be reduced by exploring parallelism across iterations. As an example, Figures 1(a) and 1(b) show a PG and its corresponding computation times distribution for each node. Since the computation times in PG are random variables, the total computation time of the graph is also a random variable. The concept of a control step, i.e., the synchronization time of the tasks within each iteration, is no longer applicable. A schedule conveys only the execution order, of the tasks being executed in a functional unit and/or between different units. Our technique

A

C

Time 1 2 3 4

D

B

(a) PG

Nodes B C 0 0 0.80 1 0 0 0.20 0

A 0 1 0 0

D 0 0.75 0 0.25

(b) Its computation time

Fig. 1: PG and its computation time

i.e., the summation of the delay functions along any cycle cannot be less than or equal to zero. The execution order of the PG can be defined as: A probabilistic task-assignment graph (PTG) G = V; E; w; T; b ; is a vertex-weighted, edge-weighted, directed acyclic graph, where V is the set of vertices representing tasks, E is the set of edges representing the data dependencies between vertices, w is a edge-type E to 0; 1 , where 0 represents the type of function from e dependency edge and 1 represents the type of flow-control edge, Tv is a random variable representing the computation time of a node v V , and b is a processor binding function from v V to PE i ; 1 i n , where PE i is processing element i and n is the total number of processing elements. For example, Figure 2 shows a PTG and its corresponding execution order while two processing elements are available. Nodes B and D are assigned to PE 0 . That is b(B ) = b(D) = PE 0 . Meanwhile b(Ce) = b(A) = PE 1 . Edges consists of C e1 A , C e2 D , B 3 D where w(e1 ) = 1 and w(e2 ) = w(e3 ) = 0.

h

2

f

will give a good initial schedule whose length is guaranteed for a given probability. Therefore, the resulting schedule is most likely satisfied the system constraints and the number of redesign cycles can be reduced. In order to compute the total computation time of this order, a probabilistic task-assignment graph (PTG) is constructed. Such a graph is obtained from the original PG in which non-zero delay edges are ignored and each node is assigned to a specific functional unit in the system. The PTG also contains some extra edges, called flow-control edges, and each of which is established between two independent tasks u and v where u is executed right before v within the same functional unit.

2

i

f g

2

  g

?!

?!

A

C

B

D

PE0

PE1

PE 0 PE 1

B C

?!

D A

2 Background (a) PTG

A probabilistic data flow graph (PG) is a vertex-weighted, edge-weighted, directed graph G = V; E; d; T , where V is the set of vertices representing tasks, E is the set of edges representing the data dependencies between vertices, d is a function from E to Z+, the set of positive integers, representing the number of delays on an edge, and Tv is a random variable representing the computation time of a node v V . Note that an ordinary DFG is a special case of the PG. A probability distribution of T is assumed to be discrete in this paper. The granularity of the resulting probability distribution, if necessary, depends on the need of accuracy. The notation (T = x) is read “the probability that the random variable T assumes the value x”. The probability function is a function that maps the possible value x to its probability, i.e., p(x) = (T = x). Each vertex v V from the PG is weighted with the probability distribution of the computation time, Tv , where Tv is a discrete random variable associated with the set of possible computation time of the vertex v such that (Tv = x) = 1. For those nodes with only one computa8x tion time t in the PG, (T = t) = 1. An edge e E from nodes u to v, is denoted by u e v and a path p starting pfrom node u and ending at node v is indicated by the notation u v: An iteration is the execution pattern of each node in V exactly once. Iterations are identified by an index i starting from 0. Inter-iteration dependencies are represented by weighted edges. An iteration is associated with a static schedule. A static schedule must obey the precedence relations defined by the data flow graph. For any iteration j , an edge e from u to v with delay d(e) conveys that the computation of node v at iteration j depends on the execution of node u at iteration j d(e). An edge with no delays represents a data dependency within the same iteration. A legal data flow graph must have a strictly positive delay cycles,

h

i

2

P

P

P

P

P

2

2

?!

?

(b) Static execution order

Fig. 2: A PTG and its execution order

3 Probabilistic Rotation Scheduling (PRS) Recall the definition of a probabilistic data flow graph (PG) found in Section 2. Since the computation time of each vertex of a PG is a random variable, the traditional notion of a fixed global cycle period, (G), for PG G is no longer valid. Therefore, the random variable called the maximum reaching time ( mrt) is introduced. The notation mrt(u; v ) represents the probabilistic critical path length for the portion of the graph between nodes u and v. Likewise, mrt(G) is the maximum reaching time of graph G, which represents the probabilistic cycle period for graph G, i.e., the probabilistic schedule length of G. Note that the mrt of PTG represents the probabilistic cycle period, i.e., the probabilistic schedule length of the PTG. Algorithm 3.1 calculates the mrt for a PTG. In order to simplify the calculation, two dummy vertices with zero computation time, vs and vd , are added to the graph. A set of zero delay edges is used to connect vertex vs to all root-nodes, and to connect all leaf-nodes to vertex vd . Therefore, the mrt(vs ; vd ) gives the overall maximum reaching time of the graph and will be used to compute the schedule length of the given PTG. This schedule length implies possible computation time of the graph. Algorithm 3.1 (Maximum reaching time)

h

i

Input : PTG G = V; E; w; T; b Output: mrt(G) = mrt0 (vs ; vd )

= hV0 ; E0 ; d; T i such that V0 = V + fvs ; vd g; e e vd g 2 Vr ; u 2 Vl ?! = + f ?! 8 2 mrt ( ) = 0; Tvs = Tvd = 0; Queue = vs 6= ; ( ) mrt ( ) = mrt (vs ; u) + Tu

1 G0

v 2 E0 E vs 0 3 u V0 ; vs ; u 4 while Queue do 5 get u; Queue 0 6 vs ; u 7 8 9 10 11 12 od

0

?! e

foreach u v do indegree (v) = indegree (v) 1 mrt0 (vs ; v) = max(mrt0 (vs ; u); mrt0 (vs ; v)) if indegree (v) = 0 then put (v; Queue ) fi od

?

In this algorithm, the graph is traversed in topological order and the mrt of each node v is computed with respect to vs . The mrt0 for node v, originally set to zero, is updated as the parent of v is dequeued. Note that operations “+” and max in Line 6 and 9 operate between two random variables. Finally, when vd is extracted from the queue, the mrt(G) is computed. Using the mrt, the concept of a probabilistic schedule length (psl(G; )) can be defined with respect to a confidence level , as the smallest computation time c such that (mrt(G) > c) < 1 . Consider the probability distribution of the mrt(G):

P

?

Prob.

::: :::

Possible computation time 13 14 15 0.18194 0.04365 0.02293

16 0.00875

With  = 0:8, psl(G; 0:8) is 14 because the smallest possible computation time is 14 where (mrt(G) > 14) < 0:2 (0:04365 + 0:02293 + 0:00875 = 0:07818 < 0:2). Note that c could be 15 and 16 but 14 is the smallest one. Therefore with above 80% confidence, the computation time of G is less than 14. In traditional rotation scheduling, a re-mapping heuristic (or re-scheduling phase) plays an important role in reducing the schedule length. For probabilistic rotation scheduling, Template scheduling(TS) heuristic is applied to find a place to re-schedule a task. In such an approach, a weight, called degree of flexibility, is assigned to each node in the PTG. In order to compute such a weight, the expected computation time of each node is computed to build up a template. This template implies not only the execution order but also the control step assignment for each task. By observing this template, one can expect how long (number of control steps) each processing element would be idle. Therefore, the template scheduling scheme should be able to decide where to re-scheduled a node. In order to determine an degree of flexibility (d ex(u; i)) we first compute an expected control step of each node (Ecs(v )). By traversing the PTG in the topological order, we compute ( Ecs(v )) e as Ecs(v ) = max(Ecs(ui ) + ET ui ), where ui v E,

P

2 ET u represents the expected computation time of node u and Ecs(vi ) = 0 for all root nodes vi 2 V . We assume here that node v can start execution right after all of their parents finish their executions. Then, d ex(u; i), is computed by: d ex(u; i) = e Ecs(v) ? Ecs(u) ? ET u where u ?! v 2 E and u and v are asi

?!

signed to PE i . Note that the degree of flexibility of a node, which is executed at last in any PE , is undefined. The template scheduling, de-

scribed in Algorithm 3.2, seeks the best processing element that results in the shortest possible psl to re-schedule a node. Algorithm 3.2 (Template scheduling)

h

i

8 2 V,

Input : PTG G = V; E; w; T; b with pre-computed ET u ; u re-mapped node v , and  Output: New G with the assignment of v 1 2 3 4 5 6 7 8 9 10

8 2 8 2

compute Ecs(u); u V compute d ex(u); u V Gtemp = G; Gbest = NULL for PE i := PE 0 to PE n do x = node v with maxPE (v )=i d ex(v; i) Gtemp = temporarily assign v after x if psl(Gtemp ; ) < psl(Gbest ; ) then Gbest = Gtemp fi remove the assignment and retry on the next PE od return(Gbest )

Algorithm 3.3 presents the probabilistic rotation scheduling (PRS). Algorithm 3.3 (Probabilistic Rotation Scheduling)

h

i

Input : PG G = V; E; d; T , and  Output: a shortest possible PTG Gs = V; E; w; T; b 1 2 3 4 5 6 7 8 9 10

h

i

8 2

compute ET u ; u V Gs = Init Schedule (G) // construct the initial schedule (DAG) Gbest = Gs for i = 1 to 2 V do R = Extract Roots (Gs ) (Gs ; G; u) = Select Rotate (R; Gs )// select a node and retime it Gs = Re-map (Gs ; u;  ) // template scheduling if psl(Gs ; ) < psl(Gbest ) then Gbest = Gs fi od return(Gbest )

j j

Line 2 constructs an initial schedule using any DAG scheduling, e.g., list scheduling which needs to be modified in order to return a PTG. The rotation phase begins in Lines 4–9. This phase loops 2 V times with hope that all nodes in the graph will have a chance to be rescheduled at least once. Extract Roots returns a set of roots which can be legally retimed. Then Select Rotate selects node u to be rotated using any priority function. In this routine, one delay is drawn from all incoming edges of node u and pushed to all outgoing edges of node u. Meanwhile, the PTG Gs is also updated, i.e., the flow-control and dependency edges are modified. Then a node is re-mapped using the TS heuristic proposed in the previous subsection. If the obtained probabilistic schedule is better than the current one, it saves the better PTG and the rotation iteration continues.

j j

4 Experiments In order to be more realistic and easy to compare with traditional rotation scheduling, we tested the PRS algorithm on some well-known benchmarks: (1) differential equation, (2) 3 stageIIR filter, (3) lattice filter, (4) Volterra filter, and (5) the 5th elliptic filter. The computation time of each node from those benchmark graphs is obtained from [4]. Table 1 demonstrates the effectiveness of our approach on both 2-adder, 1-multiplier and 2-adder, 2-multiplier systems. The

Spec.

 = 0:9

Ben.

PL

2 Adds. 1 Mul.

2 Adds. 2 Muls.

(1) (2) (3) (4) (5) (1) (2) (3) (4) (5)

169 188 229 526 318 120 124 229 359 288

AS 152 184 225 468 298 103 120 225 270 288

PRS ET 133 151 142 361 293 83 87 140 237 274

 = 0:8 PL TS 133 151 141 361 293 90 87 139 259 271

165 184 225 519 314 117 120 225 353 284

AS 147 179 220 461 294 100 110 220 265 274

PRS ET 131 147 138 354 289 83 83 136 221 270

Ben. TS 131 147 138 354 289 91 82 136 256 267

Table 1: List scheduling vs. PRS

performance of PRS is evaluated when the algorithm applies three different re-mapping heuristics: template scheduling (TS), exhaustive trial (ET) and as-late-as-possible scheduling (AS). The ET approach strives to re-map a node to all possible legal location and returns the assignment which yields the minimum psl(G; ). The AS method attempts to schedule a task once at the legal farthest position in each functional unit (adder or multiplier) while the TS heuristic legally places a task after the node with the highest degree of flexibility in each functional unit. Columns  = 0:8 and  = 0:9 show the result when considering probabilistic situation with the confidence probability 0.8 and 0.9. Column “PL” presents the psl after list scheduling is applied to the benchmarks. After running PRS using the re-mapping heuristics ET, AS and TS, Columns ET, AS and TS show the resulting psl. Among these three heuristics, the TS scheme produces better results than AS which uses the simplest criteria. Further, it yields as good as or sometimes even better results than what gives by the ET approach, while TS takes less time to select a re-scheduled position for a node. This is because in each iteration the ET method finds the local optimal place; however, scheduling nodes to these positions does not always result in the global optimal schedule length. In Table 2, based on the system that has 2 adders and 1 multiplier, we present the comparison results obtained from applying list scheduling, traditional rotation scheduling, probabilistic rotation scheduling using TS, and traditional rotation scheduling considering expected computation times, to the benchmarks. Columns “L” and “R” show the schedule length obtained from applying list scheduling and traditional rotation scheduling using TS when considering the worst case scenario. Obviously, considering the probabilistic case gives the significant improvement of the schedule length over the worst case scenario. Also, column “PL” presents the initial schedule lengths obtained from using the list scheduling approach when considering the underlying probabilistic computation time. The results in column TS are obtained from Table 1. In column “Exp”, the psl is computed by using the PG configuration retrieved from running traditional rotation to the benchmarks where the expected computation time is used for a node rather than its probability distribution. These results demonstrate that considering the probabilistic situation while performing rotation scheduling can consistently give better schedules than considering only worst-case or average-case scenario.

(1) (2) (3) (4) (5)

 = 0:9

worst case L R

PL

228 252 312 750 438

169 188 229 526 318

180 204 204 510 396

PRS TS Exp 133 136 151 163 141 153 361 526 293 299

 = 0:8 PL 165 184 225 519 314

PRS TS Exp 131 131 147 179 138 149 354 519 289 294

Table 2: Worst case, average case, vs. probabilistic case

5 Conclusion We have presented the probabilistic rotation scheduling algorithm which the probabilistic concept and loop pipelining are integrated so as to optimize a task schedule. A probabilistic data flow graph is used to model an application, which allows the probabilistic computation time. The concept of probabilistic schedule length is presented to measure the total computation time of these tasks being scheduled in one iteration. Probabilistic rotation scheduling is applied to the initial schedule in order to optimize the schedule. It produces the best optimized schedule with respect to the confidence probability. The re-mapping heuristic, template scheduling, is incorporated in the algorithm in order to find the scheduling position for a node.

REFERENCES [1] L. Chao, A. LaPaugh, and E. Sha. Rotation scheduling: A loop pipelining algorithm. In 30th DAC, pp. 566–572, June 1993. [2] I. Foster. Designing and building parallel program: concepts and tools for parallel software engineering. AddisonWesley Publishing Co., 1994. [3] E. M. Girczyc. Loop winding—a data flow approach to functional pipeline. In ISCAS, pp. 382–385, May 1987. [4] Texas Instruments. The TTL data book, volume 2. Texas Instruments Incorporation, 1985. [5] R. A. Kamin, G. B. Adams, and P. K. Dubey. Dynamic listscheduling with finite resources. In ICCD, pp. 140–144, Oct. 1994. [6] A. A. Khan, C. L. McCreary, and M. S. Jones. A comparison of multiprocessor scheduling heuristics. In ICPP , pp. 243–250, 1994. [7] D. Ku and G. De Micheli. High-Level synthesis of ASICS under Timing and Synchronization constraints. Kluwer Academic, 1992. [8] D. Ku and G. De Micheli. Relative scheduling under timing constraints: Algorithm for high-level synthesis. IEEE trans. CAD/ICAS, pp. 697–718, June 1992. [9] M. Lam. Software pipelining. In ACM SIGPLAN’88, pp. 318–328, June 1988. [10] C. E. Leiserson and J. B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5–35, 1991. [11] B. P. Lester. The art of parallel programming. PrenticeHall, Inc., Englewood Cliffs, New Jersey 07632, 1993.