Scheduling on Heterogeneous Message Passing Parallel ... - CiteSeerX

Scheduling on Heterogeneous Message Passing Parallel Architectures Stella C. S. Porto y Universidade Federal Fluminense, 24210 Niteroi, RJ and PUC-RIO 22453 Rio de Janeiro, RJ, Brazil

Daniel A. Menasce Department of Computer Science George Mason University Fairfax, VA 22030-4444

Abstract Cost-eective multiprocessor designs may be obtained by combining in the same architecture processors of dierent speeds (heterogeneous architecture) so that the serial and critical portions of the application may bene t from a fast single processor. The observed trend in the supercomputing industry indicates that future generations of high speed machines will be built by combining few very fast processors with hundreds or even thousands of slower identical VLSI processors, interconnected through a message passing scheme. Scheduling algorithms for such environments should be designed in order for applications to bene t from the heterogeneity aspect of the architecture. At the same time, the presence of communication delays in the task scheduling problem determines that two opposite criteria must be considered: communications costs and processor load. This paper proposes novel static heuristic processor assignment algorithms for heterogeneous message passing architectures and investigates their performance through the use of Markov chain based analytic models. The execution time of dierent parallel applications is computed as a function of the degree of heterogeneity of the architecture, the application communication/processing ratio, and of the fraction of sequential processing.

This work was performed while the author was at UMIACS, Univ. of Maryland at College Park on leave from PUC-RIO. Partially supported by NSF (CCR-9002351), IBM Brasil and CAPES. y This work was partially executed while the author was visiting UMIACS, Univ. of Maryland at College Park. Partially supported by CNPq (RHAE) and CAPES.

1

1 Introduction Since the advent of VLSI technology, parallel processing has been an important and challenging approach to the design of exible and scalable high-performance computers. However, parallel processing poses several problems which deserve special attention and proper design solutions for the achievement of the desired performance levels. For that matter, performance modeling has been of utmost importance when attempting to estimate and predict the behavior of parallel systems. Bene ting from the potential advantages of such systems requires that: problems (applications or jobs) be partitioned into smaller subproblems (tasks), processors be statically or dynamically allocated to concurrent jobs (in multiprogrammed environments), tasks be statically or dynamically designated to processors, and tasks be synchronized according to the application's data stream. To accomplish these requirements while exploiting the parallelism available in a multiprocessor, one must consider several important issues, which may cause performance degradation, such as: overhead due to communication between processes, contention for hardware and/or software resources, task synchronization overhead, processor allocation and task scheduling policies, interference between dierent applications in multiprogrammed environments, and overhead due to task preemption and migration between processors. Although there are many tradeos between these issues, most performance models generally focus on some particular aspects in order to obtain performance metrics such as: overall execution time, speedup, eciency, ecacy, performance degradation, processor utilization, and others. A performance model of a parallel system may be decomposed into two submodels: the workload model, which represents the set of jobs to be executed, and the system model, which represents the hardware and basic software architecture. We de ne a discrete workload model as one in which the internal structure of the application is explicitly described, usually in the form of a task graph. In opposition to a discrete workload representation, we de ne a parameterized workload model, as one in which an application is described through high-level parameters such as the fraction of sequential processing, average parallelism, minimum and maximum parallelism [13]. In this case, the internal structure of a job is not explicitly described. Usually, this type of modeling is more adequate for problems in which there is no concern with the internal dynamics of the application. This work considers discrete workload models. The system model is concerned with the system architecture and with the details of the interconnection between processors. For the purpose of this work, we assume that processes running at dierent processors can only communicate through the exchange of messages. In other words, there is no shared memory. One of the most important issues to be considered in order to achieve better performance in 2

a parallel system is the proper assignment of tasks of a parallel job to the various processors. Depending on the structure of the parallel application and on the interconnection structure of the multiprocessor, this problem may assume dierent aspects, and its solution may take dierent approaches. Interprocessor communication plays an important role in this context. The associated communication overhead, which may result from data dependencies among the tasks that compose a single parallel application, depends directly on the multiprocessor interconnection structure.

Mapping versus Scheduling. Two basically distinct approaches to handle communication de-

lays [18] can be found in the literature: mapping and scheduling. The rst approach formulates the problem in graph theoretic terms [4, 12, 3]. Two undirected graphs are considered. The nodes of the rst, called Task Interaction Graph (TIG), are associated with the set T = ft1; : : :; tn g of tasks of the parallel program. The weights = f1 ; : : :; n g associated with the nodes represent known or estimated computation costs. The edges of the TIG indicate that the linked tasks interact during their life time, i.e. communicate with each other; the edge weights cij re ect the relative amounts of communication1 between tasks ti and tj , without capturing any temporal execution dependencies. The second graph represents the multiprocessor architecture. The set of nodes of this graph corresponds to a set of processors P = fp1; : : :; pmg. The weight of an edge dpq between processors pq and pp represents the cost of exchanging a unit message between pq and pp . Processors are assigned to program tasks through a mapping function M(t) = p. A good mapping should reduce the total interprocessor communication time while balancing the workload among the processors in order to minimize the overall parallel job completion time. The TIG approach can be used to approximately model the behavior of complex parallel programs by lumping together all estimated communications between pairs of tasks, and ignoring the temporal execution dependencies. There is also a class of parallel programs, called iterative parallel programs, where the TIG model is quite accurate. With this class of programs, execution proceeds as a sequence of sequential iterations. In each iteration, all parallel tasks can execute independently, but each task then needs to communicate values computed during that iteration with tasks it is connected to in the TIG, before it can start its next iteration. The second approach considers the allocation problem as a pure scheduling problem. It regards the program graph as an acyclic directed graph, called Task Precedence Graph (TPG) . Again, a set of nodes T = ft1 ; : : :; tn g represents the program tasks and their weights = f1; : : :; n g re ect known or estimated computation times. In contrast to the TIG-model, the directed arcs in the TPG indicate a one-way communication between a pair of tasks (ti ; tj ), determining a precedence relationship ti ?! tj , which means that ti is a direct predecessor of tj or equivalently tj is a direct successor of ti . The set of direct predecessors of a task ti is called predi and the set of its direct successors is called succi . Task ti will only be ready to start execution after all 1 Measured

for example in number of message packets.

3

its predecessors have been executed and their corresponding messages have been received. These precedence relations explicitly represent the execution time dependencies. The edge-weights cij also measure the communication cost between ti and tj , but in a more restrictive way than in the mapping problem case described above. In the scheduling case, the communication between ti and each of its successors occurs only after ti has nished its computation. At this moment, ti transfers data to all of its successors according to a certain communication protocol. The processors are also organized as a non-oriented graph with nodes P = fp1; : : :; pmg corresponding to processors and edge-weights dpq re ecting the cost of transmitting a unit-message from pp to pq . A schedule is an assignment of each task to a processor (A(t) = p), such that among others, precedence constraints and communication delays are taken into account. In this paper, we are interested in schedules aimed at minimizing the completion time of parallel programs.

Approaches to Assignment Algorithms According to [10], dierent assignment strategies

that have been proposed in the literature are based on one of the following approaches: mathematical programming, graph theory, queuing theory and heuristics. The rst three approaches give approximate optimal solutions but are time consuming, given that the assignment problem is NP-complete. To speed up the search through the solution space, approximate algorithms have been used; they are based on one of the above optimal approaches limited by the search time used. Another solution to the problem is obtained by using heuristics. They may be divided in two classes [10]: greedy and ascendent/descendent. The greedy algorithms are initialized by a partial solution and search to extend this solution until a complete assignment is achieved. At each step, one task assignment is performed and this decision cannot be changed in the remaining steps. On the contrary, ascendent/descendent algorithms start from a complete initial assignment and try to improve it by analyzing neighbor solutions. The same kind of taxonomy for heuristic algorithms is found in [12]. There, it is stated that assignment schemes can also be formulated through a minimization procedure for some cost function. However, in some contexts, these schemes have been proposed using intuition about the speci c assignment context and do not use any explicit cost functions in the assignment procedure. Such schemes, which are then called domain-heuristic schemes, are computationally very ecient. In contrast, schemes that explicitly optimize cost functions are often computationally time consuming, but are more generally applicable and potentially capable of obtaining better assignments. The primary problem with most of the approaches based on explicit minimization of a cost function is the exponentially large con guration space of possible assignments that must be selectively searched in seeking to optimize the function. Many proposed heuristics to explicitly minimize cost functions have been shown to work relatively well on small task graphs (with a few tens of tasks), but have not been evaluated on large task graphs (with hundreds or thousands of tasks). 4

Scheduling Policies with Communication Delays The presence of communication delays in the task scheduling problem implies that two opposite criteria must be taken into account: Communications costs: Intraprocessor communications costs are quite trivial when compared

to those of interprocessor communications. Thus, minimizing the latter means assigning all tasks to a single processor. Processor load: To take advantage of the parallelism inherent to the parallel program, tasks need to be allocated to dierent processors. If we now add to this environment the heterogeneity aspect, the scheduling problem becomes even more complex. In a homogeneous environment one has to be able to determine the optimum number of processors to be allocated to an application (processor allocation), as well as which tasks are to be assigned to each processor (processor assignment) [1]. In a heterogeneous setting, we not only have to determine how many but which processors are allocated to an application, as well as which processors are to be assigned to each task [8]. In [8], many algorithms for processor assignment of parallel applications modeled as TPGs in multiprocessors with a heterogeneous architecture were proposed. The heuristics proposed in that work did not explicitly consider , for scheduling purposes, any information concerning the amount of data transmitted between tasks. This paper proposes new scheduling algorithms for loosely-coupled message passing heterogeneous multiprocessors. These algorithms are extensions to previous work by the same authors on scheduling in heterogeneous environments [8]. The rest of this work is organized as follows. Section 2 introduces the basics assumptions about the system and the workload model. Section 3 presents the outline for new scheduling policies for this environment, considering the greedy approach. A performance evaluation framework based on Markov chains is described in section 4. The results of the performance analysis are given in section 5. Finally, section 6 presents the concluding remarks.

2 Basic Models and Assumptions 2.1 System Model For the purpose of this work, a heterogeneous multiprocessor architecture with a message-passing communication structure is based on the model used in our previous work [8] extended to a Processor Interconnection Graph { PIG. Let P ( ) = fp1; p2; : : :; pm g be the set of m partially interconnected processors of and G( ) its non-oriented Processor Interconnection Graph. Each node in this 5

graph represents one of the processors of the multiprocessor. Arcs in this graph represent the paths between processors. Each processor has an instruction set partitioned into I execution time equivalent classes. The instruction execution times of a given processor pj are represented by a ~ j = (1j ; 2j ; : : :; Ij ), where ij is the execution time of a type i instruction at processor pj . vector A value of 1 for ij indicates that processor pj does not execute instructions of type i. We assume that a message is decomposed into equal sized packets before transmission through the network. Associated to each undirected arc (i; j ), linking nodes pi and pj , there is a scalar i;j which represents the delay of transmitting a packet (packet delay) between pi and pj . The following assumptions about the communication between processors are worth noting:

Symmetric Delays: the packet transmission time between two processors is the same in both directions (i;j = j;i 8 i; j ). Negligible Intraprocessor Delay: the delay to send a message between two tasks running in the same processor is negligible when compared to interprocessor communication delays (i;i i;j 8 i; j j i = 6 j ). Logically Fully Connected Network: there is always a path through which messages can be routed between any two processors even when there is no direct link between them in the interconnection network ((i; j ) is an arc of G( ) 8 i; j = 1; ; m and i;j 6= 1)

2.2 Workload Model Based on our previous work [8], the workload model is also extended to a complete TPG (Task Precedence Graph), where communication demands are also represented. A parallel application is a set of partially ordered interrelated tasks. Let T () = ft1 ; t2 ; : : :; tn g be the set of tasks of and G() its acyclic directed precedence graph. Each node in this graph represents one of the tasks of the application. Arcs in the graph link a task tk to its immediate successors in the execution sequence. A task tk is said to become executable when all its immediate predecessors predk in the graph nish their execution. Associated to each task tk we de ne a vector ~?k = ( 1k ; : : :; Ik ) such that ik is the average number of instructions of type i executed by task tk . This vector is called the service demand vector. Associated to each directed arc (i; j ) in T (), linking tasks ti to task tj , there is a scalar i;j which represents the average number of packets that compose the message that is sent from ti to tj when ti nishes its execution. This is the communication demand between ti and tj .

6

2.3 The Assignment Function The assignment is a function A : T ?! P , which associates tasks to processors such that A(tk ) = pj indicates that task tk has been assigned to and will be executed by processor pj . Considering the interrelation between workload and system models, the following can be stated:

The average execution time of a task tk at a processor pj , denoted by (tk ; pj ), is given by ~ j . The set of average estimated task execution times at each processor the dot product ~?k is given by the n m matrix . The average communication delay between task tk at processor pj and task ti at processor pl is given by the product k;i j;l .

2.4 Execution Model Given the workload and system models, one has to examine what happens during execution after the static assignment or during execution with a dynamic assignment. In this section we propose two simple schemes for the representation of the execution of the parallel application considering static and dynamic scheduling procedures.

2.4.1 Static Scheduling Static scheduling is done prior to execution and does not incur in any overhead in execution time. When actual execution starts, all tasks are already assigned to processors. When a task tk nishes its execution, it starts communicating with the processors to which its immediate successors are allocated. It is assumed that messages will be buered at the destination processor if needed. This is possible, because the designated processors of these successors have already been determined. If we assume that communication and computation overlap, the communication overhead is given by the message transmission delay: k;i A(t );A(t ) where ti 2 succk . However, during this communication interval, dierent situations may occur: k

i

The successor ti is still not executable , because there are one or more of its predecessors, 2

which did not nish execution. This means that the task ti cannot take control of the processor A(ti) immediately. In this case, the data sent by tk to ti will be buered in the local memory of A(ti ) until ti is scheduled to start execution.

2 All

of its predecessors have been executed.

7

The successor ti is executable but its execution is delayed because of other tasks also assigned to A(ti ). Again, the data sent by tk to ti is buered in A(ti ) until ti is scheduled to start execution. The successor ti is executable and is also ready to start execution instantly in A(ti). In this case, ti will only have to wait for the transmission delay k;i A(t );A(t ) . k

i

2.4.2 Dynamic Scheduling On the contrary, under dynamic scheduling, the processor assignment is done during the actual execution, generating scheduling delays in the overall execution time. If communication is not explicitly modeled, tasks are only scheduled to processors when they are executable and there is an available processor to execute them. However, considering the communication between tasks and their successors, an important question arises. To which processor should a task tk , which nished its execution, send the messages designated to its successors, if their assignment is still not yet determined? To this matter, we then assume the following procedure. When a task tk nishes its execution, its successors ti 2 succk if still not scheduled (A(ti ) not determined) must be scheduled at this moment. This way, messages can have their destination correctly established. This assumption changes considerably the dynamic scheduling approach considered in related literature. In fact, this indicates that tasks will frequently be scheduled before they are able of taking control and starting execution in a certain available processor. Moreover, processors will have tasks queues assigned to them, even though the scheduling procedure is being done dynamically.

3 New Scheduling Algorithms An important set of algorithms for heterogeneous parallel architectures was built with the systematic construction methodology proposed in [8]. That methodology was based on a meta-algorithm composed of two main elements, namely an Envelope and a Heuristic. A static heuristic algorithm was then de ned as a loop which executes until all tasks have been assigned to processors. Inside this loop, a procedure called Domain Selection determines the subset of the not yet assigned tasks (Task Domain T + ) and the subset of the processors (Processor Domain P + ) that will be considered as an input to the Heuristic. The Heuristic itself can be considered as a function H : T + P + ?! (tk ; pj ), where the pair (tk ; pj ), which can be null in some cases, indicates that task tk will be assigned to processor pj . After the Heuristic is executed, a Domain Update step must be executed in order to update the task and processor domains according to the result of the heuristic decision. Therefore, the envelope is the result of the combination of Domain Selection 8

and Domain Update.

3.1 DES/C Envelope The DES (Deterministic Execution Simulation) envelope was shown [8] to achieve the best performance among other proposed envelopes. Domain Selection and Update follow the precedence relations given in G() { which considers all tasks as having a deterministic execution equal to the average values given by the matrix . Given the precedence graph, DES simulates the execution of the application. Any heuristic could be used to select the appropriate task-processor pair to be scheduled at any given point in time. So, in a sense, DES acts as a dynamic scheduler run prior to actual execution using the estimated execution times instead of the actual execution times. DES must be modi ed in order to incorporate the new communication delay feature added to the execution model. This modi ed version of DES is called Deterministic Execution Simulation with Communication (DES/C). If DES/C used exactly the same philosophy employed by DES of simulating a dynamic execution prior to the actual execution, it might run into the same sort of problems mentioned in section 2.4.2. In other words, it would be dicult to compute the communication time between a task and its successors since they may not have been scheduled yet. In order to overcome this problem, DES/C adopts a rule under which tasks are only assigned to processors during DES/C when all their predecessors have already been executed. This will become more clear in the paragraphs that follow. During this simulated execution, tasks will be assigned to processors, and will have their execution started and nished. Tasks can be in the following dierent states in this order:

tk is said to be in scheduled state when it has already nished its execution in processor A(tk ). tk is in a non-schedulable state when predk 6= ; and at least one of its predecessors is still not

in scheduled state. tk is in a schedulable state when either predk = ; or all of its predecessors are already in scheduled state. tk is said to be in executing state in either one of the two following situations: { tk has already been assigned to a processor A(tk ), but tk is still receiving data from one or more of its predecessors before starting proper execution, or. { tk has already received the necessary messages from all of its predecessors and its execution has already started but has not yet nished. 9

Although communication between tk and any of its successors ti starts immediately after the completion of tk , it is only accounted for , during the DES/C simulation, after ti is schedulable , assigned to a certain available processor, and ready to start its execution immediately. Similarly to the DES, the Domain Selection and Domain Update determine T + and P + according to task completion events. The heuristic procedure selects a task-processor pair (maybe null) from T + P + . The main dierence is that the updating of task and processor states has to consider the new communicating state, which tasks must go through before they are able to start their deterministic execution period. In order to provide a formal procedural description of the DES/C some additional de nitions are necessary. A variable called clock indicates the evolution of time as the simulated execution progresses. start(tk ) and finish(tk ) are the clock values when tk is scheduled and when it nishes its execution respectively. delay (tk ) is the necessary time that the task tk , which has already been assigned to a processor, has to wait to complete the reception of all messages from its predecessors. Processors can be in two states: free and busy . A processor pj is said to be busy at a certain time instant if there is a task in executing state assigned to that processor at that time during the DES/C. Otherwise, the processor is said to be in free state. Let freet(pj ) be the next instant of time when processor pj will go into free state, given it is in busy state. Figure 1 shows the procedural description of DES/C.

3.2 New Heuristics We propose in this section new heuristics based on the ones presented in [11], which were then proved to achieve good performance when combined with the DES envelope. Actually, we present two main families of heuristics, namely MFT/C and SEETF/C , which are distinguished essentially by their dierent criteria for processor selection. This means that the resulting heuristics in each family may use dierent approaches to select tasks from the task domain during the DES/C loop. The MFT/C family is composed of heuristics which are based on the same idea: selecting a processor pj which determines the minimum nish time for a previously selected task tk . In other words, pj is such that finish0 (tk ) = minp 2P ffreet(pi ) + delay 0(tk ) + (tk ; pi)g, where finish0 (tk ) is the nish time of tk if it were to be assigned to pj , freet(pi ) is, as previously de ned, the next instant of time where processor pi will become free in the deterministic execution simulation; delay 0(tk ) is the communication delay that tk would be imposed if it were to be assigned to pi, which is given by maxt 2pred f0; finish(tl) + l;k A(t );p ? clockg. This search is done over the entire set of processors, P . If the selected processor pj is busy, then the task-processor pair is made null, meaning that no selection has been made. The reason why this heuristic is able to bene t i

l

k

l

10

i

Initialization fSet clock to zero and all processors as freeg clock ? 0; 8p 2 P do state(p ) ? free Domain Selection fUsing the task graph G(), set+task domain as the set of schedulable tasks, and the processor domain as the set of all free processors. Sets T and P + are only updated when they are empty.g If T + = ) T + ft 2 T j state(t ) = 6 scheduled and (pred = or (state (t ) = scheduled 8t 2 pred ))g; If P + = ) P + = fp 2 P j state(p ) = freeg. Domain Update fLet (t ; p ) be the task-processor pair selected by the heuristic. If (t ; p ) is not null, compute j

j

k

k

k

i

i

j

k

k

j

j

k

j

the start and nish times of the selected task, update the state of the task and of its assigned processor, and remove the task and the processor from their respective domains. When any of the input domains becomes empty, the clock variable is updated to the nish time of the rst task to nish. The state of all the processors which are supposed to become free at this time is updated.g If (t ; p ) 6= null ) start (t ) clock; delay(t ) = maxf0; max i 2 k ffinish(t ) + A( i ) A( k ) ? clockgg finish (t ) start (t ) + (t ; p ) + delay(t ) ; freet (p ) finish (t ) ; state (t ) executing; state (p ) busy; T + T + ? ft g ; k

j

k

k

t

k

j

i

pred

k

k

i;k

j

t

;

t

k

k

k

j

k

P + P + ? fp g + If T = or P + = ) clock min j ( )= ffinish(tj )g; state (t) scheduled 8t 2 T j finish (t) = clock; state (p) free 8p 2 P j freet (p) = clock j

j state tj

executing

Figure 1: Description of the DES/C from the heterogeneity is that it is able to perform a look-ahead during deterministic simulation and decide whether it is advantageous to wait for a fast processor to become available, even though there might be some free slower processors to which the task could be assigned. On the other hand, in the SEETF/C family the selected processor pj is the one which determines the minimum estimated execution plus communication time for a previously selected task tk . In other words, pj is such that (tk ; pj ) + maxt 2pred fi;k A(t );p g = minp 2P + f(tk ; pl) + maxt 2pred fi;k A(t );p gg. i

i

k

i

l

11

k

i

j

l

4 Performance Evaluation Framework In this section, we propose a framework for evaluating the performance of the scheduling algorithms previously described. This framework consists of some simplifying assumptions for the general workload and system models, evaluation parameters, and comparison parameters based on performance comparison goals and performance evaluation techniques.

4.1 Simplifying Assumptions Based on the models presented in section 2, we describe other additional assumptions that make the performance analysis simpler to be carried out. These assumptions are presented in two parts: one related to the heterogeneous aspect of the architecture model [8], and the second related to the communication aspect.

4.1.1 Heterogeneity Aspect Any processor is able to execute any task. There is only one heterogeneous or fast processor in P (phet), which has the highest processing capacity. The remaining m ? 1 processors are called homogeneous or slow processors (phom ). The ratio between the execution time of any two tasks in any two processors is constant: 8k; l 2 [1; m]; 8i; j 2 [1; n] tt ;p;p = tt ;p;p . With this assumption it is possible to consider ( (

i

k) l)

( (

j

k) l)

that the instruction set has only one partition (I = 1). Given the two previous assumptions, and also for simplifying reasons, we assume, without 1 loss of generality, that hom = 1 and het = PPR , where PPR is the Processor Power Ratio de ned in [9], which measures the ratio between the speed of the fast processor and the speed of each of the slow ones. i

j

4.1.2 Communication Aspect As it was mentioned in section 2, communication and computation overlap in all processors,

as is the case with several current machines [6]. We assume that the heterogeneity is restricted to the speed of the processors. Moreover, according to the previous assumption, the communication overhead is essentially due to the delay of message propagation through the network. 12

We also assume that there is a very small variability in the delay experienced by packets sent between dierent source-destination pairs. This is true in machines which have interconnection networks with small diameters if compared with the number of processors. In other words, we consider that the message delay is solely a function of the number of packets in the message and not a function of the pair of processors communicating.

4.2 Evaluation and Comparison Parameters The evaluation parameter considered here is the average execution time of the application when it is submitted to the dierent scheduling policies. Considering dierent applications, it would be dicult to make a fair comparison using the absolute values of the execution times. So, we use a normalized execution time Trel . The normalized value is obtained by dividing the absolute execution time by the total execution time (Tnorm ) of the same application when it is submitted to a homogeneous system with an in nite number of processors all identical to phom [13]. The performance evaluation aims at analyzing how the proposed scheduling algorithms behave as a function of the two main aspects of this environment, namely heterogeneity and communication demand. In this sense, we now propose several parameters, which are an attempt to quantify these aspects of the architecture and of the application.

The heterogeneity of the architecture is measured by the , previously de ned, processor power

ratio (PPR). The heterogeneity of the application is measured by its intrinsic serial fraction, Fs , which is obtained through the same procedure used in calculating Tnorm [13]. The intrinsic communication/processing ratio [15], CPR, which is obtained dividing the overall communication demand of the application by the overall computation demand.

4.3 Evaluation Techniques We use an analytical method to obtain the overall execution time for a parallel application submitted to a certain scheduling policy. This analytical method may be described as the following three step procedure:

Step 1 The parallel application is submitted to a scheduling algorithm under the constraints of a

certain multiprocessor architecture, which may be characterized by the tuple (G(); G( ); nm; 13

T; P ). The result of this step is the complete assignment of all tasks tk 2 T , such that A(tk ) = pj ; pj 2 P . Step 2 The original TPG, G(), is transformed into a modi ed task graph, MTPG. In the MTPG, the communication between tasks, whose estimated average delay is already known, as a result of the assignment, is explicitly represented through the insertion of communication tasks, tci;j , such that ti is a predecessor of tj . Communication tasks are inserted between tasks in the original task graph G() if necessary. The tasks that belong to the original graph are, for this purpose, now called computation tasks. A communication task tci;j will be inserted between ti and tj if and only if ti ?! tj and A(ti ) 6= A(tj ). The latter condition is necessary since we are assuming that the intraprocessor communication time is negligible with respect to interprocessor communication. Step 3 A Markov chain is generated from the MTPG using an algorithm based on the one proposed by Thomasian in [16]. A state in this Markov chain is a tuple which represents the set of concurrently active tasks. This Markov chain can be eciently solved as the states are generated [16, 7]. The execution time can be obtained by applying Little's Law to the initial state of this Markov chain.

5 Performance Analysis All the new algorithms analyzed used the DES/C envelope. Their heuristics belong to one of the two families described above, namely MFT/C and SEETF/C. Therefore, the resulting algorithms are labeled according to their heuristic. Dierent algorithms of the same family dier in the task selection criteria used by them. The following task selection approaches were considered: 1. The Combined (Comb) task selection heuristic selects task tk simultaneously with the selection of processor pj . This means that Comb MFT/C heuristic selects the pair (tk ; pj ) which will determine the minimum nish time and the Comb SEETF/C heuristic selects the taskprocessor pair which yields the shortest estimated execution time. 2. The Largest Task First (LTF) task selection heuristics selects the task from the domain with the largest number of instructions. Only then, the processor is selected. 3. The highest Weighted Level First (WLF) criterium selects the task from the task domain, which has the highest weighted level [14], also prior to processor selection. 4. The Highest Level First (HLF) criterium selects the task from the task domain with the highest level [8] prior to processor selection. 14

For the sake of comparison, we also included one of the best algorithms obtained for the original model without communication aspects, which we called DES+MFT. This algorithms uses the DES envelope, which is similar to the DES/C envelope, except that it does not consider any communication aspect present in the task system. Also, the MFT heuristic is very similar to the Comb MFT/C heuristic, but the communication is not considered in the task-processor selection procedure. Before presenting the curves and the performance analysis which follows, we anticipate some of the results in order to explain why some algorithms are not explicitly represented with unique curves. In most of the cases, all the heuristics belonging to the same family achieved the same performance, therefore only the family labels (MFT/C or SEETF/C) are presented. In the seldom cases, where a single heuristic in a family presented dierences in performance, its identifying label is shown, while the rest of the heuristics are still represented under the respective family label. It is worth noting, that in the present analysis, the dierences in performance are more subtle than in our previous work [8]. This can be explained by the fact that the original algorithms which were extended and modi ed to include the communication characteristics of the new model, were those which achieved the best performance in the previous analysis. The algorithms with poor performance in the original model were not considered in this work.

5.1 Basic Task Graphs Two basic topologies were used in our analysis, namely MVA [17] ( gure 2) and MATRIX [5] ( gure 3). Dierent applications were generated by changing the service demands (number of instructions) of the tasks in each topology. In the MVA topology, dierent values of serial fraction were obtained by varying the ratio between the service demands of the tasks on the central vertical axis of the task graph (the dark ones in gure 2) and the service demands of the rest of the tasks in the graph. On the other hand, in the MATRIX case, dierent Fs values were obtained by varying the ratio between the service demand of the 8 fork tasks and the service demand of the initial and nal tasks, which eectively determine the serial portion of the execution. The Communication/Processing Ratio (CPR) is calculated dividing the total communication demand between consecutive tasks (sum of the weights of the arcs of the task graph) by the total processing demand (sum of the service demands of the tasks in the graph). When the Fs varies, the CPR also varies. Therefore, in order to obtain a range of values for Fs with a constant value of CPR, it is necessary to determine new values for the communication demand. To simplify this calculation, the communication demand is equally distributed over all the arcs in the task graph. 15

x iterations

?~ ?@@ ? ?? @@ ~??@@ increasing paralelism ? @ ? @ ? ? @ ? @ ? @@ qqq - up to 6 tasks @@ ?? @@~ ?? @@ ?? decreasing paralelism @@ ? @ ? ? @ ? @@ ~?? Figure 2: MVA Topology

5.2 Serial Fraction Analysis We present in this section the numerical results considering the variation of the serial fraction for each of the two kinds of topologies, followed by some concluding remarks.

MVA Topology In gure 4 (CPR = 2.667, m = 3, PPR = 2), the dierence in performance between DES+MFT and the MFT/C family is evident for Fs < 0:5. For applications with low degree of parallelism (Fs > 0:5), DES+MFT and algorithms in the MFT/C family yield similar performance. This can be explained by the fact that, under those circumstances, the MFT feature will tend to assign all tasks to the fast processor, eliminating any task communication. The SEETF/C family has the worst performance. This fact demonstrates that DES+MFT is very robust, even though it does not consider any information about the communication between tasks. The dierence in performance between the SEETF/C family and DES+MFT is even greater for higher values of the PPR as illustrated by gure 5 (CPR = 2.667, m = 3, PPR = 4). 16

~?

?

nfork

-

?

~

? Figure 3: MATRIX Topology

MATRIX Topology The three graphs in gures 6, 7 and 8 show the in uence of processor

contention on the overall execution time. The SEETF/C family curve exhibits a greedy behavior when the number of available processors grows. This means that at each iteration of the DES/C envelope, these heuristics tend to assign all the free processors in the processor domain to all the executable tasks in the task domain. When the number of processors reaches 10, the 8 fork tasks are placed in the task domain after the initial task has been allocated, the SEETF/C family algorithms schedule these 8 tasks in a single DES/C iteration. The nal task only enters into the task domain after all fork tasks have been allocated and all the processors are free again. Thus, in this case, the communication has no in uence over the assignment decision. The fast processor is also not used conveniently. On the other hand, the dierence in performance between DES+MFT and the MFT/C family is due exclusively to the fact that MFT/C considers interprocessor communication. With the increase in processor contention, the SEETF/C curve tends to show a better performance. Finally, when the number of processors is equal to 3 (see gure 8), the performance of SEETF/C is slightly better than that of the DES+MFT algorithm. This happens only because the PPR is still low and the CPR is reasonably high.

5.3 Communication Processing Ratio Analysis We present in this section the numerical results obtained by varying the Communication/Processing Ratio (CPR) for the two kinds of topologies, followed by some concluding remarks. 17

1.6 1.4 1.2 Trel 1.0 0.8

SEETF/C LTFSEETF/C LTFMFT/C DES+MFT MFT/C

................. . ........ .................... ...... . .... ...... ..... ..... ... ...... ... ................ ............ ............ ............... .. . ...... .............. ........ ... ...... ...... . ...... .... . ...... .... ... . ...... ..... .. ...... ...... . .. . . . . ...... . ....... .. .... . .. ....... ... ...... . . ....... ........ ....... .. ...... . .. ....... ........... .. . ...... .... . ...... ... ....... ..... . .. ....... ..... . ...... ....... ..... ........... .............. . ........... ............. .. . . ............ ........ ... ........ ........ ........... ...... .. ..... .......... .. . .......... ........... . .. ................... ..................... ... ...... ........... . ........... ............ ... . ............. . . . . . .. ...... ..... . ....... .......... ... . .. ... ..... . . . ....... .... .. .. ...... ........ .. . . . . .... ........ .... ....... ...... . .. ..... ....... ... ......... ........ . .... ..... ........ ......... ......... ......... .........

0.6 0.4 0.2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fs

Figure 4: MVA; m = 3; PPR = 2; CPR = 2.667

MVA Topology As in the Fs analysis case, processor contention does not have any signi cant in uence on the performance of applications with a MVA topology. Therefore, we do not present graphs with dierent values of the number of processors. In gure 9 (m = 3, PPR = 4, Fs = 0.2917), it can be observed that the SEETF/C family exhibits worse performance than the DES+MFT algorithm, even for high CPR values. It is worth noting that interprocessor communication only takes place when consecutive tasks are assigned to dierent processors. The importance of considering the communication overhead in the scheduling decisions is more apparent for the higher range of CPR values. Thus, there is a signi cant dierence in performance between DES+MFT and the MFT/C family. Nevertheless, it is surprising to notice that even under this circumstance, the DES+MFT algorithm demonstrates once again its superiority with respect to other algorithms which do not have the MFT processor selection procedure. MATRIX Topology The graphs in gures 10 (m = 10, PPR = 2, Fs = 0.3794) and 11 (m =

5, PPR = 2, Fs = 0.3794) show the same relative behavior between the SEETF/C, MFT/C and the DES+MFT curves that was observed in the serial fraction analysis with respect to processor contention. The performance dierences in the present case are not as signi cant as in the previous one because of the lower value of the PPR. When the number of processors is 5, SEETF/C and DES+MFT have an identical behavior, but when this number reaches 10, SEETF/C is not able to use adequately the heterogeneous processor due to its greedy scheduling decisions, and its performance is inferior. As mentioned before, the main dierence between MFT/C and DES+MFT 18

1.6 1.4 1.2 Trel 1.0 0.8

Comb SEETF/C SEETF/C MFT/C DES+MFT Comb MFT/C

.... .... .... .. ... .......... .. ...... ........ ..... ...... .... ..... . . . . . ... . ...... ..... ... .. ... .. .. .... ......... .. .. ..... ... ...... ........ ...... ..... ..... ..... . .. .. ... .. ... ........ ... ........... ........... .. ...... .. ....... . ....... ........ . ....... .. ...... .. ....... ....... ... ....... .. ....... . ...... ....... .. ...... ... ........ ........ . ....... .. ....... .. ....... .. ........ ........ . ......... .. ......... .. ........ .......... .. . ........... ... ......... . . . . ............ . ......... . . .. . ... . .. . .. . . . . . . ......... ..... .... . .... ......... ........... . .............. . ............ ............... . ..... ................. .... ... ..................... .......... .................. .................... ........ .... ...................... ........................ ..................... ................

0.6 0.4 0.2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fs

Figure 5: MVA; m = 5; PPR = 4; CPR = 2.667 is that the former takes into account communication when performing scheduling decisions. The eect of this consideration becomes more apparent as the CPR increases.

5.4 Processor Power Ratio We present in this section the numerical results considering the variation of the Processor Power Ratio (PPR) for the two kinds of topologies, followed by some concluding remarks.

MVA Topology Figure 12 shows the most interesting graph obtained. It was necessary a very

high value for the CPR to make it possible to notice dierences in performance between the MFT/C family and DES+MFT. Nevertheless, DES+MFT presents a superior performance in comparison to the SEETF/C family, although the latter takes interprocessor communication into account when making their scheduling decisions. Again, this demonstrates the robustness of the MFT heuristic. The MFT/C family combines both distinguishing characteristics: the MFT feature, which enables it to take advantage of the heterogeneity of the architecture, and the /C feature present in the DES/C envelope and in the MFT/C heuristics, which enables it to consider more accurate information concerning interprocessor communication.

19

1.7 1.5 1.3

Trel 1.1 0.9 0.7

...... ..... ...... ..... ..... . .. . .... . ... ...... . ... . .... ...... . ... . .... . ... . .... ... . ... ... ... . .... ... . ... .. ... . ..... ... . ... .. ... . ..... ... .. . .. . ... . .... ... ... . .. ... ...... ... .. . ... ... ..... ... ... . ... ......... ......... ............................ ..... .. .... ......................................................... . . . . ........... ... . ........ . . . .... . . . . ......... ... . ....... ......... ......... ......... . . ......... ......... ..... . .. ... ... ... ... ... .. .. .. . .. .. .

.

SEETF/C DES+MFT MFT/C

0.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fs

Figure 6: MATRIX; m = 10; PPR = 2; CPR = 1.6

MATRIX Topology The graph in gure 13 shows that for a suciently high value of CPR,

the MFT/C family has a better performance than DES+MFT in almost the whole range of PPR values considered. When the PPR is extremely high, DES+MFT shows the same behavior as the MFT/C family, due to its MFT feature which is able to take advantage of the system heterogeneity. As one can see from gure 13, when the PPR becomes very high, all scheduling algorithms with the MFT feature will tend to serialize the application, i.e. run all tasks on the fast processor. Therefore, DES+MFT and MFT/C will exhibit the same behavior since the communication will no longer make any dierence. Once again, the SEETF/C family exhibits worse performance than DES+MFT, even though the latter does not consider the interprocessor communication at all in its scheduling decisions. This proves the greater importance of the MFT processor selection procedure in comparison to the /C aspect, even for communication intensive applications.

20

1.5 1.3

Trel 1.1 0.9 0.7

..... ..... ...... ...... ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ....................................... ...... ................................................................. .. ......... ......... ......... ......... ........ ...... ...... ... ..... .. ..... .. ... .. ... .. ... .. .. .. . ... .. ....


0.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fs

Figure 7: MATRIX; m = 5; PPR = 2; CPR = 1.6

1.5 1.3

Trel 1.1 0.9 0.7

..... .... ...... .... ...... ........ .. .. ...... .... . ... . ... ..... ... ... ... ... .... ... ... ... ... ..... ... ... ... ... .... ... .... .. ... ..... ... ... ... ... .... ... ... ... ... ..... ... ... ... ... ..... ... ... .. ... ................ . . . . . . . ........ . . . ... . . ........ ............... ..... . ........ . ....................... ........ ......... . .................. . ............ ...... ............ . ... ...... ........ ..... .... ... .. ... .. .. . ... ..

SEETF/C MFT/C DES+MFT

0.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fs

Figure 8: MATRIX; m = 3; PPR = 2; CPR = 1.6 21

1.5 1.4 1.3 1.2 Trel 1.1 1.0 0.9 0.8 0.7 0.6 0.5

.. ...... ..... ..... ...... ...... ......... .............. . . . . . . . .................. ...... ...... .......... ...... ...... .... ...... ...... . . . . . . ...... .... ....... .. ...... .... .... .. ...... ... ..... . . . . . . . .. .... . . . . . . . . . . ... ........... ................... ....... .......... ..... ...... ....... ....... ..... ....... ... .. ........... . . . .. ........ ........ ...... .... ..... . . ...... ... . .. ... ... .. .... ... . . .... .. ... . .... . .... . .... . .. ..... . . . .. . .. ...... . . . . . . . . .... .. . . ........ .. ........ ... ..... . . ....... ...... ... . . ..... .. . . .... ...... . . . . .... .... . . .... .. . . .... ..... . . ....... ... . . ............................ ............................... . . . . . . . . . . . . . ...... . . . . . . ..... . . . ....... . .. ............................ ............................................... ..... ....................... ............................................................................ ... ........................................ ........................................ .... . . . . . . . . . . . . . . . . . . . . . . .........

Comb SEETF/C SEETF/C DES+MFT Comb MFT/C MFT/C

0.2 0.5 0.8 1.1 1.4 1.7 2.0 2.3 2.6

CPR

Figure 9: MVA; m = 3; PPR = 4; Fs = 0.2917

2.7 2.4

Trel

2.1 1.8 1.5 1.2 0.9

.. .... .... ..... ..... . .... . . . . .. ..... . ..... . ..... ..... . .... .... . . . . . ..... . ..... ..... . ..... ..... . ..... . . . . ..... . .. ..... . ....... ..... ..... . .. ..... . ...... .... . . . . . ...... ..... . . . . . . .. ..... ..... . . .... . ...... ..... . . .... ...... .... . .... . ...... . . . . . . ... .... . ..... ..... . .... .. ..... ..... . .... . .. ..... . ...... . . . . ..... . ........ .... ...... . ... ..... . ..... ..... . ..... . ........ . . . . . .... . ...... ..... . ..... . .. ...... . ... ...... . ...... ...... . .... . . . . . . . . . ....... ... .. . . . . ... .... ...... . . ..... ..... ...... ... ........... ...... ... ............. . . . . . . . ...... .............. ......... ......

SEETF/C

DES+MFT MFT/C

0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9

CPR

Figure 10: MATRIX; m = 10; PPR = 2; Fs = 0.3794 22

2.5 2.3 2.1 1.9 Trel 1.7 1.5 1.3 1.1 0.9

... .... .... .... .... .... . . . .... .... .... .... .... .... ... .... .... .... . . . ...... ..... . . . . . . ... .... ..... .... ..... ....... . .... .... . .... ....... . . . .. . ........... .... ...... .... .... ..... . . . . ..... ..... ..... ..... ..... ..... . . . ... ..... .... .... .... .... ..... . . . . .... ..... ..... ..... ..... ..... ..... . . . . .. ..... ..... ...... ..... ........ ............... . . . . . . . . ..... . ........ ...... ......... ...... ...... ........ ....... ........ ....... . . . . . . .... . ....... .........

SEETF/C

MFT/C DES+MFT

0.2 0.6 1.0 1.4 1.8 2.2 2.6 3.0

CPR

Figure 11: MATRIX; m = 5; PPR = 2; Fs = 0.3794 1.6 1.4 1.2 Trel 1.0 Comb SEETF/C SEETF/C 0.8 DES+MFT MFT/C 0.6 0.4 0.2 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 .............. ............... ...................... ............. ................................................................................................................ ....................................................................................................................................................................................... ................ .... ...... . ...... ....... .................... . ........ ..... .............................. ........ ....................................... ..... ............................. ..... ... . ....................................... . . . . . . .. ... . . ... . .. . .. . ... . .. . .. . . ... . .. . .. . .. . . . .. .... . . ..... ... . ...... . .. . .... . . .... ....... . . . ...... . . . ....... . ........ ........ . ........ . ......... . ........ ......... . ......... ..... ..... .......... .... ...... .......... .......... ..

PPR

Figure 12: MVA; m = 3; CPR = 2.667 ; Fs = 0.2917 23

1.4 1.2

Trel

1.0 0.8 0.6 0.4

.............................. .. .................................................................................................................................................................................................................................................................................................................... ..... . ... . . ...... ... . . ..... . . . . . . ... . ...... .. . . ..... ... . .... . .. . . .. .. . . . ... .. . . ... ... . . .. .. . . . .. ... . . .. .. . . . ... .. . . .. ... . . . ...... . ....... . ...... . . ........ . ........ ........ . ........ ........ . ......... . .. ........ . ... ..... . ...... .. . ........ . ..


0.2 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

PPR

Figure 13: MATRIX; m = 5; CPR = 3.2; Fs = 0.3794

24

6 Concluding Remarks The advantages of heterogeneous parallel processing have been demonstrated in the recent literature [2, 9]. Recent announcements by supercomputer manufacturers indicate that the industry is moving into architectures which combine a few powerful CPUs with hundreds or even thousands of VLSI processors. Scheduling plays an important role in taking advantage of the bene ts of a heterogeneous architecture. This paper has proposed extensions to static heuristic scheduling algorithms previously proposed by the authors [8]. The extensions modi ed the Deterministic Execution Simulation (DES) envelope in order to consider interprocessor communication on message passing parallel architectures. The performance of the modi ed DES | called DES/C | combined with several task and processor selection heuristics was modeled with the use of Markov chain based techniques. The Markov chain is derived automatically from an augmented task graph which represents the application. A tool was developed to generate and solve for the unnormalized Markov chain steady state probabilities as the states are generated. In order to optimize the execution time of the Markov chain solver, a B -tree based data structured was used to store the states at each level of the Markov chain. The family of algorithms which exhibited the best performance when compared with various others is the MFT/C one, which stands for DES/C as an envelope combined with Minimum Finish Task/C as the processor selection heuristic. As the degree of heterogeneity of the architecture increases, MFT/C type scheduling algorithms perform much better than non MFT algorithms. The DES+MFT algorithm [8], which does not take communication into account, was compared with all the others which considered the communication overhead. It was observed that in most cases, DES+MFT outperformed algorithms that did not use an MFT based processor selection heuristic such as the SEETF/C. As applications become more communication intensive, MFT/C algorithms exhibit a much better performance than DES+MFT as expected. Finally, it was observed that as the degree of heterogeneity of the architecture becomes suciently high, MFT/C and DES+MFT behave similarly since most task will tend to be assigned to the fast processor eliminating most the communication overhead.

Acknowledgements

The authors would like to express their gratitude to the Department of Computer Science and to the Institute of Advanced Computer Studies of the University of Maryland at College Park where this work was performed while the authors were on leave from their original institutions.

25

References [1] Almeida, Virglio A. and Ivo M.M. Vasconcelos, A Simulation Study of Processor Scheduling Policies in Multiprogrammed Parallel Systems, In Proceedings of the 1991 Summer Computer Simulation Conference (SCSC 91), Baltimore, July 1991. [2] Andrews, John B. and Constantine D. Polychronopoulos, An Analytical Approach to Performance/Cost Modeling of Parallel Computers, Journal of Parallel and Distributed Computing, 12, 1991. [3] Bowen N., C. Nikolau, and A. Ghafoo, On the Assignment Problem of Arbitrary Process Systems to Heterogeneous Distributed Computer Systems, i3ecomp, 41(3):257{273, 1992. [4] Ercal, F., J. Ramanujam and P. Sadayappan, Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning, Journal of Parallel and Distributed Computing, 10, 1990. [5] Fox, Georey C., Mark A. Johnson, Gragory A. Lysenga, Steve W. Otto, John K. Salmon, and David W. Walker, Solving Problems on Concurrent Processors: General Techniques and Regular Problems, volume 1, Prentice-Hall, Inc., 1988. [6] Hillis, Daniel W.D., The Connection Machine., MIT Press, 1985. [7] Menasce, D.A., D. Saha, S. C. da Silva Porto, V. A. F. Almeida, and S. K. Tripathi, Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures, Journal of Parallel and Distributed Computing, accepted for publication. [8] Menasce, Daniel A. and Stella C.S. Porto, Processor Assignment in Heterogeneous Parallel Architectures, In International Parallel Processing Symposium'92, IEEE Computer Society, March 1992, pages 186{191. [9] Menasce, Daniel A. and Virglio Almeida, Cost{Performance Analysis of Heterogeneity in Supercomputer Architectures, In Proceedings of the Supercomputing'90 Conference, November 1990, New York, E.U.A. [10] Muntean, T. and E-G. Talbi, A Parallel Genetic Algorithm for Process-Processors Mapping, In High Performance Computing II, North-Holland, October 1991. Also Proceedings of the Second Symposium on High Performance Computing, Montpellier, France, 7-9 October 1991, pages 71{82. [11] Porto, Stella C.S., Heuristic Task Scheduling Algorithms in Multiprocessors with Heterogeneous Architectures: a Systematic Construction and Performance Evaluation, Master's thesis, Departamento de Informatica da PUC-RIO, Rio de Janeiro, July 1991. 26

[12] Sadayappan, P., F. Ercal and J. Ramajuan, Cluster Partitioning Approaches to Mapping Parallel Programs onto a Hypercube, Parallel Computing, 13, 1990. [13] Sevcik, Kenneth C., Characterization of Parallelism in Adaptation and Their Use in Scheduling, Performance Evaluation Review, 17(1), May 1989. [14] Shirazi, B., M. Wang, and G. Pathak, Analysis and Evaluation of Heuristic Methods for Static Task Scheduling, Journal of Parallel and Distributed Computing, 10:222-232, November 1990. [15] Stone, Harold S., High-Performance Computer Architecture, Addison-Wesley Publishing Company, 1987. [16] Thomasian, Alexander and Paulo Bay, Analytical Queueing Network Models for Parallel Processing of Task Systems, IEEE Transactions on Computer, 35(12), December 1986. [17] Vaswani, Raj and John Zahorjan, The Implications of Cache Anity on Processor Scheduling for Multiprogrammed, Shared Memory Multiprocessors, Technical Report 91-03-03, Department of Computer Science and Engineering, University of Washington, March 1991. [18] Veltman, B., B.J. Lageweg, and J.K. Lenstra, Multiprocessor Scheduling with Communication Delays, Parallel Computing, 16, 1990.

27