Congress: A Dynamic Distributed Task Allocation ... - CiteSeerX

Congress: A Dynamic Distributed Task Allocation Environment Nickolas J. G. Falkner and Michael J. Oudshoorn Department of Computer Science The University of Adelaide South Australia 5005 AUSTRALIA {jnick,michael}@cs.adelaide.edu.au Abstract: The implementation of Annex E (Distributed Systems) of the Ada95 Language Reference Manual is discussed with a mechanism suggested for the control of distribution in order to achieve optimal use of available resources. This approach uses compile-time complexity analysis of the source to permit the dynamic allocation of tasks to processing nodes based on the assumed computational intensity. The implementation of this structure using the Java Virtual Machine as the compilation target is discussed.

1 Introduction A heterogeneous node cluster connected by a local area network (LAN) is a viable platform for a distributed system providing that the allocation of processes to nodes is near-optimal and it is managed so that it is responsive to changes in the distributed environment. A critical decision in any distributed system is the allocation of processes to processing nodes. A static allocation mechanism can be determined by an analysis of the intercommunication costs and relative processor performance of the component nodes. However, such an allocation fails to respond to changes in the environment. Within a coarse grain, heterogeneous system there are no guarantees of uniform performance across the nodes. This paper explores a dynamic node allocation scheme consistent with Annex E of the Ada 95 Language Reference Manual (LRM) [1]. It introduces a run-time heuristic for node allocation based on compile-time analysis of the underlying network, the relative processor speed of the component nodes and the complexity of the tasks to be distributed with the goal of avoiding any modifications to the Ada95 language other than the addition of compiler directives (pragma). It then explores the Java Virtual Machine (JVM) [6] as a suitable compilation target to demonstrate the viability of the dynamic allocation strategy. Section 2 discusses the approach to distributed systems in Ada95 espoused in Annex E of the LRM and suggests static alternatives to a dynamic allocation scheme. Section 3 discusses distributed systems and the methods of assessing individual components of the system and network connections between them. Section 4 discusses the concept of a congress of nodes and motivates a compilation strategy employing a

pre-processor and code-rewriting techniques to take best advantage of the available resources. A congress is a set of heterogeneous nodes and the network which connects them. Within a congress all nodes will accept tasks distributed to them and at least one node accepts responsibility for the distribution. Section 5 discusses the implementation of this environment using the Java Virtual Machine as a compilation target. Section 6 concludes the paper with a discussion of the advantages of a dynamic allocation strategy and explores future directions.

2. Distributed Systems in Ada95 Annex E of the Ada95 Language Reference Manual describes a mechanism for the implementation of distributed systems in Ada95 where a distributed system is defined as an interconnection of one or more processing nodes (with computational and storage capabilities) and zero or more storage nodes (which have only storage capabilities and can be addressed by at least one processing node). An Ada program consists of a set of partitions which are capable of executing in parallel with each other, possibly in a separate address space and possibly on different nodes. It is possible to designate a subprogram as the main subprogram of a partition, in that, after setting the environment of a partition, the main subprogram is the first subprogram executed. It logically follows that a minimal subprogram containing a single Ada95 task is a valid subprogram to execute in a separate partition. Similarly the partition, since it may execute on another node, may run remotely. The allocation of processes to processing nodes is referred to as configuring the partitions of the program. For the purposes of this paper, we look at a restricted set of the Ada95 language for distribution. This set comprises all language components of Ada95 required to support the task construct, including the task construct itself. 2.1 Relationship to Annex E The focus of this paper is on the distribution of active partitions where an active partition has at least one thread of control. This paper is concerned with distributing tasks across the system rather than the separate problem of maintaining distributed data structures. The consistency of the system, the calling of remote routines (effectively) and the communication mechanism used for the transmission of data and reception of results are of particular interest. Annex E of the Ada95 LRM outlines a distribution mechanism where there is scope for a number of implementations, employing static or dynamic allocations of processes to nodes. The approach discussed in this paper is consistent with Annex E. 2.2 Consistency Issues It is essential in any distributed system that any and all distributed processes are derived from the same version of the system. It is undesirable that any distributed task outlive the distributed program that produced it, especially if the source code has been recompiled to alter the behaviour of the program. Unpredictable behaviour could result.

Annex E suggests two program unit attributes, the Version and the Body_Version. These give the version of the compilation unit that contains the declaration and the body of the program unit, respectively. If these values match the version of the other distributed program units then the program is consistent. In addition to version control, the other issue is the lifetime of a remote task. Although a distributed task should not continue executing beyond the completion of the distributed program it may fail to terminate due to an error in the task. If a new version of the code commences execution and is distributed there may be two versions of the same process executing simultaneously and, although the first version may never be called, it may still attempt to communicate with the now updated instance of a task on another node. If an expired task, one with an older version number, attempts to communicate with a current task it recognises the consistency problem and terminates cleanly. Hence, the first action carried out by communicating tasks can be rendered into an algorithm: Given two remote tasks, A and B, which exist on separate nodes: 1. Tasks A and B must establish a rendezvous to communicate. 2. Tasks A and B exchange Version information. 3. If the Version numbers are identical then normal execution continues otherwise the expired task terminates connections with the newer task. 4. If an expired task is identified, it attempts to complete any outstanding internal processing and communication with other tasks. In theory, such an expired task cannot be written as the original program would also not terminate. However, if a partition of the program crashes while other partitions are active then these partitions may wait, unaware of the failure, until such time as outstanding rendezvous attempts are made or processing completes. This motivates the use of a control host which maintains and manages the system, providing both initial distribution control and ongoing distribution management including the termination of rogue tasks, namely tasks that can no longer be guaranteed to perform in a predictable fashion. 2.3 The Ada95 Task Construct The existing syntax for the Ada95 task makes no direct reference to locating the task within a particular partition. The final location and task environment is left to the implementor. A static solution to the node allocation problem is to extend the existing definition of a task in two ways. Firstly, adding compiler support for static allocation by providing a list of hosts suitable for distribution and assigning ‘marked’ tasks to hosts. The second method is to provide a language construct for designating tasks as suitable for, or requiring, automatic distribution. A suggested solution for compiler support for a list of hosts is as follows. The Distribution_Hosts pragma provides a list of hosts to the compiler in the form of . (These metrics are developed further in the next section).

The second component of the solution is the addition of a distributed task construct as a natural extension to Ada95’s task construct, thereby explicitly indicating that the task is to be distributed. Together these two components provide a basis for the static allocation of tasks. To make this function effectively we require a scheduling algorithm which assigns more tasks to the more capable nodes in the list. In particular, assigning the more intensive tasks to those nodes better suited for it. Already the need to refine and develop the simple proposed language solution has arisen. Possibly the addition of an on hostname clause which forces the distribution onto a more suitable node (if available) should be considered. The primary issues with respect to static allocation are that. • the environment may be evolving (either by varying processor load or network saturation), and • the behaviour of tasks will be dependant on the size of the data set they execute with. 4 Thus a task with O(n ) time complexity only justifies distribution to a high performance node if n is suitably large. A static allocation of this task to a high performance node may commit the resources of the node inefficiently. Alternatively, if two computationally expensive tasks are assigned to the same high performance node in the belief that one will complete before the other begins, competition between the tasks will take place should this assumption fail. The next section discusses the metrics required to carry out the complex task of load balancing the execution of a distributed program and further motivates the need for a dynamic solution.

3 Measures of Performance The distributed system proposed by this paper is composed of two distinct elements: • The processor nodes (of no fixed type). • The network that connects them (again of no fixed type). The assumption is made that any processor can service a distributed task or execute the main partition although no assumption is made about a minimum performance standard. Similarly, all nodes are assumed to be connected by a network which can transfer packets of arbitrary size, with no assumption made regarding delivery time. 3.1 Relative Node Performance The question of computational performance as an absolute is not relevant as the fact that nodes are slow relative to an external benchmark make no difference as to how the distribution of tasks is carried out. Rather, the relative performance of each node within the distributed system is more important as slow processors should not be used for expensive tasks and fast processors should not be wasted on tasks of low priority or low computational intensity, particularly if the fast processors incur high communication costs [2]. It is important to determine an approximation of the relative performance measurements in order to drive an appropriate load balancing algorithm. Similar metrics for a dynamic solution are essential as the load balancing is carried out ‘on the fly’.

On the assumption that these performance measures may be carried out a number of times, and possibly during the execution of a distributed program, then there are a set of requirements that the performance measurement system must satisfy. 1. The measures must produce reliable results. 2. They must measure the performance of the areas of concern to the program (a measure of integer performance is of little interest when the tasks are floating point intensive). 3. The measurement technique must be non-trivial. 4. The measurements must complete in a short time (as they may be executed repeatedly and is always run at start-up). Requirements 3 and 4 potentially compete with each other as the need for nontrivial measurement code which adequately evaluates all nodes may not be suitable for testing some of the nodes. For example, if a multiprocessor supercomputer is being evaluated as part of a node group comprising significantly less powerful machines then a task which is non-trivial on a supercomputer node may take some minutes to complete on the other nodes. A simple and naive solution to the problem is to use iterative loops employing both integer and floating point arithmetic. Since, at this stage, no decision has been made about the target architecture, no conjecture can be made about what level of optimization can be carried out – whether any loop unrolling or additional pipelining is necessary [7]. For example, using two simple 1,000,000 iteration loops, one with integer multiplication and one with floating point multiplication on a set of 3 Sun Microsystems machines the average timing results in Table 1 were obtained. Although there is the possibility that the Ultra is being undertasked by this measure, the length of time to complete the task on the Sparc Station 2 is greater than a second for a single trial of the floating point test. If a supercomputer was added to the group it would have to be explicitly labelled as such to avoid the test software swamping the less powerful nodes. Therefore, we add a final requirement to the list: 5. Bearing in mind 3 and 4, the performance measure must be sufficiently complex to cause different nodes to produce noticeably different measures. Thus, if a node is either orders of magnitude faster in completing tasks or does not degrade in performance as the complexity of the test program increases then it is labelled as ‘high performance’ and is not tested again unless it fails to complete tasks within an expected time. The final clause of 5 above becomes important as load balancing, and an expectation of task completion time based on historical data, is discussed. Machine Sparc 2 Sparc 10/41 Ultra 140

Int 0.69 0.16 0.12

FP 1.27 0.27 0.19

Total 1.96 0.43 0.31

Table 1 - Performance Results (seconds).

3.2 Network Performance As no assumptions about the type or capacity of the network connecting the nodes have been made, it is essential to assess the effect this will have upon the distribution. A very fast node on a very slow network will be an inefficient choice for highintensity, large data set tasks as the time taken to traverse the network may drop the effective performance below that of a slower node located on a fast network. We define network distance as an approximate measure of the network connection between nodes (see Table 2). If this model is extended to cover wide area networks (WAN) then the effective bandwidths mentioned above become critical as there are quality of service issues as well of concerns about saturation effects in older networking technologies. For example, if a node in Sydney can maintain a 10Mbps connection with Perth then this will be rated as a medium distance node by the Perth node (and vice versa). Compare this with a node in Adelaide communicating with a node in regional South Australia via a shared 2Mbps link. Despite being geographically closer, its network distance is regarded as far. A simple measure can be obtained by transferring the maximum packet size the link is capable of a fixed number of times and measuring the round-trip time. This can be done using each node’s interface information and the Unix ping command to send packets to other nodes. 3.3 Combining Measures Combining the two measures of processor performance and network distance admits a rating of individual nodes in a simple way. With static allocation we are committed to this model for the duration of the execution − indeed for every execution after a given compilation. However, for a dynamic allocation it provides a first approximation to the solution to the load balancing problem. Looking at the combination of measures we can produce a set of nodes rated as Peak, High, Average and Low based on their performance (where a Peak node outperforms other nodes to an extent that it has no relative rating) and Close, Medium and Far by their network distance. Thus, a supercomputer located in Canberra may be Peak Medium and a high performance workstation in Adelaide may be High Close. Given the size of the datasets that are being used the node in Adelaide may be a preferred choice for intensive computation as the likely cost saving in communication time is high. Type of Network High Speed Point-to-Point LAN (ATM) High Speed Broadcast LAN (unsaturated 100 Mbps) Low Speed LAN (10 Mbps) or saturated High Speed Broadcast LAN Any connection with an effective throughput lower than 5 Mbps Table 2 - Network Distance.

Network Distance Close Close Medium Far

The advantage of a dynamic strategy is that it can respond to changes in the environment. For example, if the ATM link to Canberra collapses and the link speed drops back to, say, shared 10Mbps then the Canberra node is now Peak Far and should not be used except where absolutely necessary. This requires either regular polling of performance measures (which is wasteful and mostly unnecessary) or an estimate of completion time which is used to check the performance of tasks during execution. If we can measure the relative performance of a node and, more importantly, the comparative complexity of the task then we can estimate its completion time. Knowing the size of the data that is being transferred to and from the remote node allows the incorporation of communication time within the expected time to complete. Although this is an approximate time it provides a minimum level of service estimate. If a task returns outside of this time then both the node and the network should be re-polled to determine their current performance levels and the node reclassified appropriately. If a node is reclassified to a lower state it is said to be demoted. If a task exceeds the requirements and returns well within time it will also be reassessed but can be reclassified to a higher state (promoted) for processor or network performance. If a node is repeatedly promoted and demoted and fails to achieve a steady state then it is locked to the lower setting and is excluded from promotion for a fixed time The requirement for the collection of performance data and the control of distributed tasks again motivates the need for a designated control host. The control host runs the distributed program efficiently and effectively, thereby eliminating the problem of n nodes attempting a possibly contentious load balancing scheme and failing to find a solution. This is discussed in more detail in the next section. The final issue for performance measurement is the issue of the relative complexity of the tasks involved in the computation. The complexity analysis of the tasks is necessary since a task of O(n) complexity typically requires less resources than a task 4 of O(n ) using the same dataset. This also forces a change in allocation strategy as the second process is more likely to require a peak or high performance node. Thus, as well as knowing the nodes and network of the system, the computational requirement of component tasks relative to their dataset sizes during program execution must also be determined. 3.4 A Pre-processing System for Assessing System Performance A pre-processing system for the dynamic allocation system must, given source code and a set of nodes for distribution, perform the following actions: 1. Carry out an analysis of the network connecting nodes (see Section 4). 2. Carry out performance analysis of the component nodes. 3. Perform task complexity analysis on the source code using heuristics (see Section 3.4.1). 4. Using the results of 1, 2 and 3, modify the source code to mark tasks as distributed or not and to initialize the node rating table. Identify the control accordingly. 5. Pass the modified source code to the compiler.

This new program is now compiled and commences execution with an acceptable approximation to a balanced solution. This scheme assumes that there is no significant delay between first compilation and execution as the original performance measures lose validity as the delay increases. If the system had static performance characteristics there would be no requirement for dynamic load balancing. 3.4.1 A Discussion of Step 3 - Heuristic Based Complexity Analysis The processor overhead to carry out a detailed complexity analysis of each task overshadows the cost savings implied by a dynamic strategy. A simpler solution is to identify those elements which will dominate the task during execution and establish a ‘rough’ complexity analysis based on these [5]. All analysis must provide a measure which is relative to the volume of the data being passed to the task as this must be taken into account when considering network transmission costs. The implementation of the complexity analysis is undertaken as follows: 1. Identify the tasks within the source code. 2. Step through each task and use a purpose-built parser to analyse the number of lines containing operations, and the number and type of operations. 3. Using this, and any other information available about the type and size of the data passed to the task, establish how many operations will be carried out per line (as an estimate). K 4. If an estimator has approximate complexity of mO(n ) then it is assigned to be K order n unless m closely approaches cn, for some integer c, in which case it is K+C assigned order n . 5. Take all remaining estimates and generate an upper bound complexity estimator which is always at least as great as the sum of the estimates. 6. Retain this single estimate and pass it to the code rewriter to insert into the new source code. Using a code rewriter, the complexity measures can be inserted as part of a constant 1-D array indexed by task name and containing the complexity estimator as the cell contents. 3.5 A Simple Scheduling Algorithm for the Dynamic Allocation Scheme Once an estimate of complexity is obtained, an initial scheduling policy can be generated. Given a list of hosts with relevant performance measures, a lookup function which specifies the best host for a given task is produced. The lookup function takes a task name and a dataset size as parameters and returns a record containing a hostname and an estimated duration. The lookup function can access all of the task complexities as this was determined during the pre-processing stage and compiled into the amended source code. As the complexity analysis is only an estimator of actual task complexity, the completion time is estimated with a large tolerance. An overestimation of a processor’s capacity is only detected if the task runs significantly more slowly. This has the beneficial side-effect of preventing load balancing oscillations from occurring as

processors undergo small scale performance fluctuations or networks have variations in network traffic. Internally, the lookup function tracks not only the performance characteristics of the nodes but also how many tasks are currently allocated to each node. To do this a check-in/check-out system is introduced for monitoring which tasks are allocated to which nodes. This also provides a mechanism for easily checking the completion time of the tasks and, hence, whether they are achieving performance estimates, as they will check-out prior to execution and check-in on completion. When the lookup function probes the check register it is provided with information on how well tasks have been keeping to their performance estimates. The complexity of each task is also retained so that the relative loading of a node can be estimated by determining how many high and low complexity tasks it is executing. Ideally, each node should only be undertaking one task which occupies its resources to near capacity. Thus, if two nodes have identical characteristics but one has a task assigned to it then the other will be selected when a second task requires scheduling. If two nodes are identical for all intents and purposes then an arbitrary choice is made [3]. 3.6 Application of the Dynamic Scheduling Algorithm The behaviour of a node may vary during execution and the initial estimate of performance may be in error. Recall that the initial estimates are optimistic. During execution, the lookup function consults the complexity data to assess the performance intensity of the task and checks the size of the dataset to determine network load. The lookup function then consults the taskable queue to find a suitable node for the task. The taskable queue is a table which contains all of the nodes and their performance characteristics. It also contains a measure of the task load already allocated to a node. For each task assigned to a node its individual task load is the logn of its complexity estimator multiplied by the log10 of the size of the data, hence a task with a complexity 3 estimator Oe(n ) and 10 items in the dataset has an individual task load of log1010 x 3 logn(n ) = 1 x 3 = 3. The task load of a node is the sum of the individual task loads, normalised according to the processor type. Each processor category is assumed to be capable of handling a task load 1-2 times greater than that of its immediately inferior processor. Thus, a task load of 4 is 4 on a low performance node, 3 on an average, 2 on a high and 0-1 on a peak performance processor. The taskable queue is sorted by normalised task load then by processor rating, and finally by network performance. Hence, a High performance processor with an actual task load of 3 will be placed ahead of a Low performance processor with an actual task load of 3. The lookup function scans the taskable queue using an ‘accept first-tolerable’ algorithm, outlined below. 1. Identify the minimum task requirements, namely a large dataset is allocated to a close node and complex computations are allocated to fast nodes. 2. Get first queue element.

3.

If it meets the processor and network requirements, accept it, update the task load, re-order the taskable queue and then return. 4. If not, examine the next entry. If this is not the final entry goto 3. 5. Always accept the final entry. Update the task load, re-order the table and return. When the lookup function returns, the selected task is immediately launched to the specified node in the record with the calculated duration. The calculated duration defines the sole performance requirement for the executing system. When a node independently fails to meet performance requirements the assessment programs are automatically launched at the node and the node is removed from the taskable queue. If the results indicate that the node should retain its current performance criteria then it is returned to the taskable queue as is but a note is made of its failure to perform. If the results indicate a degradation of either processor performance or network performance the node is demoted. It is then re-entered into the taskable queue. If all tasks fail to meet, or exceed, requirements and there is a detectable constant factor of non-performance or over-performance, then the historical data is used to carry out a bold adjustment where a weighting factor, delta, is introduced into the completion time calculations to adjust them by the predicted delay factor. The guideline for promotion is that it a node undercuts the completion time requirement for a series of tasks. At least 3 tasks should be completed in a quarter of the calculated completion time to indicate a consistent behaviour and eligibility for promotion. The core of this scheme is that it uses simple (inexpensive) arithmetic and simple rules to provide a fast and efficient means of allocating processes to nodes. The management of this scheme demands a consistent, fast management process located somewhere within the distributed system. This is the final justification for the existence of a control host and is discussed in the next section. 3.7 The Distributed Task Construct with Dynamic Scheduling Recall the distributed task construct introduced in Section 2.3. This is still valid as a compile-time language construct if it is added by the preprocessor not by the programmer. The dynamic scheduling aspect means that there is no need to mention where a particular task is assigned. The pragma is still necessary to alert the preprocessor of the need to prepare this program for processing.

4. A Congress of Nodes The distributed system described above has now been shown to have (or require) the following characteristics: 1. A heterogeneous set of nodes. 2. A heterogeneous network connecting them. 3. A scheduling algorithm. 4. A reliable and correct management process to manage the scheduling algorithm. The managing process occupies a fixed position within the system as it must be available at all times to provide management − it can be never be in transit between nodes. It is possible to have two nodes sharing the responsibility for a set of tasks but

this requires either defined areas of control (each node manages only a subset of the complete task set) or ‘on-the-fly’ arbitration to decide which manager’s decisions takes effect in the case of a clash. To simplify the system, only a single management node is considered. This is designated the control host. This node/control host set connected by a network is referred to as a First Level Congress of Nodes where: 1. No node communicates directly with another node unless it is a Control Host. 2. The Control Host is connected to every node. 3. No assumption is made about node type or network type. Figure 1 illustrates a First Level Congress. The term congress is used to illustrate the idea of a diverse group of nodes performing program execution in a co-operative manner. The First Level Congress structure is called a Spider due to the (passing) physical resemblance to an arachnid. The key point is that all nodes are at most one hop from the control host – the control host itself may be used as a node. Recall that the main program environment is resident on one node while the tasks may be distributed across other nodes. In addition, prior to execution the source code must be pre-processed and compiled. It follows that the control host is responsible for this and continues system management throughout the life of the program, including the termination of expired tasks when a new version of the program commences execution. The data integrity of the previous program, which generated the expired tasks, is guaranteed by the version number attributes. 4.1 An n-th Level Congress of Nodes To adequately implement the defined partition system of Ada95 it is necessary to have the ability to divide the execution environment of the master program into logically distinct, partitionable, segments.

SS2

SS2 10 BaseT

Control Host ATM SS10

SS10

100BaseTX

SS10 10 BaseT

ATM

100BaseFL SS2

x86

Figure 1. - A First Level Congress of Nodes (A Spider).

Consider extending the number of control hosts and specifically allocating a subset of the tasks to that host to manage. Every control host must be capable of providing a reliable service and, given there may be sharing data structures between control hosts to provide the greater program environment, each control host must be connected by a ‘close’ network arrangement. Each control host must have a reliable connection to every other control host. From there each control host will have a single connection to all of its allocated nodes. A group of control hosts is logically identical to a single control host with multiple processors. There are no contention issues as no node will be allocated to more than one control host at a time. 4.1.1 Scheduling in an n-th Level Congress The same scheduling algorithm is used in an n-level congress as in a spider except that every control host manages a subset of nodes. Although all nodes are still available for any control host to use any node under the management of another control host is automatically classified with a network distance of far. The intercommunication between control hosts is outside the scope of the dynamic scheduling algorithm.

5 Using the Java Virtual Machine as a compilation target. Existing commercial work has already proven the applicability of Ada95 to an implementation in Java Byte Code. The AppletMagic program from Intermetrics has demonstrated that this is feasible although support for multi-tasking has to date been provided by a straight translation of Ada95’s multi-threading to Java multi-threading with no attempt to distribute threads of control across different systems. The advantage of using the JVM as a target architecture is that there is no requirement to recompile for each physical architecture. Given that the JVM has been ported to the physical architecture the code may be distributed. Figure 2 illustrates the compilation mechanism of Congressional Ada95.

6. Conclusions and Further Work The significant number of heterogeneous computing environments can still be regarded as suitable architectures for the distribution of programs in an effort to gain performance enhancements. Annex E to the Ada95 LRM defines an implementable solution to the Distributed Systems problem, namely the use of partitions to encapsulate executing portions of the run-time system. The use of a congress of nodes allows a low-overhead and reliably managed system for implementing partitions which can scale with the size of the distributed program and the number of distinct processors required without overloading the control hosts. The work developed to date has dealt with the ‘Spider’ structure, a First level Congress of Nodes. Although a basis has been provided for an n-th level Congress work is still in progress and results are expected shortly for 2nd Level Congress.

Compilation Stage

Ada95 Source Code Performance Analysis

Complexity Analysis Generate New Source Code Compile to Java Byte Codes

Executable Code (produced in Java Byte Code with all code located on compiling host. Figure 2 - The Compiler Structure for the Ada95 to Java Byte Code compiler.

Experiments with a First level Congress of 6 nodes and a set of 50 tasks to be executed yield the results shown in Table 3. The entries in each column indicate the number of tasks assigned to each node during the execution of the trial. Note that the final time to execute 50 tasks on 6 nodes is 25.8% of the time taken to execute the tasks on a single Sun UltraSparc 140 [4]. Refinement of the complexity analysis hueristics and improvement of the estimation of completion time heuristics will ensure near-optimal use of the available resources.

Acknowledgments The authors thank Ms Katrina Kerry for her assistance in the initial discussion of this work. Node Set 0 1 2 3 4

U140 50 15 9 6 6

U2170 X 21 17 16 15

U2170 X X X X 12

Pentium X 14 10 10 6

SS10 X X 14 13 7

Table 3: The Performance of Congress.

SS2 X X X 5 4

Time (s) 153.2 65.3 52.2 46.8 39.6

References [1] Ada95 Reference Manual, International Standard ANSI/ISO/IEC-8652:1995, January 1995. [2] S. G. Akl, The Design and Analysis of Parallel Algorithms, Prentice-Hall, 1989. [3] V. C. Barbosa, An Introduction to Distributed Algorithms, The MIT Press, Cambridge, Massachusetts, 1996. [4] N. J. G. Falkner, Distribution in Ada95 Supported by the Java Virtual Machine, Masters Thesis, Department of Computer Science, The University of Adelaide. [5] N. J. G. Falkner and M. J. Oudshoorn, A Discussion of Performance Based Dynamic Scheduling in Distributed System, Departmental Technical Report TR97-09, Department of Computer Science, The University of Adelaide. [6] T. Lindholm and F. Yellin, The Java™ Virtual Machine Specification, AddisonWesley, September, 1996. [7] C. D. Polychronopoulos, Parallel Programming and Compilers, Kluwer Academic Publishers, 1988.