Programming Environments for Parallel computing ... - Semantic Scholar

3 downloads 124641 Views 148KB Size Report
span all MIMD multiprocessor systems. Programming environments based upon these models 15, 6] let the application programmer write portable programs for.
Programming Environments for Parallel computing: A Comparison of CPS, Linda, P4, PVM POSYBL, and TCGMSG Timothy G. Mattson Intel Corporation Supercomputer Systems Division Bearverton, OR, 97006 email: [email protected]

Abstract In this paper, six portable parallel programming environments are compared. For each environment, communication bandwidths are reported for simple 2 node and 4 node benchmarks. Reproducibility was a top priority, so these tests were run on an isolated ethernet network of identical SPARCstation 1 workstations. Earlier reports of this work omitted opinions reached during the benchmarking about the e ectiveness of these environments. These opinions are included in this paper since they are based on a unique experience; the experience of running two programs on six di erent environments.

1 Introduction Parallel computing { ranging from clusters of workstations to tightly coupled distributed memory machines { is the dominant form of supercomputing. Software development for these computers, however, is dicult and seriously lags behind developments in parallel hardware. To address this situation, numerous groups have designed programming models that span all MIMD multiprocessor systems. Programming environments based upon these models [15, 6] let the application programmer write portable programs for MIMD systems. With so many environments available to the programmer, the question naturally arises, \which environment is the best?" A general answer to that question doesn't exist. Which environment is the \best" depends on the speci c algorithm at hand as well as a programmer's skills and personal tastes. Diverse issues such as ease of use, expressiveness and eciency must be considered to really answer the question. Given the impossibility of an objective and general ranking of portable parallel programmingenviron-

ments, comparisons usually emphasize a single trait: runtime eciency. While providing only a partial answer, this approach is useful as eciency is the only measure that can be applied objectively. Past e orts to compare runtime eciencies, however, have been unsatisfactory and contradictory. Attempts to reproduce these comparisons usually fail due to insucient descriptions of both the experiments and what was being measured. The result is a general state of confusion regarding the runtime eciency of various programming environments for distributed computing. This paper summarizes the results from a project carried out at Yale University [7, 11] to provide \painstakingly correct" comparisons of several environments for parallel and distributed computing. The following environments were evaluated in this project:  CPS.  C{Linda.  P4.  POSYBL.  PVM.  TCGMSG. These experiments consisted of two di erent communication benchmarks. First, tests were carried out by bouncing a simple message between two isolated nodes. These benchmarks measured con ict free communication and provided information about the raw performance of each programming environment. These low level, two node tests, however, were highly arti cial and extrapolation of the results to actual applications was dicult. Therefore, tests were considered with more complicated communication patterns by looking at simultaneous shifts of messages about a ring of four nodes.

The experiments described in this paper were unique for a number of reasons. First, the tests were carried out on isolated workstation networks. While this does not match the conditions encountered in a typical distributed computing environment, the isolation of the network made the experiments more reproducible and let every facet of the testbed network be controlled. Furthermore, great pains were taken to protect the experiments from bias. For each environment under study, a member of the group responsible for that tool was provided with a copy of the benchmark software. By doing this, it was possible to make sure that each environment was being used to its best advantage. Finally, the raw data and actual test programs are available by anonymous ftp (anonymous ftp at casper.na.cs.yale.edu). The result is that any group can reproduce the results of this study. Because these experiments were conducted in a \painstakingly correct" manner, the results of this project will hopefully resolve the question of relative communication eciency for these environments; at least for the network computing case. Extensions to tightly coupled MIMD systems such as the Intel Paragon are in progress and will be reported in a future paper. While these experiments are valuable, it would be a serious mistake to overemphasize them. These experiments only test simple communication patterns between small numbers of nodes. This simplicity was needed since realistic benchmarks would have precluded the consistent treatment of so many environments in a single project. Even if the communication tests were conducted for more realistic benchmarks, it is important to note that communication eciency is not always the most important issue to consider. A programming environment must be evaluated in terms of the full software lifecycle which includes ease of use, expressiveness, ease of debugging, and support. These issues are dicult to include in an unbiased comparison project since precise objective measures are not available. However, opinions on these matters could not be avoided as the two benchmark programs were ported from one environment to the next. Therefore, while the technical reports detailing the raw data [7, 11] avoids subjective comparisons, this paper will include them. The remainder of this paper begins by brie y describing each of the programming environments studied and comparing their use of network protocols. This is followed by a discussion of the two node and four node experiments. Finally, the experiences encountered as these two programs were ported to each en-

vironment are described. It is hoped that the combination of quantitative and qualitative data will help readers make informed choices as they select a programming environment for their own work.

2 Portable Parallel Programming Environments It is easy to be overwhelmed by the number and diversity of programming environments available to the parallel programmer [15, 6]. However, only a small number of these programming environments have seen substantial use outside of the research groups that created them. By restricting attention to these \mainstream" environments, the time spent selecting a tool can be minimized. Four prominent, \mainstream" programming environments { PVM, P4, TCGMSG and Linda { were studied in this project along with two lesser known systems, POSYBL and CPS. In the following subsections, the origin and overall form of each environment will be indicated. A full introduction, however, will not be included. For more complete introductions, internet accessible source of information are listed for each environment.

2.1 CPS release 2.7.2 CPS [8] (Cooperative Processes Software) is a parallel programming environment designed to support the RISC processor farms at the Fermi National Accelerator Laboratory. The CPS library includes routines for message passing, remote procedure calls and process synchronization. In addition, the CPS environment supports bulk data transfers and batch processing queues. These last two items are absent from the other environments discussed in this paper. CPS is optimized for applications requiring asynchronous I/O of large data blocks from devices operating over a large range of communication bandwidths; from high speed disks to relatively slow tape drives. Hence, the tests carried out in this study are outside of the domain for which the CPS design was optimized. To learn more about CPS, send e-mail to [email protected].

2.2 C{Linda release 2.5 Linda [4] is an associative, virtual shared memory system. The combination of Linda with C leads to the C{Linda programming language. Linda's operations

act upon Linda's shared memory to provide the process management, synchronization, and communication functionality required to control MIMD computers. The version of Linda used in this project is produced and supported by Scienti c Computing Associates, Incorporated For more details about C-Linda, send e-mail to [email protected].

2.3 p4 release 1.2 P4 [3] is a distributed computing environment providing constructs to program a variety of MIMD architectures. p4 includes both monitors for shared memory systems and message passing for distributed memory systems. In addition, p4 includes support for computing across clusters of shared memory computers. p4 was produced at Argonne National Laboratory as a follow on to the m4 project [1] and is available by anonymous ftp from info.mcs.anl.gov in the directory pub/p4.

2.4 POSYBL release 1.102 POSYBL [12] is a public domain version of Linda developed at the University of Crete. POSYBL is one of the rst public domain Linda programming environments. It is also one of the best since it is the only public domain Linda system that supports a distributed tuple space rather than a centralized tuple server. A major di erence between POSYBL and the commercially supported versions of Linda is the fact that POSYBL is implemented strictly in terms of a library and therefore can not utilize the optimizations possible with compiler-based Linda systems. This makes it less ecient than a compiler based Linda system, however, as the results in this paper indicate, the performance of POSYBL is still high enough to make the system quite useful. POSYBL is available by anonymous ftp from ariadne.csi.forth.gr in the directory posybl.

2.5 PVM release 2.4.1 PVM [14] is a message passing system. PVM has more users than any other portable parallel programming environment and is a de facto standard for message passing environments. In addition to its exceptionally large user base, PVM further distinguishes itself from other message passing systems by being speci cally designed to handle heterogeneous networks of computers.

PVM was originally developed at Oak Ridge National Laboratory and the University of Tennessee. Currently, it comes in two avors; PVM 2.4 and PVM 3. While the name is the same, the application programminginterface for the two systems has completely changed. Therefore, PVM 2.4 will continue to be used for a number of years as code migrates to the new interface. Both versions of PVM are available by anonymous ftp from netlib2.ornl.gov in the directory pvm or pvm3. When the experiments described in this paper began, only PVM 2.4 was stable enough to be used for performance benchmarking. Hence, all of the programs described in this paper use the PVM 2.4 interface. PVM 2.4 includes two classes of message passing routines. The rst, snd/rcv passes all messages through intermediate daemons. This had the advantage of better scalability, but at the price of substantial additional overhead. The more ecient method, vsnd/vrcv, uses direct TCP socket connections between communicating processes and is considerably faster. Regardless of which message passing routines are used, PVM di ers from the other systems studied in this project by forcing the programmer to explicitly pack and unpack the communication bu ers. This unpacking must occur before the data within a message is available to a program. Also, most message passing environments take care of message bu er packing/unpacking for the programmer. Therefore, the benchmarks discussed in this paper included bu er packing/unpacking in the PVM communication times. Finally, while these benchmarks did not use PVM 3, the expectation is that the performance for PVM 3 will match the vsnd/vrcv performance of PVM 2.4. This is expected because the underlying implementation of PVM 3 includes the same functionality as vsnd/vrcv. Eventually, the performance of PVM 3 will surpass that of PVM 2.4 since a whole stage of bu er copying will be eliminated once the PvmDataInPlace encoding is available in PVM 3.

2.6 TCGMSG release 4.02 TCGMSG [9] (Theoretical Chemistry Group Message passing system) is a simple message passing system that has risen to a position of prominence among computational chemists. It is very ecient for the two node experiments discussed in this paper with communication taking place over direct, point-to-point TCP/IP sockets. It was developed initially at Argonne National Laboratory and is now maintained at Paci c Northwest Laboratory. TCGMSG is available

by anonymous ftp from ftp.tcg.anl.gov in the directory

pub/tcgmsg.

3 Network protocols All of the programming environments studied in this project achieve the same result { the nodes of a workstation cluster are made to act like members of a tightly coupled, parallel computer. In every case, this is done by mapping some higher level model onto low level network protocols. With TCGMSG, point to point TCP sockets are setup between each pair of nodes. These socket connections are established when the program is initiated and are not reclaimed in the course of the calculation. This approach is called the static TCP socket method or the TCP socket crossbar. The static TCP sockets method can have problems scaling up to large numbers of nodes since the number of open le descriptors per node grows as the twice the number of nodes. However, it is fast, simple, and has the advantage of hiding the start up time required to establish connections between all of the nodes. PVM and p4 both use dynamic TCP sockets. This means they establish a socket between two communicating nodes when they rst communicate with each other. The sockets are generally not recycled in the course of a computation. This method has the advantage that it will scale well onto systems with very large numbers of nodes. The down side of dynamic TCP relative to static TCP is that the rst communication is signi cantly slower than subsequent communications. PVM di ers from P4 by o ering the programmer a second option for message passing. PVM lets the programmer choose between dynamic TCP sockets and daemon mediated, UDP communication. In PVM 2.4.2, the choice is made by programmers when they select snd/rcv (UDP) or vsnd/vrcv (TCP). PVM 3.x only has one set of snd/rcv routines. By default, the system provides the daemon mediated, UDP communication. To get the more ecient TCP socket based communication, the PVM programmer uses the call: pvm_advise (PvmRouteDirect);

prior to the communication. The use of TCP sockets is clearly more ecient than the UDP/daemon based approach when the communication between a pair of nodes takes place many times and any startup costs can be amortized. When two nodes only communicate a few times, however, it may be faster to use the daemon based, UDP method.

Characterizing these tradeo s and providing guidelines concerning the conditions under which the two methods should be used is an active research area within the PVM group. The version of C{Linda used in these tests employs UDP to communicate directly with the processes involved in the parallel computation. All the processes involved in a computation agree on a UDP port and use it during the computation. C{Linda does not use TCP sockets or an intermediate daemon. Instead, a signal handler initiates a context switch to a tuple space handler to manage the local virtual shared memory and communicate with the other nodes. The POSYBL runtime system uses a daemon on each node to manage tuple space. This daemon was implemented in terms of a standard RPC package that sits on top of TCP and UDP. TCP sockets are used for larger tuples, UDP is used for small tuples, and UDP broadcasts are used for templates. This implementation limits the number of nodes in a computation to the number supported by the RPC package (the number of le descriptors available to RPC which is usually around 32).

4 Communication Benchmarks In order to measure communication performance for these environments, two di erent experiments were carried out. The rst measured raw communication performance by bouncing a message between two nodes. The second benchmark measured the average bandwidth for each node in a four member ring to simultaneously shift a message around the ring. In every case, the programs were compiled with the standard SUN OS 4.1.3 C compiler and run on identical SUN SPARCstation 1 workstations connected by ethernet. Each workstation had identical complete le systems, were not running a network le system and had had 8 Mb of random access memory. No optimization switches were set during compilation or linking. The impact of these switches was investigated early in the project and found to have no impact on the measured times. In addition, every programming environment was used as provided. No attempt was made to tune each environment for the networks used in the tests. There are a number of ways to de ne time. In these benchmarks, the most rigorous de nition was used by looking at elapsed, wall clock times. For each benchmark program, a common clock routine based on the standard UNIX function gettimeofday() was used.

Finally, communication benchmarks are very sensitive to other trac on the network as well as other processes running on the workstations. To make these tests as reproducible as possible, the workstations were isolated on their own ethernet network. It was therefore possible to control the hardware and network environments completely. Prior to each measurement, the process status was checked on each node to assure that no user processes were executing.

4.1 Two node benchmarks The rst benchmark measured the raw communication performance for each of the programming environments. This was done with a simple program that bounced a message between two nodes; the so-called ping/pong program. The code for each programming environment is described in detail in [7, 11]. Each programming environment was tested by considering 100 iterations and collecting the round trip communication time for each iteration. These times were corrected for the overhead associated with calling the clock routine. This overhead was computed within each test program and was always insigni cant (on the order of 0.13 milliseconds) compared to the measured round-trip communication times. Once timings were collected for each iterations, a common statistics analysis routine was used to nd the average bandwidth as well as a number of other statistical quantities describing the timing run. In Table 1, the average bandwidths (in megabytes per second) are presented. The raw data is available in [7]. Missing from Table 1 are results for CPS [11]. CPS did not perform well for these tests. The CPS times include startup costs on the order of half a second per communication leading to relatively at performance across the full range of message sizes. For example, 100 byte messages had a round trip communication time of 1 second as did the round trip communication times for messages of size 100,000 bytes. This compares with times of 4 and 201 milliseconds for the analogous TCGMSG cases. Clearly, CPS is not appropriate to use as a general purpose programming environment for parallel computing on workstation clusters. It is important to note that CPS was never designed to serve as a parallel programming environment. CPS was designed for the needs of extremely coarse-grained, distributed applications which require asynchronous transfer of very large data blocks with slow tape I/O systems. Under these circumstances, the observed startup costs for CPS are insigni cant.

Table 1: Average data transfer rates for 100 ping/pong iterations between two nodes. PVM used vsnd/vrcv. kBytes Tcgmsg P4 PVM Linda Posybl 0.1 .056 .041 .035 .025 .013 0.4 .163 .154 .119 .088 .045 1 .339 .317 .217 .183 .103 4 .635 .519 .452 .379 .262 10 .855 .610 .471 .373 .311 40 1.001 .648 .543 .474 .293 100 .992 .649 .561 .514 .159 400 1.007 .659 .578 .536 .094 1,000 1.011 .660 .575 .539 | All rates are in megabytes per second. For the other programming environments, the data in Table 1 shows that TCGMSG was clearly the fastest programming environment for all message sizes. The performance of TCGMSG was particularly striking as it came quite close to ethernet's theoretical maximum performance of 1.25 Megabytes/second. In terms of overall performance, P4, PVM and C{Linda (in that order) represent a middle range. Finally, POSYBL was the slowest system and even failed for the largest message size. The most surprising result from table 1 is the persistence of the performance di erences as the messages sizes vary. This directly contradicts the conventional wisdom in distributed parallel computing; namely, \the poor bandwidth of ethernet would dominate communication and equalize performance as messages increase in size." This was clearly not the case. While many factors are involved, the most likely source of performance di erences is the bu er management at either end of the communication. This conclusion follows from the observation that once the sockets are established, P4, TCGMSG and PVM all use the same network protocols. While useful as a measure of raw system performance, these two node benchmarks are too simple. Parallel application programs utilize far more complicated communication patterns. Therefore, one should be very careful when trying to extrapolate these results to their own application programs.

5 Four node studies As mentioned in the previous section, the two node tests were quite surprising. Evidence based on a num-

ber of benchmarks [10, 5] has consistently shown that applications generally don't display signi cant performance variation as di erent programming environments are used. Could it be that the addition of network contention would equalize the performance of these programming environments? To test this idea, an isolated four node SPARCstation 1 network was assembled. The benchmark program used on this network did the following:  Startup a program on each of the four nodes.  Construct an array on each node.  Each node passes its array to its neighbor - i.e. the nodes shift the data around the ring.  Repeat for some number of shifts. We timed 100 of these shifts and reported a net bandwidth for the process. This benchmark was called the ring test. The code for each programming environment is described in detail in [7]. Unlike the simple ping/pong program, the ring test benchmark permitted two di erent solutions. One solution, the SPLIT method, divided the nodes into two parts: half the nodes did a send/receive while the other half did a receive/send. This was required for the TCGMSG program since TCGMSG communication on a network is synchronous. The other option was to post all the sends and then all the receives. This is referred to as the NO-SPLIT method. Both of these options were considered for P4 and PVM. C{Linda and POSYBL hide low level message passing details from the user. Therefore, it was not expected that the two ring test methods would yield di erent performance and the only method used was NO-SPLIT. In all cases, the same timing routine were used as in the two node tests. Furthermore, ring test timed multiple shifts so timing corrections were insigni cant and hence omitted. The key results from this study are given in Table 2 were the average bandwidth (in megabytes per second) is given as a function of message size for each of the programming environments (CPS was omitted from the four node studies). In every case, the results reported in table 2 are for the best case { SPLIT for TCGMSG and PVM; NO SPLIT for P4, POSYBL, and Linda. In Table 2 it is clear that TCGMSG is still the fastest programming environment. However, consistent with the few application benchmarks available that compare programming environments [10, 5], the performance di erences between TCGSMG and the other systems is generally decreased in the four node

Table 2: Data transfer rates for the ring test, four node studies. PVM and TCGMSG used the SPLIT algorithm while all other systems used the NO SPLIT algorithm. PVM used vsnd/vrcv message passing. kBytes Tcgmsg P4 PVM Linda Posybl 0.8 0.504 0.034 0.521 0.384 0.100 8 1.123 0.250 0.747 0.695 0.323 80 1.058 0.852 0.784 0.921 0.367 800 1.051 0.835 0.255 0.878 | All rates are in megabytes per second. benchmarks relative to the two node benchmarks. The two node tests showed TCGMSG as being approximately 50% faster for the largest message sizes. The four node tests, however, have TCGMSG only 20% faster. We suspect this e ect follows from the fact that TCGMSG communication on a network is synchronous. The rest of the network programming environments support some low level overlap of the sends and receives, but TCGMSG must synchronize at each send/receive pair. This e ect was not present in the two node tests since that test was inherently synchronous. A surprising result in the ring test benchmark was the serious performance drop-o experienced by PVM. This problem was reproducible [7] and observed on both the isolated networks and on the internal workstation network at the Supercomputer Systems Division of Intel. Therefore, this is not an aberration and points to a serious problem with PVM. The most likely source of trouble is once again the bu er management within PVM. This e ect is under investigation and will be discussed in a future paper.

6 Qualitative Comparisons In the original technical reports describing this work [7, 11], the numbers were left to speak for themselves. With the project taking place at Yale and involving \Linda-insiders" it was decided that the safest course of action was to avoid all appearances of bias. Therefore, matters outside of the timing data were left unaddressed. While defendable, this decision has been met with some frustration. A number of readers of the timing benchmark reports [7, 11] have expressed frustration at the complete lack of qualitative comparisons noting

that it would be impossible to port two programs to six di erent programming environments without forming a number of opinions about the relative utilities of the programming environments. In this paper, opinions about the utility and e ectiveness of these programming environments will be provided. The following areas will be discussed:  Support.  Ease of use (coding).  Ease of use (debugging). Before addressing these speci c items, a word of warning is o ered to anyone considering a similar project. This project was far more dicult than anyone anticipated. Any single environment is not that dicult to work with; but to obtain, load, learn, and maintain six di erent environments was a time consuming and frustrating task.

6.1 Support Of the environments studied in this project, most required some level of support from the system developers. While support is an ocial part of only one of the systems (the commercial C-Linda system), in every case, support was provided in a timely and effective manner. The best support was provided for C-Linda. This is not surprising since it is a commercial product. What was surprising was how good the support was for PVM. PVM is a public domain programming environment and has no funding in place to provide ocial support. The unocial support by email and through the comp.parallel.pvm newsgroup, however, was excellent. One advantage PVM has over all other environments is \anecdotal support", i.e. support based on the experiences of a large and accessible user base. PVM has by far the most users so when a problem is encountered { either with PVM or with expression of some algorithm in PVM { the chances are good that you will be able to nd someone who has already encountered and solved a similar problem.

6.2 Ease of use { Coding Of all the comparisons, the issue of ease of use is the most dicult to objectively address. Linda (both C-Linda and POSYBL) was the easiest systems with which to develop the benchmarks discussed in this paper. With only six operations, Linda was the only

system that did not require constant reference to a user manual during coding. While this sounds trivial, when so many environments are being used, the constant need to reference a manual for each instruction was irritating. An advantage for C-Linda (but not POSYBL) follows from the fact that it is a language not a library and the syntax can therefore be much more exible. As an example of this exibility, consider the passing of a simple message between two nodes. With Linda, programmers can choose any arrangement of elds they desire to construct the message or to specify where the message should go. Message passing systems based on libraries, however, require speci c types and locations within the argument list for each message. C-Linda, POSYBL, and PVM, all share a major oversite. These programming environments lack global combine operations (eg. global summation). Global combine operations are heavily used in many applications (eg. parallel molecular dynamics codes [10]). They are dicult to code correctly and even more dicult to code eciently. Consequently, the authors of TCGMSG and P4 made global combine operations an integral part of their packages. C-Linda, POSYBL and PVM { by forcing their users to write these themselves { greatly complicates the coding process. Finally, it is important to point out that the PVM 2.4 programs were the most awkward to code. Bu ers had to be packed or unpacked at either end of the communication. Since every library call is another avenue for errors to enter a code, these extra calls complicated the programming. Also, PVM 2.4 forces the programmer to manage for each node both the name of the spawned executable and an instance number. This was awkward for the programmer. Unfortunately, the PVM daemon remembered the names of spawned processes and assigned instance numbers based on previous runs with the program. This meant that the PVM daemon had to be cleaned up after every benchmark. It would have been possible to construct the program to avoid this problem, but one of the goals was to make the various codes as similar as possible and accounting for PVM's interaction with its daemon couldn't be achieved without signi cantly restructuring the code. These problems with PVM 2.4 are a direct result of PVM's support for heterogeneity. Stated in other terms: PVM's support for heterogeneous computing is the source of both its greatest advantage and greatest disadvantage. The new syntax in PVM 3.X improves on the situation somewhat by forcing the programmer

to maintain tid arrays that essentially combine the instance number and process name into a single opaque object. While an improvement, this still makes PVM more complicated than the message passing provided by both TCGMSG and P4. So when you put all of this together, its very hard to select a single environment that provides the greatest ease of coding. The Linda systems are probably the easiest to use if you don't need global combination operations. If these are needed (as they usually are), TCGMSG or P4 are the easiest to use.

6.3 Ease of use { Debugging Program debugging did not play a big role in this project. The benchmark programs were simple enough that once written they usually didn't take long to debug. However, enough debugging was required to form some de nite opinions on this matter. The only system in the project that had a a full

edged debugger and program visualization tool was Linda [2]. This is not surprising as Linda was the only commercial product used in this project. The other systems provided some level of debugging support, but for the most part the programs were debugged with print statements. Two systems deserve further mention with regard to debugging. First, P4 provides a command line switch to turn on communication tracing within a program. This was valuable in the course of the project. At one point there was an error in the P4 version of ring test that caused a deadlock to occur. With the debug command-line switch in P4, it was trivial to locate and x this error. Finally, the PVM group has plans to add rst rate support for debugging in PVM 3.X (where X is used to indicate some release of the product). By setting the PvmTaskDebug ag in the pvm spawn() command, the spawning process will initiate a dbx window attached to the spawned process. The Linda systems from Scienti c Computing Associates, Inc. also provide this capability, but it is most noteworthy when high quality debugging tools of this sort appear in a public domain system.

7 Conclusions The goal of this paper was to provide unbiased information about the relative merits of several major parallel programming environments. This comparison project is ongoing and several key tasks remain to be done (such as extending the tests to higher speed

networks and MPP machines). However, the data gathered to date does support a number of important conclusions with regard to execution over workstation clusters. First, the cost of bu er management at either end of the communication is more important than expected. Rather than all systems performing the same for simple communication patterns and large messages, signi cant and substantial di erences in performance were found. Second, as communication patterns become more complex, the di erences in these environments decreased substantially. With the two node tests, di erences between the systems ranged up to 50%. However, for the four node tests, the di erences dropped to 5%. Finally, based on the experience with all six programming environments, an attempt was made to provide unbiased opinions about the merits of the various programming environments. A reasonable question to ask someone who has used so many di erent environments is, \which programming environment is the best?" My answer to that question is \none of them are the best." What I really want in a programming environment is a merged system that provides:  The high speed and simple message passing of TCGMSG.  The algorithmic expressiveness and ease of debugging of Linda.  The support for heterogeneity from PVM.  The shared memory (i.e. shared address space) support from P4.  The global combine operations of TCGMSG and P4. The only way to actually piece together such a hybridenvironment is to work with source code. This is the motivation for studying POSYBL so closely. A POSYBL/TCGMSG hybrid would satisfy most of the items in my programming environment wish-list.

Acknowledgements I would like to thank Craig Douglas and Martin Schultz for many valuable discussions about the topics discussed in this paper. I would particularly like to extend a special thank you to Craig Douglas who took over running the benchmarks on the isolated workstations after my move from Yale to Intel.

References [1] J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, J. Patterson, and R. Stevens. Portable Programs for Parallel Processors, Hold, Rinehart, and Winston, 1987. [2] P. Bercovitz and N. Carriero, \TupleScope: a Graphical Monitor and Debugger for Linda based Parallel Programs," Yale University Computer Science Department, Technical Report, RR-782, 1990. [3] R. Butler and E. Lusk, \User's Guide to the p4 Programming System", Argonne National Laboratory technical report ANL-92/17, (1992). [4] N. Carriero and D. Gelernter, How to Write Parallel Programs: A First Course. (Cambridge: MIT Press, 1990). [5] N. Carriero, D. Gelernter, T. G. Mattson, A. H. Sherman, \The Linda Alternative to MessagePassing Systems", submitted to Parallel Computing, 1993. [6] D. Y. Cheng, \A Survey of Parallel Programming Languages and Tools," NASA Ames Research Center Technical Report RND-93-005, 1993. [7] C. C. Douglas, T.G. Mattson, and M.H. Schultz, \Parallel programming Systems for Workstations Clusters," Yale University Computer Science Department, Technical Report YALEU/DCS/TR975, 1993. [8] M. Fausey, F. Rinaldok, S. Wolbers, D. Potter, B. Yeager, \CPS and CPS Batch Reference Guide", Fermi National Accelerator Laboratory, GA0008, (1992). [9] R. J. Harrison,\Portable Tools and Applications for Parallel Computers", International Journal of Quantum Chemistry, Vol 40, 847-863, (1991). [10] T. G. Mattson and G. Ravishankar, \Parallel Molecular Dynamics with Wesdyn", In preparation, 1993. [11] T.G. Mattson, C. C. Douglas and M.H. Schultz, \A Comparison of CPS, Linda, P4, POSYBL, and TCGMSG: Two Node Communication Times," Yale University Computer Science Department, Research Report, May 1993.

[12] G. Schoinas, \Issues on the implementation of Programming SYstem for distributed applications", University of Crete draft technical report, (1992). [13] Scienti c Computing Associates, Inc. C{Linda Reference Manual, (1992). [14] V. S. Sunderam, \PVM: A Framework for Parallel Distributed Computing", Concurrency: Practice and Experience, Vol 2, pages 315-339, (1990). [15] L. Turcotte, \A Survey of Software Environments for Exploiting Networked Computing Resources", Tech Report MSM-EIRS-ERC-93-2, Mississippi State University, 1993.