LESSONS LEARNED FROM IMPLEMENTING BSP

14 downloads 97577 Views 178KB Size Report
Bulk Synchronous Parallelism 4,5], or BSP, is a model of parallel .... for this processor try to send it as their rst communication action, then there will be.
Programming Research Group LESSONS LEARNED FROM IMPLEMENTING BSP Jonathan M.D. Hill Oxford University Computing Laboratory Oxford, U.K. [email protected] D.B. Skillicorn Department of Computing and Information Science Queen's University, Kingston, Canada [email protected] PRG-TR-21-96



Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD

Abstract

We focus on two criticisms of Bulk Synchronous Parallelism (BSP): that delaying communication until speci c points in a program causes poor performance, and that frequent barrier synchronisations are too expensive for high-performance parallel computing. We show that these criticisms are misguided, not just about BSP but about parallel programming in general, because they are based on misconceptions about the origins of poor performance. The main implication for parallel programming is that higher levels of abstraction do not only make software construction easier|they also make high-performance implementation easier.

1 Introduction Bulk Synchronous Parallelism [4, 5], or BSP, is a model of parallel computation whose aim is general-purpose parallel programming. It provides a level of abstraction that makes programs portable across the full range of parallel architectures, at the same time providing implementations that are ecient [3]. Unlike most parallel programming environments in use today, it works well for many di erent kinds of applications, and on many di erent parallel computers. Implementations exist for the SGI PowerChallenge, Cray T3D, IBM SP2, SUN multiprocessor systems, Hitachi SR2001, Convex Exemplar, Digital Alpha Farm, and Parsytec GC. BSP provides an abstract machine whose operations are much higher-level than those of most parallel programming models. Rather than having to arrange details such as the mapping of processes to processors and the timing of individual communication actions, BSP programmers use operations on the target architecture as a whole. This might naively be expected to make it dicult to achieve good performance. In fact, the opposite is true: the gap between the abstract machine operations and the capabilities of today's architectures introduces signi cant opportunities for performance enhancement. This counter-intuitive observation is the subject of this paper. We show two ways in which the BSP model's abstraction improves performance: 1. postponing communication permits the run-time system to combine and schedule it to minimise start-up costs and congestion; 2. when communication is scheduled, there is little further overhead to barrier synchronisation.

2 The BSP Model BSP programs have both a horizontal structure and a vertical structure. The vertical structure expresses the progress of a computation through time. For BSP, this is a sequential composition of global supersteps (Figure 1), which occupy the full width of the executing architecture. Each superstep is subdivided conceptually into three ordered phases consisting of: 1. computation locally in each processor, using only values stored in its local memory; 1

Virtual Processors Local Computations

Global Communications Barrier Synchronisation Figure 1: A Superstep 2. communications among the processors, involving transfers of data; 3. a barrier synchronisation, which ends the superstep and makes the transferred data available in the local memories of the destination processors for the next superstep. The horizontal structure expresses the concurrency in the program, and consists of a xed-size bag or multiset of processes, whose size we will denote by p. BSP implementations avoid locality by making the mapping of processes to processors random. Patterns in the program's structure are then unlikely to interact with the pattern implicit in the target processor's interconnection topology. Thus the destination processors of a set of communications approximate a permutation of processor addresses with high probability. The highly-structured nature of a BSP program makes it possible to predict the delivery time of the global communications. This is where most existing parallel programming models fail, because it is hard to capture the e ects of, for example, congestion on delivery times at the level of individual communications. Treating communication actions globally, together with BSP's randomised placement of processes, makes it possible to determine the average e ect of their interactions, and hence to bound their collective delivery time. During a particular communication phase, each processor sends some set of messages to other processors, and each receives some set from other processors. Call the maximum size of any of these sets h, and the complete set of messages an h-relation. The delivery time for the h-relation can be captured by a single architectural parameter, called g, which measures, intuitively, the permeability of the network to continuous trac to uniformlyrandom destinations. It is de ned to be that value such that an h-relation of single word values is delivered in time hg, and is usually expressed in units of the processor instruction execution time to make it easier to compare architectures. A small value of g means that an architecture is good at delivering permutations. 2

The parameter g depends on the interconnection network's bisection bandwidth, but it is also a ected by the communication protocols and routing algorithms of the interconnection network, by the bu er management system, and by the BSP runtime system. Because these factors interact in complex ways, the value of g is determined, in practice, by running suitable benchmarks on each architecture. A protocol for doing so is described in [4]. Because g is de ned under conditions of continuous trac, it can be inaccurate when only small amounts of data are transferred. This is because start-up costs may dominate for short message transfers, and because they may leave the network quite empty and hence with di erent permeability characteristics. A second architectural parameter, l, captures the cost of barrier synchronisation. At the barrier each process reaches a consistent state and knows that all other processes have also reached that state. For many architectures, synchronisation is a special case of communication, but some provide special-purpose hardware to implement it. The actual cost of synchronisation is again determined by benchmarking and, as before, l is expressed in units of the instruction execution time. >From the text of a superstep and these two architectural parameters, it is possible to compute the cost of executing a program on a given architecture as follows. The time taken for a single superstep is the sum of the (maximum) time taken for the local computations in each process during the superstep, the time taken to deliver the hrelation, and the time required for the barrier synchronisation at the end. We can express this as: superstep execution time = MAX wi + MAX hi g + l processes

processes

where i ranges over processes, and wi is the computation time of process i. Note that the addition is sensible because each term is expressed in the same units of instruction execution time. Often the maxima in this formula are assumed and execution times are expressed in the simpler form w + hg + l. The execution time for a BSP program is the sum of the execution times of each superstep. In the next section we show how postponing communication until after local computations makes it possible to improve delivery times, and hence to reduce the e ective value of g. In Section 4, we show how barrier synchronisation adds little cost once communication rescheduling has been done, and hence that the value of l need not be large.

3 Postponing Communication The current implementation of BSP is an SPMD library, BSPlib [1], callable from C or Fortran. It provides a facility for one-sided direct remote memory access, where a process can copy data directly into a remote process's memory without its active participation. These one-sided communication actions, bsp hpputs, may appear textually interspersed with ordinary instructions in the computation part of each process. The semantics of bsp hpput permits them to begin communicating concurrently with the lo3

60 50 +

Immediate: y communications as y messages 3 Combined: y communications as 1 message +

3

40

g

30

+

20 +

3 3

+ + 10 + + + 0

0

100

3

3+

+

200

300

400

500

600

700

(16384=x) puts of x words; where 1  x  800

800

Figure 2: The e ect on g when combining messages on the IBM SP2 cal computation|the superstep structure of BSP only guarantees that the messages are delivered at the barrier synchronisation that identi es the end of the superstep. However, the implementation does not perform the communication concurrently with computation as the performance advantages of not overlapping them is large. This contradicts current practice in communication libraries, where overlapping is considered a good thing. This is partly because systems that deal with communication at the level of single messages have no opportunity to improve performance by altering the schedule of communication. BSP's bulk treatment of communication provides two main opportunities for improving performance because the runtime system sees the total outgoing communication demand at each processor. They are: Combining the data from all of the messages between each processor pair. This means that the transmission overhead of message startup is only paid once per superstep, instead of once per message. Estimates of delivery will be more accurate since only entire supersteps with small total communication will be misestimated, not those with small messages. The only drawback to doing this is that memory is required for bu ering. However, the e ects on memory can be alleviated by only combining small messages, since large messages can be transferred at near-asymptotic bandwidth. Figure 2 shows the results of combining, by plotting the achieved value of g, on a superstep-by-superstep basis for h-relations that route a message of size 16384 words as 16384 single word communications on the left-hand side of the graph, through to a single message of size 16384 on the right-hand side. The performance advantages of combining can be clearly seen from the graph, as the asymptotic communications bandwidth g of a machine can be achieved over a wider mix of message sizes. Reordering messages to di erent destination processors reduces congestion e ects in the network. The runtime system can arrange patterns of transmission to prevent hotspots. This kind of reordering reduces both the absolute delivery times in typical networks and 4

the variance in delivery times. The rst improves the eciency of the network; the second improves the t between the cost model and observed performance. Two mechanisms for scheduling message transmission have been investigated. They are: (1) randomly scheduling messages to reduce the a priori probability of troublesome patterns, and (2) using a latin square to guarantee a schedule that avoids contention. A latin square is a p  p square in which each of the values from 1 to p appears p times, with no repetition in any row or column. In such a square, the ith row may be used as the list of destinations, one per time step. These mechanisms both schedule messages at the level of entry to the interconnection network. Thus they do not guarantee not to introduce contention at some internal point in the network, a point to which we return later. Which mechanism performs better depends on the precise design of each architecture and must be discovered by experimentation. The use of such mechanisms has a major e ect on performance. We illustrate by comparing the performance of communication patterns transmitted immediately, and delayed and reordered. Two patterns are used, one quite sparse and the other quite dense.

Sparse A cyclic shift of arrays of n integers, each local to a process to neighbouring processors. As a single message of size n leaves and enters each process, the communication realises a n-relation with cost l + ng. This communication pattern forms a base-line against which the e ects of contention can be measured as no contention arises at the processing nodes. Such a message pattern should achieve better delivery times than the cost model predicts since it sends big messages, but few of them, and with no contention. Dense The odd processes communicate an array of size m to all odd numbered processes, and the even processes communicate to all even numbered processes. When m is chosen to be n=(p=2 ? 1), then, as each process sends and receives p=2 ? 1 messages, the cost of the communication is l + p=2n?1 (p=2 ? 1)g = l + ng. This pattern more closely resembles a typical communication pattern. Consider the messages arriving at a particular processor, and suppose that there are

h of them. If the scheduling is such that the h other processors that have a message

for this processor try to send it as their rst communication action, then there will be congestion at or around that processor. This congestion may either interfere with the delivery of other messages or prevent the senders from going on to send other messages. Such a strategy for delivering an h-relation could take time as large as h2 . Therefore in practice message order may be important when routing communications, but doing so would complicate the BSP cost model. Table 1 shows the results of these test communications. The rst two columns show the e ects of the sparse and dense communication patterns when the runtime system neither combines nor reschedules them, but transmits them as soon as a communication request is encountered during a superstep. The last two columns show the results when the runtime systems delays and reorders messages using the latin-square scheme. The sparse communication pattern, as expected, shows no congestion phenomena since only a 5

p Immediate transmission BSPlib delaying and reordering 4 8 16 32 64 128

sparse 1932 2133 2425 2970 2473 2704

dense 3453 4002 5129 6550 7347 8044

sparse 1918 2126 2426 2780 2560 2683

dense 2604 2272 2585 3739 4956 5468

Table 1: The e ects of node contention on the Cray T3D. Entries in the table are in -seconds for routing a n=32768 relation plus barrier synchronisation, i.e., at p=128 then 63 messages containing 260 integers enter and leave each process for the dense communication pattern. single message is destined for each processor. It therefore makes little di erence whether communication begins immediately, or is postponed to the end of the superstep. For the dense communication pattern, reordering improves delivery times quite dramatically, compared to overlapped communication. This is to be expected. When h is about p=2 the potential for collisions at receiving processors is high. Explicitly controlling the order of message transmission, in this case using a latin square, has a marked e ect on overall performance. This example also illustrates that g is a function not only of the architecture but also of the implementation of the library. Naively one might expect that beginning communication as soon as the need for it was established would be a performance enhancement, since it spreads the communication out over time and should, theoretically, reduce congestion. However, we have shown that this expectation is misleading and that, for today's architectures, congestion is best reduced by combining messages to avoid start-up costs and by actively scheduling data transfers to reduce contention. Even when the contention properties of an interconnection are not well understood, there is some evidence to suggest that randomising delivery schedules will still help. Congestion in the network arises in two ways: contention at the network boundary (that is, at or near the processors themselves), and contention inside the network. Reordering reduces contention of the rst kind. However, software builders can do little about contention of the second kind, since routing algorithms, internal bu er allocations and so on are not under their control. We now examine how serious a problem this is. We do this by examining the delivery times for a series of broadcast-like operations of increasing trac density, beginning with broadcast by a single processor, then simultaneous broadcasts by two processors, and so on, up to simultaneous broadcasts by all p processors (that is total exchange). The results for three popular architectures are shown in Figure 2. Each of these operations is a p-relation, and so the BSP cost model predicts that they should all cost the same. This is clearly not the case. These machines are su ering from contention e ects within the network due to the increasing density of message trac. Delaying message transmission and scheduling help a great deal, as comparison with the data in the last column shows, but software alone cannot increase the bandwidth 6

i procs broadcast Cray T3D (128 procs) i=1 2 4 8 16 32 64 128 IBM SP2 (8 procs) i=1 2 4 8 SGI PowerChallenge (8 procs) i=1 2 4 8 Machine

Time secs delayed immediate 9101 8803 13794 13559 17657 17639 22782 25169 26057 31118 30840 38738 35576 49970 41437 64731 28498 28411 39294 46738 61367 77224 104719 114474 9057 8979 11580 14962 18274 31901 34390 47657

Table 2: Communication times for a variety of 262144-relations (i.e., at most 1 Mbyte of data leaving or entering each process) at critical points in the communication network. Better communication networks are required! Delivery times do not increase nearly as fast as the applied load. A p-processor broadcast does not take p times longer than a 1-processor broadcast. For the T3D the cost of increasingly-dense broadcasts grows logarithmically with i, and for the other architectures growth is sub-linear. >From this data, it appears that these architectures have been optimised for extremely sparse communication patterns.

4 Barrier Synchronisation Barrier synchronisations are important tools in parallel programming because they reduce the state space of a program. When each process exits the barrier it knows a global state of the computation that is small compared to the possible state of a less-regimented program. There are two myths about barrier synchronisations: rst, that they are inherently expensive, and second that they are best implemented by a dependency tree. We have shown elsewhere [2] that both of these are false. Barrier synchronisation can be cheap, and the best implementation is far from obvious and requires experimentation. Table 3 shows typical values for the cost of barrier synchronisation on today's architectures. The techniques mentioned in the table are: CACHE: a shared-memory barrier in which each process writes a value to a unique location, with all of these locations on the same cache line. After writing its value, each process spins reading the set of locations. This technique uses the architecture's 7

Processors 1 2 3 4 Technique SGI shared memory 0.39 3.21 4.48 6.59 CACHE SUN shared memory 0.43 1.05 1.45 2.16 CACHE SGI MPI 0.91 47.51 53.91 91.80 PWAY SP2 switch 3.20 63.33 91.99 124.26 TOTXCH SP2 Ethernet 1064.00 1601.09 2803.67 3475.94 PWAY Cray T3D 0.85 1.91 1.91 HW Table 3: Execution time for barrier synchronisation. Times in s. cache coherence mechanism to handle mutual exclusion, since only one processor at a time can hold the line in write mode. PWAY: a distributed-memory barrier in which each process sends a message to a single designated process, and waits for a receipt message. The designated process accepts all p ? 1 messages from the other processes, before sending p ? 1 receipt messages. TOTXCH: a distributed-memory barrier in which every process sends p ? 1 messages, one to each other process, and expects p ? 1 messages in return. HW: a hardware barrier (which on the T3D performs well on submachines whose size is a power of two, but almost a factor of ten worse on submachines of other sizes). These data show that, particularly for shared-memory architectures, barrier synchronisation is cheap. For distributed-memory architectures, this is less so, partly because such architectures do not provide low-level access to the hardware. Availability of, for example, active messages might be expected to substantially reduce the cost of barriers on such architectures. Techniques based on a dependence tree perform poorly because of the number of synchronisations they introduce (linear in p). These typically require lock or semaphore management and context switches which are expensive on all of the architectures we have tested [2]. This may change when multithreaded architectures becomes common, but until then the performance advantage is with implementations that contain as few synchronisations as possible. BSP has a major advantage over less-structured programming models, because barrier synchronisations always occur logically after a collective communication operation and before a period of local computation. This knowledge can be used by a BSP implementation to `fold in' the barrier synchronisation with the collective communication and reduce the e ective cost of the two together. The exact method of doing so depends on the target architecture involved.

Distributed-memory architectures with non-blocking send and blocking receive message-passing operations (IBM SP2,Hitachi SR2001, Alpha Farm, Parsytec

GC). Receiving processors do not know how much data to expect. The technique used by the library is to: 1. Perform a total exchange, exchanging information about the number, sizes, and destination addresses of messages. Next tag messages with a small identi er, and send 8

them to the remote processors. The up-front total exchange allows direct remote memory access to be implemented by copying incoming data into the correct remote address on receipt of the message, and also serves as the barrier synchronisation for the superstep. 2. Allow each process to continue when it knows that all messages destined for it have arrived. It does this by counting incoming messages. Incoming and outgoing messages can be interleaved to reduce the total bu er requirements. The barrier synchronisation is free for such architectures, since an equivalent operation has to be done just to implement the global communication substep. Shared-memory architectures (SGI PowerChallenge, Sun). On such architectures, as we have seen, barriers are cheap. However, shared memory is a resource about which collective decisions, for example about allocation, have to be made, and barrier synchronisations can simplify this process. The following strategy is used in the current implementation of the library: 1. As each call to a communication operation occurs in the code, information about the size and destination of the message is placed in a table in shared memory. 2. A barrier synchronisation takes place to freeze this table. Each process extracts the information about what communication will take place. 3. Data is exchanged in a message-passing style by copying data from each process's private memory into bu ers in shared memory associated with the destination process. Each destination process then copies the data from shared memory into its own private address space. Once again, information about what global communication pattern is taking place is used to limit contention in the shared memory. 4. The table is cleared and a further barrier synchronisation allows it be to re-used. Here we use the existence of a cheap barrier synchronisation to reduce the cost of the global communication substep. Distributed-memory machines with remote-memory access (Cray T3D). Such architectures are, in some ways, the most dicult to build implementations for, since they do not provide either low-level access to memory (which we exploit via cache coherence) or inherently-self-synchronising messages. Fortunately, the only example in this class that makes a suitable target for BSP is the Cray T3D, which provides barrier synchronisation in hardware.

5 Implications for Parallel Programming This paper is a response to criticisms of the cost of the BSP style of programming, which have themselves been based on misconceptions about parallel computer performance. These misconceptions are: 9

1. Postponing communication creates poor performance. We have shown that exactly the opposite is the case. Postponing communication provides the opportunity to arrange for it to take place eciently, by combining small messages, and scheduling communication globally to reduce congestion and contention. 2. Barrier synchronisation is too expensive for routine use. We have shown that barrier synchronisation is not expensive for shared-memory architectures, although it requires a subtle implementation to achieve good performance. Barrier synchronisation remains fairly expensive for distributed-memory architectures, mostly because today's low-level software does not give access to the hardware at an appropriate level. In any case, the cost of barrier synchronisation can be partly o set by folding it into global communication, reducing the cost of both. It is BSP's clear cost model that shows which issues are critical. It concentrates attention on the factors that really a ect the performance of parallel architectures. Intuition about these factors has not proven a very realistic guide so far. The results we have shown here have implications for the design of all parallel programming models, languages, and libraries, not just BSP.

References

[1] M.W. Goudreau, J.M.D. Hill, K. Lang, W.F. McColl, S.D. Rao, D.C. Stefanescu, T. Suel, and T. Tsantilas. A proposal for a BSP Worldwide standard. BSP Worldwide, http://www.bsp-worldwide.org/, April 1996. [2] J.M.D. Hill and D.B. Skillicorn. Practical barrier synchronisation. Technical Report TR-16-96, Oxford University Computing Laboratory, August 1996. [3] W.F. McColl. General purpose parallel computing. In A.M. Gibbons and P. Spirakis, editors, Lectures on Parallel Computation, Cambridge International Series on Parallel Computation, pages 337{391. Cambridge University Press, Cambridge, 1993. [4] D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Technical Report TR-15-96, Oxford University Computing Laboratory, August 1996. [5] L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103{111, August 1990.

10