An Integrated Compiler-Time/Run-Time Approach to ... - CiteSeerX

4 downloads 0 Views 207KB Size Report
Y. Charlie Huy, HongHui Luz, Alan Coxy, and Willy Zwaenepoely ..... 8] B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J. States, S. Swaminathan, and M. Karplus.
An Integrated Compiler-Time/Run-Time Approach to Reducing Contention in OpenMP Programs Y. Charlie Huy, HongHui Luz, Alan Coxy, and Willy Zwaenepoely y Department of Computer Science z Department of Electrical and Computer Engineering Rice University fychu, hhl, alc, [email protected] Abstract

Contention is one of the largest obstacles to the scalability of many software DSM programs. It is caused by multiple threads reading from one thread simultaneously. In this paper, we present an integrated compiler/runtime approach to reducing communication contention in software DSM programs in the context of supporting OpenMP on network of workstations. Our approach relies on a combination of compiler and runtime support to precisely predict the global communication patterns for both regular and irregular applications. The compiler uses regular section analysis to compute each thread's access pattern between synchronization points. The access pattern is then combined with runtime information to derive global communication patterns, which are used by the runtime system to optimize communication. The optimizations include multicast and communication staggering. We measure the e ect of the optimizations on a 32-node Pentium-II cluster for four applications: Modi ed Gramm-Schmidt, 3D-FFT, Integer Sort, and and the irregular application Moldyn. The optimizations improve the running time of the four applications by 220%, 38%, 7.4%, and 1.5%, respectively.

1 Introduction The OpenMP Application Programming Interface (API) is an emerging standard for portable and scalable shared-memory programming. Until recently, it existed only for shared memory architectures. An ecient and scalable implementation on distributed memory machines, and in particular on a network of workstations will lend increased portability to OpenMP programs, and thereby further its acceptance. In our previous paper [20], we developed a preliminary implementation of OpenMP on a network of workstations. The implementation relies on a software distributed shared memory (DSM) to provide a shared memory abstraction on top of distributed memory machines, and translates OpenMP programs to parallel programs in the fork-join model. In this paper, we focus on automatically optimizing the compiler generated programs to achieve better performance on a large number of processors. Our optimizations reduce communicationcontention, identi ed as the one of the largest factors accounting for the poor scalability of software DSM systems [12]. There are two main causes of contention in OpenMP programs. First, in the fork-join style execution of OpenMP programs, slave threads in the parallel execution following a sequential execution may simultaneously read the data modi ed during the sequential execution by the master thread. Second, between two parallel loops, the access pattern of the parallel code may result in several threads reading from the same thread simultaneously, either to the same data or to di erent data. Our approach relies on a combination of compiler and runtime support to precisely predict the global communication pattern for both regular and irregular applications. The compiler support required for our approach is the simple regular section analysis: for regular applications, it suces to determine the part 1

of the data array accessed by each thread [9]. For irregular applications, the indirection array is identi ed in addition to the section of the indirection array being accessed by each thread. Access to the indirection array is usually regular. After a global synchronization, the runtime state is combined with each thread's future access to draw a complete picture of the communication pattern. According to the communication pattern, the runtime system either multicasts data to a subset of threads, or stagger the communication to reduce contention. There have been several approaches to generating collective communications for distributed memory systems. Pattern-matching has been used to generate collective communications for data parallel programs (see, for example [18]). However, pattern-matching can only recognize limited number of static communication patterns. Automatic parallelizing compilers such as [6, 3, 1] can generate collective message passing communications through whole-procedure analysis. Again, these compilers handle only regular applications. Inspector-executor methods have been proposed as a way to eciently execute irregular computations on distributed memory machines [21]. However, sophisticated compiler analysis [25, 2, 11] is required to determine when and where to insert inspectors in order to reduce the high overhead of executing the inspectors. Our complier analysis is based on the SUIF toolkit [4]. We extend the interface to the TreadMarks runtime DSM system [5] to take advantage of the compiler analysis. We have measured the performance of these techniques on a 32-node Pentium-II cluster for three regular applications, Modi ed Gramm-Schmidt (MGS), 3D-FFT, Integer Sort (IS), and an irregular application Moldyn. Our experiments show that the optimized programs achieve performance improvements of 220%, 38%, 7.4%, and 1.5%, respectively. The rest of this paper is organized as follows. Section 2 presents some of the background on the TreadMarks system, the OpenMP to TreadMarks translator, and causes of contention in OpenMP programs. Section 3 describes the augmented runtime interface. Section 4 presents the compiler analysis used to generate calls to the augmented runtime interface. Section 5 presents the performance results. Finally, we survey related work in Section 6 and conclude in Section 7.

2 Background

2.1 The OpenMP API

OpenMP [14] provides a set of directives that allow the user to annotate a sequential program to indicate how it should be executed in parallel. OpenMP is based on a fork-join model of parallel execution, where the sequential code sections are executed by a single thread, called the master thread , and the parallel code sections are executed by all threads, including the master thread. The OpenMP directives appear either as special Fortran comments, or C pragmas. OpenMP provides three kinds of directives: parallel and work sharing directives, data environment directives, and synchronization directives. We only explain the directives relevant to this paper, and refer interested readers to the OpenMP standard [14] for the full speci cation. The two basic parallel directives are parallel and parallel for. The parallel directive de nes a parallel region , which is a block of code to be executed by multiple threads in parallel. The parallel for directive speci es a parallel region that contains a single for loop. It could be followed by a schedule directive to specify how iterations of the loop are distributed. The data environment directives specify which variables are shared or private, and how variables are initialized in a parallel region. The synchronization directives include barrier and critical. A barrier directive causes the thread to wait until all threads in the parallel region have reached this point. A critical directive restricts access to the enclosed code to only one thread at a time.

2

2.2 TreadMarks Distributed Shared Memory

TreadMarks [5] is a user-level DSM system that runs on most commonly available Unix systems and on Windows NT. It provides a global shared address space on top of physically distributed memories. The parallel threads synchronize via primitives similar to those used in hardware shared memory machines: barriers, mutex locks, and condition variables. TreadMarks relies on user-level memory management techniques provided by the operating system to detect accesses to shared memory at the granularity of a page. A lazy invalidate version of release consistency (RC) and a multiple-writer protocol are employed to reduce the amount of communication involved in implementing the shared memory abstraction. RC is a relaxed memory consistency model. In RC, ordinary shared memory accesses are distinguished from synchronization accesses, with the latter category divided into acquire and release accesses. RC requires ordinary shared memory updates by a thread p to become visible to another thread q only when a subsequent release by p becomes visible to q via some chain of synchronization events. In practice, this model allows a thread to bu er multiple writes to shared data in its local memory until a synchronization point is reached. The lazy implementation delays the propagation of consistency information until the time of an acquire. Furthermore, the releaser noti es the acquiring thread of which pages have been modi ed, causing the acquiring thread to invalidate its local copies of these pages. A thread incurs a page fault on the rst access to an invalidated page, and obtains up-to-date value for that page from previous releasers. With the multiple-writer protocol, two or more threads can simultaneously modify their own copies of a shared page. Their modi cations are merged at the next synchronization operation in accordance with the de nition of RC, thereby reducing the e ect of false sharing. To support OpenMP-style environments, recent versions of TreadMarks include Tmk fork and Tmk join primitives, speci cally tailored to the fork-join style of parallelism expected by OpenMP and most other shared memory compilers [4]. For performance reasons, all threads are created at the beginning of the execution. During sequential execution, the slave threads are blocked waiting for the next Tmk fork issued by the master.

2.3 An OpenMP to TreadMarks Translator

We have developed a translator for a subset of OpenMP, based on the SUIF toolkit [4]. The translator targets the TreadMarks software DSM system [5]. The translation process is relatively simple, because TreadMarks already provides a shared memory API on top of a workstation cluster. Basically, the compiler encapsulates each parallel region into a separate subroutine. This subroutine also includes code, generated by the compiler, that allows each thread to determine, based on its thread identi er, which portions of a parallel region it needs to execute. A Tmk fork is called by the master before the parallel region, at which time a pointer to this subroutine is passed to the slaves. All threads call Tmk join at the end of the parallel function, and the control is returned to the master thread. Figure 1 shows an OpenMP version of the MGS program, and Figure 2 lists the TreadMarks MGS program translated by our OpenMP translator.

2.4 Contentions in OpenMP Programs

Contention is identi ed as one of the largest factors accounting for the poor scalability of many software DSM programs by De Lara at al. [12]. In software DSM, contention happens when several threads try to request data from the same thread at once. We summarize two scenarios that may cause contention in the OpenMP programs. 3

float v_[N][N]; void mgs() { ...... for (i = 0; i < N; i++) { /* compute row i .... */ #pragma omp parallel for schedule (static, 1) shared(v_) for (j = i; j < N; j ++) { comp = 0.0; for (k = 0; k < N; k++) comp += v[i][k] * v[j][k]; for (k = 0; k < N; k++) v[j][k] -= comp * v[i][k]; } } }

Figure 1: Pseudo-code for an OpenMP Modi ed Gramm-Schmidt (MGS) program.

 First, in the fork-join style OpenMP programs, access to shared data in the parallel execution following the sequential section may cause contention, because data modi ed by the master during the sequential section may be read by several slaves simultaneously. For example in Figure 2, the computation of the pivot row in MGS is performed by the master thread. All threads read the pivot row at the beginning of the parallel region following the sequential code. As a result, all threads fault on the page containing the pivot row, and send requests to the master simultaneously. The contention on the master thread causes prolonged waiting time for the slave threads.

 A second scenario of contention is between two parallel loops. Depending on the access pattern,

many-to-many or even all-to-all communication may take place. A contention occurs if several threads happen to access data owned by one thread simultaneously. Take the transpose of a two-dimensional array for example, and assume the transformation is written in simple two-nested loops over the assignment A[j][i] = B[i][j], with the outer loop i parallelized. Contention occurs because all threads read data from one thread at once.

3 Augmented Run-Time System To facilitate the optimizations described in this paper, we introduce three new runtime interfaces, as shown in Figure 3. In summary, Validate request direct and Validate request indirect build, for each thread, a prefetch schedule that contains prefetch requests by all threads between two synchronization points. Using this global prefetch pattern, Validate reply then tries to schedule replies to avoid contention by using multicast and/or coordinated staggered pushes.

3.1

Validate request direct

builds communicationschedules for regular data accesses. It takes as argument an access descriptor consisting of the schedule number, pid, access type, and section. The schedule number is an index to the array of schedules. The pid speci es the requesting thread id. The access type that Validate request direct

4

float v_[N][N]; void mgs() { for (i = 0; i < N; i++) { /* compute row i .... */ fork(mgs_func); join(); } } void mgs_func(int proc_id, int nprocs, int i) { int i, j, k; int begin = cyclic_begin(i, nprocs, proc_id); int end = N; for (j = begin; j < end; j += nprocs) { comp = 0.0; for (k = 0; k < N; k++) comp += v[i][k] * v[j][k]; for (k = 0; k < N; k++) v[j][k] -= comp * v[i][k]; } }

Figure 2: Pseudo-code for the TreadMarks MGS program translated from the OpenMP program. is relevant to reducing contention is one of READ and READ&WRITE. The section contains regular section descriptor of accesses to the data array. Since the access pattern of direct accesses is completely known at compile time, access descriptors to Validate request direct can be directly generated by the compiler for all threads, thus avoiding communicating them at runtime.

3.2

Validate request indirect

builds communication schedules for irregular data accesses, i.e. accesses through indirection arrays. It takes the same list of arguments as Validate request direct, but it di ers from the latter in the content of section, and in the way it builds the schedule. First, the section contains a regular section descriptor of accesses to the indirection array, and an additional parameter base which points to the data array. In the case of multiple levels of indirection, the base could point to another indirection array, and additional pairs of section and base could be added to express additional indirections. Second, Validate request indirect exchanges prefetch access requests at runtime. This is because the indirection array could have been modi ed by another thread before the synchronization, in which case the thread needs to re-compute the prefetch requests. Once prefetch requests are computed, they have to be exchanged with other threads to build a global schedule. To build a schedule, each thread needs to know whether it has received all the requests. Therefore, we use a global reduction followed a broadcast to implement the all-to-all exchange of prefetch requests and acknowledging of the end of all requests. For the subsequent iterations, the schedule is reused if the indirection array remain unchanged, thus avoid communicating prefetch requests. In irregular applications, indirection arrays are often generated once and used for many iterations of computation. Our integrated compile-time/runtimeapproach can therefore accurately reuse prefetch requests Validate request direct

5

/* for direct accesses, requests by all processors are generated for all processors, thus avoid communication */ Validate_request_direct( int sched_num int pid, int access_type, RSD section, )

/* /* /* /*

schedule number */ processor id */ READ, WRITE, READ&WRITE, WRITE_ALL, or READ&WRITE_ALL */ section of shared data (through indirection array if necessary) */

/* for indirect accesses, requests sent at runtime */ Validate_request_indirect( int sched_num int pid, int access_type, RSD section, char *base, ... ... ) /* for carrying out the actual updates, using multicast and balanced scheduling as appropriate */ Validate_reply(int sched_num)

Figure 3: Augmented runtime interfaces. without sophisticated compiler analysis as required in the inspector/executor approach [11].

3.3

Validate reply

The Validate reply takes a schedule as an argument. It summarizes all the prefetching requests and tries to schedule the replies to avoid contentions by taking advantage of multicast and coordinated staggered pushes of replies. Since the optimal scheduling program is NP-complete, we use a simple heuristic for our implementation. The same scheduling algorithm is executed by all threads. It rst histograms the amount of data that a thread needs to send to and receive from every other thread. If there are requests from multiple threads for data from the same thread, and only one such pattern exists, a multicast is scheduled. Otherwise, the scheduler schedules all replies as staggered pushes. Since the experimental platform we used (Section 5.1) does not support hardware multicast reliably, we have to implement multicast communication primitive in software, using a binomial spanning tree [17]. To implement staggered pushes of replies, every thread i rst pushes replies to thread i + 1, and then thread i + 2, and so on.

4 Compiler Analysis The compiler analyzes the parallel code translated. It rst performs local access pattern analysis, then generates calls to feed this information to the runtime system. The access analysis is based on regular section analysis, which produces regular section descriptors (RSDs) [16] that represent the data access as linear expressions of the upper and lower loop bounds along each dimension. For regular applications, regular section analysis nds out the data accessed in the parallel 6

static void mgs_func(int proc_id, int nprocs, int i) { int i, j, k; ie begin, end; for (pid = 0; pid < nprocs; pid++) { begin = cyclic_begin(i, nprocs, pid); end = N; Validate_request_direct(schedule, pid, READ, v_[i, 0:N-1]); Validate_request_direct(schedule, pid, READ&WRITE, v_[begin:end:nprocs, 0:N-1]); } Validate_reply(schedule); begin = cyclic_begin(i, nprocs, proc_id); end = N; for (j = begin; j < end; j += nprocs) { comp = 0.0; for (k = 0; k < N; k++) comp += v[i][k] * v[j][k]; for (k = 0; k < N; k++) v[j][k] -= comp * v[i][k]; } }

Figure 4: Pseudo-code for the compiler-optimized translated TreadMarks MGS program. region at compile time. In irregular applications in which data arrays are accessed via indirection arrays, e.g. A[index[i]], the compiler support required in our approach [19] involves determining the indirection array used to access shared data, and the part of the indirection array being accessed. This is usually a regular section, and hence can be handled by the compiler framework for regular accesses. Our approach can also be extended to multiple levels of the indirection in the access pattern. For the purpose of reducing contentions, calls to Validate are inserted only at the beginning of a parallel region, or right after a barrier. The compiler inserts calls to Validate request to create a global schedule. Afterwards, all threads call Validata reply to communicate data accordingly. For parallel for loops with static loop partitions, we take advantage of the fact that all access patterns are known at compile time. The compiler generates code so that a thread duplicates the computation of each thread's access pattern. As a result, runtime exchanges of data requests can be avoided. Figure 4 shows the transformations applied to the mgs func subroutine in MGS. The compiler inserts calls to Validate request direct, which allows every thread to log all prefetching requests, and a call to Validate reply, which performs replies. Since Validate reply has the global knowledge of all the prefetching requests, it can detect multiple prefetches from di erent threads for the same data that it owns and use the multicast primitive to implement replies to reduce the e ect of contention.

5 Experiments

5.1 Platform

Our experimental environment is a switched, full-duplex 100Mbps Ethernet network of 32 300 MHz Pentium II-based computers. Each computer has a 512K byte secondary cache and 256M bytes of memory. All of the computers run FreeBSD 2.2.6 and communicate through UDP sockets. On this platform, the round-trip latency for a 1-byte message is 126 microseconds. The time to acquire a lock varies from 178 to 272 microseconds. The time for an 32-processor barrier is 1,333 microseconds. The time to obtain a di 7

Application MGS 3D-FFT IS Moldyn

Size/Iteration Sequential Time (sec.) OpenMP Parallel Directives 2K x 2K 343.9 parallel for 7 x 7 x 7, 10 109.5 parallel for 64M keys, 64K buckets, 50 836.0 parallel region/for 16384, 20 iters, 2 rebuilts 287.4 parallel for

Table 1: Applications, input data sets, sequential execution time, and parallel directives in the OpenMP programs. varies from 313 to 1,544 microseconds, depending on the size of the di . The time to obtain a full page is 1,308 microseconds.

5.2 Applications and Their OpenMP Implementation

We use four applications in this study: MGS, NAS 3D-FFT, NAS IS, and Moldyn. Table 1 summarizes the problem sizes, the sequential running times, and the parallel directives used in the OpenMP implementation of the applications. The sequential running times are used as the basis for the speedup gures reported in the next section. Modi ed Gramm-Schmidt (MGS) computes an orthonormal basis for a set of N-dimensional vectors. At the ith iteration, the algorithm rst sequentially normalizes the ith vector, then makes all vectors j > i orthogonal to vector i in parallel. The updates on the vectors are assigned to the threads in a cyclic manner to balance the load. All threads synchronize at the end of each iteration. In OpenMP, the normalization of each vector is performed by the master thread. The parallel updates on all vectors with the cyclic assignment are expressed using parallel for with the static schedule with a chuck size one.

MGS

3D-FFT from the NAS benchmark suite [7] solves a partial di erential equation using three dimensional forward and inverse FFT. The program has three shared arrays of data elements and an array of checksums. The computation is decomposed so that every iteration includes local computation and a global transpose, with both expressed using the parallel for directive.

3D-FFT

IS

Integer Sort [7] from the NAS benchmarks ranks an unsorted sequence of keys using bucket sort. The parallel version of IS divides up the keys among the threads. First, each thread counts its own keys, and writes the result in a private array of buckets. Next, the threads compute the global array of buckets by adding the corresponding elements to each private array of buckets. Finally, all threads rank their keys according to the global array of buckets. To obtain good parallelism, the bucket array is divided equally into p blocks, where p is the number of threads. The global buckets are computed in p steps. In each step, a thread works on one of the blocks, and moves on to another one in the next step. A barrier synchronizes all threads after the updates, and each thread then reads the nal result in the shared array of buckets and ranks its keys. In OpenMP, the per-thread blocking of the shared bucket array is expressed using the parallel region directive. is a molecular dynamics simulation. Its computational structure resembles the non-bonded force calculation in CHARMM [8]. Non-bonded forces are long-range interactions existing between each pair of molecules. CHARMM approximates the non-bonded calculation by ignoring all pairs which are beyond a certain cuto radius. The cuto approximation is achieved by maintaining an interaction list of all the pairs within the cuto distance, and iterating over this list at each time-step. The interaction

Moldyn

8

32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 MGS

FFT

IS

Base/16-node

Optimized/16-node

Base/32-node

Optimized/32-node

Moldyn

Figure 5: Performance comparison of directly translated TreadMarks programs and compiler-optimized TreadMarks programs. list is used as an indirection array to identify interacting partners. Since molecules change their spatial location every iteration, the interaction list must be periodically updated. The original code uses symmetry in evaluating forces among the molecules. In constructing the interaction lists a thread, the thread only needs to look at the next half of total molecules (including wrap-around). The only way to express this symmetry is to write explicit parallel code using parallel region. Such a code also requires extra synchronized reductions to sum up contributions by di erent threads. Interestingly, the intention to utilize symmetry dictates the code to stagger parallel accesses to the molecule array, and the resulting code has no contention in communication. We are interested in accessing the contention in a not highly-optimized code, so we modi ed Moldyn not to use symmetry. The construction step needs to examine all pairs of molecules. The code is expressed using the parallel for directive and the inner loop goes through all molecules from the rst to the last. On 32 nodes, this simple code is 20% slower than the one using symmetry due to the elimination of reduction.

5.3 Results

Figure 5 shows the speedups achieved for all applications for the directly translated TreadMarks programs and compiler-optimized TreadMarks programs. Overall, compiler optimization substantially reduces execution time of the optimized code in comparison to the unoptimized code, ranging from 1.5% to 220% on 32 processors. Among all our applications, MGS achieves the largest reduction in execution time { a factor of 1.36 on 16 processors and a factor 2.2 and 32 processors. In fact, the only communication in MGS happens after each sequential normalization of a row when all threads read the newly normalized row from the master. The optimization generates a multicast to send to all threads. In moving from 16 processors to 32 processors, the contention is further aggravated. As a result, the unoptimized code su ers a slowdown, while the optimized code sustains a speedup of 30%. 9

Note that MGS is just one instance of a class of applications that have the type of contention from concurrent requests for some data updated during a sequential execution. Other examples include Gauss and LU [26, 23] for which our optimization is also applicable. 3D-FFT su ers from the second type of contentions during the transpose phase when all threads request di erent data all from thread 0 rst, and then from thread 1, and so on. Through regular section analysis, our compiler generates prefetch requests for all data to be accessed during the transpose phase, and the augmented runtime system schedules updates in a staggered fashion to eliminate contention. The optimized code achieves 23% and 38% reduction in execution time on 16 and 32 processor, respectively. In IS, after the parallel updates of the shared bucket array, each thread is a single writer of the block that it just updated. When every thread reads the whole bucket array in the same order and all from the beginning, contention occurs on one thread at a time, similar to that in FFT. Through regular section analysis, the compiler generates prefetch requests. The augmented runtime scheduler actually detects that the updates can be performed as p multicasts where p is the number of threads or as a sequence of p ? 1 parallel staggered pushes. A lower bound on the number of parallel receives (or sends) is (p ? 1) since p threads needs to receive a total of p  (p ? 1) messages. Since staggering achieves the lower bound without causing any potential contention, while parallel multicasts can cause contention on intermediate threads, staggering is used by the runtime system for IS. The optimized IS code achieves 8.5% and 7.4% improvement on 16 and 32 processors, respectively, compared with the unoptimized code. Moldyn is an example of irregular applications. The construction of interaction lists directly accesses the coordinates of all molecules. The accesses in the subsequent iterations before the next reconstruction of interaction lists are indirect through the interaction array. Our compiler generates prefetch requests for both direct and indirect accesses. The indirect prefetch requests are communicated at runtime at the rst iteration after the reconstruction of the interaction lists. Afterwards, the schedule remains unchanged during subsequent iterations until the next reconstruction of interaction lists. The prefetches for the direct accesses form an all-to-all communication, and is scheduled as staggered pushes. For the indirect accesses in each of the subsequent iteration, the three pages on each thread is requested by 18, 9, and 18 other threads. All prefetch requests are scheduled as staggered pushes. The improvement of optimized code, 1.5% on 32 nodes, however, is almost negligible. The poor performance improvement can be understood by looking closely at the construction of interaction lists. After the contention on the rst page, the threads go into a convoy formation, and each thread has to perform substantial amount of computation before they all go after the second page. Because of the formation, the requests for the second page will arrive at staggered time intervals, and thus incur no contention. The convoy continues since there is no synchronization throughout the construction step. The e ect of the contention on the rst page amounts to only a few milliseconds.

6 Related Work There have been several papers on integrated compile-time/runtime approach to optimize software DSM programs. Dwarkadas et al. [13] were the rst to propose the integrated approach to optimize regular software DSM programs. Their optimizations focused on aggregating data communication, aggregating data communication and synchronization, eliminating consistency overhead, and replacing global synchronization with point-to-point synchronizations. Lu et al. [19] extended communication aggregation to irregular software DSM programs. Chandra and Larus [10] performed study similar to that of Dwarkadas et al. but in the context of compiling HPF programs to ne-grain software DSM. The optimizations emphasis on aggregating data transfer and avoiding runtime overhead. Han et al. [15] implemented optimizations for compiler generated software DSM programs. Their optimization eliminates redundant barrier synchronizations or replaces a barrier with nearest-neighbor synchronizations. 10

In this paper, we extend the compiler-time/runtime integrated approach to solving a new problem, reducing communication contention, for both regular and irregular OpenMP programs. The key observation is that global communication patterns can be inferred by combining local access pattern generated by the compiler and the runtime information and used by the runtime system to optimize communication collectively. Brazos [22] is a pure runtime system that exploits hardware multicast on a small cluster of bus-connected NT machines to reduce contention. Issuing of multicasts is speculative and is only e ective when the communication pattern repeats, i.e., the same piece of data will be sent to the same thread again. This restriction, however, does not apply to applications like MGS in which the one-to-all communication occurs on a di erent pivot row from one iteration to the next. There have been several compiler techniques to generate collective communications for distributed memory systems. Li and Chen [18] used pattern-matching to generate collective communications for data parallel programs. The algorithm can only handle limited number of regular applications. Automatic parallelizing compilers such as [6, 3, 1] contain the analysis to generate collective message passing communications. Again, these compilers handle only regular applications. In contrast to the per thread local regular section analysis in our method, their compilers have to analyze a larger portion of the program to identify the communication pattern, and thus is more complicated. The inspector-executor method has been proposed as a way to eciently execute irregular computations on distributed memory machines [21]. A separate loop, the inspector, precedes the actual computational loop (called the executor). The inspector loop determines the data read and written by the individual threads executing the computational loop. This information is then used to compute a communication schedule, moving the data from the producers to the consumers at the beginning and/or end of each loop. Communication optimization can be applied when exchanging the data. In order to further reduce overhead, an attempt is made to execute the inspector loop only once for a large number of iterations of the executor loop. It has been argued that part or all of the above procedure can be automated by a compiler [25]. The compiler analysis involved can, however, be quite complicated [2, 11, 24].

7 Conclusion We have demonstrated an integrated compiler/runtime approach to reducing contention in OpenMP programs on a network of workstations. The compiler uses simple regular section analysis to compute each thread's access pattern between synchronization points. The access pattern is then combined with runtime information to derive precise global communication patterns for both regular and irregular applications. The runtime system exploits the communication pattern to optimize communication by using multicast and communication staggering. We measure the e ect of the optimizations on a 32-node Pentium-II cluster for four applications: Modi ed Gramm-Schmidt, 3D-FFT, Integer Sort, and and the irregular application Moldyn. For all of the above applications, our combined system is able to uncover optimization opportunities. The actual performance improvement depends the level of contention in the programs. Speci cally, the optimizations improve the running time of the four applications by 220%, 38%, 7.4%, and 1.5%, respectively.

References

[1] A. Agarwal, D. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors. In IEEE Transactions on Parallel and Distributed Systems, 1995. [2] G. Agarwal and J. Saltz. Interprocedural compilation of irregular applications for distributed memory machines. In Proceedings of Supercomputing '95, December 1995.

11

[3] S. Amarasinghe and M. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN 93 Conference on Programming Language Design and Implementation, June 1993. [4] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and C. W. Tseng. The SUIF compiler for scalable parallel machines. In Proceedings of the 7th SIAM Conference on Parallel Processing for Scienti c Computing, February 1995. [5] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18{28, February 1996. [6] J. Anderson and M. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the ACM SIGPLAN 93 Conference on Programming Language Design and Implementation, June 1993. [7] D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. Technical Report TR RNR-91002, NASA Ames, August 1991. [8] B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J. States, S. Swaminathan, and M. Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculations. Journal of Computational Chemistry, 4:187, 1983. [9] D. Callahan and K. Kennedy. Analysis of interprocedural side e ects in a parallel programming environment. Journal of Parallel and Distributed Computing, 5:517{550, 1988. [10] S. Chandra and J. Larus. Optimizing communication in HPF programs on ne-grain distributed shared memory. In Proceedings of the 6th Symposium on the Principles and Practice of Parallel Programming, pages 100{111, June 1997. [11] R. Das, P. Havlak, J. Saltz, and K. Kennedy. Index array attening through program transformation. In Proceedings of Supercomputing '95, December 1995. [12] E. de Lara, Y.C. Hu, A.L. Cox, and W. Zwaenepoel. Scalability of page-based software shared memory systems. Submitted to PPoPP99, October 1999. [13] S. Dwarkadas, A.L. Cox, and W. Zwaenepoel. An integrated compile-time/run-time software distributed shared memory system. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 186{197, October 1996. [14] The OpenMP Forum. OpenMP Fortran Application Program Interface, Version 1.0. http://www.openmp.org, Octorber 1997. [15] H. Han, C.-W. Tseng, and P. Keleher. Eliminating barrier synchronization for compiler-parallelized codes on software DSMs. In International Journal of Parallel Programming, October 1998. [16] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350{360, July 1991. [17] S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249{1268, September 1989. [18] J. Li and M. Chen. Compiling communication-ecient programs for massively parallel machines. IEEE Transactions on Parallel and Distributed Systems, 2(3):361{376, July 1991. [19] H. Lu, A.L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel. Software distributed shared memory support for irregular applications. In Proceedings of the 6th Symposium on the Principles and Practice of Parallel Programming, pages 48{56, June 1996. [20] H. Lu, Y. C. Hu, and W. Zwaenepoel. OpenMP on networks of workstations. In Proceedings of Supercomputing '98, November 1998. [21] J. Saltz, H. Berryman, and J. Wu. Multiprocessors and run-time compilation. Concurrency:Practice and Experience, 3(6):573{592, December 1991. [22] W.E. Speight and J.K. Bennett. Using multicast and multithreading to reduce communication in software DSM systems. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, February 1998. [23] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. Cashmere2L: Software coherent shared memory on a clustered remote write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997.

12

[24] R. von Hanxleden and K. Kennedy. Give-N-Take { a balanced code placement framework. In Proceedings of the ACM SIGPLAN 94 Conference on Programming Language Design and Implementation, June 1994. [25] R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. Compiler analysis for irregular problems in Fortran D. In Proceedings of the 5th Workshop on Languages and Compilers for Parallel Computing, August 1992. [26] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24{36, June 1995.

13