HPF { State of the Art

HPF { State of the Art Harald J. Ehold1 Wilfried N. Gansterer2 Christoph W. Ueberhuber2 1

VCPC, European Centre for Parallel Computing at Vienna, Liechtensteinstrasse 22, A-1090 Vienna, Austria E-Mail: [email protected]

Institute for Applied and Numerical Mathematics, Technical University of Vienna, Wiedner Hauptstrasse 8-10, A-1040 Vienna, Austria E-Mail: [email protected], [email protected] 2

The work described in this report was supported by the Special Research Program SFB F011 \AURORA" of the Austrian Science Fund.

Preface Programming parallel computers is an intricate task. As an alternative to explicit message passing programming, which is error-prone and time consuming, several attempts were made to develop and to standardize high level parallel programming languages that hide details of particular architectures from the programmer. Examples are High Performance C++ (HPC++) and Java. At the moment, best suited for numerical high performance parallel computing seems to be High Performance Fortran (HPF), which is based on Fortran, the classic (sequential) programming language for high performance numerical computing. In this report, the advantages and disadvantages of the HPF approach are identi ed and the progress in the realization of the concept in the form of compilers for the language is evaluated. This report is the result of an Austrian research project called AURORA which focuses on high-level software for high performance computing systems. The related research issues cover the range from models, applications, and algorithms to languages, compilers, and programming environments. Synopsis. The report is organized as follows: In Chapter 1 the evolution of HPF is described. In Chapter 2 an overview of recent publications related to HPF is given. Chapter 3 summarizes the results of two surveys. In one of the surveys some experts were asked about their views on the present state and the future evolution of HPF. In a second survey, users of HPF were asked to prioritize their expectations and demands from HPF in the near future. In Chapter 4 results of some performance oriented experiments made with HPF are presented. Finally, in Chapter 5 our view on the present state of HPF and conceivable future developments are summarized. The appendix gives an overview of HPF compilers and tools, institutions where HPF is used, Web sites that oer training and courses, projects dealing with HPF, a glossary and a comprehensive bibliography. Acknowledgments. We would like to thank everybody who took the time to answer our e-mail survey. Wolfgang Moser contributed the numerical experiments of Section 4.1. Dieter Kvasnicka provided useful comments on an early draft of this paper. John Merlin has proofread the manuscript and helped us to improve the quality of this paper. We wish to express our thanks to all of them. In addition, we would like to acknowledge the nancial support of the Austrian Science Foundation (FWF). February 1998

Harald J. Ehold Wilfried N. Gansterer Christoph W. Ueberhuber

The authors of this report would appreciate comments from readers concerning the language HPF and its future. Harald J. Ehold Wilfried N. Gansterer Christoph W. Ueberhuber

[email protected] [email protected] [email protected]

Contents 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.1

1.2 1.3

The Evolution of HPF . . . . . . . . . . . . . 1.1.1 HPF 1.0 . . . . . . . . . . . . . . . . 1.1.2 HPF 1.1 . . . . . . . . . . . . . . . . 1.1.3 HPF 2.0 . . . . . . . . . . . . . . . . 1.1.4 HUG . . . . . . . . . . . . . . . . . . Language Elements of HPF . . . . . . . . . . 1.2.1 Subset HPF . . . . . . . . . . . . . . 1.2.2 HPF 1.1 . . . . . . . . . . . . . . . . 1.2.3 HPF 2.0 . . . . . . . . . . . . . . . . Language Elements of HPF+ . . . . . . . . . 1.3.1 Comparison of HPF+ with HPF 2.0 .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

2 Literature Survey : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1

2.2 2.3 2.4 2.5 2.6

Academic Research . . . . . . . . . . . . . . 2.1.1 Language Advancement . . . . . . . 2.1.2 Compiler Development . . . . . . . . 2.1.3 Load Balancing . . . . . . . . . . . . Comparisons, Evaluations, and Benchmarks . Applications . . . . . . . . . . . . . . . . . . Tools . . . . . . . . . . . . . . . . . . . . . . Language Surveys . . . . . . . . . . . . . . . Interfacing to Libraries . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

1 2 2 2 3 3 4 4 5 5 6 8 9 9 9 10 11 12 12 14 15 16

3 Expert Opinions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 3.1

3.2

E-mail Survey . . . . 3.1.1 Questions . . 3.1.2 Responses . . 3.1.3 List of People HUG Survey . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

17 17 18 25 25

4 Case Studies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28 4.1

4.2

Parallel Matrix-Matrix Multiplication 4.1.1 The Test Program . . . . . . 4.1.2 HPF on a DEC Alpha Server 4.1.3 HPF on a Meiko CS-2 . . . . Parallel Gaussian Elimination . . . . 4.2.1 The Test Program . . . . . . 4.2.2 Performance Results . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

28 28 30 32 34 34 37

5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42

A HPF Compilers and Tools : : : : : : : : : : : : : : : : : : : : : : : 44

A.1 Commercial HPF Compilers . . . . . . . . . . . . . . . . . . . . 44 A.2 Research HPF Compilers . . . . . . . . . . . . . . . . . . . . . . 45 A.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

B Institutions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 46

B.1 Universities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B.2 Research Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B.3 Commercial Institutions . . . . . . . . . . . . . . . . . . . . . . . 47

C Training Material and Courses D Projects : : : : : : : : : : : : : : Glossary : : : : : : : : : : : : : : : : Bibliography : : : : : : : : : : : : : Books . . . Journals . . Proceedings Theses . . . Reports . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

: : : : : : : : : : : : : : : : : : : : 48 : : : : : : : : : : : : : : : : : : : : 49 : : : : : : : : : : : : : : : : : : : : 50 : . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

: . . . . .

52 52 54 71 97 107

Author Index : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 109

Chapter 1

Introduction High Performance Fortran (HPF) is a language de nition designed to simplify the task of programming data parallel applications. Originally it was intended for distributed memory MIMD machines. Recently, aspects of parallel programming on dierent types of parallel architectures, like shared memory symmetric multiprocessing (SMP) and distributed shared memory (DSM), have become increasingly important for HPF. It is generally accepted that the largest obstacle to the widespread use of distributed memory message-passing systems is the diculty encountered in programming them. It is necessary to explicitly partition data, organize message passing instructions, handle boundary cases, etc., which is a very complicated, time-consuming and error-prone task. It also impairs the adaptability and portability of the resulting programs. HPF removes this burden from the programmer. It comprises a set of extensions to the standard Fortran language. The central idea of HPF is to augment a standard Fortran program with directives that specify the distribution of data across memory, which is physically distributed but can be shared from the programming point of view (virtually shared). The HPF compiler then handles the intricate problem of partitioning the data according to the data distribution directives and allocating computation to processors according to the locality of the data references involved. This includes inserting instructions necessary for data transfer in an implementation dependent way, for example, by message-passing or by a (possibly virtual) shared memory mechanism. HPF is also designed to be largely architecture independent. It can be implemented across the whole spectrum of multi-processor architectures: distributed and shared memory computers, MIMD and SIMD computers, vector computers, workstation networks, etc. It can even be implemented on single-processor systems because data distribution is speci ed by means of directives , i. e., structured comments which do not aect program semantics. These directives are signi cant only to an HPF compiler, but not to an \ordinary" Fortran compiler. HPF was designed to make the programming of distributed memory systems easier without sacri cing the portability of the resulting programs. This paper not only addresses the current HPF standard, HPF 2.0, but it also contains information about HPF+, a language de nition which extends the scope of HPF. In Section 1.3 a short description of this language is given and in Section 1.3.1 it is compared to HPF 2.0.

2

1. Introduction

1.1 The Evolution of HPF The HPF language de nition was developed by the High Performance Fortran Forum (HPFF), a working group comprising representatives of most parallel computer manufacturers, several compiler vendors, and a number of governmental and academic research groups in the eld of parallel computation. From 1992 to 1997 there have been three rounds of HPFF meetings so far.

1.1.1 HPF 1.0 The rst version of the language de nition, HPF 1.0, was nished in 1993 following an initial meeting convened in January 1992 by Ken Kennedy from Rice University and Georey Fox from Syracuse University. The design of HPF's data distribution features was strongly in uenced by Fortran D and Fortran 90D (Fox et al. [195], Wu and Fox [205]) and Vienna Fortran (Zima et al. [206]). There were several predecessors of HPF, which provided signi cant contributions to the language development. They extended the Fortran programming language, for example, by allowing to specify data distribution. Examples of these are Distributed Fortran 90 (Merlin [139, 201]) and Pandore (Andre et al. [78]). HPF was also in uenced by various Fortran dialects from vendors such as Digital (DEC [194]), Convex, Cray (Pase et al. [202]), IBM (Sanz et al. [203]), MasPar (MasPar [200]) and Thinking Machines Corporation (TMC [204]). Input into the HPF development also came from a variety of other sources (see HPF Forum [34] for further references). The HPF 1.0 language de nition was published by the HPF Forum [34]. This document also de ned an ocial subset of the language, Subset HPF , to facilitate early implementation. In particular, Subset HPF did not contain the full Fortran 90 basis of HPF 1.0, but only Fortran 77 with a subset of the Fortran 90 features. In practice, all early HPF implementations, which started to appear in about 1994, concentrated on Subset HPF features rather than attempting to support all of HPF. A number of features that were considered but not accepted into HPF 1.0 were presented in a separate document, the \HPF Journal of Development" (HPF Forum [35]). These features were rejected because of lack of time or consensus, or in order to minimize direct extensions to Fortran 90, rather than because of technical aws, and so were documented to be used in future language design activities. A textbook about HPF 1.0 has also been published (Koelbel et al. [3]) as have a number of HPF language tutorials, e. g., Merlin and Hey [51].

1.1.2 HPF 1.1 A second set of meetings of the HPFF from April to October 1994 concentrated on corrections, clari cations, and interpretations of HPF 1.0. These activities resulted in HPF 1.1, a revised and corrected version of the language speci cation

1.1 The Evolution of HPF

3

(HPF Forum [196]). Some requirements for future versions were also identi ed at this time.

1.1.3 HPF 2.0 A third set of meetings of the HPFF was held from January 1995 to February 1997 to develop further extensions to HPF. The aim of these was to broaden HPF's applicability by providing features such as enhanced data distribution, task parallelism and computation control, parallel I/O, and directives to assist communication optimization (HPF Forum [198]). However, it became clear that vendors were reluctant to greatly extend the basic HPF language for fear of delaying commercial implementations and/or encouraging partial implementations, thus undermining HPF's credibility and use. The outcome was the de nition of a new HPF base standard, HPF 2.0 , which is quite similar to Subset HPF in terms of its HPF extensions. Probably the major dierence between Subset HPF and HPF 2.0 is that the former was based on Fortran 77 and a subset of Fortran 90 features that were considered important for parallelism (e. g., array syntax), while HPF 2.0 is based on full Fortran 95. All other HPF extensions not part of HPF 2.0, both new and old, are designated HPF 2.0 Approved Extensions . The idea was that a standard conforming HPF 2.0 compiler should provide full support for the HPF 2.0 features, but should not be required to support any of the Approved Extensions. Presumably the Approved Extensions will tend to be provided only if there is sucient demand from users. The status \Approved Extension" should make it possible to be more exible to deviate from the detailed speci cation, possibly allowing improvements in the light of implementation and user experience. Those features that turn out to be widely used can then be incorporated into the next version of the HPF base standard. The speci cation of the HPF 2.0 language and Approved Extensions can be found in HPF Forum [197].

1.1.4 HUG A rst meeting of HPF users was held in Santa Fe, New Mexico, USA on 24{ 26 February 1997. At this meeting the HPF User Group (HUG) was formally established. It was set up as a forum for the exchange and dissemination of information about HPF, e. g., HPF compilers and tools, application development, projects, training, benchmarking and performance evaluation, research etc. The HPF User Group intends to hold meetings annually. The second HUG meeting will be held in association with the conference VECPAR '98 on 25{26 June 1998 in Porto, Portugal. More details can be found on the HUG Web site http://www.vcpc.univie.ac.at/information/HUG/.

4

1. Introduction

1.2 Language Elements of HPF HPF 1.0 extends Fortran 90 for data parallel programming on NUMA architectures. The following is a summary of its core, \Subset HPF". For precise details the reader is referred to the HPF 1.0 language de nition at the Web site http://www.crpc.rice.edu/HPFF/hpf1/index.html or at http://www.vcpc. univie.ac.at/information/mirror/HPFF/hpf1/index.html.

1.2.1 Subset HPF

The language de nition of Subset HPF contains All Fortran 77 standard conforming features, except for storage and sequence association, and the following Fortran 90 features. { The Fortran 90 de nitions of MIL-STD-1753 features. | The DO WHILE statement; | The END DO statement; | The IMPLICIT NONE statement; | The INCLUDE line; | The scalar bit manipulation intrinsic procedures: IOR, IAND, NOT, IEOR, ISHFT, ISHFTC, BTEST, IBSET, IBCLR, IBITS, MVBITS; | Binary, octal and hexadecimal constants for use in DATA statements; { Arithmetic and logical array features; { Dynamic storage allocation; { Intrinsic procedures; { Nearly all of the new Fortran 90 intrinsic procedures; { Explicit procedure interfaces; { Lexical improvements, e. g., names up to 31 characters, \!" initiated comments, use of \ " in names; The basic data distribution and alignment directives: ALIGN, DISTRIBUTE, PROCESSORS and TEMPLATE; The FORALL statement (but not the FORALL construct); The INDEPENDENT directive; The SEQUENCE and NO SEQUENCE directives; The system inquiry intrinsic functions NUMBER OF PROCESSORS and PROCESSORS SHAPE; The computational intrinsic function ILEN, and the HPF extended Fortran intrinsics (additional argument `DIM') MAXLOC and MINLOC with the restriction that any actual argument expression corresponding to an optional DIM argument must be an initialization expression.

1.2 Language Elements of HPF

5

1.2.2 HPF 1.1

In addition to Subset HPF, the following features were part of HPF 1.1.

The REALIGN, REDISTRIBUTE, and DYNAMIC directives; The INHERIT directive used with a dist-format-clause or a transcriptive

(\lone star", either explicitly or implicitly) dist-target; The PURE function attribute; The FORALL construct; The HPF library and the HPF LIBRARY module; Actual argument expressions corresponding to optional DIM arguments in the Fortran 90 MAXLOC and MINLOC intrinsic functions which are not initialization expressions; The EXTRINSIC function attribute.

1.2.3 HPF 2.0

After revising HPF 1.1, the current version HPF 2.0 was designed. It consists of Standard Fortran used as the base language for extensions. This implies that HPF 2.0 includes all features added to Fortran in the 1995 revision. HPF 1.1 and loop-based reductions. The REALIGN, REDISTRIBUTE, and DYNAMIC directives as well as distribution of storage-associated variables were taken out. The Approved Extensions subsume the following features. { Extensions for Data Mapping | The REALIGN directive; | The REDISTRIBUTE directive; | The DYNAMIC directive; | The GEN BLOCK directive; | The INDIRECT directive; | The RANGE directive; | The SHADOW directive; | The Mapping to processor subsets; | The explicit mapping of pointers and components of derived types; { Extensions for Data and Task Parallelism | The ON directive; | The RESIDENT directive;

6

1. Introduction

| The TASK REGION directive;

{ Extensions for Asynchronous I/O | An additional parameter for READ/WRITE statements; | The WAIT statement; { Extensions to Intrinsic and Library Procedures | Additional intrinsic procedures: ACTIVE NUM PROCS,

ACTIVE PROCS SHAPE; | The generalized intrinsic procedure TRANSPOSE; | The new HPF LIBRARY procedures HPF MAP ARRAY and HPF NUMBER MAPPED; | The revised HPF LIBRARY procedures HPF ALIGNMENT, HPF DISTRIBUTION, and HPF TEMPLATE; { Approved Extensions for HPF Extrinsics | Interfaces designed to facilitate interoperability with other languages; | Interfaces for dierent models of parallelism (LOCAL for SPMD parallel, SERIAL for single-process sequential and F77 LOCAL especially for Fortran 77 codes).

Currently, research is being conducted into extending the scope of HPF 2.0, e. g., task parallelism, interoperability with other programming languages and/or with other programming paradigms. One important example of a language extending HPF 2.0 for more general applications is HPF+.

1.3 Language Elements of HPF+ HPF+ is based on Standard Fortran 90/95 and on the HPF 1.0 and HPF 2.0 language speci cations (HPF Forum [34, 197]). Many of the new HPF+ language constructs are already included in Vienna Fortran (Zima et al. [206]) and Vienna Fortran 90 (Benkner [171]). They were originally designed as proposals for HPF 2.0 (HPF Forum [197]). The data distribution features of HPF 2.0 are extended in HPF+ by introducing new mechanisms which are applicable to irregular data structures. The handling of processors is generalized in order to facilitate the speci cation of distributions and to in uence load balancing . The concept of independent loops has been extended primarily to handle irregular access patterns . This includes a facility for specifying reusable communication schedules. The provision of non-standard and irregular distributions, parallel loops with indirect array accesses, and additional language support for work distribution, reduction operations, and the reuse of communication schedules is fundamental to many industrial applications. Other features were designed to correct current drawbacks and de ciencies in HPF 2.0, e. g., the handling of processors, argument transfer at procedure boundaries, and others.

1.3 Language Elements of HPF+

7

The following list summarizes the language features available in the current version of HPF+. All changes to HPF 2.0 are marked accordingly: features that have been added new , features that have been extended or modi ed modified . New features that are also contained in the Approved Extensions of HPF 2.0 are written in italics. The handling of processor arrays { Establishing relationships between processor arrays new ; { De ning and naming subsets and reshaping processor arrays new . Basic Data Mapping Mechanisms { BLOCK(m), CYCLIC(m) distributions; { Alignment; { Static Mappings and Dynamic Mappings. Extension of Data Mapping Mechanisms { Mapping to processor subsets; { General block distributions; { Indirect distributions; { Mapping derived type components; { Mapping pointers; { Mapping ranges for dynamically mapped objects; { Shadow speci cations; { Halo speci cations new . Independent Loops { New-clauses; { Reduction clauses and reduction statements; { On-clauses for explicit work distribution; { Distribute-clauses for explicit work distribution new . Mapping Dummy Arrays { Explicit mappings; { Inherited mappings modified ; { INHERIT attribute modified ; { RANGE attribute; { INTENT attribute. Other Features { Support for communication schedule reuse new ; { Purest procedures new . (

)

(

(

)

(

(

)

(

(

)

(

)

(

(

)

)

)

)

)

8

1. Introduction

1.3.1 Comparison of HPF+ with HPF 2.0

Table 1.1 compares the language elements of HPF+ with the language elements provided by HPF 2.0 and the Approved Extensions. Although the syntax and semantics of HPF+ is closely related to existing HPF 2.0 extension proposals, certain features of HPF+ have a dierent syntax and/or semantics than in HPF 2.0.

Language Feature Processors

HPF+

HPF 2.0

HPF 2.0 AE

Processor views Processor subgroups

+ +

{ {

{ {

BLOCK distributions CYCLIC distributions Total replication GENERAL BLOCK Dimensional INDIRECT Multi-dimensional INDIRECT SHADOW speci cations HALO speci cations Distribution to processor sections Dynamic mappings Mapping of derived type components Mapping of pointers

+ + + + + + + + + + + +

+ + + { { { { { { { { {

{ { { + + { + { + + + +

Inherited mappings Distribution ranges Prescriptive mappings Descriptive mappings Transcriptive mappings Procedures returning mappings

+ + + { { +

+ { + + + {

{ + { { { +

Work distribution for parallel loops PUREST procedures Schedule reuse

+ + +

{ { {

{ { {

Data Mapping Mechanisms

Procedure Interface

Other Features

Table 1.1: A comparison of selected language features in HPF+, HPF 2.0 and the Approved Extensions (AE) of HPF 2.0. \+" and \{" respectively indicate the presence or absence of a language feature.

Chapter 2

Literature Survey This chapter contains an overview of the most important current research activities dealing with HPF. The information presented summarizes the literature referenced in the bibliography (see page 52 .). In order to organize this chapter properly, the references are categorized according to their main emphasis. Obviously some references deal with several aspects and are therefore mentioned in more than one category.

2.1 Academic Research

2.1.1 Language Advancement

The survey of available literature shows that many authors think that HPF should not be restricted to data parallel computing, but should also support task parallelism or heterogeneous computing adequately. One group of researchers has developed a library-based approach to support task parallelism. They use MPI as a coordination layer for HPF (Foster et al. [114, 115]). Foster [26] gives some reasons why task parallelism is considered important for parallel computing. E. g., for software engineering reasons, the use of task parallelism in complex parallel programs allows to program in a modular fashion or to execute dierent program components on dierent computers in a heterogeneous network. Task parallelism can also be used to enhance locality and hence performance by executing dierent components of a problem concurrently on disjoint sets of processors. The third reason mentioned by the author is the ability to specify computation and communication schedules that could not be discovered by a compiler. This paper also describes two candidate task parallel models, the Fortran M extensions to Fortran 77 and the HPF/MPI approach. Chapman et al. [18] describe Opus, a coordination language designed as a set of extensions to HPF and Fortran 90. This language allows to express coarsely grained task parallelism on top of nely grained data parallelism. Kamachi et al. [39] describe a compilation system called Kemari, which extends the language de nition of Subset HPF in various aspects. Kemari provides directives for dynamic data distribution through user-de ned mappings, which are needed for unstructured problems, and it has the ability to integrate MPI calls into HPF. In Chapman et al. [20] a set of extension to HPF, named HPF+, is presented. The aim of HPF+ is to provide additional functionality especially for irregular problems and for dynamic load balancing. HPF+ modi es and extends the data distribution features of HPF 2.0 by introducing special features for irregular data

10

2. Literature Survey

structures. Moreover, language elements for handling of processor arrays and for reusing communication schedules are introduced (cf. Section 1.3). Another proposal of a parallel language based on Fortran 90, F??, is based on a single syntactic extension to Fortran instead of using compiler directives (De Dinechin et al. [112]). The dimension attribute is modi ed to distinguish between accesses to local data and accesses to non-local data which (implicitly) invoke message passing calls.

2.1.2 Compiler Development

This section contains references to techniques and ideas currently under consideration for improving the eciency of HPF compilers.

Implementation of DO Loops and FORALL The ecient automatic parallelization of nested DO loops is dicult for any HPF compiler. Boulet and Brandes [83] present some performance results and describe the bene ts of dierent techniques and optimization methods for dealing with this problem. In source-to-source converters for HPF the FORALL construct is often translated into DO loops. This has to be done carefully since naive translation into DO loops can increase the loop overhead immensely. Koo et al. [131, 132] propose some techniques for converting FORALL constructs into optimized DO loops. Many compilers use a technique called privatization of data to parallelize loops by eliminating storage-related dependencies. Gupta [117] presents a framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. The author shows that there are numerous alternatives available for mapping privatized variables and that the choice of mapping can signi cantly aect the performance of the program. As another method for generating ecient communication patterns when compiling HPF programs, Kamachi et al. [126] introduce the concept of an iteration template corresponding to an iteration space. An HPF compiler which performs the loop iteration mapping using a two-level mapping of the iteration template in the same way as the data mapping is performed in HPF is presented.

Distribution of Arrays and DO Loops The task of implementing distributed array structures and their access on message passing computers is not straightforward. Methods for ecient local index enumeration, local storage, and communication for distributed arrays are presented by van Reeuwijk et al. [56]. Techniques for local set enumeration provide the basis for constructing local iteration and communication sets. When generating the code for distributed array statements, the sequence of local memory addresses accessed by each processor and the sequence of communication operations for a given processor to access non-local data have to be computed. Thirumalai and Ramanujam [65] present an approach to solving the

2.1 Academic Research

11

address sequence generation problem using the theory of integer lattices. The set of elements referenced can be generated by integer linear combinations of basis vectors. Closed form expressions for the basis vectors as a function of the mapping of data are derived. Using these basis vectors and exploiting the fact that there is a repeating pattern in the access sequence, highly optimized codes that generate the pattern at runtime can be developed.

Remapping Data Applications with varying array access patterns require array mappings to be changed dynamically on distributed-memory parallel machines. Coelho and Ancourt [23] present a new technique for compiling HPF remappings onto messagepassing parallel architectures. Special optimized algorithms for redistributing data from CYCLIC(x) to CYCLIC(y) (for x = ky or y = kx, k 2 IN) are proposed by Thakur et al. [64]. Moreover, the paper describes two algorithms for redistributing a CYCLIC(x) distribution to a CYCLIC(y) distribution with no particular relation between x and y. Walker and Otto [70] consider the problem of changing the block size in blockcyclic distributions. Message passing algorithms for this redistribution operation are expressed in MPI.

Communication Communication requirements are the main performance limiting factor in parallel computing. One important technique for improving the performance of parallel programs is to overlap communication and computation. Li and Harada [168] investigate language elements to control this overlapping and an asynchronous data transmission mechanism for HPF compilers. Based on an exact data- ow analysis on individual array element accesses, a compiler inserts data transmission requests to a message queue into SPMD codes. Another attempt to generate ecient SPMD code from an HPF program is described by Ancourt et al. [9]. They use matrix operations as a framework to encode HPF directives and to synthesize distributed code with space ecient array allocation, tight loop bounds, and vectorized communications for INDEPENDENT loops.

2.1.3 Load Balancing Some work has also been done on integrating load balancing features into a data parallel language like HPF. Perez [148] shows that the HPF virtual processors may well be used as the unit of migration when distributing load onto physical processors. Three possible implementations are compared.

12


2.2 Comparisons, Evaluations, and Benchmarks The scope of an HPF compiler (which features of the language standard it implements) and the performance of the machine code generated are the most important criterions for its quality. It is, therefore, important to know how to measure and compare the performance of the codes generated by dierent HPF compilers. The Data Parallel Fortran (DPF) benchmark suite is presented by Hu et al. [123]. It provides a set of data parallel Fortran codes for evaluating compilers for data parallel languages. The benchmarks can be used on any parallel architecture with shared or distributed memory. In addition to comparing dierent HPF compilers it is interesting to compare the HPF approach with other programming models, primarily with explicit message passing or with shared memory programming. Van der Wiel et al. [165] try to evaluate quantitatively the relative ease of use of dierent parallel programming models. In particular, some aspects of a comparison between message passing and HPF are discussed. Chapman et al. [20] outline the major elements of Vienna Fortran and compare it to HPF. The lack of support for many advanced applications which require irregular data distributions and dynamic load balancing is considered a signi cant weakness of HPF. To provide the desired functionality, HPF+ is introduced as an extended version of HPF based on Vienna Fortran (cf. Sections 1.3 and 2.1.1). O'Hallaron et al. [32] extract some simple tests from a class of applications (sensor based applications , cf. Section 2.3), which implementers and users can employ to measure the eciency of loops, reductions and index permutations generated by an HPF compiler. Rodriguez et al. [149] compare two approaches to parallelizing weather prediction models. An implementation of a complete parallel weather prediction model using a high level parallel library serves as a reference point for HPF implementations.

2.3 Applications This section gives an overview of publications dealing with HPF applications. There is a wide range of HPF applications, and some of them are quite satisfactory. Applications which are primarily of academic interest (including small \kernel" routines which, for example, implement algorithms for standard mathematical tasks), as well as larger practical applications are included. Hu and Johnsson [36] show how several optimization techniques for managing data distribution and data references in hierarchical N -body algorithms can be expressed in HPF. Brieger and Lecca [93] are working on modeling ground water ow. They present a multigrid preconditioner for the conjugate gradient kernel of a three dimensional nite element model of water ow in porous media. In addition, they describe a subdomain multigrid strategy as a parallel preconditioner for the

2.3 Applications

13

global conjugate gradient solver on systems distributed over several processors of a parallel machine. Subdomains are de ned and distributed so as to achieve the best processor load balance and to reduce interprocessor communication time. Parallel routines were coded in HPF along with the MPI communication library in order to get a portable parallel code which can be used for machines from clusters of workstations to parallel supercomputers. Programs that get their input from sensors are an important application domain for HPF. Sensor-based applications typically perform regular operations on dense arrays and often have latency and throughput requirements that can only be achieved with parallel machines. O'Hallaron et al. [32] describe a study of sensor-based applications, including fast Fourier transforms, synthetic aperture radar imaging, narrowband tracking radar processing, multibaseline stereo imaging, and medical magnetic resonance imaging. The performance of the codes is mainly determined by three operations: independent DO loops, reductions and index permutations. Cabitza et al. [94] give an overview of their experience with HPF for various scienti c applications, including basic kernel codes and parallel libraries as well as large time-consuming practical applications (seismic modeling, molecular dynamics, etc.). The authors assess HPF as a good medium to long-term investment. Onesti and Srevik [144] describe the parallelization of a local area ocean circulation model using HPF. The authors nd that this particular problem leads to a typical data parallel algorithm and is, hence, easily expressed in HPF. Tests on three dierent platforms with dierent HPF compilers were performed. Problems occurred mainly due to the restrictions in the HPF compilers available at that time. The computer code Parmila has been the primary tool for the design of proton and ion linear accelerators in the United States for nearly three decades. Previously it was sucient to perform simulations with thousands of particles; recently, however, the need to perform high resolution studies for next-generation, high intensity, linear accelerators has made it necessary to perform simulations with hundreds of millions of particles. Hence, these problems can only be dealt with using parallel computers. Ryne [150] describes the development of a parallel version of Parmila and its performance. Several users have been employing CM Fortran codes in production codes on the CM-2 and the CM-5 for the last ve to six years, investing a lot of time and money. With Thinking Machines Corporation's decision to withdraw from the hardware business and with the decommissioning of many CM-2 and CM-5 machines, the best way to protect this substantial investment in CM Fortran codes is to port the codes to HPF on highly parallel systems. HPF is very similar to CM Fortran and thus represents a natural transition. The Numerical Aerodynamic Simulation (NAS) program, located at NASA Ames Research Center, is a path nder in high performance computing for NASA and is dedicated to advancing the science of computational aerodynamics. Experiences in porting CM Fortran codes on the CM-5 to HPF are presented by Saini [151]. The author does

14


not present any runtime comparisons between HPF and CM Fortran, but tries to assess the porting eort, and criticizes a couple of points: the lack of parallel I/O in HPF, the lack of pro ler and debugging tools, and the lack of parallel mathematical and scienti c libraries which can be used in HPF codes. He concludes that HPF is not the ultimate solution for parallel programming, but that it still seems to oer good support for maintaining a single portable code version on dierent computing platforms.

2.4 Tools One very important requirement for the widespread acceptance of a programming language among users is adequate tool support. Given the diculties of programming parallel computers eciently, this especially applies in the case of HPF. This section focuses on research tools and on related literature. Further information about HPF related tools, in particular about commercial tools, is given in Appendix A.3. A list of currently available HPF compilers with pointers to further information is also given in Appendix A.

Visualization

The visualization of the parallelism introduced by the compiler is an important aid in the optimization phase of code development. Koppler et al. [44] propose a graphical toolkit for the visualization of HPF data structures, which should support ecient utilization of HPF's data mapping facilities. It consists of exploratory and estimation tools which allow the programmer to navigate through complex distributions and to obtain graphical ratings with respect to load distribution and communication. Another example of a tool which provides high-level access to distributed data structures is given by Hackstadt and Malony [120]. The Distributed Array Query and Visualization (DAQV) implementation utilizes the HPF language, compiler, and runtime system to support program-level access to array values.

Programming Environments Brandes et al. [14] present the HPFIT project whose aim was to provide a set of interactive tools integrated in a single environment to help users parallelize scienti c applications using HPF on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL which is made up of an editor, a parser, a dependence analysis tool, and an optimization kernel. Brandes et al. [15] also present other parts of the HPFIT project: a data structure visualization tool called Visit, and language extensions to HPF for irregular problems.

2.5 Language Surveys

15

Kemari is a compilation system which extends HPF in various ways (Ka-

machi et al. [39]). The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer.

Debugger Brezany and Sowa [193] present the Vienna Symbolic Debugger (VSD). VSD is a high-level symbolic debugger which enables the programmer to locate, analyze, and correct suspected faults at the HPF+ level. The authors follow an approach referred to as a sequential view of parallel execution; the real parallel code is executed, but a corresponding source code level interface is presented to the programmer.

Code Generation Tools DDT is a tool for automatic data distribution (Ayguade et al. [10]). It accepts

programs written in Fortran 77 and generates HPF directives to map arrays onto processor-memories and to parallelize loops. If required, it also generates executable statements to remap these arrays. Di Martino [50] presents Migrator which is a tool that supports the translation of Fortran 90 or partially annotated HPF programs into fully annotated HPF codes with insertion of calls to optimized parallel libraries.

Performance Evaluation and Prediction

Fahringer [2, 25] presents P T , a static performance estimator for a subset of HPF to guide the selection of ecient data distribution strategies and pro table code transformations. This tool provides a way to estimate and tune parallel program performance. Scala is a system which combines an instrumentation tool and a post execution performance analysis tool. The compiler-integrated instrumentation tool (SIS) allows an automatic and selective instrumentation of irregular codes parallelized by the Vienna Fortran Compiler. During the code parallelization SIS inserts automatically runtime measurement functions based on the compilation/analysis information. The post execution tool analyzes the trace les produced by SIS and provides performance information at dierent levels of detail. 3

2.5 Language Surveys In this section some examples of papers are given that describe the language HPF or show how to use it eciently. Schreiber [154] rst introduces the ideas that underly the data parallel language HPF. Then he reviews HPF's key language elements and the important problem of data mapping at the procedure call interface. He also discusses interoperability with other programming models, including the SPMD model. The latter part of the paper is a review of the development of HPF 2.0.

16


Zima et al. [169] outline state-of-the-art of compilation systems for HPF and related languages. They also discuss the limitations of the current version of HPF and propose an integrated programming environment that supports the automatic tuning of Fortran programs for a given parallel target architecture. Clark et al. [22] summarize some guidelines to keep in mind when developing software that may at some time in its life cycle be parallelized using a data parallel compiler. Their paper focuses on the diculty they encountered when they ported a molecular dynamics program from Fortran 77 to Fortran D. Knies et al. [43] discuss issues that need to be addressed when an ecient production quality compiler for a language like HPF is developed. They examine some speci c issues that are related to HPF's model of computation and analyze several implementation issues. After outlining the evolution of the Fortran programming language up to Fortran 90, Schaller [59] gives a detailed description of the new elements introduced by HPF. In the author's opinion HPF has a better chance of being successful than some other programming models for parallel computers. Zima [73] describes some high performance languages for parallel computing, including HPF, Vienna Fortran, and Fortran D.

2.6 Interfacing to Libraries It becomes clear in this report that a very important factor for the general acceptance of HPF among users in the natural sciences and in large-scale technical applications is the availability of parallel mathematical libraries which are HPF compatible. The functionality of HPF can potentially simplify the use of a parallel library by providing a convenient framework for specifying data distributions, etc. Brandes and Greco [90] and Lorenzo et al. [136] have explored this issue by creating an HPF interface to the ScaLapack message passing library. They discuss the following problems. Dierent programming models and incompatible data distributions (ScaLapack does not support all of HPF's data distributions) have to be handled, and data might have to be redistributed for performance reasons. Lorenzo et al. [136] also demonstrate the performance bene ts from interfacing to ecient ScaLapack routines by making a comparison between them and equivalent HPF coded functions.

Chapter 3

Expert Opinions 3.1 E-mail Survey We sent out an e-mail questionnaire to obtain the opinions of people (program developers, users, etc.) interested in HPF. We sent the e-mail out to members of the HPF Forum (HPFF ), to members of the HPF User Group (HUG ), and also to several others (researchers, compiler developers, potential users, etc.), who might be interested in HPF. The questionnaire was sent out in May and June 1997 to about 740 people, and 25 responses (3.3 %) were received up to the end of July 1997.

3.1.1 Questions

Depending on the recipient, two slightly dierent versions of the survey were sent out. We list here a \merged" form of the two questionnaires, i. e., the following list contains all the questions we asked. 1. Are you currently doing any HPF code development or any HPF related research ? If so, please give a brief description. 2. Do you know of any other code development in HPF or any (ongoing) HPF related research ? If so, please give a brief description. Please also provide a contact address if possible or forward this questionnaire to a person involved in such activities. 3. Do you know of groups or institutions (computing centers, etc.) using HPF ? 4. Compiler and tool vendors: Are you developing/planning to develop an HPF compiler or HPF related tools ? If the answer is no, what are the reasons ? If the answer is yes, please answer the following where relevant:

Do you currently have an HPF compiler which complies with the current

standard ? If not, when do you expect having one available ? What improvements will you make to your HPF compiler in the near future ? What are your major priorities ? What do you feel are the most important extensions needed to HPF ? What tools do you oer that could be useful to an HPF application developer ?

18

3. Expert Opinions

What other tools or programming support do you intend to oer in the future ?

5. What do you think is the most important tool support needed by HPF users ? 6. If you have used HPF in application development, have you been satis ed with the performance you have achieved ? If performance results were disappointing, how did this aect your opinion of HPF ? When confronted with disappointing performance results of actual HPF codes, what is your reaction ? In particular, do you believe that there are good chances that advances in compiler technology will be sucient to lead to satisfying performance in the near future (without sacri cing portability of the codes) ? 7. Do you think that the HPF approach will be successful in the long run ? In other words, do you think that HPF is going to \survive" ?

3.1.2 Responses

In the following we give a brief summary of the responses we received to each question. Of course not all questions were answered by everybody to the same degree of detail; therefore our summary is sometimes based on a rather small set of responses. Appendix B gives a list of academic and commercial institutions and projects involved in HPF-related activities (including the respondents of our e-mail survey).

HPF Code Development (Questions 1 and 2)

In the responses we received, the following applications were mentioned.

Irregularly structured sparse matrix problems (e. g., resulting from PDE discretizations);

Block-structured CFD codes; The simulation of heart stimulation by electric current; Environmental models; The processing of seismic data in order to image the earth's subsurface; Retrieval programs for atmospheric remote sensing; The numerical solution of transonic viscous turbulent ows in axial turbines and compressors (using structured grids of the so called \H type", which lead to simply structured communication which can be done very easily in HPF);

3.1 E-mail Survey

19

Poisson equation, Vlasov equation, and Boltzmann equation solvers (which are ported from existing codes);

The simulation of a Rayleigh-Taylor instability; 1D FFT, Monte Carlo method, and various other small programs and kernels from applications in computational sciences.

Sometimes it was explicitly stated that attempts to use HPF in applications were not successful.

HPF directives were added to an existing ocean model but were eventually not used because no compiler could give the required performance.

Attempts to convert and parallelize a code for a Gauss-Seidel lower upper

reduction technique using HPF (and established HPF compilers) were unsuccessful. People experimented with HPF for a few months, and they obtained incorrect answers or the compiler crashed. It also turned out that additional constructs for load balancing in conjunction with distributed arrays are necessary. In this particular case, people had to develop their own language/compiler.

HPF Related Research (Questions 1 and 2) HPF related research is being pursued at several institutions (see Appendix B). Projects deal with compiler and tool development and with the evaluation of HPF with respect to functionality, performance, and tool availability. Furthermore, research is being done on nding more ecient techniques for HPF compilation, possible new language extensions for data parallel programs, the development of advanced parallelization transformations for HPF compilers, optimization of \semi-regular" codes, migration and porting of Fortran 77 applications to Fortran 90 and/or HPF, interference of HPF implementations with parallel tools, in uences of HPF on the resulting software engineering technology, and performance modeling. Current Activities. NAG Ltd. is currently investigating the feasibility and desirability of developing an HPF interface for the NAG Parallel Library (based on the EXTRINSIC mechanism in HPF), but no commercial product has been planned yet. For the development of the NAG Parallel Library, the combination of Blacs with MPI/PVM is preferred over coding in HPF. Hitachi is developing an HPF compiler for its machines. NCSA is using HPF as their primary migration path for CM-5 users to DSM computers. Last, but not least, one response mentioned providing and developing HPF training material in order to better establish the language as an important ongoing activity.

20

3. Expert Opinions

Groups or Institutions Using HPF (Question 3)

In summary, HPF today is seldom used for serious production codes. It is mainly used in research environments and is still at a benchmarking and testing stage. It was interesting to hear that many National Laboratories, Supercomputing Centers, and all the ASCI machines in the United States have or plan to have an HPF compiler and that they generally plan to use them on a variety of hardware architectures, ranging from workstation clusters to \traditional" supercomputers.

Compilers and Tools (Question 4)

Development of HPF compilers, status of products. Appendix A lists ex-

isting HPF compilers (research compilers as well as commercial products). The following covers all the responses received in the survey, including plans for developing an HPF compiler.

Priorities for Improvements. The following topics were mentioned by several

compiler developers as major areas of development. Naturally, they depend on the status and maturity of the respective compiler, but the responses still give a general idea of the main direction the current development is taking. At the moment, compiler developers/vendors intend to

generally improve the eciency of the generated machine code, in particular to improve performance for stride access loops and for indirect access loops in order to achieve higher eciency for irregular codes (e. g., codes with indirect addressing);

implement KIND, derived types, nested FORALL, WHERE within FORALL, and mapped pointers (an HPF 2.0 Approved Extension);

provide support for generating mixed shared-memory/distributed-memory (i. e., thread/message-passing) code for clusters of SMPs.

One compiler developer said that his main goal is to produce excellent code for current benchmarks, including the NAS Parallel Benchmarks. At the research level, the following activities were mentioned.

Attempts to generate very ecient code for data parallel loops (including loops

with implicit parallelism) by developing a framework for deriving computation partitions and communication operations.

Attempts to unify the methods used for regular and irregular computations. Attempts to develop compiler techniques for compiling parallel out-of-core programs.

3.1 E-mail Survey

21

Most Important Extensions Needed. Among the current Approved Exten-

sions, EXTRINSIC(HPF LOCAL), EXTRINSIC(HPF SERIAL), mapped components of derived types, and mapped pointers were considered to be among the most important. Support for clusters of SMPs and for shared-memory directives was also mentioned. One of the responses concentrated on further extensions to the Approved Extensions. It was remarked that better multitasking facilities are required. The need to deal with parallel I/O was also mentioned by two respondents. Finally, high-quality parallel mathematical scienti c libraries (with standard calling sequences) were considered necessary for the wide-spread acceptance of HPF. One of the participants in the survey remarked that concepts for mixing task and data parallelism should be developed for applications not suited to the data parallel approach. The Approved Extension TASK REGION was said to be a good start, but still in need of re nement.

Tools Oered to Application Developers. Appendix A.3 contains pointers

to existing tools, such as debuggers, pro lers, execution tools (for example, for load monitoring), translators between dierent languages, etc.

Plans for Tools/Programming Support. Of course, people are working on

improving current tools, and there are plans to oer automatic program parallelizing tools. We noticed a strong emphasis on performance visualization and evaluation, for instance, on performance analysis tools with a graphical user interface for workstations and PCs.

Most Urgently Needed Tools (Question 5)

The two responses we received to this question mentioned both source level and parallel debuggers and tools for performance analysis (pro lers, etc.) as being most urgently needed.

The Performance of HPF Codes (Question 6)

Performance Satisfaction. The survey showed a broad spectrum of opinions

about current HPF compiler performance. One response (from a commercial compiler developer) showed satisfaction with the results in a porting project and stated that current HPF compiler technology is already good enough to achieve industrial ports which are ecient and easy to maintain { for a wide range of codes, though not all codes. At the same time a dierent participant (a commercial software developer) responded that \. . . HPF development is being looked at very skeptically because of lack of quality of HPF compilers with regards to performance of the code produced, stability and conformity with the standard. Furthermore, block-cyclic data distributions of HPF have proven insucient for many applications. More general data distributions cannot be easily implemented eciently. . . "

22

3. Expert Opinions

Reactions to Disappointing Performance. Among the various responses to

this question, the following points were the most interesting.

\. . . Most of the commercial vendors cannot aord to take the time and the effort to implement all the features of HPF, because it is a low-margin business, and there are not enough users to make it pro table enough. . . "

\. . . A normal Fortran compiler takes about 3 years to reach good performance.

HPF is about 5 years away from acceptance. If vendors do not give up too soon it can make it. . . "

\. . . Compiler technology has improved and will continue to advance and mature and thus improve performance. Some commercial compilers improved signi cantly during the past twelve months. . . "

\. . . Compiler technology will not provide sucient support. The only way to get better performance is to give the user more control. This will probably hinder portability. . . "

\. . . One of today's problems is that one has to learn how to use HPF ef-

fectively (some constructs work better than others, dierent compilers give dierent performance results). Hence, performance portability is de nitely not yet achieved. Some convergence process in these aspects will improve understanding greatly. In the meanwhile the programmer has to properly understand the compiler's strengths and weaknesses. . . " Some compiler \. . . is still full of severe bugs. . . ", another one \. . . gives excellent performance and scalability to a large number of processors, but it only runs on one type of machines. But the situation will improve in the long run. . . "

\. . . It is really crucial to remove any unnecessary communication in the algorithm and to write code in a strictly data parallel way. . . "

As a reaction to disappointing performance \. . . one can move to message passing. . . "

\. . . Very good performance can already be achieved on a few codes. When

one has poor performance then this basically indicates that the code involved is not a good candidate for HPF coding. . . "

\. . . Performances of our codes are not disappointing. (We use a Portland Group compiler and we widely use EXTRINSIC(F77 LOCAL)). . . "

\. . . Poor performance is mostly due to bad node performance. Since there

is no performance portability between codes tuned for vector-based architectures and codes tuned for memory hierarchy based architectures, in most of the poorly performing examples granularity and communication/computation ratio are not appropriate. . . "

3.1 E-mail Survey

23

\. . . Disappointing performance of actual HPF codes mostly relates to `bad'

coding (for example directly taking over Fortran 77 codes, inherently sequential algorithms, I/O, etc.) and to lousy compilers. In general, a complete redesign of Fortran 77 codes is required to fully exploit the potential of HPF. First, a vectorizing step (to Fortran 90) is an absolute necessity before the HPF compiler can do its job. If done carefully one can get codes that excellently scale across a wide variety of hardware platforms and thus prove that the concepts underlying HPF are really true. To get this far, however, it is required to spend a lot of time in redesigning and learning. . . "

The Future of the HPF Approach (Question 7)

As expected, this was one of the most controversial questions of the survey. Out of eighteen responses to this question, nine were optimistic, seven were more or less undecided or doubtful, and two were strongly pessimistic.

Optimistic. Seven of the responses in this category came from users of HPF

(although one of them has not yet developed any HPF codes himself, and two others only ran very small HPF programs, e. g., kernels of real applications). The eighth works at an institution which produces commercial compilers. He promotes HPF, saying that they have had \extremely encouraging results". Another response pointed out that HPF is of course widely applicable and very well designed for all problems which can be formulated in a data parallel fashion (as originally intended) and that lots of actual problems use structured grids and, therefore, can be solved eciently using HPF. For this reason, HPF satis es the needs of a very speci c target group { scientists who are more interested in the data parallel (often vector) formulation of their problems than in the details of the parallel machines. Additionally, he emphasized that HPF can provide portability across dierent parallel platforms, which is a big advantage. One of the more cautious responses remarked that programmers will de nitely choose HPF over MPI or PVM once HPF has reached a certain level of maturity. Moreover, it is expected that the restrictions of HPF will be reduced by the features of HPF 2.0, so that HPF can be used in a much larger class of applications. In general, the performance aspect was hardly addressed in these responses. One person admitted that HPF will \. . . never be the answer to all parallel problems. . . " and that \. . . in particular, it is doubtful that bleeding-edge users will accept it. . . "

Undecided. These responses came from four users of HPF (three of them cur-

rently only working on very small problems and codes), and from three compiler researchers. Two of them work for institutions which are either presently developing commercial HPF compilers or planning to develop them in the near future. It was pointed out that for large real-world applications programming message passing is unmanageably complex, and those involved in this activity truly hope that ecient HPF compilers will be available soon. It was also mentioned that a

24

3. Expert Opinions

basic prerequisite for the success of HPF is the acceptance and thorough establishment of Fortran 90 (Fortran 95). Starting from this basis, HPF would be a very natural way towards parallelization. Recent trends in hardware design were seen as potentially bene cial to HPF; for example, ecient compilation might be easier on NUMA architectures. Others stated that there might be a place for an HPF dialect, but not necessarily for HPF itself, and that HPF might survive only for several years. With slightly more pessimism, it was remarked that HPF will only survive as long as it is the only solution around { in the future there could be something called \HPF", but which looks entirely dierent from the current version. It was also pointed out that in order to get a reasonable share of the parallel software market and in order to develop enough dynamics to survive, several successful instances of usage of HPF need to be publicized. So far this has possibly not happened enough. Portability of HPF codes was seen as very important and helpful for being accepted. Furthermore, it was remarked that one of the big uncertainties in the HPF world is whether good programming environment tools will be developed, which is de nitely another prerequisite for wide-spread acceptance. The following statement addressed crucial issues. \The success of HPF depends on (1) a wide-spread (industrial) acceptance of Fortran 90, and (2) the availability of ecient HPF compilers. Since we are currently taking care of (2) we pray for (1)."

Pessimistic. HPF was criticized for not addressing issues related to unstruc-

tured grids and for \horrible performance" compared to explicit message passing programming. Another reason for pessimism was pointed out. So far HPF has not reached enough of a \critical mass" to make vendor support worthwhile. It is interesting to note that both negative remarks came from users, who tried to use HPF in their large-scale applications.

3.2 HUG Survey

25

3.1.3 List of People

This is the list of individuals who replied to our e-mail survey: Name

Aliation

E-mail Address

Jean-Yves Berthou Richard Bulle Luca Cazzola Carlie J. Coats Andrew Cook James Cownie Mike Delves Jiri Furst Israel Gale Steve Hammond Philip W. Jones Youngbae Kim Charles Koelbel Arnold Krommer Peter R. Maarleveld Carl Oner Hiroshi Ohta Jean Louis Pazat Matt Rosing Michael Schuh Eric de Sturler Makoto Suzuki Matthijs van Waveren Zhao-feng Zhan Mary E. Zosel

CEA/DI-CISI Langley Research Center, NASA AGIP spa, Milano N.Carolina Supercomp. Center Los Alamos Nat'l Lab Dolphin Interconnect Solutions NA Software Ltd., Liverpool CTU Prague DEC Nat'l Cent. for Atmosph. Res. Argonne National Lab Lawrence Berkeley Nat'l Lab CRPC, Rice University NAG Parallel Libraries Group TNO Inst. of Applied Physics HPF Compiler Dev., DEC Systems Dev. Lab, Hitachi IRISA { INSA { NASA SCSC { ETH Zuerich { Fujitsu Europe New Mexico State University Lawrence Livermore Nat'l Lab

[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] o[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

3.2 HUG Survey At the rst meeting of the HPF User Group on 24-26 February 1997, participants were asked to rate the following (not necessarily yet implemented) features of HPF or proposed extensions with respect to their importance on a scale from 5 (essential ) to 1 (not important ), 0 denoting cannot determine yet or don't know .

Features of HPF 2.0 Full Fortran 90 support; More ecient support for the HPF 2.0 core language; Ecient support for the full HPF library; Ecient automatic parallelization of DO-loops; Ecient support for EXTRINSIC calls; Support for CYCLIC(k).

26

3. Expert Opinions

Approved Extensions Support for INHERIT; Support for HPF LOCAL; Support for HPF SERIAL; Support for DYNAMIC distributions (i. e., redistributions); Distribution of pointer-based arrays and derived types; INDIRECT distributions; Variable-size block distributions; Task parallelism; ON-clause computation distributions; Support for the RESIDENT assertion; F77 LOCAL. Other Areas Support for HPF CRAFT; C interoperability; Development of mathematical libraries (FFT, dense and sparse linear al-

gebra); Asynchronous I/O; Ecient parallel I/O libraries; Performance analysis and debugging tools. The results of the survey are summarized in Table 3.1. As expected, high priority was given to full Fortran 90 support, which theoretically should not be a big issue at this time. But unfortunately, not even Fortran 90 compilers are fully mature yet. The very high priority given to various improvements in eciency of basic features was very interesting. This again supports the general impression we have. Conceptually good work has been done, but the realization and implementation of the concept leaves a lot to be desired. In order to gain widespread acceptance among potential users, a lot of eort is required to actually take the crucial step from the abstract language de nition to a practically useful programming language. It is interesting that this survey also showed the great demand among users for a library providing routines for mathematical standard tasks (cf. Section 2.6). Since the problem of a source-to-source translation from Fortran 77 to ecient HPF code will most likely not be solved in the near future, HPF codes for mathematical standard operations would have to be developed from scratch. It would of course be extremely important to have HPF libraries available or to be able to integrate other existing parallel libraries into HPF codes where applicable.

3.2 HUG Survey

27

# of Responses Average Rating Feature 18 28 30 27 24 27 29 27 14 24 27 25 25 21 24 25 18 18 24 25 21 25 21

4.72 4.71 4.23 4.07 4.00 3.96 3.93 3.55 3.50 3.41 3.37 3.32 3.24 3.10 3.00 3.00 2.94 2.89 2.83 2.72 2.38 2.36 1.43

full Fortran 90 support Tools Math libraries HPF LOCAL full ecient HPF 2.0 core ecient EXTRINSIC interfaces ecient HPF library parallel I/O libraries F77 LOCAL HPF SERIAL ecient parallelization of DO-loops distribution of pointers and derived types generalized BLOCK-distributions INHERIT task parallelism asynchronous I/O ON-clause RESIDENT-assertion CYCLIC(k)-distribution C interoperability indirect distributions dynamic distributions HPF CRAFT

Table 3.1: Results of the HUG survey

Putting it more bluntly, without the availability of libraries, HPF is unlikely to be accepted by large-scale users simply because of the immense eort needed for coding. Finally, we want to point out several features and aspects which are, surprisingly, not among the top priorities. For example, task parallelism was given only medium priority, and interoperability between HPF and the C language was given quite low priority.

Chapter 4

Case Studies This chapter describes some of the experiences the authors have had with HPF compilers. Section 4.1 shows some of the earliest experiments, which were done to compare several ways to express matrix-matrix multiplication operations in Fortran 90 and HPF. Compilers steadily improve, and new compiler versions have become available in the meanwhile. Therefore, we recently decided to repeat a similar set of experiments. This time we investigated Gaussian elimination. We used the newest compiler versions available. These experiments are described in Section 4.2. The second set of experiments not only gives an idea of the quality of current HPF compilers, but it also shows how much progress has been made in the past one or two years.

4.1 Parallel Matrix-Matrix Multiplication In this section we summarize some experiments we did when the rst commercial HPF compilers which promised reasonable functionality and performance became available. We compared matrix-matrix multiplication operations using two early implementations of HPF on two dierent computer systems.

4.1.1 The Test Program

We concentrated on the matrix-matrix multiplication operation C = AB , where A; B; C 2 IRnn and compared dierent ways to implement this Level 3 operation in a Fortran program. 1. The simplest version used the Fortran 90 intrinsic function MATMUL, which takes two matrices as its arguments and returns the result of the multiplication. C = MATMUL(A, B)

In practice, this function should perform well independently of the size of the matrices; if the operation is parallelized by the compiler, a reasonable parallel speedup should be obtained. 2. Conventional coding of the unblocked matrix-matrix multiplication uses three nested DO loops; for the sake of eciency the innermost loop accesses the matrices A and C columnwise (because of columnwise storage in Fortran); the elements of B are stored in a temporary variable that can be held in a register.

4.1 Parallel Matrix-Matrix Multiplication

29

DO j = 1, n DO k = 1, n temp = B(k,j) DO i = 1, n C(i,j) = C(i,j) + temp * A(i,k) END DO END DO END DO

Since the operations of the innermost loop are independent of each other, there is signi cant potential for parallelization. This version of the test program detects whether or not the compiler performs the dependence analysis of DO loops. 3. Fortran 90 array assignment syntax allows the same function to be coded as follows. DO j = 1, n DO k = 1, n temp = B(k,j) C(:,j) = C(:,j) + temp * A(:,k) END DO END DO

Here the result of the array assignment is the same as that of the DO loop in the previous version. However, the underlying programming logic diers considerably since the array assignment presupposes that there are no sequential dependencies between the array elements, while a DO loop may contain sequential data dependencies. Accordingly, array assignments can always be parallelized, while DO loops must be checked for data dependencies by the compiler or must be declared by the programmer as parallelizable using the INDEPENDENT directive. 4. Although the matrix-matrix multiplication operation can be coded in a very simple way, satisfactory performance is often only achieved by using an optimized library routine. The BLAS library has become a de-facto standard in scienti c computing and is available on many computers in an optimized version; the function call performing a matrix-matrix multiplication looks like CALL SGEMM('N','N', n, n, n, 1.0, A, n, B, n, 0.0, C, n)

This test program indicates the eciency of the BLAS implementation. Of course parallelized versions of the BLAS are not available on all machines. The matrices A and B were initialized with random numbers. The program was executed for the matrix orders n = 250, 350, 500, 700, 1000, 1400, 2000, and 2800; each time the amount of data to be processed is approximately doubled. For the parallel execution appropriate HPF directives were used to specify the data distribution pattern. In the following program the parameter n was set to one of the matrix orders stated above and the data distribution directives had to be adjusted to the version used.

30

4. Case Studies

PROGRAM main !

Constant declarations INTEGER, PARAMETER :: n = 2000

!

Variable declarations INTEGER :: i, j, k REAL :: temp REAL, DIMENSION(n,n) :: A, B, C

! HPF compiler directives !HPF$ PROCESSORS P(4) !HPF$ ALIGN WITH A :: C !HPF$ DISTRIBUTE A(BLOCK, *) ONTO P !

Executable instructions ! Initialization CALL RANDOM_SEED() CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) C = 0.0 ! Execute matrix-matrix multiplication ...

END

4.1.2 HPF on a DEC Alpha Server

Digital Equipment Corporation (DEC) released an HPF compiler for the operating system Digital Unix 3.2 in March 1995. It is a native Fortran 90 compiler, i. e., produces object code without (intermediate) preprocessing stages. DEC Fortran 90 comprises selected HPF features like data mapping directives, the FORALL statement, parallel intrinsic and standard library routines, and the EXTRINSIC mechanism. This compiler version does not support full HPF 1.1 (see Section 1.2.2) or full Subset HPF (see Section 1.2.1). The hardware platform used was a DEC AlphaServer 2100 4/275 with 4 Alpha 21064A processors running at a clock rate of 275 MHz; the peak performance of one processor is 275 M op/s.

The Performance of the Test Program Table 4.1 summarizes the performance data obtained using the test program described earlier. For each routine three numbers are given for every matrix order. 1. The response time . Since the programs were run in single user mode, the overall execution time (in seconds) is used as the primary information about the computational speed of each version.


31

2. The percentage of the initialization overhead . Obviously the test program spends a certain amount of time prior to the actual computational task for the initialization of the matrices. This overhead, together with the time required for the program startup, was determined by removing the matrix-matrix multiplication operation from the source code and measuring the runtime of the remaining program. 3. The eciency of the computational kernel. When determining the eciency of a program, only the runtime for the part of the code which executes the matrix-matrix multiplication operation was taken into account. The operation count of C = A B is 2n ? n (for matrix order n); with the given hardware peak performance of 275 M op/s per node, the empirical oatingpoint eciency (in percent of the peak performance) is determined by 1

3

Eciency =

2

100 M op/s PerformanceP eak Number of Processors : Matrix Orders

250 350 500 700 1000 1400 2000 2800 MATMUL 1 Processor

Time (s) 0.5 1.0 2.7 7.1 20.0 Overhead (%) 32.7 24.0 14.7 13.5 13.4 Eciency (%) 33.7 40.9 39.4 40.6 41.9

56.8 9.8 38.9

DO


96.1 280.4 5.8 4.2 22.0 21.7

1 Processor

Response Time (s) 0.9 2.5 6.3 16.8 48.5 Overhead (%) 18.7 9.3 7.5 5.7 5.5 1 Processor Eciency (%) 15.5 13.7 15.6 15.7 15.9

Array

{ { {

{ { { 599.8 4.2 27.8

237.9 744.8 2222.9 2.3 1.6 1.1 8.6 7.9 7.3

Array

4 Processors


39.1 109.9 26.1 18.8 17.3 16.3

SGEMM

1 Processor


34.2 16.3 69.7

94.2 12.7 70.7

247.1 10.3 72.0

SGEMM


44.1 123.2 12.7 9.7 25.9 26.1

239.2 10.6 37.3

2 Processors

{ { {

Table 4.1: The performance of dierent implementations of the matrix-matrix multiplication on a DEC AlphaServer 2100. see C. W. Ueberhuber: Numerical Computation. Springer-Verlag, Berlin Heidelberg New York Tokyo 1997. 1

32

4. Case Studies

The following observations of these early experiments are important to point out. 1. The intrinsic function MATMUL was not parallelized at all. In fact, if MATMUL is called for distributed arrays, the performance totally collapses. So this version was only run sequentially, i. e., as a sequential Fortran 90 program. The eciency ranges from 30 % to 40 %, which is reasonable but far from perfect. For small matrices the initialization overhead is signi cant. 2. The DO loop version was not parallelized either; and, in fact, collapsed like the MATMUL function for distributed arrays. The performance of the serial version is poor. This indicates that the programmer should not leave the whole optimization task to the compiler. 3. The array assignment version is the only version that was parallelized. In Table 4.1 the results of the serial version and the four processor version are shown. A comparison of the response times shows that the algorithm achieves even super-linear speedup for the matrix order n = 2000; for n = 2800 the parallel version exceeds the memory limit and is terminated by the operating system. The eciency of this version is the lowest; the compiler completely fails to produce ecient code, and for larger matrices the performance decreases signi cantly. Even the response time of the parallelized version is larger than the response time of the fastest serial program (the BLAS routine). This indicates that, although the compiler actually produces parallel code, its exploitation of the hardware resources is far from satisfactory. 4. The only version that performed well was the (serial) BLAS subroutine SGEMM. Its performance increases steadily as the matrix orders increase and reaches the satisfactory eciency value of 72 % for n = 2800. This makes it clear how important optimized standard software is.

4.1.3 HPF on a Meiko CS-2

The Portland Group Inc. (PGI) has developed an HPF compiler based on the Fortran 90D work of Syracuse University. The PGI compiler is based on a software interface which has been implemented on top of PVM, MPI, Parmacs and some proprietary communication systems like the Elan Widget library on the Meiko CS-2. Prior to the actual compilation, a preprocessor transforms the HPF source code to Fortran 77 code with calls to the communication library. The version of the PGI HPF compiler (pghpf v2.0 ) we used at that time implemented only a fraction of the HPF language speci cation.

Hardware and Operating System

The computer used for these experiments was the Meiko CS-2 at the Vienna Centre for Parallel Computing (VCPC). It consists of 128 50 MHz SuperSPARC


33

scalar processors without I/O and 8 processors of the same type for interactive use. The hardware peak performance of these processors is 50 M op/s each, i. e., one oating-point operation per clock cycle. Communication in the distributed memory system is done by Elan communication and Elite switching chips. The CS-2 operating system is based on Solaris 2.3 from SunSoft. For use in the parallel computer, the operating system was augmented to support a parallel le system, ecient resource management, and inter-processor communication.

The Performance of the Test Program Table 4.2 summarizes the performance data obtained using the same test program as in the previous section. Again, for each routine three values { the response time, the initialization overhead, and the eciency of the computational kernel { are given for all matrix orders. Matrix Orders

250 350 500 700 1000 1400 2000 2800 MATMUL 1 Processor

Time (s) 4.9 16.1 51.8 162.7 582.3 1734.3 Overhead (%) 11.3 6.7 4.3 2.7 1.5 1.1 Eciency (%) 14.4 11.4 10.1 8.7 7.0 6.4

{ { {

{ { {

Response Time (s) 6.2 12.4 32.6 Overhead (%) 9.0 8.8 6.8 Eciency (%) 11.1 15.1 16.4

86.2 247.1 5.1 3.6 16.8 16.8

667.6 2.6 16.9

{ { {

{ { {

Array

1 Processor

Time (s) 4.1 10.8 30.8 Overhead (%) 13.3 10.1 7.2 Eciency (%) 17.5 17.6 17.5

84.4 245.5 5.2 3.7 17.1 16.9

664.1 2.6 16.9

{ { {

{ { {

Array

4 Processors


20.6 17.4 20.1

57.8 12.6 19.8

158.7 458.1 10.1 6.6 19.2 18.7

Array

16 Processors


8.1 47.2 20.0

20.1 39.8 20.7

49.1 232.7 342.8 36.2 11.7 9.7 21.9 9.7 17.7

Array


5.6 60.3 19.3

13.4 52.3 19.5

30.9 44.2 19.9

DO

1 Processor

32 Processors

{ { {

91.4 233.4 31.5 22.5 16.0 15.2

Table 4.2: The performance of dierent implementations of the matrix-matrix multiplication operation on a Meiko CS-2.

A serious drawback for numerical application programming on that particular machine was that an optimized BLAS library was not available. The MATMUL and the DO loop version were only run serially because the performance degraded dramatically when parallelizing them. So the only way to achieve parallel speedup in this test series was to use the array assignment statement. Since a larger number

34

4. Case Studies

of processors than on the DEC AlphaServer was available, the parallelization has been extended from 4 to 16 and 32 nodes. The results of the experiments can be summarized in the following points. 1. The intrinsic function MATMUL was not parallelized. 2. The DO loop version was twice as fast as MATMUL for larger matrices. In the environment of the test series both program versions reached the memory limit for matrix orders n 1400. 3. The memory limit also aected the array assignment version. However, when the number of processors increases, the matrix orders that can be handled also increase. So the use of the parallel assembly not only speeds up the computational task but also makes it possible to accomplish tasks that could not be handled on a sequential computer. The relative initialization overhead increases as the number of processors increases, while the absolute time it requires remains approximately constant.

4.2 Parallel Gaussian Elimination In this section we give the results of a recent case study which was meant to nd the best way (in terms of performance achieved by current versions of some compilers) to express nested loops in HPF. The experiments were done on the same Meiko CS-2 system described in Section 4.1.3 in the previous case study, with one exception, there was an operating system upgrade from Solaris 2.3 to Solaris 2.5. Four dierent compilers were used, one Fortran 90 compiler and three HPF compilers.

apf90 v4.0 , the Apogee Fortran 90 compiler; pghpf v2.1 , the PGI HPF compiler (optimized for the CS-2); pghpf v2.2 , the PGI HPF compiler (a generic version, not optimized for the CS-2); and

hpf v2.01 , the N.A.Software HPF compiler. For all compilers the highest possible optimization level was used, i. e., , and .

apf90 -fast testfile.f90 pghpf -fast -Mmpi testfile.f90 hpf -O testfile.f90

4.2.1 The Test Program

The algorithm investigated is Gaussian elimination without pivoting. Hence the elimination procedure can be written as

4.2 Parallel Gaussian Elimination

35

DO k = 1, n-1 k1 = k + 1 DO i = k1, n A(i,k) = A(i,k) / A(k,k) ENDDO DO j = k1, n DO i = k1, n A(i,j) = A(i,j) - A(i,k) * A(k,j) ENDDO ENDDO ENDDO

Apart from using three explicit nested loops, Fortran 90 provides several alternative ways to code this problem. In all of the code versions, A(k1:n,k) = A(k1:n,k) / A(k,k)

was used instead of DO i = k1, n A(i,k) = A(i,k) / A(k,k) ENDDO

DO/DO. The INDEPENDENT directive was used to support the compiler since the HPF compilers were not expected to do proper dependence analysis for DO loops. !HPF$ INDEPENDENT DO j = k1, n !HPF$ INDEPENDENT DO i = k1, n A(i,j) = A(i,j) - A(i,k) * A(k,j) ENDDO ENDDO

This directive tells the compiler that there are no data dependencies in the loops. Thus the loops can be parallelized.

DO/ARRAY. A DO loop and Fortran 90 array syntax are used. DO j = k1, n A(k1:n,j) = A(k1:n,j) - A(k1:n,k) * A(k,j) ENDDO

DO/FORALL. The array syntax is recoded as a FORALL loop. DO i = k1, n FORALL (i = k1:n) A(i,j) = A(i,j) - A(i,k) * A(k,j) ENDDO

36

4. Case Studies

FORALL/FORALL. Using a FORALL statement, the nested DO loops can be written in one line.

FORALL (i = k1:n, j = k1:n) A(i,j) = A(i,j) - A(i,k) * A(k,j)

FORALL/ARRAY. The inner loop within the FORALL statement is expressed using array syntax.

FORALL (j = k1:n) A(k1:n,j) = A(k1:n,j) - A(k1:n,k) * A(k,j)

ARRAY/ARRAY. The Fortran 90 intrinsic SPREAD is used. A(k1:n,k1:n) = A(k1:n,k1:n) - ( SPREAD (A(k1:n,k), 2, n-k) & * SPREAD (A(k,k1:n), 1, n-k) )

The results indicate whether this Fortran 90 intrinsic is implemented eciently. The program gauss starts with an initialization phase; then the computation is done; and nally the measured time is written as output. For all versions the time for the actual elimination process was measured using the Fortran 90 intrinsic SYSTEM CLOCK. The tests were done using the values n = 1024 and n = 2000 for the dimension of the square matrix A. The code looks like PROGRAM gauss INTEGER, PARAMETER :: n = 1024 INTEGER :: i, j REAL, DIMENSION (n,n) :: A DOUBLE PRECISION :: time_start, time_end, time ! ! Initialization ! CALL RANDOM_NUMBER(a) ! ! Elimination ! CALL timer(time_start) ! elimination process ... CALL timer(time_end) time = time_end-time_start ! ! Output ! write (6,*) write (6,*) 'Time with System_Clock: ' write (6,'(f5.1,1x,a4)') time, ' sec' END PROGRAM gauss


37

The subroutine timer returns the elapsed time. SUBROUTINE timer (t) DOUBLE PRECISION, INTENT (OUT) :: t INTEGER :: cnt, rate, max ! CALL SYSTEM_CLOCK (cnt, rate, max) t = dble(cnt) / dble(rate) ! END SUBROUTINE timer

4.2.2 Performance Results

The Apogee Fortran 90 Compiler

Timing results for the Apogee Fortran 90 compiler are listed in Table 4.3. It shows that the way nested loops are expressed has a crucial impact on performance. Even the performance of sequential execution varies widely (by a factor of 4).

Loop Structure DO/DO DO/ARRAY DO/FORALL FORALL/FORALL FORALL/ARRAY ARRAY/ARRAY

Runtime Eciency 119 s 68 s 68 s 284 s

y

123 s

12 % 21 % 21 % 5% { 12 %

y. . . This version caused a segmentation fault ; Apogee con rmed this as a bug in the Fortran 90 to Fortran 77 translator.

Table 4.3: Performance of Gaussian elimination using apf90 and a matrix order n = 1024.

The PGI 2.1 HPF Compiler The PGI 2.1 compiler (see Figures 4.1 and 4.2) seems to produce only two different codes { a \good" one which scales linearly and a \bad" one which shows a slow down when the number of processors increases. Note, however, that the percentage of peak performance achieved by the \good" codes, their eciency, is terribly low (less than 10 %).

The PGI 2.2 HPF Compiler A comparison with the new compiler release PGI 2.2 reveals a similar performance behavior of the \good" versions but slightly longer runtimes of the \bad" ones (see Figure 4.3). For the PGI 2.2 compiler we also tested the in uence of the problem size on the scalability (see Figures 4.3{4.6). The \good" code versions

38

4. Case Studies

Runtime

500 s

array/array, forall/array, forall/forall do/array, do/forall

400 s 300 s 200 s 100 s 0s

1

2

4 Number of Processors

8

16

Figure 4.1: Runtime of Gaussian elimination using pghpf 2.1 and a matrix order n = 1024. Floating-Point Performance

100 %


80 % 60 % 40 % 20 % 0%

1

2


8

16

Figure 4.2: Eciency of Gaussian elimination using pghpf 2.1 and a matrix order n = 1024.

show linear speed up for all matrix sizes. This time, however, the \bad" versions show an improvement: the slow down decreases as the problem size increases. Comparing the two dierent curves of the \good" and \bad" versions, it is clear that the runtime on one processor diers proportionally as the matrix size increases. For a square matrix of order 2000, the runtime dierence is already about 400 seconds (see Figure 4.5).

The NAS HPF Compiler The NAS compiler is able to produce three dierent versions of executables (see Figures 4.7 and 4.8); unfortunately only one way of expressing nested DO loops, the ARRAY/ARRAY version results in performance gains with an increasing number of processors. There is, however, a serious drawback of the AR-


39

Runtime

500 s


400 s 300 s 200 s 100 s 0s

1

2


8

16


100 %


80 % 60 % 40 % 20 % 0%

1

2


8

16


RAY/ARRAY version. Running on one processor it is about four times slower than, for example, the DO/ARRAY version. So when comparing the three versions, the following observations can be made. One version is very slow on one processor but scales linearly. The second version is fast on one processor but does not scale at all, and the third version is very slow on one processor and does not scale.

Summary What most users would expect of a decent HPF compiler, to produce object code that is fast on one processor and scales linearly, is currently not available.

40

4. Case Studies

Runtime

1400 s


1200 s 1000 s 800 s 600 s 400 s 200 s 0s

1

2


8

16


100 %


80 % 60 % 40 % 20 % 0%

1

2


8

16



41

Runtime

500 s

array/array do/array, do/forall forall/array

400 s 300 s 200 s 100 s 0s

1

2


8

16

Figure 4.7: Runtime of Gaussian elimination using hpf and a matrix order n = 1024. Floating-Point Performance

100 %

array/array do/array, do/forall forall/array

80 % 60 % 40 % 20 % 0%

1

2


8

16

Figure 4.8: Eciency of Gaussian elimination using hpf and a matrix order n = 1024.

Chapter 5

Conclusions There have been several attempts to use HPF in \real" applications. However, portable parallel performance , one of the main goals of the HPF approach, has clearly not yet been achieved. On most of the larger applications, today's HPF compilers achieve unsatisfactory performance. This seems to have several reasons.

Most of the substantial and innovative constructs of HPF, which are very interesting for parallel algorithms, are very hard to implement eciently in HPF compilers. A lot of eort has to be put into developing compiler technology to solve this problem.

Many advanced and challenging HPF features are integrated only in academic and research compilers, whose scope is usually very limited in terms of the \basic" features (e. g., Fortran 90 functionality). Hence they are not ready for serious professional use.

Commercial compiler developers and vendors are not willing to invest a great deal of eort into the development of HPF compilers as long as they are doubtful about the pro ts they get out of it. Because there are currently so few real users of HPF, it is not clear yet whether the market will ever be big enough.

At the same time, users do not start using HPF at a large scale as long as

no mature, complete, and reliable compilers, which are easy and convenient to use and which yield at least decent performance (requirements which can most likely only be satis ed by commercial HPF compilers), are available.

There is a certain common-sense, minimum requirement which HPF compilers will have to meet in order to obtain more acceptance. Some performance degradation of HPF codes compared to explicit message-passing programming will be tolerated as long as it is outweighed by the disadvantages of the message-passing programming model (enormous coding/debugging eort and lack of exibility).

The basis of HPF, Fortran 90/95 is not yet well established and fully accepted

at the moment. Nevertheless, \dusty deck" Fortran 77 programs will require recoding using Fortran 90 features in order to be converted into HPF. Some Fortran 77 coding techniques (storage and sequence association, explicit declaration of workspace, etc.), which are used in many \dusty deck" codes, cannot be handled eciently by HPF compilers.

5. Conclusions

43

We think that, in the near future, the following aspects will be crucial for the success of HPF.

HPF was originally intended to be a language more or less restricted to data

parallel applications. In this eld, it has really been a promising approach, and signi cant progress, including successful applications, has been made. However, there is a tendency to try to make HPF a comprehensive language for parallel programming because of the strong demand from various application areas. Unfortunately, compiler technology seems not to be ready yet to support such an ambitious goal, in particular with respect to object code performance. There seems to be a small, but clearly de ned, market segment where HPF can be successfully used and where it will be established as the most ecient way for parallel programming. Many features which are currently demanded for more general application areas can probably not be integrated in the near future. In order to proceed successfully, we think that three main aspects are crucial and, hence, should be given highest priority. { The ecient implementation of full HPF 2.0 plus HPF LOCAL and HPF SERIAL (postponing attempts to implement the Approved Extensions and further ideas of extending the language). { The integration of well established parallel libraries (e. g., the interface to ScaLapack) into HPF in order to provide better support for the development of large codes. { Closely related to the previous point, the integration of interfaces to BLAS routines in order to optimize single processor performance.

Appendix A

HPF Compilers and Tools This appendix lists a variety of relevant Web sites of commercial and research HPF compilers, as well as Web sites with information about tools. By the nature of such surveys the information given here is incomplete and is only intended as a starting point for further investigation. We apologize for omissions. HPF compiler surveys are available at

http://www.ac.upc.es/HPFSurvey/ http://www.crpc.rice.edu/HPFF/hpfcompiler/index.html http://www.irisa.fr/pampa/HPF/survey.html http://www.npac.syr.edu/hpfa/compilers.html

A.1 Commercial HPF Compilers DIGITAL Fortran (DEC)

http://www.digital.com/info/hpc/fortran/hpf.html

EXPERT HPF (ACE)

http://www.ace.nl/

HPF Mapper (EPC)

http://www.epc.co.uk/hpf.html

HPFPlus (NAS)

http://www.nasoftware.co.uk/

pghpf (PGI)

http://www.pgroup.com/

SUN HPF (SUN)

http://www.sun.com/servers/hpc/software/configuration.html

VAST-HPF (PSR)

http://www.psrv.com/vast/vasthpf.html

xHPF (APR)

http://www.apri.com/

XL HPF (IBM)

http://www.software.ibm.com/ad/fortran/xlhpf/index.html

A.2 Research HPF Compilers

45

A.2 Research HPF Compilers Adaptor (GMD-SCAI)

http://www.gmd.de/SCAI/lab/adaptor/adaptor home.html

Annai (CSCS/NEC)

http://www.irisa.fr/EuroTools/SIG/HPF/HPF/annai.html

EPPP (CRIM Montreal)

http://www.CRIM.CA/apar/Group/English/Projects/EPPP.html

Fortran D (Rice University)

http://www.cs.rice.edu/ dsystem/

FX Compiler (Carnegie Mellon University)

http://www.cs.cmu.edu/ fx

VFC (University of Vienna)

http://www.par.univie.ac.at/hpf%2b/index.html

HPFC (Ecole des Mines de Paris)

http://www.cri.ensmp.fr/ coelho/hpfc.html

Paradigm (University of Illinois at Urbana-Champaign) http://www.crhc.uiuc.edu/Paradigm/

SHPF (University of Southampton and VCPC)

http://www.vcpc.univie.ac.at/information/software/shpf/

A.3 Tools

http://www.dolphinics.com/toolworks http://www.nasoftware.co.uk/hpf/index.html http://www.psrv.com/deep.html http://www.sun.com/servers/hpc/software/configuration.html http://www.irisa.fr/EuroTools/SIG/HPF/HPF/survey.html http://www.npac.syr.edu/hpfa/

Appendix B

Institutions This appendix summarizes institutions with ongoing HPF related activities (including institutions mentioned in the e-mail survey of Section 3).

B.1 Universities CTU Prague Delft University of Technology (DUT) ETH Zuerich Rice University (Center for Research on Parallel Computation CRPC) Southampton University Technical University of Vienna University of Amsterdam University of Vienna University of West Bohemia

B.2 Research Centers Argonne National Laboratory CEA/DI-CISI, Gif-sur-Yvette CRS4 (Center for Advanced Studies, Research and Development in Sardinia) Daresbury Laboratory Fujitsu Europe GMD { Forschungszentrum Informationstechnik GmbH High Performance Application Centre (HPaC) in Delft Hitachi INRIA/IRISA { INSA Lawrence Berkeley National Laboratory (LBNL) Lawrence Livermore National Laboratory (LLNL) Los Alamos National Laboratory (LANL) NASA Ames NAS division NASA Langley Research Center National Center for Atmospheric Research National Center for Supercomputing Applications (NCSA) National Parallel Architectures Center NERSC (National Energy Research Scienti c Computing Center) North Carolina Supercomputing Center

B.3 Commercial Institutions

Paci c Northwest Labs Pittsburg Supercomputer Center (PSC) RIST Sandia National Laboratory SARA (Amsterdam Supercomputer Centre) TNO (Netherlands Organization for Applied Scienti c Research)

B.3 Commercial Institutions AGIP spa Associated Compiler Experts (ACE) DEC Dolphin Interconnect Solutions IBM Matra-defense NAG Ltd. (Parallel Libraries Group) N.A.Software Ltd. NEC The Portland Group, Inc.

47

Appendix C

Training Material and Courses Here we summarize some of the training material available on the Web.

DEC:

http://www.digital.com/info/hpc/f90/hpf tutorial.ps

EPCC:

http://www.epcc.ed.ac.uk/epcc-tec/documents/

Liverpool:

http://www.liv.ac.uk/HPC/HPFpage.html

Syracuse:

http://www.npac.syr.edu/users/haupt/tutorial/tutorial.html

Appendix D

Projects

This appendix gives some Web sites of projects dealing with HPF. and

http://www.crpc.rice.edu/HPFF/projects.html http://www.vcpc.univie.ac.at/information/mirror/HPFF/projects.html

give a survey of Web sites of HPF related projects.

ESPRIT project PHAROS:

http://www.vcpc.univie.ac.at/activities/projects/PHAROS.html

ESPRIT project PREPARE:

http://www.irisa.fr/pampa/PREPARE/prepare.html

ESPRIT project PPPE:

http://cs.1.soton.ac.uk/ProjectDir/tools/pppe.html

ESPRIT project HPC Standards:

http://www.ccg.ecs.soton.ac.uk/hpc-stds/index.html

ESPRIT project HPF+:

http://www.par.univie.ac.at/hpf%2b/index.html

AURORA:

http://www.vcpc.univie.ac.at/aurora/

VFCS:

http://www.par.univie.ac.at/project/vfcs.html

SHPF:

http://www.vcpc.univie.ac.at/information/software/shpf/

HPFC:

http://www.cri.ensmp.fr/ coelho/hpfc.html

PANDORE:

http://www.irisa.fr/pampa/PANDORE/pandore.html

ADAPTOR:

http://www.gmd.de/SCAI/lab/adaptor/adaptor home.html

Glossary ALIGN: HPF directive for specifying a relation between the distribution of two objects;

APPROVED EXTENSIONS: extends the language speci cation of HPF 2.0; ASCI: Accelerated Strategic Computing Initiative; BLACS: basic linear algebra communication subroutines; BLAS: basic linear algebra subroutines; CFD: computational uid dynamics; CM: Connection Machine; a series of computers by Thinking Machines Corp.; CM Fortran: Connection Machine Fortran; DISTRIBUTE: HPF directive for specifying distribution of variables or templates over processors; DSM: distributed shared memory; EXTRINSIC: HPF keyword declaring a non-HPF procedure; FFT: fast Fourier transform; FORALL: Fortran 95 data parallel assignment or set of assignments; HPC: high performance computing; HPF: High Performance Fortran;

HPFF: HPF Forum; HPF LOCAL: a type of extrinsic interface; HPF SERIAL: a type of extrinsic interface; HUG: HPF User Group; INDEPENDENT: HPF directive declaring that the iterations of a DO loop are data independent; MIMD: multiple instructions, multiple data; MPI: message passing interface; MPP: massively parallel processing;

Glossary

51

NCSA: National Center for Supercomputing Applications; NUMA: non uniform memory access; ON HOME: an HPF Approved Extension; an optional clause of the INDEPENDENT directive specifying which processors the iterations are to be executed on; PDE: partial dierential equation; PROCESSORS: HPF directive declaring an abstract processor arrangement; PURE: Fortran 95 attribute for procedures; PVM: parallel virtual machine; a message passing package; SIMD: single instruction, multiple data; SMP: shared memory symmetric multiprocessing; SPMD: single program, multiple data; SUBSET HPF: subset of HPF 1.0/1.1 for early compiler implementations.

Bibliography Where available, we included the abstracts in this bibliography. In most cases we made some minor changes, and we also had to translate some of them into English.

Books [1] G. S. Almasi and A. Gottlieb: Highly Parallel Computing, 2nd ed. Benjamin/Cummings Publ., Redwood City 1994.

The book incorporates extensive coverage of current architectural concepts. Describes parallel machines from Intel, Thinking Machines, KSR, MasPar, IBM, Cray Research, and other companies. Covers parallel programming environments, including HPF, Express, PVM, and Linda, as well as functional and logic programming languages.

[2] T. Fahringer: Automatic Performance Prediction of Parallel Programs. Kluwer Academic Publishers, Boston 1996.

The goal of this book is to demonstrate that performance information, on the one hand, is critical for the parallelization and optimization of parallel programs, and, on the other hand, it can be automatically and eciently generated at compile time with very good accuracy. This book presents a new approach to the problem of performance prediction which is organized around two major components: pro ling and performance parameters. The author incorporates a pro le run to derive program unknown data for branching probabilities, loop iteration and statement execution counts. Large amounts of the pro le data can be successfully adapted for most important program transformations without redoing the pro le run. A parameter based approach is presented to the problem of automatic performance prediction for parallel (in particular HPF) programs. The author developed P 3 T , a Parameter based Performance Prediction Tool, which models four of the most important performance aspects of parallel programs: load balance, data locality, communication and computation overhead. This tool has been evaluated for various HPF kernels as part of an HPF compilation system.

[3] C. H. Koelbel, D. B. Loveman, R. Schreiber, G. L. Steele Jr. and M. E. Zosel: The HPF Handbook. MIT Press, Cambridge 1994.

This handbook functions as a user's guide to HPF. It provides clear explanations of both the purpose and the function of the new constructs, accompanied by numerous examples to illustrate both basic concepts and subtle eects. The handbook includes a review of new features of Fortran 90, concentrating on those that have an impact on HPF. Other chapters describe features for mapping data to parallel processors, specifying data parallel operations, and interfacing HPF programs to other programming languages.

[4] G. R. Perrin and A. Darte: The Data Parallel Programming Model { Foundations, HPF Realization, and Scienti c Applications. Lecture Notes in Computer Science 1132, Springer-Verlag, Berlin, 1996. This book presents a general introduction to the data parallel programming model, its expression in HPF, and some of its uses in scienti c applications. The rst part of lectures focus on prevalent compiling techniques and on existing programming environments. The

Books

53

second part investigates ways that the data parallel model aects programming style, implementation eciency, abstraction capability, and program proofs.

[5] R. W. Sebesta: Concepts of Programming Languages, 3rd ed. AddisonWesley, Reading 1996.

Following the usual classi cation of programming languages into imperative, functional, logic, and object-oriented languages typical concepts and constructs are described and evaluated. Examples from particular languages are presented to illustrate dierent appearances of related constructs and to analyze their origins, their design issues and design alternatives. The dierences between the third and the earlier editions result mainly from continuing evolution of currently popular languages. Additional sections are devoted new constructs of Ada 95, C++, and HPF.

[6] M. Wolfe: High-Performance Compilers for Parallel Computing. AddisonWesley, Amsterdam 1996.

This textbook does not discuss the usual topics of compiler books, e. g., lexical and syntactic analysis. It focuses on generating optimized parallel code, given the source program and the target architecture. Of course, this includes discovery of parallelism from sequential programs, but this is only a minor point. The rst chapter is an introduction into the problem of matrix multiplication on a variety of hardware architectures. (Unfortunately, there is a misprint in all high-level formulations: the multiplication sign is replaced by a plus.) This chapter impressively shows the in uence of dierent architectures on optimizing. The next chapter is concerned with programming language features. The author considers Fortran 90, HPF, and C. He compares these languages with respect to storage association, argument passing conventions, and dierent types of loops. Chapters 3 and 4 present mathematical prerequisites. Chapters 5 through 9 cover the main aspects of optimizing on the language level. Chapters 10 through 14 are concerned with dierent machine architectures.

Publications in Journals [7] G. Agrawal and J. Saltz: Interprocedural Data Flow Based Optimizations for Distributed Memory Compilation. Software-Practice & Experience 27 (1997), pp. 519{545. In this paper, the authors present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in data parallel languages like HPF. They have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routines and collective communication routines inserted for managing communication in such codes. They also present two new interprocedural optimizations: placement of scatter routines and use of coalescing and incremental routines. Then it is described how program slicing can be used for further applying IPRE in more complex scenarios. The authors have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. They present experimental results from two codes compiled using their system to demonstrate the ecacy of the presented schemes.

[8] G. Agrawal, A. Sussman and J. Saltz: An Integrated Runtime and CompileTime Approach for Parallelizing Structured and Block Structured Applications. IEEE Transactions on Parallel & Distributed Systems 6 (1995), pp. 747{754.

In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an ecient and machine-independent fashion are presented. The authors have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several dierent systems. The authors have also developed compiler analysis for determining data access patterns at compile-time and inserting calls to the appropriate runtime routines. Their methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the ecacy of this approach, the authors have implemented their compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. They have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Their experimental results show that their primitives have low runtime communication overheads and the compiler parallelized codes perform within 20 % of the codes parallelized by manually inserting calls to the runtime library.

[9] C. Ancourt, F. Coelho, F. Irigoin and R. Keryell: A Linear Algebra Framework for Static HPF Code Distribution. Scienti c Programming 6 (1997), pp. 3{27.

It is shown in this publication that linear algebra is a powerful framework to encode HPF directives and to synthesize distributed code with space ecient array allocation, tight loop bounds, and vectorized communications for INDEPENDENT loops. The generated code includes traditional optimizations such as guard elimination, message vectorization and aggregation, and overlap analysis. The systematic use of an ane framework makes it possible to prove the compilation scheme correct.

[10] E. Ayguade, J. Garcia, M. Girones, M. Luz Grande and J. Labarta: DDT:

Journals

55

A Research Tool for Automatic Data Distribution in HPF. Scienti c Programming 6 (1997), pp. 73{94.

The article describes the main features and implementation of an automatic data distribution research tool. The tool (DDT) accepts programs written in Fortran 77 and generates HPF directives to map arrays onto the memories of the processors and parallelize loops, and executable statements to remap these arrays. DDT works by identifying a set of computational phases (procedures and loops). The algorithm builds a search space of candidate solutions for these phases which is explored looking for the combination that minimizes the overall cost; this cost includes data movement cost and computation cost. The movement cost re ects the cost of accessing remote data during the execution of a phase and the remapping costs that have to be paid in order to execute the phase with the selected mapping. The computation cost includes the cost of executing a phase in parallel according to the selected mapping and the owner computes rule. The tool supports interprocedural analysis and uses control ow information to identify how phases are sequenced during the execution of the application.

[11] S. Bae and S. Ranka: PACK/UNPACK on Coarse-Grained Distributed Memory Parallel Machines. Journal of Parallel and Distributed Computing 38 (1996), pp. 204{216. PACK/UNPACK are Fortran 90/HPF array construction functions that derive new arrays from existing arrays. The authors present algorithms for performing these operations on coarse-grained parallel machines. Their algorithms are relatively architecture independent and can be applied to arrays of arbitrary dimensions with arbitrary distribution along every dimension. Experimental results are presented on the CM-5.

[12] V. Bouchitte, P. Boulet, A. Darte and Y. Robert: Evaluating Array Expressions on Massively Parallel Machines with Communication/Computation Overlap. International Journal of Supercomputer Applications 9 (1995), pp. 205{219. This paper deals with the problem of evaluating HPF style array expressions on massively parallel distributed-memory computers. This problem has been addressed by Chatterjee et al., 1992, 1993 under the strict hypothesis that computations and communications cannot overlap. As such a model appears to be unnecessarily restrictive for modeling state-of-the-art parallel distributed-memory computers, the authors relax the restriction and allow for simultaneous computations and communications. This simple modi cation has a tremendous eect on the complexity of the optimal evaluation of array expressions. The paper rst shows that even a simple version of the problem is NP-complete. Then, the authors present some heuristics that they can guarantee in some important cases in practice, namely, for coarse-grain or ne-grain computations.

[13] Z. Bozkus, L. F. Meadows, S. Nakamoto, V. Schuster and M. Young: PGHPF { An Optimizing HPF Compiler for Distributed Memory Machines. Scienti c Programming 6 (1997), pp. 29{40.

The article describes a production quality HPF compiler for a set of parallel machines. Compilation techniques such as data and computation distribution, communication generation, run time support, and optimization issues are elaborated as the basis for an HPF compiler implementation on distributed memory machines. The performance of this compiler on benchmark programs demonstrates that high eciency can be achieved executing HPF code on parallel architectures.

[14] T. Brandes, S. Chaumette, M. C. Counilh, J. Roman, A. Darte, F. Desprez and J. C. Mignot: HPFIT: A Set of Integrated Tools for the Parallelization of Applications Using HPF; Part I: HPFIT and the TransTOOL

56

Bibliography Environment. Parallel Computing 23 (1997), pp. 71{87.

The authors present the HPFIT project whose aim is to provide a set of interactive tools integrated in a single environment to help users to parallelize scienti c applications to be run on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL which includes an editor, a parser, a dependence analysis tool and an optimization kernel. Moreover, they provide a clean interface to help developers of tools around HPF to integrate their software within the tool.

[15] T. Brandes, S. Chaumette, M. C. Counilh, J. Roman, F. Desprez and J. C. Mignot: HPFIT: A Set of Integrated Tools for the Parallelization of Applications Using HPF; Part II: Data-Structure Visualization and HPF Extensions for Irregular Problems. Parallel Computing 23 (1997), pp. 89{105. For Part I see [14]. The paper presents a data structure visualization tool called Visit, and HPF extensions for irregular problems.

[16] W. N. Celmaster: Modern Fortran Revived as the Language of Scienti c Parallel Computing. Digital Technical Journal 8 (1996), pp. 39{45.

This paper describes Fortran 90 and the standardized language extensions for both shared-memory and distributed-memory parallelism. In particular, three case studies are examined, showing how the distributed-memory extensions (HPF) are used both for data parallel algorithms and for single-program-multiple-data algorithms.

[17] S. Chakrabarti, M. Gupta and J.-D. Choi: Global Communication Analysis and Optimization. SIGPLAN Notices 31 (1996), pp. 68{78.

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. The authors present a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Their algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under dierent placements before making a nal decision on the placement of any communication. It exploits the exibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buers, all in a uni ed framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive data ow analysis on array sections can generate too many messages or cache and buer contention. The algorithm has been implemented in the IBM pHPF compiler for HPF. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. The paper presents performance results for the IBM SP2 and a network of Sparc workstations. In many cases, the communication cost is reduced by a factor of two.

[18] B. M. Chapman, M. Haines, P. Mehrotra, J. Van Rosendale and H. P. Zima: Opus: A Coordination Language for Multidisciplinary Applications. Scienti c Programming 6 (1997), to appear. Many advanced scienti c and engineering applications are multidisciplinary and heterogeneous in nature, and thus do not t well into the data parallel paradigm. In this paper the authors present Opus, a language designed to ll this gap. The central concept of Opus is a mechanism called ShareD Abstractions (SDA). An SDA can be used as a computation server, i. e., a locus of computational activity, or as a data repository for sharing data between asynchronous tasks. SDAs can be internally data parallel, can thus be used to express multidisciplinary applications in a natural and ecient way. The paper describes the features of the language through a series of examples and gives an overview of the runtime support required to implement these concepts in parallel and

Journals

57

distributed environments.

[19] B. M. Chapman, P. Mehrotra and H. P. Zima: Programming in Vienna Fortran. Scienti c Programming 1 (1992), pp. 31{50.

Vienna Fortran is a language extension of Fortran which provides the user with a wide range of facilities for mapping of data structures. Programs in Vienna Fortran are written using global data references. Thus, the user has the advantages of a shared memory programming paradigm while explicitly controlling the data distribution. In this paper, the authors present the language features of Vienna Fortran for Fortran 77, together with examples illustrating the use of these features.

[20] B. M. Chapman, P. Mehrotra and H. P. Zima: Vienna Fortran and the Path towards a Standard Parallel Language. IEICE Transactions on Information and Systems E80-D (1997), 409-416. Vienna Fortran was the rst fully-speci ed data-parallel language for highly parallel scalable multiprocessing systems that provided features for the speci cation of data distribution and alignment at a high level of abstraction. In this paper, the authors outline the major elements of Vienna Fortran and compare it to HPF. A signi cant weakness of HPF is its lack of support for many advanced applications which require irregular data distributions and dynamic load balancing. The paper introduces HPF+, an extension of HPF based on Vienna Fortran, that provides the required functionality.

[21] S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber and S. H. Teng: Generating Local Addresses and Communication Sets for Data-Parallel Programs. Journal of Parallel and Distributed Computing 26 (1995), pp. 72{84.

Generating local addresses and communication sets is an important issue in distributedmemory implementations of data-parallel languages such as HPF. The authors demonstrate a storage scheme for an array A anely aligned to a template that is distributed across p processors with a cyclic(k) distribution that does not waste any storage, and show that, under this storage scheme, the local memory access sequence of any processor for a computation involving the regular section A(l : h : s) is characterized by a nite state machine of at most k states. They present fast algorithms for computing the essential information about these state machines, and they extend the framework to handle multidimensional arrays. The authors also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

[22] T. W. Clark, R. von Hanxleden and K. Kennedy: Experiences in DataParallel Programming. Scienti c Programming 6 (1997), pp. 153{158.

To eciently parallelize a scienti c application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization eort reinforced this observation and motivated this correspondence. Speci cally, the authors have transformed a Fortran 77 version of Gromos, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation they have encountered a number of diculties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. The experience with Gromos suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. The paper presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or HPF compilers.

[23] F. Coelho and C. Ancourt: Optimal Compilation of HPF Remappings. Journal of Parallel and Distributed Computing 38 (1996), pp. 229{236.

58

Bibliography Applications with varying array access patterns require to dynamically change array mappings on distributed-memory parallel machines. HPF provides such remappings explicitly through realign and redistribute directives and implicitly at procedure calls and returns. However, such features are left out of HPF 2.0 for eciency reasons. This paper presents a new technique for compiling HPF remappings onto message-passing parallel architectures. First, useless remappings that appear naturally are removed. Second, the SPMD generated code takes advantage of replication to shorten the remapping time. Communication is proved optimal: a minimal number of messages, containing only the required data, is sent over the network. The technique is fully implemented in the authors' HPF compiler and was experimented on a DEC Alpha farm.

[24] J. J. Dongarra and A. J. G. Hey: The ParkBench Benchmark Collection. Supercomputer 11 (1995), pp. 94{114. The ParkBench benchmark suite comprises codes that vary from low-level benchmarks measuring basic machine parameters, through important application kernels, to compact research applications. This hierarchical structure allows information derived from the simpler codes to be used in explaining the performance characteristics of the more complicated codes. Thus the benchmark suite can be used to evaluate performance on a range of levels from simple machine parameters to full applications where eects due to non-parallelizable sections of code, and memory, communication or I/O bottlenecks may become important. The synthetic compiler benchmark suite described here is an addition to the benchmark kernels and applications described previously. It is not meant as a tool to evaluate the overall performance of the compiler generated codes. It has been introduced as an aid for compiler developers and implementors to address some selected aspect of the HPF compilation process. In the current version, the suite does not comprise a comprehensive sample of HPF codes. Actually, it addresses only the HPF subset.

[25] T. Fahringer: Estimating and Optimizing Performance for Parallel Programs. IEEE Computer 28 (1995), pp. 47{56.

P 3 T is an interactive performance estimator that assists users in performance tuning of

scienti c Fortran programs. It detects performance bottlenecks in the program, identi es the cause of performance problems, and advises the user on how to gain performance. Four of the most critical performance aspects of parallel programs are estimated: load balance, cache locality, communication and computation overhead. P 3 T is an integrated tool of the Vienna Fortran Compilation System, which enables the estimator to aggressively exploit considerable knowledge about the compiler's analysis information and code restructuring strategies. P 3 T 0s graphical user interface directs the user to bottlenecks in a computation that prevent the program from performing well. Furthermore, it allows ltering and visualizing of performance data at various levels of detail. A new performance parameter, which models computation time is discussed. Furthermore, P 3 T has been extended to handle a subset of HPF programs.

[26] I. Foster: Task Parallelism and High-Performance Languages. IEEE Parallel & Distributed Technology 2 (1994), pp. 27{36. HPF currently lacks support for explicit task parallelism. This article explains why task parallelism is important, and describes one precandidate task-parallel model: the Fortran M extensions to Fortran 77.

[27] M. Le Fur: Scanning Parameterized Polyhedron Using Fourier-Motzkin Elimination. Concurrency: Practice and Experience 8 (1996), pp. 445{460. The paper presents two algorithms for computing a control structure whose execution enumerates the integer vectors of a parameterized polyhedron de ned in a given context.

Journals

59

Both algorithms reconsider the successive projection method, based on Fourier-Motzkin pairwise elimination, de ned by Ancourt and Irigoin. The way redundant constraints are removed in their algorithm is revisited in order to improve the computation time for the enumeration code of higher order polyhedrons as well as their execution time. The algorithms presented here are at the root of the code generation in the HPF compiler Pandore developed at IRISA, France; a comparison of these algorithms with the one de ned by Ancourt and Irigoin is given in the class of polyhedrons manipulated by the Pandore compiler.

[28] T. Gross, D. R. O'Hallaron and J. Subhlok: Task Parallelism in a HPF Framework. IEEE Parallel & Distributed Technology 2 (1994), pp. 16{26.

Exploiting both data and task parallelism in a single framework is the key to achieving good performance for a variety of applications.

[29] M. Gupta and P. Banerjee: Compile-Time Estimation of Communication Costs of Programs. Journal of Programming Languages 2 (1994), pp. 191{ 225. One of the most challenging problems in compiling for distributed memory machines is to determine how data for a program should be distributed across processors. Any compiler that makes data partitioning decisions needs a mechanism for estimating communication and computational costs of programs to compare dierent alternatives. A methodology for estimating communication costs of programs written in global address space is presented. In this approach, the compiler analyzes programs before generating communication, and yet takes into account important communication optimizations that will be performed. The notion of traversal properties of array references in loops is introduced, helping to identify the nature and extent of data movement in terms of high-level communication primitives. This enables the compiler to obtain more precise information about the global state of communication, and in a largely machine-independent manner. The methodology described is implemented in a compiler, Paradigm, that automatically determines data partitioning for Fortran programs. The results obtained with Paradigm con rm the importance of this analysis for making good data partitioning decisions. The techniques developed for recognizing communication primitives that best characterize the data movement are quite general, and also form the basis of generation of communication in Ptran-II, a prototype compiler for HPF.

[30] S. K. S. Gupta, S. D. Kaushik, C.-H. Huang and P. Sadayappan: Compiling Array Expressions for Ecient Execution on Distributed-Memory Machines. Journal of Parallel and Distributed Computing 32 (1996), pp. 155{172.

Array statements are often used to express data-parallelism in scienti c languages such as Fortran 90 and HPF. In compiling array statements for a distributed-memory machine, ecient generation of communication sets and local index sets is important. The authors show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be eciently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with a virtual processor approach to give an ecient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or blockwise) mapped to physical processors. These views are referred to as virtual-block or virtual-cyclic views, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits dierent schemes based on the combination of the virtual processor views chosen for the dierent arrays involved in an array statement. These virtualization schemes have dierent indexing

60

Bibliography overhead. The authors present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, ecient code can be generated for execution of array statements involving block-cyclically distributed arrays.

[31] M. Gupta, E. Schonberg and H. Srinivasan: A Uni ed Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel & Distributed Systems 7 (1996), pp. 689{704.

This paper presents a framework, based on global array data- ow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. The authors introduce available section descriptors, a novel representation of communication involving array sections. This representation allows them to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, the authors are able to capture optimizations like (1) vectorizing communication, (2) eliminating communication that is redundant on any control ow path, (3) reducing the amount of data being communicated, (4) reducing the number-of processors to which data must be communicated, and (5) moving communication earlier to hide latency, and to subsume previous communication. They show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more ecient. The paper presents results from a preliminary implementation of this framework, which are extremely encouraging, and demonstrate the eectiveness of this analysis in improving the performance of programs.

[32] D. R. O'Hallaron, J. Webb and J. Subhlok: Performance Issues in HPF Implementations of Sensor-Based Applications. Scienti c Programming 6 (1997), pp. 59{72.

Applications that get their inputs from sensors are an important and often overlooked application domain for HPF. Such sensor-based applications typically perform regular operations on dense arrays, and often have latency and throughput requirements that can only be achieved with parallel machines. The article describes a study of sensor-based applications, including the fast Fourier transform, synthetic aperture radar imaging, narrowband tracking radar processing, multibaseline stereo imaging, and medical magnetic resonance imaging. The applications are written in a dialect of HPF developed at Carnegie Mellon, and are compiled by the Fx compiler for the Intel Paragon. The main results of the study are that (1) it is possible to realize good performance for realistic sensor-based applications written in HPF and (2) the performance of the applications is determined by the performance of three core operations: independent loops (i. e., loops with no dependences between iterations), reductions, and index permutations. The article discusses the implications for HPF implementations and introduces some simple tests that implementers and users can use to measure the eciency of the loops, reductions, and index permutations generated by an HPF compiler.

[33] J. Harris, J. A. Bircsak, M. R. Bolduc, J. A. Diewald, I. Gale, N. W. Johnson, S. Lee, C. A. Nelson and C. D. Oner: Compiling HPF for Distributed Memory Systems. Digital Technical Journal 7 (1995), pp. 5{23.

Digital's DEC Fortran 90 compiler implements most of HPF version 1.1, a language for writing parallel programs. The compiler generates code for distributed-memory machines consisting of interconnected workstations or servers powered by Digital's Alpha microprocessors. The DEC Fortran 90 compiler eciently implements the features of Fortran 90 and HPF that support parallelism. HPF programs compiled with Digital's compiler yield performance that scales linearly or even superlinearly on signi cant applications on both distributed-memory and shared-memory architectures.

Journals

61

[34] HPFF: High Performance Fortran Language Speci cation Version 1.0. Scienti c Programming 2 (1) (1993), pp. 1{170. Also available at http://www.crpc.rice.edu/HPFF/hpf1/index.html or www.vcpc.univie.ac.at/information/mirror/HPFF/hpf1/index.html. [35] HPFF: High Performance Fortran Journal of Development. Scienti c Programming 2 (2) (1993), pp. 1{44. Also available at http://www.crpc.rice.edu/HPFF/hpf1/index.html or www.vcpc.univie.ac.at/information/mirror/HPFF/hpf1/index.html. [36] Y. Hu and S. L. Johnsson: Implementing O(N) N-Body Algorithms Eciently in Data-Parallel Languages. Scienti c Programming 5 (1996), pp. 337{364.

The optimization techniques for hierarchical O(N ) N -body algorithms described here focus on managing the data distribution and the data references, both between the memories of dierent nodes and within the memory hierarchy of each node. The paper shows how the techniques can be expressed in data-parallel languages, such as HPF and CM Fortran. The eectiveness of the used techniques is demonstrated on an implementation of Anderson's (1992) hierarchical O(N ) N -body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20%, with the average eciency for arithmetic operations being about 40 % and the total eciency (including communication) being about 35 %. For the CM-5E, a performance in excess of 60 M op/s per node (peak 160 M op/s per node) has been measured.

[37] Y.-S. Hwang, B. Moon, S. D. Sharma, R. Ponnusamy, R. Das and J. Saltz: Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed-Memory Machines. Software { Practice & Experience 25 (1995), pp. 597{621.

In many scienti c applications, arrays containing data are indirectly indexed through indirection arrays. Such scienti c applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called Chaos, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The Chaos library provides ecient runtime primitives for distributing data and computation over processors; it supports ecient index translation mechanisms and provides users high-level mechanisms for optimizing communication. Chaos subsumes the previous Parti library and supports a larger class of applications. In particular, it provides ecient support for parallelization of adaptive irregular programs where indirection arrays are modi ed during the course of computation. To demonstrate the ecacy of Chaos, two challenging real-life adaptive applications were parallelized using Chaos primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, Chaos can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how Chaos can be eectively used in such a framework. By embedding Chaos primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized.

[38] E. T. Kalns and L. M. Ni: Processor Mapping Techniques toward Ecient Data Redistribution. IEEE Transactions on Parallel & Distributed Systems 6 (1995), pp. 1234-1247.

62

Bibliography Run-time data redistribution can enhance algorithm performance in distributed-memory machines. Explicit redistribution of data can be performed between algorithm phases when a dierent data decomposition is expected to deliver increased performance for a subsequent phase of computation. Redistribution, however, represents increased program overhead as algorithm computation is discontinued while data are exchanged among processor memories. In this paper, the authors present a technique that minimizes the amount of data exchange for block to cyclic (or vice-versa) redistributions of arbitrary number of dimensions. Preserving the semantics of the target (destination) distribution pattern, the technique manipulates the data to logical processor mapping of the target pattern. When implemented on an IBM SP, the mapping technique demonstrates redistribution performance improvements of approximately 40 % over traditional data to processor mapping. Relative to the traditional mapping technique, the proposed method aords greater exibility in specifying precisely which data elements are redistributed and which elements remain on-processor.

[39] T. Kamachi, A. Muller, R. Ruhl, Y. Seo, K. Suehiro and M. Tamura: Kemari: A Portable HPF System for Distributed Memory Parallel Processors. Scienti c Programming 6 (1997), pp. 41{58.

The authors have developed a compilation system which extends HPF in various aspects. They support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, they provide directives for dynamic data distribution through user-de ned mappings. The compiler also allows integration of MPI primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, the authors describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.

[40] T. Kamachi, K. Kusano, K. Suehiro, Y. Seo, M. Tamura, S. Sakon, Y. Watanabe and Y. Shiroto: HPF Compiler for Parallel Computers { Implementation and Performance Evaluation on Cenju-3. NEC Research & Development 36 (1995), pp. 325{334.

This paper presents compilation techniques for HPF used in the authors' prototype HPF compiler for parallel computers. The techniques include computation mapping, communication generation, and run-time support. Extensions of HPF with special directives for specifying computation mapping are also proposed. Preliminary experimental results of some benchmark programs on a NEC Cenju-3 demonstrate the eciency of the authors' implementation.

[41] S. D. Kaushik, C.-H. Huang and P. Sadayappan: Ecient Index Set Generation for Compiling HPF Array Statements on Distributed-Memory Machines. Journal of Parallel and Distributed Computing 38 (1996), pp. 237{ 247.

In languages such as HPF, array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, ecient enumeration of local index sets and communication sets is important. A method based on a virtual processor approach has been proposed for ecient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to rst develop closed forms in terms

Journals

63

of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an ecient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are rst aligned with a template at an oset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a non unit stride in the alignment creates \holes" in the distributed arrays which leads to wasting memory. In this paper, using simple mathematical properties of regular sections, the authors extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. They develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the ecacy of the extensions and identify various trade-os associated with the proposed method.

[42] K. Kennedy, N. Nedeljkovic and A. Sethi: A Linear-Time Algorithm for Computing the Memory Access Sequence in Data-Parallel Programs. SIGPLAN Notices 30 (1995), pp. 102{111. In this paper, the authors present an improved algorithm for nding the local memory access sequence in computations involving regular sections of arrays with cyclic(k) distributions. After establishing the fact that regular section indices correspond to elements of an integer lattice, they show how to nd a lattice basis that allows for simple and fast enumeration of memory accesses. The complexity of the used algorithm is shown to be lower than that of the previous solution for the same problem. In addition, the experimental results demonstrate the eciency of this method in practice.

[43] A. Knies, M. O'Keefe and T. MacDonald: HPF: A Practical Analysis. Scienti c Programming 3 (1994), pp. 187{199. In this article, the authors discuss issues that need to be addressed before an ecient production quality compiler will be available for HPF. They examine some speci c issues that are related to HPF's model of computation and analyze several implementation issues. The paper also provides some results from another data parallel compiler to help gain insight on some of the implementation issues that are relevant to HPF. Finally, the authors provide a summary of options currently available for application developers in industry.

[44] R. Koppler, S. Grabner and J. Volkert: Visualization of Distributed Data Structures for HPF-like Languages. Scienti c Programming 6 (1997), pp. 115{126.

The article motivates the usage of graphics and visualization for ecient utilization of HPF's data distribution facilities. It proposes a graphical toolkit consisting of exploratory and estimation tools which allow the programmer to navigate through complex distributions and to obtain graphical ratings with respect to load distribution and communication. The toolkit has been implemented in a mapping design and visualization tool which is coupled with a compilation system for the HPF predecessor Vienna Fortran. Since this language covers a superset of HPF's facilities, the tool may also be used for visualization of HPF data structures.

[45] J. K. Lee, I.-K. Tsaur and S.-Y. Hwang: Parallel Array Object I/O Support on Distributed Environments. Journal of Parallel and Distributed Computing 40 (1997), pp. 227{241. This paper presents a parallel le object environment to support distributed array store

64

Bibliography on shared-nothing distributed computing environments. The authors' environment enables programmers to extend the concept of array distributions from memory levels to le levels. It allows parallel I/O that facilitates the distribution of objects in an application. When objects are read and/or written by multiple applications using dierent distributions, a novel scheme is presented to help programmers to select the best data distribution pattern according to a minimum amount of remote data movements for the storage of array objects on distributed le systems. The authors' selection scheme, to the best of their knowledge, is the rst work to attempt to optimize the distribution patterns in the secondary storage for HPF-like programs with inter-application cases. This is especially important for a class of problems called multiple disciplinary optimization problems. The used test bed is built on an 8-node DEC Farm connected with an ethernet, FDDI, or ATM switch. The experimental results with scienti c applications show that not only the used parallel le system can provide aggregate bandwidths, but also the selection scheme eectively reduces the communication trac for the system.

[46] J. M. Levesque: Applied Parallel Research's xHPF System. IEEE Parallel and Distributed Technology, Systems and Applications 2 (1994), p. 71. Applied Parallel Research (APR) takes a somewhat dierent approach to HPF than do other implementers. APR feels the real power of HPF is in its comment line directives by which the user can drive an automatic parallelization system. HPF can be a powerful aid for automatic parallelization of existing Fortran 77 programs. HPF's data-distribution directives let one provide batch automatic parallelization tools, such as xHPF. The xHPF system will parallelize very complex Fortran 77 DO loops rather than relying on the user to explicitly expose parallel operations by translating to Fortran 90 array syntax. xHPF also accepts Fortran 90 array syntax and extends HPF data-distribution rules. APR's approach has been to provide HPF compilation systems that let users more easily port existing sequential Fortran 77 programs to massively parallel processor (MPP) systems. APR feels that the market for tools that port existing Fortran programs to MPP systems far exceeds the market for tools to develop parallel programs from scratch.

[47] W. Li and K. Pingali: Access Normalization: Loop Restructuring for NUMA Computers. ACM Transactions on Computer Systems 11 (1993), pp. 353{375.

In this article, the authors start with a language like HPF and develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. The authors demonstrate the power of their techniques using routines from the BLAS library. An important feature of this approach is that they model loop transformations using invertible matrices and integer lattice theory.

[48] L. M. Liebrock and K. Kennedy: Automatic Data Distribution for Composite Grid Applications. Scienti c Programming 6 (1997), pp. 95{113.

Problem topology is the key to ecient parallelization support for partially regular applications. Speci cally, problem topology provides the information necessary for automatic data distribution and regular application optimization of a large class of partially regular applications. Problem topology is the connectivity of the problem. The research focuses on composite grid applications and strives to take advantage of their partial regularity in the parallelization and compilation process. Composite grid problems arise in important application areas, e. g., reactor and aerodynamic simulation. Related physical phenomena are inherently parallel and their simulations are computationally intensive. The authors present algorithms that automatically determine data distributions for composite grid problems. The algorithm's alignment and distribution speci cations may be used as input to an HPF program to apply the mapping for execution of the simulation code. These algorithms eliminate the need for user-speci ed data distribution for this

Journals

65

large class of complex topology problems. They test the algorithms using a number of topological descriptions from aerodynamic and water-cooled nuclear reactor simulations. Speedup-bound predictions with and without communication, based on the automatically generated distributions, indicate that signi cant speedups are possible using these algorithms.

[49] L. M. Liebrock and K. Kennedy: Parallelization Support for Coupled Grid Applications with Small Meshes. Concurrency: Practice and Experience 8 (1996), pp. 581{615.

Composite grid problems arise in important application areas, e. g., reactor simulation. Related physical phenomena are inherently parallel and their simulations are computationally intensive. Unfortunately, parallel languages, such as HPF, provide little support for these problems. The authors illustrate topological connections via a coupling statement, develop a programming style and transformation system to support composite grid code development, and develop an algorithm that automatically determines distributions for composite grid problems with small meshes. A mesh is classi ed as small if the amount of computational work associated with the mesh is less than the amount of work to be assigned to a single processor. Precompiler transformations, such as cloning for alignment speci cation, are described. Excerpts from an HPF program before and after transformation illustrate user programming style and transformation issues. The used distribution algorithm's alignment and distribution speci cations are input to the transformed HPF program which applies the mapping for execution of the simulation code. Some advantages of this approach are: transformations are applied before compilation and allow communication optimization; data distribution may be determined for any number of problems without recompilation; user determined distribution for parallelization is unnecessary; portability is improved. The authors validate the topology-based data distribution algorithm using a number of reactor con gurations. Two random distribution algorithms provide a basis of comparison with measures of load balance and communication cost. Experiments show that the topology-based distribution algorithm almost always obtains load balance at least as good as, and often signi cantly better than, random algorithms while reducing the total communication per iteration from 50% to as much as a factor of ten.

[50] B. Di Martino: Algorithmic Concept Recognition Support for Automatic Parallelization: A Case Study for Loop Optimization and Parallelization. Journal of Information Science and Engineering, Special Issue on Compiler Techniques for High-Performance Computing, to appear in March 1998. Migrator is a reverse engineering tool supporting the translation of Fortran 90 or par-

tially annotated HPF programs into fully annotated HPF codes with insertion of calls to optimized parallel libraries. The core analysis of Migrator is centered around an algorithmic recognition subsystem that proposes suitable restructuring transformations for solving alignment con icts, improving data locality, etc. Furthermore, calls to speci c HPF intrinsic functions, or BLAS and Linpack library functions can be inserted which replace the code implementing the corresponding function.

[51] J. H. Merlin and A. J. G. Hey: An Introduction to HPF. Scienti c Programming 4 (1995), pp. 87{113. This article provides a tutorial introduction to the main features of HPF. It outlines the background and motivations for the development of HPF, and gives an overview of the HPF programming model. Then it provides an in-depth description of the data distribution extensions and the extensions for expressing data-parallelism and concurrency, and summarizes the remaining HPF extensions. Finally the authors discuss some pros and cons of the HPF approach.

66

Bibliography

[52] T. Nakatani: Status and Issues of HPF Compilers. Systems and Computers in Japan 26 (1995), pp. 1{7.

The author has implemented HPF on the basis of the IBM XL Fortran 90 compiler. The paper presents an overview of the author's compilation strategy and preliminary results obtained on an IBM Scalable POWERparallel System.

[53] M. Parashar and S. Hariri: Compile-Time Performance Prediction of HPF/ Fortran 90D. IEEE Parallel & Distributed Technology: Systems & Applications 4 (1996), pp. 57{73. The compile-time HPF/Fortran 90D performance prediction framework allows accurate, cost-eective performance prediction in high-performance computing environments. The framework implements an interpretative approach to performance prediction and helps select appropriate HPF/Fortran 90D compiler directives, debug application performance, and experiment with runtime and system parameters.

[54] R. Ponnusamy, J. Saltz, A. Choudhary and Y.-S. Hwang: Runtime Support and Compilation Methods for User-Speci ed Irregular Data Distributions. IEEE Transactions on Parallel & Distributed Systems 6 (1995), pp. 815{ 831.

This paper describes two new ideas by which an HPF compiler can deal with irregular computations eectively. The rst mechanism invokes a user speci ed mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements, and computational load. The second mechanism is a conservative method for compiling irregular loops in which dependence arises only due to reduction operations. This mechanism in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e. g., communication schedules, loop iteration partitions, and information that associates o-processor data copies with on-processor buer locations). This paper also presents performance results for these mechanisms from a Fortran 90D compiler implementation.

[55] R. Ponnusamy, Y.-S. Hwang, R. Das, J. Saltz, A. Choudhary and G. Fox: Supporting Irregular Distributions Using Data-Parallel Languages. IEEE Parallel & Distributed Technology: Systems & Applications 3 (1995), pp. 12{24.

Languages such as Fortran D provide irregular distribution schemes that can eciently support irregular problems. Irregular distributions can also be emulated in HPF. Compilers can incorporate runtime procedures to automatically support these distributions.

[56] K. van Reeuwijk, W. Denissen, H. J. Sips and E. M. R. M. Paalvast: An Implementation Framework for HPF Distributed Arrays on Message-Passing Parallel Computer Systems. IEEE Transactions on Parallel and Distributed Systems 7 (1996), pp. 897{914.

In this paper, an implementation framework is presented for HPF distributed arrays on message passing computers. Methods are presented for ecient (in space and time) local index enumeration, local storage, and communication. Techniques for local set enumeration provide the basis for constructing local iteration sets and communication sets. It is shown that both local set enumeration and local storage schemes can be derived from the same equation. Local set enumeration and local storage schemes are shown to be orthogonal, i. e., they can be freely combined. Moreover, for linear access sequences generated by the authors' enumeration methods, the local address calculations can be moved out of the enumeration loop, yielding ecient local memory address generation.

Journals

67

The local set enumeration methods are implemented by using a relatively simple general transformation rule for absorbing ownership tests. This transformation rule can be repeatedly applied to absorb multiple ownership tests. Performance gures are presented for local iteration overhead, a simple communication pattern, and storage eciency.

[57] B. Rodriguez, L. Hart and T. Henderson: Parallelizing Operational Weather Forecast Models for Portable and Fast Execution. Journal of Parallel and Distributed Computing 37 (1996), pp. 159{170.

This paper describes a high-level library (The Nearest Neighbor Tool, NNT) that has been used to parallelize operational weather prediction models. NNT is part of the Scalable Modeling System (SMS), developed at the Forecast Systems Laboratory (FSL). Programs written in NNT rely on SMS's run-time system and port between a wide range of computing platforms, performing well in multiprocessor systems. The authors show, using examples from operational weather models, how large Fortran 77 codes can be parallelized using NNT. The paper compares the ease of programmability of NNT and HPF. The paper also discusses optimizations like data movement overlap (in interprocessor communication and I/O operations), and the minimization of data exchanges through the use of redundant computations. The authors show that although HPF provides a simpler programming interface, NNT allows for program optimizations that increase performance considerably and still keeps a simple user interface. These optimizations have proven essential to run weather prediction models in real time, and HPF compilers should incorporate them in order to meet operational demands. Throughout the paper, the authors present performance results of weather models running on a network of workstations, the Intel Paragon, and the SGI Challenge. Finally, the authors study the cost of programming global address space architectures with NNT's local address space paradigm.

[58] R. Ryne and S. Habib: Beam Dynamics Calculations and Particle Tracking Using Massively Parallel Processors. Particle Accelerators 54 (1996), pp. 365{374.

The acceptance of MPPs has been slow for a variety of reasons. For example, some algorithms are not easily parallelizable. Also, in the past these machines were dicult to program. But in recent years the development of Fortran-like languages such as CM Fortran and HPF have made MPPs much easier to use. The paper describes how MPPs can be used for beam dynamics calculations and long term particle tracking.

[59] C. Schaller: Galoppierender Dinosaurier HPF. UNIX open 2 (1996), pp. 80{84.

Es wird die Programmiersprache HPF vorgestellt. Zunachst wird ein kurzer Abri uber die Entwicklung von Fortran bis zur aktuellen Version Fortran 90 gegeben. Danach wird ausfuhrlich auf die Erweiterungen von HPF eingegangen. Abschlieend werden HPF gute Chancen fur eine breite Akzeptanz gegenuber anderen Programmiermodellen fur Parallelrechner eingeraumt. This article introduces HPF. It describes the development of Fortran up to Fortran 90 and gives a detailed description of the HPF extensions. The author believes that HPF has a good chance to be widely used for programming parallel computers.

[60] T. Shindo, H. Iwashita, T. Doi, J. Hagiwara and S. Y. Kaneshiro: FLopS: A Parallelizing Compiler for Distributed Memory Parallel Computers. Transactions of the Information Processing Society of Japan 37 (1996), pp. 2030{ 2038. The paper presents the design and implementation of FLoPS { a parallelizing compiler for distributed memory parallel computers. FLoPS complies both HPF and VPP For-

68

Bibliography tran programs as its input and generates parallelized code for the AP1000. There are two important features implemented in FLoPS. The rst is machine independent optimizations based on VPP Fortran language features. The second is a code generation and optimization technique using direct remote data access mechanism supported by the AP1000 hardware. Based on experiments performed on the AP1000, the paper shows the eects of the new code generation and optimization techniques.

[61] A. Stewart: Reasoning about Data-Parallel Array Assignment. Journal of Parallel and Distributed Computing 27 (1995), pp. 79{85.

Three representations of data-parallel array assignment { generalized array assignment, Fortran 90 array assignment and HPF array assignment { are compared by deriving their axiomatic inference rules. The goals are (i) to identify shortcomings of representations of data-parallel array assignment in existing programming languages and to suggest improvements and (ii) to clarify the semantics of particular formulations of array assignment.

[62] E. De Sturler and V. Strumpen: Scienti c Programming with HPF: A Case Study Using the xHPF Compiler. Scienti c Programming 6 (1997), pp. 127{152.

The rst commercial HPF subset compilers have recently appeared. The authors report on their experiences with the xHPF compiler of Applied Parallel Research, version 1.2, for the Intel Paragon. At this stage, they do not expect very high performance from the HPF programs, even though performance will eventually be of paramount importance for the acceptance of HPF. Instead, their primary objective is to study how to convert large Fortran 77 programs to HPF such that the compiler generates reasonably ecient parallel code. They report on a case study that identi es several problems when parallelizing code with HPF; most of these problems aect current HPF compiler technology in general, although some are speci c for the xHPF compiler. They discuss their solutions from the perspective of the scienti c programmer, and present timing results on the Intel Paragon. The case study comprises three programs of dierent complexity with respect to parallelization. They use the dense matrix-matrix product to show that the distribution of arrays and the order of nested loops signi cantly in uence the performance of the parallel program. They use Gaussian elimination with partial pivoting to study the parallelization strategy of the compiler. There are various ways to structure this algorithm for a particular data distribution. This example shows how much eort may be demanded from the programmer to support the compiler in generating an ecient parallel implementation.

[63] J. Subhlok and G. Vondran: Optimal Mapping of Sequences of Data Parallel Tasks. SIGPLAN Notices 30 (1995), pp. 134{143. Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modules and assigning a subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. The authors formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The rst algorithm is based on dynamic programming and nds the optimal mapping of k tasks onto P processors in O(P 4 k2 ) time. The paper also presents a heuristic algorithm that is linear in the number of processors and establish with theoretical and practical results

Journals

69

that the solutions obtained are optimal in practical situations. The entire framework is implemented as an automatic mapping tool for the Fx parallelizing compiler for HPF. The authors present experimental results that demonstrate the importance of choosing a good mapping and show that the methods presented yield ecient mappings and predict optimal performance accurately.

[64] R. Thakur, A. Choudhary and J. Ramanujam: Ecient Algorithms for Array Redistribution. IEEE Transactions on Parallel and Distributed Systems 7 (1996), pp. 587{594.

Dynamic redistribution of arrays is required very often in programs on distributed presents ecient algorithms for redistribution between dierent cyclic(k) distributions, as de ned in HPF. First special optimized algorithms for a cyclic(x) to cyclic(y) redistribution are proposed when x is a multiple of y, or y is a multiple of x. The authors then propose two algorithms, called the GCD method and the LCM method, for the general cyclic(x) to cyclic(y) redistribution when there is no particular relation between x and y. They have implemented these algorithms on the Intel Touchstone Delta, and nd that they perform well for dierent array sizes and number of processors.

[65] A. Thirumalai and J. Ramanujam: Ecient Computation of Address Sequences in Data Parallel Programs Using Closed Forms for Basis Vectors. Journal of Parallel and Distributed Computing 38 (1996), pp. 188{203.

Arrays are mapped to processors through a two-step process-alignment followed by distribution-in data-parallel languages such as HPF. This process of mapping creates disjoint pieces of the array that are locally owned by each processor. An HPF compiler that generates code for array statements must compute the sequence of local memory addresses accessed by each processor and the sequence of sends and receives for a given processor to access nonlocal data. In this paper, the authors present an approach to the address sequence generation problem using the theory of integer lattices. The set of elements referenced can be generated by integer linear combinations of basis vectors. Unlike other work on this problem, the authors derive closed form expressions for the basis vectors as a function of the mapping of data. Using these basis vectors and exploiting the fact that there is a repeating pattern in the access sequence, they derive highly optimized code that generates the pattern at runtime. The code generated uses table-lookup of the pattern. Experimental results show that this approach is faster than other solutions to this problem.

[66] J. Vanderlip: Paci c Sierras Vast-HPF and Vast/77TOHPF. IEEE Parallel & Distributed Technology 2 (1994), p. 74. [67] J. L. Wagener: HPF. Computer Standards & Interfaces 18 (1996), pp. 371{ 377. The paper brie y describes basic parallel architectures and programming paradigms, and sketches the nature of HPF.

[68] A. Wakatani: A New Approach to Reduce Communication Latency for Distributed Memory Multicomputers. Transactions of the Information Processing Society of Japan 37 (1996), pp. 204{213. This paper describes a new technique to reduce communication latency for distributed memory multicomputers. In order to hide communication behind computation, the paper introduces an optimization scheme called `message strip-mining'. By using this scheme, the communication overhead is almost completely overlapped with the subsequent computation. It is also shown that this scheme is easily implemented with HPF-type compilers, such as the Fortran D compiler.

[69] A. Wakatani and M. Wolfe: Optimization of Array Redistribution for Dis-

70

Bibliography tributed Memory Multicomputers. Parallel Computing 21 (1995), pp. 1485{ 1490.

To enhance parallelism and to reduce communication, it is sometimes bene cial for an HPF programmer to change the distribution between phases of the algorithm. The authors introduce a new mapping strategy called the spiral mapping, that reduces the communication overhead of array redistribution. Redistribution using the spiral mapping exploits communication locality and reduces global communication con icts. They implemented redistribution using the standard linear mapping and the spiral mapping for two dimensional arrays; for 10241024 arrays, redistribution using the spiral mapping is 36 % faster than using the linear mapping on a 16 node Intel iPSC/860.

[70] D. W. Walker and S. W. Otto: Redistribution of Block-Cyclic Data Distributions Using MPI. Concurrency: Practice and Experience 8 (1996), pp. 707{728. Arrays that are distributed in a block-cyclic fashion are important for many applications in the computational sciences since they often lead to parallel algorithms with good load balancing properties. The authors consider the problem of redistributing such an array to a new block size. This operation is directly expressible in HPF and will arise in applications written in this language. Ecient message passing algorithms are given for the redistribution operation, expressed in the standardized message passing interface, MPI. The algorithms are analyzed and performance results from the IBM SP-1 and Intel Paragon are given and discussed. The results show that redistribution can be done in time comparable to other collective communication operations, such as broadcast and MPI-ALLTOALL.

[71] K. H. Warren: PDDP, a Data Parallel Programming Model. Scienti c Programming 5 (1996), pp. 319{327.

PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements HPF compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

[72] B. J. N. Wylie: Annai Tool Environment. Crosscuts (1996) special issue, pp. 6{9.

The joint CSCS/NEC Collaboration in Parallel Processing developed the Annai integrated parallel program engineering environment, combining several component tools: Parallelization Support Tool; Parallel Debugging Tool; Performance Monitor and Analyzer. All tools support \data-parallel" HPF and explicit programming with MPI as well as the use of standard Fortran and C. The integration of the dierent component tools (and their run-time libraries) in the Annai environment, with its common user and machine interfaces, is shown.

[73] H. P. Zima: High-Performance Languages for Parallel Computing. IEEE Computational Science and Engineering 3 (1996), pp. 63{65.

The paper discusses some high performance languages for parallel computing, including HPF, Vienna Fortran and Fortran D.

Publications in Proceedings [74] G. Agrawal, G. Edjlali, A. Sussman, J. Humphries and J. Saltz: Runtime Support for Programming in Adaptive Parallel Environments. Languages, Compilers and Run-Time Systems for Scalable Computers, Kluwer Academic Publishers, Boston 1996, pp. 241{252.

There has been an increasing trend towards using a network of non-dedicated workstations for parallel programming. In such an environment, the number of processors available for parallel programming may change during the execution of a program. The authors are developing compiler and runtime support for data parallel programming in such an adaptive environment. Executing data parallel programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. The authors have developed a runtime library to provide this support. They also brie y discuss how this runtime support can be used by compilers of HPF-like languages. They present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2.

[75] G. Agrawal and J. Saltz: Interprocedural Data Flow Based Optimizations for Compilation of Irregular Problems. Languages and Compilers for Parallel Computing, Springer-Verlag, Berlin 1996, pp. 465{479. The authors present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. The authors have developed an interprocedural partial redundancy elimination algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. They also present two new interprocedural optimizations: placement of scatter routines and use of coalescing and incremental routines.

[76] F. Andre, P. Brezany, O. Cheron, W. Denissen, J.-L. Pazat and K. Sanjari: A New Compiler Technology for Handling HPF Data Parallel Constructs. Languages, Compilers and Run-time Systems for Scalable Computers, Kluwer Academic Publishers, Norwell 1996, pp. 279{282.

The paper concentrates on the description of the Parallelization Engine (PE) that is responsible for the restructuring of HPF programs and for the SPMD code generation. The methods used by the PE for processing HPF codes are presented: the implementation of HPF distributed arrays and the compilation techniques are discussed. According to the access functions used in distributed array references (regular or irregular case), the compiler applies two dierent techniques which are detailed. The work described in the paper was carried out as part of the European ESPRIT project EP 6516 PREPARE.

[77] F. Andre, M. Le Fur, Y. Maheo and J.-L. Pazat: The Pandore Data-Parallel Compiler and its Portable Runtime. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 176{83. This paper presents an environment for programming distributed memory computers using HPF. Emphasis is put on compilation techniques and distributed array management. Results are shown for some well known numerical algorithms.

[78] F. Andre, J.-L. Pazat and H. Thomas: Pandore: A System to Manage Data Distribution. International Conference on Supercomputing, ACM Press, New York 1990, pp. 380{388.

72

Bibliography

[79] F. Andre and J.-L. Pazat: A Multi-Threads Runtime for the Pandore DataParallel Compiler. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 994{996.

HPF and its precursors are based on a sequential language where data parallelism is expressed through the use of array assignments or parallel loops. HPF compilers generate code according to the SPMD model and the owner-computes rule: each processor executes only the statements that modify the data assigned to it by the distribution. This approach constitutes the basis of several compilers and is also applied in the Pandore compiler. For each parallel nested loop, the Pandore compiler produces a SPMD code that comprises a communication part and a computation part. It performs the restriction of iteration domains and the vectorization of messages. Thanks to the multithreads implementation the authors are able to eciently overlap the communication and the computation phases. The unnecessary waiting times that may occur when using a sequential process are avoided. The limited number of threads involved in the authors implementation prevents the system overhead that in many cases counterbalances the bene t of the parallelization. The authors think that these improvements will contribute to a better maturity for data parallel compilers and so automatic code generation for distributed memory parallel architectures will become a realistic means of programming these architectures for application users.

[80] M. Annaratone, D. B. Loveman and C. D. Oner: HPF on Workstation Farms. 8th Int. Parallel Processing Symp., IEEE Press, Los Alamitos 1994, pp. 664{669.

HPF debuggers and performance pro lers present interesting technical challenges, because what an HPF compiler has decomposed and multithreaded must be `reassembled' by the debugger and by the pro ler into a single-threaded, global address space representation. The degree of success of HPF will be dictated by the delivered performance. A message of the paper has been that system issues are the determinant factor in the performance department, and that there is a limit to what a compiler or a language can do. The statement is obvious, but it is imperative to understand that parallel systems are much less forgiving than single-threaded systems in this respect. Ignoring the balance between parallel architecture, language, and parallelizing compiler can have very serious consequences indeed.

[81] S. Bae and S. Ranka: PACK/UNPACK on Coarse-Grained Distributed Memory Parallel Machines. 10th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1996, pp. 320{324.

PACK/UNPACK are Fortran 90/HPF array construction functions which derive new arrays from existing arrays. The authors present algorithms for performing these operations on coarse-grained parallel machines. Their algorithms are relatively architecture independent and can be applied to arrays of arbitrary dimensions with arbitrary distribution along every dimension. Experimental results are presented on the CM-5.

[82] P. Boulet: Bouclettes: A Fortran Loop Parallelizer. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 784{791. To help the user to extract the parallelism in a given program, an automatic loop parallelizer has been developed: Bouclettes. Bouclettes has been written to validate some scheduling and mapping techniques that are mentioned in this paper. A Fortran 77 loop nest is rewritten into a HPF loop nest with explicit parallel loops (INDEPENDENT loops) and explicit data distribution (using DISTRIBUTE and ALIGN directives). The

Proceedings

73

dierent stages of parallelization (analysis, dependence analysis, scheduling, mapping and code generation) and their implementation issues are described in this paper.

[83] P. Boulet and T. Brandes: Evaluation of Automatic Parallelization Strategies for HPF Compilers. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 778{783.

In the data parallel programming style the user usually speci es the data parallelism explicitly so that the compiler can generate ecient code without enhanced analysis techniques. In some situations it is not possible to specify the parallelism explicitly or this might be not very convenient. This is especially true for loop nests with data dependences between the data of distributed dimensions. In the case of uniform loop nests there are scheduling, mapping and partitioning techniques available. Some dierent strategies have been considered and evaluated with existing HPF compilation systems. This paper gives some results about the performance and the bene ts of the dierent techniques and optimizations. The results are intended to direct the future development of data parallel compilers.

[84] B. Boulter: Performance Evaluation of HPF for Scienti c Computing. HighPerformance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 652{657. The author considers the application of HPF for solving scienti c computing problems. He shows how an ecient elliptic partial dierential equation solver can be stated in Fortran 90 and how it can be made to run in parallel with HPF's data distribution directives. Performance gures are then presented showing ecient parallel implementation on SMP and workstation cluster architectures.

[85] Z. Bozkus, L. F. Meadows, D. Miles, S. Nakamoto, V. Schuster and M. Young: Techniques for Compiling and Executing HPF Programs on SharedMemory and Distributed-Memory Parallel Systems. First International Workshop on Parallel Processing, Tata McGraw-Hill, New Delhi 1995, pp. 515{520. The diversity of memory system architectures that characterize today's parallel computing systems poses a challenge for the compilation and execution of portable HPF programs. This paper discusses these issues and describes the approaches selected for PGHPF (Portland Group HPF), the HPF compiler product of the Portland Group Inc. (PGI).

[86] Z. Bozkus, L. F. Meadows, S. Nakamoto, V. Schuster and M. Young: Compiling HPF. Seventh SIAM conference on parallel processing for scienti c computing, SIAM Press, Philadelphia 1995, pp. 704{709. [87] T. Brandes: Compiling HPF to Message Passing. 2nd European PVM Users Group Meeting, Hermes, Paris 1995, pp. 31{40.

Adaptor is a public domain HPF compilation system that translates data parallel

programs with their global view of data to programs that work on the local parts of the distributed data and exchange the other needed data via message passing. This paper describes the design and the implementation of the Adaptor system and discusses its portability by using a portable message passing interface. The advantages of using MPI are outlined. Furthermore, the combination of data parallel programs with ecient message passage programs is discussed that will allow the use of existing libraries.

[88] T. Brandes: Evaluation of HPF on Some Real Applications. High Performance Computing and Networking, International Conference and Exhibi-

74

Bibliography tion, Springer-Verlag, Berlin Heidelberg New York 1994, pp. 417{422.

Some real application programs have been investigated to show the usefulness of HPF. Therefore these programs have been transformed into the data parallel style and a prototype compilation system has been used to translate them to message-passing programs running on dierent platforms. The results verify that the generated message-passing programs scale and can achieve nearly the same performance as their hand-written counterparts. The experiences have also shown where the data parallel programming model has diculties and where additional optimizations are required.

[89] T. Brandes and F. Desprez: Implementing Pipelined Computation and Communication in an HPF Compiler. Second International Euro-Par Conference, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 459{462.

Many scienti c applications can bene t from pipelining computation and communication. The authors' aim is to provide compiler and runtime support for HPF applications that could bene t from these techniques. The paper describes the integration of a library for pipelined computations in the runtime system. Results on some application kernels are given.

[90] T. Brandes and D. Greco: Realization of an HPF Interface to ScaLapack with Redistributions. High-Performance Computing and Networking. International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 834{839.

The data parallel programming paradigm is not sucient for all kind of applications and message passing programs are more ecient in some cases. Therefore it is desirable to have an interface from HPF to existing parallel libraries based on the ecient message passing paradigm. In this paper the authors describe the realization of such an interface to ScaLapack, a library of high performance linear algebra routines based on message passing. The interface is realized in such a way that it applies redistribution routines from the HPF runtime system to its arguments if ScaLapack does not support a certain HPF distribution or if it can increase the performance. In fact, the high level speci cation and the powerful functionality of HPF will simplify the use of a parallel library dramatically and make its use much more convenient.

[91] P. Brezany, O. Cheron, K. Sanjari and E. van Konijnenburg: Processing Irregular Codes Containing Arrays with Multi-Dimensional Distributions by the Prepare HPF Compiler. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 526{531.

The automatic parallelization of irregular codes for massively parallel systems is a challenging problem of growing importance. The paper describes techniques for processing irregular codes that are being implemented in the Prepare HPF compiler. The authors' work makes two contributions: 1 it constructs a library called Parti+ that is used for the runtime analysis of codes with runtime data dependent accesses to arrays with multidimensional distributions, and 2 it develops the appropriate compiler support allowing optimizations.

[92] P. Brezany, T. A. Mueck and E. Schikuta: A Software Architecture for Massively Parallel Input-Output. Third International Workshop PARA '96, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 85{96. For an increasing number of data intensive scienti c applications, parallel I/O concepts are a major performance issue. Tackling this issue, the authors provide an outline of an input/output system designed for highly ecient, scalable and conveniently usable

Proceedings

75

parallel I/O on distributed memory systems. The main focus of this paper is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of HPF eorts. Speci cally, the design is presented in the context of the Vienna Fortran Compilation System.

[93] L. Brieger and G. Lecca: Parallel Multigrid Preconditioning for Finite Element Models of Groundwater Flow. 11th International Conference on Computational Methods in Water Resources, Computational Mechanics Publ., Southampton 1996, pp. 505{513.

This paper presents a multigrid preconditioner for the conjugate gradient kernel of a three dimensional nite element model of ow in porous media. This preconditioner consists of damped Jacobi relaxation and coarse grid correction. The authors have restricted themselves to regular tetrahedral meshes, for which grid coarsening and the incumbent restriction and prolongation operations are straightforward. With coarse grid corrections on even very few grid levels and a reduced number of relaxation steps per level, the conjugate gradient solver is enhanced by multigrid's convergence rate, which is roughly independent of mesh size. In addition, a subdomain multigrid strategy is presented as a parallel preconditioner for the global conjugate gradient solver on systems distributed over several processors of a parallel machine. Subdomains are de ned and distributed so as to achieve best processor load balance and reduce interprocessor communication time. The authors' parallel routines are coded using Fortran 90 and the HPF programming language, along with the MPI communication library, all emerging standards for parallel programming. They thus achieve portability of the parallel code for machines from clusters of workstations to parallel supercomputers.

[94] G. Cabitza, C. Nardone, C. Bagaini, A. Balzano, E. Bonomi, L. Brieger, M. Ennas, E. Garau, D. Greco, G. Lecca, E. Pieroni and C. Rossi: Experiences with HPF for Scienti c Applications. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 290{296. A wide spectrum of activities around HPF has been carried out and planned for the future at CRS4, ranging from basic tools to parallel libraries, kernel codes and time-consuming real applications. Examples are the HPParLib++ project (a run-time system for HPF), codes for seismic migration and modeling, a molecular dynamics code of liquid water and a shallow water transport code for the simulation of lagoon systems. The strategy behind this work is to advance both in basic computer science and in the development of end-user applications in order to bene t from the synergetic exchange of ideas and solutions.

[95] S. Chakrabarti, M. Gupta and J.-D. Choi: Global Communication Analysis and Optimization. ACM SIGPLAN'96 Conference on Programming Language Design and Implementation, ACM Press, New York 1996, pp. 68{78.

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. The authors' algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under dierent placements before making a nal decision on the placement of any communication. It exploits the exibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buers, all in a uni ed framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive data ow analysis on array sections can generate too many messages or cache and buer contention. The algorithm

76

Bibliography has been implemented in the IBM pHPF compiler for HPF. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. The authors present performance results for the IBM SP2 and a network of Sparc workstations connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.

[96] B. M. Chapman, P. Mehrotra and H. P. Zima: High-Level Languages for Parallel Scienti c Computing. 22nd Seminar on Current Trends in Theory and Practice of Informatics, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 292{314.

In this paper, the authors outline the state-of-the-art in the distribution of data across the processors of a machine and provide a detailed description of HPF. A signi cant weakness of current HPF is its lack of support for many advanced applications, which require irregular data distributions, dynamic load balancing, or task parallelism. The authors introduce HPF+, an extension of HPF based on Vienna Fortran, that addresses these problems and provides the required functionality.

[97] B. M. Chapman, P. Mehrotra and H. P. Zima: Why HPF is Not Useful for Advanced Numerical Applications { Directions for Future Development. Massive Parallelism, Hardware, Software and Applications, World Sci. Publishing, Singapore River Edge London 1994, pp. 321{336.

HPF was de ned with the objective of providing support for the development of ecient data parallel programs for distributed-memory architectures. The authors believe that the current version of the language has failed to reach this goal to a sucient degree. While the basic distribution functions oered by the language can support regular numerical algorithms, advanced algorithms such as multiblock and particle-in-cell codes cannot be expressed adequately. Furthermore, the basic language contains a number of

aws and lacks sucient generality. Finally, there is no mechanism to support task parallelism and to integrate it with the data parallel paradigm. This paper discusses some of the data distribution and alignment issues, outlines possible future paths of development, and points out a number of problems with the basic language.

[98] S. Chatterjee, J. R. Gilbert, R. Schreiber and T. J. She er: Automatic Distribution in HPF. Second Workshop on Environments and Tools for Parallel Scienti c Computing, SIAM Press, Philadelphia 1994, pp. 11{18.

One of the major decisions in compiling data-parallel programs for distributed-memory parallel computers is the mapping of data and computation to the multiple processors of the machine. A good mapping minimizes program completion time by balancing the opposing needs of parallelism and communication: spreading the data and work over many processors increases available parallelism, but also increases communication time. Most compilation systems, e. g., HPF, divide the data mapping problem into two phases: alignment, in which the relative positions of arrays are determined, and distribution, in which arrays are partitioned and mapped to a processor grid. The authors' recent work in compiler and run-time mapping of the array data in a distributed-memory, data-parallel program has concentrated on choosing the parameters of the distribution. They show how a compiler may choose between cyclic, block and block-cyclic distributions and also decide how to allocate the available processors to array axes.

[99] C. Clemencon, K. M. Decker, V. R. Deshpande, A. Endo, J. Fritscher, P. A. R. Lorenzo, N. R. Masuda, A. Muller, R. Ruhl, W. Sawyer, B. J. N. Wylie and F. Zimmermann: Tools-Supported HPF and MPI Parallelization of the NAS Parallel Benchmarks. Sixth Symposium on the Frontiers of Massively Parallel Computing, IEEE Press, Los Alamitos 1996, pp. 309{

Proceedings

77

318.

HPF compilers and communication libraries with MPI are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports programming, debugging and tuning of both HPF- and MPI-based applications. Considering code development time to be as important as nal performance, it is addressed how sequential versions of the familiar NAS parallel benchmark kernels can be expediently parallelized with appropriate tool support. While automatic parallelization of scienti c applications written in traditional sequential languages remains largely impractical, Annai provides users with high-level language extensions and integrated program engineering support tools. Respectable performance and scalability in most cases are obtained with this straightforward parallelization strategy on the NEC Cenju-3 distributed-memory parallel processor even without recourse to platform-speci c optimizations or major program transformations.

[100] C. Clemencon, K. M. Decker, V. R. Deshpande, A. Endo, J. Fritscher, P. A. R. Lorenzo, N. R. Masuda, A. Muller, R. Ruhl, W. Sawyer, B. J. N. Wylie and F. Zimmermann: Tool-Supported Parallel Application Development. IEEE Fifteenth Annual International Phoenix Conference on Computers and Communications, IEEE Press, Piscataway 1996, pp. 294{302.

The goal of this paper is to ease the parallelization of applications on distributed-memory parallel processors. Part of the authors' team is implementing parallel kernels common to industrially signi cant applications using HPF and MPI. They are assisted in this activity by a second group developing an integrated tool environment, Annai, consisting of a parallelization support tool, a debugger, and a performance monitor and analyzer. These two groups interact closely, with application developers de ning requirements and evaluating prototypes of the tool environment. The Annai environment is unique in providing a suite of general-purpose tools for scienti c application development, encompassing comprehensive support for programming irregular problems and the debugging and tuning of HPF and/or MPI-based parallel programs. Addressing application developer requirements, while making leading-edge technology available in a genuinely-usable form, has resulted in the timely provision of powerful parallel programming tools. The paper describes goals, achievements and perspectives of the project, illustrating with speci c case studies of three application kernels how the tool environment assists in the parallelization process: development eort and resulting performance are discussed.

[101] C. Clemencon, A. Endo, J. Fritscher, A. Muller and B. J. N. Wylie: Annai Scalable Run-Time Support for Interactive Debugging and Performance Analysis of Large-Scale Parallel Programs. Second International Euro-Par Conference, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 64{69.

The Annai tool environment helps exploit distributed-memory parallel computers with HPF and/or explicit communication, using MPI as a portable machine interface. Integration within a uni ed environment allows the component parallelization and compilation support, debugging and performance tools to synergetically use common facilities. Additionally, massive quantities of partitioned data and execution information, from largescale applications on multiple processors, needs to be eectively managed and presented during program engineering. This has been achieved by scalable design and cooperative integration of tool component run-time libraries, supporting exible interactive debugging and performance analysis.

[102] C. Clemencon, A. Endo, J. Fritscher, A. Muller, R. Ruhl and B. J. N. Wylie: The Annai Environment for Portable Distributed Parallel Programming. Twenty-Eighth Hawaii International Conference on System Sciences, IEEE Press, Los Alamitos 1995, pp. 242{251.

78

Bibliography The authors are developing a portable integrated tool environment, called Annai, for distributed memory parallel processors (DMPPs), as part of the joint CSCS-ETH/NEC collaboration in parallel processing. As they design and implement a sequence of tool prototypes, these are used and evaluated by a team of application developers. This close interaction allows rapid feedback, as requests from the application developers for functionality enhancements can be promptly considered for inclusion in future tool designs and prototypes. The tool environment consists of an extended HPF compiler, a performance monitor and analyzer and a source-level debugger for distributed programs, all sharing a common user interface. MPI serves as the authors' low-level machine interface and allows portability independent of the target hardware architecture. The key features of the tool environment comprise HPF language extensions to allow dynamic data distributions, and support for both high-level data-parallel and low-level message-passing programming paradigms by the interactive debugger and the interactive performance monitor and analyzer. The language extensions are particularly useful for the ecient parallelization of unstructured problems, and they are fully supported by both the debugger and the performance analyzer. The authors outline the structure of the tool environment, and summarize the functionality of its components. They also demonstrate an example user session and present performance results on two DMPPs.

[103] C. Clemencon, J. Fritscher, R. Ruhl, B. J. N. Wylie, K. M. Decker, N. R. Masuda, W. Sawyer, F. Zimmermann, A. Endo, A. Muller and E. De Sturler: Application-Driven Development of an Integrated Tool Environment for Distributed-Memory Parallel Processors. First International Workshop on Parallel Processing, Tata McGraw-Hill, New Delhi 1995, pp. 110{ 116.

The Joint CSCS-ETHINEC Collaboration in Parallel Processing comprises the development of an integrated tool environment together with applications and algorithms for distributed-memory parallel processors (DMPPs). Tool and application developers interact closely: the requirements of the tools are de ned by the needs of the application developers, and once an application requirement becomes an integral part of the tool environment, the tools ease parallelization of similar applications and whole application classes. Additional features of the project are the use of a standardized DMPP highlevel programming language (HPF) and low-level MPI. The tool environment integrates parallelization support, a parallel debugger and a performance monitor and analyzer. Applications already investigated include unstructured problems. This paper summarizes the tool and application development eorts and shows preliminary performance results of three applications eectively parallelized on two DMPP platforms with the assistance of the authors' tool environment.

[104] F. Coelho: Compilation of I/O Communications for HPF. The Fifth Symposium on the Frontiers of Massively Parallel Computation, IEEE Press, Los Alamitos 1994, pp. 102{109.

This paper focuses on the compilation of I/O communications for HPF. Data must be eciently collected to or updated from I/O nodes with vectorized messages, for any possible mapping. The problem is solved using standard polyhedron scanning techniques. The code generation issues to handle the dierent cases are addressed. Then the method is improved and extended to parallel I/Os. This work suggests new HPF directives for parallel I/Os.

[105] F. Coelho: Discussing HPF Design Issues. Second International Euro-Par Conference, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 571{ 578. This paper provides comprehensive criteria for analyzing HPF features: adequacy to ap-

Proceedings

79

plications, aesthetics and soundness in a language, and implementability. Some features already in HPF or being currently discussed are analyzed according to these criteria. They are shown as not balanced. Thus new or improved features are suggested to solve the outlined de ciencies: namely a scope provider, multiple mapping declarations and simpler remappings.

[106] F. Coelho: Experiments with HPF Compilation for a Network of Workstations. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1994, pp. 423{428.

An early implementation of a prototype HPF optimizing compiler is described. Experiments of PVM 3-based generated code on a network of workstations are analyzed and discussed. It is shown that if such systems can provide very good speedups at low cost, they cannot allow scalable performance without speci c communication hardware. Other early implementations of HPF compilers from academic and commercial groups are presented and compared to the authors' work.

[107] F. Coelho, C. Germain and J.-L. Pazat: State of the Art in Compiling HPF. Data Parallel Programming Model; Foundations, HPF Realization, and Scienti c Applications, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 104{133. Proposing to the user a nice programming model based on the data-parallel paradigm is one thing. Running the resulting applications very fast is the next issue for a language aiming at high performance on massively parallel machines. This paper discusses the issues involved in HPF compilation and presents optimization techniques, targeting the message-passing SPMD programming model of distributed memory MIMD architectures.

[108] J. Cownie, A. Dunlop, S. Hellberg, A. J. G. Hey and D. Pritchard: Portable Parallel Programming Environments { The ESPRIT PPPE Project. Massively Parallel Processing Applications and Development; EUROSIM Conference, Elsevier, Amsterdam 1994, pp. 135{142.

The ESPRIT PPPE project aims to produce a portable, parallel programming environment available across many high performance and parallel computers such that these systems become available for mainstream applications development. Very broadly there are two sides to the environment: oering run-time and the cross-development functionality. Each side has a portability layer upon which all the components are based: PCTE for the cross-development environment and Parmacs/MPI for the run-time environment. The initial tools being developed include a parallel debugger, a performance monitor and analyzer and an HPF mapper and debugger. This paper describes the PPPE programming environment and tools.

[109] K. Dincer and G. Fox: An HPF Programming Environment on the WorldWide Virtual Machine. ISCA International Conference, Int. Soc. Comput. & Their Appl.-ISCA, Raleigh 1996, pp. 193{198.

In today's high-performance computing arena there is a strong trend toward building virtual computers from heterogeneous resources on a network. The authors describe their experiences in building a parallel/distributed virtual machine called the World-Wide Virtual Machine (WWVM) using Web and high-performance computing and communications technologies. They ported their HPF compilation system onto this virtual platform to provide a Web-based collaboratory HPF programming environment.

[110] K. Dincer, G. Fox and K. A. Hawick: HPF and possible extensions to support conjugate gradient algorithms. Fifth IEEE International Symposium

80

Bibliography on High Performance Distributed Computing, IEEE Press, Los Alamitos 1996, pp. 69{77. The authors evaluate the HPF language for the compact expression and ecient implementation of conjugate-gradient iterative matrix-solvers on high-performance computing platforms. They discuss the use of intrinsic functions, data distribution directives and explicitly parallel constructs to optimize performance by minimizing communications requirements in a portable manner. They focus on implementations using the existing HPF de nitions but also discuss issues arising that may in uence a revised de nition for HPF 2. Some of the codes discussed are available on the World Wide Web at http://www.npac.syr.edu/hpfa/, along with other educational and discussion material related to applications in HPF.

[111] P. A. Dinda and D. R. O'Hallaron: The Performance Impact of Address Relation Caching. Languages, Compilers and Run-time Systems for Scalable Computers; Third Workshop, Kluwer Academic Publishers, Boston 1996, pp. 213{26. An important portion of end-to-end latency in data transfer is spent in address computation, determining a relation between sender and receiver addresses. In deposit model communication, this computation happens only on the sender and some of its results are embedded in the message. Conventionally, address computation takes place on-line, as the message is assembled. If the amount of address computation is signi cant, and the communication is repeated, it may make sense to remove address computation from the critical path by caching its results. However, assembling a message using the cache uses additional memory bandwidth. The authors present a ne grain analytic model for simple address relation caching in deposit model communication. The model predicts how many times a communication must be repeated in order for the average end-to-end latency of an implementation which caches to break even with that of an implementation which doesn't cache. The model also predicts speedup and those regimes where a caching implementation never breaks even. The model shows that the eectiveness of caching depends on CPU speed, memory bandwidth and the complexity of the address computation. The authors verify the model on the iWarp and the Paragon and nd that, for both machines, caching can improve performance even when address computation is quite simple (one instruction per data word on the iWarp and 16 instructions per data word on the Paragon). To show the practical bene t of address relation caching, the authors examine the performance of an HPF distributed array communication library that can be con gured to use caching. In some cases, caching can double the performance of the library.

[112] B. D. De Dinechin, G. Elsesser, G. Fischer, B. H. Johnson, T. MacDonald, R. W. Numrich, and J. L. Steidel: De nition of the F?? Extension to Fortran 90. 10th International Workshop on Languages and Compilers for Parallel Computing, Springer-Verlag, Berlin Heidelberg New York, to appear.

F?? is an extended version of Fortran 90 which allows to express explicit parallelism. It is built on the SPMD model of parallel computation and uses a single syntactic extension to the Fortran DIMENSION attribute to express data decomposition. The programmer controls work through intrinsic synchronization procedures. Being closer to message-passing models than to directive-based models for parallel programming, this approach gives programmers more explicit control over parallelism and therefore should achieve a high level of parallel eciency.

[113] G. Edjlali, A. Sussman and J. Saltz: Interoperability of Data Parallel Runtime Libraries. 11th International Parallel Processing Symposium, IEEE

Proceedings

81

Press, Los Alamitos 1997, pp. 451{459.

This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single application. The ability to use multiple libraries is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programs that use dierent libraries to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta-Chaos that has been used to exchange data between data parallel programs written using HPF, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pC++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries eciently and that MetaChaos provides eective support for complex applications.

[114] I. Foster, D. R. Kohr Jr., R. Krishnaiyer and A. Choudhary: Communicating Data-Parallel Tasks: An MPI Library for HPF. 3rd International Conference on High Performance Computing, IEEE Press, Los Alamitos 1996, pp. 433{438. HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of the authors' work on a library-based approach to support task parallelism, using MPI as a coordination layer for HPF. This library enables a wide variety of applications, such as multidisciplinary simulations and pipeline computations, to take advantage of combined task and data parallelism. An HPF binding for MPI raises several interface and communication issues. The authors discuss these issues and describe their implementation of an HPF/MPI library that operates with a commercial HPF compiler. They also evaluate the performance of their library using a synthetic communication benchmark and a multiblock application.

[115] I. Foster, D. R. Kohr Jr. and R. Krishnaiyer: MPI As a Coordination Layer for Communicating HPF Tasks. Second MPI Developer's Conference, IEEE Press, Los Alamitos 1996, pp. 68{78.

Data-parallel languages such as HPF present a simple execution model in which a single thread of control performs high-level operations on distributed arrays. These languages can greatly ease the development of parallel programs. Yet there are large classes of applications for which a mixture of task and data parallelism is most appropriate. Such applications can be structured as collections of data-parallel tasks that communicate by using explicit message passing. Because MPI de nes standardized, familiar mechanisms for this communication model, the authors propose that HPF tasks communicate by making calls to a coordination library that provides an HPF binding for MPI. The semantics of a communication interface for sequential languages can be ambiguous when the interface is invoked from a parallel language; it is shown how these ambiguities can be resolved by describing one possible HPF binding for MPI. The authors then present the design of a library that implements this binding, discuss issues that in uenced their design decisions, and evaluate the performance of a prototype HPF/MPI library using a communications microbenchmark and application kernel. Finally, the authors discuss how MPI features might be incorporated into their design framework.

[116] P. Fritzson, R. Wismuller, O. Hansen, J. Sala and P. Skov: A Parallel Debugger with Support for Distributed Arrays, Multiple Executables and Dynamic Processes. Compiler Construction; 6th International Conference, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 341{355. This paper present the parallel debugger Detop with special emphasis on new support for debugging of programs with distributed data structures such as arrays that have

82

Bibliography been partitioned over a number of processors. The new array visualizer within Detop supports transparent browsing and visualization of distributed arrays which occur in languages such as HPF. Visualization of sparse arrays is supported through an array mapper facility, as well as transparent visualization of arrays which have been partitioned by hand for applications in C or Fortran 77. Color coding makes the visualization more expressive and easier to read. Detop provides a graphical user interface that is simple to use even for inexperienced users and supports not only static data parallel programs, but also dynamic programs and parallel applications based on functional decomposition. The combination of support for applications that include dynamic process creation, multiple executables, processes and threads, and distributed data structures, makes Detop rather unique among parallel debuggers. Detop has been implemented for Parsytec PowerPC based multicomputers with Sparcstation front-ends. Ongoing eorts include portable versions of Detop for PVM and MPI run-time environments.

[117] M. Gupta: On Privatization of Variables for Data-Parallel Execution. 11th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1997, pp. 533{541. Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the bene ts of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. The author shows that there are numerous alternatives available for mapping privatized variables and the choice of mapping can signi cantly aect the performance of the program. He presents an algorithm that attempts to preserve parallelism and minimize communication overheads. He also introduces the concept of partial privatization of arrays that combines data partitioning and privatization, and enables ecient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, it is shown how the ideas of privatization apply to the execution of control ow statements as well. An implementation of these ideas in the pHPF prototype compiler for HPF on the IBM SP2 machine has shown impressive results.

[118] M. Gupta, S. Midki, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching and T. Ngo: HPF Compiler for the IBM SP2. ACM/IEEE Supercomputing Conference, IEEE Press, Los Alamitos 1995, pp. 1944{ 1984.

The authors describe pHPF, a research prototype HPF compiler for the IBM SP series of parallel machines. The compiler accepts as input Fortran 90 and Fortran 77 programs, augmented with HPF directives; sequential loops are automatically parallelized. The compiler supports symbolic analysis of expressions. This allows parameters such as the number of processors to be unknown at compile-time without signi cantly aecting performance. Communication schedules and computation guards are generated in a parameterized form at compile-time. Several novel optimizations and improved versions of wellknown optimizations have been implemented in pHPF to exploit parallelism and reduce communication costs. These optimizations include elimination of redundant communication using data-availability analysis; using collective communication; new techniques for mapping scalar variables; coarse-grain wavefronting; and communication reduction in multi-dimensional shift communications. The authors present experimental results for some well-known benchmark routines. The results show the eectiveness of the compiler in generating ecient code for HPF programs.

[119] M. Gupta and E. Schonberg: Static Analysis to Reduce Synchronization Costs in Data-Parallel Programs. 23rd ACM SIGPLAN-SIGACT Sympo-

Proceedings

83

sium on Principles of Programming Languages, ACM Press, New York 1996, pp. 322{332.

For a program with sucient parallelism, reducing synchronization costs is one of the most important objectives for achieving ecient execution on any parallel machine. The paper presents a novel methodology for reducing synchronization costs of programs compiled for SPMD execution. This methodology combines data ow analysis with communication analysis to determine the ordering between production and consumption of data on dierent processors, which helps in identifying redundant synchronization. The resulting framework is more powerful than any that have been previously presented, as it provides the rst algorithm that can eliminate synchronization messages even from computations that need communication. The authors show that several commonly occurring computation patterns such as reductions and stencil computations with reciprocal producer consumer relationship between processors lend themselves well to this optimization, an observation that is con rmed by an examination of some HPF benchmark programs. The framework also recognizes situations where the synchronization needs for multiple data transfers can be satis ed by a single synchronization message. This analysis, while applicable to all shared memory machines as well, is especially useful for those with a exible cache coherence protocol, as it identi es ecient ways of moving data directly from producers to consumers, often without any extra synchronization.

[120] S. T. Hackstadt and A. D. Malony: Distributed Array Query and Visualization for HPF. Second International Euro-Par Conference, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 55{63. This paper describes the design and implementation of the Distributed Array Query and Visualization (DAQV) system for HPF, a project sponsored by the Parallel Tools Consortium. DAQV's implementation utilizes the HPF language, compiler, and runtime system to address the general problem of providing high-level access to distributed data structures. DAQV supports a framework in which visualization and analysis clients connect to a distributed array server (i. e., the HPF application with DAQV control) for program-level access to array values. Implementing key components of DAQV in HPF itself has led to a robust and portable solution.

[121] K. A. Hawick and G. Fox: Exploiting HPF for Computational Fluid Dynamics. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 413{419.

The authors discuss HPF as an aid to software engineering and as a tool for exploiting high performance computing systems for computational uid dynamics applications. They discuss the use of intrinsic functions, data distribution directives and explicitly parallel constructs to optimize performance by minimizing communications requirements in a portable manner. In particular the authors use an implicit method such as the ADI algorithm to illustrate the major issues. They focus on regular mesh problems, since these can be eciently represented by the existing HPF de nition, but also discuss issues arising from the use of irregular meshes that are in uencing a revised de nition for HPF 2. Some of the codes discussed are available on the Web at http://www.npac.syr.edu/hpfa/ along with other educational and discussion material related to applications in HPF.

[122] S. Hiranandani, K. Kennedy, J. Mellor-Crummey and A. Sethi: Compilation Techniques for Block-Cyclic Distributions. International Conference on Supercomputing, ACM Press, New York 1994, pp. 392{403. Compilers for data-parallel languages such as Fortran D and HPF use data alignment and distribution speci cations as the basis for translating programs for execution on MIMD distributed-memory machines. This paper describes techniques for generating ecient

84

Bibliography code for programs that use block-cyclic distributions. These techniques can be applied to programs with symbolic loop bounds, symbolic array dimensions, and loops with nonunit strides. This paper presents algorithms for computing the data elements that need to be communicated among processors both for loops with unit and non-unit strides, a linear-time algorithm for computing the memory access sequence for loops with nonunit strides, and experimental results for a hand-compiled test case using block-cyclic distributions.

[123] Y. Hu, S. L. Johnsson, D. Kehagias and N. Shalaby: DPF: A Data Parallel Fortran Benchmark Suite. 11th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1997, pp. 219{226. This paper presents the data parallel Fortran (DPF) benchmark suite, a set of data parallel Fortran codes for evaluating data parallel compilers appropriate for any target parallel architecture, with shared or distributed memory. The codes are provided in basic, optimized and several library versions. The functionality of the benchmarks cover collective communication functions, scienti c software library functions, and application kernels that re ect the computational structure and communication patterns in uid dynamic simulations, fundamental physics and molecular studies in chemistry or biology. The DPF benchmark suite assumes the language model of HPF, and provides performance evaluation metrics of busy and elapsed times and op rates, op count, memory usage, communication patterns, focal memory access, and arithmetic eciency as well as operation and communication counts per iteration. An instance of the benchmark suite was fully implemented in CM Fortran and tested on the CM-5.

[124] G.-H. Hwang, J. K. Lee and D.-C. R. Ju: Array Operation Synthesis to Optimize HPF Programs. International Conference on Parallel Processing, IEEE Press, Los Alamitos 1996, pp. 1{8.

The synthesis of consecutive array operations or array expressions into a composite access function of the source arrays at compile time can reduce the redundant data movement, temporary storage usage and loop synchronization overhead on at shared-memory parallel machines with uniform memory accesses. However, it remains open how the synthesis scheme can be incorporated into optimizing HPF programs on distributed-memory machines by taking into account communication costs. In this paper, the authors propose solutions to address this open problem. They rst apply the array operation synthesis (developed for Fortran 90 programs) to HPF programs and demonstrate its performance bene ts on distributed-memory machines. In addition, to prevent a situation they call the \synthesis performance anomaly", they derive a cost model and present an optimal solution based on this model to guide the array synthesis process on distributed-memory machines. The authors also show that the optimal problem is NP-hard. Therefore, they develop a practical heuristic algorithm for compilers to devise a synthesis strategy on distributed-memory machines with HPF programs. Experimental results show a significant performance improvement over the base codes for HPF code fragments from real applications on a DEC Alpha processor farm by incorporating the proposed optimizations.

[125] K. Ishizaki and H. Komatsu: A Loop Parallelization Algorithm for HPF Compilers. Languages and Compilers for Parallel Computing, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 176{190.

This paper presents a formalized loop parallelization algorithm for eectively extracting parallelism from data to be allocated to processors with array decomposition directives, in languages such as HPF. The authors de ne a communication dependence vector that shows data dependence among processors, and use it in their algorithm to formalize and unify the detection of vector prefetch communication and vector pipeline communication for loop parallelization. The paper also presents a method, based an their algorithm for

Proceedings

85

generating vector communications. The authors implemented the algorithm in their HPF compiler and carried out experiments with two applications on an IBM RS/6000 scalable POWERparallel system.

[126] T. Kamachi, K. Kusano, K. Suehiro, Y. Seo, M. Tamura and S. Sakon: Generating Realignment-Based Communication for HPF Programs. 10th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1996, pp. 364{371. This paper presents methods for generating communication on compiling HPF programs for distributed-memory machines. The authors introduce the concept of an iteration template corresponding to an iteration space. Their HPF compiler performs the loop iteration mapping through the two-level mapping of the iteration template in the same way as the data mapping is performed in HPF. Making use of this uni ed mapping model of the data and the loops, communication for nonlocal accesses is handled based on data-realignment between the user-declared alignment and the optimal alignment, which ensures that only local accesses occur inside the loop. This strategy results in eective means of dealing with communication for arrays with unde ned mapping, a simple manner for generating communication, and high portability of the HPF compiler. Experimental results on the NEC Cenju-3 distributed-memory machine demonstrate the eectiveness of the authors' approach: the execution time of the compiler-generated program was within 10 % of that of the hand-parallelized program.

[127] S. Y. Kaneshiro and T. Shindo: Pro ling Optimized Code: A Pro ling System for an HPF Compiler. 10th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1996, pp. 469{473. To achieve high performance, an HPF compiler must optimize the code, which may result in a signi cant change to the original code structure. Because the performances of the optimized and non-optimized codes dier, pro ling HPF programs with compiler optimizations must be supported. This requires additional support in the compiler to report the pro le data with respect to the original code as opposed to the transformed code. The authors developed a pro ling system which maintains a history of code transformations, maps the original code to the optimized code, and reports pro le data based on the original code. A version of the pro ling system was successfully implemented for the Fujitsu HPF compiler and tested for several benchmarks on the Fujitsu AP1000.

[128] S. D. Kaushik, C.-H. Huang and P. Sadayappan: Compiling Array Statements for Ecient Execution on Distributed-Memory Machines: Two-Level Mappings. Languages and Compilers for Parallel Computing, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 209{223.

In compiling HPF array statements for distributed-memory machines, ecient enumeration of local index sets and communication sets is important. The virtual processor approach, among several other methods, has been proposed for ecient enumeration of these index sets. In this paper, using simple mathematical properties of regular sections, the authors extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping supported by HPF. Performance results on the Cray T3D are presented to demonstrate the ecacy of the extensions and identify various tradeos associated with the proposed method.

[129] K. Kennedy and U. J. Kremer: Automatic Data Layout for HPF. ACM/ IEEE Supercomputing Conference, IEEE Press, Los Alamitos 1995, pp. 2088{2109.

A framework is described for automatic data layout for use in a data layout assistant tool

86

Bibliography for data-parallel languages such as HPF. The envisioned tool can be used to generate a rst data layout for a sequential Fortran program without data layout statements, or to extend a partially speci ed data layout in a HPF program to a totally speci ed data layout. The framework can use techniques that may be too computationally expensive to be included in a compiler. A prototype data layout assistant tool based on the framework has been implemented as part of the D system currently under development at Rice University.

[130] D. Kimelman, P. Mittal, E. Schonberg, P. F. Sweeney, K.-Y. Wang and D. Zernik: Visualizing the Execution of HPF Programs. 9th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1995, pp. 750{ 759.

Although the data placement programming model frees programmers from writing low level communication code, it also makes it harder for them to anticipate the performance of the code they write. In particular, the communication that is required by a given data distribution, and its impact on performance, may not be obvious from the program source. Many existing performance visualization systems could be used in attempts to investigate the performance of data placement programs; however, no published system would provide views that relate communication activity to the abstraction that is the primary focus of data placement programming: data distribution. The paper introduces new high level views of HPF program behavior that show communication activity in the context of the array distribution from which the compiler derived the communication. With these views, programmers can see not only when and where communication takes place, but also what subarrays are moving between which processors by virtue of the communication. These views provide programmers with feedback in terms of the programming model-the programmers' frame of reference. The programming model as the level programmers most readily comprehend, and it is the level at which they must improve program performance by altering data distribution. Thus, these views provide a basis for eective tuning of data placement programs.

[131] M.-S. Koo, S.-S. Park, H.-G. Yook and M.-S. Park: A New Transformation Method to Generate Optimized DO Loops from FORALL Construct. Second AIZU International Symposium on Parallel Algorithms/Architecture Synthesis, IEEE Press, Los Alamitos 1997, pp. 240{247. Most HPF compilers are source-to-source translators because they can be easily implemented. However, these source-to-source compilers produce signi cant amount of ineffective codes. In particular, the FORALL construct is converted into several DO loops, so its loop overhead is increased. Therefore, the authors propose some techniques for converting FORALL construct to optimized DO loop. For this, they de ne and use relation distance vector which can represent both data dependence information and ow information. Then they evaluate and analyze execution time for the codes converted by their method and by Paradigm method.

[132] M.-S. Koo, S.-S. Park, H.-G. Yook and M.-S. Park: A Transformation Method to Reduce Loop Overhead in HPF Compiler. HPC Asia '97, IEEE Press, Los Alamitos 1997, pp. 306{311.

Most HPF compilers are source-to-source translators because they can be implemented easily. Also, they currently generate signi cant amount of ineective codes. In particular, the FORALL construct is converted into several DO loops, so loop overhead, which exacerbates program performance of these codes is substantially increased. Therefore HPF source-to-source compilers require some techniques to produce an optimized DO loop from the FORALL construct. The authors propose a transformation method to reduce loop overhead for the FORALL construct. For this, they de ne and use a relation

Proceedings

87

distance vector which can represent both data dependence information and ow information. Then they evaluate and analyze the execution time for the codes converted by their method and by the Paradigm method in aspects of array size and dependence distance.

[133] R. Koppler, S. Grabner and J. Volkert: Graphical Support for Data Distribution in SPMD Parallelization Environments. IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, IEEE Press, Los Alamitos 1996, pp. 21{28.

The paper introduces a programming environment for HPF-like languages with emphasis on graphical support for data distribution. A novel component of this environment is a mapping design and visualization tool. The tool provides visualization of HPF array objects such as data arrays and logical processor arrays and creates a number of diagrams based on information that is gathered from other components of the environment such as the compiler or a debugger. The diagrams relate to crucial issues such as load distribution and communication. Furthermore the authors show how their environment facilitates seamless integration of additional components.

[134] E. A. Kornkven and L. V. Kale: Ecient Implementation of HPF via Adaptive Scheduling { An Overview. First International Workshop on Parallel Processing, Tata McGraw-Hill, New Delhi 1995, pp. 136{141. The authors have developed a compiler for a subset of HPF. They have shown that generating message-driven code provides an opportunity for improved eciency in the presence of communication. By utilizing a notation called Dagger, they are able to schedule work eciently without relying on complicated and unreliable compile-time approaches. This paper gives an overview of the authors' approach and reports on the project's status.

[135] W. Krotz-Vogel and H.-C. Hoppe: The PALLAS Portable Parallel Programming Environment. Second International Euro-Par Conference, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 899{903.

Pallas is an independent German software company specializing in high performance computing. Apart from consulting and training services, a programming environment for developing, porting and tuning of parallel applications is available. It consists of: Vampir (versatile performance analysis of MPI programs); TotalView (parallel debugger for MPI/Parmacs/PVM programs); and HPF The Portland Group). The programming environment presented provides quality and functionality across a wide range of parallel platforms from workstations to MPP systems.

[136] P. A. R. Lorenzo, A. Muller, Y. Murakami and B. J. N. Wylie: HPF Interfacing to ScaLapack. Third International Workshop PARA '96, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 457{466.

The ScaLapack numerical library for MIMD distributed-memory parallel computers comprises highly ecient and robust parallel dense linear algebra routines, implemented using explicit message passing. HPF was developed as an alternative to the messagepassing paradigm; it extends Fortran 90 with directives to automatically distribute data and to parallelize loops, such that all required inter-processor communication is generated by the compiler. While HPF can ease parallelization of many applications, it still does not make sense to re-program existing libraries like ScaLapack. Rather, programmers should have the opportunity to use them from within HPF programs. HPF interfaces to routines in the ScaLapack library are presented which are simpli ed considerably through exploitation of Fortran 90 array features. Substantial performance bene ts from interfacing to ecient ScaLapack routines are also demonstrated via a comparison with equivalent HPF-coded functions. Finally, standard ScaLapack optimizations, tuning

88

Bibliography block sizes and processor topology/mapping, are found to be equally eective from HPF.

[137] D. B. Loveman: Fortran: A Modern Standard Programming Language for Parallel Scalable High Performance Technical Computing. ICPP Workshop on Challenges for Parallel Processing, IEEE Press, Los Alamitos 1996, pp. 140{148.

This paper provides a quick overview of \modern" Fortran for the development of numerically intensive technical computing applications and looks at some simple examples. These examples are chosen to be illustrative of a data parallel coding style that is readable and understandable, performs well on a single processor system, and scales well on multiple processor shared memory and distributed memory systems. Compilers supporting this style of programming are available from a number of vendors.

[138] L. F. Meadows, D. Miles and M. Young: Performance Results of Several HPF Benchmarks. 9th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1995, pp. 516{517.

The pghpf HPF compiler from The Portland Group, Inc. (PGI) has been ported to a variety of shared- and distributed-memory platforms. To track compiler improvements and assist in tuning of the pghpf runtime system, three types of benchmarks are used. Low level runtime benchmarks measure the eciency of the runtime system relative to hardware speci cations. Kernel benchmarks, usually only a few hundred lines long, measure eectiveness on sequences of code that dominate the execution time of many common applications. Full application benchmarks measure eectiveness on productionclass codes from a variety of disciplines. Application benchmarks are typically longrunning, include I/O of varying amounts, and are signi cantly more complicated than the kernel benchmarks. The paper describes these benchmarks below in further detail.

[139] J. H. Merlin: Adapting Fortran 90 Array Programs for Distributed Memory Architectures. 1st Int. Conf. of the ACPC, Springer-Verlag, Berlin Heidelberg New York 1991, pp. 184{200.

The author describes a system whose purpose is to automatically transform data parallel Fortran 90 programs for execution on MIMD distributed memory architectures. The system is called Adapt (for `Array Distribution Automatic Parallelisation Tool'). Programs for the system should make full use of the array features of Fortran 90, as parallelism is automatically extracted from the array syntax. Parallelization is by data-partitioning, guided by `distribution' declarations that the user inserts in his program|these being the only additions required to standard Fortran 90 programs. This paper gives a brief overview of the array features of Fortran 90, describes the `distribution' declarations required by Adapt, and gives details of the parallelization scheme.

[140] M. Metcalf: HPF. 1995 CERN School of Computing, CERN, Geneva 1995, pp. 193{197.

This paper presents the main features of HPF, a language to write data-parallel programs in a machine-independent way.

[141] D. Middleton, P. Mehrotra and J. Van Rosendale: Expressing Direct Simulation Monte Carlo Methods in HPF. Seventh SIAM Conference on Parallel Processing for Scienti c Computing, SIAM Press, Philadelphia 1995, pp. 698{703. HPF can readily express a broad spectrum of scienti c applications and may achieve ecient parallel execution on most of them. However, the current language contains little support for programs whose data structures rely heavily on irregular or pointerbased data structures. The authors look at direct simulation Monte Carlo methods, an

Proceedings

89

important scienti c application in this category. They focus rst on an explicitly parallel implementation of this algorithm and then examine possible HPF expressions of this algorithm. The goal is to use HPF to achieve eective performance, while requiring as little reprogramming as possible.

[142] A. Muller and R. Ruhl: Communication-Buers for Data-Parallel, Irregular Computations. Languages, Compilers and Run-Time Systems for Scalable Computers, Kluwer Academic Publishers, Boston 1996, pp. 295{298. The authors are currently developing the portable integrated tool environment Annai (Clemencon et al. [102]) for distributed memory parallel processors. Annai consists of a HPF compiler, extended for the support of unstructured computations, a performance monitor and analyzer (Wylie and Endo [166]), and a source level debugger (Clemencon et al. [102]) for distributed programs, all sharing a common user interface. MPI serves as a low level machine interface. The HPF extensions are implemented by a compiler generically called Parallelization Support Tool (PST). All PST HPF extensions are fully supported by both the performance analyzer and debugger. Much like Arf, Fortran D, Kali, and Oxygen, PST supports unstructured computations through the run time preprocessing of critical code segments. In a previous report (Muller and Ruhl, 1995) the authors described PST's language extensions, compilation mechanisms and run time support in detail. Performance of compiler generated parallel programs was discussed and execution time overhead was measured. Here they focus on one of PST's advanced features, namely the availability of dierent buer organizations for distributed data.

[143] C. A. Nelson: Compiler Optimizations for Red-Black HPF Codes. Third International Workshop PARA '96, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 534{542.

A case study is presented of the compiler optimizations that have been used to generate ecient code for a red-black relaxation problem coded in HPF and targeted at distributed memory. HPF gives the programmer the ability to express the parallelism at a high level without entering into the low-level details of message-passing and synchronization thereby reducing the time and eort required for parallel program development. Because the HPF compiler is responsible for scheduling the parallel operations on the physical machines, HPF opens up a vast area of optimizations which the compiler must perform in order to generate ecient code. These are optimizations which would otherwise have been performed by the programmer at a lower level (using explicit message passing). Some timings from the Digital Fortran 90 compiler showing the eect of these optimizations are presented.

[144] L. Onesti and T. Srevik: Parallelization of a Local Area Ocean Model. Third International Workshop PARA '96, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 543{554. This paper describes the parallelization of a local area ocean circulation model using HPF. The authors nd that this particular problem, leading to a typical data parallel algorithm is easily expressed in HPF. The problems which occur are mainly due to restrictions in the current HPF compilers. The authors have run their program on three dierent platforms, using dierent HPF compilers, and will report on the status of these compilers as measured by their problem.

[145] M. Parashar, S. Hairi, T. Haupt and G. Fox: Design of an Application Development Toolkit for HPF/Fortran 90D. First International Workshop on Parallel Processing, Tata McGraw-Hill, New Delhi 1995, pp. 482{487. This paper describes the design and operation of a toolkit for HPF/Fortran 90D application development. The toolkit incorporates the following systems: 1 ESP: An

90

Bibliography Interpretive Framework for HPF/Fortran 90D Performance Prediction; 2 ESP-i: A HPF/Fortran 90D Functional Interpreter; and 3 ESPial: An Integrated Environment for HPF/Fortran 90D Application Development & Execution. The toolkit has been implemented on the iPSC/860 hypercube system, and is supported by an interactive, graphical user interface (ESPView) which provides application developers with the following functionality: design evaluation capability, functional veri cation capability, performance visualization support, experimentation capability, compilation support, and execution support.

[146] M. Parashar, S. Hairi, T. Haupt and G. Fox: Interpreting the Performance of HPF/Fortran 90D. Supercomputing Conference, IEEE Press, Los Alamitos 1994, pp. 743{752. The authors present a novel interpretive approach for accurate and cost-eective performance prediction in a high performance computing environment, and describe the design of a source-driven HPF/Fortran 90D performance prediction framework based on this approach. The performance prediction framework has been implemented as part of a HPF/Fortran 90D application development environment. A set of benchmarking kernels and application codes are used to validate the accuracy, utility, usability, and cost-eectiveness of the performance prediction framework. The use of the framework for selecting appropriate compiler directives and for application performance debugging is demonstrated.

[147] J.-L. Pazat: Tools for HPF: A Survey. Data Parallel Programming Model; Foundations, HPF Realization, and Scienti c Applications, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 134{158. This survey has been constructed interactively on the Web with tool developers. Most of the descriptions of tools presented in this survey have been provided `as is' by tools designers, and some have been partly rewritten. The current version can be accessed at the Web site http://www.irisa.fr/EuroTools/SIG/HPF/HPF/survey.html.

[148] C. Perez: Load Balancing HPF Programs by Migrating Virtual Processors. Second International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Press, Los Alamitos 1997, pp. 85{92. The paper explores the integration of load balancing features in HPF, targeting semi regular applications. The author shows that the HPF virtual processors are good candidates to be the unit of migration. Then, he compares three possible implementations and show that threads provide a good trade-o between eciency and ease of implementation. Finally a preliminary implementation is described. The experimental results, obtained with Gaussian elimination using partial pivoting are promising.

[149] B. Rodriguez, L. Hart and T. Henderson: Comparing Scalable Programming Techniques for Weather Prediction. 2nd International Conference on Programming Models for Massively Parallel Computers, IEEE Press, Los Alamitos 1995, pp. 111{120.

In this paper the authors study the parallelization of weather prediction models. They compare parallelization using a high level library (the Nearest Neighbor Tool: NNT) and a high level language/directive approach using HPF. They report on the performance of a complete weather prediction model coded using NNT. The authors quantify the performance eects of optimizations possible with NNT that must be performed by an HPF compiler.

[150] R. Ryne: Beam Dynamics Simulations Using a Parallel Version of Parmila.

Proceedings

91

XVIII International Linear Accelerator Conference, CERN, Geneva 1996, pp. 234{236.

The computer code Parmila has been the primary tool for the design of proton and ion linear accelerators in the United States for nearly three decades. Previously it was sucient to perform simulations with of order 10 000 particles, but recently the need to perform high resolution halo studies for next-generation, high intensity linear accelerators has made it necessary to perform simulations with of order 100 million particles. With the advent of massively parallel computers such simulations are now within reach. Parallel computers already make it possible, for example, to perform beam dynamics calculations with tens of millions of particles, requiring over 10 GByte of core memory, in just a few hours. Also, parallel computers are becoming easier to use thanks to the availability of mature, Fortran-like languages such as CM Fortran and HPF. The author will describe his experience developing a parallel version of Parmila and the performance of the new code.

[151] S. Saini: NAS Experiences of Porting CM Fortran Codes to HPF on IBM SP2 and SGI Power Challenge. 10th International Parallel Processing Symposium, IEEE Press, Los Alamitos 1996, pp. 873{880.

Current CM Fortran codes developed for the CM-2 and the CM-5 represent an important class of parallel applications. Several users have employed CM Fortran codes in the production mode on the CM-2 and the CM-5 for the last ve to six years, constituting a heavy investment in terms of cost and time. With Thinking Machines Corporation's decision to withdraw from the hardware business and with the decommissioning of many CM-2 and CM-5 machines, the best way to protect the substantial investment in CM Fortran codes is to port the codes to HPF on highly parallel systems. HPF is very similar to CM Fortran and thus represents a natural transition. The Numerical Aerodynamic Simulation (NAS) Program, located at NASA Ames Research Center, is a path nder in high-performance computing for NASA and is dedicated to advancing the science of computational aerodynamics. Their experiences with the conversion issues involved in porting CM Fortran codes on the CM-5 to HPF are presented. Several CM Fortran codes have been ported to Subset HPF on the IBM SP2 and the SGI Power Challenge. Speedup ratios versus number of processors for the linear solver and DSMC (direct simulation Monte Carlo) code are presented.

[152] R. Sampath, J. Fritscher and B. J. N. Wylie: Porting the Annai Tool Environment to Workstation Clusters. Workstations and their applications; SIWORK '96, vdf Hochschulverlag an der ETH, Zurich 1996, pp. 299{310.

The realization of support for workstation clusters with the integrated tool environment Annai is discussed. Annai prototypes developed CSCS support parallel program development with standard paradigms: implicitly-parallel HPF and explicit communication with MPI. The port to this new platform required only minor changes to the authors' MPI library and parallel debugger, due to the clean separation of Annai components into portable and platform-dependent parts. Initial results show promising performance for the further exploitation of cost-eective workstation clusters with real MPI and HPF applications.

[153] M. Sato, T. Hirooka, K. Wada and F. Yamamoto: Program Partitioning Optimizations in an HPF Prototype Compiler. Twentieth Annual International Computer Software and Applications Conference, IEEE Press, Los Alamitos 1996, pp. 124{131.

Two kinds of program partitioning optimizations are proposed. The rst extends program partitioning using array subscript functions to deal with two cases about array elements on the left hand side of assignment statements in the loop: (a) the same loop

92

Bibliography control variables exist in dierent dimensions, (b) plural loop control variables exist in one dimension. The second removes statement guards generated in a loop by program partitioning in two cases about loop iteration ranges of statements in the loop: (c) ranges of plural statements for one processor are potentially dierent, (d) ranges of one statement for plural processors are potentially dierent.

[154] R. Schreiber: An Introduction to HPF. Data Parallel Programming Model; Foundations, HPF Realization, and Scienti c Applications, SpringerVerlag, Berlin Heidelberg New York 1996, pp. 27{44.

This paper reviews HPF's key language elements: elemental array parallelism and data mapping pragmas; the relationship between data mapping and implicit communication; the FORALL and INDEPENDENT loop mechanisms for more general data parallelism; and the standard HPF library, which adds to the richness of the array operators at the disposal of the HPF programmer. It reviews the important problem of data mapping at the procedure call interface. It also discusses interoperability with other programming models, including SPMD programming. The latter part of the paper is a review of HPF 2.0.

[155] G. Schulz-Ziemer and A. Geiger: HPF on Intel Paragon and CRAFT on CRAY T3D: Basic Performance Measurements and Experiments with a Block-Sparse CG-Algorithm. High-Performance Computing and Networking, International Conference and Exhibition, Springer-Verlag, Berlin Heidelberg New York 1995, pp. 618{625.

CRAFT is the CRay Adaptive ForTran of Cray Research. In addition to the shared variable concept of HPF, CRAFT allows also `private' variables. Therefore CRAFT can be mixed with message passing and explicit shared memory functions and can perform shared to private coercion. Also, it is possible to share work on subroutine level, de ne sequential regions and explicit synchronization-points. This paper presents experiences and results with the APR and PGI HPF-Compilers on the intel Paragon and CRAFT on the Cray T3D. Motivated by the wide use of unstructured discretizations in CFD and structural mechanics, the authors examine the parallelization of a block-sparse conjugate gradient algorithm. An overview over the adapted BCCS-format and the corresponding data-distribution is given. They describe the diculties, restrictions and results of using this storage format for eciently calculating the sparse matrix-vector-product, which is the dominating operation in the conjugate gradient algorithm.

[156] H. J. Sips, K. van Reeuwijk and W. Denissen: Analysis of Local Enumeration and Storage Schemes in HPF. International Conference on Supercomputing, ACM Press, New York 1996, pp. 10{17.

In this paper, the authors analyze the eciency of three local enumeration and three storage compression schemes for cyclic(m) data distributions in HPF. They show that for linear array access sequences, ecient enumeration and storage compression schemes can be derived. Furthermore, local enumeration and storage techniques are shown to be orthogonal, if the local storage compression scheme is collapsible. Performance gures of the methods are given for a number of dierent processors.

[157] H. Sivaraman and C. S. Raghavendra: ADDT: Automatic Data Distribution Tool for Porting Programs to PVM. Twenty-Ninth Hawaii International Conference on System Sciences, IEEE Press, Los Alamitos 1996, pp. 557{ 564. HPF requires the user to specify the distribution of the data in the application. This distribution is speci ed over the processors on which the application is to execute. A good data distribution depends on the number of processors, the communication latency

Proceedings

93

and the underlying compilation system. The authors present a method for automatically generating a dynamic data distribution and alignments for a distributed memory environment. This method has been implemented in an automatic data distribution tool called ADDT. ADDT reads in a shared memory parallel program in Fortran and generates as output a HPF program. This method works by breaking up the input program into sub parts called distribution blocks (DB). It then obtains a set of data distribution for the arrays in each DB using the data layouts from preceding DB's. The algorithms used to obtain a set of data distribution (called candidate distributions) for a DB are discussed. The nal data layout for the program has to be selected from these candidate distributions. This selection is formulated as a linear optimization problem which minimizes the communication while minimizing loss of parallelism. The authors present experimental results obtained by running ADDT on some example programs.

[158] H. Sivaraman and C. S. Raghavendra: Parallelizing Sequential Programs to a Cluster of Workstations. ICPP Workshop on Challenges for Parallel Processing, IEEE Press, Los Alamitos 1996, pp. 38{41. This paper presents an overview of the design of an automatic parallelization system for a cluster of workstations. The compilation system the authors have put together is called Aztec. Aztec consists of three components, the Parafrase-2 parallelizing compiler, a data distribution tool called ADDT, and an HPF compiler, Adaptor. Aztec has a simple GUI to support the selection of a source le and machines to automatically port a given sequential program to parallel PVM code. The authors have successfully tested their Aztec system on several benchmark programs and obtained modest speedups.

[159] T. Suganuma, H. Komatsu and T. Nakatani: Detection and Global Optimization of Reduction Operations for Distributed Parallel Machines. International Conference on Supercomputing, ACM Press, New York 1996, pp. 18{25.

This paper presents a new technique for detecting and optimizing reduction operations for parallelizing compilers. The technique presented here can detect reduction constructs in general complex loops, parallelize the loops containing reduction constructs, and optimize communications for multiple reduction operations. The optimizations proposed here can be applied not only to individual reduction loops, but also to multiple loop nests throughout a program. The techniques have been implemented in an HPF compiler, and their eectiveness is evaluated on an IBM SP2 using a set of standard benchmarking programs. Although the experimental results are still preliminary, it is shown that the authors' techniques for detecting and optimizing reductions are eective on practical application programs.

[160] S. Sumana and U. N. Shenoy: Porting Realistic Applications to HPF { A Case Study. Proceedings of the First International Workshop on Parallel Processing, Tata McGraw-Hill, New Delhi 1995, pp. 370{375. The paper reports a study of Perfect Club Benchmarks (sequential programs written in standard Fortran 77) in the context of converting them to HPF. The authors expect that this study not only throws more light on the problems a programmer might face in converting existing sequential programs to HPF, but also gives several clues to the compiler writer about the structure of real application programs.

[161] R. Thakur, R. R. Bordawekar and A. Choudhary: Compiler and Runtime Support for Out-Of-Core HPF Programs. International Conference on Supercomputing, ACM Press, New York 1994, pp. 382{391.

This paper describes the design of a compiler which can translate out-of-core programs written in a data parallel language like HPF. Such a compiler is required for compiling

94

Bibliography large scale scienti c applications, such as the Grand Challenge applications, which deal with enormous quantities of data. The authors propose a framework by which a compiler together with appropriate runtime support can translate an out-of-core HPF program to a message passing node program with explicit parallel I/O. They describe the basic model of the compiler and the various transformations made by the compiler. We also discuss the runtime routines used by the compiler for I/O and communication. In order to minimize I/O, the runtime support system can reuse data already fetched into memory. The working of the compiler is illustrated using two out-of-core applications, namely a Laplace equation solver and LU Decomposition, together with performance results on the Intel Touchstone Delta.

[162] A. Thirumalai and J. Ramanujam: Fast Address Sequence Generation for Data-Parallel Programs Using Integer Lattices. Languages and Compilers for Parallel Computing, Springer-Verlag, Berlin Heidelberg New York 1996, pp. 191{208.

In data-parallel languages such as HPF and Fortran D, arrays are mapped to processors through a two step process involving alignment followed by distribution. A compiler that generates code for each processor has to compute the sequence of local memory addresses accessed by each processor and the sequence of sends and receives for a given processor to access non-local data. In this paper a novel approach based on integer lattices is presented. The set of elements referenced can be generated by integer linear combinations of basis vectors. This linear algorithm determines the basis vectors as a function of the mapping. Using the basis vectors, a loop nest is derived that enumerates the addresses, which are points in the lattice generated by the basis vectors. Experimental results show that this approach is better than that of a recent linear time solution to this problem.

[163] A. Thirumalai, J. Ramanujam and A. Venkatachar: Communication Generation and Optimization for HPF. Languages, Compilers and Run-Time Systems for Scalable Computers, Kluwer Academic Publishers, Norwell 1996, pp. 311{316.

Fortran D and HPF include directives that allow programmers to specify data mappings. In these languages, arrays are aligned to an abstract Cartesian grid called a template; the template is then distributed across the various processors. A compiler for HPF that generates code for each processor has to compute the sequence of local memory addresses accessed by each processor and the sequence of sends and receives for a given processor to access non local data. The distribution of computation in most compilers follows the owner computes rule. That is, a processor performs only those computations (or assignments) for which it owns the left hand side variable. Access to non local right hand side variables is achieved by inserting sends and receives. The authors present a novel approach to communication generation for general block-cyclic distributions based on the fact that the set of elements referenced form a lattice, i. e., they can be generated by integer linear combinations of basis vectors and that this set has a repeating pattern. Experiments on an IBM SP-2 show that the communication table generation times are small.

[164] M. Ujaldon, E. L. Zapata, B. M. Chapman and H. P. Zima: Data-Parallel Language Features for Sparse Codes. Languages, Compilers and Run-Time Systems for Scalable Computers. Kluwer Academic Publishers, Norwell 1996, pp. 253{264. This paper proposes a new approach to improve data-parallel languages in the context of sparse and irregular computation. The authors analyze the capabilities of HPF and Vienna Fortran, and identify a set of problems leading to sub-optimal parallel code

Proceedings

95

generation for such computations on distributed-memory machines. Finally, they propose extensions to the data distribution facilities in Vienna Fortran which address these issues and provide a powerful mechanism for eciently expressing sparse algorithms.

[165] S. P. Van der Wiel, D. Nathanson and D. J. Lilja: Complexity and Performance in Parallel Programming Languages. Second International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Press, Los Alamitos 1997, pp. 3{12.

This paper borrows techniques from the software engineering eld to quantify the complexity of three predominate programming models: shared memory, message passing and HPF. It is concluded that traditional software complexity metrics are eective indicators of the relative complexity of parallel programming languages. The impact of complexity on run-time performance is also discussed in the context of message passing versus HPF on an IBM SP2.

[166] B. J. N. Wylie and A. Endo: Annai/PMA Multi-Level Hierarchical Parallel Program Performance Engineering. First International Workshop on HighLevel Programming Models and Supportive Environments, IEEE Press, Los Alamitos 1996, pp. 58{67. The Annai tool environment helps exploit distributed memory parallel computers with HPF and/or explicit communication, using MPI as the machine interface for portability. Integration within a uni ed environment allows the Performance Monitor and Analyzer component (PMA) to use common facilities to conveniently interact with source code browsers and the loaded parallel program. Information furnished by the parallelization and compilation systems is exploited for in depth insight, and maps low level measurements back to appropriate objects in the high level source program. Powerful, directed analysis begins with scalable, graphical pro le summaries of a range of metrics related to program routines and loops, progressing through processor based `balance' displays on to detailed evolution charts of the time varying behavior of individual processes and communication events. As the performance engineering investigation advances, increasingly more detailed information is provided in a context where it can be most readily understood.

[167] B. J. N. Wylie and A. Endo: The Annai/PMA Performance Monitor and Analyzer. Fourth International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, IEEE Press, Los Alamitos 1996, pp. 186{191.

The Annai integrated tool environment helps exploit the inherent power of distributedmemory parallel computers with standardized languages and convenient programming support. Portable application development is supported in HPF and/or with explicit message-passing, using MPI as the machine interface. Integration within a uni ed tool environment allows the performance monitor and analyzer (PMA) component to interact with source code browsers and the loaded executable on the same terms as when using the parallel debugger. Data distribution and other program information furnished by the parallelization support and compilation systems is also exploited for additional insight. Powerful, directed analysis and interactive graphical summaries address scalability, while detailed charts of the time-varying behavior of individual processes and communication events can also be browsed when desired, always retaining essential reference to the original program source code.

[168] Xiaojie Li and K. Harada: Evaluations of an Asynchronous Data Transmission Mechanism for HPF Compilers. IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, IEEE Press, Los

96

Bibliography Alamitos 1996, pp. 430{437.

Reducing the overhead of data transmissions is crucial to harnessing the potential of distributed memory multiprocessors. Some optimization techniques such as overlapping communication and computation have been proposed. However, the most overlapping techniques are still limited by the following factors. There is no linguistic support for overlapping communication and computation in data parallel languages; and there is no global analysis performed for array references over dierent nested loops, which will decide how to overlap communication and computation. This paper proposes an asynchronous data transmission mechanism for data parallel languages. It describes a new linguistic support, called an N-level message queue that is designed to overlap communication and computation in a program. Based on an exact data- ow analysis on individual array element accesses, the compiler inserts data transmission requests to message queue into SPMD code. Experiments show that introducing an asynchronous transmission mechanism into data parallel language compilers is eective.

[169] H. P. Zima, B. M. Chapman and M. Pantano: Compiler Technology for Scalable Parallel Architectures { A Short Overview. Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, World Scienti c, Singapore 1995, pp. 240{249.

The authors discuss compilation systems for HPF and its siblings. The authors outline the state-of-the-art, discuss limitations of current HPF, and propose an integrated programming environment that supports the automatic tuning of Fortran programs for a given parallel target architecture.

[170] M. E. Zosel: HPF: An Overview. 38th Annual IEEE Comput.Soc. Int. Comput. Conf., IEEE Press, Los Alamitos 1993, pp. 132{136.

The purpose of this paper is to give an overview of the work of the HPFF. This group of industry, academic, and user representatives has been meeting to de ne a set of extensions for Fortran dedicated to the special problems posed by high performance computers, especially parallel computers. The paper describes the HPFF eort and its goals and gives a brief description of the functionality of HPF.

Theses [171] S. Benkner: Vienna Fortran 90 and its Compilation. University of Vienna, September 1994.

In this thesis, the author de nes Vienna Fortran 90, an advanced data parallel language based on Fortran 90 and Vienna Fortran, that enables the user to program distributed memory architectures almost as easily as sequential computers using global addresses only. Vienna Fortran oers a variety of high level features for data distribution and thus combines the advantages of a shared memory programming paradigm with mechanisms for explicit user control of those aspects that have the greatest impact on eciency. Vienna Fortran 90 provides a number of new features not found in any other comparable language such as HPF or Fortran D. This includes concepts for distributing user de ned data structures, distribution of pointer objects, and a general framework for specifying arbitrary data and work distribution. Having speci ed the language, the author develops the major compilation techniques for restructuring Vienna Fortran 90 programs into parallel message passing programs for execution on distributed memory machines. These techniques are general enough to compile programs for a parameterized number of processors. From the user speci ed data distributions mathematical formulas are derived that determine the ownership and osets of local data, the work distribution and the communication between processors. Novel techniques for the management of distributed data, including runtime representation of distributions, new methods for index conversion, and static as well as dynamic schemes for handling non-local data, have been developed.

[172] R. R. Bordawekar: Techniques for Compiling I/O Intensive Parallel Programs. Syracuse University, December 1996.

This dissertation investigates several issues in providing compiler support for I/O intensive parallel programs. In this dissertation, the author focuses on satisfying two I/O requirements, namely, support for accessing multidimensional arrays and support for out-of-core computations. He analyzes working spaces in I/O intensive programs and propose three execution models to be used by users or compilers for developing ecient I/O intensive parallel programs. Dierent phases in compiling out-of-core parallel programs are then described. Three dierent methods for performing communication are presented and validated using representative application templates. It is illustrated that communication in out-of-core programs may require both inter-processor communication and le I/O. The author shows that using the copy-in-copy-out semantics of the HPF scFORALL construct, extra le I/O incurred in communication can be completely eliminated by reordering in-core computations. Two dierent approaches for reordering in-core computations are presented, namely, integrated tiling and scheduling heuristic, and data ow framework for placing communication and I/O calls. The discussion is supplemented with experimental performance results of representative stencil applications. Finally, an overview of the prototype Passion (Parallel And Scalable Software for I/O) compiler is presented. This compiler takes an annotated out-of-core HPF program as input and generates the corresponding node+message-passing program with calls to the parallel I/O runtime library. The author illustrates various functionalities of the compiler using example programs and supplement them by experimental results.

[173] Z. Bozkus: Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers. Syracuse University, August 1996.

This thesis describes an advanced compiler that can generate ecient parallel programs when the source programming language naturally represents an application's parallelism. Fortran 90D/HPF described in this thesis is such a language. Using Fortran 90D/HPF,

98

Bibliography parallelism is represented with parallel constructs, such as array operations, where statements, forall statements, and intrinsic functions. The language provides directives for data distribution. Fortran 90D/HPF gives the programmer powerful tools to express a problem with natural data parallelism. To validate this hypothesis, a prototype of Fortran 90D/HPF was implemented. The compiler is organized around several major units: language parsing, partitioning data and computation, detecting communication and generating code. The compiler recognizes the presence of communication patterns in the computations in order to generate appropriate communication calls. Speci cally, this involves a number of tests on the relationships among subscripts of various arrays in a statement. The compiler includes a specially designed algorithm to detect communications and to generate appropriate collective communication calls to execute array assignments and forall statements. The Fortran 90D/HPF compiler performs several types of communication and computation optimizations to improve the performance of the generated code. Empirical measurements show that the performance of the output of the Fortran 90D/HPF compiler is comparable to that of corresponding hand-written codes on several systems. The author hopes that this thesis assists in the widespread adoption of parallel computing technology and leads to a more attractive and powerful software development environment to support application parallelism that many users need.

[174] S. Chakrabarti: Ecient Resource Scheduling in Multiprocessors (NP Hard). University of California, Berkeley, 1996.

The focus of this thesis is in between the theory and practice of scheduling: it includes modeling, performance analysis and practical algorithmics. The author presents a variety of new techniques for scheduling problems relevant to parallel scienti c computing. The thesis progresses from new compile-time algorithms for message scheduling through new runtime algorithms for processor scheduling to a uni ed framework for allocating multiprocessor resources to competing jobs while optimizing both individual application performance and system throughput. The compiler algorithm schedules network communication for parallel programs accessing distributed arrays. By analyzing and optimizing communication patterns globally, rather than at the single statement level, communication costs are often reduced by factors of two to three in an implementation based on IBM's HPF compiler. The best parallelizing compilers at present support regular, static, array-based parallelism. But parallel programmers are out-growing this model. Many scienti c and commercial applications have a two-level structure: the outer level is a potentially irregular and dynamic task graph, and the inner level comprises relatively regular parallelism within each task. The author gives new runtime algorithms for allocating processors to such tasks. The result can be a twofold increase in eective mega ops, as seen from an implementation based on ScaLapack, a library of scienti c software for scalable parallel machines. Compilers and runtime systems target single programs. Other system software must do resource scheduling across multiple programs. For example, a database scheduler or a multiprocessor batch queuing system must allocate many kinds of resources between multiple programs. Some resources like processors may be traded for time, others, like memory, may not. Also, the goal is not to nish a xed set of programs as fast as possible but to minimize the average response time of the programs, perhaps weighted by a priority. The author presents new algorithms for such problems. Most of the above results assume a central scheduler with global knowledge. When the setting is distributed, decentralized techniques are needed.

[175] T. S. Chyczewski Jr.: A Time-Dependent, Three-Dimensional Numerical Study Of Supersonic Rectangular Jet Flow And Noise Using The Full Navier-Stokes Equations. The Pennsylvania State University, October 1996. A national interest in High Speed Civil Transports (HSCT) coupled with strict airport noise regulations has prompted the scienti c community to investigate new and improved

Theses

99

noise prediction strategies. Meeting these airport regulations is considered to be a major design challenge for the HSCT. In light of this eort, a direct simulation strategy for predicting supersonic jet noise is developed in this thesis. Direct simulations are quickly becoming the method of choice due to their generality and ever decreasing expense associated with the development of parallel processors. Supersonic jet noise is known to be dominated by the growth and decay of large scale turbulent structures. The direct simulation approach used here consists of solving the full Navier Stokes equations using high order nite dierence techniques to simulate the evolution of these structures and the noise they radiate to the acoustic near eld. This near eld solution is then extrapolated to the far eld using a Kirchho method. The numerical algorithm uses a fourth order Runge-Kutta method for the time integration. The spatial derivatives are approximated by a sixth order central scheme. A sixth order lter is used at each interior mesh point to damp frequencies that cannot be resolved by the spatial scheme. Second order ltering is provided only where required for stability. It is found to be con ned to speci c locations in the jet core and should have no eect on the acoustic solution. Characteristic based nonre ecting conditions are used to minimize re ections at the far eld boundaries and have proven to be eective. Additional boundary conditions are required in the form of it model for the nozzle exit ow. The characteristics of the nozzle exit ow can have a signi cant impact on the noise radiation. This dependence is unfortunate since comprehensive experimental data is not available in this region of the jet. A model is developed here that addresses a variety of factors (such as initial disturbance amplitude, corner vortices and modal excitation) that can signi cantly in uence the jet development. A comprehensive analysis of the eect of these factors is provided. The method has been implemented and optimized on the CM-5 using the data parallel paradigm. These optimizations are quite eective and result in an ecient parallel algorithm. The code is written in CM Fortran, which is a dialect of HPF, and can be easily ported to the latest generation parallel processors. The supersonic ow from a cold, ideally expanded rectangular ori ce is calculated to validate the code. The geometry (jet aspect ratio of 3) and ow conditions (Mjet = 1:54; Rejet = 25 000) are selected to match those of a rectangular jet experiment that has recently concluded at The Pennsylvania State University. Detailed information on the experimental setup as well as experimental results have been provided. The results presented in this thesis indicate that the direct simulation approach is capable of predicting many of the features of supersonic rectangular jets. This includes the evolution of large scale turbulent structures and the noise they radiate to the far eld.

[176] T. W. Clark: A Parallel Programming Approach for Scienti c Applications. University of Houston, 1996.

To address the software engineering problem for parallel scienti c programs, the IPFortran language and compiler were developed. This thesis discusses the design, development and testing of the IPFortran compiler. IPFortran grew out of an eort to ease the task of parallelizing complex scienti c codes, especially pre-existing ones which are often poorly structured. IPFortran has been found to be very successful in parallelizing molecular dynamics, quantum dynamics and Brownian dynamics applications, providing a low-level, but machine independent paradigm for performing operations that routinely arise in parallel scienti c programming. The author discusses the molecular dynamics algorithms in general and in the speci c context of IPFortran and Fortran D, a prototype for the HPF languages. He demonstrates that the complex interplay from high-level details of problem decomposition down to the low-level details in machine mapping are addressable with IPFortran. Through his experiences in parallelizing some of the molecular dynamics algorithms with Fortran D, he provides guidelines for structuring Fortran programs to facilitate parallelizing them with HPF-like languages.

[177] P. Crooks: An Automatic Program Translator for Distributed Memory MIMD Machines. Queen's University of Belfast (Northern Ireland), Sum-

100

Bibliography mer 1996.

The subject of this thesis is the design and implementation of an automatic data distribution tool for a subset of regular data parallel programs. The tool is a source-to-source translator which determines regular data distributions and redistributions for the arrays of the source program, into which it inserts the distribution scheme in the form of annotations. This annotated code is then output by the tool and may be executed on a distributed memory MIMD machine using a high-level compiler, such as the Vienna Fortran and HPF systems. The source language is a subset of Fortran 90. The tool uses pattern-matching techniques to detect common reference patterns in the source program; these are then used to index a look-up table, which returns an ecient distribution for the arrays referenced in a pattern. Performance estimation is used to arbitrate between con icting distributions. As the problem of nding a redistribution scheme to minimize total execution cost (including redistribution costs) is NP-complete, the tool employs an heuristic approach to dynamic distribution. The fundamental contribution of the project is that, for a particular range of programs, the distribution tool enables a programmer to port a sequential program on to a distributed memory MIMD machine without having to consider parallel issues. The tool has been demonstrated for several applications which display a range of distribution needs.

[178] W. N. Gansterer: Hochleistungsrechnen mit HPF. Techn. Univ. of Vienna, 1994.

High performance computing requires careful analysis of the relations between three components: algorithm, hardware, and software. The consequences of the structure of modern computer architectures on the development and implementation of ecient parallel algorithms using HPF are discussed in this thesis. After a detailed description of the basic concepts of modern computer architectures, an extensive discussion of performance analysis and measurement is given. Parallel versions of matrix-matrix multiplication and LU factorization are investigated as prototypical problems. After a brief overview of Fortran 90's most important elements for data parallel programming, concepts and language elements of HPF 1.0 are outlined.

[179] D. A. Hudson: Numerical Simulation of a Con ned Supersonic Shear Layer. The Pennsylvania State University, February 1997.

Numerical simulations of con ned supersonic shear layers have been conducted using advanced massively parallel computing systems and a high performance scienti c programming language. The fundamental capability sought was the ability to model the steady and unsteady behavior of con ned compressible shear layer mixing, at least to within the conventional guidelines for the resolution of important physical phenomena. The overall objective was fully achieved. The development of the software was accomplished on a combination of a CM-200a situated at Penn State, the Numerical Aerodynamic Simulation Program's CM-5 at NASA Ames, and the National Center for Supercomputing Application's CM-5 at the University of Illinois at Urbana-Champaign. A system speci c version of HPF, CM Fortran, was used to code the software. The Euler equations were integrated with the MacCormack 2-4 numerical scheme applied over a Cartesian grid. Several new developments such as a modi ed Jameson Arti cial Viscosity scheme, a new spatial extrapolation scheme, and new unsteady inlet boundary conditions, resulted in excellent comparison with experimental data. The supersonic shear layers were simulated using dense grids to provide a ne-grain resolution of the mixing layer. Grid densities were chosen to resolve the fundamental Kelvin-Helmholtz instability mode and the very thin shear layer near the inlet. The ne grain solutions utilize approximately 200 000 grid points in the 2-D cases and 5 250 000 grid points in the 3-D case. Two compressible mixing layers were simulated and compared to experimental data collected under a separate eort. Similarly, single frequency excitation simulations were made to pro-

Theses

101

vide comparisons with linear stability theory. Both shear layers have moderate Reynolds numbers (Rey=sb[=delta=omega]) of 6600 and 9500, and have convective Mach numbers of 0.5 and 0.64, respectively. Very good agreement between the two-dimensional simulation and experimental results were obtained for the mean velocity pro les, FFT spectra, shear layer growth rate, momentum uctuation quantities and the Reynolds stresses. Examination of instantaneous and time-averaged eld variables provided informative insight into; shear layer growth rate behavior, double peaked turbulence intensity pro les that are commonly observed in experimental data, and shear layer excitation by standing Mach waves re ected from the con ning channel walls.

[180] E. T. Kalns: Scalable Data Redistribution Services for Distributed-Memory Machines. Michigan State University, April 1996.

This dissertation investigates the pertinent issues that aect the performance of data redistribution on distributed-memory machines, focusing on four primary areas. First, the partitioning (or mapping) of data are addressed onto processor memories. A technique that facilitates the minimal amount of data exchange among processor memories during redistribution between HPF's regular patterns is proposed. Second, the design of a portable and communication-ecient data redistribution library are presented whose implementation is portable among a large class of distributed-memory machines. Portability is enhanced through MPI communication primitives. Third, a framework for quantifying the scalability of parallel algorithms together with the machines upon which they execute are developed. Fourth, the framework to quantify the scalability of the data redistribution library for a large range of processor con gurations and data set sizes on selected distributed-memory machines is applied.

[181] S. D. Kaushik: Compile-Time and Run-Time Strategies for Array Statement Execution on Distributed-Memory Machines. The Ohio State University, December 1995.

In HPF, array statements are used to express data-parallelism. In this thesis, the author develops methods for the ecient execution of array statements on distributed-memory machines. In compiling array statements for a distributed-memory machine, ecient generation of communication sets and local index sets is important. It is shown that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be eciently enumerated as closed forms using regular sections. First, closed form solutions are developed for arrays that are distributed using block or cyclic distributions. These closed forms are then used with a virtual processor approach to give an ecient solution for arrays with block-cyclic distributions. Performance results on a Cray T3D system demonstrate the ecacy of the virtual processor approach. To eciently perform array redistribution, precise closed forms for enumerating the communication sets are developed for two special cases of array redistribution involving block-cyclically distributed arrays. The general case for array redistribution involving block-cyclically distributed arrays can be expressed in terms of these special cases. Using the closed forms, a distributed algorithm for scheduling the communication for redistribution to eliminate node contention is developed. The algorithm has a lower communication and scheduling overhead than those presented in the literature. Based on the closed forms, a cost model for estimating the communication overhead for array redistribution is developed. Using this model, a multi-phase approach for reducing the communication cost of array redistribution is presented. Experimental results on the IBM SP2 and Cray T3D validate the proposed cost model and demonstrate the ecacy of the multi-phase approach.

[182] U. J. Kremer: Automatic Data Layout for Distributed Memory Machines. Rice University, November 1996.

This thesis discusses the design and implementation of a data layout selection tool that

102

Bibliography generates Fortran D or HPF style data layout speci cations automatically. Because the tool is not embedded in the target compiler and will be run only a few times during the tuning phase of an application, it can use techniques that may be considered too computationally expensive for inclusion in today's compilers. The proposed framework for automatic data layout selection builds and examines explicit search spaces of candidate data layouts. A candidate layout is an ecient layout for some part of the program. After the generation of search spaces, a single candidate layout is selected for each program part, resulting in a data layout for the entire program. A good overall data layout may require the remapping of arrays between program parts. A performance estimator based on a compiler model, an execution model, and a machine model is used to predict the execution time of each candidate layout and the costs of possible remappings between candidate data layouts. The machine model uses the novel training set approach which determines the costs of arithmetic operations and simple communication patterns. In the proposed framework, instances of NP-complete problems are solved during the construction of candidate layout search spaces and the nal selection of candidate layouts from each search space. Rather than resorting to heuristics prematurely, the framework capitalizes on state-of-the-art 0-1 integer programming technology to compute optimal solutions of these NP-complete problems. A prototype of the data layout assistant tool has been implemented. Experiments indicate that good data layouts can be determined eciently.

[183] L. M. Liebrock: Using Problem Topology in Parallelization. Rice University, June 1996.

Problem topology is the key to ecient parallelization support for partially regular applications. Speci cally, problem topology provides the information necessary for automatic data distribution and regular application optimization of partially regular applications. Problem topology is the dimensionality, size, and connectivity of the problem. Problem topology has traditionally been used in explicit parallelization of regular problems such as physical simulation applications. In languages such as HPF, problems that are regular allow many optimizations not applicable to irregular application codes. Unfortunately, many applications must sacri ce regularity to some extent for computational eciency. This research focuses on partially regular problems and strives to take advantage of partial regularity in the parallelization and compilation process. This dissertation uses topology for automatic, natural-topology, data distribution in linearized and composite grid (or multiblock) applications. For linearized applications, Fortran D is extended with logical topology and index array speci cations. With the information provided in these speci cations, it is shown how regular, linearized applications can be parallelized automatically in Fortran D as regular computations using their natural topology. In composite grid problems, meshes are coupled to form larger, more complex topology structures. Composite grid problems arise in important application areas such as uid

ow simulation, aerodynamic simulation, electric circuit simulation, and nuclear reactor simulation. Such physical phenomenon are inherently parallel and their simulations are computationally intensive. This dissertation presents a programming style and template for writing HPF programs for these applications, algorithms for automatic distribution of composite grid applications with mesh con gurations included in the input, and a discussion of compiler issues for composite grid problems. The automatically generated distributions for composite grid applications guarantee that all communication associated with the distribution of any given mesh will remain regular and nearest-neighbor in the mapping to processors. This research allows HPF compilers to perform regular application optimizations on the codes for this class of partially regular applications. Finally, the research is supported by experimental results, which indicate that substantial performance improvements are possible when topology is used in the parallelization of partially regular applications.

[184] W. Moser: The Portability of Parallel Software. Techn. Univ. of Vienna,

Theses

103

1996.

This thesis discusses the most important aspects of parallel software portability, a key issue in the future development of parallel computing. The general technique which enables the transfer of programs from one computer system to another one is based on abstract software layers which hide the details of the underlying computer hardware and which can be addressed by application programs in a uniform way on every computer for which they are implemented. It will turn out in the course of the following discussion that such techniques are still in an experimental or prototypical stage. In particular, the high-level approach of HPF is potentially very useful for a certain class of problems, but there remains a lot to be done in the development of satisfactory compilers.

[185] Y. Ozyoruk: Sound Radiation from Ducted Fans Using Computational Aeroacoustics on Parallel Computers. The Pennsylvania State University, July 1996. As a component of a more advanced, new generation fan noise prediction technology, a computational aeroacoustics algorithm has been developed using an entirely new approach. Unlike previous approaches, the current method accounts for the nonuniform background ow and aerodynamic-acoustic coupling issues by solving the 3-D, timedependent, full nonlinear Euler equations (although the developed computer program is a Navier-Stokes solver). The equations are solved on a 3-D body tted curvilinear coordinate system using temporally and spatially 4th-order accurate nite dierence, Runge-Kutta time integration. The time-accurate ow eld is determined only in a relatively small physical domain using nonre ecting boundary conditions on its outer boundaries. A moving surface Kirchho method using the formulation of Farassat and Myers has been developed and coupled to the ow solver for far- eld noise predictions. The acoustic eld is obtained by subtracting the mean eld from the total eld. To establish the mean ow eld, steady state solutions are required and Jameson's full approximation storage multigrid method has been extended to make use of the current high resolution algorithm for obtaining such solutions fast. Formulations in cylindrical coordinates together with cell-centered nite dierencing are used to eectively treat the grid singularity along the centerline. Well designed grids aid this treatment. A 3-D grid generator has been developed using the conformal mappings of Ives and Menor to provide the hybrid radiation code with capabilities for very rapid and good quality mesh generation. The hybrid radiation code has been written in CM Fortran, which is essentially HPF. Some novel optimization procedures have been developed and implemented in the code, which runs eciently on the CM-200 and CM-5 parallel computers. The code has been tested solving a large variety of problems, ranging from an oscillating piston problem to Stokes' second problem to prediction of noise radiation from realistic aircraft engine inlet con gurations. Comparisons with available exact or approximate solutions and experimental data indicate excellent agreement.

[186] M. Parashar: Interpretive Performance Prediction for High Performance Parallel Computing. Syracuse University, July 1995.

The key factor contributing to the complexity of parallel application development and the poor utilization of current high performance computing systems is the increased degrees of freedom that have to be resolved in such an environment. The primary objective of the authors' research is to address this software development bottleneck. In this research the interpretive approach to performance prediction is developed. The essence of this approach is the application of interpretation techniques to performance prediction through an appropriate characterization of the HPC system and the application. A comprehensive system characterization methodology is de ned to hierarchically abstract the HPC system into a set of parameters which represent its performance. A corresponding application characterization methodology is de ned to abstract a high-level application description into a set of parameters which represent its behavior. Performance predic-

104

Bibliography tion is then achieved by interpreting the execution costs of the abstracted application in terms of the parameters exported by the abstracted system. Models and heuristics are de ned to handle accesses to the memory hierarchy, overlap between computation and communication, and user experimentation with system and run-time parameters. This thesis concentrates on distributed memory HPC systems and uses such a system to illustrate and validate the developed approach. An interpretive toolkit is designed and implemented to support HPF/Fortran 90D application development. It incorporates the following three systems: (1) ESP: An Interpretive Framework for HPF/Fortran 90D Performance Prediction; (2) ESP-i: A HPF/Fortran 90D Functional Interpreter; and (3) ESPial: An Integrated Environment for HPF/Fortran 90D Application Development & Execution. The toolkit is supported by an interactive, graphical user interface (ESPView) and provides the developer with the following functionality: design evaluation capability, functional veri cation capability, performance visualization support, experimentation capability, compilation support, and execution support. A set of application codes and benchmarking kernels are used to validate the accuracy, utility, cost-eectiveness, and usability of the interpretive framework. The interpretive approach provides an accurate and cost-eective (in terms of time and resources required) evaluation methodology that can be used by any tool supporting HPC (e. g., intelligent compilers, mapping and load-balancing tools, and system design evaluation tools) which has to optimize available design options.

[187] R. Ponnusamy: Run-Time Support and Compilation Methods for Irregular Computations on Distributed Memory Parallel Machines. Syracuse University, August 1996.

There exists a class of scienti c and engineering applications, called irregular applications, in which many of the performance optimizations can be done only at runtime. This constraint presents a greater challenge for compilers. This research provides solutions for compiling irregular problems. This thesis presents a combined runtime and compile-time approach for parallelizing this general class of applications on distributed memory machines. It presents a runtime system that has been designed and implemented for parallelizing these applications on distributed memory machines. Methods by which compilers for HPF-style parallel programming languages can automatically generate calls to the runtime system are also presented. The runtime system supports the partitioning of loop iterations to maintain data locality, the coupling of data partitioners to obtain non-standard distribution, the remapping of data structures and optimizations such as vectorization, aggregation and schedule reuse. The compiler techniques have been implemented in the Fortran 90D/HPF compiler being developed at Syracuse University. The runtime and compile-time approaches have been evaluated using templates from real scienti c applications. Performance results of Fortran 90D compiler-parallelized codes are compared with that of hand-parallelized codes. It is observed that the compiler-generated codes perform within 15 % of the hand-parallelized codes.

[188] S. Ramaswamy: Simultaneous Exploitation of Task and Data Parallelism in Regular Scienti c Applications. University of Illinois at UrbanaChampaign, February 1997. In this thesis the author explores a new compiler optimization for regular scienti c applications { the simultaneous exploitation of task and data parallelism. An optimization is implemented as part of the Paradigm HPF compiler framework and as part of a Matlab compiler framework the author has developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, program execution is made more ecient and therefore

Theses

105

faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as the experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be signi cantly faster that its execution using data parallelism alone. This makes the proposed optimization practical and extremely useful.

[189] G. H. Roth: Optimizing Fortran 90D/HPF for Distributed-Memory Computers. Dept of Computer Science, Rice University, April 1997.

This dissertation presents and analyzes a set of advanced optimizations designed to improve the execution performance of HPF programs on distributed-memory architectures. Presented is a methodology for performing deep analysis of Fortran 90 programs, eliminating the reliance upon pattern matching to drive the optimizations as is done in many Fortran 90 compilers. The optimizations address the overhead of data movement, both interprocessor and intraprocessor movement, that results from the translation of Fortran 90 array constructs. Additional optimizations address the issues of scalarizing array assignment statements, loop fusion, and data locality. The combination of these optimizations results in a compiler that is capable of optimizing dense matrix stencil computations more completely than all previous eorts in this area. This work is distinguished by advanced compile-time analysis and optimizations performed at the whole-array level as opposed to analysis and optimization performed at the loop or array-element levels.

[190] A. Roy-Chowdhury: Manual and Compiler Assisted Methods for Generating Fault-Tolerant Parallel Programs. University of Illinois at UrbanaChampaign, October 1996.

Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating faulttolerance into existing applications. Applications are modi ed to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modi cation to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they have been located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this dissertation, rst a general scheme for performing fault-location and recovery under the ABFT framework is presented. The authors' fault model assumes that a faulty processor can corrupt all of the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a distributed memory multicomputer, which demonstrate acceptably low overheads for the single- and double-fault location and recovery cases. For a class of algorithms performing ane transformations, the author automates the process of generating an error-detecting version at compile time. The compiler is used to identify loops that perform ane transformations on array elements. These loops are then checked by computing a checksum over the array elements being transformed and transforming the checksums appropriately, which typically results in much smaller overheads than checking the entire code by duplication. Portions of code in the program that are not ane transformations are checked by duplication. An existing source-to-source compiler, Parafrase-2, has been modi ed to take in programs written in HPF and output an error-detecting version of the same. Data distributions for the new arrays and checksums introduced are speci ed by inserting additional HPF directives in the program. The modi ed program can then be input to a parallelizer for distributed memory machines, such as Paradigm, to

106

Bibliography obtain an error-detecting parallel program. The results are demonstrated on three numerical programs by executing the error-detecting versions generated by a compiler on a distributed memory multicomputer.

[191] R. Thakur: Runtime Support for In-Core and Out-Of-Core Data-Parallel Programs. Syracuse University, March 1996. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater eciency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like HPF to node programs for distributed memory machines. In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents ecient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. The performance of these algorithms on the CM-5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results. A number of applications deal with very large data sets which cannot t in main memory, and hence have to be stored in les on disks, resulting in out-of-core programs. This thesis also describes the design and implementation of ecient runtime support for out-of-core computations. Several optimizations for accessing out-of-core data are presented. An Extended TwoPhase Method is proposed for accessing sections of out-of-core arrays eciently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out-of-core programs on the Touchstone Delta are presented.

[192] J.-J. Wu: Optimization and Transformation Techniques for HPF. Yale University, February 1996.

This dissertation studies optimization issues crucial to implementing HPF on massively parallel machines. Two major challenges imposed by distributed-memory implementation of HPF are reducing communication overhead and improving code performance at the processor level. The author introduces a new two-phase transformative framework for optimizing HPF programs: an abstract, algebraic transformation and runtime method for reducing data movement, and a set of more machine-dependent transformations for optimizing explicit, partitioned loops. The algebraic transformation framework allows a compiler to optimize data movement at the abstract level without going into machinedependent details. The set of post-partition transformations can serve as a basis for automated optimizations. These optimizations are demonstrated in the context of a HPF compiler for CM-5 although they are applicable to other data parallel languages and other massively parallel machines. The author reports results from experiments on the Connection Machine CM-5. The results demonstrate that these optimizations can achieve signi cant target performance improvement.

Reports [193] P. Brezany, K. Sowa: Symbolic Debugger for HPF+ { Functional Speci cation (Version 1.0) Internal Report, Institute for Software Technology and Parallel Systems, University of Vienna, February 1998. [194] Digital Equipment Corporation: DECmpp 12000 Sx|High Performance Fortran Reference Manual. Report [AA-PMAHC-TE], Digital Equipment Corporation, Maynard, Massachusetts, February 1993. [195] G. Fox, S. Hiranandani, K. Kennedy, C. H. Koelbel, U. J. Kremer, C.-W. Tseng and M.-Y. Wu: Fortran D Language Speci cation. Technical Reports COMP TR90-141, Dept. of Comp. Sci, Rice Univ, Houston, TX, Dec. 1990, and SCCS-42c, Syracuse Center for Comp. Sci., Syracuse University, Syracuse, NY, April 1991. [196] HPFF: High Performance Fortran Language Speci cation Version 1.1. http://www.crpc.rice.edu/HPFF/hpf1/index.html or http://www.vcpc.univie.ac.at/mirror/HPFF/hpf1/index.html. [197] HPFF: High Performance Fortran Language Speci cation Version 2.0. January 1997. http://www.crpc.rice.edu/HPFF/hpf2/index.html or http://www.vcpc.univie.ac.at/mirror/HPFF/hpf2/index.html. [198] HPFF: HPF-2 Scope of Activities and Motivating Applications. November 1994. ftp://softlib.rice.edu/pub/HPF/hpf2-requirements.ps.gz. [199] W. Joppich and G. Ostwald: Comparison of Transposition Algorithms on a MIMD-Computer. Vergleich von Transpositionsalgorithmen auf einem MIMD-Rechner. GMD, Sankt Augustin 1996. [200] MasPar Computer Corporation: MasPar Fortran Reference Manual. MasPar Computer Corporation, 749 North Mary Avenue, Sunnyvale, California, May 1991. [201] J. H. Merlin: Techniques for the Automatic Parallelisation of Distributed Fortran 90. Technical report SNARC 92-02, Dept. of Electronics and Comp. Science, Univ. of Southampton, November 1991. [202] D. M. Pase, T. MacDonald and A. Meltzer: MPP Fortran Programming Model. Cray Research Inc., Eagan, Minnesota, August 1992. [203] J. Sanz: Data Parallel Fortran. Technical Report, IBM Almaden Research Center, March 1992. [204] Thinking Machines Corporation: CM Fortran Reference Manual Thinking Machines Corporation, Cambridge, Massachusetts, July 1991. [205] M.-Y. Wu and G. Fox: Fortran 90D Compiler for Distributed Memory MIMD Parallel Computers. Technical Report SCCS-88b, Syracuse Center for Comp. Sci., Syracuse University, Syracuse, NY, July 1991. [206] H. P. Zima, P. Brezany, B. M. Chapman, P. Mehrotra and A. Schwald: Vienna Fortran|A Language Speci cation. ICASE Interim Report 21,

108

Bibliography ICASE NASA Langley Research Center, Hampton, Virginia 23665, March 1992.

Author Index Agrawal, G. Almasi, G. S. Ancourt, C. Andre, F. Annaratone, M. Ayguade, E. Bae, S. Bagaini, C. Balzano, A. Banerjee, P. Benkner, S. Bircsak, J. A. Bolduc, M. R. Bonomi, E. Bordawekar, R. R. Bouchitte, V. Boulet, P. Boulter, B. Bozkus, Z. Brandes, T. Brezany, P. Brieger, L. Cabitza, G. Celmaster, W. N. Chakrabarti, S. Chapman, B. M. Chatterjee, S. Chaumette, S. Cheron, O. Ching, W.-M.

[7], [8], [74], [75] [1] [9], [23] [76], [77], [78], [79] [80] [10] [11], [81] [94] [94] [29] [171] [33] [33] [94] [172], [161] [12] [12], [82], [83] [84] [173], [13], [85], [86] [14], [15], [83], [87], [88], [89], [90] [76], [91], [92], [193], [206] [93], [94] [94] [16] [174], [17], [95] [18], [19], [20], [96], [97], [164], [169], [206] [21], [98] [14], [15] [76], [91] [118]

Choi, J.-D. Choudhary, A. Chyczewski Jr., T. S. Clark, T. W. Clemencon, C. Coelho, F. Counilh, M. C. Cownie, J. Crooks, P. Darte, A. Das, R. Decker, K. M. Denissen, W. Deshpande, V. R. Desprez, F. Diewald, J. A. Dincer, K. Dinda, P. A. Dinechin De, B. D. Doi, T. Dongarra, J. J. Dunlop, A. Edjlali, G. Elsesser, G. Endo, A. Ennas, M. Fahringer, T. Fischer, G. Foster, I.

[17], [95] [54], [55], [64], [114], [161] [175] [176], [22] [99], [100], [101], [102], [103] [9], [23], [104], [105], [106], [107] [14], [15] [108] [177] [4], [12], [14] [37], [55] [99], [100], [103] [56], [76], [156] [99], [100] [14], [15], [89] [33] [109], [110] [111] [112] [60] [24] [108] [74], [113] [112] [99], [100], [101], [102], [103], [166], [167] [94] [2], [25] [112] [26], [114], [115]

110 Fox, G. Fritscher, J. Fritzson, P. Gale, I. Gansterer, W. N. Garau, E. Garcia, J. Geiger, A. Germain, C. Gilbert, J. R. Girones, M. Gottlieb, A. Grabner, S. Greco, D. Gross, T. Gupta, M. Gupta, S. K. S. HPFF Habib, S. Hackstadt, S. T. Hagiwara, J. Haines, M. Hairi, S. Hansen, O. Hanxleden von, R. Harada, K. Hariri, S. Harris, J. Hart, L. Haupt, T. Hawick, K. A. Hellberg, S. Henderson, T. Hey, A. J. G. Hiranandani, S. Hirooka, T.

Author Index [55], [109], [110], [121], [145], [146], [195], [205] [99], [100], [101], [102], [103], [152] [116] [33] [178] [94] [10] [155] [107] [21], [98] [10] [1] [44], [133] [90], [94] [28] [17], [29], [31], [95], [117], [118], [119] [30] [34], [35], [196], [197], [198] [58] [120] [60] [18] [145], [146] [116] [22] [168] [53] [33] [57], [149] [145], [146] [110], [121] [108] [57], [149] [24], [51], [108] [122], [195] [153]

Hoppe, H.-C. Hu, Y. Huang, C.-H. Hudson, D. A. Humphries, J. Hwang, G.-H. Hwang, S.-Y. Hwang, Y.-S. Irigoin, F. Ishizaki, K. Iwashita, H. Johnson, B. H. Johnson, N. W. Johnsson, S. L. Joppich, W. Ju, D.-C. R. Kale, L. V. Kalns, E. T. Kamachi, T. Kaneshiro, S. Y. Kaushik, S. D. Kehagias, D. Kennedy, K. Keryell, R. Kimelman, D. Knies, A. Koelbel, C. H. Kohr Jr., D. R. Komatsu, H. Konijnenburg van, E. Koo, M.-S. Koppler, R. Kornkven, E. A. Kremer, U. J. Krishnaiyer, R. Krotz-Vogel, W. Kusano, K. Labarta, J. Le Fur, M.

[135] [36], [123] [30], [41], [128] [179] [74] [124] [45] [37], [54], [55] [9] [125] [60] [112] [33] [36], [123] [199] [124] [134] [180], [38] [39], [40], [126] [60], [127] [181], [30], [41], [128] [123] [22], [42], [48], [49], [122], [129], [195] [9] [130] [43] [3], [195] [114], [115] [125], [159] [91] [131], [132] [44], [133] [134] [182], [129], [195] [114], [115] [135] [40], [126] [10] [27], [77]

Author Index Lecca, G. Lee, J. K. Lee, S. Levesque, J. M. Li, W. Liebrock, L. M. Lilja, D. J. Long, F. J. E. Lorenzo, P. A. R. Loveman, D. B. Luz Grande, M. MacDonald, T. Maheo, Y. Malony, A. D. Martino Di, B. Masuda, N. R. Meadows, L. F. Mehrotra, P. Mellor-Crummey, J. Meltzer, A. Merlin, J. H. Metcalf, M. Middleton, D. Midki, S. Mignot, J. C. Miles, D. Mittal, P. Moon, B. Moser, W. Mueck, T. A. Muller, A. Murakami, Y. Nakamoto, S.

111 [93], [94] [45], [124] [33] [46] [47] [183], [48], [49] [165] [21] [99], [100], [136] [3], [80], [137] [10] [43], [112], [202] [77] [120] [50] [99], [100], [103] [13], [85], [86], [138] [18], [19], [20], [96], [97], [141], [206] [122] [202] [51], [139], [201] [140] [141] [118] [14], [15] [85], [138] [130] [37] [184] [92] [39], [99], [100], [101], [102], [103], [136], [142] [136] [13], [85], [86]

Nakatani, T. Nardone, C. Nathanson, D. Nedeljkovic, N. Nelson, C. A. Ngo, T. Ni, L. M. Numrich, R. W. O'Hallaron, D. R. O'Keefe, M. Oner, C. D. Onesti, L. Ostwald, G. Otto, S. W. Ozyoruk, Y. Paalvast, E. M. R. M. Pantano, M. Parashar, M. Park, M.-S. Park, S.-S. Pase, D. M. Pazat, J.-L. Perez, C. Perrin, G. R. Pieroni, E. Pingali, K. Ponnusamy, R. Pritchard, D. Raghavendra, C. S. Ramanujam, J. Ramaswamy, S. Ranka, S. Reeuwijk van, K. Robert, Y. Rodriguez, B. Roman, J. Rosendale van, J. Rossi, C. Roth, G. H. Roy-Chowdhury, A. Ruhl, R.

[52], [159] [94] [165] [42] [33], [143] [118] [38] [112] [28], [32], [111] [43] [33], [80] [144] [199] [70] [185] [56] [169] [186], [53], [145], [146] [131], [132] [131], [132] [202] [76], [77], [78], [79], [107], [147] [148] [4] [94] [47] [187], [37], [54], [55] [108] [157], [158] [64], [65], [162], [163] [188] [11], [81] [56], [156] [12] [57], [149] [14], [15] [141] [94] [189] [190] [39], [99],

112

Ryne, R. Sadayappan, P. Saini, S. Sakon, S. Sala, J. Saltz, J. Sampath, R. Sanjari, K. Sanz, J. Sato, M. Sawyer, W. Schaller, C. Schikuta, E. Schonberg, E. Schreiber, R. Schulz-Ziemer, G. Schuster, V. Schwald, A. Sebesta, R. W. Seo, Y. Seshadri, V. Sethi, A. Shalaby, N. Sharma, S. D. She er, T. J. Shenoy, U. N. Shields, D. Shindo, T. Shiroto, Y. Sips, H. J. Sivaraman, H. Skov, P. Srevik, T. Sowa, K. Srinivasan, H. Steele Jr., G. L.

Author Index [100], [102], [103], [142] [58], [150] [30], [41], [128] [151] [40], [126] [116] [7], [8], [37], [54], [55], [74], [75], [113] [152] [76], [91] [203] [153] [99], [100], [103] [59] [92] [31], [118], [119], [130] [3], [21], [98], [154] [155] [13], [85], [86] [206] [5] [39], [40], [126] [118] [42], [122] [123] [37] [98] [160] [118] [60], [127] [40] [56], [156] [157], [158] [116] [144] [193] [31] [3]

Steidel, J. L. Stewart, A. Strumpen, V. Sturler de, E. Subhlok, J. Suehiro, K. Suganuma, T. Sumana, S. Sussman, A. Sweeney, P. F. Tamura, M. Teng, S. H. Thakur, R. Thirumalai, A. Thomas, H. Tsaur, I.-K. Tseng, C.-W. Ujaldon, M. Van Rosendale, J. Vanderlip, J. Venkatachar, A. Volkert, J. Vondran, G. Wada, K. Wagener, J. L. Wakatani, A. Walker, D. W. Wang, K.-Y. Warren, K. H. Watanabe, Y. Webb, J. Wiel van der, S. P. Wismuller, R. Wolfe, M. Wu, J.-J. Wu, M.-Y. Wylie, B. J. N.

[112] [61] [62] [62], [103] [28], [32], [63] [39], [40], [126] [159] [160] [8], [74], [113] [130] [39], [40], [126] [21] [191], [64], [161] [65], [162], [163] [78] [45] [195] [164] [18] [66] [163] [44], [133] [63] [153] [67] [68], [69] [70] [118], [130] [71] [40] [32] [165] [116] [6], [69] [192] [195], [205] [72], [99], [100], [101], [102], [103], [136], [152], [166], [167]

Author Index Xiaojie Li Yamamoto, F. Yook, H.-G. Young, M. Zapata, E. L. Zernik, D. Zima, H. P.

Zimmermann, F. Zosel, M. E.

113 [168] [153] [131], [132] [13], [85], [86], [138] [164] [130] [18], [19], [20], [73], [96], [97], [164], [169], [206] [99], [100], [103] [3], [170]