Parallel ocean circulation modeling on Cedar - CiteSeerX

PARALLEL OCEAN CIRCULATION MODELING ON CEDAR

BY ^ LUIZ ANTONIO DE ROSE Bach., Universidade de Braslia, 1978 M.Stat., Universidade de Braslia, 1982

THESIS Submitted in partial ful llment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1992

Urbana, Illinois

iii

ABSTRACT

The simulation of ocean circulation is a problem of great importance in the environmental sciences of today. Realistic simulations impose great demands on the existing computer systems, requiring the use of considerable computational power and storage capabilities. An implementation of an ocean general circulation model on the Cedar multicluster architecture is presented. This is based on the GFDL three-dimensional model, that was adapted for simulation of the Mediterranean. The model simulates the basic aspects of large-scale, baroclinic ocean circulation, including treatment of irregular bottom topography. The data and computational mapping strategies and the eect on the performance are discussed. The Cedar version of the code, using four clusters and 32 processors, has demonstrated signi cant speedup compared to a single cluster and compared to a single processor.

iv

To my parents, Julio and Niomar De Rose

v

ACKNOWLEDGEMENTS This work was supported by the U.S. Department of Energy under Grant No. DOE DE-FG02-85ER25001, with additional support provided by the State of Illinois Department of Commerce and Community aairs, State Technology challenge Fund under grant No. SCCA 91-108, by the Conselho Nacional de Desenvolvimento Cienti co e Tecnologico (CNPq), Brazil, and by the National Science Foundation under Grant No. NSF CCR900000N for the use of the Cray Y-MP/48 at the National Center for Supercomputing Applications, Univ. of Illinois at Urbana-Champaign. I am indebted with the many people that helped me during all this years. First of all I would like to thank my advisors Stratis Gallopoulos and Kyle Gallivan for all their support, advice, assistance and patience during the development of this work. Thanks to Dr. Antonio Navarra for his help and for providing the code. I also like to thank Professor Ahmed Sameh, who made possible for me to develop my research at CSRD. I wish to thank very much to all the CSRD sta, and to my fellow graduate students and ocemates, Alan, Bret, Brian, George, Henry, Jose, Peter, and Xiaoge, for their friendship, help, and support during the course of this work. Also, I would like to thank my friends from Brazil that encouraged and helped me to come to Illinois for my graduate studies, especially Jairo, Moura, Luiz Ant^onio and the Sta from CNPq. Thanks for my parents for their continuous support during all this years in graduate school, and last but not least I would like to express my deepest gratitude to four special people in my life, my wife Jane, and my children Pedro, Lgia, and Luiza. My wife for understanding me and giving me encouragement and support, especially during the dicult times, and my children for adding joy and happiness to my life.

vi

TABLE OF CONTENTS CHAPTER

PAGE

1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1

2 DESCRIPTION OF THE MODEL : : : : : : : : : : : : : : : : : : : : : : 2.1 Model equations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Time stepping scheme : : : : : : : : : : : : : : : : : : : : : : : : : :

4 4 6

: : : : : : :

: : : : : : :

: : : : : : :

7 7 7 8 9 10 12

4 DESIGN AND IMPLEMENTATION OF THE CODE ON CEDAR : : 4.1 Cedar system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Design of the code for the baroclinic phase : : : : : : : : : : : : : 4.2.1 The issue of land points : : : : : : : : : : : : : : : : : : : 4.2.2 Grid partitioning : : : : : : : : : : : : : : : : : : : : : : : 4.2.3 Computational, synchronization and communication issues 4.2.4 Memory utilization : : : : : : : : : : : : : : : : : : : : : : 4.3 Design of the code for the barotropic phase : : : : : : : : : : : : : 4.3.1 Description of the successive over-relaxation : : : : : : : : 4.3.2 Initialization issues : : : : : : : : : : : : : : : : : : : : : : 4.3.3 Spike algorithm : : : : : : : : : : : : : : : : : : : : : : : : 4.3.4 Hole relaxation. : : : : : : : : : : : : : : : : : : : : : : : : 4.3.5 Overall structure of the multicluster relaxation : : : : : : : 4.4 Control ow and structure of the code : : : : : : : : : : : : : : : 4.5 Implementation details : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

15 15 18 18 19 24 27 28 29 30 32 35 35 35 38

3 OCEAN MODELING HISTORY AND ORIGINAL CODE DESIGN 3.1 History of past eorts : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Eorts using parallel supercomputers : : : : : : : : : : : : : : : 3.3 Original code design : : : : : : : : : : : : : : : : : : : : : : : : 3.3.1 Grid representation : : : : : : : : : : : : : : : : : : : : : 3.3.2 Structure of the code : : : : : : : : : : : : : : : : : : : : 3.3.3 Data management : : : : : : : : : : : : : : : : : : : : :

vii 5 RESULTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.2 Memory usage : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.3 Execution time for the relaxation routine : : : : : : : : : : : : : : : : 5.4 Performance results : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.5 Conclusions about data placement strategy and the in uence of prefetch 5.6 Comparisons between data partitioning schemes and vector length issues 5.7 Granularity using four clusters. : : : : : : : : : : : : : : : : : : : : : 5.8 Problems with cluster memory bandwidth : : : : : : : : : : : : : : :

41 41 42 45 46 52 54 56 57

6 CONCLUSIONS AND FUTURE WORK : : : : : : : : : : : : : : : : : :

60

APPENDIX A: MATHEMATICAL FORMULATION OF OCEAN SIMULATION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A.1 Continuous formulation : : : : : : : : : : : : : : : : : : : : : : : : : : A.1.1 Continuous equations of the model : : : : : : : : : : : : : : : A.1.2 Boundary conditions : : : : : : : : : : : : : : : : : : : : : : : A.1.3 The stream function : : : : : : : : : : : : : : : : : : : : : : : A.2 Finite Dierence Formulation : : : : : : : : : : : : : : : : : : : : : : A.2.1 Integral constraints : : : : : : : : : : : : : : : : : : : : : : : : A.2.2 Grid representation : : : : : : : : : : : : : : : : : : : : : : : : A.2.3 Finite dierence equations : : : : : : : : : : : : : : : : : : : : A.2.4 Finite dierence form of the stream function : : : : : : : : : : A.2.5 Finite dierence form of the tracer equations : : : : : : : : : :

62 62 62 64 64 66 66 68 68 72 72

REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

73

viii

LIST OF TABLES

5.1: Memory usage in Mbytes for the model P4L8. : : : : : : : : : : : : : 5.2: Memory usage in Mbytes for the model P8L16. : : : : : : : : : : : : : 5.3: Distribution of the global memory usage in Mbytes for the CC version of data set P8L16. : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.4: Average number of page faults per cluster per time step for data set P8L16 5.5: Average time per time step for the relaxation routine. : : : : : : : : : 5.6: Original program, average runtime per time step. : : : : : : : : : : : 5.7: Model P4L8 - GC version with prefetch. : : : : : : : : : : : : : : : : : 5.8: Model P4L8 - GC version without prefetch. : : : : : : : : : : : : : : : 5.9: Model P4L8 - CC version. : : : : : : : : : : : : : : : : : : : : : : : : 5.10: Model P4L8 - GG version with prefetch. : : : : : : : : : : : : : : : : 5.11: Model P4L8 - GG version without prefetch. : : : : : : : : : : : : : : : 5.12: Model P8L16 - GC version with prefetch. : : : : : : : : : : : : : : : : 5.13: Model P8L16 - GC version without prefetch. : : : : : : : : : : : : : : 5.14: Model P8L16 - CC version. : : : : : : : : : : : : : : : : : : : : : : : : 5.15: Model P8L16 - GG version with prefetch. : : : : : : : : : : : : : : : : 5.16: Model P8L16 - GG version without prefetch. : : : : : : : : : : : : : : 5.17: Dierences in execution times from the GG to the GC versions, using model P8L16 with 8 CEs. : : : : : : : : : : : : : : : : : : : : : : 5.18: Multi-CE speedups for model P8L16, using 1 and 4 clusters with 8 CEs per cluster. : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

43 44 44 45 46 48 48 49 49 49 50 50 50 51 51 51 52 58

ix

LIST OF FIGURES

3.1: 3.2: 4.1: 4.2: 4.3: 4.4: 4.5: 4.6: 4.7: 5.1: 5.2: 5.3: A.1: A.2:

Representation of the basin. : : : : : : : : : : : : : : : : : : : : : : : Structure of the original code. : : : : : : : : : : : : : : : : : : : : : : Cedar architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : Representation of the grids with row partitioning. : : : : : : : : : : : Representation of the grids with column partitioning. : : : : : : : : : Representation of the grids with two-dimensional partitioning. : : : : Algorithm for one iteration of the SOR. : : : : : : : : : : : : : : : : : Multicluster structure of the relaxation. : : : : : : : : : : : : : : : : : Structure of the multicluster code. : : : : : : : : : : : : : : : : : : : : Slowdown for model P4L8, varying the global memory access time. : : Slowdown for model P8L16, varying the global memory access time. : Multi-CE eciency (EvCE (1; p)) for one cluster code using model P8L16. Horizontal grid. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Vertical grid. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9 11 16 21 22 22 31 36 37 54 55 58 69 70

1

CHAPTER 1

INTRODUCTION

The numerical modeling of ocean circulation is a task of great importance for climate prediction studies. Due to the interaction between the atmosphere and the oceans, developing adequate predictive skill for the sea surface temperature is an important step towards better climate studies. There are many interconnections between the physics and dynamics of the atmosphere and the oceans, inasmuch as the fundamental equations that govern the motion of the atmosphere and the oceans are derived from the same basic laws of physics. Hence, some techniques for numerical weather prediction have been successfully applied to the simulations of general ocean circulation models. In spite of the similarities between the two models, there are considerable dierences in the characteristics of time and space scales of these two mediums. For example, the oceanic time scale is much larger than the atmospheric. The reason is because the atmosphere tends to respond faster to adjustments in external conditions than the oceans do. Therefore, while atmospheric simulations should be in the order of days or months, ocean models require simulations in the order of years or decades. Also, the seawater is nearly incompressible, while the assumption of incompressibility does not hold for the atmosphere. Another fundamental dierence is the horizontal resolution, where the ocean models require a larger number of grid points than the atmospheric models. This requirement is due to the order of the internal radius of deformation1, which is much smaller in the ocean than in the atmosphere. These time and space requirements necessitate the utilization of more powerful computational environments to simulate oceanic general circulation models, and coupled ocean-atmospheric models. Due to these \computational" reasons it is only recently that realistic global and detailed regional ocean circulation models were made feasible. Not surprisingly, such achievements were intimately linked to the availability of high-performance computer architectures. There has been previous work done on the topic of ocean circulation models 1The internal radius of deformation L is de ned as the ratio of the phase speed of gravity waves (cgr )

to the Coriolis parameter: L = cgr =f

2 on high-performance computer architectures. These models are all based on the primitive equations model designed by Bryan and Cox [Bry69]. Two major reorganizations of the code, were due to Cox [Cox84], suitable for vector machines with long start-up times (e.g. Cyber-205), and Semtner [Sem74], for register-to-register vector architectures of the Cray class. Recently, after a decade of work on organizing the basic code for the use of vector processing, attention is focusing on the exploitation of parallelism. Although we described advances made on supercomputer systems, interest is also growing in other systems, ranging from massively parallel systems to (low-priced) minisupercomputers and interconnected workstations. In this work we present our implementation of an ocean general circulation model on Cedar, a multicluster architecture with fully hierarchical memory, that is being developed at the Center for Supercomputing Research and Development of the University of Illinois. With its scalable high-performance multiprocessor system, Cedar represents a model for the future of parallel computers. We believe that the implementation of real applications on Cedar, and the fully exploitation of the options given by its architecture, is an excellent laboratory experiment for the development of applications aiming the next generation of parallel computers. The main goal in this work was to implement on Cedar an application with large memory requirements that demands computational power, and represents the state of the art in its eld. Our rst objective was to exploit Cedar's scalable architecture experimenting dierent implementations of the same code, trying to detect the weaknesses and the strengths of the architecture. Our second objective was to take advantage of the fact that Cedar can simulate dierent parallel programming paradigms to study the best way of implementing a parallel version of the chosen application on some of today's parallel computers. The Ocean General Circulation Model used in this work is based on a basic model of the Geophysical Fluid Dynamics Laboratory (GFDL) [Cox84], as adapted in the Istituto per lo Studio delle Metodologie Geo siche Ambientali (IMGA-CNR) to the Mediterranean basin geometry [PN88]. The model simulates the basic aspects of large-scale, baroclinic ocean circulation, including treatment of irregular bottom topography. It is used in climate studies and also to study the development of mid{ocean eddies. Temperature, salinity and the prediction of currents are the main physical phenomena of interest. We also present some comparisons with the same model running on the Cray Y/MP, and on the Alliant FX/80, a vector multiprocessor. A preliminary study of this model running on the Alliant2 FX/8 can be found in [DGG89]. The structure of this thesis is as follows: Chapters 2 and 3 introduce the basic information about the model and the original code, necessary for the following chapters. 2The main dierence between the Alliant FX/8 and the Alliant FX/80 is that the former uses up to

8 CEs (Computational Elements), while the latter is a newer version that uses up to 8 ACEs (Advanced CEs), that is a faster processor with smaller startup for vector operations.

3 Chapter 2 presents a brief description of the ocean simulation model, leaving a more detailed mathematical formulation of the problem to Appendix A. The original code design is presented in Chapter 3. A reader that already knows the original design of the code may skip these two chapters without destroying continuity. An overview of the Cedar architecture, and the design and implementation of the code on Cedar are described in Chapter 4. Chapter 5 contains the experiments results, and nally the conclusions and remarks for further work are presented in Chapter 6.

4

CHAPTER 2

DESCRIPTION OF THE MODEL

This Chapter presents a summary of the mathematical model of ocean simulation, followed by a description of the time stepping scheme. A complete mathematical formulation of the model along with the nite dierence formulation is presented in Appendix A.

2.1 Model equations The mathematical model uses the Navier-Stokes equations with three basic assumptions: Boussinesq approximation, in which density dierences are neglected, except in the buoyancy term, hydrostatic assumption, where local acceleration and other terms of equal order are eliminated from the equation of vertical motion, and turbulent viscosity hypothesis, in which stresses exerted by scales of motion too small to be resolved by the grid are represented as an enhanced molecular mixing. Temperature and salinity are calculated using conservation equations, and the equations are linked by a simpli ed equation of state. The equations are written in the spherical coordinate system ( , , z), with depth z de ned as negative downward from z = 0 at the surface. The model equations are: 1. Momentum: @~uh + ~u r~u + f~ ~u = ? 1 r ~ p + Ahr2~uh + (Av~uhz )z (2:1) h h @t 0 2. Hydrostatic approximation: 3. Incompressibility:

pz = ?g

(2:2)

r ~u = 0

(2:3)

5 4. Tracers (temperature, salinity): @T + ~u rT = K r2T + (K T ) h v z z @t @S + ~u rS = K r2S + (K S ) h v z z @t

(2.4) (2.5) (2.6)

5. Equation of state:

= (T; S; p) (2:7) Where ~u = (u; v; w) is the velocity vector, ~uh its horizontal components, p, are pressure and density, f~ = 2 sin ~k , T , S are the temperature and salinity tracers, and Ah;v , Kh;v are the turbulent diusion coecients. The boundary conditions for momentum and tracer uxes are: At the ocean surface (z = 0): 0Av (uh;z vh;z ) = ~ (2:8) @ (T; S ) = 0 Kv @z (2:9) w=0 (2:10) where ~ is a seasonal wind stress. At the bottom z = ?H (; ): w = ?~uh rH (2:11) (Tz ; Sz ) = 0 (2:12) (uz ; vz ) = 0 (2:13) At the side wall boundaries, the normal and tangential horizontal velocities, and the horizontal uxes of sensible temperature and salinity are set to zero. Under the rigid{lid boundary condition, and using vertically averaged velocity components (u; v), the external mode of momentum is written in terms of a volume transport stream function, , for the vertically integrated ow. 1 @ ; u = ? Ha (2:14) @ 1 @ : (2:15) v = Ha cos @ A Poisson type prognostic equation for is given by "

#

"

#!

"

1 @ 1 @ 2v0 ? @ @u0 cos 1 @ 2 + @ cos @ 2 = a2 @ H cos @@t @ H @@t a @@t @ @t The boundary conditions for at the lateral walls are = =0:

!#

: (2:16) (2:17)

6

2.2 Time stepping scheme An explicit, second order scheme is used for the time stepping. Hence, data from three time steps are required during the computation of any grid point, because the leapfrog method uses data from the current and the previous time steps to predict values for the future time step. It is well known however that the repeated use of leapfrog tends to separate the solutions between even and odd steps. A standard technique to correct this is to periodically execute a simple forward Euler step ([Kas77]). The time step is divided into two parts: the baroclinic, a three-dimensional phase that sweeps the grid, from south to north, predicting the new values and accumulating average values for each vertical line, and the barotropic, that consists of the solution of the resulting two-dimensional Poisson equation for the mass transport stream function.

7

CHAPTER 3

OCEAN MODELING HISTORY AND ORIGINAL CODE DESIGN

This chapter reviews a history of past eorts in numerical ocean simulation, as presented in [WP86, Sem86b], followed by references to some recent work using vector supercomputers, and nishes with the presentation of the original code design.

3.1 History of past eorts The rst global numerical ocean simulation models were developed by Bryan [Bry63] in the United States, and by Sarkisyan [Sar66] in the Soviet Union. In 1967, the results from the rst numerical experiments with primitive equation, three{dimensional ocean model were presented by Bryan and Cox [BC67]; these were followed by a description of the physics and numerics involved in the code by Bryan [Bry69]. Although, since that time, ocean models have being developed by several research groups, most of them are derived from the aforementioned original work of Bryan and Cox. Semtner [Sem74] improved the structure of the code, and added various features to the mathematical formulation, such as hole relaxation [Tak74], for treating the islands. Later, Cox [Cox84] further modi ed Bryan's model to improve the numerical and computational eciency of the code. Pinardi and Navarra adapted Cox's program to the Mediterranean basin geometry [PN88].

3.2 Eorts using parallel supercomputers Andrich, Madec et al. in [AM88, ADLM88] describe a code written for the Cray 2 applied to sections of the Atlantic and of the Mediterranean. An interesting aspect of their work is their emphasis on ecient elliptic solvers for the two-dimensional mass transport stream function. Singh and Hennessy in [SH89] discuss the parallel simulation of a two-level quasigeotrophic model on a sixteen processor Encore MULTIMAX. Despite the simplicity

8 of the model, their work is very interesting because its point of view is from the computer science perspective, with a detailed discussion of the parallelization and memory management strategies. Recently, Semtner and Chervin modi ed Semtner's model [CS88, Sem86a, SC88] to combine microtasking and vector processing, thus achieving impressive performance in terms of MFLOPS, on a 4-CPU Cray X-MP in several global ocean circulation experiments. Smith, Dukowicz, and Malone [SDM91] implemented Semtner-Chervin's model on the Connection Machine CM-2, obtaining similar performance to the Cray X-MP. They also developed and implemented a new formulation of the barotropic equations which involves the surface pressure eld rather than the stream function, and used a parallel preconditioned conjugate-gradient method. With this new model, they reported an improvement of 70% over their original version. A new generation of computationally more demanding models is under development in context of the CHAMMP activity [CHA90]. Another model that considers parameterizations, especially surface energy balance and vertical turbulent mixing, was recently developed by Pacanowski, Dixon, and Rosati [PDR90] at the Geophysical Fluid Dynamics Laboratory, for the Cray Y/MP.

3.3 Original code design The original program was rst written in Fortran for the IBM 360/91, and was subsequently redesigned for vector computers. Among the design principles cited by [Bel81, CS88] for the design of the original program [Sem74], are the following: small number of operations in each loop; data transfer as if only a small amount of main memory were available; and to avoid if statements within loops to improve the performance of the vector statements. The program has been ported to several machines, including the Cray series and the, now discontinued, Cyber-205 series. There exist two versions of the model, an in-core version and an out-of-core version, these versions diering by only one subroutine, that is responsible for the transfer of data between memory and disk1. The code accepts as input several parameters, such as grid size, number of islands, etc., in order to simulate basins with dierent sizes and characteristics. The main aspects of the implementation of the code on the Cyber 205 at the Geophysical Fluid Dynamics Laboratory [Cox84] are presented in this Section. starting with an overview of the grid representation and the program ow, and nishing with the description of the data management. 1Following the documentation of the program ([Cox84]), the description in this Chapter is for the

out-of-core version. The in-core version has the same structure, but uses a large buer to simulate the disk.

9 North West J

K

. ..

East South

I Figure 3.1: Representation of the basin.

3.3.1 Grid representation

The Cyber 205 is a memory to memory vector supercomputer that requires large vector lengths to realize high performance. This requirement is due to the large vector start-up time of its pipelines. In order to avoid computations with small vector lengths, the design philosophy was to use a grid with rectangular shape containing land areas. The data is organized in J \slabs" of xed latitude. Each slab consists of all I K grid cells lying on a given longitude-depth plane, as shown in Figure 3.1. The computations in the three-dimensional phase are executed over both ocean and land areas, with masks being used to distinguish the cells that correspond to ocean from land. The cells with ocean data have their masks set to one, while the others are set to zero. Thus, by multiplying the mask by any array containing data, the result for the land areas will be set to zero. In order to avoid any special consideration at the boundaries, such as if statements to detect the border of the basin, one extra latitude or longitude cell is used at each side of the basin. These cells are also treated as land areas, and have their mask set to zero.

10

3.3.2 Structure of the code

The structure of the code, shown in Figure 3.2, consists of the main program and nine subroutines, that perform the following functions2: Main starts from scratch or continues a previous run. It reads the input data, performs the initialization, prints the initial con gurations of the model, and calls the subroutine Step once per time step. Step controls the time stepping. It is divided in four parts. The rst part corresponds to initialization, during which the time step counter is updated, and various quantities used to analyze the solution are initialized. The second part is the bootstrap procedure, where the second and third slabs are read from disk3, several arrays including the data for the boundary are initialized, and some quantities for the second slab are computed. The third part, correspond to slab by slab computation, since it updates all the slabs except the borders, by calling the subroutines Clinic and Tracer. It prints the solutions of the newly computed slab (j ), writes it into disk, and gets a new slab (j + 2) from disk. Finally, the last part calls the subroutine Relax, and saves the relaxation results into disk. Stinit is called by Step to load the appropriate normalization constants and coecients into arrays of proper dimension. These vector movements are done to try to improve the vectorization in the subsequent calls to State and Clinic. Clinic computes for one slab the internal mode component of the u and v velocity elds (equations A.53, and A.54), as well as the vorticity driving function (A.65), used later by Relax. Tracer is called by Step for every slab to compute temperature and salinity, according to equation (A.66). State and Statec are called by Clinic and Tracer respectively once per slab, to compute the third order polynomial approximation to the Knudsen formula (A.11). State is also called by Step once per slab during the bootstrap procedure. Relax is called once by Step at the end of each time step. It takes the vorticity driving function computed in Clinic and, using successive over-relaxation, solves a two dimensional Poisson equation for the mass transport stream function (A.65). It computes an initial guess for the relaxation extrapolating the solution of the two previous time steps, and uses the maximum absolute residual to test for convergence. Matrix is a two-dimensional array printing routine, and is called by Step on speci ed time steps. Odam is the Ocean Direct Access Manager that consists of three subroutines that are used to handle the transfer of data between the memory and disk. The routine Ostart is called once by program Main to open the les used by the model, and the routines Oput and Oget are responsible for writing and reading the les respectively. 2The equations referenced in this section are presented in Appendix A which contains the mathematical formulation of the model. 3The rst slab is set to zero, due to the boundaries considerations described in Section 3.3.1

11

Odam Main Odam

Matrix Step

Stinit State

Clinic

State

Tracer

Statec

Odam Relax

Figure 3.2: Structure of the original code.

12 In summary program Main initializes the model and calls Step. Step organizes the data in memory and calls Clinic and Tracer for the slab by slab computation, from south to north through the basin. After the completion of the last slab, Step calls Relax and returns the control to Main, which may continue the loop calling Step for another time step.

3.3.3 Data management Three stage buer

During the computation of prognostic variables in any grid cell, the leapfrog method reads and writes data from three time steps, namely the predicted one (n + 1), the current (n), and the previous time steps (n ? 1), which, consequently, should be present in memory. As no data from other time steps is required, a three-stage buering method is used to store and access the data. The rst buer is used to write the data for the future time step, and the second and third buers are used to read the data from the current and previous time steps. At the end of each time step, the indices of the buers are permuted in such a way, that the buer containing the oldest time step values (n ? 1) becomes the buer designated for the next write, and the other two buers become read buers. The buers contain the values for temperature, salinity, and velocity components u and v, for every grid point in the three-dimensional grid. They also contain the two dimensional topography and wind stress data arrays. These latter values are time invariant, but are kept in the disk buers to reduce memory requirements. In this way, instead of having to store in the work space one array of dimension I J for each time invariant data, the program needs only three vectors of dimension I , one for each row that is being used during the computation of the second order accurate nite-dierence approximation. This can be done because the computations are executed by handling the J index independently. Hence, by keeping this invariant data in the disk buers, the row dimension (J ) of these two-dimensional variables is reduced to three. On the other hand, this memory saving generates more I/O trac and more data movements, because these variables must be read and written for every slab. For the solution of the Poisson equation, two disk buers are used to store the solution of the two previous time steps. The two disk buers are read in the beginning of the relaxation and the new solution is stored in the buer that contains the older data.

Common blocks

The computation of the second order accurate nite-dierence approximation of the horizontal derivatives of a variable at each grid point requires the values of the variable at the two latitudinal and the two longitudinal neighbors. Each slab contains all the longitudinal data from a given latitude (j ), and the latitudinal neighbors of a grid point in slab (j ) are stored in slabs (j ? 1) and (j + 1). Hence, for the computation of the

13 variables of any row from a given latitude (j ), the program uses a work space containing seven slabs: Three slabs with the values of the variables from latitudes (j ? 1), (j ), and (j + 1) for the current time step (n), three slabs with the values of the variables from these latitude values but from the previous time step (n ? 1), and one slab containing the data for latitude (j ) which is being computed for time step (n + 1). The work space also contains the time invariant data described above, and temporary arrays that are used during the slab by slab computation. In order to reduce storage requirements the three-dimensional phase of the code shares work space with the relaxation routine4. This is possible because the relaxation procedure is executed after the three-dimensional phase. The values of the stream function for the current and previous time steps, and an array containing the change in vorticity that is updated during the slab by slab computation are stored in a dierent common block, which is used during the relaxation procedure. The other common blocks used by the program are for control variables, scalar values, masks describing boundaries and land areas, one dimensional quantities, and diagnostic variables. These latter ones can be considered as temporary storage, because they can be computed directly from stored variables.

Memory requirements

The largest common block used by the program is the one containing the work space. De ning P to be 4 or 8, if the program uses 32bit or 64 bit oating point precision, the memory requirements of this common block depend on the size of the grid. If I , J , and K are the total number of points in the longitude, latitude, and vertical directions respectively, the memory requirements for the standard work space in bytes are: Mem(S ) = P (96 I K + 57 I ) ; (3.1) letting nisle be the number of islands in the model, the memory requirements in bytes for the work space used during the relaxation are: Mem(R) = P (8 I J + I + nisle) : (3.2) If no island is present, the work space used by the relaxation routine will be smaller than the standard work space if J < 12 K + 7. The total work space requirement is the maximum of (3.1) and (3.2). The next largest common block, of size P (4 I J ) bytes, contains the values of the stream function from the two previous time steps, the change of vorticity across one time step, and the reciprocal of total depth at u and v points. The remaining common blocks have negligible size, compared to the previous two. Thus, the total memory requirement for the out-of-core model is approximately max[Mem(S ); Mem(R)] + 4 I J bytes. (3:3) 4This common block will be referenced as relax work space (R) when it is being used inside the

relaxation procedure, and standard work space (S ) in the other case.

14 The in-core version of the code requires an additional common block, in order to simulate the disk. The memory requirements for this common block are

Size(disk) = P (20 + (7 I J ) + 3 [J (4 I K + 2 I )]) bytes.

(3:4)

This corresponds to data stored by the out-of-core model in the ve disk les used by the program.

15

CHAPTER 4

DESIGN AND IMPLEMENTATION OF THE CODE ON CEDAR

This chapter presents the design and implementation of a parallel version of the basic Ocean General Circulation Model of GFDL [Cox84] on CEDAR. The rst section review the CEDAR system, describing the major components of its architecture, the operating systems, and the Cedar Fortran language. For a more detailed description of the Cedar system see [KDLS86, EPY89, Yew86, KTV+91]. Section 4.2 contains the design of Cedar code for the baroclinic phase, it discusses the three grid partitioning strategies considered in our design and addresses the dierent approaches of placing the data structure, to take advantage of Cedar's hierarchical memory. Section 4.3 describe the modi cations in the relaxation routine, that led us to obtain more than threefold performance in special for this part of the code. To complete the description of the multicluster design, the control ow and the high level structure of the code are described in Section 4.4. Finally, Section 4.5 covers some implementation details.

4.1 Cedar system The Cedar system, that is being developed at the Center for Supercomputing Research and Development of the University of Illinois, has as its main characteristic the hierarchical organization of its computational capabilities and memory system. It is a multicluster-based architecture (see Figure 4.1) with four clusters, where each cluster is a modi ed Alliant FX/8 machine with 8 computational elements (CEs). Three levels of parallelism can be applied. First, vectorization can be used within a CE. Second, small grain parallelism can be exploited by using concurrency within the cluster. Finally, medium and large grain parallelism can be used across the clusters. The memory system has four levels in its hierarchy, namely registers for each CE, cache for each cluster, cluster memory and global memory. As expected, the cost of access increases at each level. Each CE has its own set of scalar and vector registers. The CEs in each cluster share the cache and cluster memory. At the higher level of the hierarchy, a global memory is shared by the CEs of all the clusters. Each cluster

16 Global Memory

Global Memory

Global Memory

Global Memory

Global Memory

Global Memory

Global Network

Cluster

Cluster

Cluster

Cluster

To Global Network

Cluster Memory Global Interface Global Interface

Memory Bus

IP Cache

CE Cache

Interactive Processor

Interactive Processor

MULTIBUS

MULTIBUS

Cluster Switch

Computational

Computational

Computational

Element

Element

Element

Concurrency Control Bus

CSRD Designed

Alliant Designed

Figure 4.1: Cedar architecture

17 has 16 Mbytes of cluster memory and 512 Kbytes of cache. The global memory size is 64 Mbytes (32 memory banks). One problem of this Cedar con guration is that a signi cant portion of the cluster memory is used by the operating system, leaving less than 10 Mbytes available for the user. However, there are plans to increase the cluster memory in the future. Cedar utilizes pipelining in the global memory system, and can have multiple outstanding memory requests from each CE. Each Alliant processor has a restriction on the number of outstanding read requests, what would make impossible to fully utilize the shared memory system pipeline. The solution to overcome this limitation was the use of data prefetching, that allows the processor to start a block move from the global memory and continue the execution regardless of how many read requests are outstanding. For testing purposes, this feature can be set o with the use of compiler ags. For more details on the prefetch unit see [KTV+91]. The processors and the global memory are connected via the global interconnection network, that consists of two unidirectional packet-switched networks. These switching networks are 2-stage Omega networks built from 8 8 cross-bar switches. For a more complete discussion of the memory system see [ME87]. The Cedar operating system, Xylem, extends the Alliant Concentrix operating system to include multitasking and virtual memory management of the Cedar memory hierarchy (see [Emr85]). A Xylem process consists of one or more independently scheduled program segments, that execute asynchronously across the Cedar system. These program segments are called cluster-tasks. System calls are provided by Xylem for starting and stopping tasks, waiting for tasks to nish, and for inter-task synchronization. The Cedar Fortran language [Hoe91, GPHL90, EHJP90] is derived from Alliant's FX/Fortran [All87], which is Fortran 77, with vector constructs such as those proposed for the next Fortran standard (Fortran 90). Cedar Fortran has extensions for memory allocation, concurrency control, multitasking and synchronization. It allows the speci cation of the location and visibility of the data, re ecting the Xylem memory access and locality structure. Concurrent execution of loops within a single cluster, or across clusters are provided with doall and doacross constructs1. Vector concurrency, conditional vector statements, and vector reduction functions are also provided by the Cedar Fortran language. Cedar Fortran provides facilities for creating and synchronizing cluster-tasks. Three groups of synchronization routines are provided: doacross loop synchronization; ZhuYew synchronization primitives (see [ZY87]); and Cray-Style synchronization operations (see [Cra85]). 1doall loops may perform their iterations in any order and synchronization between iterations is not allowed. The iterations of a doacross are guaranteed to start in the same order as they would if the loop were serial.

18

4.2 Design of the code for the baroclinic phase As described before, the main goal in this work was to implement on Cedar an application with large memory requirements that demands computational power, and represents the state of the art in its eld2. In this way, we would be able to exploit Cedar's scalable architecture experimenting with dierent implementations of the same code, trying to detect the weaknesses and the strengths of the architecture; and to take advantage of the fact that Cedar can simulate dierent parallel programming paradigms to study the best way of implementing a parallel version of the chosen application on some of today's parallel computers. Considering our objectives, we tried to preserve the high level code structure, which was described in Chapter 3. Nevertheless to take advantage of parallelism, vectorization, and Cedar's hierarchical memory, we were required to make some changes to the ow of computation and to the data structures. The only major algorithm change was made for the solution of the two-dimension Poisson equation for the mass transport stream function, because the original algorithm was not eciently parallelizable. We leave the discussion of our implementation of the Poisson solver to Section 4.3. The Cedar multicluster code was designed to allow dierent subdivisions of the basin. It was also parameterized in the number of subdivisions so that scalability experiments could be attempted. Therefore, it is easy to run the code with a dierent number of clusters, without changing the program itself. Currently the partitioning scheme is decided beforehand, and the data grid is divided equally between clusters. We leave it as future work the design of partitioning schemes that can be decided at run time, considering load balancing, and that take into consideration the topography of the basin. The issue of land areas is discussed in Section 4.2.1, the grid partitioning issue is addressed in Section 4.2.2, and the synchronization, communication and computational issues are discussed in Section 4.2.3. The placement of the data structures,(described in Section 3.3.3), was also an important consideration during the design. The discussion about the location of the work space and the three-dimensional data containing the slabs, is presented in Section 4.2.4. This three-dimensional data will be referenced from now on as virtual disk.

4.2.1 The issue of land points

One aspect that should be considered when designing an ocean model is the handling of land points. These points represent the coastline of the basin being studied, the islands and the bottom topography. In earlier models ([Cox84]), the standard approach was to consider a basin with regular shape, perform the computations over the land areas, and then to use masks to zero out the land points at the end of the time step. The main reason for this approach was because in order to obtain long vectors and thus high performance on vector computers with large startup time for their vector pipelines. Therefore it was 2At that time, the new GFDL MOM code [PDR90] was not yet available.

19 more eective to perform the computation using a large vector containing land points than to use several small vectors containing only the water points. In simulations over \land-rich" basins with realistic topography, this approach may cause more than 50% of unnecessary computations, due to land areas, (mainly the coastline and the bottom topography). Hence, with the advent of vector supercomputers that have very fast startup time for their pipelines, the considerations about land points became an important issue, and in some of the new models [PDR90] this issue is already being considered. In the multiprocessor case, the avoidance of computation over land areas will generate problems of theoretical and practical importance such as new partitioning schemes, load balance, and scheduling. The solution for the former two topics are much easier when the original approach is taken, because of the regularity of the domain. In our current design, we decided to follow the same approach taken in the original code regarding land areas. This will postpone any decisions about load balance for the future, since it allows equi-sized partitions of data. Moreover the new model would have roughly the same amount of oating point operations than the original code, providing a standard for comparisons.

4.2.2 Grid partitioning

The partitioning of the problem domain is a critical factor in the design of parallel codes, with many papers having been presented on this topic. Bell and Patterson [BP87] describe data space partitioning strategies for large numerical codes, with a survey of the approach used in these codes. Their focus is on the logical design of data structures for very ecient transfer among multiple memory levels and multiple processors. Thune [Thu90] discusses the use of slices and rectangles as partitioning shapes for the case of explicit dierence methods on two-dimensional problems, on MIMD computers with distributed memory. Also for multiple processor systems, Reed, Adams, and Patrick [RAP87] use the stencil structure to choose the appropriate partitioning shape. Berger and Bokhari [BB87] consider partitioning strategies that balance the workload for problems on a domain with nonuniform work estimates. Slab partitioning in terms of independent longitude/depth (vertical) sections was introduced in ocean simulation codes for Bryan's model. This partitioning scheme was used to solve two major problems of the rst generation of vector supercomputers, namely the necessity to have long enough vectors to have ecient vectorization, and the lack of enough memory space to store large data sets. The rst problem was solved because the vectorization could be done eciently by using the longitudinal direction as the rst index of the arrays. The second problem was solved by keeping the three-dimensional data out of main memory, and bringing to memory just the slabs necessary to perform the computation. The slab partitioning in vertical sections is still used in the recent version of Semtner [CS88]. Andrich, Madec et al. [AM88, ADLM88] using the Cray-2, (and its very large memory), have a dierent approach, with the use of horizontal slabs for all the horizontal

20 operators and vertical slabs for the vertical operators. For each case, each particular processor computes one slab at a time. In the context of multiprocessing, there are some other methods of partitioning the domain that were not referenced so far. One can divide the grid into several blocks containing approximately the same number of grid points, and distribute the blocks among the processors. Each processor would solve the equations independently for its part of the grid, and a barrier synchronization would be used before the end of each time step. The partitioning of the domain can be done in one, two or three dimensions. In one dimension the division of the grid can be in the East-West direction, including complete rows of the same latitude, in the North-South direction, including complete columns of the same longitude, or in the vertical direction, including complete planes of the same depth. There are three possible combinations for the two-dimensional partitioning. We can subdivided the grid in two directions, namely latitudinal/longitudinal, latitudinal/vertical, and longitudinal/vertical. Finally, the idea of a three-dimensional partitioning is that the blocks are formed by subdividing the data grid in all three directions. When the issue of avoiding computations over land areas is considered, the partitioning of the grid requires some thinking about balancing. This is still an open problem, and additional work will be required to be done before the implementation of a good solution is obtained. We are planning for our next version to address this problem by introducing new partitioning schemes that take into account the topography le, and studying the implications for load balancing. In our work we considered two one-dimensional partitionings schemes, the one dividing the grid in the East-West direction and the other in the North-South direction. We also considered a latitudinal/longitudinal two-dimensional partitioning. These partitioning schemes will be referenced from now on as row, column and 2D or two-dimensional partitioning, respectively. We only divide the grid in as many blocks as the number of clusters we are using, with each cluster being assigned to one of the blocks. Each cluster subdivides its blocks into vertical slabs, and performs the computation in a manner similar to the original model, taking advantage of its multiple CEs and vector arithmetic to compute each block as if it were a smaller basin. Whenever possible, each cluster executes the computations in a concurrent vector mode, using the longitudinal direction for vectorization and the vertical direction for concurrent execution. Our implementation of the two-dimensional partitioning scheme was for the case of four clusters, where each row contains half the number of grid points in the latitude and each column contains half the number of grid points in the longitude. Naturally, the idea of the two-dimensional partitioning can be extended for con gurations that have more than four clusters, or to an o-center partitioning, where each block can have a dierent number of rows or columns. This latter scheme becomes interesting when the issue of land points is considered. The schemes that were implemented are depicted for four cluster partitioning in Figures 4.2, 4.3, and 4.4. These partitioning schemes were used for the baroclinic equations. As

21 North Cluster 4 West

Cluster 3 Cluster 2

Cluster 1

East

South Figure 4.2: Representation of the grids with row partitioning. mentioned earlier, for the barotropic phase we used a dierent approach, which will be presented in Section 4.3. As the baroclinic equations are solved explicitly with nite dierences, the computation at each grid point during one time step is independent of the other grid points in the same time step. Hence, there would be no problems to implement any of the partitioning schemes described above. However, we decided not to implement schemes that divide the grid in the vertical direction because they would require major changes in the code, and that wasn't the purpose of this work. As described in Section 3.3.2, the computation in the original code is performed slab by slab, where each slab contains the data of a xed latitude. In general the computations in each slab are performed using doubly nested loops, varying the depth in the outer loop, and the longitude in the inner loop. Considering that usually the vertical dimension is much smaller than the other dimensions, the division of the grid in the vertical direction would lead to a poor load balance inside each cluster. Therefore it would be necessary to interchange each double nested loop for a better utilization of the CEs, but then compromising the vectorization. A discussion about some of the advantages and disadvantages of each of the partitioning schemes that we implemented is presented in the following subsections.

22 North West

Cluster 1

Cluster 2

Cluster 3

Cluster 4

East

South Figure 4.3: Representation of the grids with column partitioning. North West Cluster 3

Cluster 1

Cluster 4

Cluster 2

East

South Figure 4.4: Representation of the grids with two-dimensional partitioning.

23

Row partitioning

In the row partitioning (Figure 4.2), the grid containing J rows is divided in C blocks of contiguous slabs. Each block, containing dJ=C e + 2 slabs, is assigned to one of the C clusters. The two extra slabs contain data for the South and North borders of the partition. This data is necessary to compute the ve point stencil of the second-order nite-dierence approximation of the derivatives, when the rst and last row of the partition are being updated. During each time step, each cluster will perform computations for its complete set of slabs3, and then synchronize before executing the relaxation. This scheme is not very attractive for models with high resolution, that require a large work space, because the memory requirements for the work space can't be distributed among the clusters. We should remember that because the discretization scheme was second order with a ve point stencil, each cluster will need to work with three full slabs in its work space. The memory problem appears because the computation is executed slab by slab, and each slab contains all data for a given row. So, if we were using for example J clusters, where J is the number of latitude grid points, the memory requirement for the work space would be J times the memory size required for the computation of each slab, that is given by equation (3.1). One solution for this problem would be to have slabs of xed longitude with grid cells lying on a given latitude-depth plane, and to change the direction of the computation, that would be from West to East (or from East to West), instead of slabs of xed latitude and the computation being executed from South to North. However, this modi cation would make the row partitioning scheme to have a similar behavior to the column partitioning scheme, described next.

Column partitioning

In the column partitioning, (Figure 4.3), the data grid is divided in the zonal direction, in a similar fashion to the row partitioning. The I longitudinal points are divided equally among the clusters, and two extra columns are provided for the East and West borders. During each time step, the slab-by-slab computation of each subdivision of the basin is executed independently by each cluster. One advantage of this scheme over the row partitioning is that with the column partitioning the work space is distributed between the clusters, demanding less memory requirements, and obtaining more data locality, because each of the C clusters will work with only roughly 1=C of the work space necessary to be present in memory for the computation of each row. One drawback of this scheme is that if the basin has a small number of East-West points, or if a large number of clusters is used, the performance of the code will be aected, because the vector lengths corresponding to each cluster will become short. Hence, we can notice that one-dimensional partitioning schemes will tend to create a problem of 3No computation is performed over the border rows

24 memory usage versus granularity of computation. This problem appears because of the lack of exibility of these partitioning schemes, leading us to infer that a two-dimensional partitioning scheme will have a better performance than one-dimensional partitioning. One point in favor of the column partitioning over the row partitioning is that as the number of grid points is directly proportional to the resolution of the model, and as the tendency is to have even bigger models in the future, this scheme could be an attractive option, if one-dimensional partitioning is chosen.

2D partitioning

The two-dimensional partitioning (Figure 4.4), is a combination of the two previous methods, and is of great interest when at least four cluster are available. In the four cluster case, the two horizontal dimensions are divided equally and each quadrant is assigned to a dierent cluster. The idea of two-dimensional partitioning can be easily extended when more than four clusters are available, with more divisions in the zonal, meridional, or both directions. This scheme is used as a solution for the problem of memory versus granularity that was discussed for the previous two schemes. It has the advantage of being more scalable than the others when a larger number of clusters is available, because the vector length doesn't become as small as in the column partitioning, nor the memory requirements are too large, as in the row partitioning. As a disadvantage, for a larger number of clusters, this scheme requires two extra rows and two extra columns to store the border information. This scheme appears to be more suitable than the others when the issue of computation over land areas is considered. The main reason is because o-center partitioning can be used in both directions, to obtain a better distribution of water points for each cluster, and in consequence a reasonable load balance.

4.2.3 Computational, synchronization and communication issues

One important aspect to consider when parallel processing is being used is the overhead that is added due to the required synchronization among processors and the necessity to exchange information between processes. In the case of explicit schemes, like the one that is used in the three-dimensional phase, the synchronization is independent of the partitioning that is being used, because once the data that is required from previous time steps is obtained, all the computations in each block can be performed without the necessity of synchronization. On the other hand, the communication activity between clusters is dependent on the partitioning scheme, because at the beginning of each time step each cluster needs to have information from its neighbors, in order to compute the nite-dierence approximation of the derivatives. To obtain a better performance on a multicluster environment, one should try to maximize the number of operations in each cluster and minimize the intercluster communication, or to maximize the ratio between computations and communications per cluster (Rcc ). This ratio is important because it

25 allows one to verify if the performance of the program is being degraded by excessive amount of intercluster communication. Much work has been done to analyze communication overhead in parallel architectures, especially for the use of explicit dierence methods on regular grids. A detailed study of the relationships between stencils, partitioning schemes, architecture, and data structure management was done by Reed, Adams and Patrick [RAP87]. They studied the eects of dierent partitioning schemes and discretization stencils on interprocessor communication, for both distributed and shared memory architectures. A followup study, using Reed et al. as a framework was done by Nicol and Willard [NW87]. They studied the relationship between problem size and architecture, and analytically quanti ed the relationship between stencil type, partitioning scheme, grid size, execution time and type of communication network. They showed that optimal performance is not always achieved by using all processors available. In this section we are going to review the cost of communication between clusters for the baroclinic phase, considering the ve point stencil and the three partitioning schemes used in our model. For a grid containing I longitudinal points, J latitudinal points, and K vertical levels, we approximate the number of oating point computations performed during the threedimensional phase by I J K , where is the average number of oating point operations performed for each grid point4 during each time step. As described above, the three partitioning schemes divide the computation equally between the cluster. Therefore, using C clusters, each cluster will perform approximately I CJ K , oating point operations, independently of the partitioning scheme. In the row partitioning, the two clusters containing the data from the edge of the basin require one extra slab for the boundary values of their only neighbor. All other clusters require data from southward and northward neighboring slabs. Therefore, in the worst case, each cluster needs to send values5 from 2 I K cells to its two neighbors, and the ratio between computation and communication per cluster, (Rccr ) in the worst case for row partitioning will be given by Rccr = 2 J C : (4:1) We take a similar approach to compute the ratio between computation and communication per cluster for the column partitioning (Rccc ). In the worst case, each cluster is required to send values from 2 J K cells to its neighbors, and the number of

oating point operations per item of information exchanged in the worst case for the column partitioning will be given by (4:2) Rccc = 2 I C : 4We have counted for our original program that is approximately 300. 5Each value, or item of information, can be considered as a data word, (4 bytes in single precision,

or 8 bytes in double precision). In our original program is approximately 40.

26 As expected, by comparing equations (4.1) and (4.2) one can notice that if onedimensional partitioning is used, the ratio between computation and communication will be larger if the grid is divided along the larger dimension. In the 2D partitioning, assuming an arbitrary number of clusters, each data block may have two, three or four neighbors. Let the grid be divided in C = i j blocks, where i and j are the number of subdivisions along the zonal and meridional directions respectively, each block having I=i longitudinal points and J=j latitudinal points6. Each cluster will perform Ii Jj K oating point operations. In the worst case, each cluster is required to send values from 2 Ii + Jj K cells to its neighbors. The ratio between computation and communication per cluster (Rcc d ), for the 2D partitioning will be a function of the area and the perimeter of the block, given by Ii Jj ; Rcc d = (4:3) 2 Ii + Jj 2

2

or after simplifying, given that C = i j

Rcc d = 2

I J : 2 C Ii + Jj

(4:4)

We can generalize equations (4.1) and (4.2), substituting J and I respectively by D1 , that is the number of grid points along the dimension that is being used for the partitioning. In this way, the ratio between computation and communication per cluster (Rcc d ) will be given by 1 Rcc d = 2 D (4:5) C : 1

1

We can also generalize equation (4.4) to be D2 ; Rcc d = DC1 P

(4:6)

2

where D1 and D2 are the number of horizontal grid points, P = 2 Dd + Dd is the perimeter of the block7, and d1 and d2 are the number of subdivisions in the two dimensions respectively. Comparing equations (4.5) and (4.6) we obtain than in a general case, with d1 and d2 > 2, for a basin with D1 D2 horizontal grid points, a two-dimensional partitioning 1 1

2 2

6For simplicity we are assuming that i and j are multiple of the number of longitudinal (I ) and latitudinal (J ) points respectively. This assumption won't aect our analysis, because we are considering the worst case for communication. If i and multiples of I and J , we would have to consider j were not J I the perimeter of the larger block as 2 d i e + d j e . 7We are considering the perimeter of the projection of the block into the plane formed by the zonal and meridional dimensions.

27 will have a ratio of computation over communication higher than the one-dimensional partitioning (dividing the grid in the direction that has D1 grid points) when P < 2 D2. In the special case of four clusters, each one has only two neighbors, and during each I + J time step it is necessary to send data values from K 2 cells; hence the ratio between computation and communication is given by: Rcc d cl = I IJ+J (4:7) C 2 (2 =4 )

oating point operations per data cell exchanged, or Rcc d cl = D1 DP 2 ; C 2 (2 =4 )

(4:8)

and as expected, the ratio between computation and communication will be twice the ratio obtained for the general case, given by equation (4.6). The 2D partitioning can be used on a massively parallel architecture, up to a granularity limit for each processor of only K grid points in the depth direction. In this case, using a ve point stencil, each processor will execute K oating point operations, and will need to receive 4 K data values from its neighbor cells. The ratio between computation and communication will be 4 . Of course, there will be other problems associated with this partitioning scheme, such as the totalization of scalar variables like energy, volume averaging of absolute change of temperature and salinity, etc., that requires information from all over the grid, and will increase substantially the communication cost.

4.2.4 Memory utilization

The decision of where to keep the virtual disk data and the work space is an important aspect for better utilization of the memory system, in this case Cedar's memory hierarchy. Four combinations are possible, and will be referred to as \CC", \GC", \GG" or \CG", where the rst letter makes reference to where the virtual disk is stored, while the second letter is used for the location of the work space. In this context \C" and \G" stand for cluster and global memory. The versions \CC" and \GG" have similarities with the behavior of two memory paradigms, namely distributed memory and shared memory. In the rst case most of the data is stored in the cluster local memories, and the communication must be done through message passing, that is simulated in our case by the use of the global memory. In our design, the CC version also uses the global memory to accumulate and store scalar quantities such as energy, temperature, salinity, etc, that are partially computed by each cluster. In the GG version, it is not necessary to exchange messages between processors, because all the data is stored in the shared memory. Nevertheless, access times to this memory are slower. As presented by Gallivan et al. in [GJT+ 91], the

28 potential improvement by using the prefetch unit to oset latency may bring the global memory access rate close to the cluster memory access rate. The use of Cedar for the GG version is analogous to machines that have large shared memory such as the Cray-2. The version \GC" is a hybrid type, and appears from the experiments to be the one that better utilizes Cedar's hierarchical memory, because the faster local memory is used to keep the work space that is more frequently used, while global memory is used for storage of the three-dimensional data, which is more sparsely accessed during each time step. An analogy can be made between this hierarchical memory usage on Cedar and the use of the SSD on the Cray Y/MP system, or the disk model for the original code. The ratio between the number of oating point operations using cluster memory data, and number of global memory accesses is an important measure for the hybrid versions (GC and CG). This ratio is important because in general the cost to access global memory is higher than the cost of accessing cluster memory; hence it is desirable for each global memory access to perform several oating point operations in order to take advantage of the extra cost of accessing global memory. Using the GC version during the baroclinic phase, for each point in the three-dimensional grid approximately 12 global memory accesses (8 reads and 4 writes) are necessary while almost 300 oating point operations are executed. Hence, the 3D phase achieves a ratio of 25 oating point operations using data in cluster memory for each global memory access. The \CG" version is the second hybrid type. It uses the local memory to store the virtual disk, and the global memory to store the work space. There is no current machine to make an analogy with, and we also didn't expect a good performance from this version, because it appears to combine the disadvantages of versions CC and GG such as the necessity to exchange messages between processes, and the large number of global memory accesses, respectively. Hence, we didn't implement the CG version.

4.3 Design of the code for the barotropic phase The successive over-relaxation (SOR) appears to be a good sequential algorithm for the solution of the two-dimensional Poisson equation, that is associated with the barotropic phase. Although we know that its execution time is dependent on the convergence parameter, given as an input, we observed in a preliminary study of this ocean model [DGG89], that the relaxation routine was responsible only for a small percentage of the total execution time. We also noticed that a small number of iterations was in general sucient to provide the required precision. However, due to the inherent recurrence of the algorithm used, this part of the code is dicult to parallelize, especially for parallelizing compilers, thus becoming a bottleneck when parallel execution is considered. This problem was con rmed by another result in our preliminary study that showed a speedup between 1.5 and 2.5, using Alliant's parallelizing compiler [All87] on the Alliant FX/8 with 8 CEs. Similar results were obtained by other authors, and we can observe that much work is being done to implement new algorithms for the barotropic phase. Andrich, Madec

29 et al. in [AM88, ADLM88] implemented a parallel SOR, and a parallel conjugate gradient method for the Cray 2 with four CPUs, obtaining better performance for the latter method. The code used by Singh and Hennessy in [SH89] was based on the software package FISHPAK [SS75] 8 for the solution of the two-dimensional Poisson equation. Smith, Dukowicz, and Malone [SDM91], on their implementation on the Connection Machine, also focused on the barotropic equations, because approximately two-thirds of the total execution time on their code was being spent on the SOR. They developed a new numerical formulation of the barotropic equations, and also implemented a parallel preconditioned conjugate-gradient solver for the elliptic equation, obtaining an improvement of more than 70%. Even though it appears for us that the best approach would be to implement a new algorithm for the solution of the two-dimension Poisson equation, developing a new algorithm wasn't among the main goals of this work. This is partly due to our objective being, for this phase of the work, to obtain results for the data sets corresponding to the Mediterranean, for which relaxation converged in few iterations. Hence, we used successive over relaxation method as in the original program, but modi ed to enhance the parallel performance of the code. We believe that using the same routine would provide a better standard for comparison, although we have plans to study and implement dierent algorithms for the solution of the elliptic equation in future work. A description of the SOR, with the discussion of the data dependences, and the problems that arise for parallelization is presented in Section 4.3.1. The initialization issues and the approach that was taken to avoid redundant computation during the initialization are discussed in Section 4.3.2. Section 4.3.3 has a brief description of the algorithm, that was used to obtain a better SOR performance. The multicluster design of the hole relaxation is discussed in Section 4.3.4, and nally the overall structure of the multicluster relaxation is addressed in Section 4.3.5.

4.3.1 Description of the successive over-relaxation

The relaxation routine in the original code consists of three parts: the initialization of the work area, the iterative phase, and the update of the stream function. During the initialization phase, the solution from the last two time steps is read, masks are formed to distinguish land areas from ocean, the depth of the eld is calculated, and the arrays of coecients for the relaxation are generated. The iterative part of the relaxation consists of the solution of the block tridiagonal system, followed by the hole relaxation. Since the matrix arises from the discretization 8Not being able to parallelize block cyclic reduction, which is the underlying FISHPAK algorithm,

they opted for relaxation type methods, also experimenting with the eect of asynchronous schemes. We note that in the meantime, the method developed by Gallopoulos, Saad and Sweet in [GS89, Swe88] allow now the eective parallelization of block cyclic reduction, and could be used instead. The use of FISHPAK however was much dependent upon the simplicity of the basin topography, taken to be an islandless rectangular domain in the horizontal.

30 of a Poisson operator written in spherical coordinates using a ve point stencil, each of the J diagonal block submatrices is tridiagonal of size I I , and the o diagonal blocks are diagonal matrices. A simpli ed algorithm for each iteration of the solution of the tridiagonal system is presented in Figure 4.5. To simplify the discussion, this description ignores the land issue; It should be remembered however that the original program has conditional code within the loops to avoid computations over land areas. The data dependence analysis [Ban76, KKP+ 81, PW86] for the loops in this algorithm shows that loops 1 and 5 can be executed as vector concurrent loops, as long as loop 1 precedes loop 5. The problems for parallelization occur with loops 2, 3 and 4. There is a ow dependence and an output dependence from S2 to S3, and a ow dependence from S3 to S2. These dependences indicate that loops 2, 3, and 4 can't be executed in parallel. They also point out that we can't split loop 4, and that loop 2 should precede loop 3. Finally, there are ow dependences from S2 to S2, and from S3 to S3. In the rst case, the dependence is due to the outer loop, thus loop 2 can be executed in vector mode or in parallel. However, in the latter case, the dependence is in the inner loop, due to a recurrence relation, and the loop needs to be restructured to be executed in parallel. Loops 2, 3 and 4 are the kernel of the solution of the block tridiagonal system. Loop 4 controls the iteration for each block, and should be executed sequentially. Loop 2 performs the correction of the southern points, and if the dimension I is large enough, can be executed in a vector concurrent mode, otherwise it could be executed in vector mode or in parallel. Finally, loop 3 executes the correction on western points and is the loop that deserves more attention, because it can't be executed in parallel, nor be vectorized. In this case, to obtain a better performance of this loop, some parallelizing compilers would call a special library routine to compute the rst order recurrence relation. In our algorithm we replaced loop 3 by the Spike algorithm, that will be shortly described in Section 4.3.3. For a more detailed description of the algorithm see [CKS78].

4.3.2 Initialization issues

To save memory space, with the use of the same work space for the three-dimensional phase and the relaxation routine, as described in Section 3.3.3, the major part of the initialization work is repeated at each time step. This part of the initialization will be referred as set up of the relaxation constants. Considering that during normal utilization of the program, one year of simulation may require close to 10,000 time steps9, and that in general ocean simulations are performed for several years, the only reason for repeating this redundant computation at every time step is for memory savings. This was a standard approach in many codes written for memory limited machines. When memory space is no longer a major problem like in most of the current machines, there is not much sense in executing this redundant computation at every time step. Our approach was to perform this computation once at the beginning of the program, and to store the initialized data using extra memory. This 9Considering each time step as one hour of simulation.

31

do 1 j = 2; J ? 1 do 1 i = 2; I ? 1 S1: res(i; j ) = n (i; j ) (i; j + 1) + s (i; j ) (i; j ? 1) + e (i; j ) (i + 1; j ) + w (i; j ) (i ? 1; j ) + ! ((i; j ) + (i; j )) 1 continue do 4 j = 2; J ? 1 do 2 i = 2; I ? 1 S2: res(i; j ) = res(i; j ) + s(i; j ) res(i; j ? 1) 2 continue do 3 i = 2; I ? 1 S3: res(i; j ) = res(i; j ) + w (i; j ) res(i ? 1; j ) 3 continue 4 continue do 5 j = 2; J ? 1 do 5 i = 2; I ? 1 S5: (i; j ) = (i; j ) + res(i; j ) 5 continue Where: res is the residual, is the change of the stream function, ! is the over-relaxation parameter, is the change of vorticity across one time step, and n;s;e;w are the matrix coecients for the northern, southern, eastern and western points. Figure 4.5: Algorithm for one iteration of the SOR.

32 data required 16 new arrays10 of the size of the horizontal grid (I J ). In our version of the relaxation, the initialization consists only of the computation of the initial guess, that is performed with the use of the two previous solutions, and can be executed using all the clusters available.

4.3.3 Spike algorithm

The Spike algorithm was proposed by Chen, Kuck and Sameh [CKS78] for the parallel solution of linear recurrence systems, (or banded triangular systems). The algorithm is well suited to recurrences of low order and for machines with a limited number of processors. In this section we present the algorithm for rst order recurrence relations. For a complete description of the algorithm, we refer to their paper. Each execution of Loop 3 in Figure 4.5 can be considered as a banded linear system solver of equations Ax = r, where A is unit lower triangular of order n = I ? 1, bandwidth 1, and [?w (i; j )] in the subdiagonal; r is a vector with the residuals; vector x represents the new residual values after one execution of the loop; and j is the index of the outer loop (loop 4). To simplify the discussion from now on we will denote ?w (i; j ) by i, ignoring the index j that is constant during each execution of loop 3. Hence, the loop can be written in matrix notation as 2 3 3 32 2 r x 1 1 1 6 7 7 76 6 6 r2 7 7 6 x2 7 6 2 1 6 7 7 76 6 6 7 76 x 7 6 3 1 (4:9) 7 6 3 7 = 6 r3 7 6 6 6 7 7 6 . . . . . . 7 6 ... 7 6 ... 77 6 4 5 5 54 4 rn n 1 xn The Spike algorithm is divided in four steps, namely spike factorization, block diagonal solver, solution of the reduced system, and recovery of the solution vector. The banded linear system is partitioned in p blocks11 of order k = n=p, and the resulting system has the form: 2

L1 6 6 R2 L2 6 6 R3 L3 6 6 ... ... 6 4 Rp Lp

32

3

2

y1 7 6 f1 76 7 6 y2 7 6 f2 7 76 6 76 y 7 6 7 6 3 7 = 6 f3 76 . 7 6 . 76 . 7 6 . 54 . 5 4 . yp fp

3 7 7 7 7 7 7 7 5

(4:10)

10In this rst implementation of the code, this extra data is being stored in global memory in all three

versions (CC, GG, and GC), so there is no communication overhead. 11To simplify the presentation we are assuming that n is a multiple of p, but in fact this is not a requirement of the algorithm.

33 where for i = 1 to p, and j = (i ? 1) np 2 6 6 6 6 6 6 6 4

1

j+2

3

1

7 7 7 7 7 7 7 5

j+3 1 ; ... ... j+ np 1 2 0 0 j+ np +1 3 6 6 0 0 0 777 6 Ri = 6 7 ; ... 4 5 0 0 0 yi = (xj+1; xj+2; ; xj+ np )T , and fi = (rj+1 ; rj+2; ; rj+ np )T . Li =

In the rst step the spikes are created by multiplying the system by eectively the following diagonal matrix that consists of the Li inverses: 2

L1?1 6 6 0 L?2 1 6 6 0 L?3 1 6 6 ... ... 6 4 0 L?p 1

3 7 7 7 7 7 7 7 5

(4:11)

The new system becomes: 2

32

3

2

3

I y1 7 6 L?1 1 f1 7 6 76 1 6 L? 6 S2 7 6 y2 7 I 2 f2 7 6 ? 7 6 76 1 7 7 6 6 76 y 7 S3 I (4:12) 6 7 6 3 7 = 6 L3 f3 7 6 7 6 6 7 . . . . . . 7 6 ... 7 6 ... 77 6 5 4 5 4 54 L?p 1 fp yp Sp I where Si is a matrix with all but the last column equal to zero. Therefore it can be stored as a vector si. As each si is only dependent on the matrix Li, these vectors need to be computed only once, for all executions of the algorithm when using matrix A. The next step, the block diagonal solver, is the update of the right hand side with the solution of the systems hi = L?i 1fi. This step can be executed in parallel for all i = 1 to p, and the new system is:

34 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

1

32

...

1

sk+1 1 ... ...

sk+k

1

...

1

sn?k+1 1 ... ...

sn

1

76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 54

x1 ...

xk

xk+1 ...

xk+k ...

xn?k

xn?k+1 ...

xn

3

2

7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

=

h1 ...

hk

hk+1 ...

hk+k ...

hn?k

hn?k+1 ...

hn

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

(4:13)

After this step, the rst block (containing the rst k = n=p) is solved, and the solution of each other block depends on the value of the last element of the preceding block. So, by solving serially the reduced system (highlighted in representation of the previous system), the phase of recovery of the solution vector can be executed in parallel for the remaining p ? 1 blocks. This phase consists of the update of the right hand side of each block, by subtracting the product of the spikes with the last value of the preceding block. Hence, for the rst k ? 1 elements of the block starting at row i we need to execute the following operation: 2 2 si+1 3 xi+1 3 2 hi+1 3 7 6 7 6 7 6 6 si+2 7 6 hi+2 7 6 xi+2 7 7?x 6 7=6 6 (4:14) ... 75 64 ... 75 i 64 ... 775 6 4 si+k?1 hi+k?1 xi+k?1 The implementation of the Spike algorithm was parameterized in the number of clusters. The reason for this option is because this part of the code depends on the topography and horizontal grid size. Thus, depending on the granularity of the computation, it might be better to use fewer clusters, avoiding the cost of inter-cluster communication. As the coecients for the relaxation are known at the beginning of the program, the spikes need to be evaluated only once, and can be stored for later use during the relaxation. This evaluation can be executed during the set up of the relaxation constants at the beginning of the program, and requires two extra arrays12 of the size of the horizontal grid to store the results. 12The Spike algorithm requires one extra array per matrix of coecients that is used, but due to the leapfrog method, two matrices are used during the execution of the program.

35

4.3.4 Hole relaxation.

The hole relaxation [Tak74], used for the treatment of the islands, is executed for each island at the end of each SOR iteration. Since there are no intersection between any two islands, the hole relaxation can be executed in parallel, without running into the problem of the same grid point being updated by dierent tasks. A multicluster version of the hole relaxation using a queue with the island indexes and a dynamic scheduling algorithm was de ned. Each cluster that becomes available removes the rst island from the queue to execute the hole relaxation. The process is repeated until all the islands are processed. This procedure also accepts as a parameter the number of clusters to be used. Due to the small grain of the computation, the execution time of this multicluster implementation might be dominated by intercluster communication and synchronization overhead. A second approach that will be implemented later will execute the algorithm using only one cluster, but with a Cdoall loop to perform the hole relaxation for each island in parallel.

4.3.5 Overall structure of the multicluster relaxation

The last phase of the relaxation is the update of the solution. This can be executed with simple vector operations that are easily split among clusters with a small cost of communication. The detailed structure of the multicluster relaxation is presented in Figure 4.6. The computation of the rst guess for the relaxation based on the two previous solutions forward in time is done during the initialization phase. The next step is the iterative process that updates the residuals, executes the Spike algorithm, and perform the hole relaxation for each island. This iterative process is repeated until convergence or until a maximum number of iterations is executed. Finally the stream function is updated based upon the relaxation solution.

4.4 Control ow and structure of the code Among the options that are oered by the Cedar system for parallel execution of the code, we decided to use macrotasking, with the utilization of one task for each subdivision of the basin. Considering that the program is supposed to simulate several time steps, the cost of task initialization is practically insigni cant because it is done only once at the beginning of the program. The cost of task synchronization, that is done using \busy waiting", is also irrelevant, because each task will perform computations of large enough grain to amortize the cost. Also, the program was written in a modular way, that it is straightforward to switch to another synchronization mechanism. To construct the busy waiting loops, we used the atomic functions provided by the Zhu-Yew synchronization primitives (see [ZY87]).

36

relax init.

init.

.....

init.

.....

comp res.

.....

hole relax

comp res.

comp res. spike

hole relax

hole relax

HH HHH

HH no Hconvergence? yes

update

update

.....

update

Figure 4.6: Multicluster structure of the relaxation.

37 main

init. (global) init.

init.

boot

boot

-

-

rowc

J

init.

.....

boot

-

time step

J

.....

-

.....

rowc

J

rowc

J

clinic

clinic

.....

clinic

tracer

tracer

.....

tracer

-

vort.

J

-

.....

vort.

J

-

vort.

relax

Figure 4.7: Structure of the multicluster code.

38 To improve the multicluster code we performed some restructuring to the ow of computations. The structure of the code, with detailed description of the three-dimensional phase is shown in Figure 4.7. The detailed description of the relaxation was presented in Figure 4.6. The rst phase of the multicluster version is executed once at the beginning of the program, and consists of starting the tasks, reading the data les, and performing the global and cluster initialization. This phase also includes the set up of the relaxation constants, described in Section 4.3. At the beginning of each time step, the bootstrap procedure (Boot) is executed. This procedure reads from the virtual disk the data for the south boundary of each partitioning, and performs the initialization necessary for the row by row computation. Due to the partitioning of the basin, and the utilization of the extra memory locations containing the boundary values, described in Section 4.2.2, each cluster can compute its share of the three-dimensional phase without any intercluster synchronization. The row by row computation starts with the routine Rowc that has similar functionality to the Step routine in the original code. It rotates the buers in the work space, reads the new slab, compute masks, updates the scalar quantities for each cluster, and calls the subroutines Clinic and Tracer. After the row by row computation there is a synchronization point. At this point the scalar quantities that were partially computed by each cluster are added together. In selected time steps the program also prints appropriate elds and other computed values after this synchronization point. In the original program, the vorticity driving function (A.65), that is required as an input for the solution of the Poisson equation, was computed inside the subroutine Clinic, after the row by row computation. We can obtain a better load distribution, increasing the granularity of the initialization phase of the relaxation, if we perform this initialization together with the computation of the vorticity function (Vort.), at the beginning of the barotropic phase. In this case, the cost of synchronization compared to the cost of computation of each of the two phases becomes insigni cant, because in both cases, a complete sweep through the entire three dimensional grid is executed.

4.5 Implementation details The implementation of the multicluster version was divided in several phases. Some of these phases were related to the history of Cedar's development. The rst step was the conversion of the original code to one cluster Cedar Fortran. This task was done using the KAP [Kuc87] source-to-source restructuring program. All subsequent multicluster versions were hand written using the one cluster KAP output as a guide. The use of a source-to-source transformation program was a good way to start, because this kind of program can execute some of the tedious work of converting the language constructs, analyzing data dependences, and detecting parallel and vector loops. However, the multicluster version is much more complex because it requires global knowledge of

39 the problem that is being implemented, hence a semi-automatic source-to-source transformation program that accepts some user interaction would be more appropriate to be used to help the development of the multicluster code. For example, we had to manually detect certain arti cial data dependences in the original code to allow the memory savings described in Section 3.3.3. Similar intervention was needed for the higher level division of the problem, such as the grid partitioning, instead of obtaining parallelization working at loop level. We are considering for future work the development of some semi-automatic source-to-source transformation tools.

Boundary points issues

One of the rst problems that arises when grid partitioning and execution of independent tasks are considered, is how to handle the computation of the boundary points of each subdivision of the basin. As described in Section 4.2.2, during the computation of the second-order nite-dierence approximation of the derivatives in the horizontal with a ve point stencil, data from the two latitudinal and two longitudinal neighbors are needed for each grid point. Therefore, to perform the computation at the boundary of each partition of the basin, it would be necessary to communicate between clusters. To minimize this communication, whenever necessary, we extended the work space in each cluster with extra grid points at each border of the subdivision of the basin. This data is read and updated by each cluster during the beginning of the computation of each new row in the three-dimensional phase. As the data required is from previous time steps, due to the explicit computation, and as a result of the three stage buering scheme described in Section 3.3.3, the occurrence of read/write con icts. is not possible. The data that is being computed during one time step will only be read by a dierent cluster in the following time step.

Implementation of the CC version

The rst multicluster version to be implemented was the CC version. Note that when our project was started, Cedar had a very small global memory, so none of the other versions could be implemented. One major concern for the CC version, was how to execute the intercluster communication, especially how to provide the boundary data. In the CC version the boundary values were saved in extra vectors in global memory during one time step, and read into the work space of the cluster that required the data in the beginning of the next time step. The data necessary for the relaxation routine in the CC version could be stored in cluster memory, but due to cluster memory size restrictions, this decision would degrade considerably the performance of the code when running a problem with ne resolution. Thus, in this version we decided to store all the data necessary for the relaxation in global memory.

40

Implementation of the GC and GG versions

The next phase of the work was in uenced by the concurrent development in Cedar's design, namely the increase to the size of the global memory. This increase made possible the third step of our work, that was the implementation of the GC and GG versions. These versions didn't need to use the extra vector in global memory to communicate the boundary values, because the virtual disk was already stored in global memory. Thus, the update of the work space boundaries was done directly from the virtual disk at the beginning of the computation of each row, when a new slab was read. The data necessary for the relaxation routine in the GC version could also be stored in cluster memory, but to be fair during the comparison between the results of the CC and the GC versions, we also decided to store the relaxation data in global memory for the latter version.

The Spike algorithm

The last phase of the work was the implementation of the Spike algorithm. Two versions of the algorithm were implemented. The rst one isolates the water segments and executes the algorithm for each of these segments, similarly to the original code. In the second implementation land points are partially avoided, by not considering the rst and last land segments in each row, and masking o the computation on land points between water segments. The latter version has larger vector lengths, producing less startup overhead, resulting in a faster routine than the former version. We are considering as future project a new multicluster version for the Spike algorithm that will divide the grid taking into account the horizontal topography of the basin. We can improve the performance of the algorithm if we select the rst element of each subdivision to be a land point. In this case, the parallelism of the algorithm will be increased, because the spikes will be null.

41

CHAPTER 5

RESULTS

5.1 Introduction In this chapter we present the performance results of the Cedar multicluster ocean simulation code. Some of the results will be compared with the performance of the original code optimized with VAST, the parallelizing compiler for the Alliant FX series, and with the single cluster version of the code using cluster memory optimized with KAP. These programs will be referenced respectively as Cedar multicluster (or hand optimized), VAST and KAP versions. The Cedar multicluster program was parameterized to be able to run with dierent number of clusters and partitioning schemes, as well as dierent number of horizontal grid points, vertical levels, and islands. It expects as input initial and boundary conditions, along with topography and wind information. Our experiments were conducted using the following two models adapted to simulate the Mediterranean basin geometry, the data sets being denoted by Pn Lk where 1=n is the horizontal resolution in degrees and k is the number of vertical levels.

P8 L16: This model uses grid spacing of 0:125 (approximately 13.87 km). The grid size is 334 118 in the horizontal direction, and 16 levels in the vertical direction. Nine islands are represented in this model, and each time step simulates one hour. P4 L8: This coarse model uses grid spacing of 0:25 (approximately 27.75 km). The grid size is 167 57 in the horizontal direction and 8 levels in the vertical direction. Five islands are represented in this data set, and each time step simulates three hours. The coarse model is used primarily to allow performance comparisons with the VAST version, running on the Alliant FX/8, and also to allow performance evaluation of the CC version, and of the single cluster KAP version. In all these cases the model using the larger data set is slowed down considerably due to excessive cluster memory paging. Also, due to memory limitations, the computations were performed in single precision (32

42 bit). We compared the program output when using single precision with the output from the original program using double precision. In both models there was good agreement in the results obtained using double and single precision. All reported timings on Cedar correspond to wall clock time, collected by the Cedar routines hrcget and hrcdelta, in single user mode. Timings on the Alliant FX/8 correspond to CPU times collected in single user mode, using the Alliant Fortran library routine etime. Timings on the Alliant FX/80 are CPU times, also collected with etime, but in multiuser mode. The routines hrcget and hrcdelta are part of a high-resolution timing facility (hrtime) that has been implemented for the Cedar System [Mal87]. Hrtime is an extension of the Concentrix user and system process time measurements; it times both execution and non-execution process states with 10 sec accuracy. The Alliant Fortran library routine etime returns the elapsed cpu time, also with 10 sec accuracy (see [All87]). Timing results are in seconds per timestep, and were derived after running 12 time steps and averaging the last 10 time steps for the model P8L16, and the last 14 time steps out of 16 in P4L8. The rst two time steps were not taken into account in order to eliminate the eect of startup overhead, which is negligible for long simulations. For each version of the code and for each data set, we executed several runs, varying the number of clusters and the number of CEs in each cluster. As described earlier, the program uses single precision, so that the current size of global memory was adequate for paging not to be a problem, for the data con gurations under consideration. In the versions that used global memory to store the virtual disk, we ran a second set of experiments, simulating a slowdown of the global memory access time, by explicitly disabling the use of the prefetch unit. Subsequently unless stated otherwise, we will consider as a default the use of the prefetch unit. The memory usage of the code for each version will be discussed in Section 5.2, Section 5.3 will present some timing results for the relaxation routine, Section 5.4 will present the performance results for the code using the two data sets. In the remaining sections we will discuss the best data placement strategy to exploit Cedar's hierarchical memory; discuss the in uence of data prefetching; compare the results for the data partitioning schemes; and present some considerations about speedup and granularity using four clusters, vector length, and cluster memory bandwidth.

5.2 Memory usage The amounts of memory used by each version of the code are presented in Tables 5.1 and 5.2 for models P4L8 and P8L16 respectively. On these tables, \pc",\pg", and \sg" refer to cluster memory that is private for each cluster (private cluster), global memory that is private for each cluster (private global), and global memory that is shared (shared global). The \pc" and \pg" values are given for each cluster. Hence, for example, the GG version using row partitioning with four clusters requires a total of 7.6 Mbytes of global memory for private global data. Notice that the private memory requirement for

43 Mem.

1

CC version. pc 4.9 pg 0 sg 2.1 GC version. pc 1.0 pg 0 sg 5.6 GG version. pc 0 pg 1.0 sg 5.6 Table 5.1:

Number of Clusters 2 3 4 (row) (col) (row) (col) (row) (col) (2D) 2.9 0 2.2

2.5 0 2.1

1.9 0 2.4

1.7 0 2.2

1.6 0 2.5

1.3 0 2.3

1.4 0 2.3

0.7 0 5.6

0.5 0 5.7

0.5 0 5.7

0.4 0 5.7

0.5 0 5.7

0.3 0 5.7

0.3 0 5.7

0 0 0 0 0 0 0 0.7 0.5 0.6 0.4 0.5 0.3 0.4 5.7 5.7 5.7 5.7 5.7 5.7 5.7 Memory usage in Mbytes for the model P4L8.

the column partitioning is approximately the same for any number of clusters, while the same requirement for the row partitioning increases linearly with the number of clusters. As discussed in Section 4.2.2, this dierence is due to the division of the work space that is possible when the column partitioning is being used, but not possible with the row partitioning. Cedar's operating system uses more than 6 Mbytes of each cluster's memory. During our experiments, the cluster memory capacity was 16 Mbytes, therefore leading to virtual memory paging on all runs which required more than 10 Mbytes of cluster memory. Such is the case, for example, for the CC version when using P8L16. As discussed in Section 4.5, in order to compensate for the cluster memory size limitations, we designed the CC code to utilize some global memory. This is shown in Tables 5.1 and 5.2, while a detailed account of the global memory usage by the data structures of the CC version are shown in Table 5.3. In fact, the CC version requires space in global memory only for communication between boundaries, ( rst row of the table), and to store partial values for the totalization of the scalar quantities, (represented as \others" on the table). The remaining global memory is used to store wind information and data used during the relaxation, such as matrix coecients, spikes, work space, and stream function values from the previous two time steps that are stored in virtual disk. The paging activity can be con rmed in Table 5.4, that shows the average number of page faults in private cluster and shared global memory, per cluster, per time step, for the model P8 L16, using the the CC and GC versions. We notice that the GC version has practically no problems with virtual memory activity, while almost all runs with the CC version using model P8L16 had a large number of page faults per time step. The only two exceptions are the runs using four clusters

44

Mem.

1

Number of Clusters 2 3 4 (row) (col) (row) (col) (row) (col) (2D)

CC version. pc 34.9 18.3 17.5 12.9 11.7 10.1 8.8 9.2 pg 0 0 0 0 0 0 0 0 sg 7.5 8.5 8.2 9.1 8.5 9.6 8.7 8.8 GC version. pc 3.9 2.5 2.0 2.1 1.4 1.9 1.0 1.3 pg 0 0 0 0 0 0 0 0 sg 37.8 37.8 37.9 37.9 38.1 37.9 38.2 37.9 GG version. pc 0 0 0 0 0 0 0 0 pg 4.0 2.7 2.0 2.2 1.4 1.9 1.0 1.3 sg 37.8 37.8 37.9 37.9 38.1 37.9 38.2 37.9 Table 5.2: Memory usage in Mbytes for the model P8L16.

Number of Clusters data 1 2 3 4 (row) (col) (row) (col) (row) (col) (2D) boundaries 0.00 0.98 0.64 1.50 0.83 2.02 1.02 1.17 wind data 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.61 relax work space 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.76 matrix coecients 2.41 2.41 2.41 2.41 2.41 2.41 2.41 2.41 relax virtual disk 0.30 0.30 0.30 0.31 0.31 0.31 0.31 0.31 spikes 0.30 0.31 0.31 0.31 0.31 0.32 0.32 0.32 others 0.15 0.15 0.21 0.15 0.27 0.15 0.31 0.20 total 7.53 8.52 8.24 9.05 8.49 9.58 8.74 8.78 Table 5.3: Distribution of the global memory usage in Mbytes for the CC version of data set P8L16.

45 CEs/ Mem.

1

Number of Clusters 2 3 (row) (col) (row) (col)

4 (row) (col) (2D)

CC version. pc 19434.6 6051.4 5929.9 1857.2 1687.9 337.1 0.0 0.0 sg 377.2 81.7 50.1 48.2 31.7 25.4 6.1 5.4 GC version. pc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 sg 0.0 3.2 1.9 4.3 3.1 5.4 4.4 4.7 Table 5.4: Average number of page faults per cluster per time step for data set P8L16 and column or two-dimensional partitioning. As shown on Table 5.2, these are the only versions of the program that require less than 10 Mbytes of cluster memory when using model P8 L16.

5.3 Execution time for the relaxation routine As discussed in Section 4.3, much work was done on the relaxation routine, and considerable performance improvement was obtained. We measured the time spent during the ve phases of the relaxation in our multicluster version and in the single cluster KAP version. We compared to the KAP version, instead of the VAST version, because KAP was able to recognize the recurrence relation that appears in loop 3 of Figure 4.5 and exploit a recurrence solver subroutine written in assembler, improving the performance of the relaxation routine. This substitution was the only major dierence between the VAST and the KAP versions, all the other changes being at the loop level. The ve phases timed during the relaxation were: initialization, computation of the residuals and correction of the stream function (loops 1 and 5 in Figure 4.5), linear system solution (loops 2, 3, and 4 in Figure 4.5), hole relaxation, and update of the solution. The initialization and update phases are executed once per time step, while the other phases are executed during an iterative process that is repeated until convergence is reached. The changes between the original (KAP) version and our nal version for each of the phases were described in Section 4.3. Table 5.5 presents for each model the average time per time step spent by each phase of the relaxation in the single cluster KAP version, and the one and four cluster hand optimized versions, running on Cedar. For our data sets, models P4L8 and P8L16 executed an average of 2.5 and 1.8 iterations per time step. One of the program's inputs is the upper limit of iterations for the relaxation. In our input data this limit was set to 25, but as we can see, the average number of iterations was well below this limit for both data sets. The idea behind the redundant computation executed during the initialization of the relaxation in the original program, as described in Section 4.3.2, was that the time spent

46 model P4L8 model P8L16 KAP hand opt. KAP hand opt. 1 cluster 4 clusters 1 cluster 4 clusters init. 0.099 0.005 0.003 0.523 0.015 0.005 comp. res. 0.055 0.055 0.047 0.244 0.244 0.086 solver 0.112 0.082 0.150 0.384 0.226 0.426 hole relax. 0.017 0.018 0.014 0.038 0.041 0.061 update 0.006 0.006 0.004 0.031 0.031 0.008 total 0.289 0.166 0.218 1.220 0.557 0.586 best time 0.289 0.166 0.154 1.220 0.557 0.366 Table 5.5: Average time per time step for the relaxation routine. Phase

with this computation would be amortized during the iterative part of the relaxation. For the test data we have, this is not true. Indeed, as mentioned above, the average number of iterations during the relaxation was small enough for the time spent during the initialization phase to degrade the overall performance of the relaxation. The rst row of table 5.5 shows the considerable improvement obtained by executing once at the beginning of the program and saving in memory the major part of the initialization. Further improvements were obtained with the four-cluster version for the computation of the residuals and for the update of the solution. The third line on Table 5.5 show that the use of the Spike algorithm also brought gains in performance compared to the recurrence solver used by KAP. Nevertheless, this gain was obtained on the single cluster version. The multicluster version was slower, due to the small size of the problem, relative to the amount of communication required between clusters. Finally, the multicluster version of the hole relaxation phase with dynamic scheduling, didn't have a better performance than the single cluster KAP version, due to the small grain size of the routine, and the overhead necessary for control and communication. As one can see in Table 5.5, the total time when four clusters are used is larger that the total time for a single cluster. This is mainly due to Spike algorithm, which was faster when a single cluster was used, as explained above, and also due to the execution time of the hole relaxation routine. Hence, in our nal multicluster version, we executed these routines in a single cluster, obtaining the execution time for the relaxation that is shown in the last line of the table (called \best time"). Finally, we should mention that there are some discrepancies in the times for model P4L8 due to the accuracy of the timing routines, and the granularity of the model.

5.4 Performance results To account for the hierarchical nature of Cedar, in the following discussion we rst introduce some notation for four dierent instances of speedup and eciency, namely

47 overall, multi-CE, multicluster, and true. Considering Tv (C; p) the run time of version \v", using C clusters and p CEs, we de ne the overall speedup and eciency for C clusters and p CEs per cluster using version \v" as SvO (C; p) = TTv((1C;; 1) (5.1) p) ; v EvO (C; p) = SvO(C; p)=(C p) ; (5.2)

the multi-CE speedup and eciency for C clusters and p CEs per cluster using version \v" are C; 1) ; (5.3) SvCE (C; p) = TTv ((C; p) v EvCE (C; p) = SvCE (C; p)=p ; (5.4) and the multicluster speedup and eciency for C clusters and p CEs per cluster using version \v" are Svcl(C; p) = TTv((1C;; pp)) ; (5.5) v Evcl(C; p) = Svcl (C; p)=C : (5.6) Finally, the true speedup and eciency for p CEs are 0 (5:7) Sp = tt10 ; p and Ep = Sp=p ; (5:8) where t01 is the best execution time using one CE, and t0p is the best execution time using p CEs. We next present and discuss results for both models. Table 5.6 has the time for the baseline code, running on one CE of the Alliant FX/8, (model P4L8) and Alliant FX/80 (both models), and the results for the original code compiled using VAST, running model P4 L8 on the Alliant FX/8 and both models on the Alliant FX/80. We used scalar optimization, concurrency, vectorization, and associativity transformations as VAST optimization options (see [All87]). Table 5.6 also has the run time for the single cluster KAP version running model P4L8 on Cedar1. Due to size of the cluster memory it wasn't possible to run model P8L16 with the single cluster KAP or VAST versions, without having virtual memory activity dominate performance. For model P4 L8, the results for the GC version with the prefetching unit turned on and o are in Tables 5.7 and 5.8 respectively. The results for the CC version are presented 1As described earlier, one cluster of Cedar is similar to one Alliant FX/8

48 Model P4L8 P8L16 Machine FX/8 FX/8 FX/80 FX/80 baseline 31.09 |{ 26.64 216.97 CEs \ Opt. (VAST) (KAP) (VAST) (VAST) 1 10.50 11.92 6.11 54.83 2 5.88 6.26 3.72 35.38 3 4.43 4.57 2.94 27.37 4 3.48 3.53 2.32 22.20 6 3.25 3.13 2.25 18.21 8 2.50 2.27 1.78 15.72 Table 5.6: Original program, average runtime per time step. CEs/ cluster 1 2 3 4 6 8

Number of Clusters 3 4 (row) (col) (row) (col) (row) (col) 12.41 6.51 6.53 4.63 4.60 3.76 3.86 6.41 3.38 3.43 2.44 2.48 2.02 2.09 4.71 2.49 2.50 1.82 1.85 1.51 1.57 3.50 1.86 1.90 1.37 1.40 1.14 1.20 3.09 1.66 1.68 1.23 1.26 1.03 1.08 2.12 1.14 1.18 0.86 0.90 0.73 0.80 Table 5.7: Model P4L8 - GC version with prefetch. 1

2

(2D) 3.53 1.91 1.43 1.09 0.98

0.71

in Table 5.9, and the results for the GG version, with and without prefetch are presented in Tables 5.10 and 5.11. For model P8L16, Tables 5.12 and 5.13 have the results for the GC version with and without prefetch. The results for the CC version are presented on Table 5.14, and the results for the GG version, with and without prefetch are presented on Tables 5.15 and 5.16. The best performance using the model P4L8 (see Table 5.7) averages 0.71 seconds per time step, (S32 = 14:8), while model P8L16 (Table 5.12) averages 4.17 seconds per time step, with a true speedup S32 of 22.81, and a rate of approximately 50 MFLOPS (using single precision). This result for model P8L16 was only three times slower than the run time obtained running the original program on a single CPU of a Cray Y-MP 4/464 (1.4 seconds per time step). As expected, virtual memory activity was a major problem for the runs using the CC version with model P8L16. The only runs that were not aected by paging were the ones using 4 clusters, with column and two-dimensional partitionings. The performance of the CC version for other con gurations using this model is severely degraded due to the large number of page faults (see Table 5.4). Hence there will be no further reference to the runs

49 CEs/ cluster 1 2 3 4 6 8

Number of Clusters 3 4 (row) (col) (row) (col) (row) (col) 12.96 6.83 6.79 4.87 4.79 3.97 3.94 6.75 3.56 3.54 2.58 2.58 2.11 2.15 4.93 2.60 2.62 1.92 1.94 1.58 1.62 3.67 1.94 1.96 1.45 1.46 1.20 1.25 3.27 1.72 1.76 1.29 1.31 1.07 1.12 2.26 1.19 1.21 0.90 0.93 0.77 0.82 Table 5.8: Model P4 L8 - GC version without prefetch.

CEs/ cluster 1 2 3 4 6 8


1

2

Number of Clusters 3 (row) (col) (row) (col) (row) 12.46 6.53 6.54 4.68 4.68 3.85 6.52 3.41 3.45 2.52 2.53 2.09 4.84 2.55 2.55 1.88 1.89 1.56 3.56 1.91 1.94 1.41 1.44 1.20 3.15 1.70 1.73 1.29 1.31 1.07 2.20 1.22 1.22 0.90 0.94 0.80 Table 5.9: Model P4L8 - CC version. 1

2

(2D) 3.71 2.00 1.50 1.14 1.05 0.74

4 (col) 3.92 2.15 1.63 1.26 1.15 0.86

(2D) 3.64 1.99 1.50 1.16 1.05 0.76

Number of Clusters 1 2 3 4 (row) (col) (row) (col) (row) (col) 16.36 8.60 8.56 6.18 6.50 5.35 5.54 8.70 4.60 4.62 3.31 3.46 2.87 2.97 6.26 3.30 3.37 2.39 2.48 2.03 2.17 4.57 2.43 2.47 1.80 1.89 1.56 1.66 3.99 2.13 2.18 1.57 1.64 1.31 1.44 2.56 1.37 1.46 1.06 1.14 0.93 1.04 Table 5.10: Model P4L8 - GG version with prefetch.

(2D) 4.88 2.61 1.90 1.46 1.26 0.90

50 CEs/ cluster

Number of Clusters 3 4 (row) (col) (row) (col) (row) (col) 27.69 14.37 14.19 10.07 9.83 8.05 7.72 14.18 7.42 7.34 5.26 5.16 4.22 4.12 10.33 5.41 5.38 3.88 3.82 3.12 3.08 7.35 3.86 3.87 2.81 2.79 2.30 2.27 6.68 3.50 3.54 2.55 2.55 2.07 2.06 3.99 2.13 2.18 1.59 1.61 1.31 1.35 Table 5.11: Model P4L8 - GG version without prefetch.

(2D) 7.55 3.99 2.97 2.18 1.96 1.27

Number of Clusters 3 4 (row) (col) (row) (col) (row) (col) 95.12 48.36 47.30 33.74 31.61 25.84 24.18 49.54 25.27 24.32 17.77 16.35 13.57 12.57 36.86 18.79 18.14 13.29 12.29 10.18 9.29 27.54 14.16 13.48 9.95 8.96 7.69 6.96 21.54 11.00 10.39 7.89 6.94 6.11 5.32 17.11 8.89 8.23 6.29 5.47 4.94 4.17 Table 5.12: Model P8L16 - GC version with prefetch.

(2D) 24.50 12.77 9.59 7.15 5.62

Number of Clusters 1 2 3 4 (row) (col) (row) (col) (row) (col) 98.51 50.56 49.15 35.28 32.90 26.77 25.38 51.18 26.39 25.50 18.54 17.03 14.13 13.04 38.24 19.56 18.92 13.87 12.68 10.64 9.71 28.32 14.62 13.96 10.37 9.30 8.00 7.17 22.18 11.44 10.76 8.26 7.71 6.34 5.56 17.75 9.16 8.43 6.56 5.60 5.05 4.33 Table 5.13: Model P8L16 - GC version without prefetch.

(2D) 25.54 13.41 10.01 7.50 5.80 4.54

1 2 3 4 6 8



1

1

2

2

4.38

51 CEs/ cluster

1

1 2 3 4 6 8

478.10 443.47 504.63 483.86 487.53 462.95



Number of Clusters 3 (row) (col) (row) (col) (row) 268.42 263.62 194.61 151.18 60.45 176.50 172.10 97.79 91.12 42.93 162.98 187.37 53.90 43.27 27.43 183.78 211.89 52.34 40.42 22.44 176.86 159.94 55.59 41.58 16.96 173.09 169.64 55.73 41.08 16.69 Table 5.14: Model P8L16 - CC version. 2

4 (col) 24.88 13.12 9.77 7.43 5.72 4.69

Number of Clusters 1 2 3 4 (row) (col) (row) (col) (row) (col) 117.65 59.91 61.14 42.92 46.14 35.79 35.91 62.30 31.88 33.37 22.40 23.93 18.58 18.63 43.90 22.61 23.13 16.06 16.23 12.61 12.74 32.11 16.62 17.00 11.81 12.51 9.77 10.17 23.55 12.19 12.53 8.73 8.89 6.86 7.03 16.69 8.75 9.05 6.32 6.72 5.29 5.45 Table 5.15: Model P8 L16 - GG version with prefetch.

Number of Clusters 1 2 3 4 (row) (col) (row) (col) (row) (col) 212.86 107.72 108.43 74.97 74.30 56.72 56.14 108.40 54.81 55.27 38.57 38.05 29.27 28.98 78.84 40.02 40.37 28.22 27.82 21.53 21.32 55.58 28.08 28.39 19.92 19.72 15.28 15.23 39.95 20.71 20.92 14.71 14.61 11.33 11.33 27.80 14.66 14.86 10.61 10.49 8.26 8.30 Table 5.16: Model P8L16 - GG version without prefetch.

(2D) 24.95 13.15 9.80 7.42 5.82 4.61

(2D) 35.22 18.28 12.47 9.66 6.86 5.25

(2D) 55.79 28.84 21.21 15.16 11.27 8.14

52 Number of Clusters prefetch 1 2 3 4 (row) (col) (row) (col) (row) (col) (2D) o 56.6% 60.0% 76.3% 61.7% 87.3% 63.6% 91.7% 79.3% on -2.5% -1.6% 10.0% 0.4% 22.9% 7.1% 30.7% 19.9% Table 5.17: Dierences in execution times from the GG to the GC versions, using model P8L16 with 8 CEs. using these con gurations. In the following discussion, all references to the CC version will be to the runs using model P4L8 and the runs using column and two-dimensional partitionings with model P8L16.

5.5 Conclusions about data placement strategy and the in uence of prefetch The use of global memory to store the three-dimensional data and of the cluster memory to store the work space exploited well Cedar's hierarchical memory, giving the best performance among the data placement strategies we tried. This can be con rmed by the results presented on Tables 5.7 and 5.12, that show the best execution times obtained using four clusters with models P4L8 and P8 L16 respectively. Gallivan et al. in [GJT+ 91] predicted that the potential improvement by using the prefetch unit to oset latency may bring the global memory access rate close to the cluster memory access rate. We can con rm their prediction by comparing the results from the GC and GG runs using model P8L16 with 8 CEs. Table 5.17 shows the percentage of dierence in execution time from the runs using the GG version to the runs using the GC version, when the prefetch unit was turned on and o. One can notice that the dierence dropped considerably when the prefetch unit was turned on, and that in some cases the GG version was even faster than the GC version. These results show that the global memory access rate became close to the cluster memory access rate when the prefetch unit was used. Table 5.17 also shows that when more clusters were added, the percentage of dierence in execution time for the row partitioning increased slightly, probably due to the increase in global memory contention, while the column partitioning produced more signi cant dierences. When the prefetch unit was turned on, the performance degradation was also much smaller for the runs using row partitioning. The main reason for this dierent behavior between the row and the column partitioning is due to vector lengths, as will be discussed in Section 5.6. The results for the GC version were consistently better than the ones for the CC version, but always by less than 10%. One possible reason for the similarity between these results is the fact that the extra time that is spent accessing the global memory in the GC version is compensated by the communication between clusters, containing the information from the borders, that is necessary in the CC version, as described in

53 Section 4.2.3. The most important reason though is the ratio between the number of

oating point operations using cluster memory data to the number of global memory accesses. This ratio is roughly 25 and as a result, the extra cost to access the global memory is largely amortized during the computation of the three-dimensional phase. Also, as addressed above, with the use of the prefetch unit, the cost of global memory access approaches the cost of cluster memory access. Comparing the results from tables 5.12 and 5.13, we observe that in the GC version, the runs that used the prefetch unit were at most 5% faster than the runs that didn't use prefetch. This result indicates that even by slowing down the global memory access by a small factor2, the total time is not considerably aected. Hence, considering a small performance degradation by the use of the global memory, we conclude that, as expected, the use of the prefetch mechanism and the use of data partitions into cluster memory are complementary techniques for avoiding performance degradation due to global memory latency, for the present Cedar cost parameters. As described in Section 3.3.2, there are two subroutines that are used to transfer data between the work space and the virtual disk (respectively cluster and global memory in the GC version). During the baroclinic phase, the global memory is accessed only when these routines are called. Hence, we were able to arti cially vary the global memory access time by calling these routines more than once whenever a read or a write was required. As an experiment we ran the GC version arti cially slowing down the global memory access time by factors ranging from 1 (no slowdown, i.e. the original) to 50. The slowdown ratios are given by tk=t1 where tk is the run time with slowdown factor of k, and t1 is the execution time without slowdown. The results are plotted in Figures 5.1 and 5.2 for the models P4L8 and P8L16 respectively, where the dashed line shows the run times using the column partitioning, the dotted line is for the row partitioning and the solid lines show the times for the runs using the two-dimensional partitioning. The best results when no prefetch was used, were obtained with the column partitioning on Model P8 L16, and with the 2D partitioning on model P4L8; hence, the line for the 2D partitioning and the line for the column partitioning are duplicated in Figures 5.1 and 5.2 respectively, representing the runs with and without prefetch in both cases (lower and upper line respectively). Using these plots one could estimate what would be the program's run time for a slower global memory. In both gures one can easily observe that the slope of the line for the run without prefetch is larger than the others, thus these plots also shows the increasing need for data prefetch as global memory becomes more remote. Finally, another result that shows the importance of the use of data prefetch was the drop in performance that occurred with the GG version, when the prefetch unit was turned o. Comparing the best results for the GG version with and without prefetch (Tables 5.10 and 5.11 for model P4L8, and Tables 5.15 and 5.16 for model P8L16), we 2According to [GJT+ 91], the global memory access time is 6 times slower than the cache access time,

and by using the prefetch unit to oset latency the global memory access time can be improved by a factor of 3.

54 no prefetch

3 Slowdown 2 1 0

dotted line = row partitioning dashed line = column partitioning full line = 2D partitioning

0

10

20

30

40

50

Factor of slowdown of global memory Figure 5.1: Slowdown for model P4L8, varying the global memory access time. notice that the performance dropped approximately 40% and 55% for models P4L8 and P8 L16 respectively, when prefetch was not used. These results show the importance of the use of data prefetch when there is a small ratio between the number of oating point operations and global memory accesses (in the GG version this ratio is less than one).

5.6 Comparisons between data partitioning schemes and vector length issues Table 5.12 shows that the column and the two-dimensional partitioning schemes performed 18.5% and 12.8% faster respectively than the row partitioning when using four clusters and 8 CEs each. This con rms that the column and 2D partitionings are better strategies for large grid sizes. The same table shows that the time dierence between the column and the two-dimensional partitioning was only 5%; this dierence is not large enough to lead to de nite conclusion about which scheme is better. However, Figures 5.1 and 5.2 show that even using prefetch, the slopes of the curves corresponding to column partitioning are larger than the slopes of the curves from the other two partitioning schemes. Therefore, it can be concluded that performance using the column partitioning strategy will be more aected by a slower global memory than the other two partitionings. Another evidence for this conclusion are the dierences presented in Table 5.17. The three partitioning methods read and write the same amount of data from and to global memory. However, they have a dierent number of read or write calls and the amount of data that is transferred each time is also dierent. Using four clusters, each read or write in the column partitioning contains one fourth of a slab, while the row partitioning executes one fourth of the number of reads, and each read contains a full slab. The two-dimensional partitioning executes half of the number of reads and writes executed by the column partitioning, each one containing twice the amount of data. The time to

55

4.50 4.00

no prefetch

3.50 3.00 Slowdown 2.50 2.00 1.50 dotted line = row partitioning dashed line = column partitioning full line = 2D partitioning

1.00 0

10

20

30

40

50

Factor of slowdown of global memory access Figure 5.2: Slowdown for model P8L16, varying the global memory access time.

56 read a vector of length n from global memory can be approximate by the expression t(n) = S + nr F (r) (5:9) where S (startup time) is the overhead time due to latency, and F (r) (fetch time) is the time to move one block of size r from global to cluster memory. Considering that during each time step there are global memory accesses, and the vector length for the row partitioning is I (the number of grid points along the zonal direction), the approximated times (Tr, and Tc) for the row and column partitioning to access global memory, using C clusters, would be given by: (5:10) Tr = C S + Ir F (r) ; and 2l m3 I C 6 Tc = S + 6 r 77 F (r) : (5:11) 6

7

When the prefetch unit is turned on, the fetch time is optimized, but the startup time doesn't change. Therefore, the row partitioning will have the best performance improvement with the prefetch unit because the fetch time component in equation (5.10) is approximately C times larger than the same component in equation (5.11). Hence, we can infer that the column partitioning induced better data locality than the other partitionings, because it uses smaller vectors, whereas by utilizing longer vectors, the row partitioning had the smallest time degradation when the prefetch unit was turned o, and the best performance improvement when the unit was turned on. When four clusters were used with model P4L8, the results for row partitioning and two-dimensional partitioning using 8 CEs were close, with an average dierence of 3.6%, while the results for column partitioning were at least 10% slower. One reason for these dierences is the resulting small vector length in the column partitioning3, aecting the performance of vector operations and vector concurrent loops. When larger vector lengths were used by the column partitioning, as in the three and two cluster versions, the results became similar to the row partitioning. This observation can be con rmed by the results of model P8L16 using column partitioning, where the length of the vectors was no longer a problem.

5.7 Granularity using four clusters. We can conclude that even the coarser model P4 L8 had sucient granularity for the computation to be divided among the four clusters, and that in both models we 3Model P4L8 uses vectors of length 167, thus each cluster in a four clusters column partitioning will

use vectors with length 42.

57 obtained close to the optimum multicluster speedup for the baroclinic phase, using four clusters. As described above, the vector length in the row partitioning is independent of the number of clusters and is the same as the one cluster version. Therefore, we prefer to analyze the granularity of the multicluster version by comparing the results of the row partitioning. The last line of table 5.7, which presents the best results for model P4 L8 cl (4; 8) using the row partitioning, shows that the multicluster speedup for four clusters SGC is approximately 2.90. This value would lead one to think that the small model doesn't have large enough granularity to be divided among the clusters. This result however is due to the fact that during the relaxation, the Spike algorithm is executed using only one cluster, which according to Table 5.5 it takes roughly 12% of the total execution time of the program. Therefore, considering that only 88% of the code can be divided among the clusters, and considering that the overhead of the communication can be compensated by the extra memory and memory bandwidth that is used, one can show that this speedup is practically the maximum that can be obtained with four clusters. The same analysis can be done for model P8L16. Comparing the results on Table 5.12 for the row partitioning using 32 CEs and the one cluster version using 8 CEs, we obtain cl (4; 8) of approximately 3.5. This speedup is again practically a multicluster speedup SGC the maximum that can be obtained, considering that the sequential part of the relaxation (solver and hole relaxation in Table 5.5) takes roughly 5.4% of the time. These results con rm the necessity of further improvements in the relaxation routine, such as the design of a new multicluster solver, or the use of dierent reorderings, such as red-black, wavefront, etc. Another option would be to solve the two-dimensional Poisson equation with a dierent algorithm, such as the pre-conditioned conjugate-gradient method.

5.8 Problems with cluster memory bandwidth The comparison of the multi-CE speedups for all versions and partitionings, presented in Table 5.18 for model P8 L16 using 1 and 4 clusters with 8 CEs per cluster, reveals a characteristic of the Alliant architecture, namely that cluster memory bandwidth restricts the maximum speedup. This limitation can be better observed in Figure 5.3 that shows the single cluster multi-CE eciency EvCE (1; p) for the code using the GG version with and without prefetch (dotted and solid lines respectively), and the GC version using prefetch (dashed line). These lines were drawn using the times for 1, 2, 4 and 8 CEs only, because in general the concurrent loops are executed for the vertical dimension, with size 16, and we tried to avoid time distortions due to load unbalance that can be generated by a number of CEs that does not divide the number of vertical levels. The single cluster GG version without prefetch showed good multi-CE speedup; It also did not have many problems with global memory contention, especially when the memory access time was increased as in the case of the runs without prefetch, for which the multi-CE eciency was practically constant. When prefetch was used in the GG version, The curve had a very small slope, with a single cluster multi-CE eciency of

58

SvCE (1; 8)

SvCE (4; 8) (row) (col) (2D) GG without prefetch 7.66 6.87 6.76 6.85 GG with prefetch 7.05 6.77 6.59 6.71 GC without prefetch 5.55 5.30 5.86 5.63 GC with prefetch 5.56 5.23 5.79 5.59 CC | 3.62 5.30 5.41 Original code on FX/80 3.49 | | | Table 5.18: Multi-CE speedups for model P8L16, using 1 and 4 clusters with 8 CEs per cluster. Version

1.00

GG (no prefetch)

0.90 Eciency 0.80

GG GC

0.70 1

2

3

4

5

6

7

8

Number of CEs Figure 5.3: Multi-CE eciency (EvCE (1; p)) for one cluster code using model P8L16.

59 almost 90% for 8 CEs. The limitations with memory bandwidth appear in the version that uses cluster memory, where the multi-CE eciency drops considerably with the addition of more CEs.

60

CHAPTER 6

CONCLUSIONS AND FUTURE WORK

We implemented a multicluster ocean general circulation model on Cedar, and experimented with three partitioning schemes, using several data placement strategies, and mappings to the components of Cedar. The experiments simulated the Mediterranean basin geometry, using two data sets with dierent grid sizes. The best result using the model with the coarse grid, averages 0.71 seconds per time step, while the original model running with compiler optimizations on an Alliant FX/8 averaged 2.5 seconds per time step. The best result with the larger data set averages 4.17 seconds per time step, with a speedup of 22.81 for the 32 CEs, and a rate of approximately 50 MFLOPS (using 32 bit precision). This performance was roughly one third of the speed that we obtained on a single CPU of a Cray Y-MP 4/464. The GC version, that uses the global memory to store the three-dimensional data, and the cluster memory to store the work space appears to be the best data placement strategy to exploit Cedar's control and memory structures. The experiments showed that the use of the prefetch unit could mitigate the eect of the latency of the global memory system, bringing the times for the GG version close to the times for the GC version. This was especially true for partitionings that required longer vector lengths. However, for the present Cedar hardware, the prefetch mechanism didn't appear to be determinant for better performance, when there was a large ratio between oating point operations using cluster memory data and global memory accesses. By comparing the single cluster multi-CE eciencies, we observed that due to memory contention the cluster memory bandwidth restricts the maximum speedup. We also noticed that when a single cluster was used, this eect didn't occur with the global memory system. For the current Cedar architecture, we were not able to determine which partitioning scheme was best, as for the larger data set, the dierence in runtime between the column and the two-dimensional partitioning was only 5%. However, by performing experiments that arti cially degrade the global memory access time, we observed that the column partitioning strategy would be more aected by a slower global memory than the other

61 two partitioning schemes. These experiments also con rmed the importance of data prefetching as global memory becomes more remote. The changes to the SOR code for the Poisson solver improved the time of the barotropic phase by a factor of almost two for model P4 L8, and more than three for model P8L16. The use of the Spike algorithm improved the time by a factor of 1.7 over the recurrence relation routine used by KAP. Nevertheless, this improvement was obtained with a single cluster version, and due to the small granularity of the problem, relative to the amount of communication required between clusters, the multicluster Spike was slower than the routine used by KAP. These results, together with the observation that the multicluster speedup of the code was restricted by the use of a single cluster for the Spike algorithm, con rms the necessity for further improvements in the solution of the two-dimensional Poisson equation. There are many interesting issues left for the continuation to this work. One is related to the avoidance of computation over land areas. In this case, we are studying and developing schemes for topography dependent partitionings. These schemes are based on the partitionings implemented so far, and take into consideration scheduling strategies and load balance as important factors. We are considering static and dynamic grid partitionings, in order to achieve self adapting strategies that allow run-time optimization. Regarding the solution of the two-dimensional Poisson equation, we are planning to implement a new multicluster version of the Spike algorithm that takes into consideration the horizontal topography of the basin. In this way, we expect to increase the parallelism of the algorithm, by avoiding some of the inter-cluster synchronization cost. We are also considering the implementation of a multicluster Spike algorithm for the complete block tridiagonal system, instead of executing the algorithm for the solution of each of the blocks. The use of dierent approaches for the solution of the elliptic equation is also under consideration for future work. We also observed the desirability of better tools for program transformation. We believe that the development of a semi-automatic source-to-source transformation program, that accepts some user interaction, would be of great help for the implementation of this class of large scale computation programs into new parallel architectures. Finally, we are planning to expand our eort to the new Modular Ocean Model (MOM), developed at the Geophysical Fluid Dynamics Laboratory. This program belongs to the new generation of computationally more demanding ocean models, and will be an excellent test case for Cedar's understanding and performance evaluation.

62

APPENDIX A

MATHEMATICAL FORMULATION OF OCEAN SIMULATION

The formulation of Bryan's model will be presented here based on Cox's and Semtner's modi cations [Sem74, Cox84, PN88]. The continuous equations of the model are presented in Section A.1, following the description given by [Cox84, WP86], and the nite dierence formulation is addressed in Section A.2, according to [Bry69, Cox84].

A.1 Continuous formulation The mathematical model uses the Navier-Stokes equations with three basic assumptions: Boussinesq approximation, in which density dierences are neglected, except in the buoyancy term, hydrostatic assumption, where local acceleration and other terms of equal order are eliminated from the equation of vertical motion, and turbulent viscosity hypothesis, in which stresses exerted by scales of motion too small to be resolved by the grid are represented as an enhanced molecular mixing. Temperature and salinity are calculated using conservation equations, and the equations are linked by a simpli ed equation of state. The equations are written in the spherical coordinate system, with depth, z de ned as negative downward from z = 0 at the surface.

A.1.1 Continuous equations of the model

The equations of motion for the ocean are @u + ?(u) ? fv = ? 1 @p + F u ; @t 0a cos @ @v + ?(v) + fu = ? 1 @p + F v ; @t 0a @ where (

(A:1) (A:2) )

2 )u 2 2 sin @v ; ? F u = Av @@zu2 + Ah r2u + (1 ? tan 2 a a2 cos2 @

(A:3)

63

Fv

)

(

@ 2v + A r2v + (1 ? tan2 )v ? 2 sin @u ; = Av @z h 2 a2 a2 cos2 @ " # @ @ @ (w) ; 1 ?() = a cos @ (u) + @ (cos v) + @z

(A:4) (A:5)

and is any scalar quantity, is the latitude, is the longitude, a is the radius of the earth, (u; v; w) is the velocity vector, f = 2 sin , Av is the vertical eddy viscosity coecient, Ah is the horizontal eddy viscosity coecient, p is the pressure, and 0 is a constant approximation to the density of seawater, that is taken to be the unit. The local pressure p is given by the hydrostatic relation

pz = ps +

Z

0

z

g dz ;

(A:6)

where ps is the pressure at the surface of the ocean. The continuity equation is " # @w = ? 1 @u + @v cos : (A:7) @z a cos @ @ The conservation equations are @T + ?(T ) = K @ 2T + K r2T ; (A:8) v 2 h @t @z and @S + ?(S ) = K @ 2S + K r2S ; (A:9) v 2 h @t @z where T and S are temperature and salinity, and the vertical and horizontal eddy diffusivity coecients are represented by Kv and Kh , respectively. The right hand side of equations (A.8) and (A.9) represent the eects of turbulent mixing (F T ) for the tracers. Convection is introduced into the model, assuming in nite mixing under statically unstable conditions, and uniform mixing under statically stable conditions. Therefore, considering z as the local vertical density gradient, at each grid point,

if z > 0 ; Kv ! 1 if z < 0 ; Kv = value:

(A.10)

64 The equation of state is:

= (T; S; z) ; (A:11) where (T; S; z) is taken to be a 9{term, third order polynomial approximation to the Knudsen formula for the density of seawater, as described in [BC72].

A.1.2 Boundary conditions

The boundary conditions at the ocean surface (z = 0) are @ (u; v) = ( ; ) ; (A:12) 0Av @z @ (T; S ) = 0 ; (A:13) Kv @z and w=0; (A:14) where and are the zonal and meridional components of surface stress. The \rigid{lid" assumption of zero vertical motion at the surface (A.14) lters out high frequency motions, which would have restricted the time step due to the CFL condition. This ltering allows the use of reasonable large time step in the numerical integration. It is important to note that these high speed external gravity waves have practically no in uence to the lower frequency motions of climate processes. At the bottom of the basin, z = ?H (; ), the boundary conditions are @ (u; v) = ( ; ) ; 0Av @z (A:15) B B @ (T; S ) = 0 ; (A:16) @z where B , and B are bottom stresses. Equation (A.16) implies that there is no vertical ux of sensible temperature or salinity at the bottom of the ocean. The bottom boundary condition on the vertical velocity is u @H ? v @H : w = ? a cos (A:17) @ a @ At the side wall boundaries, the normal and tangential horizontal velocities, and the horizontal uxes of sensible temperature and salinity are set to zero.

A.1.3 The stream function

Combining equations (A.1) and (A.2) with (A.6) we have @u = @u0 ? 1 @ps ; @t @t a cos @

(A:18)

65

@v = @v0 ? 1 @ps ; @t @t a @

where

(A:19)

@u0 = ??(u) + fv ? g Z 0 dz0 + F u ; @t a cos z @v0 = ??(v) + fu ? 1 Z 0 dz0 + F v : @t a z Using the vertical averaging operator Z 0 dz ; = H1 ?H the velocities u and v can be written as

(A:20) (A:21) (A:22)

u = u^ + u ;

(A:23)

(A:24) v = v^ + v ; and since ps is not a function of depth, @ u^ = @u0 ? @u0 ; (A:25) @t @t @t @ v^ = @v0 ? @v0 : (A:26) @t @t @t All terms on the right of equations (A.20) and (A.21) are known, hence under the rigid{lid boundary condition, and using the vertically averaged velocity components (u; v), the external mode of momentum may be written in terms of a volume transport stream function, , 1 @ ; u = ? Ha (A:27) @ 1 @ : v = Ha cos @

(A:28)

A prognostic equation for is obtained by forming the vertical averages of (A.18) and (A.19), and eliminating the terms in ps, by taking the vertical component of the curl operator (r) given by "

1 @ 2v ? @ @u cos curlz (vt; ut) = a cos @t@ @ @t Substituting equations (A.27) and (A.28) with "

#

"

1 @ 2 + @ cos @ 2 1 @ a2 @ H cos @@t @ H @@t

#!

"

!#

:

@ 2v0 ? @ @u0 cos = 1a @@t @ @t

(A:29) !#

: (A:30)

66 Equation (A.30) is a prognostic equation for but also requires the solution of an elliptic equation for @@t . The boundary conditions for at the lateral walls are (A:31) = =0: If the basin has no islands, the boundaries may be considered as a continent, and may be held constant in time, forming land mass. If islands are present, the value of must be spatially constant along each individual coastline, but the associated constant varies in each island, re ecting the changing circulation. Hence, it must be predicted by the governing equations. Hole relaxation is used, and ps is required to be a single{valued function, in such a way that the line integral of the quantity rps around the coastline of each island should vanish. The predictive equation is given by ! ! I 1 @ 2 d ? cos @ 2 d = a I @v0 d + @u0 cos d ; (A:32) H cos @t@ H @t@ @t @t and is obtained by applying the vertical averaging operator (A.22) to equations (A.18) and (A.19), integrating the rst equation over and the second over , around the coast of each island, adding the results, and considering the line{integral condition. Finally, applying Stokes' theorem, we obtain the following equation, that is an area integral of equation (A.30), taken over the islands. " ! !# Z " 2 0 0 cos !# 1 1 @ 2 + @ cos @ 2 @ @u @ 1Z @ v dA: a2 A @ H cos @@t @ H @@t dA = a A @@t ? @ @t (A:33)

A.2 Finite Dierence Formulation A.2.1 Integral constraints

The initial value problem described in Section A.1 may be solved numerically using nite dierence techniques. Nevertheless it is necessary to insure that certain integral constraints are maintained during the numerical solution of the problem. In the absence of dissipative eects, momentum, energy, and the variance of temperature and salinity should be conserved. Arakawa [Ara66] was the rst to work with the nite dierence formulation, maintaining the integral constraints, and this approach was generalized by Bryan [Bry69], allowing the arrangement of the cells to be chosen in any manner that is convenient for the problem. The rst constraint that must be satis ed is the mass conservation within each cell. It is shown by Arakawa [Ara66] that if integral constraints on energy are maintained, nonlinear instability can be avoided. Considering each cell as a regular rectangular array, then the number of neighbor cells will be 6. In this case, 6 X

b=1

VbAb = 0 ;

(A:34)

67 where Ab is the area of the interface b, and Vb is the normal velocity to interface b into the cell. To guarantee conservation of momentum, temperature and salinity, the volume integral I 0 of any conserved quantity q must remain unchanged by the advective process. Therefore 6 N X dI 0 = ? X qbVb Ab = 0 ; (A:35) dt n=1 b=1

where N is the total number of cells, and qb is the value of q on the interface b. This integral vanishes because the contribution of adjacent interfaces are equal and of opposite signs. Hence, the various terms in the right hand side appears twice, but they are cancelled when the sum is taken over the entire volume. The terms that appears only once are the boundary terms and in this case Vb is zero. The conservation of kinetic energy and the variance of temperature and salinity are guaranteed if the volume integral I 00 of the the square of q is unchanged by advection. N X 6 dI 00 = ?2 X qbQ Vb Ab = 0 ; (A:36) dt n=1 b=1 where Q is the average of q within the cell. The right hand side of this integral does not vanish for all de nitions of qb, however, using Qb as the average of q in the cell adjacent to the interface, and the following interpolation formula for the interface value of qb, (A:37) qb = 21 (Q + Qb) ; we can rewrite the integral as # N " X 6 6 X dI 00 = ? X Q2 Vb Ab + QQbVb Ab : (A:38) dt n=1 b=1 b=1 The rst sum inside the brackets vanishes because the continuity relation (A.34), and the second term is zero due to the same cancelling process that occurs in (A.35). The fourth constraint is that the system must maintain a balance between the kinetic energy that is gained (or lost) through the pressure term of the momentum equations, with the loss (or gain) of potential energy through the advection terms of the conservation equations for density. The continuous form of this constraint is given by # Z " u (p= ) ? v (p= ) dV = ? Z gz?()dV : (A:39) 0 a 0 V V a cos The last constraint is due to the existence of an insulating boundary condition for the temperature at all boundaries of the basin, other than the surface. Therefore the following equation must be true. Z Z (A:40) F T dV = dA ; V

A

where the integral on the right is taken over the area of the surface of the basin.

68

A.2.2 Grid representation

The nite dierence method uses a staggered grid with indices i, j , and k representing the eastward, northward, and downward position respectively, and cells with dimensions i, j and k z. Horizontally, the tracer quantities (temperature and salinity) and the two-dimensional stream function are positioned at the center of the cell, whereas the horizontal velocity components are placed at the corners, as shown in Figure A.1. In the vertical grid, presented in Figure A.2, the tracer quantities and the velocities u and v are positioned at the center of the vertical dimension of the cell. The quantities W T and W v are the vertical velocities that are computed. The rst quantity is used for the computation of the tracers, and the second is used for the computation of u and v. In each case, the velocity is calculated at the intersection of the horizontal interface of the cell with the vertical line that is associated with the prognostic variable T , S , u or v.

A.2.3 Finite dierence equations

In this description of the nite dierence formulation the following nite dierence operators will be used: (i ) = i+1=2 ? i?1=2 ; (A:41) (A:42) i = 21 i+1=2 + i?1=2 ; max ( ) = maximum of ; ; (A:43) i i +1 = 2 i ? 1 = 2

min (A:44)

(i ) = minimum of i+1=2 ; i?1=2 : Where is either the longitudinal () or the meridional () direction. In the remainder of the text the indices i and j will be used to denote the position of the respective variable that is being considered. Therefore, in some equations i and j will be full integers, corresponding to the tracers, and in other equations they will be half integers, corresponding to u and v. The correct position will be implied by the variable that is being indexed. Furthermore, if the indices are omitted, the values i, j , and k are the ones being considered. The depth of each cell is the minimum of the four vertical columns of the cell (H v ), and the total depth of the basin H T is de ned as the minimum of the depth of the cells.

H T = min min (H v ) :

(A:45)

The discrete vertical averaging operator (A.22) is K X = H1 z ; k=1

(A:46)

69

T,S,

j+2

j + 3=2

u; v

j + 1T,S,

j + 1=2

j

u; v

u; v

i ? 1=2

u; v

T,S,

u; v

T,S,

j?1 i?1

u; v

T,S,

T,S,

j ? 1=2

T,S,

T,S,

u; v

T,S,

T,S,

u; v

T,S, T,S, i + 1=2 i i+1 Figure A.1: Horizontal grid.

u; v

i + 3=2

i+2

70

k ? 1=2 W v

WT

Wv

u; v

T,S

u; v

T,S

u; v

k + 1=2 W v

WT

Wv

WT

Wv z

k+1

u; v

T,S

u; v

T,S

u; v

k + 3=2 W v

WT

Wv

WT

Wv

k

WT

Wv

i ? 1=2 i i + 1=2 i + 1 Figure A.2: Vertical grid.

i + 3=2

and equations (A.23) and (A.24) can be rewritten as

u = u^ + u ;

(A:47)

v = v^ + v : (A:48) The external mode of momentum that is written in terms of the stream function , (A.27) and (A.28) can be rewritten as 1 ; u = ? Ha (A:49) 1 : v = Ha cos (A:50) Using centered dierencing in time, equations (A.25) and (A.26) are written as

t(^ut) = tu0t ? tu0t ;

(A:51)

t(^vt) = tv0t ? tv0t ; (A:52) and with N + 1 indicating the time step that is being predicted, equations (A.20) and (A.21) may be written as (A:53) tu0t ? fv0N = Gu ; +1

71

tv0t + fu0N = Gv ;

(A:54)

+1

where

Gu and

= ??(u) + f (1 ? )vN ?1 ?

g k?X1=2 z zz + F u ; a cos =1=2

(A:55)

k?X 1=2 g v N ? 1 z zz + F v : (A:56) G = ?? (v) ? f (1 ? )u ? a =1=2 The advective operator ? is de ned as 1 h fug + f gi + (wv z ) ; (A:57) ?() = a cos z where u and are written as u = u^ ? a max ; (A:58) (H ) ; (A:59) = v^ cos ? a max (H ) and wv is given by 1 [ u + ] ; z (wv ) = ? a cos (A:60) along with the boundary condition wkv=1=2 = 0 : (A:61) The frictional terms (F u and F v), are delayed by one time step to avoid numerical instability [RM67], and are written as " # 2 sin A h F u = Av z z uN ?1 + a2 r2 uN ?1 + 1 ? tan2 uN ?1 ? cos2 vN +1 ; (A:62) " # 2 sin A N ?1 h 2 2 N ? 1 N +1 v N ? 1 ; (A:63) F = Av z z v + a2 r v + 1 ? tan v ? cos2 u where r2 () = + cos1 (cos ) : (A:64) Equation (A.60) automatically satis es the constraints (A.34) and (A.35), and the constraint (A.36) is satis ed by an approach similar to (A.37), using the average of the neighboring values to express the advective quantity at the various interfaces. The special quantities (A.58) and (A.59) are used to satisfy the bottom boundary condition on the vertical velocity, de ned by (A.17).

72

A.2.4 Finite dierence form of the stream function

The nite dierence form of (A.30) for the stream function is given by " # # " cos 1 t t 2 t + 2 t ? H a cos Ha " " # # f f Ha N +1 = ? Ha N +1 = = 1 (Gv )= ? (Gu cos )= : (A.65) a The boundary condition (A.31) is satis ed by setting constant along the two rows of cells at the boundaries of the basin. This constant may be zero if the basin is simply connected. The nite dierence form of (A.33) must be used if islands are considered. In this case, the approach is to take an area weighted sum of (A.65), considering all cells in which takes island values.

A.2.5 Finite dierence form of the tracer equations

The predictive equations for salinity and temperature have similar forms, therefore only the nite dierence form for a general tracer T will be written, tT t = ?(T ) + F T ; (A:66) where ?(T ) and the diusive operator F T are written as i h ?(T ) = cos uT = + v cos T = + z wT T z ; (A:67) a (A:68) F T = Kv z z T + Ka2h r2T ; and wT is given by i h z wT = ? cosa u = + v cos = ; (A:69) along with the boundary condition wkt =1=2 = 0 : (A:70) The constraints (A.34) and (A.35) are satis ed by (A.70), and constraint (A.36) is satis ed by the same argument used for the advective operator of momentum. The additional weighting by and under the bar operators, is required to satisfy constraint (A.39). Finally, a vertical mixing with in nite mixing coecient is simulated at the end of each time step, to achieve the convective mixing described in (A.10). This process is performed by testing the vertical static stability of each column of cells, taking the volume average of the tracer values for all cells that are found to be unstable, and resetting each cell to the average.

73

REFERENCES

[ADLM88] P. Andrich, P. Delecluse, C. Levy, and G. Madec. A multitasked general circulation model of the ocean. In Proceedings Fourth International Symposium, Cray Research, pages 407{428, 1988. [All87] Alliant Computer Systems Corporation, 42 Nagog Park, Acton, MA 01720. FX/Fortran Language Manual, 1987. [AM88] P. Andrich and G. Madec. Performance evaluation for an ocean general circulation model: vectorization and multitasking. In 1988 International Conference on Supercomputing, pages 295{302, St. Malo, France, July 1988. ACM. [Ara66] A. Arakawa. Computational design for long{term numerical integration of the equations of uid motion: two dimensional imcompressible ow. part 1. Journal of Computational Physics, 1:119{143, 1966. [Ban76] U. Banerjee. Data dependence in ordinary programs. Master's thesis, University of Illinois at Urbana-Champaign, Dept. of Computer Science, November 1976. [BB87] M. J. Berger and S. H. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, C-36:570{580, May 1987. [BC67] K. Bryan and M. D. Cox. A numerical investigation of the oceanic general circulation. Tellus, XIX:54{80, 1967. [BC72] K. Bryan and M. D. Cox. An approximate equation of state for numerical models of ocean circulation. Journal of Physics Oceanographic, 2:510{514, 1972. [Bel81] J. Bell. Report of interview with Bert Semtner re: Oceanic GCM. unpublished, March 1981. [BP87] J. L. Bell and G. S. Patterson Jr. Data organization in large numerical computations. The journal of supercomputing, 1:105{136, 1987.

74 [Bry63] K. Bryan. A numerical investigation of a nonlinear model of a wind{driven ocean. Journal of Atmospheric Sciences, 20:594{606, 1963. [Bry69] K. Bryan. A numerical method for the study of the circulation of the world ocean. Journal of Computational Physics, 4:347{376, 1969. [CHA90] Building an advanced climate model: Program plan for the CHAMMP climate modeling program. Technical Report DOE/ER-0479T, U.S. Dept. of Energy, Washington, D.C., Dec. 1990. [CKS78] S. C. Chen, D. J. Kuck, and A. H. Sameh. Practical parallel band triangular system solvers. ACM Trans. on Mathematical Software, 4(3):270{277, Sept., 1978. [Cox84] M. D. Cox. A primitive equation, 3-dimensional model of the ocean. Technical Report 1, Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542, August 1984. [Cra85] Cray Research Inc. Multitasking User Guide, January 1985. [CS88] R. M. Chervin and A. J. Semtner Jr. An ocean modelling system for supercomputer architectures of the 1990s. In M. E. Schlesinger, editor, Proceedings of the NATO Advanced Research Workshop on Climate-Ocean Interaction, pages 87{95. Kluwer Academic Publishers., 1988. [DGG89] L. De Rose, K. Gallivan, and E. Gallopoulos. Trace analysis of the GFDL ocean circulation model: A preliminary study. Technical Report 863, Center for Supercomputing Research and Development, University of Illinois at UrbanaChampaign, Urbana IL 61801, 1989. [EHJP90] R. Eigenmann, J. Hoe inger, G. Jaxon, and D. Padua. Cedar Fortran and its restructuring compiler. Technical Report 1041, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana IL 61801, 1990. [Emr85] P. Emrath. Xylem: an operating system for the Cedar multiprocessor. IEEE Software, 2(4):30{37, July 1985. [EPY89] P. Emrath, D. Padua, and P. Yew. Cedar architecture and its software. In 22nd Hawaii International Conference on System Sciences, 1989. [GJT+ 91] K. Gallivan, W. Jalby, S. Turner, A. Veidenbaum, and H . Wijsho. Preliminary basic performance analysis of the cedar multiproces sor memory systems. Proceedings of ICPP'91, St. Charles, IL, I:71{75, August 12-16, 1991.

75 [GPHL90] M. D. Guzzi, D. A. Padua, J. P. Hoe inger, and D. H. Lawrie. Cedar Fortran and other vector and parallel Fortran dialects. Journal of Supercomputing, pages 37{62, March 1990. [GS89] E. Gallopoulos and Youcef Saad. Parallel Block Cyclic Reduction Algorithm for the Fast Solutio n of Elliptic Equations. Parallel Computing, 10(2):143{160, 1989. [Hoe91] Jay Hoe inger. Cedar fortran programmer's handbook. Technical report, Univ. of Illinois at Urbana-Champaign, Center for Superco mputing Res. & Dev., October 1991. CSRD Report No. 1157. [Kas77] A. Kasahara. Computational aspects of numerical models for weather prediction and climate simulation. Methods in Computational Physics, 17:2{66, 1977. [KDLS86] D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh. Parallel supercomputing today and the Cedar approach. Science, 231:967{974, February 1986. [KKP+ 81] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M . Wolfe. Dependence graphs and compiler optimizations. Proc. of the 8th ACM Symp. on Principles of Programming Languages (POPL), pages 207{218, Jan., 1981. [KTV+91] J. Konicek, T. Tilton, A. Veidenbaum, C. Zhu, E. Davidson, R. Downing, M. Haney, M. Sharma, P. Yew, P. Farmwald, D. Kuck, D. Lavery, R. Lindsey, D. Pointer, J. Andrews, T. Beck, T. Murphy, S. Turner, and N. Warter. The organization of the Cedar system. In 1991 Int'l Conference on Parallel processing, 1991. [Kuc87] Kuck and Associates, Inc., Savoy, IL 61874. KAP User's Guide, 4th edition, 1987. [Mal87] A. D. Malony. High resolution process timing user's manual. Technical report, Univ. of Illinois at Urbana-Champaign, Center for Supercomputing Res. & Dev., June 1987. CSRD Report No. 676. [ME87] R. McGrath and P. Emrath. Using memory in the Cedar system. Technical Report 655, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana IL 61801, 1987. [NW87] D. M. Nicol and F. H. Willard. Problem size, parallel architecture, and optimal speedup. Technical Report 87-7, ICASE - Institute for Computer Applications in Science and Engineering, April 1987.

76 [PDR90] R. C. Pacanowski, K. Dixon, and A. Rosati. GFDL MOM 1.0. Geophysical Fluid Dynamics Laboratory/NOAA, December 1990. [PN88] N. Pinardi and A. Navarra. A brief review of global mediterranean wind-driven general circulation experiments. Technical Report 132, IMGA-CNR, Modena Italy, 1988. [PW86] D. Padua and M. Wolfe. Advanced compiler optimization for supercomputers. CACM, 29(12):1184{1201, December, 1986. [RAP87] D. A. Reed, L. M. Adams, and M. L. Patrick. Stencils and problem partitionings: Their in uence on the performance of multiple processor systems. IEEE Transactions on Computers, C-36:845{858, July 1987. [RM67] R. D. Richtmyer and K. W. Morton. Dierence methods for initial value problems. Interscience, second edition, 1967. [Sar66] A. S. Sarkisyan. Osnovy teorii i raschet okeanicheskyky techeny (Fundamentals of the theory and calculation of ocean currents). Gidrometeoizdat, Moscow, 1966. [SC88] A. J. Semtner Jr. and R. M. Chervin. A simulation of the global ocean circulation with resolved eddies. Journal of Geophysical Research, 93(C12):15502{ 15522, 1988. [SDM91] R. D. Smith, J. K. Dukowicz, and R. C. Malone. Massively parallel global ocean modeling. Technical Report LA-UR-91-2583, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, 1991. [Sem74] A. J. Semtner Jr. An oceanic general circulation model with botton topography. Technical Report 9, UCLA Dept. of Metheorology, 1974. [Sem86a] A. J. Semtner Jr. Finite{diferrence formulation of a world ocean model. In J. J. O'Brien, editor, Proc. NATO Institute of Advanced Physical Oceanographic Numerical Modelling., pages 187{202. D. Reidel Publishing Co., 1986. [Sem86b] A. J. Semtner Jr. History and methodology of modelling the circulation of the world ocean. In J. J. O'Brien, editor, Proc. NATO Institute of Advanced Physical Oceanographic Numerical Modelling., pages 23{32. D. Reidel Publishing Co., 1986. [SH89] J. P. Singh and J. L. Hennessy. Parallelizing the simulation of ocean eddy currents. Technical Report CSL-TR-89-388, Computer Systems Laboratory, Stanford University, Stanford, CA 94305-4055, August 1989.

77 [SS75] [Swe88] [Tak74] [Thu90] [WP86] [Yew86] [ZY87]

P. Swarztrauber and R. Sweet. Ecient fortran subprograms for the solution of elliptic partial dierential equations. Technical Report TN/IA-109, NCAR, National Center for Atmospheric Research, July 1975. R. A. Sweet. A parallel and vector cyclic reduction algorithm. SIAM J. Sci. Statist. Comput., 9(4):761{765, July 1988. K. Takano. A general circulation model for the world ocean. Technical Report 8, UCLA Dept. of Metheorology, 1974. Michael Thune. A partitioning strategy for explicit dierence methods. Parallel Computing, 15:147{154, 1990. W. M. Washington and C. L. Parkinson. An Introduction to ThreeDimensional Climate Modeling. University Science Books, 1986. P. C. Yew. Architecture of the Cedar parallel supercomputer. Technical Report 609, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana IL 61801, 1986. C. Zhu and P. Yew. A scheme to enforce data dependence on large multiprocessor systems. IEEE Transactions on Software Engineering, SE-13(6):726{739, June 1987.