A Parallel Algebraic Multigrid Solver on Graphics ... - CiteSeerX

A Parallel Algebraic Multigrid Solver on Graphics Processing Units? Gundolf Haase1 , Manfred Liebmann1,2 , Craig C. Douglas2 , and Gernot Plank3 1

Institute for Mathematics and Scientific Computing, University of Graz 2 Department of Mathematics, University of Wyoming 3 Computing Laboratory, Oxford University

Abstract. The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the manycore GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core.

1

Introduction

Key elements of numerical algorithms in scientific computing are now accessible on graphics processing units (GPUs) due to recent hardware and software advancements. Specifically, the support of IEEE 754 double precision floating point arithmetics and the introduction of the Nvidia CUDA technology. The hardware and software enhancements of Nvidia CUDA technology remove the limitations of the stream computing concepts of the previous generation of graphics processors and significantly simplify the design and implementation of complex numerical algorithms on many-core GPU architectures. The paper presents the design and implementation of a parallel preconditioned conjugate gradient algorithm with algebraic multigrid preconditioner (PCG-AMG) [6, 3, 14, 9, 8]. The PCG-AMG algorithm requires symmetric positive definite system matrices, that typically appear in finite element simulations of potential problems in electrodynamics or in fluid dynamics simulations. A performance comparison of the PCG-AMG solver on different hardware platforms shows a significant performance advantage for GPU based systems over traditional shared memory servers and cluster computers. A benchmark for an elliptic model problem derived from a 3D virtual heart simulation [13] shows that a single Nvidia Tesla ?

This publication is based on work supported in part by NSF grants OISE-0405349, ACI-0305466, CNS-0719626, and ACI-0324876, by DOE project DE-FC26-08NT4, by FWF project SFB032, by BMWF project AustrianGrid 2, and Award No. KUSC1-016-04, made by King Abdullah University of Science and Technology (KAUST).

C1060 board delivers the same performance as a sixteen node Infiniband cluster with 32 Opteron CPU cores. A GPU server with four Nvidia Geforce GTX 295 dual-GPU boards is about 100x times faster than a typical AMD Opteron CPU core. A detailed performance comparison of the PCG-AMG solver for the elliptic model problem on a variety of hardware is given in Table 1. Table 2 details the parallel performance for the setup of the algebraic multigrid (AMG) preconditioner. The key performance ingredient for the GPU implementation of the PCG-AMG algorithm is an efficient implementation of the sparse matrix-vector multiplication. An interleaved compressed row storage (ICRS) data format for sparse matrices is introduced in §3.2, that provides the basis for an efficient matrix-vector multiplication scheme for unstructured matrices. Previous work on multigrid algorithms on graphics processors can be found in [8, 15].

2

Algebraic Multigrid

We solve a sparse system of linear equations Au = f with an unknown solution vector u and a quadratic, positive definite system matrix A derived from a finite element discretization of a partial differential equation in a 3D domain. We solve this system of linear equations using multigrid and especially an algebraic version. We give a brief sketch of the methods (see [4, 14, 16] for more details). 2.1

Multigrid

The standard multigrid V-cycle is formulated as a recursive scheme in Algorithm 1. The algorithm requires as inputs the matrix hierarchy Al (A0 := A), the prolongation and restriction operator Pl and Rl , and the pre- and post-smoothers Sl and Tl are typically Jacobi or forward/backward Gauss-Seidel smoothers, for 0 ≤ l < L. Furthermore a coarse grid solver is required denoted as A−1 L and the right hand side on the finest level f0 := f . The algorithm returns the solution at the finest level u := u0 . Implementation details for these algorithms can be found in [7, 4, 5]. Most steps of Alg. 1 require a sparse matrix-vector multiplication, which motivates us to investigate this operation in §3.2 in more detail. 2.2

Algebraic Multigrid

Algorithm 1 assumes that all coarser discretizations, all matrices Al , and all intergrid transfer operators Rl and Pl are known. If only the finest matrix A0 is given, then all of the missing information must be created [14, 16] using a setup process. AMG contains the following setup steps on each level l = 0, 1, . . . , L−1 : 1. Coarsening the set of nodes ωl , such that ωl+1 ⊂ ωl . 2. Derive the restriction Rl : Rωl 7→ Rωl+1 and the prolongation Pl : Rωl+1 7→ Rωl for nodes in ωl and ωl+1 from their entries in matrix Al . 3. Calculate the coarse matrix by the sparse triple product Al+1 = Rl · Al · Pl . The parallelization of multigrid and AMG on distributed memory systems follows [10, 12].

Algorithm 1 Sequential Multigrid Algorithm: multigrid(l) Require: fl , ul , rl , sl , Al , Rl , Pl , Sl , Tl , f0 ← f if l < L then ul ← 0 ul ← Sl (ul , fl ) rl ← fl − Al ul fl+1 ← Rl rl multigrid(l+1) sl ← Pl ul+1 ul ← ul + sl ul ← Tl (ul , fl ) else uL ← A−1 L fL end if u ← u0

3 3.1

0≤l(parms); //GPU kernel launch cudaUnbindTexture(tex_u);

//Unbind the texture

} }

3.3

Interleaved Compressed Row Storage Data Format

The standard compressed row storage data format for a matrix A ∈ Rn×n with m = #A non-zero elements is defined by the three vectors cnt ∈ Nn , col ∈ Nm , and ele ∈ Rm . The vector cnt represents the number of non-zero elements per row. The vector col stores the column indices of the non-zero matrix entries row wise in sequential order and finally ele stores the non-zero matrix elements corresponding to the column indices. The vector dsp complements the

data structure with the offsets to the first element of a matrix row. Using the traditional vector of pointers to the first element of a matrix row violates the coalescence restriction and leads to bad performance. The adapted CRS format is depicted in Fig. 2 for the simple example matrix from Fig. 1. The interleaved

4

0

-2

0

0

0

-1

0

-1

4

0

0

0

0

0

-1

0

0

3

-1

0

-1

0

0

0

0

0

4

0

-1

0

0

-1

0

0

0

2

0

-1

0

0

0

-1

0

0

4

0

0

-1

0

0

0

0

3

-1

0

0

0

-1

0

0

-1

0

4

Fig. 1. A sample matrix with the rows colored in different hues.

3

3

3

2

3

2

3

3

0

3

6

9

11

14

16

19

1

3

7

1

2

8

3

4

6

4

6

1

5

7

3

6

1

6

7

3

6

8

4

-2

-1

-1

4

-1

3

-1

-1

4

-1

-1

2

-1

-1

4

-1

3

-1

-1

-1

4

Fig. 2. CRS data structure (cnt, dsp, col, ele) for the sample matrix with the count and displacement vector on top and the column indices and matrix entries below.

compressed row storage (ICRS) data format leaves the vector cnt untouched, but rearranges the other three vectors. A block of L rows of column indices is grouped together by storing all the column indices of the first non-zero elements in a row in sequential order, then the second ones, and then third ones up to the maximum row length in the block. If not all rows in the block have the same number of column indices then the data structure contains holes. Although this reduces the storage efficiency of the data format for finite element matrices the number of non-zero elements per row is roughly the same. Even in the case of greater variations in the row lengths the data structure adapts since only blocks of L rows are grouped together. Typically L is chosen in the range of 16 − 256 as a multiple of sixteen, which corresponds to the minimum coalescence requirement. The example matrix is stored in the ICRS data structure with eight rows interleaved which can be seen in Fig. 3.

3

3

3

2

3

2

3

3

0

1

2

3

4

5

6

7

1

1

3

4

1

3

1

3

3

2

4

6

5

6

6

6

7

8

6

7

7

8

4

-1

3

4

-1

-1

-1

-1

-2

4

-1

-1

2

4

3

-1

-1

-1

-1

-1

-1

4

Fig. 3. ICRS data structure (cnt, dsp, col, ele) for the sample matrix in Fig. 1 with the count and displacement vector on top and the interleaved column indices and matrix entries below. The L = 8 interleaved matrix rows create holes in the data structure represented by the white boxes.

4 4.1

Benchmarks for Elliptic Model Problems Performance Results

The key result of this paper is the performance of the complete PCG-AMG algorithm running on a GPU server when compared to traditional cluster computers and servers. The system matrix used in the tests originates from the unstructured 3D finite element discretization of an elliptical subproblem of the biharmonic equations describing a virtual heart simulation [13]. The resulting matrix has n = 862, 515 rows and m = 12, 795, 209 non-zero entries. The benchmark times the setup of the AMG algorithm in Table 2 and 64 iterations of the PCG-AMG algorithm in Table 1 for details. We compare the performance of

#cores 1 2 4 8 16 32

kepler liebmann mgser1 gtx CPU–cores 29.239 30.253 14.428 15.954 7.305 7.544 3.607 4.054 1.909 3.493 1.167

gpusrv1 mgser1 gtx gpusrv1 GPU–cores

22.615 17.026 9.607 1.217 1.016 1.238 11.999 9.709 5.662 0.612 0.726 8.490 6.562 3.885 0.367 0.409 8.226 4.105 0.284

Speedup wrt. one core 28.7 29.8 22.3 16.8 9.5 1.2 1.0 1.2 parallel 25.0 8.7 2.8 2.6 2.5 2.8 4.4 Table 1. PCG-AMG Solver in seconds for the servers: kepler (AMD Opteron 248 Infiniband Cluster), liebmann (AMD Opteron 8347 Quad Socket Shared Memory Server), mgser1 (Intel Xeon E5405 Dual Socket Server), gtx (AMD Phenom 9950 Server), gpusrv1 (Intel Core i7 965 Server) and for the GPU configurations: mgser1 (1x Nvidia Tesla C1060), gtx (4x Nvidia Geforce GTX 280), gpusrv1 (4x Nvidia Geforce GTX 295)

the Infiniband cluster computer kepler with 16 dual-socket single core Opteron 2.2GHz nodes and the quad-socket shared memory server liebmann with quadcore Opteron 1.9GHz processors and the dual-socket quad-core Xeon 2.6GHz server mgser1 and the single socket quad-core Phenom 2.6GHz server gtx and the single socket quad-core Intel Core i7 3.2GHz server gpusrv1 with three GPU servers setups. The first GPU server mgser1 has a single Tesla C1060 board with 4GB of memory and 240 processing cores. The second custom built GPU server gtx has four Nvidia GTX 280 boards with 1GB memory each and 960 processing cores. The third most powerful GPU server gpusrv1 has four Nvidia GTX295 dual GPU boards with 1.8GB memory each and 1920 processing cores. The performance of the GPU server gpusrv1 with all eight GPUs active is about 100x that of a single Opteron core on the cluster computer or shared memory machine. Even compared to the fastest Intel Core i7 CPU with all four cores active the GPU server is more than 13x faster. A single Tesla board is essentially as fast as the whole Infiniband cluster with all 32 CPUs active and the fastest GPU server beats the Infiniband cluster by a factor of four. The setup phase of AMG requires too many data synchronizations between the GPU cores and therefore this part of the algorithm has not been accelerated by the GPU. At least the whole setup process runs on the GPU and we have all of the data available in the GPU memory afterwards. The saves time for starting the AMG solver on one hand and allows a cheap update of AMG-components on the GPU in case of a nonlinear problem.

5

Conclusions

We showed that even with algorithms on unstructured data the GPU outperforms the CPU and even a complex algorithm such as AMG can be completely implemented on the GPU. Considering the GPU as a processor with high inherent parallelism and combining it into clusters opens new opportunities for low budget parallel computing. Not all algorithm parts should be transferred to the GPU as the differences in in Tables 1 and 2 show. Over time, GPUs may evolve so that all parts can be productively transferred, however. #cores 1 2 4 8 16 32

kepler liebmann mgser1 gtx gpusrv1 mgser1 gtx gpusrv1 CPU–cores GPU–cores 6.302 3.280 1.601 0.787 0.501 0.426

5.812 3.182 1.640 0.877 0.564

3.836 3.583 2.106 5.452 4.714 2.888 2.209 2.069 1.217 2.765 1.745 1.239 1.165 0.684 1.698 1.118 0.850 0.581 1.159

Parallel speedup 14.8 10.3 4.5 3.1 3.6 2.8 2.6 Table 2. Timings in seconds for the parallel algebraic multigrid (AMG) preconditioner setup for the same hardware as in Table 1

References 1. M. M. Baskaran and R Bordawekar. Optimizing sparse matrix-vector multiplication on gpus. IBM Technical Report RC24704, 2008. 2. N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on cuda. NVIDIA Technical Report NVR-2008-004, 2008. 3. F. A. Bornemann and P. Deuflhard. The cascadic multigrid method for elliptic problems. Numer. Math., 75:135–152, 1996. 4. W. L. Briggs, V. E. Henson, and S. McCormick. A Multigrid Tutorial. SIAM, second edition, 2000. 5. C. C. Douglas. Madpack: A family of abstract multigrid or multilevel solvers. Comput. Appl. Math., 14:3–20, 1995. 6. C.C. Douglas, G. Haase, and U. Langer. A Tutorial on Elliptic Pde Solvers and Their Parallelization. Society for Industrial and Applied Mathematics, 2003. 7. R. Barrett et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM Philadelphia, 1994. 8. Dominik G¨ oddeke, Robert Strzodka, Jamal Mohd-Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker, and Stefan Turek. Using GPUs to improve multigrid solver performance on a cluster. International Journal of Computational Science and Engineering, 4(1):36–55, 2008. 9. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. The MIT Press, Cambridge, 1999. 10. Gundolf Haase, Michael Kuhn, and Stefan Reitzinger. Parallel AMG on distributed memory computers. SIAM SISC, 24(2):410–427, 2002. 11. Eun-Jin Im, Katherine Yelick, and Richard Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135– 158, 2004. 12. Manfred Liebmann. Efficient PDE Solvers on Modern Hardware with Applications in Medical and Technical Sciences. PhD thesis, University of Graz, Department of Mathematics and Scientific Computing, July 2009. 13. G. Plank, M. Liebmann, R. Weber dos Santos, E. J. Vigmond, and G. Haase. Algebraic multigrid preconditioner for the cardiac bidomain model. IEEE Transactions on Biomedical Engineering, 54(4):585–596, 2007. 14. J. W. Ruge and K. St¨ uben. Efficient solution of finite difference and finite element equations by algebraic multigrid (amg). Multigrid methods for integral and differential equations. The Institute of Mathematics and Its Applications Conference Series. Oxford: Clarendon Press, pages 169–212, 1985. 15. Peter Thoman, editor. Multigrid Methods on GPUs. VDM, Saarbr¨ ucken, 2008. 16. Panayot S. Vassilevski. Multilevel Block Factorization Preconditioners: Matrixbased Analysis and Algorithms for Solving Finite Element Equations. Springer, New York, 1st edition, 2008.

Fig. 4. Parallel solver and setup performance data for the kepler cluster and the servers liebmann, mgser1, gtx, and gpusrv1