Implementation of multigrid on QPACE

Implementation of multigrid on QPACE Matthias Bolten1 1 2

Daniel Brinkers2 Markus Stürmer2

Ulrich Rüde2

Bergische Universität Wuppertal

Friedrich-Alexander-Universität Erlangen-N¨ urnberg

2011/09/26

QPACE Multigrid Network model Measured results Conclusion

AI

Outline


Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

2/29


AI

Outline



3/29


AI

QPACE I I I I I

I

QCD parallel computing on Cell Based on enhanced Cell BE processor (PowerXCell 8i) Custom network processor (based on FPGA) Direct coupling between Cell BE and network processor Developed within “Sonderforschungsbereich” (special research area) SFB/TR 55 of the Universities of Regensburg and Wuppertal (development led by University of Regensburg, cooperation with IBM and other industry partners) Two installations: I I

I

University of Wuppertal (3+ racks) J¨ ulich Supercomputing Centre (1 rack + 3 racks owned by the University of Regensburg)

Ranked #1 – #3 in November 2009’s Green500 (#7 – #9 today) Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

4/29


AI

QPACE architecture I

Hierarchical cluster architecture: I I I

I I

I

Rack, consisting of up to 8 backplanes and 2 superroot cards Backplane, consisting of up to 32 node cards and 2 root cards Superroot card (provides power supply control and management and global tree network) Root card (services for nodes, handling of global clock) Node card (used for actual computations)

Networks: I I

I I

Ethernet (user I/O, Linux boot) Interrupt tree network (evaluation of global conditions, global interrupts, synchronisation) 3D torus network (nearest-neighbor communication) Global clock tree


5/29



AI

6/29


AI

Implementation of torus network I

NWP realized in Xilinx Virtex-5 LX110T FPGA

I

Coupled to the Cell BE processor using its FlexIO interface

I

Connected to 6 network PHYs

I

Each link provides up to 10 GiB/s

I

Each PHY can use two different links for reconfiguration of the system in software

I

Allows for partitioning into sizes of [4, 8, 16] × [4, 8] × [4, 8]

I

Each link supports up to 4 virtual channels to distinguish logical links


7/29


AI

Cell BE processor I I

I I

I

I

I

PowerXCell 8i (latest and last incarnation of CBEA) Contains 1 PowerPC Element (PPE) and 8 Synergistic Processing Element (SPE) Interconnected by Element Interconnect Bus (EIB) EIB ring also connect Memory Interface Controller (MIC) and Broadband Interface Controller (BIC) PPE consists of PowerPC core with 64 KiB L1 and 512 KiB L2 cache, supports multithreading, but only in-order execution SPE consists of speciallized SIMD core Synergistic Processing Unit (SPU), the DMA engine Memory Flow Controller (MFC) and 256 KiB Local Storage (LS) SPU cannot access main memory directly but only via DMA to LS Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

8/29


AI

Elements of the Cell BE processor

MIC

MFC

MFC

MFC

MFC

SPU

SPU

SPU

SPU

SPE

SPE

SPE

SPE

PPSS

MFC

MFC

MFC

MFC

PPU

SPU

SPU

SPU

SPU

PPE

SPE

SPE

SPE

SPE

EIB

BIC

FlexIO

external

DDR2 Main memory


9/29


AI

Communication in QPACE I

Initiation of communication via DMA to network processor

I

Consequence: PPE cannot directly communicate via torus network

I

Accelerator-based programming modell is “natural” Communication is credit based:

I

I I I

Receiving SPE provides credit to the NWP When the transfer is done, receiver is notified by NWP Consequence: Either small messages, only or sending after credit has been supplied

I

Limitations due to Cell DMA engine: message size ≤ 16 KiB and multiple of 128 bytes

I

No routing implented, neither in hardware nor software! Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

10/29


AI

Way of a message through the torus network Cell

NWP

NWP

dat

Cell it

cred

a

dat

a

dat a

K AC not

ify


11/29


AI

Outline



12/29


AI

Multigrid Motivation for implementing multigrid I

Torus network allows for nearest neighbor communication

I

Other applications need more than that

I

Multigrid was chosen for evaluating QPACE for other communication patterns

What is multigrid I

Iterative solver for linear systems of equations

I

Makes use of smoothing property of other iterative methods

I

Systems ares solved on a hierarchy of levels


13/29


AI

Multigrid Smooth

The Multigrid V−cycle

Prolongation

Finest Grid

Restriction

Fewer Dofs First Coarse Grid


14/29


AI

Model problem I

Sought for is the solution of −∆u(x) = f (x) for all x ∈ R3 /Z3

I

Corresponds to solving inside of domain [0, 1)3 and imposing periodic boundary conditions

I

Discretized with grid-spacing h and   0 0 0 0 −1 1    0 −1 0 −1 6 h2 0 0 0 0 −1

I

Yields Au ∗ = f , where A ∈ R(n

well-known 7-point stencil   0 0 0 0 −1   0 −1 0  0 0 0 0

3 )×(n3 )

and u ∗ , f ∈ R(n


3)

15/29


AI

Multigrid cycle Algorithm: Multigrid cycle u` ← φMGM;` (u` , f` ) x ← Siν1 (ui , fi ) ri ← fi − Ai ui ri+1 ← Iii+1 ri ei+1 ← 0 if i + 1 = lmax then elmax ← A−1 lmax rlmax else ei+1 ← MG i+1 (ei+1 , ri+1 ) end if i e ei ← Ii+1 i+1 ui ← ui + ei ui ← S˜iν2 (ui , fi ) Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

16/29


AI

Details of implemented multigrid method I

Cell-based discretization instead of node-based (no communication necessary for restriction and prolongation)

I

ω-Jacobi smoother with optimal smoothing parameter ω =

I

System sizes (n3 ) × (n3 ) with n = 2k , k ∈ N

I

lmax = k − 1, so coarsest system has dimension (23 ) × (23 )

I

Temporary buffer of ω-Jacobi method reused to compute solution on coarser levels

I

The compute kernels are vectorized and the ω-Jacobi and the restriction are optimized by loop-unrolling

I

The kernels are compiled for different grid sizes


2 3

17/29


AI

Implementation on QPACE I

Accelerator-centric programming model

I

Local Storage used, only

I

Domain-splitting, i.e. each SPU handles part of the domain

I

Limited LS results in limitation of local domain to 163

I

Measurements carried out on a 43 partition, resulting in global grid size 1283

I

QPACE torus network library provides functionality for inter-node communication, only

I

Intermediate software layer introduced to allow intra-node communication in the same (credit-based) manner


18/29

Cell Cell

Cell Cell

Cell Cell

SPU SPU

Cell

SPU

SPU

SPU SPU

SPU

Cell

SPU SPU

SPU

SPU SPU

SPU One link with four channels

163 grid on every SPU

SPU

SPU SPU


AI

Outline



20/29


AI

Network model I

Network cannot be modelled solely by bandwidth and latency

I

Bandwidth of the NWP is also limited

I

Bandwidth limit effective when either multiple SPUs or multiple links are used

I

NWP can only handle about .75 messages per µs. Measurements:

I

I I I

I

Latency: 2.95 µs Link bandwidth: ca. 891 MiB/s NWP bandwidth: ca. 1130 MiB/s

Results in three bounds, where the worst is in effect


21/29


AI

Three bounds 1. Link bound Every link has a bandwidth limit of ca. 891 MiB/s, the latency is ca. 2.95 µs, so the time needed for one message of size s is 2.95 µs +

s . 891 MiB/s

2. NWP bound The NWP has a bandwidth limit of ca. 1130MiB/s. The latency per message is still the same. 3. Message bound The NWP has a throughput of .75 messages/µs.


22/29


AI

Outline



23/29


AI

Performance of the kernels I

I I

I I

I I I

One line of ω-Jacobi (theoretical bound: 72 cycles): naive SIMD loop unroll reordering cycles 2502 603 594 120 On next coarser level: 91 cycles (theoretical bound: 40 cycles) Residual calculation: Reduction from 92 cycles per grid point to 115 cycles per line (theoretical bound: 63 cycles) On next coarser level: 84 cycles (theoretical bound 35 cycles) Restriction: Reduction from 103 cycles per grid point to 70 cycles per line Restriction limited by number of loads, stores and shuffles Prolongation and update kernel not using unrolling Overall: Kernels without communication need about 95 µs Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

24/29


AI

Overall I I

I

I I I I

I

Time for one V(2,2)-cycle: 1050 µs Theoretical performance of 4 × 4 × 4 QPACE partition: 6.5 teraflop/s Number of grid points per SPU: 3 1 3 3 3 3 3 16 + 8 + 4 + 2 + 1 + = 4681.125 2 Number of flops per SPU: 228790.296875 Peak performance of our implementation: 0.218 gigaflop/s Percentage of theoretical peak performance: 1.7 percent So timing for one V-cycle is impressive, but the percentage of the theoretical peak performance is not Obviously large portion spent in communication Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

25/29


AI

Communication

I

V-cycle without computation needs 848 µs

I

Modified network model has to be taken into account Effective bound changes from level to level:

I

I I I I

I

Level 1 Level 2 Level 6 Level 7 bound

(1283 ): only ω-Jacobi, NWP bound – 5 (643 – 83 ): only ω-Jacobi, message bound (43 ): ω-Jacobi and grid transfer, message bound (23 ): ω-Jacobi and grid transfer, message and link

Time predicted by the model: 736 µs


26/29


AI

Outline



27/29


AI

Conclusion I

Accelerator-centric programming model is suitable for this kind of problems

I

Network has to be available to the accelerator directly

I

We need low latency

I

We need a lot of messages

I

Cell BE architecture is limiting (memory!)

I

Routing can be implemented in software (for simple communication patterns)

I

Although performance is not optimal: QPACE can handle this workload as well

I

Performance of NWP is limiting factor Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer

28/29



AI

29/29