Sep 26, 2011 - SPU cannot access main memory directly but only via DMA to LS. Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, ...
Implementation of multigrid on QPACE Matthias Bolten1 1 2
Daniel Brinkers2 Markus St¨urmer2
Ulrich R¨ude2
Bergische Universit¨at Wuppertal
Friedrich-Alexander-Universit¨at Erlangen-N¨ urnberg
2011/09/26
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
2/29
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
3/29
QPACE Multigrid Network model Measured results Conclusion
AI
QPACE I I I I I
I
QCD parallel computing on Cell Based on enhanced Cell BE processor (PowerXCell 8i) Custom network processor (based on FPGA) Direct coupling between Cell BE and network processor Developed within “Sonderforschungsbereich” (special research area) SFB/TR 55 of the Universities of Regensburg and Wuppertal (development led by University of Regensburg, cooperation with IBM and other industry partners) Two installations: I I
I
University of Wuppertal (3+ racks) J¨ ulich Supercomputing Centre (1 rack + 3 racks owned by the University of Regensburg)
Ranked #1 – #3 in November 2009’s Green500 (#7 – #9 today) Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
4/29
QPACE Multigrid Network model Measured results Conclusion
AI
QPACE architecture I
Hierarchical cluster architecture: I I I
I I
I
Rack, consisting of up to 8 backplanes and 2 superroot cards Backplane, consisting of up to 32 node cards and 2 root cards Superroot card (provides power supply control and management and global tree network) Root card (services for nodes, handling of global clock) Node card (used for actual computations)
Networks: I I
I I
Ethernet (user I/O, Linux boot) Interrupt tree network (evaluation of global conditions, global interrupts, synchronisation) 3D torus network (nearest-neighbor communication) Global clock tree
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
5/29
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
AI
6/29
QPACE Multigrid Network model Measured results Conclusion
AI
Implementation of torus network I
NWP realized in Xilinx Virtex-5 LX110T FPGA
I
Coupled to the Cell BE processor using its FlexIO interface
I
Connected to 6 network PHYs
I
Each link provides up to 10 GiB/s
I
Each PHY can use two different links for reconfiguration of the system in software
I
Allows for partitioning into sizes of [4, 8, 16] × [4, 8] × [4, 8]
I
Each link supports up to 4 virtual channels to distinguish logical links
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
7/29
QPACE Multigrid Network model Measured results Conclusion
AI
Cell BE processor I I
I I
I
I
I
PowerXCell 8i (latest and last incarnation of CBEA) Contains 1 PowerPC Element (PPE) and 8 Synergistic Processing Element (SPE) Interconnected by Element Interconnect Bus (EIB) EIB ring also connect Memory Interface Controller (MIC) and Broadband Interface Controller (BIC) PPE consists of PowerPC core with 64 KiB L1 and 512 KiB L2 cache, supports multithreading, but only in-order execution SPE consists of speciallized SIMD core Synergistic Processing Unit (SPU), the DMA engine Memory Flow Controller (MFC) and 256 KiB Local Storage (LS) SPU cannot access main memory directly but only via DMA to LS Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
8/29
QPACE Multigrid Network model Measured results Conclusion
AI
Elements of the Cell BE processor
MIC
MFC
MFC
MFC
MFC
SPU
SPU
SPU
SPU
SPE
SPE
SPE
SPE
PPSS
MFC
MFC
MFC
MFC
PPU
SPU
SPU
SPU
SPU
PPE
SPE
SPE
SPE
SPE
EIB
BIC
FlexIO
external
DDR2 Main memory
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
9/29
QPACE Multigrid Network model Measured results Conclusion
AI
Communication in QPACE I
Initiation of communication via DMA to network processor
I
Consequence: PPE cannot directly communicate via torus network
I
Accelerator-based programming modell is “natural” Communication is credit based:
I
I I I
Receiving SPE provides credit to the NWP When the transfer is done, receiver is notified by NWP Consequence: Either small messages, only or sending after credit has been supplied
I
Limitations due to Cell DMA engine: message size ≤ 16 KiB and multiple of 128 bytes
I
No routing implented, neither in hardware nor software! Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
10/29
QPACE Multigrid Network model Measured results Conclusion
AI
Way of a message through the torus network Cell
NWP
NWP
dat
Cell it
cred
a
dat
a
dat a
K AC not
ify
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
11/29
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
12/29
QPACE Multigrid Network model Measured results Conclusion
AI
Multigrid Motivation for implementing multigrid I
Torus network allows for nearest neighbor communication
I
Other applications need more than that
I
Multigrid was chosen for evaluating QPACE for other communication patterns
What is multigrid I
Iterative solver for linear systems of equations
I
Makes use of smoothing property of other iterative methods
I
Systems ares solved on a hierarchy of levels
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
13/29
QPACE Multigrid Network model Measured results Conclusion
AI
Multigrid Smooth
The Multigrid V−cycle
Prolongation
Finest Grid
Restriction
Fewer Dofs First Coarse Grid
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
14/29
QPACE Multigrid Network model Measured results Conclusion
AI
Model problem I
Sought for is the solution of −∆u(x) = f (x) for all x ∈ R3 /Z3
I
Corresponds to solving inside of domain [0, 1)3 and imposing periodic boundary conditions
I
Discretized with grid-spacing h and 0 0 0 0 −1 1 0 −1 0 −1 6 h2 0 0 0 0 −1
I
Yields Au ∗ = f , where A ∈ R(n
well-known 7-point stencil 0 0 0 0 −1 0 −1 0 0 0 0 0
3 )×(n3 )
and u ∗ , f ∈ R(n
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
3)
15/29
QPACE Multigrid Network model Measured results Conclusion
AI
Multigrid cycle Algorithm: Multigrid cycle u` ← φMGM;` (u` , f` ) x ← Siν1 (ui , fi ) ri ← fi − Ai ui ri+1 ← Iii+1 ri ei+1 ← 0 if i + 1 = lmax then elmax ← A−1 lmax rlmax else ei+1 ← MG i+1 (ei+1 , ri+1 ) end if i e ei ← Ii+1 i+1 ui ← ui + ei ui ← S˜iν2 (ui , fi ) Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
16/29
QPACE Multigrid Network model Measured results Conclusion
AI
Details of implemented multigrid method I
Cell-based discretization instead of node-based (no communication necessary for restriction and prolongation)
I
ω-Jacobi smoother with optimal smoothing parameter ω =
I
System sizes (n3 ) × (n3 ) with n = 2k , k ∈ N
I
lmax = k − 1, so coarsest system has dimension (23 ) × (23 )
I
Temporary buffer of ω-Jacobi method reused to compute solution on coarser levels
I
The compute kernels are vectorized and the ω-Jacobi and the restriction are optimized by loop-unrolling
I
The kernels are compiled for different grid sizes
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
2 3
17/29
QPACE Multigrid Network model Measured results Conclusion
AI
Implementation on QPACE I
Accelerator-centric programming model
I
Local Storage used, only
I
Domain-splitting, i.e. each SPU handles part of the domain
I
Limited LS results in limitation of local domain to 163
I
Measurements carried out on a 43 partition, resulting in global grid size 1283
I
QPACE torus network library provides functionality for inter-node communication, only
I
Intermediate software layer introduced to allow intra-node communication in the same (credit-based) manner
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
18/29
Cell Cell
Cell Cell
Cell Cell
SPU SPU
Cell
SPU
SPU
SPU SPU
SPU
Cell
SPU SPU
SPU
SPU SPU
SPU One link with four channels
163 grid on every SPU
SPU
SPU SPU
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
20/29
QPACE Multigrid Network model Measured results Conclusion
AI
Network model I
Network cannot be modelled solely by bandwidth and latency
I
Bandwidth of the NWP is also limited
I
Bandwidth limit effective when either multiple SPUs or multiple links are used
I
NWP can only handle about .75 messages per µs. Measurements:
I
I I I
I
Latency: 2.95 µs Link bandwidth: ca. 891 MiB/s NWP bandwidth: ca. 1130 MiB/s
Results in three bounds, where the worst is in effect
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
21/29
QPACE Multigrid Network model Measured results Conclusion
AI
Three bounds 1. Link bound Every link has a bandwidth limit of ca. 891 MiB/s, the latency is ca. 2.95 µs, so the time needed for one message of size s is 2.95 µs +
s . 891 MiB/s
2. NWP bound The NWP has a bandwidth limit of ca. 1130MiB/s. The latency per message is still the same. 3. Message bound The NWP has a throughput of .75 messages/µs.
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
22/29
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
23/29
QPACE Multigrid Network model Measured results Conclusion
AI
Performance of the kernels I
I I
I I
I I I
One line of ω-Jacobi (theoretical bound: 72 cycles): naive SIMD loop unroll reordering cycles 2502 603 594 120 On next coarser level: 91 cycles (theoretical bound: 40 cycles) Residual calculation: Reduction from 92 cycles per grid point to 115 cycles per line (theoretical bound: 63 cycles) On next coarser level: 84 cycles (theoretical bound 35 cycles) Restriction: Reduction from 103 cycles per grid point to 70 cycles per line Restriction limited by number of loads, stores and shuffles Prolongation and update kernel not using unrolling Overall: Kernels without communication need about 95 µs Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
24/29
QPACE Multigrid Network model Measured results Conclusion
AI
Overall I I
I
I I I I
I
Time for one V(2,2)-cycle: 1050 µs Theoretical performance of 4 × 4 × 4 QPACE partition: 6.5 teraflop/s Number of grid points per SPU: 3 1 3 3 3 3 3 16 + 8 + 4 + 2 + 1 + = 4681.125 2 Number of flops per SPU: 228790.296875 Peak performance of our implementation: 0.218 gigaflop/s Percentage of theoretical peak performance: 1.7 percent So timing for one V-cycle is impressive, but the percentage of the theoretical peak performance is not Obviously large portion spent in communication Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
25/29
QPACE Multigrid Network model Measured results Conclusion
AI
Communication
I
V-cycle without computation needs 848 µs
I
Modified network model has to be taken into account Effective bound changes from level to level:
I
I I I I
I
Level 1 Level 2 Level 6 Level 7 bound
(1283 ): only ω-Jacobi, NWP bound – 5 (643 – 83 ): only ω-Jacobi, message bound (43 ): ω-Jacobi and grid transfer, message bound (23 ): ω-Jacobi and grid transfer, message and link
Time predicted by the model: 736 µs
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
26/29
QPACE Multigrid Network model Measured results Conclusion
AI
Outline
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
27/29
QPACE Multigrid Network model Measured results Conclusion
AI
Conclusion I
Accelerator-centric programming model is suitable for this kind of problems
I
Network has to be available to the accelerator directly
I
We need low latency
I
We need a lot of messages
I
Cell BE architecture is limiting (memory!)
I
Routing can be implemented in software (for simple communication patterns)
I
Although performance is not optimal: QPACE can handle this workload as well
I
Performance of NWP is limiting factor Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
28/29
QPACE Multigrid Network model Measured results Conclusion
Implementation of multigrid on QPACE, Matthias Bolten, Daniel Brinkers, Ulrich R¨ ude, Markus St¨ urmer
AI
29/29