Center for Information Services and High Performance Computing (ZIH)
Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture International Conference on Parallel Processing Beijing, September 4th 2015
Daniel Molka (
[email protected]) Daniel Hackenberg (
[email protected]) Robert Schöne (
[email protected]) Wolfgang E. Nagel (
[email protected])
Motivation Complex composition of the processor 12-Core Haswell-EP package
12-Core Die QPI
Queue
PCI Express
Queue
Core L1 L2 3
L3
L3
L2 L1
Core 4
Core L1 L2 11
L3
Core L1 L2 2
L3
L3
L2 L1
Core 5
Core L1 L2 10
L3
Core L1 L2 1
L3
L3
L2 L1
Core 6
Core L1 L2 9
L3
Core L1 L2 0
L3
L3
L2 L1
Core 7
Core L1 L2 8
L3
Queue Queue
DDR4 D
DDR4 C
IMC
DDR4 B
DDR4 A
IMC
12 core Intel Haswell-EP architecture
– Multiple cores
– Segmented shared resources
How does performance scale with number of used cores?
Daniel Molka
2
Motivation New energy efficiency features – Reduced frequency in AVX mode – Uncore frequency scaling
How does this influence performance? Several modes of coherence protocol – Source or home snooping – Optional clustering in two NUMA domains
Unclear which is the optimal configuration
Daniel Molka
3
Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol
Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control
Results – Latency – Bandwidth
– Application Performance
Summary
Daniel Molka
4
Haswell-EP Architecture Core Architecture – ISA extensions: AVX2 (256 bit integer SIMD), FMA – Reduced frequency when processing 256 bit instructions – Enhanced out-of-order execution compared to predecessor • More instructions in flight • More load/store buffers – Additional AGU and wider data paths to the L1 and L2 cache
Uncore – L3 consists of multiple slices (all slices shared by all cores) – Two integrated memory controllers (IMC), each with two DDR4 channels – Two interconnected rings connect cores, L3 slices, and IMCs – Uncore Frequency Scaling (UFS): hardware controlled DVFS
Daniel Molka
5
Cluster-On-Die Mode 12-Core Haswell-EP package
12-Core Die QPI PCI Express L1 L2
L3
L3
L2 L1
L1 L2
L3
L3
L2 L1
L1 L2
L3
L3
L2 L1
L1 L2
L3
L3
L2 L1
L1 L2
L3
L1 L2
L3
L1 L2
L3
L1 L2
L3
DDR4 C
Cluster 2
IMC
DDR4 B
Cluster 1
DDR4 A
IMC
DDR4 D
Core 3 Core 2 Core 1 Core 0
Queue Queue Core Core 11 4 Core Core 5 10 Core Core 6 9 Core Core 7 8 Queue Queue
Each IMC visible as separate NUMA domain – Software view (2x 6 cores) does not match the actual layout (8+4 cores) – Higher distance to QPI interconnect for second cluster
Daniel Molka
6
Cache Coherence Protocol (I) Snooping based coherence protocol (MESIF)
CA
CA
CA
Core
Core
CA
CA
CA
CA
Core
Core
Core
Core
CA
CA
CA
CA
Core
Core
requesting core
...
...
Core
Core
CA
CA
CA
CA
Core
Core
Core
Core
Core
Core
Core
CA
CA
CA
CA
CA
CA
CA
CA
Core
Core
Core
Core
responsible CAs
responsible HA
Daniel Molka
...
...
request
peer node
CA
Core
home node
Core
HA
Core
HA
HA HA
peer node
source node
Implemented by “caching agents” (CA) in the L3 slices and “home agents” (HA) in the memory controllers
response
7
Cache Coherence Protocol (II) What about the CAs in the peer nodes?
a) Source Snooping – CA in source node broadcasts request to all nodes – Cached copies in the state E,M, or F are forwarded to the requester
– HA sends data from memory if required Lowest latency
b) Home Snooping – CA forwards request to HA – HA broadcast request Higher latency
– HA can use directory to filter unnecessary snoops lower bandwidth demand
Daniel Molka
8
Cache Coherence Protocol (III) Directory support – already available in Sandy Bridge – 2 bits per cache line stored in ECC bits • no additional memory access • high latency until directory information becomes available – Three possible states: • “remote invalid” – no snoops required • “snoop-all” – broadcast snoops for read and write requests • “shared” – broadcast snoops only for write requests
Directory Caches (HitME cache) – new in Haswell – 14 KiB per home node – Stores 8 bit vectors of presence bits
– Entries are allocated when cache lines are forwarded between nodes – In-memory directory bits set to “snoop-all” when HitME-entry is created
Daniel Molka
9
Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol
Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control
Results – Latency – Bandwidth
– Application Performance
Summary
Daniel Molka
10
Benchmark Overview Implemented using BenchIT – Framework to develop and run microbenchmarks (http://www.benchit.org) Assembler implementation – Measurement routines: latency, single-threaded bandwidth, and aggregated bandwidth – Precise timestamps using Time Stamp Counter (rdtsc) – Synchronization of concurrently running threads (cmpxchg)
Data placement – Data is placed in cache hierarchy of a certain core – measurement performed on the same or another core – Optional cache flushes to measure individual cache levels
Coherence state control mechanisms – Defined coherence state of accessed data
Daniel Molka
11
Data Placement
Data set size determines cache level Distance between cached data and accessing core is controlled by CPU affinity of participating threads (sched_setaffinity()) Defined memory affinity (using libnuma) to investigate NUMA characteristics
Daniel Molka
12
Coherence State Control (I) Intel and AMD use enhanced versions of the MESI protocol – MESIF (Intel):
additional state Forward enables forwarding of shared cache lines between cores
– MOESI (AMD):
additional state Owned enables forwarding of modified cache lines without write back
I
I
Re
ad M Sn iss oo , P p W ro b rite e M H i i ss t
be ro t , P e Hi s s t Mi Wri d p a Re noo S
Re
t Hi
ad M Sn iss, P oo p W ro b eM rite i ss Hi t
be ro i t ,P ss eH i t i r M ad op W e R o Sn Read Hit
ead
Hit
Hit Snoop Read Hit
Read Hit
Write Hit
Snoop Read Hit
W rite
Wri te
pR
p oo Sn d Hit a Re
Hit
Read Hit, Write Hit
Read Hit
MESIF [1]
Hi
t
Snoop Read Hit
O
M
Write Hit
Snoop Read Hit
M
Read Hit, Write Hit
MOESI [2]
[1]: H. Hum and J. Goodman, Forward state for use in cache coherency in a multiprocessor system [2]: AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Rev. 3.22
Daniel Molka
E
Hit
Hit pW rite Sno o
Snoop Read Hit
Snoop Read Hit
S
Hit
o Sno
Wr ite
S
Read Hit
Sn Writ oop eH it
E
iss it H eM Writ Write op Sno
F
rite op W Sno e Miss Writ
Read Hit
Writ e
t Hi
13
Read Hit
Coherence State Control (II) Coordinated access sequences—that can involve multiple threads— ensure that data is cached in the intended coherence state Generate certain coherence state in caches of core N: – Modified:
- core N writes data (invalidates other copies)
– Exclusive:
- enforce modified state in caches of core N - flush caches (clflush) - core N reads data
– Shared:
- enforce exclusive state in caches of core N - another core reads data
– Forward:
- enforce exclusive state in caches of core !=N - core N reads data
– Owned:
- enforce modified state in caches of core N - another core reads data
Daniel Molka
14
Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol
Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control
Results – Latency – Bandwidth
– Application Performance
Summary
Daniel Molka
15
Test System (I) Bull SAS bullx R421 E4
2x Xeon E5 2680 v3 – 12 cores / 24 threads – 2.5 GHz base frequency (2.1 GHz for AVX)
– 30 MiB L3, variable frequency up to 3 GHz – 4 channel DDR4 memory controller – Two 9.6 GT/s QPI links
→ up to 76.8 GB/s between sockets (38.4 in each direction)
128 MiB PC4-2133P-R → up to 68.3 GB/s per socket
Daniel Molka
16
Test System (II) Default configuration: Socket 1
– two NUMA nodes
Mem Mem
– source snoop
Socket 2
Mem
Mem
Node 0
Node 1
Mem
Mem Mem Mem
Early Snoop disabled:
I/O
I/O
Socket 1
Socket 2
Node 1
Node 3
Node 0
Node 2
I/O
I/O
– Two NUMA nodes – home snoop – no directory support
Cluster-on-Die mode: – four NUMA nodes – home snoop
– directory support enabled
Mem Mem Mem Mem
– HitME-caches enabled
Daniel Molka
17
Mem Mem Mem Mem
Latency Results (I)
1.
2.
Disabling early snoop: 1. Increases latency of remote cache accesses 2. and local memory accesses Daniel Molka
18
Latency Results (II) 3.
3.
1.
2. 1.
1. Reduction in local L3 and memory latency (up to 15% / 7%) 2. Accesses to second L3 partition take quite long 3. Spread of remote latencies depending on number of hops Daniel Molka
19
Latency Results (III)
1.
2. 3.
1. Home node provides data if directory cache indicates multiple copies 2. Snoop broadcasts if directory cache misses (“snoop-all” directory state) 3. Memory latency increases due to outdated directory information Daniel Molka
20
Bandwidth Results (I)
L3 performance scales almost linear with core count L3 bandwidth depends on uncore frequency – 26.2 – 278 GB/s for nonrecurring accesses (28.2 – 292 GB/s with HT) – 29.8 – 343 GB/s for repeated accesses (32.4 – 368 GB/s with HT)
Memory bandwidth reaches 63,1 GB/s, saturated with 6 cores
Daniel Molka
21
Bandwidth Results (II) Coherence mode influence – Single threaded memory bandwidths mirror latency results: default: 10.3 GB/s, early snoop disabled: 9.6 GB/s, COD: 12.6 GB/s – Concurrent accesses of all cores achieve comparable per socket bandwidths in all modes
Interconnect bandwidth – Much higher remote bandwidth if early snoop is disabled Cores
1
2
3
4
5
6
7
8
9
10
default
8.0
13.1 14.1 14.7 15.2 15.6 15.9 16.3 16.5 16.6
no early snoop
8.2
16.1 24.0 28.3 30.2 30.4
30.6
Concurrent read accesses CPU-bind node0, membind node1 [GB/s]
Daniel Molka
22
11
12
16.8
Application Performance (I) 1,1
default
ES disabled
COD
1,05 1 0,95
137.lu
132.zeusmp2
130.socorro
129.tera_tf
128.GAPgeofem
127.wrf2
126.lammps
122.tachyon
121.pop2
115.fds4
113.GemsFDTD
107.leslie3d
0,9
104.milc
normalized runtime
SPECmpi2007
Message passing → good locality of memory accesses Disabling early snoop is disadvantageous – Reduced local memory performance outweighs higher remote bandwidth
Small performance improvement due to COD in some benchmarks
Daniel Molka
23
Application Performance (II) normalized runtime
SPEComp2012 1,3
default
ES disabled
COD
1,2 1,1 1 0,9 0,8
376.kdtree
372.smithwa
371.applu331
370.mgrid331
367.imagick
363.swim
362.fma3d
360.ilbdc
359.botsspar
358.botsalgn
357.bt331
352.nab
351.bwaves
350.md
0,7
Shared memory paradigm → not necessarily NUMA-aware More complex NUMA topology in COD mode can reduce performance Disabling early snoops is beneficial in some cases – Higher interconnect bandwidth accelerates remote accesses
Daniel Molka
24
Summary Micro-benchmarks show some huge differences between the coherence protocol modes – Large increase of worst case latencies in COD mode – Much higher inter-socket bandwidth if early snoop is disabled
Significant differences in the memory performance have surprisingly little influence on application performance The default configuration is a good compromise for mixed workloads – The increased latency in home snooping mode (early snoop disabled) mostly outweighs the increased remote bandwidth – MPI applications reflect the small improvements in local memory performance, thus slightly favor COD mode
– However, COD mode can decrease performance of shared memory applications
Daniel Molka
25
Thanks for your Attention
Benchmarks and BenchIT Framework available as open source – available at https://fusionforge.zih.tu-dresden.de/projects/benchit/
Daniel Molka
26
Backup
Daniel Molka
27
DRAM latency
Anomalous behavior for small data set sizes – Presumably caused by DRAM organization – i.e., faster access to already opened pages
Daniel Molka
28