ICPP'15 presentation - FusionForge - TU Dresden

2 downloads 0 Views 1MB Size Report
Core Architecture. – ISA extensions: AVX2 (256 bit integer SIMD), FMA. – Reduced frequency when processing 256 bit instructions. – Enhanced out-of-order ...
Center for Information Services and High Performance Computing (ZIH)

Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture International Conference on Parallel Processing Beijing, September 4th 2015

Daniel Molka ([email protected]) Daniel Hackenberg ([email protected]) Robert Schöne ([email protected]) Wolfgang E. Nagel ([email protected])

Motivation Complex composition of the processor 12-Core Haswell-EP package

12-Core Die QPI

Queue

PCI Express

Queue

Core L1 L2 3

L3

L3

L2 L1

Core 4

Core L1 L2 11

L3

Core L1 L2 2

L3

L3

L2 L1

Core 5

Core L1 L2 10

L3

Core L1 L2 1

L3

L3

L2 L1

Core 6

Core L1 L2 9

L3

Core L1 L2 0

L3

L3

L2 L1

Core 7

Core L1 L2 8

L3

Queue Queue

DDR4 D

DDR4 C

IMC

DDR4 B

DDR4 A

IMC

12 core Intel Haswell-EP architecture

– Multiple cores

– Segmented shared resources

 How does performance scale with number of used cores?

Daniel Molka

2

Motivation New energy efficiency features – Reduced frequency in AVX mode – Uncore frequency scaling

 How does this influence performance? Several modes of coherence protocol – Source or home snooping – Optional clustering in two NUMA domains

 Unclear which is the optimal configuration

Daniel Molka

3

Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol

Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control

Results – Latency – Bandwidth

– Application Performance

Summary

Daniel Molka

4

Haswell-EP Architecture Core Architecture – ISA extensions: AVX2 (256 bit integer SIMD), FMA – Reduced frequency when processing 256 bit instructions – Enhanced out-of-order execution compared to predecessor • More instructions in flight • More load/store buffers – Additional AGU and wider data paths to the L1 and L2 cache

Uncore – L3 consists of multiple slices (all slices shared by all cores) – Two integrated memory controllers (IMC), each with two DDR4 channels – Two interconnected rings connect cores, L3 slices, and IMCs – Uncore Frequency Scaling (UFS): hardware controlled DVFS

Daniel Molka

5

Cluster-On-Die Mode 12-Core Haswell-EP package

12-Core Die QPI PCI Express L1 L2

L3

L3

L2 L1

L1 L2

L3

L3

L2 L1

L1 L2

L3

L3

L2 L1

L1 L2

L3

L3

L2 L1

L1 L2

L3

L1 L2

L3

L1 L2

L3

L1 L2

L3

DDR4 C

Cluster 2

IMC

DDR4 B

Cluster 1

DDR4 A

IMC

DDR4 D

Core 3 Core 2 Core 1 Core 0

Queue Queue Core Core 11 4 Core Core 5 10 Core Core 6 9 Core Core 7 8 Queue Queue

Each IMC visible as separate NUMA domain – Software view (2x 6 cores) does not match the actual layout (8+4 cores) – Higher distance to QPI interconnect for second cluster

Daniel Molka

6

Cache Coherence Protocol (I) Snooping based coherence protocol (MESIF)

CA

CA

CA

Core

Core

CA

CA

CA

CA

Core

Core

Core

Core

CA

CA

CA

CA

Core

Core

requesting core

...

...

Core

Core

CA

CA

CA

CA

Core

Core

Core

Core

Core

Core

Core

CA

CA

CA

CA

CA

CA

CA

CA

Core

Core

Core

Core

responsible CAs

responsible HA

Daniel Molka

...

...

request

peer node

CA

Core

home node

Core

HA

Core

HA

HA HA

peer node

source node

Implemented by “caching agents” (CA) in the L3 slices and “home agents” (HA) in the memory controllers

response

7

Cache Coherence Protocol (II) What about the CAs in the peer nodes?

a) Source Snooping – CA in source node broadcasts request to all nodes – Cached copies in the state E,M, or F are forwarded to the requester

– HA sends data from memory if required  Lowest latency

b) Home Snooping – CA forwards request to HA – HA broadcast request  Higher latency

– HA can use directory to filter unnecessary snoops  lower bandwidth demand

Daniel Molka

8

Cache Coherence Protocol (III) Directory support – already available in Sandy Bridge – 2 bits per cache line stored in ECC bits • no additional memory access • high latency until directory information becomes available – Three possible states: • “remote invalid” – no snoops required • “snoop-all” – broadcast snoops for read and write requests • “shared” – broadcast snoops only for write requests

Directory Caches (HitME cache) – new in Haswell – 14 KiB per home node – Stores 8 bit vectors of presence bits

– Entries are allocated when cache lines are forwarded between nodes – In-memory directory bits set to “snoop-all” when HitME-entry is created

Daniel Molka

9

Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol

Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control

Results – Latency – Bandwidth

– Application Performance

Summary

Daniel Molka

10

Benchmark Overview Implemented using BenchIT – Framework to develop and run microbenchmarks (http://www.benchit.org) Assembler implementation – Measurement routines: latency, single-threaded bandwidth, and aggregated bandwidth – Precise timestamps using Time Stamp Counter (rdtsc) – Synchronization of concurrently running threads (cmpxchg)

Data placement – Data is placed in cache hierarchy of a certain core – measurement performed on the same or another core – Optional cache flushes to measure individual cache levels

Coherence state control mechanisms – Defined coherence state of accessed data

Daniel Molka

11

Data Placement

Data set size determines cache level Distance between cached data and accessing core is controlled by CPU affinity of participating threads (sched_setaffinity()) Defined memory affinity (using libnuma) to investigate NUMA characteristics

Daniel Molka

12

Coherence State Control (I) Intel and AMD use enhanced versions of the MESI protocol – MESIF (Intel):

additional state Forward enables forwarding of shared cache lines between cores

– MOESI (AMD):

additional state Owned enables forwarding of modified cache lines without write back

I

I

Re

ad M Sn iss oo , P p W ro b rite e M H i i ss t

be ro t , P e Hi s s t Mi Wri d p a Re noo S

Re

t Hi

ad M Sn iss, P oo p W ro b eM rite i ss Hi t

be ro i t ,P ss eH i t i r M ad op W e R o Sn Read Hit

ead

Hit

Hit Snoop Read Hit

Read Hit

Write Hit

Snoop Read Hit

W rite

Wri te

pR

p oo Sn d Hit a Re

Hit

Read Hit, Write Hit

Read Hit

MESIF [1]

Hi

t

Snoop Read Hit

O

M

Write Hit

Snoop Read Hit

M

Read Hit, Write Hit

MOESI [2]

[1]: H. Hum and J. Goodman, Forward state for use in cache coherency in a multiprocessor system [2]: AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Rev. 3.22

Daniel Molka

E

Hit

Hit pW rite Sno o

Snoop Read Hit

Snoop Read Hit

S

Hit

o Sno

Wr ite

S

Read Hit

Sn Writ oop eH it

E

iss it H eM Writ Write op Sno

F

rite op W Sno e Miss Writ

Read Hit

Writ e

t Hi

13

Read Hit

Coherence State Control (II) Coordinated access sequences—that can involve multiple threads— ensure that data is cached in the intended coherence state Generate certain coherence state in caches of core N: – Modified:

- core N writes data (invalidates other copies)

– Exclusive:

- enforce modified state in caches of core N - flush caches (clflush) - core N reads data

– Shared:

- enforce exclusive state in caches of core N - another core reads data

– Forward:

- enforce exclusive state in caches of core !=N - core N reads data

– Owned:

- enforce modified state in caches of core N - another core reads data

Daniel Molka

14

Outline Haswell-EP architecture – Architecture overview – Cache coherence protocol

Benchmark Design and Implementation – Measurement Routines – Data Placement – Coherence State Control

Results – Latency – Bandwidth

– Application Performance

Summary

Daniel Molka

15

Test System (I) Bull SAS bullx R421 E4

2x Xeon E5 2680 v3 – 12 cores / 24 threads – 2.5 GHz base frequency (2.1 GHz for AVX)

– 30 MiB L3, variable frequency up to 3 GHz – 4 channel DDR4 memory controller – Two 9.6 GT/s QPI links

→ up to 76.8 GB/s between sockets (38.4 in each direction)

128 MiB PC4-2133P-R → up to 68.3 GB/s per socket

Daniel Molka

16

Test System (II) Default configuration: Socket 1

– two NUMA nodes

Mem Mem

– source snoop

Socket 2

Mem

Mem

Node 0

Node 1

Mem

Mem Mem Mem

Early Snoop disabled:

I/O

I/O

Socket 1

Socket 2

Node 1

Node 3

Node 0

Node 2

I/O

I/O

– Two NUMA nodes – home snoop – no directory support

Cluster-on-Die mode: – four NUMA nodes – home snoop

– directory support enabled

Mem Mem Mem Mem

– HitME-caches enabled

Daniel Molka

17

Mem Mem Mem Mem

Latency Results (I)

1.

2.

Disabling early snoop: 1. Increases latency of remote cache accesses 2. and local memory accesses Daniel Molka

18

Latency Results (II) 3.

3.

1.

2. 1.

1. Reduction in local L3 and memory latency (up to 15% / 7%) 2. Accesses to second L3 partition take quite long 3. Spread of remote latencies depending on number of hops Daniel Molka

19

Latency Results (III)

1.

2. 3.

1. Home node provides data if directory cache indicates multiple copies 2. Snoop broadcasts if directory cache misses (“snoop-all” directory state) 3. Memory latency increases due to outdated directory information Daniel Molka

20

Bandwidth Results (I)

L3 performance scales almost linear with core count L3 bandwidth depends on uncore frequency – 26.2 – 278 GB/s for nonrecurring accesses (28.2 – 292 GB/s with HT) – 29.8 – 343 GB/s for repeated accesses (32.4 – 368 GB/s with HT)

Memory bandwidth reaches 63,1 GB/s, saturated with 6 cores

Daniel Molka

21

Bandwidth Results (II) Coherence mode influence – Single threaded memory bandwidths mirror latency results: default: 10.3 GB/s, early snoop disabled: 9.6 GB/s, COD: 12.6 GB/s – Concurrent accesses of all cores achieve comparable per socket bandwidths in all modes

Interconnect bandwidth – Much higher remote bandwidth if early snoop is disabled Cores

1

2

3

4

5

6

7

8

9

10

default

8.0

13.1 14.1 14.7 15.2 15.6 15.9 16.3 16.5 16.6

no early snoop

8.2

16.1 24.0 28.3 30.2 30.4

30.6

Concurrent read accesses CPU-bind node0, membind node1 [GB/s]

Daniel Molka

22

11

12

16.8

Application Performance (I) 1,1

default

ES disabled

COD

1,05 1 0,95

137.lu

132.zeusmp2

130.socorro

129.tera_tf

128.GAPgeofem

127.wrf2

126.lammps

122.tachyon

121.pop2

115.fds4

113.GemsFDTD

107.leslie3d

0,9

104.milc

normalized runtime

SPECmpi2007

Message passing → good locality of memory accesses Disabling early snoop is disadvantageous – Reduced local memory performance outweighs higher remote bandwidth

Small performance improvement due to COD in some benchmarks

Daniel Molka

23

Application Performance (II) normalized runtime

SPEComp2012 1,3

default

ES disabled

COD

1,2 1,1 1 0,9 0,8

376.kdtree

372.smithwa

371.applu331

370.mgrid331

367.imagick

363.swim

362.fma3d

360.ilbdc

359.botsspar

358.botsalgn

357.bt331

352.nab

351.bwaves

350.md

0,7

Shared memory paradigm → not necessarily NUMA-aware More complex NUMA topology in COD mode can reduce performance Disabling early snoops is beneficial in some cases – Higher interconnect bandwidth accelerates remote accesses

Daniel Molka

24

Summary Micro-benchmarks show some huge differences between the coherence protocol modes – Large increase of worst case latencies in COD mode – Much higher inter-socket bandwidth if early snoop is disabled

Significant differences in the memory performance have surprisingly little influence on application performance The default configuration is a good compromise for mixed workloads – The increased latency in home snooping mode (early snoop disabled) mostly outweighs the increased remote bandwidth – MPI applications reflect the small improvements in local memory performance, thus slightly favor COD mode

– However, COD mode can decrease performance of shared memory applications

Daniel Molka

25

Thanks for your Attention

Benchmarks and BenchIT Framework available as open source – available at https://fusionforge.zih.tu-dresden.de/projects/benchit/

Daniel Molka

26

Backup

Daniel Molka

27

DRAM latency

Anomalous behavior for small data set sizes – Presumably caused by DRAM organization – i.e., faster access to already opened pages

Daniel Molka

28