Cache Microarchitecture - Cornell University

50 downloads 442 Views 5MB Size Report
ECE 4750 Computer Architecture. Topic 6: Cache Microarchitecture. Christopher Batten. School of Electrical and Computer Engineering. Cornell University.
ECE 4750 Computer Architecture Topic 6: Cache Microarchitecture Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece4750 slide revision: 2013-10-01-22-23

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples

ECE 4750

T06: Cache Microarchitecture

2 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct Mapped Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

3 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Set Associative Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

4 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Fully Associative Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

5 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Synchronous SRAMs

ECE 4750

T06: Cache Microarchitecture

6 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct-Mapped Parallel Read Hit Path

ECE 4750

T06: Cache Microarchitecture

7 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct Mapped Pipelined Write Hit Path

ECE 4750

T06: Cache Microarchitecture

8 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Set-Associative Parallel Read Hit Path

ECE 4750

T06: Cache Microarchitecture

9 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Processor-Cache Interaction 0x4

E

Add

M A

nop PC

addr

inst

IR D

Decode, Register Fetch

ALU

Y

Primary Data rdata Cache hit? wdata wdata

B

hit? Primary Instruction Cache

MD1

we addr

R

MD2

Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy

ECE 4750

T06: Cache Microarchitecture

10 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Zero-Cycle Hit Latency with Tightly Coupled Interface cs_X

cs_M

cs_W

decode br_targ jr j_targ pc_plus4

br_targ _X pc_plus4_D

pc_F +4

ir[15:0]

pc_sel_P

stall_D

stall_F

kill_F addr

ir[25:0]

ir_D

j_tgen op0 _sel_D

br_tgen

alu _func_X

op0_X

16

ir[10:6]

wb_sel_M

result_M

ir[25:21]

rdata ir[20:16]

imem

branch _cond

regfile (read)

op1_X

ir[15:0] ir[15:0]

regfile (write) dmem _wen_M

zext sext

result_W

alu

nop stall_D

regfile _waddr_W regfile _wen_W

sd_X

sd_M addr rdata wdata

op1 _sel_D

dmem bypass_from_X1 bypass_from_M bypass_from_W

Fetch (F)

ECE 4750

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

Tag Check → Data Access Memory (M)

Writeback (W)

11 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Two-Cycle Hit Latency with Val/Rdy Interface cs_X

cs_M0

cs_M1

cs_W

decode br_targ jr j_targ pc_plus4

br_targ _X pc_plus4_F1 pc_plus4_D

pc_F0

+4

ir[15:0]

pc_sel_P

stall_D stall_F0

!rdy

memreq valrdy

ir[25:0]

kill_F1

ir_D

memresp val

imem

branch _cond

j_tgen op0 _sel_D

br_tgen

op0_X

16

ir[10:6]

alu _func_X wb_sel_M

result_M0 result_M1 result_W

ir[25:21] ir[20:16]

regfile (read)

op1_X

regfile (write)

alu

nop stall_D

ir[15:0] ir[15:0]

zext sext

sd_X op1 _sel_D !rdy bypass_from_X1 bypass_from_M bypass_from_W

Fetch (F0/F1)

ECE 4750

regfile _waddr_W regfile _wen_W

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

memreq valrdy

memresp val

dmem

Tag Check → Data Access Memory (M0/M1) Writeback (W)

12 / 36

• Single-Bank Cache uArch •

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Parallel Read, Pipelined Write Hit Path cs_X

cs_M

cs_W

decode br_targ jr j_targ pc_plus4

br_targ _X pc_plus4_D

pc_F +4

ir[15:0]

pc_sel_P

stall_D

stall_F

!rdy

memreq valrdy

kill_F

ir_D

memresp val

imem

ir[25:0]

branch _cond

j_tgen op0 _sel_D

br_tgen

op0_X

16

ir[10:6]

alu _func_X wb_sel_M

result_M

ir[25:21] ir[20:16]

regfile (read)

op1_X

stall_D

ir[15:0] ir[15:0]

sd_X op1 _sel_D !rdy bypass_from_X1 bypass_from_M bypass_from_W

Fetch (F)

ECE 4750

regfile (write) dmem _wen_M

zext sext

result_W

alu

nop

regfile _waddr_W regfile _wen_W

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

memreq valrdy

memresp val

dmem

azard ew H

s?

N

Tag Check ∥ Read Access Memory (M)

Write Access Writeback (W)

13 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples

ECE 4750

T06: Cache Microarchitecture

14 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Unified I/D $

ECE 4750

T06: Cache Microarchitecture

15 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Single-Bank I$ and D$

ECE 4750

T06: Cache Microarchitecture

16 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Multi-Bank I$ and D$

ECE 4750

T06: Cache Microarchitecture

17 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Multicore PARC w/ Private I$ and D$

ECE 4750

T06: Cache Microarchitecture

18 / 36

Single-Bank Cache uArch

• Multi-Bank Cache uArch •

Basic Optimizations

Cache Examples

Multicore PARC w/ Private I$ and Shared D$

ECE 4750

T06: Cache Microarchitecture

19 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Agenda

Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples

ECE 4750

T06: Cache Microarchitecture

20 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Average Memory Access Time

Avg Mem Access Time = Hit Time + ( Miss Rate × Miss Penalty )

I Reduce hit time

I Reduce miss rate

. Small and simple caches

. . . .

Large block size Large cache size High associativity Compiler optimizations

I Reduce miss penalty . Multi-level cache hierarchy . Prioritize reads ECE 4750

T06: Cache Microarchitecture

21 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Hit Time: Small & Simple Caches

ECE 4750

T06: Cache Microarchitecture

22 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Rate: Large Block Size

I Less tag overhead I Exploit fast burst transfers from DRAM I Exploit fast burst transfers over wide on-chip busses ECE 4750

I Can waste bandwidth if data is not used I Fewer blocks → more conflicts

T06: Cache Microarchitecture

23 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb: √ If cache size is doubled, miss rate usually drops by about 2

ECE 4750

T06: Cache Microarchitecture

24 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Rate: High Associativity

Empirical Rule of Thumb: Direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2 ECE 4750

T06: Cache Microarchitecture

25 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Rate: Compiler Optimizations I Restructuring code affects the data block access sequence . Group data accesses together to improve spatial locality . Re-order data accesses to improve temporal locality

I Prevent data from entering the cache . Useful for variales that will only be accessed once before eviction . Needs mechanism for software to tell hardware not to cache data (“no-allocate” instruction hits or page table bits)

I Kill data that will never be used again . Streaming data exploits spatial locality but not temporal locality . Replace into dead-cache locations

ECE 4750

T06: Cache Microarchitecture

26 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } }

for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } }

What type of locality does this improve? ECE 4750

T06: Cache Microarchitecture

27 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i];

for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; }

What type of locality does this improve? ECE 4750

T06: Cache Microarchitecture

28 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Penalty: Multi-Level Caches Hit Processor

L1 Cache

L2 Cache

Main Memory

L1 Miss -- L2 Hit

Avg Mem Access Time = Hit Time of L1 + ( Miss Rate of L1 × Miss Penalty of L1 ) Miss Penalty of L1 = Hit Time of L2 + ( Miss Rate of L2 × Miss Penalty of L2 )

I Local miss rate = misses in cache / accesses to cache I Global miss rate = misses in cache / processor memory accesses I Misses per instruction = misses in cache / number of instructions ECE 4750

T06: Cache Microarchitecture

29 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Penalty: Multi-Level Caches I Use smaller L1 if there is also an L2 . Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty . Reduces average access energy

I Use simpler write-through L1 with on-chip L2 . Write-back L2 cache absorbs write traffic, doesn’t go off-chip . Simplifies processor pipeline . Simplifies on-chip coherence issues

I Inclusive Multilevel Cache . Inner cache holds copy of data in outer cache . External coherence is simpler

I Exclusive Multilevel Cache . Inner cache hold data not in outer cache . Swap lines between inner/outer cache on miss

ECE 4750

T06: Cache Microarchitecture

30 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

• Basic Optimizations •

Cache Examples

Reduce Miss Penalty: Prioritize Reads CPU RF

Data Cache

Write buffe r Evicted dirty lines for writeback cache OR All writes in writethrough cache

Unified L2 Cache

I Processor not stalled on writes, and read misses can go ahead of writes to main memory

I Write buffer may hold updated value of location needed by read miss . On read miss, wait for write buffer to be empty . Check write buffer addresses and bypass

ECE 4750

T06: Cache Microarchitecture

31 / 36

Single-Bank Cache uArch

• Basic Optimizations •

Multi-Bank Cache uArch

Cache Examples

Cache Optimizations Impact on Average Memory Access Time Hit Time

Technique

− − −

Parallel read hit Pipelined write hit Smaller caches Large block size Large cache size High associativity Compiler optimizations

++ ++

Miss Rate

− − − −

T06: Cache Microarchitecture

HW 0 1 0

++

Multi-level cache Prioritize reads

ECE 4750

Miss Penalty

+

0 1 1 0

− −

2 1

32 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

• Cache Examples •

Agenda

Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples

ECE 4750

T06: Cache Microarchitecture

33 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Itanium-2 On-Chip Caches Intel Itanium-2 (Intel/HP, 2002)

• Cache Examples •

On-Chip Caches Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency

February 9, 2/17/2009 2010

ECE 4750

CS152, Spring 2010

T06: Cache Microarchitecture

24

34 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

• Cache Examples •

IBM Power-7 On-Chip Caches

Power 7 On-Chip Caches [IBM 2009] 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency

32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice

February 9, 2010 ECE 4750

CS152, Spring 2010 T06: Cache Microarchitecture

25 35 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

• Cache Examples •

Acknowledgements

Some of these slides contain material developed and copyrighted by: Arvind (MIT), Krste Asanovi´c (MIT/UCB), Joel Emer (Intel/MIT) James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from courses CS152 and CS252

ECE 4750

T06: Cache Microarchitecture

36 / 36