ECE 4750 Computer Architecture. Topic 6: Cache Microarchitecture. Christopher
Batten. School of Electrical and Computer Engineering. Cornell University.
ECE 4750 Computer Architecture Topic 6: Cache Microarchitecture Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece4750 slide revision: 2013-10-01-22-23
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Agenda
Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples
ECE 4750
T06: Cache Microarchitecture
2 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Direct Mapped Cache Microarchitecture
ECE 4750
T06: Cache Microarchitecture
3 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Set Associative Cache Microarchitecture
ECE 4750
T06: Cache Microarchitecture
4 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Fully Associative Cache Microarchitecture
ECE 4750
T06: Cache Microarchitecture
5 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Synchronous SRAMs
ECE 4750
T06: Cache Microarchitecture
6 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Direct-Mapped Parallel Read Hit Path
ECE 4750
T06: Cache Microarchitecture
7 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Direct Mapped Pipelined Write Hit Path
ECE 4750
T06: Cache Microarchitecture
8 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Set-Associative Parallel Read Hit Path
ECE 4750
T06: Cache Microarchitecture
9 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Processor-Cache Interaction 0x4
E
Add
M A
nop PC
addr
inst
IR D
Decode, Register Fetch
ALU
Y
Primary Data rdata Cache hit? wdata wdata
B
hit? Primary Instruction Cache
MD1
we addr
R
MD2
Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy
ECE 4750
T06: Cache Microarchitecture
10 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Zero-Cycle Hit Latency with Tightly Coupled Interface cs_X
cs_M
cs_W
decode br_targ jr j_targ pc_plus4
br_targ _X pc_plus4_D
pc_F +4
ir[15:0]
pc_sel_P
stall_D
stall_F
kill_F addr
ir[25:0]
ir_D
j_tgen op0 _sel_D
br_tgen
alu _func_X
op0_X
16
ir[10:6]
wb_sel_M
result_M
ir[25:21]
rdata ir[20:16]
imem
branch _cond
regfile (read)
op1_X
ir[15:0] ir[15:0]
regfile (write) dmem _wen_M
zext sext
result_W
alu
nop stall_D
regfile _waddr_W regfile _wen_W
sd_X
sd_M addr rdata wdata
op1 _sel_D
dmem bypass_from_X1 bypass_from_M bypass_from_W
Fetch (F)
ECE 4750
Decode & Reg Read (D)
Execute (X)
T06: Cache Microarchitecture
Tag Check → Data Access Memory (M)
Writeback (W)
11 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Two-Cycle Hit Latency with Val/Rdy Interface cs_X
cs_M0
cs_M1
cs_W
decode br_targ jr j_targ pc_plus4
br_targ _X pc_plus4_F1 pc_plus4_D
pc_F0
+4
ir[15:0]
pc_sel_P
stall_D stall_F0
!rdy
memreq valrdy
ir[25:0]
kill_F1
ir_D
memresp val
imem
branch _cond
j_tgen op0 _sel_D
br_tgen
op0_X
16
ir[10:6]
alu _func_X wb_sel_M
result_M0 result_M1 result_W
ir[25:21] ir[20:16]
regfile (read)
op1_X
regfile (write)
alu
nop stall_D
ir[15:0] ir[15:0]
zext sext
sd_X op1 _sel_D !rdy bypass_from_X1 bypass_from_M bypass_from_W
Fetch (F0/F1)
ECE 4750
regfile _waddr_W regfile _wen_W
Decode & Reg Read (D)
Execute (X)
T06: Cache Microarchitecture
memreq valrdy
memresp val
dmem
Tag Check → Data Access Memory (M0/M1) Writeback (W)
12 / 36
• Single-Bank Cache uArch •
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Parallel Read, Pipelined Write Hit Path cs_X
cs_M
cs_W
decode br_targ jr j_targ pc_plus4
br_targ _X pc_plus4_D
pc_F +4
ir[15:0]
pc_sel_P
stall_D
stall_F
!rdy
memreq valrdy
kill_F
ir_D
memresp val
imem
ir[25:0]
branch _cond
j_tgen op0 _sel_D
br_tgen
op0_X
16
ir[10:6]
alu _func_X wb_sel_M
result_M
ir[25:21] ir[20:16]
regfile (read)
op1_X
stall_D
ir[15:0] ir[15:0]
sd_X op1 _sel_D !rdy bypass_from_X1 bypass_from_M bypass_from_W
Fetch (F)
ECE 4750
regfile (write) dmem _wen_M
zext sext
result_W
alu
nop
regfile _waddr_W regfile _wen_W
Decode & Reg Read (D)
Execute (X)
T06: Cache Microarchitecture
memreq valrdy
memresp val
dmem
azard ew H
s?
N
Tag Check ∥ Read Access Memory (M)
Write Access Writeback (W)
13 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Agenda
Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples
ECE 4750
T06: Cache Microarchitecture
14 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Unified I/D $
ECE 4750
T06: Cache Microarchitecture
15 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Single-Bank I$ and D$
ECE 4750
T06: Cache Microarchitecture
16 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Multi-Bank I$ and D$
ECE 4750
T06: Cache Microarchitecture
17 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Multicore PARC w/ Private I$ and D$
ECE 4750
T06: Cache Microarchitecture
18 / 36
Single-Bank Cache uArch
• Multi-Bank Cache uArch •
Basic Optimizations
Cache Examples
Multicore PARC w/ Private I$ and Shared D$
ECE 4750
T06: Cache Microarchitecture
19 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Agenda
Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples
ECE 4750
T06: Cache Microarchitecture
20 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Average Memory Access Time
Avg Mem Access Time = Hit Time + ( Miss Rate × Miss Penalty )
I Reduce hit time
I Reduce miss rate
. Small and simple caches
. . . .
Large block size Large cache size High associativity Compiler optimizations
I Reduce miss penalty . Multi-level cache hierarchy . Prioritize reads ECE 4750
T06: Cache Microarchitecture
21 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Hit Time: Small & Simple Caches
ECE 4750
T06: Cache Microarchitecture
22 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Rate: Large Block Size
I Less tag overhead I Exploit fast burst transfers from DRAM I Exploit fast burst transfers over wide on-chip busses ECE 4750
I Can waste bandwidth if data is not used I Fewer blocks → more conflicts
T06: Cache Microarchitecture
23 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Rate: Large Cache Size
Empirical Rule of Thumb: √ If cache size is doubled, miss rate usually drops by about 2
ECE 4750
T06: Cache Microarchitecture
24 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Rate: High Associativity
Empirical Rule of Thumb: Direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2 ECE 4750
T06: Cache Microarchitecture
25 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Rate: Compiler Optimizations I Restructuring code affects the data block access sequence . Group data accesses together to improve spatial locality . Re-order data accesses to improve temporal locality
I Prevent data from entering the cache . Useful for variales that will only be accessed once before eviction . Needs mechanism for software to tell hardware not to cache data (“no-allocate” instruction hits or page table bits)
I Kill data that will never be used again . Streaming data exploits spatial locality but not temporal locality . Replace into dead-cache locations
ECE 4750
T06: Cache Microarchitecture
26 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } }
for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } }
What type of locality does this improve? ECE 4750
T06: Cache Microarchitecture
27 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i];
for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; }
What type of locality does this improve? ECE 4750
T06: Cache Microarchitecture
28 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Penalty: Multi-Level Caches Hit Processor
L1 Cache
L2 Cache
Main Memory
L1 Miss -- L2 Hit
Avg Mem Access Time = Hit Time of L1 + ( Miss Rate of L1 × Miss Penalty of L1 ) Miss Penalty of L1 = Hit Time of L2 + ( Miss Rate of L2 × Miss Penalty of L2 )
I Local miss rate = misses in cache / accesses to cache I Global miss rate = misses in cache / processor memory accesses I Misses per instruction = misses in cache / number of instructions ECE 4750
T06: Cache Microarchitecture
29 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Penalty: Multi-Level Caches I Use smaller L1 if there is also an L2 . Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty . Reduces average access energy
I Use simpler write-through L1 with on-chip L2 . Write-back L2 cache absorbs write traffic, doesn’t go off-chip . Simplifies processor pipeline . Simplifies on-chip coherence issues
I Inclusive Multilevel Cache . Inner cache holds copy of data in outer cache . External coherence is simpler
I Exclusive Multilevel Cache . Inner cache hold data not in outer cache . Swap lines between inner/outer cache on miss
ECE 4750
T06: Cache Microarchitecture
30 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
• Basic Optimizations •
Cache Examples
Reduce Miss Penalty: Prioritize Reads CPU RF
Data Cache
Write buffe r Evicted dirty lines for writeback cache OR All writes in writethrough cache
Unified L2 Cache
I Processor not stalled on writes, and read misses can go ahead of writes to main memory
I Write buffer may hold updated value of location needed by read miss . On read miss, wait for write buffer to be empty . Check write buffer addresses and bypass
ECE 4750
T06: Cache Microarchitecture
31 / 36
Single-Bank Cache uArch
• Basic Optimizations •
Multi-Bank Cache uArch
Cache Examples
Cache Optimizations Impact on Average Memory Access Time Hit Time
Technique
− − −
Parallel read hit Pipelined write hit Smaller caches Large block size Large cache size High associativity Compiler optimizations
++ ++
Miss Rate
− − − −
T06: Cache Microarchitecture
HW 0 1 0
++
Multi-level cache Prioritize reads
ECE 4750
Miss Penalty
+
0 1 1 0
− −
2 1
32 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
Basic Optimizations
• Cache Examples •
Agenda
Single-Bank Cache Microarchitecture Multi-Bank Cache Microarchitecture Basic Optimizations Cache Examples
ECE 4750
T06: Cache Microarchitecture
33 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
Basic Optimizations
Itanium-2 On-Chip Caches Intel Itanium-2 (Intel/HP, 2002)
• Cache Examples •
On-Chip Caches Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency
February 9, 2/17/2009 2010
ECE 4750
CS152, Spring 2010
T06: Cache Microarchitecture
24
34 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
Basic Optimizations
• Cache Examples •
IBM Power-7 On-Chip Caches
Power 7 On-Chip Caches [IBM 2009] 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency
32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice
February 9, 2010 ECE 4750
CS152, Spring 2010 T06: Cache Microarchitecture
25 35 / 36
Single-Bank Cache uArch
Multi-Bank Cache uArch
Basic Optimizations
• Cache Examples •
Acknowledgements
Some of these slides contain material developed and copyrighted by: Arvind (MIT), Krste Asanovi´c (MIT/UCB), Joel Emer (Intel/MIT) James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from courses CS152 and CS252
ECE 4750
T06: Cache Microarchitecture
36 / 36