Massively Parallel Processor (MPP) Architectures

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Scalable Multiprocessors Case Studies(Chapter 7) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

Massively Parallel Processor (MPP) Architectures Processor

Network Interface

• Network interface typically close to processor

Cache

Memory Bus

– Memory bus: » locked to specific processor architecture/bus protocol – Registers/cache: » only in research machines

I/O Bridge

Network

I/O Bus Main Memory Disk Controller

Disk

• Time-to-market is long – processor already available or work closely with processor designers

Disk

• Maximize performance and cost (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 1

CS/ECE 757

2

Network of Workstations Processor

interrupts

Cache

Core Chip Set I/O Bus

Main Memory Disk Controller

Disk

Disk

Graphics Controller

Graphics

Network Interface

• Network interface on I/O bus • Standards (e.g., PCI) => longer life, faster to market • Slow (microseconds) to access network interface • “System Area Network” (SAN): between LAN & MPP

Network

(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

CS/ECE 757

3

Transaction Interpretation • Simplest MP: assist doesn’t interpret much if anything – DMA from/to buffer, interrupt or set flag on completion

• User-level messaging: get the OS out of the way – assist does protection checks to allow direct user access to network – may have minimal interpretation otherwise

• Virtual DMA: get the CPU out of the way (maybe) – basic protection plus address translation: user-level bulk DMA

• Global physical address space (NUMA): everything in hardware – complexity increases, but performance does too (if done right)

• Cache coherence: even more so – stay tuned (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 2

CS/ECE 757

4

Increasing HW Support, Specialization, Intrusiveness, Performance (???)

Spectrum of Designs • None: Physical bit stream – physical DMA

nCUBE, iPSC, . . .

• User/System – User-level port – User-level handler

CM-5, *T J-Machine, Monsoon, . . .

• Remote virtual address – Processing, translation – Reflective memory (?)

Paragon, Meiko CS-2, Myrinet Memory Channel, SHRIMP

• Global physical address – Proc + Memory controller

RP3, BBN, T3D, T3E

• Cache-to-cache (later) – Cache controller

Dash, KSR, Flash


CS/ECE 757

5

Net Transactions: Physical DMA

Data

Dest

DMA channels Addr Length Rdy

Memory

Status, interrupt

Cmd

P

°°°

Addr Length Rdy

Memory

P

• Physical addresses: OS must initiate transfers – system call per message on both ends: ouch

• Sending OS copies data to kernel buffer w/ header/trailer – can avoid copy if interface does scatter/gather

• Receiver copies packet into OS buffer, then interrupts – user message then copied (or mapped) into user space (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 3

CS/ECE 757

6

Conventional LAN Network Interface Host Memory

NIC trncv

NIC Controller Data

Addr Len Status Next



TX

addr

RX

len

IO Bus mem bus



DMA

Proc



CS/ECE 757

7

User Level Ports User/system Data

Dest

°°°

Mem

P

Status, interrupt

Mem

P

• map network hardware into user’s address space – talk directly to network via loads & stores

• user-to-user communication without OS intervention: low latency • protection: user/user & user/system • DMA hard… CPU involvement (copying) becomes bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 4

CS/ECE 757

8

Case Study: Thinking Machines CM5 • Follow-on to CM2 – – – – –

Abandons SIMD, bit-serial processing Uses off-shelf processors/parts Focus on floating point 32 to 1024 processors Designed to scale to 16K processors

• Designed to be independent of specific processor node • Current" processor node – 40 MHz SPARC – 32 MB memory per node – 4 FP vector chips per node

(C) 2003, J. E. Smith

9

CM5, contd. • Vector Unit – – – – –

Four FP processors Direct connect to main memory Each has a 1 word data path FP unit can do scalar or vector 128 MFLOPS peak: 50 MFLOPS Linpack

(C) 2003, J. E. Smith

Page 5

10

Interconnection Networks • Two networks: Data and Control • Network Interface – – – –

Memory-mapped functions Store data => data Part of address => control Some addresses map to privileged functions

(C) 2003, J. E. Smith

11

Interconnection Networks • Input and output FIFO for each network • Two data networks • Save/restore network buffers on context switch

Diagnostics network Control network Data network

PM PM Processing partition

SPARC

Processing Control partition processors

FPU

$ ctrl

Data networks

$ SRAM

I/O partition

Control network

NI

MBUS

DRAM ctrl DRAM

Vector unit

DRAM ctrl DRAM

DRAM ctrl DRAM


Page 6

Vector unit

DRAM ctrl DRAM

CS/ECE 757

12

Message Management • Typical function implementation – Each function has two FIFOs (in and out) – Two outgoing control registers: » send and send_first » send_first initiates message » send sends any additional data – read send_ok to check successful send;else retry – read send_space to check space prior to send – incoming register receive_ok can be polled – read receive_length_left for message length – read receive for input data in FIFO order

(C) 2003, J. E. Smith

13

User Level Network ports Virtual address space Net output port Net input port

Processor

Status Registers Program counter

• Appears to user as logical message queues plus status • What happens if no user pop?


Page 7

CS/ECE 757

14

Control Network • In general, every processor participates – A mode bit allows a processor to abstain

• Broadcasting – 3 functional units: broadcast, supervisor broadcast, interrupt » only 1 broadcast at a time » broadcast message 1 to 15 32-bit words

• Combining – Operation types: » reduction » parallel prefix » parallel suffix » router-done (big-OR) » Combine message is a 32 to 128 bit integer » Operator types: OR, XOR, max, signed add, unsigned add – Operation and operator are encoded in send_first address

(C) 2003, J. E. Smith

15

Control Network, contd • Global Operations – – – –

Big OR of 1 bit from each processor three independent units; one synchronous, 2 asynchronous Synchronous useful for barrier synch Asynchronous useful for passing error conditions independent of other Control unit functions

• Individual instructions are not synchronized each processor fetches instructions barrier synch via control is used between instruction blocks => support for a loose form of data parallelism

(C) 2003, J. E. Smith

Page 8

16

Data Network • Architecture – – – – – – –

Fat-tree Packet switched Bandwidth to neighbor: 20 MB/sec Latency in large network: 3 to 7 microseconds Can support message passing or global address space

• Network interface – – – –

One data network functional unit send_first gets destn address + tag 1 to 5 32-bit chunks of data tag can cause interrupt at receiver

• Addresses may be physical or relative – physical addresses are privileged – relative address is bounds-checked and translated

(C) 2003, J. E. Smith

17

Data Network, contd. • Typical message interchange: – alternate between pushing onto send-FIFO and receiving on receive-FIFO – once all messages are sent, assert this with combine function on control network – alternate between receiving more messages and testing control network – when router_done goes active, message pass phase is complete

(C) 2003, J. E. Smith

Page 9

18

User Level Handlers U se r/sys te m D a ta

Ad d re ss

D est

° ° ° Mem

M em

P

P

• Hardware support to vector to address specified in message – message ports in registers – alternate register set for handler?

• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

CS/ECE 757

19

CS/ECE 757

20

J-Machine • Each node a small messagedriven processor • HW support to queue msgs and dispatch to msg handler task


Page 10

J-Machine Message-Driven Processor • MIT research project • Targets fine-grain message passing – very low message overheads allow: – small messages – small tasks

• J-Machine architecture – – – – –

3D mesh direct interconnect Global address space up to 1 Mwords of DRAM per node MDP single-chip supports processing memory control message passing not an off-the-shelf chip

(C) 2003, J. E. Smith

21

Features/Example: Combining Tree •

Each node collects data from lower levels, accumulates sum, and passes sum up tree when all inputs are done.

•

Communication – SEND instructions send values up tree in small messages – On arrival, a task is created to perform COMBINE routine

•

Synchronization – – – –

message arrival dispatches a task example: combine message invokes COMBINE routine presence bits (full/empty) on state value set empty; reads are blocked until it becomes full

(C) 2003, J. E. Smith

Page 11

22

Features/Example, contd. • Naming – segmented memory, and translation instructions – protection via segment descriptors – allocate a segment of memory for each combining node

(C) 2003, J. E. Smith

23

Instruction Set • Separate register sets for three execution levels: – background: no messages – priority 0: lower priority messages – priority 1: higher priority messages

• Each Register set: – – – – –

4 GPRs 4 Address regs.: contain segment descriptors base + field length 4 ID registers: P0 and P1 only instruction pointer: includes status info

• Addressing – an address is an offset + address register

(C) 2003, J. E. Smith

Page 12

24

Instruction Set, contd • Tags – – – – – –

4 bit tags per 32-bit data value data types, e.g. integer, boolean, symbol system types, e.g. IP, addr, msg 4 user-defined types Full/empty types: Fut, and Cfut interrupt on access to Cfut value

• MDP can do operand type checking and interrupt on miss-match => program does not have to do software checking

• Instructions – 17-bit format (two per word) – three operands – words not tagged as instructions are loaded into R0 automatically

(C) 2003, J. E. Smith

25

Instruction Set, contd.

(C) 2003, J. E. Smith

Page 13

26

Instruction set, contd. • Naming – ENTER instruction places translation into TLB – XLATE instruction does TLB translation

• Communication – SEND instructions – Example: » SEND R0,0 ;send net address, prio 0 » SEND2 R1,R2,0 ;send two data words » SEND2E R3,[3,A3],0; send two more words;

(C) 2003, J. E. Smith

27

Instruction set, contd. • Task scheduling – – – – – –

Message at head of queue causes MDP to create a task opcode loaded into IP length field and message head loaded into A3 takes three cycles low latency messages executed directly others specify a handler that locates a required method and transfers control to the method

(C) 2003, J. E. Smith

Page 14

28

Network Architecture • 3D Mesh; up to 64K nodes • No torus => faces for I/O • Bidirectional channels – – – – – –

channels can be turned around on alternate cycles 9 bits data + 6 bits control => 9 bit phit 2 phit per flit (granularity of flow control) Each channel 288 Mbps Bisection bandwidth (1024 nodes) 18.4 Gps

• Synchronous routers – clocked at 2X processor clock – => 9-bit phit per 62.5ns – messages route at 1 cycle per hop

(C) 2003, J. E. Smith

29

Dedicated Message Processing Without Specialized Hardware Network dest °°° Mem

Mem NI

P User

• • • •

NI

MP

P

System

User

MP System

Msg processor performs arbitrary output processing (at system level) Msg processor interprets incoming network transactions (in system) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 15

CS/ECE 757

30

Levels of Network Transaction Network dest °°°

Mem

NI

P

Mem

NI

MP

User

MP

P

System

• User Processor stores cmd / msg / data into shared output queue – must still check for output queue full (or grow dynamically)

• Communication assists make transaction happen – checking, translation, scheduling, transport, interpretation

• Avoid system call overhead • Multiple bus crossings likely bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

CS/ECE 757

31

Example: Intel Paragon Service Network I/O Nodes

I/O Nodes

Devices Devices

16 Mem

175 MB/s Duplex 2048 B NI

i860xp 50 MHz 16 KB $ 4-way 32B Block MESI

°°°

EOP

rte MP handler Var data

64 400 MB/s

$

$

P

MP

sDMA rDMA


Page 16

CS/ECE 757

32

Dedicated MP w/specialized NI: Meiko CS-2 • Integrate message processor into network interface – active messages-like capability – dedicated threads for DMA, reply handling, simple remote memory access – supports user-level virtual DMA » own page table » can take a page fault, signal OS, restart • meanwhile, nack other node

• Problem: processor is slow, time-slices threads – fundamental issue with building your own CPU


CS/ECE 757

33

Myricom Myrinet (Berkeley NOW) • Programmable network interface on I/O Bus (Sun SBUS or PCI) – embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) – 256KB SRAM – 3 DMA engines: to network, from network, to/from host memory

• Downloadable firmware executes in kernel mode – includes source-based routing protocol

• SRAM pages can be mapped into user space – separate pages for separate processes – firmware can define status words, queues, etc. » data for short messages or pointers for long ones » firmware can do address translation too… w/OS help – poll to check for sends from user

• Bottom line: I/O bus still bottleneck, CPU could be faster (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 17

CS/ECE 757

34

Shared Physical Address Space • Implement SAS model in hardware w/o caching – actual caching must be done by copying from remote memory to local – programming paradigm looks more like message passing than Pthreads » yet, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right » result: great platform for MPI & compiled data-parallel codes

• Implementation: – “pseudo-memory” acts as memory controller for remote mem, converts accesses to network transaction (request) – “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply – split-transaction or retry-capable bus required (or dual-ported mem)


CS/ECE 757

35

Case Study: Cray T3D • • • • • • • • • • •

Processing element nodes 3D Torus interconnect Wormhole routing PE numbering Local memory Support circuitry Prefetch Messaging Barrier synch Fetch and inc. Block transfers (C) 2003, J. E. Smith

Page 18

36

Processing Element Nodes • Two processors per node • Shared block transfer engine – DMA-like transfer of large blocks of data

• Shared network interface/router • Synchronous 6.67 ns clock

(C) 2003, J. E. Smith

37

Communication Links • Signals: – Data: 16 bits – Channel control: 4 bits – -- request/response, virt. channel buffer – Channel acknowledge: 4 bits

• virt. channel buffer status

(C) 2003, J. E. Smith

Page 19

38

Routing • Dimension order routing – may go in either + or - direction along a dimension

• Virtual channels – Four virtual channel buffers per physical channel – => two request channels, two response channels

• Deadlock avoidance – In each dimension specify a "dateline" link – Packets that do not cross dateline use virtual channel 0 – Packets that cross dateline use virtual channel 1

(C) 2003, J. E. Smith

39

Packets • Size: 8 physical units (phits) – 16 bits per phit

• Header: – – – – –

routing info destn processor control info source processor memory address

• Read Request – header: 6 phits

• Read Response – header: 6 phits – body: 4 phits (1 word) – or 16 phits (4 words) (C) 2003, J. E. Smith

Page 20

40

Processing Nodes • Processor: Alpha 21064 • Support circuitry: – – – – – – –

Address interpretation reads and writes (local/non-local) data prefetch messaging barrier synch. fetch and incr. status

(C) 2003, J. E. Smith

41

Address Interpretation • T3D needs: – 64 MB memory per node => 26 bits – up to 2048 nodes => 11 bits

• Alpha 21064 provides: – 43-bit virtual address – 32-bit physical address – (+2 bits for mem mapped devices)

=> Annex registers in DTB – – – – – –

external to Alpha map 32-bit address onto 48 bit node + 27-bit address 32-annex registers annex registers also contain function info e.g. cache / non-cache accesses DTB modified via load/locked store cond. insts.

(C) 2003, J. E. Smith

Page 21

42

Reads and Writes • Reads – note: no hardware cache coherence – may be cacheable or non-cacheable – two types of non-cacheable: » normal and atomic swap » swap uses "swaperand" register – three types of cacheable reads: » normal, atomic swap, cached readahead » cached readahead buffers and pre-reads 4 word blocks of data

• Writes – – – –

Alpha uses write-through read-allocate cache write counter: to help with consistency increments on write request decrements on ack

(C) 2003, J. E. Smith

43

Data Prefetch • Architected Prefetch Queue – 1 word wide by 16 deep

• Prefetch instruction: – Alpha prefetch hint => T3D prefetch

• Performance – Allows multiple outstanding read requests – (normal 21064 reads are blocking)

(C) 2003, J. E. Smith

Page 22

44

Messaging • Message queues – 256 KBytes reserved space in local memory – => 4080 message packets + 16 overflow locations

• Sending processor: – Uses Alpha PAL code – builds message packets of 4 words – plus address of receiving node

• Receiving node – – – – – – –

stores message interrupts processor processor sends an ack processor may execute routine at address provided by message (active messages) if message queue full; NACK is sent also, error messages may be generated by support circuitry (C) 2003, J. E. Smith

45

Barrier Synchronization • For Barrier or Eureka • Hardware implementation – – – –

hierarchical tree bypasses in tree to limit its scope masks for barrier bits to further limit scope interrupting or non-interrupting

(C) 2003, J. E. Smith

Page 23

46

Fetch and Increment • Special hardware registers – – – –

2 per node user accessible used for auto-scheduling tasks (often loop iterations)

(C) 2003, J. E. Smith

47

Block Transfer • Special block transfer engine – – – –

does DMA transfers can gather/scatter among processing elements up to 64K packets 1 or 4 words per packet

• Types of transfers – constant stride read/write – gather/scatter

• Requires System Call – for memory protection => big overhead

(C) 2003, J. E. Smith

Page 24

48

Cray T3E • • • • • •

T3D Post Mortem T3E Overview Global Communication Synchronization Message passing Kernel performance

(C) 2003, J. E. Smith

49

T3D Post Mortem • Very high performance interconnect – 3D torus worthwhile

• Barrier network "overengineered" – Barrier synch not a significant fraction of runtime

• Prefetch queue useful; should be more of them • Block Transfer engine not very useful – high overhead to setup – yet another way to access memory

• DTB Annex difficult to use in general – one entry might have sufficed – every processor must have same mapping for physical page

(C) 2003, J. E. Smith

Page 25

50

T3E Overview • Alpha 21164 processors • Up to 2 GB per node • Caches – – – – –

8K L1 and 96K L2 on-chip supports 2 outstanding 64-byte line fills stream buffers to enhance cache only local memory is cached => hardware cache coherence straightforward

• 512 (user) + 128 (system) E-registers for communication/synchronization • One router chip per processor

(C) 2003, J. E. Smith

51

T3E Overview, contd. • Clock: – system (i.e. shell) logic at 75 MHz – proc at some multiple (initially 300 MHz)

• 3D Torus Interconnect – bidirectional links – adaptive multiple path routing – links run at 300 MHz

(C) 2003, J. E. Smith

Page 26

52

Global Communication: E-registers • • • •

Extend address space Implement "centrifuging" function Memory-mapped (in IO space) Operations: – load/stores between E-registers and processor registers – Global E-register operations » transfer data to/from global memory » messaging » synchronization

(C) 2003, J. E. Smith

53

Global Address Translation • E-reg block holds base and mask; – previously stored there as part of setup

• Remote memory access (mem mapped store): – data bits: E-reg pointer(8) + address index(50) – address bits: Command + src/dstn E-reg

(C) 2003, J. E. Smith

Page 27

54

Global Address Translation

(C) 2003, J. E. Smith

55

Address Translation, contd. • Translation steps – – – – –

Address index centrifuged with mask => virt address + virt PE Offset added to base => vseg + seg offset vseg translated => gseg + base PE base PE + virtual PE => logical PE logical PE through lookup table => physical routing tag

• GVA: gseg(6) + offset (32) – goes out over network to physical PE » at remote node, global translation => physical address

(C) 2003, J. E. Smith

Page 28

56

Global Communication: Gets and Puts • Get: global read – word (32 or 64-bits) – vector (8 words, with stride) – stride in access E-reg block

• Put: global store • Full/Empty synchronization on E-regs • Gets/Puts are pipelined – up to 480 MB/sec transfer rate between nodes

(C) 2003, J. E. Smith

57

Synchronization • Atomic ops between E-regs and memory – – – –

Fetch & Inc Fetch & Add Compare & Swap Masked Swap

• Barrier/Eureka Synchronization – – – – – –

32 BSUs per processor accessible as memory-mapped registers BSUs have states and operations State transition diagrams Barrier/Eureka trees are embedded in 3D torus use highes priority virtual channels

(C) 2003, J. E. Smith

Page 29

58

Message Passing • Message queues – – – –

arbitrary number created in user or system mode mapped into memory space up to 128 MBytes

• Message Queue Control Word in memory – – – –

tail pointer limit value threshold triggers interrupt signal bit set when interrupt is triggered

• Message Send – from block of 8 E-regs – Send command, similar to put

• Queue head management done in software – swap can manages queue in two segments (C) 2003, J. E. Smith

59

T3E Summary • Messaging is done well... – within constraints of COTS processor

• Relies more on high communication and memory bandwidth than caching => lower perf on dynamic irregular codes => higher perf on memory-intensive codes with large communication

• Centrifuge; probably unique • Barrier synch uses dedicated hardware but NOT dedicated network

(C) 2003, J. E. Smith

Page 30

60

DEC Memory Channel (Princeton SHRIMP) Virtual

Virtual Physical

Physical

• Reflective Memory • Writes on Sender appear in Receiver’s memory – send & receive regions – page control table

• Receive region is pinned in memory • Requires duplicate writes, really just message buffers (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

CS/ECE 757

61

Performance of Distributed Memory Machines • Microbenchmarking • One-way latency of small (five-word) message – echo test – round-trip divided by 2

• Shared Memory remote read • Message Passing Operations – see text


Page 31

CS/ECE 757

62

Network Transaction Performance 14 Microseconds

12 10 8

Gap L Or Os

6 4

T3D

NOW

CS2

Paragon

CM-5

T3D

NOW

CS2

Paragon

0

CM-5

2

Figure 7.31 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

CS/ECE 757

63

Remote Read Performance

20 15 Gap L Issue

10

T3D

NOW

CS2

Paragon

CM-5

T3D

NOW

CS2

0

Paragon

5 CM-5

Microseconds

25

Figure 7.32 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh

Page 32

CS/ECE 757

64

Summary of Distributed Memory Machines • Convergence of architectures – everything “looks basically the same” – processor, cache, memory, communication assist

• Communication Assist – where is it? (I/O bus, memory bus, processor registers) – what does it know? » does it just move bytes, or does it perform some functions? – is it programmable? – does it run user code?

• Network transaction – input & output buffering – action on remote node


Page 33

CS/ECE 757

65