Switching Techniques, Adaptive Routing and Deadlock ... - IEEE

21 downloads 119 Views 718KB Size Report
Grupo de Arquitecturas Paralelas (GAP). Parallel Architectures Group. Switching Techniques, Adaptive Routing and. Deadlock Handling in Interconnection.
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Switching Techniques, Adaptive Routing and Deadlock Handling in Interconnection Networks Jose Duato Dept. de Ingeniera de Sistemas, Computadores y Automatica Universidad Politecnica de Valencia, Spain 1

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Adaptive Routing and Deadlock Handling in Interconnection Networks Jose Duato Dept. de Ingeniera de Sistemas, Computadores y Automatica Universidad Politecnica de Valencia, Spain

1

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Outline

Introduction Switching techniques Optimized switching techniques Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation

2

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Outline Introduction Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation

2

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Introduction (From W. J. Dally) The performance of most digital systems today is limited by their communication or interconnection, not by their logic or memory Most of the power is used to drive wires and most of the clock cycle is spent on wire delay, not gate delay As technology improves, pin density and wiring density are scaling at a slower rate than the components themselves. Also, the frequency of communication between components is lagging far beyond the clock rates of modern processors These factors combine to make interconnection the key factor in the success of future digital systems 3

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Introduction (From W. J. Dally) As designers strive to make more ecient use of scarce interconnection bandwidth, interconnection networks are emerging as a nearly universal solution to the system-level communication problems for modern digital systems Originally developed for the demanding communication requirements of multicomputers, interconnection networks are beginning to replace buses as the standard system-level interconnection Interconnection networks are also replacing dedicated wiring in special-purpose systems as designers discover that routing packets is both faster and more economical than routing wires 4

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Introduction Interconnection networks are currently being used for many di erent applications, ranging from internal buses in VLSI circuits to wide area computer networks. These applications include:  System area networks  Telephone switches  Internal networks for ATM switches  Processor/memory interconnects for vector supercomputers  Interconnects for multicomputers  Interconnects for distributed shared-memory multiprocessors  Clusters of workstations  Local area networks  Metropolitan area networks Computer networks  Wide area networks

}

5

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Introduction Parallel computers should be designed using commodity components to be cost-e ective Unfortunately, commodity communication subsystems have been designed to meet a di erent set of requirements, i.e., those arising in computer networks Designing high performance interconnection networks becomes a critical issue to exploit the performance of parallel computers Most manufacturers designed custom interconnection networks Recently, several high performance switches have been developed to build inexpensive parallel computers by connecting cost-e ective computers through those switches

6

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Main design parameters Topology: De nes how the nodes are interconnected by channels 

Direct networks, switch-based networks

Routing algorithm: Determines the path selected by a message to reach its destination  Deterministic routing, adaptive routing

Switching technique: Determines how and when bu ers are

reserved and switches are con gured  Packet switching, circuit switching, wormhole, virtual cut-through 7

Interconnection Networks Shared-Medium Networks Local Area Networks Contention Bus (Ethernet) Token Bus (Arcnet) Token Ring (FDDI Ring, IBM Token Ring) Backplane Bus (Sun Gigaplane, DEC AlphaServer8X00, SGI PowerPath-2)

Direct Networks (Router-Based Networks) Strictly Orthogonal Topologies Mesh 2-D Mesh (Intel Paragon) 3-D Mesh (MIT J-Machine) Torus (k-ary n-cube) 1-D Unidirectional Torus or Ring (KSR first-level ring) 2-D Bidirectional Torus (Intel/CMU iWarp) 3-D Bidirectional Torus (Cray T3D, Cray T3E) Hypercube (Intel iPSC, nCUBE) Other Topologies: Trees, Cube-Connected Cycles, de Bruijn, Star Graphs, etc.

Indirect Networks (Switch-Based Networks)

Irregular Topologies (DEC Autonet, Myrinet, ServerNet)

Hybrid Networks Multiple-Backplane Buses (Sun XDBus) Hierarchical Networks (Bridged LANs, KSR) Cluster-Based Networks (Stanford DASH, HP/Convex Exemplar) Other Hypergraph Topologies: Hyperbuses, Hypermeshes, etc.

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Regular Topologies Crossbar (Cray X/Y-MP, DEC GIGAswitch, Myrinet) Multistage Interconnection Networks Blocking Networks Unidirectional MIN (NEC Cenju-3, IBM RP3) Bidirectional MIN (IBM SP, TMC CM-5) Nonblocking Networks: Clos Network

8

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Direct networks

(a) 2-ary 4-cube (hypercube)

(b) 3-ary 2-cube

(c) 3-ary 3D-mesh

9

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Multistage interconnection networks 0000

0000

0000

0000

0001

0001

0001

0001

0010

0010

0010

0010

0011

0011

0011

0011

0100

0100

0100

0100

0101

0101

0101

0101

0110

0110

0110

0110

0111

0111

0111

0111

1000

1000

1000

1000

1001

1001

1001

1001

1010

1010

1010

1010

1011

1011

1011

1011

1100

1100

1100

1100

1101

1101

1101

1101

1110

1110

1110

1110

1111

1111

1111

1111

Multistage butterfly network

Omega network 10

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Switch-based irregular topologies Bidirectional Links

1

0

2

1 5

7

5

0

2 7

3

Switch

3

4

4 6

6

Processing Elements

Switch-Based Network

Graph Representation 11

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Generalized MIN model

N

M

P o r t s

P o r t s

C0

G0

C1

G1

Gg − 1

Cg 12

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Uni ed View

Some manufacturers developed switches that are suitable to implement either direct or indirect networks (Inmos C104, SGI SPIDER) We can view networks using point-to-point links as a set of interconnected switches, each one connected to zero, one, or more nodes: Direct networks correspond to the case where every switch is connected to a single node  Crossbar networks correspond to the case where there is a single switch connected to all the nodes  Multistage interconnection networks correspond to the case where switches are arranged into several stages and the switches in intermediate stages are not connected to any processor 

13

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Router organization LC

LC

Ejection Channel

LC

LC

LC

LC Switch

LC LC

LC

Routing & Arbitration

Output Channels

Input Channels

Injection Channel

LC

14

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Switching Switching: Determines how and when bu ers are reserved and switches are con gured

Flow control: Synchronization protocol for transmitting and receiving a unit of information

Unit of ow control: Portion of the message whose transfer must be synchronized

Flow control occurs at two levels: message ow control and physical channel ow control 15

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Packet switching and circuit switching Channel

Time-space diagram (packet switching)

Time

Channel

Time-space diagram (circuit switching)

Time 16

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Virtual cut-through and wormhole switching T D D D D D D D D D D D D D D H

Channel

Time-space diagram

Time

17

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Virtual channels

Flit buffers

channel

To switch

Physical

Channel demultiplexor

Channel multiplexor

From switch

Virtual channel controller

Flit buffers

18

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance of switching techniques Packet switching is well suited for very short messages Circuit switching is well suited for very long messages Virtual cut-through switching is well suited for messages of any length but requires splitting messages into xed-size packets Wormhole switching is well suited for messages of any length but saturates at moderate loads. Virtual channels alleviate this situation Wormhole switching has been preferred for electronic routers because bu ers can be small and the resulting circuits are compact and fast 19

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Optimized switching techniques Trac from real applications may be bimodal and may vary over time Wormhole switching can be used for short messages Circuit switching can be used for very long messages Path set-up can be overlapped with useful computation and/or circuits can be reused Physical circuits do not need bu ers at intermediate routers and can be made much faster than conventional links either by using wave pipelining or optical technology 20

Pipelined Output Channels

Pipelined Input Channels

Sync Switch S k Sync

Sync Switch S 1 Sync

Wormhole Control Unit

PCS Control Unit

Output Channels

mux mux

Switch S 0

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Input Channels

mux

Control Channels

Optimized router organization

From/to Local Processor

21

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 16000

’CS 28+4’ ’WSNR 16+16 ’ ’WH’

Average Latency (cycles)

14000

10% short messages (16 its) 90% long messages (1024 its)

12000 10000 8000 6000 4000 2000 0 0.05

0.1 0.15 0.2 Traffic ( CLK x 2) (flits/node/cycle)

0.25

22

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 18000 ’CS 28+4’ ’WSNR 16+16 ’ ’WH’

Average Latency (cycles)

16000 14000 12000 10000 8000 6000 4000

10% short messages (16 its) 90% long messages (1024 its)

2000 0 0.05

0.1 0.15 0.2 Traffic ( CLK x 3) (flits/node/cycle)

23

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications ’CS 28+4’ ’WSNR 16+16 ’ ’WH’

Average Latency (cycles)

35000 30000 25000 20000 15000 10000

10% short messages (16 its) 90% long messages (1024 its)

5000 0 0.05

0.1 0.15 0.2 Traffic ( CLK x 4) (flits/node/cycle)

24

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 40000 ’WS 16+16’ ’WH’ ’WH 2 VC’ ’WH 3 VC’

Average Latency (cycles)

35000 30000 25000 20000 15000 10000 5000

10% short messages (16 its) 90% long messages (1024 its) Only long messages are shown

0 0

0.05 0.1 0.15 0.2 Long messages traffic (1024 flits long, 90%)

25

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 30000

’WS 16+16’ ’WH’ ’WH 2 VC’ ’WH 3 VC’

Average Latency (cycles)

25000

10% short messages (16 its) 90% long messages (1024 its) Only short messages are shown

20000 15000 10000 5000 0 0

0.005 0.01 0.015 0.02 0.025 Short messages traffic (16 flits long, 10%)

0.03

26

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 30000

’WS 16+16’ ’WH ’ ’WH 2 VC’ ’WH 3 VC’

Average Latency (cycles)

25000 20000 15000 10000 5000

10% short messages (16 its) 90% long messages (1024 its) Only short messages are shown

0 0

1.0e-4

2.0e-4

3.0e-4

4.0e-4

Short messages traffic (16 flits long, 10%)

27

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 100000 ’32 slots’ ’16 slots’ ’8 slots’ ’4 slots’ ’2 slots’ ’1 slot’

Average Latency (cycles)

80000

60000

40000

20000

10% short messages (16 its) 90% long messages (1024 its) 256 Gbps link bandwidth

0 0

0.01

0.02

0.03 0.04 0.05 0.06 0.07 Traffic for 256 Gbps (10% 16 flits, 90% 1024 flits)

0.08

0.09

28

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance for multimedia applications 120000 ’32 slots’ ’16 slots’ ’8 slots’ ’4 slots’ ’2 slots’ ’1 slot’

100000

Average Latency

80000

60000

40000

20000

40% short messages (16 its) 60% long messages (1024 its) 256 Gbps link bandwidth

0 0

0.01

0.02

0.03 0.04 0.05 0.06 Traffic for 256 Gbps (40% 16 flits, 60% 1024 flits)

0.07

0.08

29

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Routing Algorithms

Number of Destinations

Routing Decisions

Implementation

Adaptivity

Progressiveness

Unicast Routing

Centralized Routing

Source Routing

Multicast Routing

Distributed Routing Multiphase Routing

Table Lookup

Finite-State Machine

Deterministic Routing

Adaptive Routing

Progressive

Backtracking

Minimality

Profitable

Misrouting

Number of Paths

Complete

Partial

30

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Situations that may prevent packet delivery Undeliverable Packets Deadlock Prevention Avoidance Recovery Livelock Minimal Paths Restricted Nonminimal Paths Probabilistic Avoidance Starvation Resource Assignment Scheme 31

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Deadlock handling Deadlock prevention: Backtracking Deadlock avoidance: Acyclic graph, acyclic subgraph Regressive deadlock recovery: Message removal, message abortion Progressive deadlock recovery: Disha

Main goal Design of ecient deadlock-free fully adaptive routing algorithms 32

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Deadlocked con guration N0 2 2 2 2 1 1 1 1

N3

2 2 2 2 1 1 1 1

0 0 0 0

3 3 3 3

N1

Messages wait for resources held by other messages in a cyclic way ) Removing cyclic dependencies will avoid deadlock

3 3 3 3 0 0 0 0

N2

33

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Allowing cyclic dependencies Example for the unidirectional ring: Ai channels can be used to forward messages to all the destinations. Hi channels can only be used if the destination is higher than the current node. c

c

cH0 n0

n1 cA0

cA3

c

cA1

cH1

cA2 n3

n2 cH2

There exist cyclic dependencies between Ai channels However, Hi channels have no cyclic dependencies There is no deadlock because messages waiting for resources can always escape by using Hi channels c

c

34

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Theory of deadlock avoidance (informal) Interconnection network

35

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Adaptive routing function and selection function

nc

nc

nd

Routing Function

nc

nd

nd

Selection Function

36

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Routing subfunction Network channels can be split into two subsets: adaptive and escape channels The routing function will be referred to as routing subfunction when restricted to escape channels

37

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Approach to avoid deadlock An adaptive routing function may allow cyclic dependencies between channels as long as: There exist a subset of channels (escape channels) that have no cyclic dependencies between them  It is possible to establish a path from the current node to the destination node using only escape channels  For wormhole switching, when a message reserves an escape channel and then an adaptive channel, it must be able to select an escape channel at the current node, i.e., escape channels should have no cyclic dependencies indirectly through adaptive channels 

38

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Deadlock produced by indirect dependencies A set of messages are cyclically waiting for channels occupied by other messages in the set Some messages are able to use escape channels but reach another cycle. Messages using escape channels are cyclically waiting indirectly through adaptive channels ) There is a deadlock 39

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Design methodology Based on the extension of other routing functions Allows the use of all the alternative minimal paths Does not increase the number of physical channels Provides a way to: 

Extend the network topology and the routing function



Guarantee the absence of deadlocks 40

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Design methodology Steps: Given an interconnection network 1, de ne a minimal path connected deadlock-free routing function 1 

I

R

Split each physical channel into a set of additional virtual channels. The new routing function can use any of the new channels belonging to a minimal path or, alternatively, the channels supplied by 1 

R

Verify that the extended channel dependency graph for 1 is acyclic. If it is, the routing algorithm is valid. Otherwise, it must be discarded. This step is not required for store-and-forward and virtual cut-through 

R

41

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Design example Routing algorithm for n-dimensional meshes Basic algorithm: Dimension order routing Step2: Split each physical channel i;1; ai;2; : : : ; ai;k,1; bi

i

c

into

k

virtual channels

a

New algorithm: Route over any minimal path using any of the channels. Alternatively, route over the lowest useful dimension using the corresponding channel

a

b

The MIT Reliable Router uses two virtual channels for fully adaptive minimal routing and two virtual channels for dimension-order routing in the absence of faults (on a 2-D mesh) 42

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Example routing paths for 2-D meshes 0

1

2

3

4

5

6

7

Source node Destination node

8

9

10

11

12

13

14

15

Channels supplied by R

43

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Extended channel dependency graph for R1 b01

b10

b12

b21

b14 b03

b25

b30

b52 b34

b45 b54

b43 b36

b58

b63

b85 b74

b67

b76

b87 b78

44

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 2-D mesh 400

0.55 ’Deterministic (1 vc)’ ’Deterministic (2 vc)’ ’Adaptive (2 vc)’

Average Latency (cycles)

350

0.52

Network size: 256 processors. Message length: 16 its. Random trac

300 250 200 150 100

0.1

0.2

0.3

0.4

0.5

0.6

Normalized Accepted Traffic

45

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 3-D mesh 350 0.52

Average Latency (cycles)

300

’Deterministic (1 vc)’ ’Deterministic (2 vc)’ ’Adaptive (2 vc)’

Network size: 512 processors. Message length: 16 its. Random trac

250 200 150 100 50 0.1

0.2

0.3

0.4

0.5

Normalized Accepted Traffic

46

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 2-D torus Average Latency (cycles)

250

0.52

’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

200

150

100

Network size: 256 processors. Message length: 16 its. Random trac

50 0.1

0.2

0.3

0.4

0.5

Normalized Accepted Traffic

47

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 3-D torus 220

0.52

’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Part-Adaptive (3 vc)’ ’Adaptive (3 vc)’

200

Average Latency (cycles)

180 160 140 120 100 80

Network size: 512 processors. Message length: 16 its. Random trac

60 40 0.1

0.2

0.3

0.4

0.5

Normalized Accepted Traffic

48

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 3-D torus (II) Average Latency (cycles)

80

Network size: 512 processors. Message length: 16 its. Local trac

70 60 50 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

40 30 0.2

0.4

0.6

0.8

1

1.2

Normalized Accepted Traffic

49

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation for the 3-D torus (III) Average Latency (cycles)

100

Network size: 512 processors. Message length: 16 its. Bit-reversal trac pattern

’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

90 80 70 60 50 40 0.05

0.1

0.15

0.2

0.25

0.3

0.35

Normalized Accepted Traffic

50

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Accurate performance evaluation for the 3-D torus 550

Network size: 512 processors. Message length: 16 its. Random trac

Average Latency (ns)

500 450 400 350 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

300 250 10

20

30

40

50

60

Traffic (flits/node/us)

51

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Accurate performance evaluation for the 3-D torus (II) 550

Network size: 512 processors. Message length: 16 its. Local trac

Average Latency (ns)

500 450 400 350 300 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

250 200 20

40

60

80

100

120

140

160

Traffic (flits/node/us)

52

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Accurate performance evaluation for the 3-D torus (III) 700

Average Latency (ns)

650

Network size: 512 processors. Message length: 16 its. Bit-reversal trac pattern

’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’

600 550 500 450 400 350 300 250 5

10

15

20

25

30

35

40

45

Traffic (flits/node/us)

53

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Application to deadlock recovery Routing resources (channels or bu ers) are split into two classes: adaptive and escape Adaptive resources can be freely used by all the packets When a packet is waiting for longer than a timeout, it moves to an escape resource Once a packet uses an escape resource, it cannot use an adaptive resource again This routing scheme eliminates all the indirect dependencies between adaptive and escape resources 54

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Router organization for Disha LC

Ejection Channel

LC

LC

LC

LC

LC

Switch LC

LC

Output Channels

Input Channels

Injection LC Channel

LC Routing and Arbitration

Deadlock Buffer

55

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Routing on edge and deadlock bu ers 0

1

2

3

7

6

5

4

8

9

10

11

15

14

13

12

0

1

2

3

7

6

5

4

8

9

10

11

15

14

13

12

Deadlock bu ers can only be used in increasing label order When a deadlock is detected, the packet header can be routed to the deadlock bu er Edge bu ers allow fully adaptive minimal routing Escape channels are de ned so that the routing subfunction is able to deliver messages for any destination (including deadlock bu ers) 56

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Extended channel dependency graph for edge bu ers n0

c10

c21

n1

c50

n2

c41

n5

n4

c65

c32 n3

c74

n6

n7

c83

n8

57

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation 180 .56

.27 160

.77 .49

Average Latency (Cycles)

140 120 100 80 −o− Avoidance−Det (2 VC)

60

−x− Recovery−Det (2 VC)

40

−+− Avoidance−Adap (3 VC) 20 0

−*− Recovery−Adap (3 VC) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Normalized Accepted Traffic

58

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Injection limitation Prevents performance degradation at saturation Reduces the frequency of deadlock occurrence to negligible values

RESERVE

RELEASE

BUSY OUTPUT CHANNELS COUNTER

COMPARATOR

 

INJECTION PERMITTED

THRESHOLD

59

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Improved injection limitation mechanism PHYSICAL CHANNELS Vn-1

V1 V0 RESERVE

RELEASE

1 2 3 BUSY OUTPUT CHANNELS COUNTER m-1 BIT =1 counter

TRANSLATION

COMPARATOR

MESSAGE NUM.

0

INJECTION PERMITTED

TABLE

Bitwise OR

60

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Improved deadlock detection mechanism 0

I

Threshold

1

2

Switch

3

Output Channels

Input Channels

Counter

Counter I

Thresho ld

61

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Application to networks of workstations Networks of workstations are emerging as a cost-e ective alternative to parallel computers. Switch-based interconnects like Autonet, Myrinet and ServerNet have been proposed to build networks of workstations with irregular topology. The irregularity provides: Wiring exibility.  Scalability.  Incremental expansion capability. 

62

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Drawback: The irregularity makes deadlock avoidance and routing quite complicated. Simplest solution: Avoid deadlock by eliminating all the cyclic dependencies between channels ) Many messages are routed following non-minimal paths. ! Higher message latency ! Waste of resources ! Lower throughput Alternative solution: Allow cyclic dependencies between channels ! Reduces contention by increasing routing adaptivity ! Allows more messages to follow minimal paths

63

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Switch-based networks with irregular topologies Bidirectional Links

1

0

2

1 5

7

5

0

2 7

3

Switch

3

4

4 6

6

Processing Elements

Switch-Based Network

Graph Representation 64

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

The Autonet routing algorithm General characteristics: Deadlock-free routing scheme (up/down routing).  Provides partially adaptive communication between nodes.  Distributed.  Implemented using table-lookup. 

65

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

The up/down routing algorithm 0

"up" direction

4

6

2

7

5

1

3

Routing is based on an assignment of direction to the operational links. Routing rule: a legal route must traverse zero or more links in the \up" direction followed by zero or more links in the \down" direction.

Each cycle has at least one link in the \up" direction and one link in the \down" direction.  Cyclic dependencies are avoided: messages cannot cross a link in the \up" direction after one in the \down" direction. 

66

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Routing eciency 0

"up" direction

From 7 to 0: OK  From 2 to 5: lack of adaptivity  From 4 to 1: non-minimal routing 

4

6

2

7

5

1

3

The basic routing rule prevents from using minimal routing and adaptivity in most cases because of \down" to \up" con icts. Probability of non-minimal routing increases with network size.

67

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

A design methodology for adaptive routing algorithms interconnection network + deadlock-free routing function

new methodology

)

physical channels duplicated or split into two virtual channels (original and new) + extended routing function 68

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Extended routing function Newly injected messages can use the new channels without any restriction. For performance reasons, only minimal paths are allowed  Original channels are used exactly in the same way as in the original routing function  Once a message reserves one of the original channels, it cannot use any of the new channels again  When the routing table provides both kinds of channels, give preference to new channels 

The extended routing function is deadlock-free

69

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Improving the eciency of the methodology Idea: Focus on minimal routing, even if adaptivity is reduced  Restrict the transition from new channels to original channels  Improved adaptive routing function: { Newly injected messages can only use new channels { At intermediate switches, a higher priority is assigned to the new channels belonging to minimal paths { If all the new channels are busy, then an original channel belonging to a minimal path (if any) is selected { If none exists, then the one that provides the shortest path is used (this ensures deadlock-freedom)  Once a message reserves an original channel, it can no longer reserve a new one 

70

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Performance evaluation Evaluation of four routing schemes: Basic up/down routing scheme (UD).  Up/down routing scheme using two virtual channels per physical channel (UD-2VC).  Adaptive routing scheme using two virtual channels per physical channel (A-2VC).  Improved adaptive routing scheme using two virtual channels per physical channel (MA-2VC). 

Performance evaluation carried out by simulation.

71

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Network model: Topology generated randomly (8-port switches)  4 nodes (processors) connected to each switch  Two adjacent switches are connected by a single link  One routing control unit per switch (assigned in a round-robin fashion)  Message destination is randomly chosen among nodes  It takes one clock cycle to compute the routing algorithm, to transfer one it from an input bu er to an output bu er, or to transfer one it across a physical channel 

72

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results (I) 60

Average Latency (Cycles)

55 50

UD UD-2VC A-2VC MA-2VC

Network size: 16 switches. Message length: 16 its.

45 40 35 30 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Traffic (Flits/Cycle/Node)

73

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results (II) Average Latency (Cycles)

60 ’UD’ ’UD-2VC’ ’A-2VC’ ’MA-2VC’

55 50

Network size: 32 switches. Message length: 16 its.

45 40 35 30 0.1

0.2

0.3

0.4

0.5

Traffic (Flits/Cycle/Node) 74

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results (III) Average Latency (Cycles)

80 UD UD-2VC A-2VC MA-2VC

70

Network size: 64 switches. Message length: 16 its.

60 50 40 30 0.05

0.1

0.15

0.2

0.25

0.3

Traffic (Flits/Cycle/Node) 75

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results (IV) Average Latency (Cycles)

200 UD UD-2VC A-2VC MA-2VC

180 160

Network size: 64 switches. Message length: 64 its.

140 120 100 80 0.05

0.1

0.15

0.2

0.25

0.3

Traffic (Flits/Cycle/Node) 76

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results (V) Average Latency (Cycles)

900 UD UD-2VC A-2VC MA-2VC

800 700

Network size: 64 switches. Message length: 256 its.

600 500 400 300 0.05

0.1

0.15

0.2

0.25

0.3

Traffic (Flits/Cycle/Node) 77

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results for application traces 160000

Amount of messages

140000 120000

Messages

100000

Traces from Barnes-Hut executed on 64 processors

80000 60000 40000 20000 0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

8e+07

Time

78

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Simulation results for application traces MA-2VC UD-2VC UD

80000 70000

Latency (Cycles)

60000 50000 40000 30000 20000 10000 0

1e+07

2e+07

3e+07

4e+07 5e+07 Time (Cycles)

6e+07

7e+07

8e+07

79

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Zoom of the rst peak 450000

MA-2VC UD-2VC UD

400000

Latency (Cycles)

350000 300000 250000 200000 150000 100000 50000 0 1.8e+07

1.85e+07

1.9e+07 1.95e+07 Time (Cycles)

2e+07

80

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Zoom of the second peak 250000

MA-2VC UD-2VC UD

Latency (Cycles)

200000

150000

100000

50000

0

3.7e+07

3.75e+07

3.8e+07 3.85e+07 Time (Cycles)

3.9e+07

81

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Zoom of the third peak MA-2VC UD-2VC UD

180000 160000

Latency (Cycles)

140000 120000 100000 80000 60000 40000 20000 0

5.2e+07

5.25e+07 5.3e+07 5.35e+07 Time (Cycles)

5.4e+07

82

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Final Remarks

Hybrid switching techniques may considerably increase performance by using the appropriate switching technique for each message class Circuit switching can take advantage of wave pipelining and optical technology to increase link bandwidth Flexible deadlock avoidance and recovery schemes allow the design of more ecient routing algorithms These routing algorithms have been implemented in the MIT Reliable Router and the Cray T3E Adaptive routing and virtual channels are especially interesting when applications produce bursty trac that saturates the network during some time intervals (usually prior to synchronization points) Adaptive routing and virtual channels must be implemented eciently to minimize the increment in clock cycle time

83

Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Final Remarks Flexible deadlock avoidance and recovery schemes allow the design of more ecient routing algorithms These routing algorithms have been implemented in the MIT Reliable Router and the Cray T3E Adaptive routing and virtual channels are especially interesting when applications produce bursty trac that saturates the network during some time intervals (usually prior to synchronization points) Adaptive routing and virtual channels must be implemented eciently to minimize the increment in clock cycle time 83