Grupo de Arquitecturas Paralelas (GAP). Parallel Architectures Group. Switching
Techniques, Adaptive Routing and. Deadlock Handling in Interconnection.
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Switching Techniques, Adaptive Routing and Deadlock Handling in Interconnection Networks Jose Duato Dept. de Ingeniera de Sistemas, Computadores y Automatica Universidad Politecnica de Valencia, Spain 1
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Adaptive Routing and Deadlock Handling in Interconnection Networks Jose Duato Dept. de Ingeniera de Sistemas, Computadores y Automatica Universidad Politecnica de Valencia, Spain
1
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Outline
Introduction Switching techniques Optimized switching techniques Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation
2
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Outline Introduction Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation
2
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Introduction (From W. J. Dally) The performance of most digital systems today is limited by their communication or interconnection, not by their logic or memory Most of the power is used to drive wires and most of the clock cycle is spent on wire delay, not gate delay As technology improves, pin density and wiring density are scaling at a slower rate than the components themselves. Also, the frequency of communication between components is lagging far beyond the clock rates of modern processors These factors combine to make interconnection the key factor in the success of future digital systems 3
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Introduction (From W. J. Dally) As designers strive to make more ecient use of scarce interconnection bandwidth, interconnection networks are emerging as a nearly universal solution to the system-level communication problems for modern digital systems Originally developed for the demanding communication requirements of multicomputers, interconnection networks are beginning to replace buses as the standard system-level interconnection Interconnection networks are also replacing dedicated wiring in special-purpose systems as designers discover that routing packets is both faster and more economical than routing wires 4
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Introduction Interconnection networks are currently being used for many dierent applications, ranging from internal buses in VLSI circuits to wide area computer networks. These applications include: System area networks Telephone switches Internal networks for ATM switches Processor/memory interconnects for vector supercomputers Interconnects for multicomputers Interconnects for distributed shared-memory multiprocessors Clusters of workstations Local area networks Metropolitan area networks Computer networks Wide area networks
}
5
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Introduction Parallel computers should be designed using commodity components to be cost-eective Unfortunately, commodity communication subsystems have been designed to meet a dierent set of requirements, i.e., those arising in computer networks Designing high performance interconnection networks becomes a critical issue to exploit the performance of parallel computers Most manufacturers designed custom interconnection networks Recently, several high performance switches have been developed to build inexpensive parallel computers by connecting cost-eective computers through those switches
6
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Main design parameters Topology: De nes how the nodes are interconnected by channels
Direct networks, switch-based networks
Routing algorithm: Determines the path selected by a message to reach its destination Deterministic routing, adaptive routing
Switching technique: Determines how and when buers are
reserved and switches are con gured Packet switching, circuit switching, wormhole, virtual cut-through 7
Interconnection Networks Shared-Medium Networks Local Area Networks Contention Bus (Ethernet) Token Bus (Arcnet) Token Ring (FDDI Ring, IBM Token Ring) Backplane Bus (Sun Gigaplane, DEC AlphaServer8X00, SGI PowerPath-2)
Direct Networks (Router-Based Networks) Strictly Orthogonal Topologies Mesh 2-D Mesh (Intel Paragon) 3-D Mesh (MIT J-Machine) Torus (k-ary n-cube) 1-D Unidirectional Torus or Ring (KSR first-level ring) 2-D Bidirectional Torus (Intel/CMU iWarp) 3-D Bidirectional Torus (Cray T3D, Cray T3E) Hypercube (Intel iPSC, nCUBE) Other Topologies: Trees, Cube-Connected Cycles, de Bruijn, Star Graphs, etc.
Indirect Networks (Switch-Based Networks)
Irregular Topologies (DEC Autonet, Myrinet, ServerNet)
Hybrid Networks Multiple-Backplane Buses (Sun XDBus) Hierarchical Networks (Bridged LANs, KSR) Cluster-Based Networks (Stanford DASH, HP/Convex Exemplar) Other Hypergraph Topologies: Hyperbuses, Hypermeshes, etc.
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Regular Topologies Crossbar (Cray X/Y-MP, DEC GIGAswitch, Myrinet) Multistage Interconnection Networks Blocking Networks Unidirectional MIN (NEC Cenju-3, IBM RP3) Bidirectional MIN (IBM SP, TMC CM-5) Nonblocking Networks: Clos Network
8
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Direct networks
(a) 2-ary 4-cube (hypercube)
(b) 3-ary 2-cube
(c) 3-ary 3D-mesh
9
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Multistage interconnection networks 0000
0000
0000
0000
0001
0001
0001
0001
0010
0010
0010
0010
0011
0011
0011
0011
0100
0100
0100
0100
0101
0101
0101
0101
0110
0110
0110
0110
0111
0111
0111
0111
1000
1000
1000
1000
1001
1001
1001
1001
1010
1010
1010
1010
1011
1011
1011
1011
1100
1100
1100
1100
1101
1101
1101
1101
1110
1110
1110
1110
1111
1111
1111
1111
Multistage butterfly network
Omega network 10
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Switch-based irregular topologies Bidirectional Links
1
0
2
1 5
7
5
0
2 7
3
Switch
3
4
4 6
6
Processing Elements
Switch-Based Network
Graph Representation 11
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Generalized MIN model
N
M
P o r t s
P o r t s
C0
G0
C1
G1
Gg − 1
Cg 12
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Uni ed View
Some manufacturers developed switches that are suitable to implement either direct or indirect networks (Inmos C104, SGI SPIDER) We can view networks using point-to-point links as a set of interconnected switches, each one connected to zero, one, or more nodes: Direct networks correspond to the case where every switch is connected to a single node Crossbar networks correspond to the case where there is a single switch connected to all the nodes Multistage interconnection networks correspond to the case where switches are arranged into several stages and the switches in intermediate stages are not connected to any processor
13
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Router organization LC
LC
Ejection Channel
LC
LC
LC
LC Switch
LC LC
LC
Routing & Arbitration
Output Channels
Input Channels
Injection Channel
LC
14
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Switching Switching: Determines how and when buers are reserved and switches are con gured
Flow control: Synchronization protocol for transmitting and receiving a unit of information
Unit of ow control: Portion of the message whose transfer must be synchronized
Flow control occurs at two levels: message ow control and physical channel ow control 15
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Packet switching and circuit switching Channel
Time-space diagram (packet switching)
Time
Channel
Time-space diagram (circuit switching)
Time 16
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Virtual cut-through and wormhole switching T D D D D D D D D D D D D D D H
Channel
Time-space diagram
Time
17
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Virtual channels
Flit buffers
channel
To switch
Physical
Channel demultiplexor
Channel multiplexor
From switch
Virtual channel controller
Flit buffers
18
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance of switching techniques Packet switching is well suited for very short messages Circuit switching is well suited for very long messages Virtual cut-through switching is well suited for messages of any length but requires splitting messages into xed-size packets Wormhole switching is well suited for messages of any length but saturates at moderate loads. Virtual channels alleviate this situation Wormhole switching has been preferred for electronic routers because buers can be small and the resulting circuits are compact and fast 19
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Optimized switching techniques Trac from real applications may be bimodal and may vary over time Wormhole switching can be used for short messages Circuit switching can be used for very long messages Path set-up can be overlapped with useful computation and/or circuits can be reused Physical circuits do not need buers at intermediate routers and can be made much faster than conventional links either by using wave pipelining or optical technology 20
Pipelined Output Channels
Pipelined Input Channels
Sync Switch S k Sync
Sync Switch S 1 Sync
Wormhole Control Unit
PCS Control Unit
Output Channels
mux mux
Switch S 0
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Input Channels
mux
Control Channels
Optimized router organization
From/to Local Processor
21
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 16000
’CS 28+4’ ’WSNR 16+16 ’ ’WH’
Average Latency (cycles)
14000
10% short messages (16 its) 90% long messages (1024 its)
12000 10000 8000 6000 4000 2000 0 0.05
0.1 0.15 0.2 Traffic ( CLK x 2) (flits/node/cycle)
0.25
22
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 18000 ’CS 28+4’ ’WSNR 16+16 ’ ’WH’
Average Latency (cycles)
16000 14000 12000 10000 8000 6000 4000
10% short messages (16 its) 90% long messages (1024 its)
2000 0 0.05
0.1 0.15 0.2 Traffic ( CLK x 3) (flits/node/cycle)
23
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications ’CS 28+4’ ’WSNR 16+16 ’ ’WH’
Average Latency (cycles)
35000 30000 25000 20000 15000 10000
10% short messages (16 its) 90% long messages (1024 its)
5000 0 0.05
0.1 0.15 0.2 Traffic ( CLK x 4) (flits/node/cycle)
24
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 40000 ’WS 16+16’ ’WH’ ’WH 2 VC’ ’WH 3 VC’
Average Latency (cycles)
35000 30000 25000 20000 15000 10000 5000
10% short messages (16 its) 90% long messages (1024 its) Only long messages are shown
0 0
0.05 0.1 0.15 0.2 Long messages traffic (1024 flits long, 90%)
25
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 30000
’WS 16+16’ ’WH’ ’WH 2 VC’ ’WH 3 VC’
Average Latency (cycles)
25000
10% short messages (16 its) 90% long messages (1024 its) Only short messages are shown
20000 15000 10000 5000 0 0
0.005 0.01 0.015 0.02 0.025 Short messages traffic (16 flits long, 10%)
0.03
26
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 30000
’WS 16+16’ ’WH ’ ’WH 2 VC’ ’WH 3 VC’
Average Latency (cycles)
25000 20000 15000 10000 5000
10% short messages (16 its) 90% long messages (1024 its) Only short messages are shown
0 0
1.0e-4
2.0e-4
3.0e-4
4.0e-4
Short messages traffic (16 flits long, 10%)
27
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 100000 ’32 slots’ ’16 slots’ ’8 slots’ ’4 slots’ ’2 slots’ ’1 slot’
Average Latency (cycles)
80000
60000
40000
20000
10% short messages (16 its) 90% long messages (1024 its) 256 Gbps link bandwidth
0 0
0.01
0.02
0.03 0.04 0.05 0.06 0.07 Traffic for 256 Gbps (10% 16 flits, 90% 1024 flits)
0.08
0.09
28
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance for multimedia applications 120000 ’32 slots’ ’16 slots’ ’8 slots’ ’4 slots’ ’2 slots’ ’1 slot’
100000
Average Latency
80000
60000
40000
20000
40% short messages (16 its) 60% long messages (1024 its) 256 Gbps link bandwidth
0 0
0.01
0.02
0.03 0.04 0.05 0.06 Traffic for 256 Gbps (40% 16 flits, 60% 1024 flits)
0.07
0.08
29
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Routing Algorithms
Number of Destinations
Routing Decisions
Implementation
Adaptivity
Progressiveness
Unicast Routing
Centralized Routing
Source Routing
Multicast Routing
Distributed Routing Multiphase Routing
Table Lookup
Finite-State Machine
Deterministic Routing
Adaptive Routing
Progressive
Backtracking
Minimality
Profitable
Misrouting
Number of Paths
Complete
Partial
30
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Situations that may prevent packet delivery Undeliverable Packets Deadlock Prevention Avoidance Recovery Livelock Minimal Paths Restricted Nonminimal Paths Probabilistic Avoidance Starvation Resource Assignment Scheme 31
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Deadlock handling Deadlock prevention: Backtracking Deadlock avoidance: Acyclic graph, acyclic subgraph Regressive deadlock recovery: Message removal, message abortion Progressive deadlock recovery: Disha
Main goal Design of ecient deadlock-free fully adaptive routing algorithms 32
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Deadlocked con guration N0 2 2 2 2 1 1 1 1
N3
2 2 2 2 1 1 1 1
0 0 0 0
3 3 3 3
N1
Messages wait for resources held by other messages in a cyclic way ) Removing cyclic dependencies will avoid deadlock
3 3 3 3 0 0 0 0
N2
33
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Allowing cyclic dependencies Example for the unidirectional ring: Ai channels can be used to forward messages to all the destinations. Hi channels can only be used if the destination is higher than the current node. c
c
cH0 n0
n1 cA0
cA3
c
cA1
cH1
cA2 n3
n2 cH2
There exist cyclic dependencies between Ai channels However, Hi channels have no cyclic dependencies There is no deadlock because messages waiting for resources can always escape by using Hi channels c
c
34
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Theory of deadlock avoidance (informal) Interconnection network
35
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Adaptive routing function and selection function
nc
nc
nd
Routing Function
nc
nd
nd
Selection Function
36
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Routing subfunction Network channels can be split into two subsets: adaptive and escape channels The routing function will be referred to as routing subfunction when restricted to escape channels
37
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Approach to avoid deadlock An adaptive routing function may allow cyclic dependencies between channels as long as: There exist a subset of channels (escape channels) that have no cyclic dependencies between them It is possible to establish a path from the current node to the destination node using only escape channels For wormhole switching, when a message reserves an escape channel and then an adaptive channel, it must be able to select an escape channel at the current node, i.e., escape channels should have no cyclic dependencies indirectly through adaptive channels
38
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Deadlock produced by indirect dependencies A set of messages are cyclically waiting for channels occupied by other messages in the set Some messages are able to use escape channels but reach another cycle. Messages using escape channels are cyclically waiting indirectly through adaptive channels ) There is a deadlock 39
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Design methodology Based on the extension of other routing functions Allows the use of all the alternative minimal paths Does not increase the number of physical channels Provides a way to:
Extend the network topology and the routing function
Guarantee the absence of deadlocks 40
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Design methodology Steps: Given an interconnection network 1, de ne a minimal path connected deadlock-free routing function 1
I
R
Split each physical channel into a set of additional virtual channels. The new routing function can use any of the new channels belonging to a minimal path or, alternatively, the channels supplied by 1
R
Verify that the extended channel dependency graph for 1 is acyclic. If it is, the routing algorithm is valid. Otherwise, it must be discarded. This step is not required for store-and-forward and virtual cut-through
R
41
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Design example Routing algorithm for n-dimensional meshes Basic algorithm: Dimension order routing Step2: Split each physical channel i;1; ai;2; : : : ; ai;k,1; bi
i
c
into
k
virtual channels
a
New algorithm: Route over any minimal path using any of the channels. Alternatively, route over the lowest useful dimension using the corresponding channel
a
b
The MIT Reliable Router uses two virtual channels for fully adaptive minimal routing and two virtual channels for dimension-order routing in the absence of faults (on a 2-D mesh) 42
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Example routing paths for 2-D meshes 0
1
2
3
4
5
6
7
Source node Destination node
8
9
10
11
12
13
14
15
Channels supplied by R
43
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Extended channel dependency graph for R1 b01
b10
b12
b21
b14 b03
b25
b30
b52 b34
b45 b54
b43 b36
b58
b63
b85 b74
b67
b76
b87 b78
44
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 2-D mesh 400
0.55 ’Deterministic (1 vc)’ ’Deterministic (2 vc)’ ’Adaptive (2 vc)’
Average Latency (cycles)
350
0.52
Network size: 256 processors. Message length: 16 its. Random trac
300 250 200 150 100
0.1
0.2
0.3
0.4
0.5
0.6
Normalized Accepted Traffic
45
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 3-D mesh 350 0.52
Average Latency (cycles)
300
’Deterministic (1 vc)’ ’Deterministic (2 vc)’ ’Adaptive (2 vc)’
Network size: 512 processors. Message length: 16 its. Random trac
250 200 150 100 50 0.1
0.2
0.3
0.4
0.5
Normalized Accepted Traffic
46
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 2-D torus Average Latency (cycles)
250
0.52
’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
200
150
100
Network size: 256 processors. Message length: 16 its. Random trac
50 0.1
0.2
0.3
0.4
0.5
Normalized Accepted Traffic
47
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 3-D torus 220
0.52
’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Part-Adaptive (3 vc)’ ’Adaptive (3 vc)’
200
Average Latency (cycles)
180 160 140 120 100 80
Network size: 512 processors. Message length: 16 its. Random trac
60 40 0.1
0.2
0.3
0.4
0.5
Normalized Accepted Traffic
48
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 3-D torus (II) Average Latency (cycles)
80
Network size: 512 processors. Message length: 16 its. Local trac
70 60 50 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
40 30 0.2
0.4
0.6
0.8
1
1.2
Normalized Accepted Traffic
49
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation for the 3-D torus (III) Average Latency (cycles)
100
Network size: 512 processors. Message length: 16 its. Bit-reversal trac pattern
’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
90 80 70 60 50 40 0.05
0.1
0.15
0.2
0.25
0.3
0.35
Normalized Accepted Traffic
50
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Accurate performance evaluation for the 3-D torus 550
Network size: 512 processors. Message length: 16 its. Random trac
Average Latency (ns)
500 450 400 350 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
300 250 10
20
30
40
50
60
Traffic (flits/node/us)
51
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Accurate performance evaluation for the 3-D torus (II) 550
Network size: 512 processors. Message length: 16 its. Local trac
Average Latency (ns)
500 450 400 350 300 ’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
250 200 20
40
60
80
100
120
140
160
Traffic (flits/node/us)
52
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Accurate performance evaluation for the 3-D torus (III) 700
Average Latency (ns)
650
Network size: 512 processors. Message length: 16 its. Bit-reversal trac pattern
’Deterministic (2 vc)’ ’Part-Adaptive (2 vc)’ ’Adaptive (3 vc)’
600 550 500 450 400 350 300 250 5
10
15
20
25
30
35
40
45
Traffic (flits/node/us)
53
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Application to deadlock recovery Routing resources (channels or buers) are split into two classes: adaptive and escape Adaptive resources can be freely used by all the packets When a packet is waiting for longer than a timeout, it moves to an escape resource Once a packet uses an escape resource, it cannot use an adaptive resource again This routing scheme eliminates all the indirect dependencies between adaptive and escape resources 54
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Router organization for Disha LC
Ejection Channel
LC
LC
LC
LC
LC
Switch LC
LC
Output Channels
Input Channels
Injection LC Channel
LC Routing and Arbitration
Deadlock Buffer
55
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Routing on edge and deadlock buers 0
1
2
3
7
6
5
4
8
9
10
11
15
14
13
12
0
1
2
3
7
6
5
4
8
9
10
11
15
14
13
12
Deadlock buers can only be used in increasing label order When a deadlock is detected, the packet header can be routed to the deadlock buer Edge buers allow fully adaptive minimal routing Escape channels are de ned so that the routing subfunction is able to deliver messages for any destination (including deadlock buers) 56
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Extended channel dependency graph for edge buers n0
c10
c21
n1
c50
n2
c41
n5
n4
c65
c32 n3
c74
n6
n7
c83
n8
57
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation 180 .56
.27 160
.77 .49
Average Latency (Cycles)
140 120 100 80 −o− Avoidance−Det (2 VC)
60
−x− Recovery−Det (2 VC)
40
−+− Avoidance−Adap (3 VC) 20 0
−*− Recovery−Adap (3 VC) 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Normalized Accepted Traffic
58
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Injection limitation Prevents performance degradation at saturation Reduces the frequency of deadlock occurrence to negligible values
RESERVE
RELEASE
BUSY OUTPUT CHANNELS COUNTER
COMPARATOR
INJECTION PERMITTED
THRESHOLD
59
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Improved injection limitation mechanism PHYSICAL CHANNELS Vn-1
V1 V0 RESERVE
RELEASE
1 2 3 BUSY OUTPUT CHANNELS COUNTER m-1 BIT =1 counter
TRANSLATION
COMPARATOR
MESSAGE NUM.
0
INJECTION PERMITTED
TABLE
Bitwise OR
60
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Improved deadlock detection mechanism 0
I
Threshold
1
2
Switch
3
Output Channels
Input Channels
Counter
Counter I
Thresho ld
61
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Application to networks of workstations Networks of workstations are emerging as a cost-eective alternative to parallel computers. Switch-based interconnects like Autonet, Myrinet and ServerNet have been proposed to build networks of workstations with irregular topology. The irregularity provides: Wiring exibility. Scalability. Incremental expansion capability.
62
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Drawback: The irregularity makes deadlock avoidance and routing quite complicated. Simplest solution: Avoid deadlock by eliminating all the cyclic dependencies between channels ) Many messages are routed following non-minimal paths. ! Higher message latency ! Waste of resources ! Lower throughput Alternative solution: Allow cyclic dependencies between channels ! Reduces contention by increasing routing adaptivity ! Allows more messages to follow minimal paths
63
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Switch-based networks with irregular topologies Bidirectional Links
1
0
2
1 5
7
5
0
2 7
3
Switch
3
4
4 6
6
Processing Elements
Switch-Based Network
Graph Representation 64
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
The Autonet routing algorithm General characteristics: Deadlock-free routing scheme (up/down routing). Provides partially adaptive communication between nodes. Distributed. Implemented using table-lookup.
65
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
The up/down routing algorithm 0
"up" direction
4
6
2
7
5
1
3
Routing is based on an assignment of direction to the operational links. Routing rule: a legal route must traverse zero or more links in the \up" direction followed by zero or more links in the \down" direction.
Each cycle has at least one link in the \up" direction and one link in the \down" direction. Cyclic dependencies are avoided: messages cannot cross a link in the \up" direction after one in the \down" direction.
66
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Routing eciency 0
"up" direction
From 7 to 0: OK From 2 to 5: lack of adaptivity From 4 to 1: non-minimal routing
4
6
2
7
5
1
3
The basic routing rule prevents from using minimal routing and adaptivity in most cases because of \down" to \up" con icts. Probability of non-minimal routing increases with network size.
67
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
A design methodology for adaptive routing algorithms interconnection network + deadlock-free routing function
new methodology
)
physical channels duplicated or split into two virtual channels (original and new) + extended routing function 68
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Extended routing function Newly injected messages can use the new channels without any restriction. For performance reasons, only minimal paths are allowed Original channels are used exactly in the same way as in the original routing function Once a message reserves one of the original channels, it cannot use any of the new channels again When the routing table provides both kinds of channels, give preference to new channels
The extended routing function is deadlock-free
69
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Improving the eciency of the methodology Idea: Focus on minimal routing, even if adaptivity is reduced Restrict the transition from new channels to original channels Improved adaptive routing function: { Newly injected messages can only use new channels { At intermediate switches, a higher priority is assigned to the new channels belonging to minimal paths { If all the new channels are busy, then an original channel belonging to a minimal path (if any) is selected { If none exists, then the one that provides the shortest path is used (this ensures deadlock-freedom) Once a message reserves an original channel, it can no longer reserve a new one
70
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Performance evaluation Evaluation of four routing schemes: Basic up/down routing scheme (UD). Up/down routing scheme using two virtual channels per physical channel (UD-2VC). Adaptive routing scheme using two virtual channels per physical channel (A-2VC). Improved adaptive routing scheme using two virtual channels per physical channel (MA-2VC).
Performance evaluation carried out by simulation.
71
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Network model: Topology generated randomly (8-port switches) 4 nodes (processors) connected to each switch Two adjacent switches are connected by a single link One routing control unit per switch (assigned in a round-robin fashion) Message destination is randomly chosen among nodes It takes one clock cycle to compute the routing algorithm, to transfer one it from an input buer to an output buer, or to transfer one it across a physical channel
72
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results (I) 60
Average Latency (Cycles)
55 50
UD UD-2VC A-2VC MA-2VC
Network size: 16 switches. Message length: 16 its.
45 40 35 30 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Traffic (Flits/Cycle/Node)
73
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results (II) Average Latency (Cycles)
60 ’UD’ ’UD-2VC’ ’A-2VC’ ’MA-2VC’
55 50
Network size: 32 switches. Message length: 16 its.
45 40 35 30 0.1
0.2
0.3
0.4
0.5
Traffic (Flits/Cycle/Node) 74
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results (III) Average Latency (Cycles)
80 UD UD-2VC A-2VC MA-2VC
70
Network size: 64 switches. Message length: 16 its.
60 50 40 30 0.05
0.1
0.15
0.2
0.25
0.3
Traffic (Flits/Cycle/Node) 75
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results (IV) Average Latency (Cycles)
200 UD UD-2VC A-2VC MA-2VC
180 160
Network size: 64 switches. Message length: 64 its.
140 120 100 80 0.05
0.1
0.15
0.2
0.25
0.3
Traffic (Flits/Cycle/Node) 76
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results (V) Average Latency (Cycles)
900 UD UD-2VC A-2VC MA-2VC
800 700
Network size: 64 switches. Message length: 256 its.
600 500 400 300 0.05
0.1
0.15
0.2
0.25
0.3
Traffic (Flits/Cycle/Node) 77
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results for application traces 160000
Amount of messages
140000 120000
Messages
100000
Traces from Barnes-Hut executed on 64 processors
80000 60000 40000 20000 0
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
Time
78
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Simulation results for application traces MA-2VC UD-2VC UD
80000 70000
Latency (Cycles)
60000 50000 40000 30000 20000 10000 0
1e+07
2e+07
3e+07
4e+07 5e+07 Time (Cycles)
6e+07
7e+07
8e+07
79
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Zoom of the rst peak 450000
MA-2VC UD-2VC UD
400000
Latency (Cycles)
350000 300000 250000 200000 150000 100000 50000 0 1.8e+07
1.85e+07
1.9e+07 1.95e+07 Time (Cycles)
2e+07
80
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Zoom of the second peak 250000
MA-2VC UD-2VC UD
Latency (Cycles)
200000
150000
100000
50000
0
3.7e+07
3.75e+07
3.8e+07 3.85e+07 Time (Cycles)
3.9e+07
81
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Zoom of the third peak MA-2VC UD-2VC UD
180000 160000
Latency (Cycles)
140000 120000 100000 80000 60000 40000 20000 0
5.2e+07
5.25e+07 5.3e+07 5.35e+07 Time (Cycles)
5.4e+07
82
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Final Remarks
Hybrid switching techniques may considerably increase performance by using the appropriate switching technique for each message class Circuit switching can take advantage of wave pipelining and optical technology to increase link bandwidth Flexible deadlock avoidance and recovery schemes allow the design of more ecient routing algorithms These routing algorithms have been implemented in the MIT Reliable Router and the Cray T3E Adaptive routing and virtual channels are especially interesting when applications produce bursty trac that saturates the network during some time intervals (usually prior to synchronization points) Adaptive routing and virtual channels must be implemented eciently to minimize the increment in clock cycle time
83
Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)
Final Remarks Flexible deadlock avoidance and recovery schemes allow the design of more ecient routing algorithms These routing algorithms have been implemented in the MIT Reliable Router and the Cray T3E Adaptive routing and virtual channels are especially interesting when applications produce bursty trac that saturates the network during some time intervals (usually prior to synchronization points) Adaptive routing and virtual channels must be implemented eciently to minimize the increment in clock cycle time 83