Sep 16, 2010 - meet the bandwidth demands of the application ⦠⫠How? ⢠⦠by dynamically reconfiguring the logical topology. Cubic Ring Network. 9/18/ ...
Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip Bilal Zafar, Jeff Draper and Timothy Pinkston Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA ICPP `10: 39th International Conference on Parallel Processing San Diego, CA Sep. 16, 2010
CMPs & Network-on-Chip: Then & Now In the beginning … (2001) • IBM released its first commercial dual-core CPU (Power4) • Bill Dally presents “Route Packets Not Wires” at DAC And, now … • Multi-core CMPs with 6-12 cores commercially available • Many-core CMPs research chips o o o
9/18/2010
Intel Teraflops Research Chip (80 cores; 2D Mesh) MIT RAW Chip (16 cores; 2D Mesh) UT Austin TRIPS (2x25 tiles; 2D Mesh)
2
Network-on-Chip Challenges Power • Too much power in the network: 20-36% [Kundu, OCIN Workshop 2006] • Leakage power is high, especially in buffers Performance • Bandwidth is not a problem; sufficient wire bandwidth • Latency per hop is high Resilience • Fixed routing causes single point-of-failure • Expensive to exploit path diversity (blame deadlocks-avoidance) 9/18/2010
3
Power Reduction Three main classes of methods to reduce power [M. Horowitz] • Cheat o
Lower the performance of the design
• Reduce Waste o
Stop wasting energy on stuff that doesn’t produce results
• Reformulate Problem o
9/18/2010
Reduce work to be done
4
Power Reduction in NoCs Three main classes of methods to reduce power [M. Horowitz] • Cheating o
Lower the performance of the design = Increase latency – Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04] – Dynamic Link width reduction [Alonso04]
• Reduce Waste o
Stop wasting energy on stuff that doesn’t produce results = Clock gating – Leakage reduction in buffers [Chen03]
• Problem Reformulation o
9/18/2010
Reduce Work
5
Power Reduction in NoCs Three main classes of methods to reduce power [M. Horowitz] • Cheating Decrease Bandwidth o
Lower the performance of the design = Increase latency – Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04] – Dynamic Link width reduction [Alonso04]
• Reduce Waste o
Stop wasting energy on stuff that doesn’t produce results = Clock gating – Leakage reduction in buffers [Chen03]
• Problem Reformulation o
9/18/2010
Power-Gate Segments
Reduce Work
Our Approach
6
Key Insight Assumptions: • Physical k-ary n-cube torus network • Turning off links & ports saves power • 8-ary 2-cube Torus • 64 Routers • Each Router has 5 ports (4 Network + 1 Local) • No. of Bi-directional Links = 128 • Average Distance = 4
9/18/2010
7
Key Insight: Illustration
4-ary 2-cube Torus
9/18/2010
Average Distance vs. Turned Off Links
8
Key Insight: Illustration
Y-Links Removed = 1
9/18/2010
Average Distance vs. Turned Off Links
9
Key Insight: Illustration
Y-Links Removed = 2
9/18/2010
Average Distance vs. Turned Off Links
10
Key Insight: Illustration
Y-Links Removed = 3
9/18/2010
Average Distance vs. Turned Off Links
11
Key Insight: Illustration
Y-Links Removed = 4
9/18/2010
Average Distance vs. Turned Off Links
12
Key Insight: Illustration
Y-Links Removed = 5
9/18/2010
Average Distance vs. Turned Off Links
13
Key Insight: Illustration
Y-Links Removed = 6
9/18/2010
Average Distance vs. Turned Off Links
14
Key Insight: Illustration
Y-Links Removed = 7
Average Distance vs. Turned Off Links
Let’s Exploit This!
9/18/2010
15
Power-Bandwidth Tradeoff Power-Bandwidth tradeoff: • Increase in off links/ports is
• •
Latency Reduction
linear Increase in average distance is non-linear Power-Bandwidth Tradeoff works across different sizes
Bandwidth Increase 9/18/2010
16
Cubic Ring Network What? • A network topology and routing function … Why? • … designed to operate at multiple power-performance points to meet the bandwidth demands of the application …
How? • … by dynamically reconfiguring the logical topology.
9/18/2010
17
Agenda
The Topology The Routing The Results
9/18/2010
18
Agenda
The Topology The Routing The Results
9/18/2010
19
Topology: Informal Description A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node
9/18/2010
20
Topology: Informal Description A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node
Example: Consider the 4x4x4 torus
9/18/2010
21
Topology: Informal Description A Cubic Ring topology (logical) is obtained from a torus topology (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node
Example: Consider the 4x4x4 torus … morphed into one possible cRing configuration
9/18/2010
22
Topology: Formal Description k-ary n-cube Torus: n-dimensional, radix-k • kn nodes connected via nkn bi-directional links organized as nkn-1 rings
k-ary n-cube R-ring Cubic Ring • R = {rn-1, rn-2, … , r1, r0}, each ri is a k-bit string corresponding to
• • •
9/18/2010
a specific set of one of more torus rings in the i-th dimension The value of each bit of ri indicates the presence (if ‘1’) or absence (if ‘0’) of the corresponding set of rings in the cRing r0[l] = 1, for all 0 ≤ l ≤ k-1 … all rings in one dimension are connected ri[l] = 1, for all 0 ≤ i ≤ n-1, for at least one value of l … each level of the hierarchy is connected to the next higher level through at least one node 23
Topology: Formal Description Examples:
• 4-ary 2-cube R-cRing with R = {0001, 1111}
9/18/2010
24
Topology: Formal Description Examples:
• 4-ary 2-cube R-cRing with R = {1000, 1111}
• 4-ary 2-cube R-cRing with R = {0101,1111}
9/18/2010
25
Topology: Formal Description Examples:
• 4-ary 2-cube R-cRing with R = {1000, 1111}
• 4-ary 2-cube R-cRing with R = {1010,1111}
• 4-ary 3-cube R-cRing with R = {0001, 0101, 1111}
9/18/2010
26
Agenda
The Topology The Routing The Results
9/18/2010
27
Routing Function Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o
Rings at each level, not nodes
Example: • Source = {0,1,1} • Destination: {2,3,2}
9/18/2010
28
Routing Function Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o
Rings at each level, not nodes
Example: • Source = {0,1,1} • Destination: {2,3,2} • Route: s → u1 → u2 → u3 → u4 → u5 → u6 → d
9/18/2010
29
Routing Function Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o
Rings at each level, not nodes
Example: • Source = {0,1,1} • Destination: {2,3,2} • Route: s → u1 → u2 → u3 → u4 → u5 → u6 → d o o o o o
9/18/2010
Source local ring hop: s → u1 Source intermediate ring hop: u1 → u2 Global ring hop: u2 → u3 → u4 Destination intermediate ring hop: u4 → u5 Destination local ring hop: u5 → u6 → d 30
Deadlock-Freedom Proof of Deadlock-Freedom • No cycles within rings o o
o
Guaranteed by Bubble Flow Control (BFC) [Carrion’97] A message can be injected into a ring iff after injection there is at least one empty message buffer (“bubble”) in the ring in the dir of the msg Applies to newly injected and turning messages
• No deadlocks in the up segment (VC0) o o
Dimensions are travelled in the increasing order For example: x+/x- → y+/y- → z+/z-
• No deadlocks in the down segment (VC1) o
Dimensions are travelled in the decreasing order (e-cube routing) For example: z+/z- → y+/y- → x+/x-
o
VC0 → VC1 and once in VC1 messages sink at their destination
o
• No deadlocks in the network 9/18/2010
31
Agenda
The Topology The Routing The Results
9/18/2010
32
Power Router Power Estimate • Router Verilog, synthesized using Synopsys Design Compiler for
• •
9/18/2010
TSMC 90nm Two virtual channels, 64-bit flit size, 8-flit input buffers Fixed configurations: upper limit of the power-savings
33
Performance Evaluation Evaluated Networks o o o o o
Flit-level simulation using detailed network simulator, SICOSYS; 4-stage Bubble Adaptive Router, 8-flit input buffers, 2-flit packets 4/8-ary 2-cube Torus Network with Bubble Adaptive routing 4/8-ary 2-cube R-cRing w/ R = {0001, 1111}, {0101, 1111} and {0111, 1111} 4/8-ary 2-cube Torus Network with West-Last East-Last routing
On/Off Links with West-Last East-Last Routing [Soteriou’04] • Alternate Row-Column on/off links • Each router must have one out-going link in each dimension
•
9/18/2010
connected West-Last East-Last routing
34
Performance Evaluation: 64 nodes cRing, ry= cRing, ry= cRing, ry= {00000001} {000100001} {01001001}
Torus
WLEL
128
64
72
80
88
# 3-port Routers
0
0
56
48
40
# 5-port Routers
64
64
8
16
24
# On Links
9/18/2010
35
Power-Bandwidth Tradeoff Independent of Size … • With 4 Global Rings, < 5%
37%
increase in latency
3%
For a 16x16 Torus • About 37% off links; • Less than 3% increase in average distance
9/18/2010
36
For Fault Tolerance Failed Resource: • Network link (fully or partially) • Router port: input buffer, VC control, routing unit, output VC control, link control
Resolution: • Disable the ring involving the failed resource • Assign the disabled ring to highest dimension Fault Coverage • Network remains fully-connected as long as all links in at least
•
9/18/2010
one dimension are working When no dimension has all links connected, disconnect a ring (in lowest dimension). Does not affect routing elsewhere 37
Conclusion & Continuing Work Conclusion • Dynamically reconfigurable NoCs offer two key advantages: o o
Power-Bandwidth tradeoff, rather than Power-Latency Same mechanism used for power-reduction & fault tolerance
• Cubic Ring Networks are an example of how efficient •
dynamically reconfigurable NoCs can be implemented Must select topology, routing and flow control with an eye toward deadlock-free dynamic reconfiguration
Continuing Work o o o
9/18/2010
Dynamic Reconfiguration Scheme When to trigger reconfiguration? Quantify fault tolerance capability of cRings 38
Questions?
9/18/2010
39