Cubic Ring Networks: A Polymorphic Topology for ... - Semantic Scholar

1 downloads 0 Views 2MB Size Report
Sep 16, 2010 - meet the bandwidth demands of the application … ▫ How? • … by dynamically reconfiguring the logical topology. Cubic Ring Network. 9/18/ ...
Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip Bilal Zafar, Jeff Draper and Timothy Pinkston Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA ICPP `10: 39th International Conference on Parallel Processing San Diego, CA Sep. 16, 2010

CMPs & Network-on-Chip: Then & Now  In the beginning … (2001) • IBM released its first commercial dual-core CPU (Power4) • Bill Dally presents “Route Packets Not Wires” at DAC  And, now … • Multi-core CMPs with 6-12 cores commercially available • Many-core CMPs research chips o o o

9/18/2010

Intel Teraflops Research Chip (80 cores; 2D Mesh) MIT RAW Chip (16 cores; 2D Mesh) UT Austin TRIPS (2x25 tiles; 2D Mesh)

2

Network-on-Chip Challenges  Power • Too much power in the network: 20-36% [Kundu, OCIN Workshop 2006] • Leakage power is high, especially in buffers  Performance • Bandwidth is not a problem; sufficient wire bandwidth • Latency per hop is high  Resilience • Fixed routing causes single point-of-failure • Expensive to exploit path diversity (blame deadlocks-avoidance) 9/18/2010

3

Power Reduction  Three main classes of methods to reduce power [M. Horowitz] • Cheat o

Lower the performance of the design

• Reduce Waste o

Stop wasting energy on stuff that doesn’t produce results

• Reformulate Problem o

9/18/2010

Reduce work to be done

4

Power Reduction in NoCs  Three main classes of methods to reduce power [M. Horowitz] • Cheating o

Lower the performance of the design = Increase latency – Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04] – Dynamic Link width reduction [Alonso04]

• Reduce Waste o

Stop wasting energy on stuff that doesn’t produce results = Clock gating – Leakage reduction in buffers [Chen03]

• Problem Reformulation o

9/18/2010

Reduce Work

5

Power Reduction in NoCs  Three main classes of methods to reduce power [M. Horowitz] • Cheating Decrease Bandwidth o

Lower the performance of the design = Increase latency – Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04] – Dynamic Link width reduction [Alonso04]

• Reduce Waste o

Stop wasting energy on stuff that doesn’t produce results = Clock gating – Leakage reduction in buffers [Chen03]

• Problem Reformulation o

9/18/2010

Power-Gate Segments

Reduce Work

Our Approach

6

Key Insight  Assumptions: • Physical k-ary n-cube torus network • Turning off links & ports saves power • 8-ary 2-cube Torus • 64 Routers • Each Router has 5 ports (4 Network + 1 Local) • No. of Bi-directional Links = 128 • Average Distance = 4

9/18/2010

7

Key Insight: Illustration

4-ary 2-cube Torus

9/18/2010

Average Distance vs. Turned Off Links

8

Key Insight: Illustration

Y-Links Removed = 1

9/18/2010

Average Distance vs. Turned Off Links

9

Key Insight: Illustration

Y-Links Removed = 2

9/18/2010

Average Distance vs. Turned Off Links

10

Key Insight: Illustration

Y-Links Removed = 3

9/18/2010

Average Distance vs. Turned Off Links

11

Key Insight: Illustration

Y-Links Removed = 4

9/18/2010

Average Distance vs. Turned Off Links

12

Key Insight: Illustration

Y-Links Removed = 5

9/18/2010

Average Distance vs. Turned Off Links

13

Key Insight: Illustration

Y-Links Removed = 6

9/18/2010

Average Distance vs. Turned Off Links

14

Key Insight: Illustration

Y-Links Removed = 7

Average Distance vs. Turned Off Links

Let’s Exploit This!

9/18/2010

15

Power-Bandwidth Tradeoff  Power-Bandwidth tradeoff: • Increase in off links/ports is

• •

Latency Reduction

linear Increase in average distance is non-linear Power-Bandwidth Tradeoff works across different sizes

Bandwidth Increase 9/18/2010

16

Cubic Ring Network  What? • A network topology and routing function …  Why? • … designed to operate at multiple power-performance points to meet the bandwidth demands of the application …

 How? • … by dynamically reconfiguring the logical topology.

9/18/2010

17

Agenda

 The Topology  The Routing  The Results

9/18/2010

18

Agenda

 The Topology  The Routing  The Results

9/18/2010

19

Topology: Informal Description  A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node

9/18/2010

20

Topology: Informal Description  A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node

 Example: Consider the 4x4x4 torus

9/18/2010

21

Topology: Informal Description  A Cubic Ring topology (logical) is obtained from a torus topology (physical) by removing a subset of rings in … • in all but one dimension • a hierarchical fashion, where each level is connected to the next higher (if there exists one) through at least one node

 Example: Consider the 4x4x4 torus … morphed into one possible cRing configuration

9/18/2010

22

Topology: Formal Description  k-ary n-cube Torus: n-dimensional, radix-k • kn nodes connected via nkn bi-directional links organized as nkn-1 rings

 k-ary n-cube R-ring Cubic Ring • R = {rn-1, rn-2, … , r1, r0}, each ri is a k-bit string corresponding to

• • •

9/18/2010

a specific set of one of more torus rings in the i-th dimension The value of each bit of ri indicates the presence (if ‘1’) or absence (if ‘0’) of the corresponding set of rings in the cRing r0[l] = 1, for all 0 ≤ l ≤ k-1 … all rings in one dimension are connected ri[l] = 1, for all 0 ≤ i ≤ n-1, for at least one value of l … each level of the hierarchy is connected to the next higher level through at least one node 23

Topology: Formal Description  Examples:

• 4-ary 2-cube R-cRing with R = {0001, 1111}

9/18/2010

24

Topology: Formal Description  Examples:

• 4-ary 2-cube R-cRing with R = {1000, 1111}

• 4-ary 2-cube R-cRing with R = {0101,1111}

9/18/2010

25

Topology: Formal Description  Examples:

• 4-ary 2-cube R-cRing with R = {1000, 1111}

• 4-ary 2-cube R-cRing with R = {1010,1111}

• 4-ary 3-cube R-cRing with R = {0001, 0101, 1111}

9/18/2010

26

Agenda

 The Topology  The Routing  The Results

9/18/2010

27

Routing Function  Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o

Rings at each level, not nodes

 Example: • Source = {0,1,1} • Destination: {2,3,2}

9/18/2010

28

Routing Function  Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o

Rings at each level, not nodes

 Example: • Source = {0,1,1} • Destination: {2,3,2} • Route: s → u1 → u2 → u3 → u4 → u5 → u6 → d

9/18/2010

29

Routing Function  Routing derives naturally from the topology • Messages travel up the hierarchy and then down the hierarchy • cRing routing is NOT up/down routing o

Rings at each level, not nodes

 Example: • Source = {0,1,1} • Destination: {2,3,2} • Route: s → u1 → u2 → u3 → u4 → u5 → u6 → d o o o o o

9/18/2010

Source local ring hop: s → u1 Source intermediate ring hop: u1 → u2 Global ring hop: u2 → u3 → u4 Destination intermediate ring hop: u4 → u5 Destination local ring hop: u5 → u6 → d 30

Deadlock-Freedom  Proof of Deadlock-Freedom • No cycles within rings o o

o

Guaranteed by Bubble Flow Control (BFC) [Carrion’97] A message can be injected into a ring iff after injection there is at least one empty message buffer (“bubble”) in the ring in the dir of the msg Applies to newly injected and turning messages

• No deadlocks in the up segment (VC0) o o

Dimensions are travelled in the increasing order For example: x+/x- → y+/y- → z+/z-

• No deadlocks in the down segment (VC1) o

Dimensions are travelled in the decreasing order (e-cube routing) For example: z+/z- → y+/y- → x+/x-

o

VC0 → VC1 and once in VC1 messages sink at their destination

o

• No deadlocks in the network 9/18/2010

31

Agenda

 The Topology  The Routing  The Results

9/18/2010

32

Power  Router Power Estimate • Router Verilog, synthesized using Synopsys Design Compiler for

• •

9/18/2010

TSMC 90nm Two virtual channels, 64-bit flit size, 8-flit input buffers Fixed configurations: upper limit of the power-savings

33

Performance Evaluation  Evaluated Networks o o o o o

Flit-level simulation using detailed network simulator, SICOSYS; 4-stage Bubble Adaptive Router, 8-flit input buffers, 2-flit packets 4/8-ary 2-cube Torus Network with Bubble Adaptive routing 4/8-ary 2-cube R-cRing w/ R = {0001, 1111}, {0101, 1111} and {0111, 1111} 4/8-ary 2-cube Torus Network with West-Last East-Last routing

 On/Off Links with West-Last East-Last Routing [Soteriou’04] • Alternate Row-Column on/off links • Each router must have one out-going link in each dimension



9/18/2010

connected West-Last East-Last routing

34

Performance Evaluation: 64 nodes cRing, ry= cRing, ry= cRing, ry= {00000001} {000100001} {01001001}

Torus

WLEL

128

64

72

80

88

# 3-port Routers

0

0

56

48

40

# 5-port Routers

64

64

8

16

24

# On Links

9/18/2010

35

Power-Bandwidth Tradeoff  Independent of Size … • With 4 Global Rings, < 5%

37%

increase in latency

3%

 For a 16x16 Torus • About 37% off links; • Less than 3% increase in average distance

9/18/2010

36

For Fault Tolerance  Failed Resource: • Network link (fully or partially) • Router port: input buffer, VC control, routing unit, output VC control, link control

 Resolution: • Disable the ring involving the failed resource • Assign the disabled ring to highest dimension  Fault Coverage • Network remains fully-connected as long as all links in at least



9/18/2010

one dimension are working When no dimension has all links connected, disconnect a ring (in lowest dimension). Does not affect routing elsewhere 37

Conclusion & Continuing Work  Conclusion • Dynamically reconfigurable NoCs offer two key advantages: o o

Power-Bandwidth tradeoff, rather than Power-Latency Same mechanism used for power-reduction & fault tolerance

• Cubic Ring Networks are an example of how efficient •

dynamically reconfigurable NoCs can be implemented Must select topology, routing and flow control with an eye toward deadlock-free dynamic reconfiguration

 Continuing Work o o o

9/18/2010

Dynamic Reconfiguration Scheme When to trigger reconfiguration? Quantify fault tolerance capability of cRings 38

Questions?

9/18/2010

39