A 1.1 GOPS/mW FPGA Chip with Hierarchical ... - Semantic Scholarhttps://www.researchgate.net/.../A-11-GOPS-mW-FPGA-chip-with-hierarchical-interc...

A 1.1 GOPS/mW FPGA Chip with Hierarchical Interconnect Fabric Cheng C. Wang, Fang-Li Yuan, Henry Chen, Dejan Marković Electrical Engineering Department, University of California, Los Angeles, CA Abstract A 2048 look-up-table FPGA with a radix-2 hierarchical interconnect network is realized in 3.94mm2 in 65-nm CMOS. It has an interconnect-to-logic area ratio of 1:1, which is a 3–4x reduction from modern FPGAs while allowing up to 100% resource utilization. As a proof of concept, it is designed with standard cells, achieving 16.4 GOPS/mm2 at 370MHz. Peak energy efficiency of 1.1 GOPS/mW is measured at 0.5V. Introduction Field-programmable gate arrays (FPGAs) are effective for rapid verification and prototyping of VLSI designs. They are also used in products that require periodic hardware changes and short time to market. However, FPGAs incur penalties in area (17–54x), speed (2.5–6.7x), and power (5.7–62x) over standard-cell ASICs [1], hindering their expansion into ASIC markets. The overhead is primarily due to interconnects, which account for over 75% of area and delay. For over 20 years, FPGAs have used 2D-mesh interconnects, where look-up tables (LUTs) are placed in configurable logic blocks (CLBs), and arrays of switch boxes are placed at interconnect crossings (Fig. 1). Since a full array requires too much area, various heuristics are used to simplify switch-box arrays at the cost of resource utilization. Yet 80% of the 1.1B transistors on Virtex-5 are used for interconnects [2]. This paper demonstrates an FPGA with hierarchical interconnects where interconnect area is 51%, a 3–4x reduction from commercial FPGAs while preserving connectivity. An energy efficiency of 1.1 GOPS/mW is the highest among reported FPGAs. The chip is tested up to 400MHz. Hierarchical Interconnect Architecture The key issue with 2D-mesh is scalability; the number of switch boxes grows as O(N2) with the number of LUTs. Using Rent’s rule, interconnect complexity is still O(N1.75) for random logic, requiring FPGA size to scale much faster than Moore’s Law. In the proposed hierarchical interconnect, a folded Beneš network is employed to reduce the complexity to O(N·logN) [3]: 4 LUTs are connected via 2 stages of switch matrices (SMs), and another 4 LUTs are connected with a 3rd SM stage (Fig. 2a). Each SM has 4 unidirectional connections per direction. Although this architecture reduces interconnect complexity, each SM stage doubles the routing congestion. This O(N) congestion makes physical design difficult. To alleviate congestion, routing is alternated between x-y directions to reduce congestion to O(N0.5) (Fig. 2b). At every hierarchy, the LUTs near the center are interconnected to create shorter routes, and the edge routes are longer. This gives routing tools options for faster paths on timing-critical routes. The test chip has 2048 4-input LUTs: 1024 LUTs form 256 Logic CLBs, 896 LUTs form 224 DSP CLBs, and 128 LUTs form 16 Block RAMs (BRAMs) of 1kb each. In practice, the majority of the logic connections are local, requiring fewer connections on upper hierarchies. Therefore full connectivity is preserved up to 6 SM stages (Fig. 3a), then half-connectivity SMs are used to reduce the complexity of upper hierarchies. This partitions the interconnect into 3 sub-networks: N8:2, N6:2, and N6:1. The chip is divided into 16 macros (Fig. 3b). Macros N8:2 are centered for shorter top-level routing, branching into N6:2 and N6:1. Each of the macros contains 32 CLBs—a combination of Logic, DSP, and BRAM (Fig. 3c). Circuit Implementation The CLBs include four 4-input LUTs with selectable asynchronous/synchronous output stages (Fig. 4a). Each LUT

is configurable as one 4-input LUT or two 3-input LUTs with up to 4 unique inputs. A Logic CLB includes a carry chain to support 4b additions where Propagate and Generate are driven from LUTs. The Logic CLB is especially useful when two outputs per bit are required, such as in 3:2 compressors. The DSP CLB (Fig. 4b) has a LUT combiner to support 5/6-input LUTs, and a carry chain that is configurable as one 8b or two 4b adders. The adder cells are shared with a 4b×4b Wallace-tree multiplier. Based on the configuration, the appropriate outputs are sent to the output stage. Due to the level of configurability, the synthesized CLB has 50 logic gates on its critical path (shaded), amounting to a 1.1ns delay. Configuration bits are required to control CLBs and SMs, but traditional SRAM arrays are not suitable because all bits cannot be accessed simultaneously. A scan chain is adopted in [4] to control 6 CLBs, but it is not scalable to larger designs. Therefore an SRAM-based bit cell (BC) is designed where the output of each BC is directly routed to the configuration inputs of CLBs and SMs (Fig. 5a). The BC area is 5x smaller than a DFF-based scan cell. The bit-line (BL) and word-line (WL) controls are implemented as scan chains to write one row of BCs at a time. The BC arrays are local to each CLB, so only the BL and WL controls are propagated to top level. Overall, the memory area is reduced (Fig. 5b), and total interconnect area is 51%, a 3–4x reduction over 2D-mesh [5] for a fixed logic area. Automated Mapper An automated mapper is developed to map RTL onto this FPGA. A standard-cell library of LUT functions is created to enable logic synthesis using commercial tools. The LUT netlist is imported into an automated, custom place-and-route tool that generates the bitstream for FPGA programming. This tool is also used during architecture design to evaluate interconnect connectivities by mapping Toronto20 benchmarks. Measurement Results Our chip achieves 16.4 GOPS/mm2 when all Logic and DSP CLBs are utilized, executing 175 16b accumulators at 370MHz. Since a 16b adder uses 2 DSP CLBs or 4 Logic CLBs, the DSP adders are faster, reaching 400MHz. Performance is hindered by equipment limitations due to a 0.25ns input-clock jitter at 400MHz. The energy-delay curve and the power breakdowns for minimum delay and minimum energy are shown in Fig. 6. In comparison, [4] has no interconnects, the full-custom CLB in 32-nm LVT is 2.5x faster, but achieves 2.6 GOPS/mW at 0.34V for 8b operations, which is 0.65 GOPS/mW for 16b (2 CLBs per operation at half the speed). With interconnects, our 65-nm chip reaches 1.1 GOPS/mW at 0.5V. Leakage is well-controlled even without power gating. A 1.08 GOPS/mW is attainable with only 112 DSP accumulators active and most of the Logic CLBs idle (Table I). The FIR filter achieves 274MHz due to longer routing, but interconnect delay is still under 50%. The 2×2 MIMO FFT uses 10 BRAMs to implement various delay lines. With many control signals and a critical path of 11 CLBs, the FFT achieves 83MHz. Figure 7 shows the die photo. The top 3 metal layers (out of 9) are sparsely used, leaving ample room for larger designs. Acknowledgments We thank STMicroelectronics and C. Yang for helpful discussions. References [1] I. Kuon et al., Found. Trends in Elec. Design, 2008. [2] I. Bolsens, MPSOC, 2006. [3] V. Konda, U.S. Patent 2010/0172349. [4] A. Agarwal et al., ISSCC Dig. Tech. Papers, 2010. [5] M. Lin et al., FPGA ’06.

Congestion: O(N)

I/O Connection LUT Box Switch-box LUT

CLB LUT LUT LUT LUT Bi‐directional Switch Box

SM

Congestion: O(√N) LUT

LUT

LUT

LUT

Array

LUT

LUT LUT LUT LUT LUT CLB To 4 LUTs a) b) Figure 1: 2D-mesh interconnect. Figure 2: a) Hierarchical routing of 8 LUTs (4 shown) using SMs, b) alternated x-y routing. LUT LUT

N6:2

LUT

: Full SM : Half SM

N6:1 1

6

7

8

9

(SM Stage)

10 DSP Logic

N8:2

b)

a)

N6:2

BRAM

DSP

BRAM

Logic DSP

c)

N6:1

(from Hier. Network)

Output Stage

Cin

a)

5/6 input LUT Combiner (DSP CLB only) CO1

LUT

inD

Logic

B[3] P23 D[3]

A[2] P12

B[2] D[2]

LUT

A[1] P20

B[1]

C[1] P11

4

b)

D[1] P30

P02

LUT

A[0] P10 0/1 CI1 C[0]

Critical LUT Path:

+ ×7+

B[0] D[0]

C3 CI2 0/1

WL

BL+ BC Output

SM & CLB

Bit Cell Area: 1.4μm × 1.8μm

Interconnect Area Interconnects + Routing 43%

Intercon. + Mem Routing 36% 15% 13%

Memory 35%

3–4× reduction in interconnect area

Mem

2.8 370MHz 2.6 1.0V Clock 2.4 28% Active 2.2 46% Leak. 2 26% 1.8 300MHz 0.88V 1.6 Power Ratio 200MHz 1.4 @ Fmax 0.74V 1.2 100MHz 1 0.59V 0.80 5 10

Clock 18% Leak. 44%

Active 38%

Power Ratio @ Emin

55MHz 0.5V 15

Delay (ns)

Figure 6: Energy-delay curve of the mapped 175 16b accumulator with power breakdown at Fmax and Emin (insets).

LUT5_CD C2 C6 S2 LUT4_C S6 LUT3_C M4

P03

4

inA

M7 C8 M6 S8 C3 C7 S3 LUT4_D S7 LUT3_D M5

P33 A[3] P31

C[2] P21

inB

CO2

P13

LUT

inC

CO0

Mem 8%

BL‐

b) 35% Figure 5: a) Bit cell (BC) configuration circuitry, b) area comparisons of 2D-mesh vs. this chip for a fixed logic area.

OutD OutC OutB 2 OutA

P32

C[3] P22

4

BC BC … BC

This Work

To Output Configurable Adder / Mult Stage

4

FF Logic 14%

(to Hier. Network)

Cout

3‐/4‐ input LUT

BC BC … BC

2D‐ Mesh

Figure 3: a) Interconnect architecture of 2048 LUTs, floorplan of b) SM network, c) CLB placement. inD inC inB inA 4

FF

To next CLB BC Outputs

Logic Area

N6:1 N6:2

… FF

WL Control WSE WIN WEV BC BC … BC FF

N8:2

LUT

a)

BSE BL Control BIN FF FF

NP:Q: P stages of full SM + Q stages of half SM

N6:1

LUT6 C1 C5 S1 LUT4_B S5 LUT3_B M3 LUT4_A LUT3_A LUT5_AB C0 C4 S0M1 M0 S4M2

×8

Figure 4: a) CLB block diagram and b) DSP CLB schematic.

Result Design 175 Logic+DSP 16b Accum. 112 DSP 16b Accum. 32‐tap 16b FIR Filter 2×2 MIMO 64‐point FFT

TABLE I: MEASUREMENT RESULTS. Resource Utilization Performance Logic DSP BRAM Power VDD Freq. (256) (224) (16) (mW) (V) (MHz) 179 1.0 370 256 224 0 8.6 0.50 55 123 1.0 400 4 224 0 6.2 0.51 60 120 1.0 274 132 209 0 10.2 0.56 50 82.7 1.0 83 196 93 10 26.5 0.78 40

GOPS /mW 0.36 1.13 0.57 1.08 0.21 0.45 0.05 0.07

3.06mm Technology 65nm 1P9M CMOS Core VDD 0.34 to 1.0V Frequency 40 to 400MHz I/Os 75 bidirectional Core Size 2.52mm × 1.56mm Gate Count 2.73M CLB Count 256 Logic, 224 DSP Block RAMs 16 128×8b Config. Bits 297,472b

Figure 7: Die micrograph and chip summary.

A 1.1 GOPS/mW FPGA Chip with Hierarchical ... - Semantic Scholarhttps://www.researchgate.net/.../A-11-GOPS-mW-FPGA-chip-with-hierarchical-interc...

A 1.1 GOPS/mW FPGA Chip with Hierarchical ... - Semantic Scholarhttps://www.researchgate.net/.../A-11-GOPS-mW-FPGA-chip-with-hierarchical-interc...

Suggest Documents

Real-time Stereo Vision FPGA Chip with Low Error Rate

A Cohesive FPGA-Based System-on-Chip Design Curriculum

A Single-Chip FPGA Implementation of Real-time Adaptive ... - Core

FPGA Emulation and Prototyping of a CyberPhysical-System-On-Chip ...

Two Approaches for a Single-Chip FPGA Implementation ... - CiteSeerX

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

Hybrid breadth-first search on a single-chip FPGA-CPU ...

A non-volatile Flip-Flop in Magnetic FPGA chip

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

FPGA-Array with Bandwidth-Reduction ... - Semantic Scholar

HiWA: A Hierarchical Wireless Network-on-Chip ... - IEEE Xplore

A Hierarchical Test Scheme for System-on-Chip Designs - CECS

An FPGA Based SIMD Processor With A Vector ... - Semantic Scholar

A Study of FPGA-based System-on-Chip Designs ... - Semantic Scholar

BAG2 Interferes with CHIP-Mediated ... - Semantic Scholar

Prototyping with a bio-inspired recon .gurable chip ... - Semantic Scholar

Hierarchical Agglomerative Clustering with ... - Semantic Scholar

Agglomerative Hierarchical Clustering with ... - Semantic Scholar

Hierarchical Design Rewriting with Maude - Semantic Scholar

hierarchical disparity estimation with ... - Semantic Scholar

Semantic Image Classification with Hierarchical Feature Subset

Exploring OLAP Aggregates with Hierarchical ... - Semantic Scholar

mining actor correlations with hierarchical ... - Semantic Scholar