Implementing and Evaluating Stream Applications on the ... - CiteSeerX

55 downloads 0 Views 101KB Size Report
ments(PEs) and distributed memory modules, DRP-1 out- performed Pentium III/4 and embedded CPU MIPS64 in some stream application examples. We also ...
Implementing and Evaluating Stream Applications on the Dynamically Reconfigurable Processor Noriaki Suzuki, Shunsuke Kurotaki, Masayasu Suzuki, Naoto Kaneko, Yutaka Yamada, Katsuaki Deguchi, Yohei Hasegawa, Hideharu Amano Dept. of Information and Computer Science, Keio University 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan [email protected] Kenichiro Anjo, Masato Motomura NEC Electronics, 1753 Shimo Numabe, Kakahara, Kawasaki 221-8668, Japan Kazutoshi Wakabayashi, Takeo Toi, Toru Awashima NEC System Devices Research Lab., 1753 Shimo Numabe, Kakahara, Kawasaki 221-8668, Japan Abstract Dynamically Reconfigurable Processor (DRP)[1] developed by NEC Electronics is a coarse grain reconfigurable processor that selects a data path from the on-chip repository of sixteen circuit configurations, or contexts, to implement different logic on one single DRP chip. Several stream applications have been implemented on DRP1, the first prototype chip, and evaluation results are presented. By computing parallelly using the Processing Elements(PEs) and distributed memory modules, DRP-1 outperformed Pentium III/4 and embedded CPU MIPS64 in some stream application examples. We also present programming techniques applicable on reconfigurable processors and discuss their feasibility in boosting system performance.

1 DRP Overview DRP is a coarse-grain reconfigurable processor core which can be integrated into ASICs and SOCs. The primitive unit of DRP Core is called a ‘Tile’, and DRP Core consists of arbitrary number of Tiles. The number of Tiles can be expandable, horizontally and vertically. The primitive modules of Tile are processing elements(PEs), State Transition Controller(STC), 2-ported memories (VMEMs: Vertical MEMories), VMEM Controller(VMCtrl) and 1-ported memories (HMEMs: Horizontal MEMories). The structure of a Tile is shown in Figure 1. There are 8×8 PEs located in one Tile. It has an 8-bit ALU, an 8-bit DMU, an 8-bit×16-word register file and an 8-bit flip-flop. DRP Core, consisting of several Tiles, can change its contexts every cycle by instruction pointer distribution from

Mem

Mem

Mem

Mem

VMU

VMU

VMU

VMU

VMU

VMU

VMU VmCtrl VmCtrl VMU

State Transition Controller

VMU VmCtrl VmCtrl VMU

VMU

VMU

VMU

VMU

VMU

VMU

Mem

Mem

Mem

Mem

Figure 1. Structure of a Tile STCs. Also, each STC can run independently, by programming different FSMs. DRP-1 is the prototype chip, using DRP Core with 4×2 Tiles. It is fabricated with 0.15-um 8-metal layer CMOS processes. It consists of 8-Tile DRP Core, eight 32-bit multipliers, an external SRAM controller, a PCI interface, and 256-bit I/Os. The maximum operation frequency is 133MHz. An integrated design environment for DRP which includes a high level synthesis tool, a design mapper for DRP, simulators, and a layout/viewer tool is provided. Applications can be written in a C-based high level hardware description language, synthesized, and mapped onto the DRP.

2 Design Examples 2.1 Direction-pass filter for an intelligent robot Although DRP-1 itself is a stand alone reconfigurable device, Tiles of DRP can be used as an IP macro with

an embedded processor. Here, a small chip including an embedded processor with a Tile is assumed, and front-end part of the sound ditection system for an intelligent robot has been developed on it. One of the time critical jobs is the direction-pass filter for sound source separation using two microphones placed with a certain distance apart. In the direction-pass filter, IPD(Interaural Phase Difference) is computed by a combination of DFT(Discrete Cosine Transform) and other specific functions[2]. Here, we introduce FFT(Fast Fourier Transform) implementation on the DRP Tile developed for the filter. Since the other part of the direction-pass filter is independent, only FFT is implemented on a DRP Tile. 2.1.1 Performance Comparison The required clock cycles, frequency, and achieved number of FFT per second of two implementation examples are shown in Table 1, and compared with a high-end embedded processor MIPS64. MIPS64 is a four-way in-order super scalar architecture which runs at 500MHz. 32KB L1 instruction cache and 32KB L1 data cache are provided with 512KB L2 cache. The FFT program exactly alike to the FFT implemented on the DRP is written in C language and compiled by gcc with -O2 option. TI’s VLIW style DSP C6713 is tested on the evaluation board TMS320C6713 that operates at 225MHz. 4KB L1 instruction cache and 4KB data cache are provided with 256KB L2 cache. The same C code is compiled by Code Composer Studio Version2.20.05 with -O2 option.

2.2.1 Performance Comparison The summary of the performance and required number of contexts are shown in Table 2. For performance comparison, a high-end embedded processor MIPS64 is used under the same conditions as shown in Section 2. Pentium III (600MHz) and Pentium 4 (2.5GHz) are also used as competitors of high performance implementation. Software implemented on MIPS64 and Pentium III/4 were compiled with gcc with the -O3 optimization option. Table 2. Performance evaluation Application α Blender RC6 DWT IMDCT Viterbi

# of Ctxt 5 13 14 16 12

Frq.(MHz) 38 32 61 36 33

Relative Performance

3 × Pentium IV 6 × MIPS64 2.2 × Pentium III 1.8 × Pentium III 5 × Pentium IV

Although various competitors are used, applications on DRP-1 achieve much better performance compared to the embedded CPU. The normal operating frequency of DRP-1 with our application is between 30MHz to 60MHz, much slower than those of competitors. This shows that the parallel computing and accessing distributed memory are the main factors of the speed up.

3 Conclusion Table 1. Performance comparison DRP DRP MIPS64 DSP Simple Improved Clocks 26624 11776 248047 83997 Freq. 50MHz 33MHz 500MHz 225MHz FFT/sec 1878 2802 2015 2678 As shown in Table 1, the improved version(max. 59 PEs) gives out 1.5 times more throughput than the simple implementation(max. 19 PEs) by the best use of parallel and pipelined execution, and it outperforms high-end embedded CPU and DSP by using only a small area equivalent to a Tile. Considering the limited resources used and the low operating frequency, these results demonstrate that the acceleration using DRP is an efficient approach.

2.2 Other Examples of Stream Processing The following applications: Discrete Waveform Transform, Inverse Modified Discrete Cosine Transform for MP3, Viterbi decoder, Alpha Blender with Anti-aliasing Capabilities and Block Cipher RC6 are implemented as other examples of stream applications.

In this paper, we have implemented and evaluated several stream applications on the DRP. Our implementation shows that one single Tile of DRP can outperform that of a high-end embedded processor. Also, when all eight Tiles are utilized efficiently, the performance of DRP-1 can surpass those of Pentium III/4.

Acknowledgments The authors are grateful to Mr. H. Goto of Nokia Corp. for the countless discussions that have enriched our work.

References [1] M.Motomura:”A Dynamically Reconfigurable Processor Architecture,” Microprocessor Forum, Oct. 2002. [2] K. Nakadai, H. G. Okuno, H. Kitano, “Auditory Fovea Based Speech Separation and Its Application to Dialog System,” Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-2002), pp.1314-1319, Oct. 2002.