FPGA-Array with Bandwidth-Reduction ... - Semantic Scholar

2 downloads 7052 Views 2MB Size Report
able FPGA-array with bandwidth-reduction mechanism (BRM) to implement ... to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. ..... have computing units dedicated to operations major in computation.
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations based on Finite Difference Methods KENTARO SANO, WANG LUZHOU, YOSHIAKI HATSUDA, TAKANORI IIZUKA and SATORU YAMAMOTO Graduate School of Information Sciences, Tohoku University

For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this paper, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously-partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array. Categories and Subject Descriptors: B.6.1 [Logic Design]: Design Styles—Cellular arrays and automata; C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors)—Array and vector processors; C.3 [Special-Purpose and Application-based Systems]: Microprocessor/microcomputer applications; I.6.8 [Simulation and Modeling]: Types of Simulation—Parallel; J.2 [Physical Sciences and Engineering]: Aerospace; Engineering Author’s address: Graduate School of Information Sciences,Tohoku University 6-6-01 Aramaki Aza Aoba, Aoba-ku, Sendai 980-8579, JAPAN Email: {kentah, limgys, hatsuda, iizuka, yamamoto}@caero.mech.tohoku.ac.jp Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2001 ACM 1936-7406/2001/09-0001 $5.00  ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001, Pages 1–0??.

FPGA-Array with Bandwidth-Reduction Mechanism

·

1

General Terms: Design, Experimentation, Performance Additional Key Words and Phrases: Finite difference methods, FPGAs, reconfigurable computing, scalable array

1. INTRODUCTION Scientific numerical-simulation based on finite difference methods is one of the major applications requiring high-performance computing (HPC) with floating-point operations, which includes thermal propagation problems, fluid dynamics problems, electromagnetic problems and so on [Ferziger and Peri´c 1996]. These simulations numerically solve the partial differential equations (PDEs) constructing the governing equations of physics, which are approximated by applying the difference schemes with discrete values defined at 2D or 3D grid points. Since computation of each grid-point requires data of its multiple neighbors, such simulations are memoryintensive applications. Therefore, not only the peak arithmetic performance but also scalable memory-bandwidth is necessary for the high-performance scientific simulation. For HPC, supercomputers or PC clusters comprised of general-purpose microprocessors are commonly utilized. However, such general-purpose computers have a structural problem in terms of bandwidth. First, microprocessors have limitations of memory-bandwidth, which are so-called the von Neumann bottleneck. While the amount of hardware resources available on a chip has been steadily growing with technology scaling, the bandwidth through the chip I/O pins has been narrowly improved with difficulty. Thus, even though a cache memory is effective, the off-chip main memory is still very far from the processor core, and its bandwidth is inherently insufficient for memory-intensive applications[Williams et al. 2009]. Second, multiprocessor systems consisting of such inefficient processors are also confronted with the scalability problem. For example, due to the conflicts in the shared memory and/or the overhead of communication and synchronization via an interconnection network, only a fraction of the peak performance of a general-purpose system is exploited for actual applications. Under these circumstances, custom computing machines (CCMs) are expected to achieve efficient and scalable computation with customized data-paths, memory systems and a network for each individual application. Especially, field-programmable gate arrays (FPGAs) are nowadays becoming very attractive devices to implement CCMs for HPC. Thanks to the remarkably advanced FPGA technology, more and faster resources, e.g., logic elements (LEs), DSP blocks for integer multiplication, embedded memories and I/O blocks, have been integrated on an FPGA with less power consumption. Consequently, the potential performance of FPGAs now compares with, or overcomes that of general-purpose microprocessors for floating-point computations[Underwood and Hemmert 2004; Underwood 2004]. Accordingly, a lot of researchers have been trying to exploit the potential of FPGAs for floating-point applications for years [Shirazi et al. 1995; deLorimier and DeHon 2005; Zhuo and Prasanna 2005; Dou et al. 2005; Durbano et al. 2004; He et al. 2004; He et al. 2005; Scrofano et al. 2006]. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

2

·

Kentaro Sano et al.

In our previous work[Sano et al. 2007], we proposed the systolic computational memory (SCM) architecture for FPGA-based scalable simulation of computational fluid dynamics (CFD), and demonstrated that the single-FPGA implementation of a systolic computational-memory array (SCMA) achieves higher performance and higher efficiency than those of a general-purpose microprocessor. The decentralized memories coupled with local processing elements (PEs) of the SCM architecture theoretically provide complete scalability of both entire memory-bandwidth and arithmetic performance to increasing the array size. However, the performance given by a single FPGA is finite. A scalable system with multiple FPGAs is indispensable to implement a larger-scale SCMA for higher performance. Therefore, we need to know if the SCM architecture still has scalability for multiple-FPGA implementation, and if yes, how to design the system with multiple FPGAs. Moreover, we should also provide flexibility to allow such a large-scale SCMA to compute similar but different problems without the sacrifice of performance. In this paper, we give the answers to these questions. We propose a scalable FPGA-array allowing a generalized SCMA to be extensible over multiple FPGAs for power-efficient HPC of scientific simulations based on finite difference methods. First, we provide our design concept of the FPGA-based SCMAs as programmable CCMs to be constructed with the configurable hardware part and the software part for wider target applications including CFD. The hardware part with a minimum of programmability is given by a static structure customized commonly for the group of target applications. The software part presents dynamic flexibility to control the hardware part with a microprogram for various computations. With this concept, we describe our design of the SCMA for not only CFD, but also computations based on finite difference methods. Second, by introducing homogeneous partitioning, we map sub-arrays of an SCMA to FPGAs connected with a 1D or 2D mesh network. For keeping the performance of each FPGA in the 1D or 2D FPGA-array, the inter-FPGA bandwidth is important. Since the computation of a sub-array requires a certain bandwidth to transfer data between adjacent FPGAs, insufficient I/O bandwidth of an FPGA chip decreases the single-FPGA performance so that it is balanced with the bandwidth, as a result, spoiling the scalability. To avoid such performance degradation, we propose a bandwidth-reduction mechanism (BRM) based on time-division multiplexing for inter-FPGA data-transfer. The design parameters of BRM influence the actual bandwidth and delay of data-transfer. We derive the constraint of the parameters from available I/O bandwidth of an FPGA device and permissible delay in a computing program. Through implementation with two FPGAs, 1-2 we show that the SCMA keeps a high utilization of computing units for multiple-FPGA operation, resulting in a complete scalability. Thereby two FPGAs allow the SCMA to achieve the doubled performance of that by a single FPGA. We also show that the FPGA-based SCMA also has the advantage of low power-consumption in comparison with an actual microprocessor. In spite of computational speedup, the power consumption of the entire system including microprocessors and FPGAs is less than that of the system without FPGAs for the same computation by a microprocessor. Moreover, we evaluate the effectiveness of BRM with software simulator. By obtaining the feasible ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

3

design-parameters of BRM, we estimate the actual bandwidth-reduction of interFPGA data-transfer for typical computing programs. Based on this estimation, we illustrate that we can construct completely scalable SCMAs with a 2D array of commercially available high-end to low-end FPGAs. This paper is organized as follows. Section 2 summarizes related work. Section 3 describes the target computations, the architecture and design of an SCMA, BRM for inter-FPGA communication, and the parameter constraint of BRM. Section 4 explains the implementation using two ALTERA Stratix II FPGAs and discusses the performance with three benchmark computations and the feasibility with commercially available FPGAs. Finally, Section 5 gives conclusions and future work. 2. RELATED WORK As FPGAs have been getting more and faster components of LEs, DSPs and embedded RAMs[Compton and Hauck 2002], their potential performance for floatingpoint operations has increased rapidly. There have been many reports about floating-point computations with FPGAs: fundamental researches of floating-point units on FPGAs[Shirazi et al. 1995], linear algebra kernels[deLorimier and DeHon 2005; Zhuo and Prasanna 2005; 2007; Zhuo et al. 2007; Dou et al. 2005] and performance evaluation, analysis or projections[Underwood and Hemmert 2004; Underwood 2004; Strenski et al. 2008]. FPGA-based acceleration of individual floating-point application has been presented for iterative solvers[Hemmert and Underwood 2005], FFT[Morris et al. 2006], cellular automata simulations[Murtaza et al. 2008], acceleration of the lattice Boltzmann method[Sano et al. 2007], adaptive beam-forming in sensor array systems[Walke et al. 2000], seismic migration[He et al. 2004], transient waves[He et al. 2005], molecular dynamics[Patel et al. 2006; Scrofano et al. 2006; 2008; Chiu et al. 2008] and finance problems[Kaganov et al. 2008; Woods and VanCourt 2008]. There has been work attempting to use FPGAs for acceleration of numerical simulations based on finite difference methods: the initial investigation to build an FPGA-based flow solver[Hauser 2005], the overview toward an FPGA-based accelerator for CFD applications[Smith and Schnore 2003], designing FPGA-based arithmetic pipelines with a memory hierarchy customized for a part of CFD subroutines[Morishita et al. 2008], the proposals for FPGA-based acceleration of finitedifference time-domain (FDTD) method [Schneider et al. 2002; Durbano et al. 2004; Chen et al. 2004]. However, they didn’t give sufficient discussion and evaluation of the system scalability, particularly for multiple-FPGA implementation. In contrast to the above previous work, the approach proposed in this paper is based on the discussion of architectures suitable to obtain both scalable arithmetic-performance and scalable memory-bandwidth according to the size of a system. Especially, we focus on the scalability to the increasing number of FPGAs, providing a promising mechanism to connect FPGAs with relatively narrow I/O bandwidth. We also quantitatively evaluate the utilization of computing units and the power-efficiency, as well as the scalability, of our FPGA-based CCMs. In addition, while the previous work only aims at a specific computation, we consider a versatile CCMs. We design our FPGA-based CCMs so that they can handle a group of computations based on ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

4

·

Kentaro Sano et al.

finite difference methods, instead of only a single specific computation. We should introduce the FPGA-based programmable active memory (PAM) [Vuillemin et al. 1996] as a similar approach to our FPGA-based SCMA, while it is not concerned with numerical simulation with floating-point computation. The similarity is that PAM is based on the concept of the programmable active memory: PAM composed of a 2D FPGA-array and an external local-memory behaves as a memory for a host machine while processing the data stored. In addition, extensibility of PAM is considered by connecting it with I/O modules or other PAMs. On the other hand, our SCMA and its concept have the following differences from PAM. First, PAM is not specialized for floating-point computation. Second, the constructive unit of our SCMA is different from that of PAM. The constructive unit of PAM is PAM itself, which is composed of an FPGA-array and a local memory. Customized circuits are configured over the FPGAs of PAM. In our SCMA, each FPGA is the basic unit that has the same hardware design as a module. The array of FPGAs forms a scalable SCMA, and therefore we can easily extend the system by adding FPGAs such as implemented on “stackable mini-FPGA-boards”. Next, we give the SCMA the hardware layer of a systolic array to execute systolic algorithms. The hardware layer has only a minimum of programmability, which is given with a set of microinstructions for versatility. Partitioning the systolic array into sub-array modules allows the SCMA to be extensible with a unit of an FPGA. Last of all, we propose and evaluate a mechanism for inter-FPGA communication so that the inter-FPGA bandwidth does not spoil the performance of the SCMA consisting of an FPGA array. The proposed mechanism for bandwidth reduction presents feasibility for implementation of a large-scale SCMA with commercially available low-end FPGAs. 2-1 Someone may consider that GPGPUs [Fatahalian et al. 2004], multicore processors [Williams et al. 2009] and their parallel systems can fully exploit their peak performance for the coarse-grain parallelism of our target applications. However, these general-purpose processors are designed for high arithmetic performance without sufficient memory bandwidth, and therefore cannot fit the requirement of the applications: stencil computations for finite difference methods. These applications have low operational intensity [Williams et al. 2009], which means requiring relatively high off-chip memory bandwidth per arithmetic operation in their computation. Such memory-bound applications often limit the actual performance of general-purpose processors to a few percent of their peak performance[Williams et al. 2009], resulting in very low performance per cost or power. On the other hand, we design our SCMA so that its application-specific structure gives the balanced memory-bandwidth with the peak arithmetic performance based on the systolic computational-memory architecture. Furthermore we can extend the FPGA-based accelerator subsystem with high scalability, keeping the memory bandwidth per arithmetic operation. Thus our proposal for FPGA-based acceleration provides a guideline for efficient high-performance computation of applications with low operational intensity. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

5

3. SCALABLE FPGA-ARRAY AND ARCHITECTURE FOR CUSTOM COMPUTING 3.1 Target computation The target computation of our FPGA-based CCMs is the numerical simulation based on finite difference methods, which is one of the major application groups requiring high-performance floating-point computations. The simulation numerically solves the governing equations that are the partial differential equations (PDEs) modeling the physics. The PDEs are numerically approximated by applying the difference schemes with discrete values defined at 2D or 3D grid points for discrete time-steps. For example, an incompressible and viscous flow is governed by the following equations. ∇·V = 0

(1)

∂V + (V · ∇)V = −∇ϕ + ν∇2 V (2) ∂t where V , ϕ, ν and t are the velocity vector (V ≡ (u, v) for 2D), the pressure (p) divided by the density (ρ) (ϕ = p/ρ), the kinematic viscosity and the time, respectively. The fractional-step method [Kim and Moin 1985; Strikwerda and Lee 1999], which is one of the typical and widely-used numerical methods for simulating the fluid, is composed of the three steps computing the following equations to solve Eqs.(1) to (2) and obtain V and ϕ for the successive time-steps n = 0, 1, 2, · · ·. Step1: Calculate the tentative velocity V ∗ with Eq.(2) ignoring the pressure term:   (3) V ∗ = V n + Δt −(V n · ∇)V n + ν∇2 V n , where Δt denotes the time interval between the time-steps n and (n + 1). Here, V n means V at the time-step n. Step2: Calculate ϕn+1 with V ∗ by solving the following Poisson’s equation: ∇2 ϕn+1 =

∇·V∗ . Δt

(4)

Step3: Calculate V n+1 with V ∗ and ϕn+1 : V n+1 = V ∗ − Δt∇ϕn+1 .

(5)

In the case of 2D flows, the central-difference schemes of the 2nd-order accuracy give the following approximations on the 2D collocated grid shown in Fig. 1. ∂u ui+1 − ui−1  , ∂x 2Δx

∂2u ui−1 − 2ui + ui+1  . ∂x2 Δx2

(6)

With these approximations, Eqs.(3) to (5) are expressed in the following common form[Sano et al. 2007]: new qi,j = c0 + c1 qi,j + c2 qi+1,j + c3 qi−1,j + c4 qi,j+1 + c5 qi,j−1

(7)

where qi,j is a certain value at grid-point (i, j), and c0 to c5 are constants or values obtained only with values at (i, j). We refer to this computation as neighboring accumulation. In the case of the 3D grid, the accumulation contains at most eight ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

6

·

Kentaro Sano et al.

i-1, j+1

i-1, j

i, j+1

i+1, j+1

i, j

i+1, j

ϕi, j ui, j vi, j i-1, j-1

Fig. 1.

i, j-1

i+1, j-1

2D collocated grid.

terms. Thus, the difference schemes allow the numerical simulations to be performed by computing the neighboring accumulations. We can also compute higherorder difference schemes with a combination of the neighboring accumulations. Eq.(7) means that all grid-points just require the accumulation computations only with data of the adjacent ones. The computations at all grid-points are independent, so that they can be performed in parallel. To exploit these properties of the locality and the parallelism, an array of processing elements (PEs) performing parallel computation with decomposed subdomains is suitable[Hoshino et al. 1983]. Moreover, due to the computational homogeneity among grid-points, computations based on the difference schemes are described as a systolic algorithm, which can efficiently be performed by a systolic array [Kung 1982; Johnson et al. 1993] in parallel. 3.2 Architecture Since recent advancements in semiconductor technology provide VLSIs abundant in transistors with slightly increased I/O pins, performance improvement is not limited by computation itself, but the memory bandwidth, i.e., the von Neumann bottleneck. Hence, computing systems for HPC should be designed to have scalable memory-bandwidth by flexibly introducing custom-computing architectures appropriate to target problems instead of the conventional computing-prioritized structure. Reconfigurable computing with FPGAs is a promising approach to built such custom-computing machines (CCMs) tailored for target problems. Then, what architecture is suitable to design CCMs with scalable memorybandwidth for HPC? Our answer to this question is the systolic computationalmemory (SCM) architecture, which is the combination of the systolic architecture [Sano et al. 2004; Sano et al. 2005; Sano et al. 2006a; 2006b; 2007; Sano et al. 2008] or other 2D array architectures[Hoshino et al. 1983], and the computational memory approach. The systolic array[Kung 1982; Johnson et al. 1993] is a regular arrangement of many processing elements (PEs) in an array, where data are processed and synchronously flow across the array between neighbors. Since such an array is suitable for pipelining and spatially parallel processing with input data passing through the array, it gives scalable arithmetic performance according to the array size. However the external-memory access can still be a bottleneck of the ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism FPGA0 Logic Mem

·

7

FPGA2 Logic Mem

Logic Mem

Logic Mem

PE Logic Mem

Logic Mem Logic Mem FPGA1

Fig. 2.

Logic Mem

Logic Mem Logic Mem

Logic Mem

Logic Mem Logic Mem

Logic Mem

Logic Mem Logic Mem

Switch Comp units Local Memory Controller for PEs

FPGA3

Systolic computational-memory array over a 2D FPGA array.

performance improvement for memory-intensive computations. We believe that the computational-memory approach is one of the solutions to this problem. This approach is similar to computational RAM (C*RAM) or “processing in memory” concept[Vuillemin et al. 1996; 1996; 1996; Patterson et al. 1997; Patterson et al. 1997; Elliott et al. 1999], where computing logic and memory are arranged very close to each other. In our SCM architecture, the entire array behaves as memory not only storing data but also performing floating-point operations with them by itself. The memory is partitioned into local memories decentralized to PEs, which concurrently perform computation with the data stored in their local memories. This structure allows the internal-memory bandwidth of the array to be wide and scalable to its size without the bottleneck of the external-memory access. As we described in [Sano et al. 2007; Sano et al. 2008], the accumulation denoted by Eq.(7) can efficiently be computed by a systolic algorithm due to its locality, parallelism and regularity. Accordingly, by decomposing a computational grid into sub-grids, we can parallelize the computations on the grid for a systolic array with an appropriate network topology. Since the present semiconductor technology relies upon planar integration on a chip, the systolic array with the 2D mesh-network shown in Fig.2 is suitable. We can map not only 2D grids, but also 3D ones to the 2D systolic array by applying 2D grid-decomposition. In [Sano et al. 2007], we presented the systolic computational-memory array (SCMA) for computational fluid dynamics (CFD), and showed its effectiveness through prototype implementation with a single FPGA. The prototyped SCMA is connected to a host machine. Since the data transfer between the host machine and the SCMA should not be a bottleneck in computing, we designed the SCMA so that it performs the entire CFD computation without the host machine once necessary data are transferred. Thus we demonstrated that the SCMA implemented with a single FPGA works with a high utilization of arithmetic units. However, multiple-FPGA implementation is indispensable to further scale the performance. The scalability for multiple FPGAs was not well discussed. Moreover, the computational generality of the SCM architecture has not been demonstrated ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

8

·

Kentaro Sano et al.

yet. In this paper, we generalize the SCMA for FPGA-based HPC by extending the target problem from only CFD to a group of numerical computations based on finite difference methods, and giving a framework to obtain scalability with tightly-coupled FPGAs. For multiple-FPGA implementation, partitioning the array needs to be discussed: which part of the array should be implemented with each FPGA. We make a choice of homogeneous partitioning shown in Fig.2, considering state-of-the-art FPGAs that become large enough to integrate many PEs. For the homogeneous partitioning, we uniformly partition the array of PEs into sub-arrays. Each FPGA takes charge of one sub-array, while the connected FPGAs form the entire array. As we discuss later, the communication bandwidth between adjacent PEs can be much less than the local-memory bandwidth of each PE. Since the FPGA’s off-chip I/O bandwidth still remains insufficient to substitute for the internal bandwidth of embedded memories, our design choice is reasonable for multiple-FPGA implementation. The homogeneous partitioning also gives high productivity that allows FPGAs to share almost the same design. Fig.2 shows the functional blocks and the controllers of PEs. We design the CCM composed of the hardware (HW) part with static configuration and the software (SW) part for dynamically controlling data-paths with microprograms to give computational versatility and facility to the SCMA. Even if we use FPGAs, it is reasonable to share common HW components as much as possible for different problems because it takes long time to design and compile the HW part. Therefore, we implement computing units, data-paths, local memories and a network as the HW part in CCMs. Of course, these HW components should be specialized for target problems if necessary. For example, the network should have an appropriate topology for frequently-appearing communication patterns, or the data-path should have computing units dedicated to operations major in computation. Next, in order to achieve various computations with the HW part, we employ sequencers to control the data-paths with our defined microinstructions. A microprogram written with the microinstructions provides the dynamic part of our CCMs. This is a software, but not almighty and not too flexible. It is so simple that only necessary functions are defined just enough for the so-called time-sharing implementation of static functions on a limited hardware. Thanks to this approach, we can use common units for various computations, and consequently achieve high utilization of them, which is very important for HPC with limited HW resources. The programmability also allows us to easily develop CCMs for different computing problems. The complex and specialized computations such as boundary computations can also be performed with the same HW part by having the SW layer. Thus, almost all the computations of a problem can be performed independently of a host machine. For the SW part, we give a necessary number of sequencers to the SCMA. One sequencer can control multiple PEs. We consider the sequencer-to-PE correspondence to be of the HW part, but it can be configured appropriately for target problems. 3.3 Design of processing elements In this section, we summarize our design criteria of the SCMA, and describe an actual design of PEs in HW and SW aspects. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism Microinstruction Sequence: MS

Memory Read: MR

32

8

8

microoperation

8

9

Execute1: EX2 EX3 EX4 EX5 Write EX1 Back:WB

aSrc

write data

acc-in

32

read data 1

0

M u 1x

32

write addr

Local Memory Sequencer

·

32

in-a

32 1

readAddr1

32

M u 0x

32

read data 2

32

in-b

out

Floating-Point MAC: ab (5 stages)

sign

readAddr2

bSrc

accSelect

8

wAddr vSrc

hSrc

32

N-FIFO out 32

S-FIFO out

32 1

M u 0x

E-FIFO out W-FIFO out

in N-FIFO’

1

32

M u 0x

out

Communication FIFOs

Fig. 3.

in W-FIFO’

out in S-FIFO’

in E-FIFO’

Comm. FIFOs of adjacent PEs

Pipelined data-path of PE.

3.3.1 Hardware part. Although the computations based on the difference scheme are commonly comprised of the neighboring accumulations of Eq.(7), each accumulation can have a different number of terms with different constants. Therefore if we implement a fixed data-path for accumulating a fixed number of terms, it results in lower utilization of units for a different number of terms. To achieve high utilization, we made a choice of a programmable approach where a single MAC (Multiplication and ACcumulation) unit is sequentially used to accumulate any number of terms. Moreover, since the accumulation results are often used for the computations for the adjacent grid-points, we designed the data-path so that the output of the MAC unit can be directly transferred to the adjacent PEs. Fig.3 show the data-path of the PE. The data-path is composed of a sequencer, a local memory and a MAC unit. The local memory stores all the necessary data for the sub-gird allocated to the PE, and temporal or intermediate results of computations. The sequencer controls the programmable PEs. The sequencer has a sequence memory to store a microprogram, and a program counter (PC) to specify a microinstruction output to the remaining data-path. Since all the numerical computations have loop structures for an iterative solver and time marching, we designed a hardwired loop-control mechanism including multiple loop counters for nested loops. Basically each sequencer is not dedicated to a PE, but multiple PEs share the one as mentioned above. The number of PEs sharing the same sequencer depends on the type of applications. The data-path is pipelined with eight stages: MS (Microinstruction Sequence) stage, MR (Memory Read) stage, five EX (Execution) stages and WB (Write Back) stage, where the MAC unit occupies the five EX stages. The MAC unit performs multiplication and accumulation of single precision floating-point numbers. In the accumulation mode, the MAC unit computes a × b with the two inputs of a and ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

·

10

Kentaro Sano et al.

Table I.

Defined microinstructions of the processing element.

Opcode

Dst1,

Dst2,

Src1,

Src2

mulp

-,

L1,

SFIFO,

L2

Description MACout = 0 + SFIFO × M[L2], M[L1] := MACout.

mulm

SN,

-,

L2,

L3

MACout = 0 - M[L2] × M[L3], FIFOs of {S&N}-PEs := MACout.

accp

-,

L1,

L2,

L3

MACout = MACout + M[L2] × M[L3], M[L1] := MACout.

nop

No operation.

halt

Halt the array processor.

lset

num,

LC[p]∗ := num.

addr

JAR[p]∗ := addr. Branch if LC[p] = 0.

bnz accpbnz

-,

L1,

L2,

L3

* LC[p] : loop counter,

Execute accp and bnz simultaneously. JAR[p] : jump-address register (for p-th nested loop).

b, and then adds or subtracts ab with its output. For this accumulation, the MAC unit has a forwarding path from EX5 to EX2. This three-stage forwarding forces the inputs to be fed for accumulation every three cycles. This means that three sets of Eq.(7) have to be concurrently performed in order to fully utilize the multiplier and the adder of the MAC unit. The output of the MAC unit is written into the local memory, and/or sent to the adjacent PEs through the communication FIFOs (First-In First-Out queues). In the 2D mesh network of the array, each PE is connected to the four adjacent PEs with the north(N-), south(S-), west(W-) and east(E-) FIFOs. These FIFOs allow PEs to avoid too rigorous requirement for synchronization in sending and receiving data to/from the adjacent PEs. In the current design, we give 32 entries to FIFOs. 3.3.2 Software part. To describe microprograms that perform numerical computations with the aforementioned HW part, we defined a assembly language based on the following requirements: (1) (2) (3) (4)

MAC unit takes the two inputs from the memory or the FIFOs. MAC unit multiplies, and then adds or accumulates. Output of MAC unit is written to the memory and/or the FIFOs. Computations are repeated by nested loops.

Table I shows the microinstruction set of the PE, which is composed of computing instructions and controlling instructions. There is no comparison instruction that is not necessary for our present target computations. The computing instruction takes an operation code (Opcode), two destinations (Dst1 and Dst2), and two sources ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

lset LOOP: mulp mulp mulp accp accp accp accp accp accp accp accpbnz accp halt

Fig. 4.

1000, -, -, -, -, -, -, -, -, -, WS, , ES,

LOOP -, -, -, -, -, -, -, -, -, F_0_0, F_1_0, F_2_0,

WFIFO, F_0_0, F_1_0, SFIFO, SFIFO, SFIFO, F_0_1, F_1_1, F_2_1, F_1_0, F_2_0, EFIFO,

·

11

C_1 C_1 C_1 C_2 C_2 C_2 C_3 C_3 C_3 C_4 C_4 C_4

Example of a microprogram.

(Src1 and Src2). The opcodes: mulp, mulm and accp are multiply and add with zero, multiply and subtract with zero and multiply and accumulate with the previous output, respectively. The first destination, Dst1, specifies PEs which the computing result is sent to. S, N, E and W of Dst1 corresponds to the south-PE, north-PE, east-PE and west-PE, respectively. The second destination, Dst2, specifies the address of the local memory where the computing result is written. The first and second sources, Src1 and Src2, specify the addresses of the local memory or FIFOs from which values are read to the MAC unit. As controlling instructions, we have nop, halt, lset and bnz instructions. The lset and bnz instructions are dedicated to the nested-loop control. The lset instruction is used to set a loop-counter, LC, and a jump-address register, JAR, in the sequencer with num and addr, respectively. Then, when bnz is executed, the program counter is set to be the address stored in JAR if LC is not equal to zero. Simultaneously, LC is decremented. Like the accpbnz instruction, the combination of a computing instruction and bnz simultaneously performs both the instructions at the same clock cycle. 1-4 For multiple nested-loops, we designed the sequencer so that it has multiple LCs and JARs as LC[p] and JAR[p] for 1 ≤ p ≤ Nloop , where Nloop is the number of nested loops. Initially, p = 0. When entering the most outer loop, p is incremented so that LC[1] and JAR[1] are used by the lset and bnz instructions for the first-level loop. Thus, p is incremented for entering a loop and decremented for leaving a loop, while LC[p] and JAR[p] are used for the nested-loop control. We implement the sequencers with Nloop = 2 for the benchmark computations described in Section 4. Fig.4 shows an example of a microprogram, which repeats the following accumulation 1000 times. f0,0 = c1 f−1,0 + c2 f0,−1 + c3 f0,1 + c4 f1,0

(8)

+ c2 f1,−1 + c3 f1,1 + c4 f2,0 + c2 f2,−1 + c3 f2,1 + c4 f3,0

(9) (10)

f1,0 = c1 f0,0 f2,0 = c1 f1,0

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

12

·

Kentaro Sano et al.

where c1 to c4 are constants, and fi,j is a value defined at point (i, j) of a 3 × 3 grid. In this example, we assume that f−1,0 , f0,−1 , f1,−1 , f2,−1 and f3,0 are not the data of this PE. f−1,0 and f3,0 are sent from the west PE and the east PE, respectively, while f0,−1, f1,−1 and f2,−1 are sent from the south PE. The lset instruction in the 1st line initializes the loop-counter with 1000, and register the jump-address with the label of “LOOP”. The 2nd, 5th, 8th and 11th instructions sequentially compute Eq.(8), where “WFIFO” is the reserved name for the FIFO connected to the west PE, “F ∗ ∗” and “C ∗” are labels specifying the memory addresses assigned to f∗,∗ and c∗ , respectively. The 11th instruction writes the result of the accumulation 2-3 f0,0 to both “F 0 0” and the FIFOs of the west and south PEs. Note that only such writing FIFO transfers data to the adjacent PEs. In this example, the 11th and 13th instructions cause inter-PE communication. The 12th instruction executes bnz to return to “LOOP” until the loop-counter becomes zero, while the accp instruction is also executed. The 13th instruction is for the delay slot of the bnz instruction. In consequence, the instructions from the 2nd to 13th lines are repeated. At last, the halt instruction stops the PE. 3.4 Inter-FPGA communication by time-division multiplexing 3.4.1 Requirements for Inter-FPGA communication. To obtain scalable performance, we adopt the approach of implementing a large SCMA with multiple FPGAs by homogeneously partitioning the entire array as shown in Fig.2. For this approach, we have to connect FPGAs with a 1D or 2D mesh network. Fig.5 shows connection for a 2D FPGA-array. We referred to the communication channels between the adjacent two FPGAs as a link. In a 1D or 2D FPGA-array, each FPGA requries 2 or 4 links, respectively. Here, we obtain the unidirectional bandwidth required for inter-FPGA communication. For simplicity, we suppose that each FPGA contains the sub-array of NP E × NP E PEs, which is connected to the north, south, west and east FPGAs. Let b denote the number of bits of a datum that a PE can send to the adjacent one at each cycle. For the single-precision floating point numbers, b is 33 bits including a 1-bit control signal. Due to the homogeneous partitioning of the entire array, the NP E PEs on each edge of the sub-array send data to the adjacent FPGA. When PEs operate at f MHz, the maximum unidirectional bandwidth in Mbits/s required for a link is given by: link = bf NP E . Wmax

(11)

d

Since each FPGA has 2 links for a d-dimensional FPGA-array,the maximum unidirectional bandwidth of each FPGA for inter-FPGA communication is obtained as follows: F = 2d bf NP E . Wmax

(12)

F Wavail

Let denote the available unidirectional I/O bandwidth of an FPGA. To directly connect PEs between the adjacent FPGAs, the following condition must be satisfied so that the PEs can send data at every clock cycle: F F = 2d bf NP E ≤ Wavail . Wmax ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

(13)

FPGA-Array with Bandwidth-Reduction Mechanism

2bf [Mbits/s]

Link

PE

FPGA

South FPGA FPGA array

Fig. 5.

13

W MAX = 2NPE bf [Mbits/s]

North FPGA

West FPGA

·

East FPGA

PE

PE

PE

PE

Ngrid points

PE

Sub-array on an FPGA (NPE x NPE PEs)

Sub-grid on a PE (Ngrid x Ngrid points)

2D FPGA array and bandwidth requirement of each FPGA.

When this condition is not met, the PEs cannot be connected directly because of the insufficient bandwidth. The maximum required bandwidth can be reduced by decreasing f NP E , however the peak performance of the sub-array on each FPGA also decreases because it follows O(f NP E 2 ). We should avoid the performance degradation to construct a scalable and efficient FPGA-array. 3.4.2 Bandwidth reduction by time-division multiplexing. The maximum required bandwidth is necessary only for the case that PEs send data at every cycle, which is not common in actual computation. As shown in Fig.4, data-transfer instructions are intermittent in a typical program and the ratio of the instructions to the total cycles is small. This is because each PE computes a sub-grid. Fig.5 shows that only the grid points on the most right column of the Ngrid ×Ngrid sub-grid require the communication with the right FPGA. Furthermore, the neighboring accumulation of Eq.(7) takes Ninst > 1 cycles for each grid point. Therefore, the data-transfer to the right FPGA is taken place at intervals of Ngrid Ninst cycles in average. This means that the unidirectional bandwidth can actually be restricted to   link Wmax link Wmin = O . (14) Ngrid Ninst In order to connect FPGAs that do not satisfy Eq.(13), we propose a bandwidthreduction mechanism (BRM) that exploits the above feature. Fig.6 shows the overview of the mechanism. BRM has FIFO queues to accept data-transfers at various intervals. The (bNPE )-bit data in the FIFOs assigned to the PEs are read out together to the buffer, which are partitioned into m segments for time-division multiplexing. The data in the buffer are serialized into m data segments, and sent to the adjacent FPGA spending m cycles. While the transferred segments are on the link, the data-available signal is asserted. The receiving FPGA deserializes the segments and reconstructs the (bNPE )-bit data in the buffer, which are written to the communication FIFOs of the destination PEs. Since the data from the NPE PEs are sent together, null data are transferred when data-transfers of the PEs are not synchronized. The null data can be distinguished ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

14

·

Kentaro Sano et al. FIFOs

NPE PEs

2 1

Buffer segments 1

PE

2

PE

b

b

SER & Tx

3 valid

Inter-FPGA communication channnels (link)

W FPGA [Mbits/s]

PE b

Rx & DES

PE

valid

D FPGA [cycles]

PE

PE

m

D SER

D DES

FPGA 1 Fig. 6.

D PE

FPGA 2

Bandwidth-reduction mechanism (BRM) for a link between adjacent FPGAs.

by the receiving FPGA with the control signal included in the b bits. However, we expect that there is almost no efficiency degradation in data transfer because the PEs are basically controlled by the same or similar sequences. Moreover we think that the FIFO queues absorb the variation of intervals in most cases. 3.4.3 Bandwidth and delay constraint. In designing BRM, we have to determine the number of segments, m, and the depth of the FIFOs, dF IF O , as design parameters. Here, we describe the constraints of these parameters. By serializing data with m segments, the required bandwidth for a link is reduced F Wmax F = m + Wcontrol where Wcontrol is for control signals including the datato Wreduc F , we ignore it. available signal. Since Wcontrol is generally much smaller than Wmax For bandwidth-reduction based on time-division multiplexing, the FPGA with the F I/O bandwidth of Wavail requires the following constraint. F = Wreduc

F 2d bf NP E Wmax F = ≤ Wavail . m m

(15)

With this equation, we obtain 2d bf NPE ≤ m. F Wavail

(16)

Next, we model the delay cycles of transferring data with BRM to give its constraint. Let DPE be the number of cycles required for transferring data between PEs connected directly. After executing a data-transfer instruction at the i-th cycle, the data can be read by the receiving PE with the instruction executed at the (i + DPE )-th cycle or later. 1-5 Since the write-back stage is the 8th stage as shown in Fig.3, the data-transfer instruction fetched at cycle i writes the MAC output to the FIFO of the receiving PE at cycle (i + 7). At cycle (i + 8), the data written to the FIFO can be read in the memory-read stage of the receiving PE by the instruction fetched at cycle (i + 7). Therefore, the current design of the PE gives DPE = 7. We also define DFPGA as the number of cycles required for transferring data between the transceiver (Tx) and the receiver (Rx) of the connected FPGAs. DFPGA ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

·

FPGA-Array with Bandwidth-Reduction Mechanism 1

5

10

FIFO-in by PEs

15

20

25

30

35

40

45

50

15

cycles

4

FIFOstage 4

3 x 4 cycles

FIFOstage 3 FIFOstage 2

nrem = 3

FIFOstage 1 Read-out to buffer Segment transfer

1 2 3 4

Dataavailable DSER(0) = 6

DSER(1) = 4 + 5 = 9

DSER(3) = 3 x 4 + 5 = 17 DSER(m, nrem) < nrem m + (1 + m)

Fig. 7.

Timing chart examples of data transfer with the bandwidth-reduction mechanism.

depends on the I/O unit used for inter-FPGA communication. We will obtain DFPGA through prototype implementation in Section 4. The total delay, D, of data transfer with BRM is given by the following equation as shown in Fig.6. D = DSER + DFPGA + DDES + DPE ,

(17)

where DSER and DDES are the numbers of cycles for the serialization and deserialization, respectively. Apparently, DDES = 1 just for writing the received data to the buffer. Let nrem denote the number of data remaining in the FIFO when a new data-transfer begins. DSER is a function of m and nrem , while DFPGA , DDES and DPE are constants. Fig.7 shows examples of transferring intermittent or successive data with serialization of m = 4 segments. The FIFO-input at cycle 3 gives a single data-transfer with nrem = 0. As depicted, writing FIFO, reading out to the buffer and transferring segments take 1, 1, and 4(= m) cycles, respectively. Accordingly, DSER = 6 for a single data-transfer. On the other hand, a succeeding data-transfer takes more than 6 cycles when nrem ≥ 1, because the transferred data remain in the FIFO before being read out to the buffer. A data-transfer initiated less than 4 cycles after the preceding datatransfer has nrem ≥ 1. The FIFO-input at cycle 20 gives the data-transfer with nrem = 1, causing its data to stay in the FIFO for 4 (= m) cycles. The serialization delay of this data-transfer is obtained by: DSER (nrem = 1) = 4 + 5 = 9.

(18)

Similarly, the data transfer at cycle 36 with nrem = 3 has the serialization delay of: DSER (3) = 3 × 4 + 5 = 17

(19)

because the data remain in the FIFO for at most 3 × m cycles. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

16

·

Kentaro Sano et al.

Based on the above consideration, we have the following function for the maximum serialization delay of the data transfer with m and nrem . 2-4

DSER (m, nrem ) = (nrem + 1)m + 1.

(20)

By substituting Eq.(20) for Eq.(17), we obtain: 2-4

D = (nrem + 1)m + 1 + DFPGA + DDES + DPE .

(21)

In the next section, we will estimate the maximum delay of BRM by obtaining maximum nrem in actual computation. Here we derive the constraint of m from the bandwidth model and the delay model of BRM. Let Dmax be the maximum permissible delay given by each computing program. Note that Dmax depends on computing programs, while nrem depends on both computing programs and m. Given Dmax and nrem , we have the delay constraint: 2-4

D = (nrem + 1)m + 1 + DFPGA + DDES + DPE ≤ Dmax ,

(22)

or Dmax − 1 − DFPGA − DDES − DPE . nrem + 1 Eqs.(16) and (23) give the following constraint for m. 2-4

2-4

m≤

2d bf NPE Dmax − 1 − DFPGA − DDES − DPE . ≤m≤ F nrem + 1 Wavail

(23)

(24)

If there exists such m that satisfies this requirement with the given parameters, we can connect FPGAs to construct a scalable SCMA without any performance degradation. 2-5 As described in Section 4.5, we can determine m satisfying Eq.(24) with Dmax F and the maximum nrem for typical computations. Parameters f , NPE , Wavail , DFPGA and DDES are given by FPGA devices, and d, b and DPE are the design parameters of the SCMA, while Dmax and nrem depend on computing programs. We will quantatively evaluate them in the next section. 2-5 Once m is determined, we select the depth of the BRM FIFOs, dF IF O , so that it is sufficiently larger than the maximum nrem of typical computations. As described in Section 4.5, for example, the two benchmarks give nrem of 3 at most for available m, and dF IF O of 32 is expected to be sufficiently large for most sequences. In our design policy, PEs should have simple mechanisms, e.g., a PE does not stall due to FIFO overflow. Thus we use FIFOs without overflow detection for BRM. Therefore, sequences must have their maximum nrem always less than dF IF O . We can insert nop instructions in the sequence to reduce nrem if necessary, however, we consider that most sequences do not require nop-insertion for sufficiently-large dF IF O , such as 32. 4. IMPLEMENTATION AND PERFORMANCE EVALUATION In this section, we discuss the feasibility of our SCMA while evaluating the computational performance, the power consumption and scalability for available FPGA products through prototype implementation and cycle-accurate software simulation. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

·

FPGA-Array with Bandwidth-Reduction Mechanism DDR2

DDR2

DDR2

StratixII FPGA A

StratixII FPGA B

StratixII FPGA C

PCI Controller

Ext. connector

Ext. connector

17

JTAG

USB CF

Config FPGA

A

PCI-bus

Fig. 8.

FPGA prototyping board: DN7000K10PCI, and its block diagram.

4.1 Implementation We implemented the SCMA with the FPGA prototyping board: DN7000k10PCI shown in Fig.8. The board has two ALTERA Stratix II EP2S180-5 FPGAs: FPGAA and FPGA-B, DDR2 memories and a PCI controller. The Stratix II FPGA on the board totally has 143520 adaptive look-up tables (ALUTs), which can emulate up to 1.2 million ASIC gates. The FPGA also contains totally 96 embedded 36-bit multipliers (DSP blocks) and three types of configurable SRAMs: 512-bit M512 blocks, 4-Kbit M4K blocks and 512-Kbit M-RAM blocks. We wrote verilog codes and compiled them with ALTERA Quartus II version 8.0 SP1. Fig.9 shows the overview of the implemented SCMA with 24 × 8 PEs in all. On each FPGA, we implement the 12 × 8 SCMA, the array controller including nine sequencers, and the communication unit without BRM. For the communication unit, we use LVDS (low voltage differential signaling) transceivers (Tx) and receivers (Rx), which are embedded units of the Stratix II FPGA. These embedded Tx and Rx provide x8 serialization and deserialization, resulting in 4.98 GByte/s for unidirectional bandwidth. Since this bandwidth is sufficient, we directly connect the eight PEs on FPGA-A with the eight PEs on FPGA-B without BRM. There is still room for implementing one more LVDS-based communication unit on each FPGA. Therefore, this Stratix II FPGA can totally have the available bandwidth F = 4.98 × 2 = 9.96 GByte/s on the prototyping board, which is sufficient of Wavail to connect any number of FPGAs in a 1D array without BRM. The LVDS-based communication unit has the delay of DFPGA = 6 cycles. In the present implementation, the size of a local memory on each PE is 1 KBytes where up to 256 single-precision floating-point numbers are stored. The MAC unit contains an adder and a multiplier for single-precision floating-point numbers in the IEEE754 format except denormalized numbers. The MAC unit is implemented with one embedded 36-bit multiplier. The four communication FIFOs of a PE each have 32 entries, which are enough for send/receive synchronization in microprograms. The 12 × 8 SCMA on each FPGA consumes 81449 ALUTs (56.8%), 96 36-bit DSP blocks (100%), 386(= 4 × 96) M4K blocks (50.3%) and 706(= 7.4 × 96) M512 blocks (75.9%). The M4K blocks are utilized to implement the dual read-port local memories of PEs. The M512 blocks are used to implement the communication FIFOs. The 96 PEs share the nine sequencers that are implemented by using 1585 ALUTs (1.1%) and 9 M-RAM blocks (100%). The size of each sequence memory is 64 KBytes, where up to 8192 microinstructions can be stored. These nine sequencers ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

18

·

Kentaro Sano et al. Stratix II FPGA-A

LVDS

Stratix II FPGA-B

4.98GB/s

SER Tx, Rx DES

Systolic Array (12x8 PEs) PCI Controller

addr

control sequences

data

Array Controller with 9 Sequencers

data

4.98GB/s

LVDS SER Tx, Rx DES

control sequences

Fig. 9.

Systolic Array (12x8 PEs) upper

Array Controller with 9 Sequencers upper-left

PCI Bus

data

upper-right

lower internal lower-left

left right

lower-right

Implemented SCMA with the two Stratix II FPGAs.

are allocated to the 60 internal PEs, the 6 left PEs, the 6 right PEs, the 10 top PEs, the 10 bottom PEs, and the top-left PE, the top-right PE, the bottom-left PE and the bottom-right PE, respectively. This allocation is useful for different computations of the grid boundary. The implemented SCMA operates at f = 106 MHz, though further optimization is probably possible. Then each PE has 212 MFlops (= 106MHz × 2), and the sub-array on each FPGA provides 20.35 (= 0.212 × 96) GFlops. The maximum unidirectional bandwidth required for the link between FPGA-A and FPGAlink B is Wmax = bf NP E = 33 × 106 × 8 = 27984 Mbit/s = 3.5 GByte/s. Since F link , the two FPGAs can fully operate at 106MHz Wavail /2 = 4.98 GByte/s > Wmax being connected without BRM. Accordingly, the double-sized array on the two FPGAs achieves the peak performance of 40.7 (= 0.212 × 192) GFlops. We refer to the 12 × 8 SCMA implemented with FPGA-A as the single-FPGA SCMA. From software, the 24 × 8 PEs implemented over FPGA-A and FPGA-B are also transparently seen as a single array, which is referred to as the doubleFPGA SCMA. These SCMAs have an idle mode and a computing mode. In the idle mode, all the local memories and the sequence memories on FPGA-A and FPGA-B are arranged in a single memory space, which is accessed via the PCI bus. We use this mode to initialize all the local memories and write microprograms in the sequencers before computation, and read the computational results in the local memories after computation. By writing 1-bit ‘1’ to a control register, which is also mapped to a certain memory address, we switch the mode to the computing mode. In the computing mode, the SCMA performs computation being controlled by the sequencers. 4.2 Benchmarks For benchmarks, we use the applications summarized in Table II. The red-blackSOR (successive over-relaxation) method [Hageman and Young 1981], RB-SOR, is one of the iterative solvers for Poisson’s equation or Laplace’s equation. In the red-black-SOR method, the grid points are treated as a checkerboard with red and ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

·

FPGA-Array with Bandwidth-Reduction Mechanism

Table II.

19

Benchmark computations. N grid-points

N grid-points

N grid-points

y x

u=0, v=0

M grid-points

u=0, v=0

u=0, v=0

M grid-points

M grid-points

u=uf , v=0

Ex Ey wave source of Hz y (xs, ys)

Hz

Ey

Ex

x

Absorbing boundary condition

Red-black-SOR

Fractional-step method

FDTD method

Iterative numerical solver of Laplace’s equation: ∇2 φ = 0. Simple time-independent heat-conduction is computed with a 2D grid.

Numerical method to compute incompressible viscous flow. 2D square driven cavity flow is computed giving the time-independent result.

Numerical method to solve Maxwell’s equations for electromagnetic problems. 2D propagation of electromagnetic waves is computed.

black points, and each iteration is split into a red-step and a black-step. The redand black-steps compute the red points and the black points, respectively. We solve the heat-conduction problem on a 2D square plate shown in Table II by solving the Laplace’s equation. For the single-FPGA SCMA and the double-FPGA SCMA, we compute 2.0 × 105 iterations with a 96 × 96 grid, and 1.0 × 105 iterations with a 192 × 96 grid, respectively, where PEs each take charge of a 8 × 12 sub-grid in both the cases. Figs.10a and 10b show the computational results of RB-SOR for the 192 × 96 grid. The latter is the converged solution. The fractional-method [Kim and Moin 1985; Strikwerda and Lee 1999], FRAC, is a typical and widely-used numerical method for computing incompressible viscous flows by numerically solving the Navier-Stokes equations. We compute the 2D square driven cavity flow as shown in Table II with the kinematic viscosity ν = 0.025 and RE(Reynolds number) = 40. The left, right and lower walls of the square cavity are stable, and only the upper surface is moving to right with a velocity of u = 1.0. For the single-FPGA SCMA and the double-FPGA SCMA, we compute 8000 time-steps with a 48 × 48 grid, and 4000 time-steps with a 96 × 48 grid, respectively, where each PE is in charge of a 4 × 6 sub-grid in both the cases. In each time-step, we solve the Poisson’s equation of Eq.(4) with 250 iterations of the Jacobi method. The viscosity and the non-slip condition of the upper surface finally cause a vortex flow in the square cavity as shown in Figs.10c and 10d. The finite-difference time-domain (FDTD) method [Yee 1966; Allen and C. 1996], FDTD, is a powerful and widely-used tool to solve a wide variety of electromagnetic problems, which provides a direct time-domain solution of Maxwell’s Equations discretized by difference schemes on a uniform grid and at time intervals. We compute the 2D electromagnetic-wave propagation as shown in Table II. At the left-bottom corner, we put the square-wave source with the amplitude of 1 and the period of 80 time-steps. On the border, Mur’s first-order absorbing boundary condition is applied. For the single-FPGA SCMA and the double-FPGA SCMA, ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

20

·

Kentaro Sano et al.

a. Red-black-SOR, 2.0 × 102 iterations.

b. Red-black-SOR, 1.0 × 105 iterations.

c. Fractional-step method, 120 time-steps.

d. Fractional-step method, 4000 time-steps.

e. FDTD, 120 time-steps. g. FDTD, 360 time-steps.

f. FDTD, 240 time-steps. h. FDTD, 620 time-steps.

Fig. 10. Computational results of the red-black-SOR (a), the fractional-step method (b and c), and the FDTD method (d to g). For the red-black-SOR, φ (temperature) is visualized in color. For the fractional-step method, the pressure per density and the velocity vectors are visualized. For the FDTD method, the norm of the electric field is visualized.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

·

FPGA-Array with Bandwidth-Reduction Mechanism

Table III.

21

Performance, power and energy results.

RB-SOR

FRAC

FDTD

RB-SOR

FRAC

FDTD

grid size

96 × 96

48 × 48

72 × 72

192 × 96

96 × 48

144 × 72

iterations

200000

8000

560000

100000

4000

280000

Pentium4, 3.4GHz gcc -O3 time [s]

10.44

33.69

31.78

10.32

33.92

32.00

power [W]

-

125.73

125.99

-

125.91

125.95

energy [J]

-

4222.4

4009.9

-

4229.8

4038.5

icc -O3 -ipo -xP -static time [s]

9.94

19.80

9.50

9.87

20.46

9.49

speedup to gcc

1.05

1.70

3.35

1.05

1.66

3.37

FPGA-based SCMA, 106MHz single (96 PEs)

double (192 PEs)

MAC util.

82.7%

87.6%

80.2%

82.9%

87.7%

80.6%

GFlops

16.8

17.8

16.3

33.7

35.7

32.8

total cycles

115200045

245512041

332668023

57600045

122756041

166334023

time [s]

1.087

2.316

3.138

0.543

1.158

1.569

speedup to gcc

9.60

14.55

10.13

19.0

29.3

20.4

speedup to icc

9.14

8.55

3.03

18.2

17.7

6.05

power [W]

-

86.56

90.21

-

101.17

109.81

energy [J]

-

200.12

282.46

-

117.03

171.96

we compute 5.6 × 105 time-steps with a 72 × 72 grid, and 2.8 × 105 time-steps with a 144 × 72 grid, respectively, where each PE is in charge of a 6 × 9 sub-grid in both the cases. Figs.10e to 10h show the computed time-dependent results. For comparison, we wrote programs of these benchmarks in C to be executed by a Linux PC, hp ProLiant ML310 G3, with Intel Pentium4 processor model 550, 1-6 which has a 1MB L1 cache and a single core operating at 3.4GHz. All the floating-point computations are performed in single precision. We compiled the programs using gcc version 3.3.2 and Intel icc version 10.0, and measured the execution time of the core computation, which is the same part as those executed by FPGAs, with the gettimeofday system call. We used gcc with “-O3”, and icc with “-O3 -ipo -xP (enabling SSE3) -static”. For FPGAs, we wrote sequences of the benchmark computations. As shown ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

22

·

Kentaro Sano et al.

in Fig.9, the sub-array implemented on each FPGA has nine sequencers, which are allocated to the following PE groups containing: the upper, the lower, the left, the right boundaries, the upper-left, the upper-right, the lower-left, the lowerright corners and the internal grid-points, respectively, because they need different sequences for different boundary computations. 1-6 Although the MAC unit do not support denormalized numbers, the FPGA-based computation gave almost the same results as those of the software computation with the Pentium4 processor, which are accurate enough for users of these scientific simulations.

4.3 Performance evaluation of the implemented SCMA Table III shows the performance results of the 3.4GHz Pentium4 processor and the 106MHz FPGA-based SCMAs. 2-2 The implemented SCMA requires polling its status register to know when the SCMA finishes computation. Since the delay in polling causes inaccurate measurement, we count the number of cycles executed by each PE and calculate the exact execution time for both the single-FPGA SCMA and the double-FPGA SCMA, instead of measuring the elapsed time with the gettimeofday system call. Accordingly we evaluate the performances of the core computation itself with the results shown in Table III. For RB-SOR, FRAC and FDTD, the utilization of the MAC unit is around 83%, 88% and 80%, respectively. Although about 12 to 20 % loss of utilization is caused by multiplying the first member of the neighboring accumulation, the customized data-path allows applications to enjoy these high utilizations. These MAC utilizations give the sustained performances of 16.8 GFlops, 17.8 GFlops and 16.3 GFlops to RB-SOR, FRAC and FDTD, respectively, for the single-FPGA SCMA. The double-FPGA SCMA achieves 33.7 GFlops, 35.7 GFlops and 32.8 GFlops, respectively. These sustained performances obviously exceed the peak single-precision performance of the 3.4GHz Pentium4 processor: 13.6 GFlops given by SSE3 (streaming SIMD extension 3) instructions. 1-6 On the contrary, gcc and icc do not fully utilize the peak performance at all. Consequently, the double-FPGA SCMA provides 19 to 29 times faster computation than that with gcc. Although icc generates better codes utilizing SSE3 instructions, the SCMA is still 6 to 18 times faster than icc. Since newer microprocessors and FPGAs have higher peak performance, the comparison of the absolute performance with the Pentium4 processor is not beyond or below just the comparison. However, note that the MAC utilization is kept almost the same for both the single-FPGA SCMA and the double-FPGA SCMA. In general, microprocessor cores suffer from low utilization of floating-point units due to the lack of instruction-level parallelism, overhead of complicated but indispensable data/program structure, e.g., address computations for array accesses and loop control, and limited memory-bandwidth. Moreover, parallel computers employing a lot of processor cores also have difficulty in keeping high efficiency. The aforementioned experimental results show that our SCMA on an FPGA array connected with sufficient bandwidth provides complete scalability keeping high parallel processing efficiency around 100%. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

23

AC 100V

hp ProLiant ML310 G3

HDD 80GB HDS728080 PLA380

SATA

Power Supply Unit DC +3.3V, +5V, +12V

DC voltage

Memory HiCorder HIOKI 8855

DC current

Main Board Chipset: Intel E7230 Intel Pentium4 550 3.4GHz PC4200 DDR2 2GB On-board Graphics, GbE

PCI

FPGA Board DN7000K10PCI StratixII EP2S180

Fig. 11. Power measurement of the host PC and the SCMA. All the DC inputs to the main board are measured.

4.4 Power consumption We evaluate the actual power and energy consumption of the FPGA-based SCMAs compared to the software computation 1-6 using gcc. To obtain the power consumption, we use a digital oscilloscope, HIOKI Memory HiCorder 8855, which can measure and record samples of DC voltage and current. Fig.11 shows the power measurement of the host PC and the FPGA prototyping board. We measure the DC inputs to the main board: DC +3.3V, DC +5V and DC +12V. The sampling rate is 1.0 × 103 samples/sec. The digital oscilloscope also calculates the power with the measured voltage and current. Thus, we can observe the net power consumption of the system including the chipset, the CPU, the main memories, the other peripherals and the FPGA board, while we remove the FPGA board when we measure the power for software computation. Fig.12 shows the records of the DC power for the fractional-step computation of 4000 time-steps with the 96 × 48 grid. Software computation with the Pentium4 model 550 processor whose TDP is 115 W has the average power consumption of 125.9 W, while computation with the double-FPGA SCMA consumes 101.2 W. Since the SCMA reduces the computing time, the total energy for the entire computation with the SCMA is only 117 J, which is only 2.8% of 4230 J consumed for software computation. Table III summaries the average power and the total energy for FRAC and FDTD computations with the Pentium4 processor and the single-FPGA and double-FPGA SCMAs. The SCMAs consume the 69% to 87% power of that for the Pentium4 processor, and their computational speedup allows the SCMAs to require only 2.8% to 7.0% of the total energy for the Pentium4 processor. Note that the total energy for the double-FPGA SCMA is less than that for the single-FPGA SCMA while the Pentium4 processor consumes almost the same energy for these benchmarks. This shows that net energy consumed only by FPGAs is very low. 1-1 To evaluate the energy consumption of scalable systems, we estimate the energy-delay product, EDP, which means the energy consumption per performance [Feng et al. 2005][Sano et al. 2008]. We compare the EDPs for a CPU cluster and an FPGA cluster with a single CPU node shown in Fig.13. Suppose that the CPU cluster with n nodes achieves the same computing performance as that of the SCMA ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

24

·

Kentaro Sano et al.

Power (W)

160 120 80 40

Computation by Pentium4 (33.6 sec, power=125.9W, energy=4230J) Idle Power = 69.9W (without FPGA board)

0 5

10

15

20

25

30

35

Time (sec) a. Pentium4 3.4GHz without FPGA board.

Power (W)

160 120

“Run” by software

80

Computation by FPGAs (1.16 sec, power=101.2W, energy=117J)

40

Idle Power = 85.9W (with FPGAs configured)

0 3.5

4

4.5

5

Time (sec) b. the double-FPGA SCMA 106MHz. Fig. 12. Records of total DC power in computing the fractional-step method (96 × 48, 4000 time-steps).

on the FPGA cluster with N Stratix II FPGAs. Let P1CPU and P1IDLE denote the average computing power and the average idle power of a single node only with a microprocessor, respectively. Let P1FPGA be the average computing power of the FPGA board per FPGA. Since the measured power of the double-FPGA SCMA is (P1IDLE + 2P1FPGA ), we can obtain P1FPGA with the measured power consumption. Let D1CPU and D1FPGA denote the delay, or the computing time, of a single node only with a microprocessor and the delay of an SCMA with a single FPGA, respectively. 1-1 For simplicity, we assume that each of n CPU nodes consumes the same power as P1CPU , and the total computing time with n nodes, DnCPU , is modeled with 1 CPU D (25) DnCPU = ne 1 where e is a parallel processing efficiency (0 < e ≤ 1). If we ignore the power for the interconnection network, the computing power consumption for n nodes is PnCPU = nP1CPU . Therefore the EDP of n CPU nodes is given by

= PnCPU (DnCPU )2 = nP1CPU (DnCPU )2 . EDPCPU n

(26)

Assuming the complete scalability for the SCMA, FPGA = DN

1 FPGA D . N 1

(27)

The EDP of the SCMA with N FPGAs and a single host is FPGA 2 EDPFPGA = (P1IDLE + N P1FPGA )(DN ) . N ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

(28)

·

FPGA-Array with Bandwidth-Reduction Mechanism

25

CPU CPU n CPUs (P n , D n )

P 1CPU [W] D 1CPU [s]

CPU node

CPU node

250

CPU node

e = 0.2

a. CPU Cluster FPGA

N FPGAs (P N

, D FPGA ) N

Ratio of EDP

200

Interconnection Network

e = 0.3

150 e = 0.5

100

e = 0.7

P 1IDLE [W]

CPU node

FPGA FPGA FPGA

50

e = 1.0

FPGA

P 1 [W] FPGA FPGA FPGA D 1FPGA [s] b. FPGA Cluster with a host CPU

Fig. 13. A CPU cluster and an FPGA cluster for EDP estimation. We assume no power consumption for the network.

0 0

1

2

3

4

Number of Stratix II FPGAs

Fig. 14. EDP Ratio of CPUs to FPGAs for the equal computing speed of the fractional-step method, where e denotes a parallel processing efficiency.

Since we assume that the n CPUs achieve the same computing performance as that N D1CPU FPGA of the SCMA with N Stratix II FPGAs, DnCPU = DN and therefore n = eDFPGA . 1

to EDPFPGA , R, is given by Accordingly, the EDP ratio of EDPCPU n N R=

nP1CPU (DnCPU )2 N P1CPU D1CPU EDPCPU n = = . FPGA )2 PNFPGA (DN e(P1IDLE + N P1FPGA )D1FPGA EDPFPGA N

(29)

1-1 Here, we obtain the parameters for the fractional-step method. As Table III shows for FRAC of the 96 × 48 grid, P1CPU = 125.91 W, D1CPU = 33.92 sec and D2FPGA = 1.158 sec. As shown in Fig.12a, P1IDLE = 69.9 W for a single-CPU node without FPGAs, and therefore P2FPGA = 101.17 − P1IDLE = 31.27 W. We simply estimate that P1FPGA = P2FPGA /2 = 15.635 W and D1FPGA = D2FPGA × 2 = 2.32 sec. By substituting these parameters into Eq.(29), we obtain the following EDP ratio for FRAC. 117.74N . (30) R e(4.47 + N ) 1-1 Fig.14 graphs Eq.(30) for various N and e. EDP of the CPU cluster is at least 50 times more than that of the FPGA-based SCMA with four FPGAs for the same performance even if we assume the CPU cluster has complete scalability with e = 1.0. This is due to the power consumption of the Pentium4 processor and the other peripherals of each node, while there exist better processors and peripherals in performance per energy consumption. Thus Fig.14 illustrates that the FPGAbased accelerator is much more efficient than the CPU cluster, and very promising for low-power and high-performance computation. Note that especially the FPGAbased SCMA can operate independently of the host PC after it is initialized. We can utilize a low-performance but power-efficient host PC or make a microprocessor sleep while FPGAs are computing. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

26

·

Kentaro Sano et al.

Table IV.

Dmax , nrem and availability of m by software simulation of a PE. m 23 Dmax

RB-SOR

25

26

116 (west direction)

max. nrem

1

1

1

> 127

availability

yes

yes

yes

no

Dmax FDTD

24

82 (south direction)

max. nrem

3

3

6

8

availability

yes

yes

no

no

4.5 Bandwidth and delay evaluation of BRM In the prototype implementation with the two Stratix II FPGAs of Fig.9, the direct connection is sufficient for the 1D array of FPGA-A and FPGA-B. However, a 2D array of FPGAs with less I/O bandwidth than the Stratix II FPGA requires bandwidth reduction because each FPGA has two times more links necessary for the 2D mesh connection. Here we evaluate the number of serialization cycles, m, for a 2D FPGA array, i.e., how the our bandwidth-reduction mechanism (BRM) decreases the bandwidth for benchmark programs with the actual hardware design. F Wmax F Eq.(23) gives the maximum m that reduces the bandwidth to Wreduc = m . To obtain the right-hand side of Eq.(23), we should know the following parameters: Dmax , DFPGA , DDES , DPE and nrem . As described in Section 3.4.3, DDES = 1 and DPE = 7 in our design. Section 4.1 shows that DFPGA = 6 for the LVDS-based communication unit. Dmax and nrem depend on benchmark programs. To know Dmax and nrem , we implement a cycle-accurate software simulator of a single PE and run sequences of RB-SOR and FDTD with the simulator to record actual Dmax and nrem . 2-7 We optimzie the simulated sequences by hand for BRM of the 2D FPGA array, while no optimization is necessary for the double-FPGA SCMA without BRM. In the optimized sequences, data-transfer instructions are almost evenly spread over cycles in each of N, S, W and E communication-directions. Thereby these sequences are different from those for the experiments summarized in Table III, while their computations are equivalent except the grid size. The simulated sequences for RB-SOR and FDTD compute the internal sub-grids of 8 × 24 and 6 × 9 points with a single PE, respectively. Table IV shows the results of the simulation. The maximum permissible delay, Dmax , of RB-SOR is 116 for the west direction. This means that the transferred data are used by the adjacent PE at least 116 cycle after the data are sent. Similarly, Dmax of FDTD is 82 for the south direction. Next, we obtain the maximum number of remaining data in the FIFOs of BRM, nrem , by using the simulator for different m = 23 , 24 , 25 and 26 . In the case of RB-SOR, nrem is 1 for m = 23 , 24 and 25 , but more than 127 for 26 . This is because the sequence is well optimized in terms of data-transfer distribution, but the ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

27

serialization cycles of 26 = 64 is too many to keep nrem small. On the other hand, FDTD gives nrem that is 3, 3, 6 and 8 for m = 23 , 24 , 25 and 26 , respectively. The gradually increasing nrem means that the data transfers are not well-distributed and sometimes locally continue, but the serialization cycles does not reach the critical level of the FDTD sequence. With these parameters, we evaluate the availability of each m. If all the parameters for m satisfy Eq.(23), such m is available for BRM in terms of delay requirement. For example, m = 23 of RB-SOR satisfies Eq.(23) as follows. 116 − 1 − 6 − 1 − 7 Dmax − 1 − DFPGA − DDES − DPE = = 50.5. nrem + 1 2 (31) Accordingly, m = 23 is available for RB-SOR with BRM. On the other hand, m = 26 does not satisfy Eq.(23) as follows. 2-4

23 = 8 ≤

2-4

26 = 64 >

116 − 1 − 6 − 1 − 7  0.79. 127 + 1

(32)

Thus, m = 26 is not available. 1-7, 2-8 Table IV summaries the availability of m for RB-SOR and FDTD. For both RB-SOR and FDTD, the bandwidth required for an FPGA can be F F reduced to Wreduc = Wmax /16 because m = 24 is available thanks to the large number of maximum  delay cycles. Since the factor of bandwidth re permissible 1 duction is given O Ngrid Ninst in theory as shown in Eq.(14), similar m is expected to be available for other benchmark programs with a similar grid-size and optimization. If we construct a 2D FPGA-array for the prototype SCMA, each F link F FPGA requires Wmax = 4Wmax = 4 × 3.5 = 14.0 GByte/s, which exceeds Wavail = 4 4.98 × 2 = 9.96 GByte/s. However, BRM with m = 2 decreases the required F F bandwidth to Wreduc = Wmax /16 = 14.0/16 = 0.875 GByte/s, which is much less F than Wavail = 9.96 GByte/s. Thus, the 2D FPGA-array with BRM is feasible in terms of bandwidth and delay of inter-FPGA communication for actual benchmark programs. 4.6 Feasibility of SCMA with commercially available FPGAs In this subsection, we discuss the inter-FPGA bandwidth and the feasibility of SCMAs over a 2D FPGA array for commercially available FPGAs. Table V shows examples of high-end and low-end FPGAs of different generations, and their hardware specifications[Altera Corporation 2008]: the number of LEs (logic elements), the total size of embedded memories in KBytes, the number of 36-bit DSP blocks, the number of LVDS channels for Tx and Rx each, and the bandwidth per LVDS channel. For Cyclone III, we estimated the number of 36-bit DSP blocks by dividing the number of 18-bit DSP blocks with 4. F , by calcuWe obtained the total unidirectional I/O bandwidth of FPGA, Wavail lating NLVDS-ch WLVDS-ch , where NLVDS-ch is the number of LVDS channels for Tx and Rx each, and WLVDS-ch is the unidirectional bandwidth per channel. In this table, Stratix II EP2S180 has higher I/O bandwidth than that of our implementation because the FPGA prototyping board uses the I/O pins not only for the inter-FPGA connection, but also for the on-board memory and the PCI interface. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

28

·

Kentaro Sano et al.

Table V. Estimation of array size and required bandwidth for a 2D FPGA-array based on hardware resources of FPGAs. Stratix IV E EP4SE680

Stratix III L EP3SL340

Stratix II EP2S180

Cyclone III EP3C120

LEs

681,100

337,500

179,400

119,088

Memory [KBytes]

3,936

2,034

1,145

486

36-bit DSPs

340

144

96

288/4

LVDS chs. (Tx,Rx each)

132

132

156

110

BW per ch [MB/s]

200

156.25

125

80

F Wavail

26.4

20.6

19.5

8.8

Assumed PE freq [MHz]

106

106

106

106

Estimated # of PEs

340

144

96

72

Peak GFlops

72.1

30.5

20.3

15.3

F Wmax

32.2

21.0

17.1

14.8

[GB/s] m=2

16.1

10.5

8.55

7.40

F Wreduc [GB/s]

m=4

8.05

5.25

4.28

3.70

with BRM

m=8

4.03

2.62

2.14

1.89

m = 16

2.01

1.31

1.07

0.93

2-9 Since single-precision floating-point numbers are used, b = 33. We conservatively assume that all these FPGAs can operate at f = 106MHz for SCMAs. This assumption is based on synthesis reports of a single PE, where the maximum frequency is 145.3MHz, 166.5MHz, 163.7MHz and 118.8MHz for Stratix IV, Stratix III, Stratix II and Cyclone III, respectively. We estimate the number of PEs implemented on an FPGA, NP E 2 , with the number of 36-bit DSP blocks. The peak GFlops shows the peak performance of each FPGA. Then we obtained the maxlink imum unidirectional-bandwidth required for one link with Wmax = NP E bf /8000 F link = 4Wmax GByte/s. For a 2D FPGA-array, the total required bandwidth is Wmax d because each FPGA has 2 links in a d-dimensional FPGA-array. Under these assumption and estimation, the Stratix series of ALTERA’s highend FPGAs have the I/O bandwidth close to the required bandwidth, while the low-end Cyclone III FPGA has almost half of the required bandwidth. This means that the high-end FPGAs have more-balanced I/O bandwidth with their computing performance than the low-end FPGAs. However, the available I/O bandwidth can be lower than the required bandwidth if we do not use BRM. Thus BRM is essential to implement scalable FPGA-arrays without performance degradation of each FPGA. If we assume that 2 ≤ m ≤ 16 based on the evaluation of BRM, all the FPGAs F F ≤ Wavail , giving us the more design sufficiently satisfy the condition that Wreduc choices. This result means that our SCMA has complete scalability with multiple ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

29

FPGAs arranged in a 2D array for real computing problems. Note that the required bandwidth does not grow so much as the peak performance of each FPGA does. This is because the peak performance and the required bandwidth are differently given with O(NP E 2 ) and O(NP E ), respectively. Thus our strategy to homogeneously partition an SCMA over FPGAs is feasible and suitable for commercially available FPGAs with such balanced I/O bandwidth and computing performance. 5. CONCLUSIONS In this paper, we have proposed the scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient customcomputing machines (CCMs) for scientific simulations based on finite difference methods. The FPGA-based CCMs are designed as the programmable systolic computational-memory array (SCMA), which achieves both scalable arithmeticperformance and scalable memory-bandwidth. The minimum programmability of the SCMA provides flexibility and high productivity to the CCM for various computing kernels and boundary computations. By introducing the homogeneous partitioning, we allowed SCMAs to be extensible over an array of multiple tightlycoupled FPGAs. A large SCMA implemented on an FPGA-array presents scalable and high-performance computation with high parallel-processing efficiency. The bandwidth of inter-FPGA communication is the key to implementing such a scalable SCMA. The bandwidth required for computing performance of each FPGA must be provided. To satisfy this requirement for FPGAs without sufficiently wide bandwidth, we have proposed BRM based on time-division multiplexing. We described the behavior of BRM, and then formulated the constraint model for bandwidth and delay of communication with BRM. For feasibility demonstration and quantitative evaluation, we designed and implemented the SCMA of 24 × 8 = 192 PEs with the two StratixII FPGAs on the same board. The SCMA operates at 106MHz, and the implemented communication units using LVDS provide sufficient inter-FPGA bandwidth for the 1D FPGA-array to fully perform computations. The two FPGAs have the peak performance of 40.7 GFlops of single-precision floating-point computations, while they also give the sustained performance of 32.8 to 35.7 GFlops with high utilization of MAC units for the benchmark computations. The double-FPGA SCMA achieves 6 to 29 times faster computation than software computation with the Pentium4 processor operating at 3.4GHz. In particular, the two FPGAs have complete scalability with the ideal parallel-processing efficiency of almost 100% in addition to the high utilization of the MAC units. Furthermore, the SCMAs demonstrate much higher power-efficiency: the FPGAbased computation requires 69% to 87% power consumption and only 2.8% to 7.0% energy consumption for the same computations, compared to the software computation with the 3.4GHz Pentium4 processor. By estimating the energy-delay product (EDP), we also showed that an FPGA-based SCMA has much smaller EDP than that of a CPU cluster for the same computing speed. This means that FPGAs can provide power-efficient acceleration of scientific simulations. We also discussed the scalability and feasibility of a 2D FPGA-array with BRM. By using the software simulator, we obtained the necessary parameters for the ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

30

·

Kentaro Sano et al.

constraint model of the serialization cycles in BRM. The simulation results show that commercially available high-end and low-end FPGAs can be used to construct a scalable SCMA due to the effective bandwidth reduction. This is because the actual computation permits the delay-cycle increase for BRM, and data-transfer instructions can sufficiently be spread over cycles. Thus, optimization of computing programs is important for BRM to work effectively. Currently we optimize programs by hand, however the ideal optimization is not always obtained. A scheduler or compiler for SCMAs is a part of our future work. The limited size of local memories is also a problem of the FPGA-based SCMAs, while Table V shows that FPGAs are being given better situation by favorably growing size of embedded memories. We will also consider implementation with external DRAMs, which corresponds to another style of the SCMA implementation. We are now developing a larger-scale SCMA with stackable FPGA-boards like ALTERA DE3[Altera Corporation 2008]. Furthermore, we will consider 3D computation on a large-scale SCMA, and more dedicated data-paths/networks of PEs to a specific target computations. Acknowledgments This research was supported by MEXT Grant-in-Aid for Young Scientists(B) No. 20700040 and MEXT Grant-in-Aid for Scientific Research(B) No. 19360078. We thank Professor Takayuki Aoki and Associate Professor Takeshi Nishikawa, Tokyo Institute of Technology, and Assistant Professor Hiroyuki Takizawa, Tohoku University, for power measurement. REFERENCES Allen, T. and C., H. S. 1996. Computational electrodynamics - The finite difference time-domain method. Norwood, MA: Aretch House Inc. Altera Corporation. 2008. http://www.altera.com/literature/. Chen, W., Kosmas, P., Leeser, M., and Rappaport, C. 2004. An fpga implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays (FPGA2004), 213–222. Chiu, M., Herbordt, M., and Langhammer, M. 2008. Performance potential of molecular dynamics simulations on high performance reconfigurable computing systems. Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications(HPRCTA’08), DOI: 10.1109/HPRCTA.2008.4745685. Compton, K. and Hauck, S. 2002. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34, 2 (June), 171–210. deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. Proceedings of the International Symposium on Field-Programmable Gate Arrays, 75–85. Dou, Y., Vassiliadis, S., Kuzmanov, G. K., and Gaydadjiev, G. N. 2005. 64-bit floatingpoint fpga matrix multiplication. Proceedings of the International Symposium on FieldProgrammable Gate Arrays, 86–95. Durbano, J. P., Ortiz, F. E., Humphrey, J. R., Curt, P. F., and Prather, D. W. 2004. Fpga-based acceleration of the 3d finite-difference time-domain method. Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 156– 163. Elliott, D. G., Stumm, M., Snelgrove, W., Cojocaru, C., and Mckenzie, R. 1999. Computational ram: Implementing processors in memory. Design & Test of Computers 16, 1 (JanuaryMarch), 32–41. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

31

Fatahalian, K., Sugerman, J., and Hanrahan, P. M. 2004. Understanding the eciency of GPU algorithms for matrix-matrix multiplication. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 133–137. Feng, X., Ge, R., and Cameron, K. W. 2005. Power and energy profiling of scieitific applications on distributed systems. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. ´, M. 1996. Computational Methods for Fluid Dynamics. SpringerFerziger, J. H. and Peric Verlag Berlin Heidelberg. Hageman, L. A. and Young, D. M. 1981. Applied Iterative Methods. Academic Press. Hauser, T. 2005. A flow solver for a reconfigurable fpga-based hypercomputer. AIAA Aerospace Sciences Meeting and Exhibit AIAA-2005-1382. He, C., Lu, M., and Sun, C. 2004. Accelerating seismic migration using fpga-based coprocessor platform. Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 207–216. He, C., Zhao, W., and Lu, M. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 127–136. Hemmert, K. S. and Underwood, K. D. 2005. An analysis of the double-precision floatingpoint fft on FPGAs. Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 171–180. Hoshino, T., Kawai, T., Shirakawa, T., Higashino, J., Yamaoka, A., Ito, H., Sato, T., and Sawada, K. 1983. Pacs: A parallel microprocessor array for scientific calculations. ACM Transactions on Computer Systems 1, 3 (August), 195–221. Johnson, K. T., Hurson, A., and Shirazi, B. 1993. General-purpose systolic arrays. Computer 26, 11 (November), 20–31. Kaganov, A., Chow, P., and Lakhany, A. 2008. FPGA acceleration of monte-carlo based credit derivative pricing. Proceedings of the International Conference on Field Programmable Logic and Applications, 329–334. Kim, J. and Moin, P. 1985. Application of a fractional-step method to incompressible navierstokes. Journal of Computational Physics 59, 308–323. Kung, H. T. 1982. Why systolic architecture? Computer 15, 1, 37–46. Morishita, H., Osana, Y., Fujita, N., and Amano, H. 2008. Exploiting memory hierarchy for a computational fluid dynamics accelerator on FPGAs. Proceedings of the International Conference on Field-Programmable Technology (FPT2008), 193–200. Morris, G. R., Prasanna, V. K., and Anderson, R. D. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 30–12. Murtaza, S., Hoekstra, A., and Sloot, P. 2008. Floating point based cellular automata simulations using a dual FPGA-enabled system. Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications(HPRCTA’08), DOI: 10.1109/HPRCTA.2008.4745686. Patel, A., Madill, C. A., Saldana, M., Comis, C., Pomes, R., and Chow, P. 2006. A scalable FPGA-based multiprocessor. Proceedings of the 14th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, 111–120. Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., and Yelick, K. 1997. A case for intelligent ram: Iram. IEEE Micro 17, 2 (March/April), 34–44. Patterson, D., Asanovic, K., Brown, A., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Kozyrakis, C., Martin, D., Perissakis, S., Thomas, R., Treuhaft, N., and Yelick, K. 1997. Intelligent ram(iram): the industrial setting, applications, and architectures. Proceedings of the International Conference on Computer Design, 2–9. Sano, K., Iizuka, T., and Yamamoto, S. 2006a. Massively parallel processor based on systolic architecture for high-performance computation of different schemes. Proceedings of the International Conference on Parallel Computational Fluid Dynamics, 174–177. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

32

·

Kentaro Sano et al.

Sano, K., Iizuka, T., and Yamamoto, S. 2006b. Systolic computational-memory architecture for an fpga-based flow solver. Proceedings of the 49th IEEE Internationial Midwest Symposium on Circuits and Systems (MWSCAS2006), CDROM proceedings(paper :3213, 5 pages). Sano, K., Iizuka, T., and Yamamoto, S. 2007. Systolic architecture for computational fluid dynamics on FPGAs. Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM2007), 107–116. Sano, K., Luzhou, W., Hatsuda, Y., and Yamamoto, S. 2008. Scalable FPGA-array for highperformance and power-efficient computation based on difference schemes. Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications(HPRCTA’08), DOI: 10.1109/HPRCTA.2008.4745679. Sano, K., Nishikawa, T., Aoki, T., and Yamamoto, S. 2008. Evaluating power and energy consumption of FPGA-based custom computing machines for scientific floating-point computation. Proceedings of the International Conference on Field-Programmable Technology (FPT2008), 301–304. Sano, K., Pell, O., Luk, W., and Yamamoto, S. 2007. FPGA-based streaming computation for lattice boltzmann method. Proceedings of the International Conference on Field-Programmable Technology (FPT2007), 233–236. Sano, K., Takagi, C., Egawa, R., Suzuki, K., and Nakamura, T. 2004. A systolic memory architecture for fast codebook design based on MMPDCL algorithm. Proceedings of the International Conference on Information Technology(ITCC2004), 572–578. Sano, K., Takagi, C., and Nakamura, T. 2005. Systolic computational memory approach to high-speed codebook design. Proceedings of the 5th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT2005), 334–339. Schneider, R. N., Turner, L. E., and Okoniewski, M. M. 2002. Application of fpga technology to accelerate the finite-difference time-domain (FDTD) method. Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays (FPGA2002), 97–105. Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2006. A hardware/software approach to molecular dynamics on reconfigurable computers. Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 23–34. Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2008. Accelerating molecular dynamics simulations with reconfigurable computers. IEEE Transactions on Parallel and Distributed Systems 19, 6 (June), 764–778. Shirazi, N., Walters, A., and Athanas, P. 1995. Quantitative analysis of floating point arithmetic on fpga based custom computing machines. Proceedings of the IEEE Symposium on FPGA’s for Custom Computing Machines, 155–162. Smith, W. D. and Schnore, A. R. 2003. Towards an rcc-based accelerator for computational fluid dynamics applications. The Journal of Supercomputing 30, 3 (December), 239–261. Strenski, D., Simkins, J., Walke, R., and Wittig, R. 2008. Evaluating FPGAs for floating point performance. Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications(HPRCTA’08), DOI: 10.1109/HPRCTA.2008.4745680. Strikwerda, J. C. and Lee, Y. S. 1999. The accuracy of the fractional step method. SIAM Journal on Numerical Analysis 37, 1 (November), 37–47. Underwood, K. 2004. Fpga vs. cpus: Trends in peak floating-point performance. Proceedings of the International Symposium on Field-Programmable Gate Arrays, 171–180. Underwood, K. D. and Hemmert, K. S. 2004. Closing the gap: Cpu and fpga trends in sustainable floating-point blas performance. Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 219–228. Vuillemin, J. E., Bertin, P., Roncin, D., Shand, M., Touati, H. H., and Boucard, P. 1996. Programmable active memories: reconfigurable systems come of age. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 4, 1 (Mar), 56–69. Walke, R. L., Smith, R. W. M., and Lightbody, G. 2000. 20-GFLOPS QR processor on a xilinx virtex-e FPGA. Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures and Implementations X 4116, 300–310. ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.

FPGA-Array with Bandwidth-Reduction Mechanism

·

33

Williams, S., Waterman, A., and Patterson, D. 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52, 4, 65–76. Woods, N. A. and VanCourt, T. 2008. FPGA acceleration of quasi-monte carlo in finance. Proceedings of the International Conference on Field Programmable Logic and Applications, 335–340. Yee, K. S. 1966. Numerical solution of inital boundary value problems involving maxwell’s equations in isotropic media. IEEE Transactions on Antennas and Propagation 14, 302–307. Zhuo, L., Morris, G. R., and Prasanna, V. K. 2007. High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Transactions on Parallel and Distributed Systems 18, 10 (October), 1377–1392. Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. Proceedings of the International Symposium on Field-Programmable Gate Arrays, 63–74. Zhuo, L. and Prasanna, V. K. 2007. Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions on Parallel and Distributed Systems 18, 4 (April), 433–448.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 3, 09 2001.