Automatic Generation of Dataflow-Based ... - IEEE Xplore

2 downloads 83373 Views 396KB Size Report
Reconfigurable Co-processing Units. Carlo Sau. DIEE - Microelectronics and Bioengineering Lab. University of Cagliari, ITA. Email: [email protected].
Automatic Generation of Dataflow-Based Reconfigurable Co-processing Units Carlo Sau

Francesca Palumbo

DIEE - Microelectronics and Bioengineering Lab University of Cagliari, ITA Email: [email protected]

PolComIng - Information Engineering Unit University of Sassari, ITA Email: [email protected]

Abstract—Hardware accelerators are widely adopted to speed up computationally onerous applications. However their design is not trivial, especially if multiple applications/kernels need to be served. To this aim the Multi-Dataflow Composer (MDC) tool can be adopted to generate the internal computing core of flexible and reconfigurable hardware accelerators. Nevertheless, MDC is not able, as it is, to deploy ready to use accelerators. To address this lack, we conceived a fully automated design flow for coarse-grained reconfigurable and memory-mapped hardware accelerators, which required: a) the definition of a generic co-processing template, the Template Interface Layer; b) the extension of the MDC tool to characterize such a template and deploy the accelerators. This methodology represents, within MPEG Reconfigurable Video Coding studies, the first framework for the automatic generation of reconfigurable hardware accelerators and, as it will be discussed, it may be beneficial also in other contexts of execution with fairly limited adjustments. Results validated the proposed approach in a real use case scenario, comparing the automatically generated co-processor with a previous custom one.

I.

I NTRODUCTION

Embedded applications in nowadays systems integrate complex functions, requiring their efficient execution on portable devices. Design time estimation of systems execution is becoming challenging and runtime adaptivity, though improving the overall system flexibility, complicates its design. Adaptivity is defined as the ability of the system to autonomously react to a context modification, but still it represents an open issue in the embedded system domain. At the hardware level, heterogeneous systems seem to better cope with highly flexible scenarios. Reconfigurable computing has proven to be extremely suitable to address multifarious domains, characterized by diverse processing behaviours and unpredictable nature of complex applications. These architectures utilize resources in a time-multiplexed manner. They are classified into coarse-grained and finegrained: the former facilitate data-dominant applications by providing word-level reconfigurable elements, the latter are beneficial for control-flow dominant applications and bit-level computations. Traditionally, reconfigurable architectures have been used for hardware accelerations. Coarse-grained architectures have been mainly adopted to speed up filter algorithms that have a high level of instruction parallelism and/or data parallelism, exhibiting less control flow [1] [2]; whereas, finegrained ones have been used to map algorithms with increased amount of control flow [3]. Several studies have then tried to exploit mixed-grained approaches to optimally serve complex

data/control multifarious applications [4] [5]. At the design flow level, integrated tool chains capable of fast adaptive systems deployment are still missing. Within this context, this work leverages on the dataflow formalism at the base of the MPEG Reconfigurable Video Coding standard (MPEG-RVC) [6] to automatically generate runtime reconfigurable coarse-grained co-processing units. The contribution of this paper is dual. On the one hand, an integrated design flow for memory-mapped accelerators is presented. On the other, an architectural co-processing template, the Template Interface Layer (TIL), will be introduced. The TIL is compliant with the MPEG-RVC communication protocol. Starting from the high level dataflow specifications of the multiple applications/kernels to be served, the proposed flow is capable of characterizing the TIL and deploying a ready to use reconfigurable hardware accelerator. With respect to other state of the art approaches, this methodology guarantees the automatic deployment of application specific flexible accelerators (providing efficient mapping, high level synthesis and runtime support), rather than simply retrieving multiple units from a given library [7] and not targeting the entire system characterization [5]. The rest of this paper is organized as follow. Section II presents the background of this work. Section III illustrates the design flow to be combined with the TIL (in Sect.IV) and the proposed software extension (in Sect.V) to automatically generate reconfigurable co-processing units. Finally, Sect.VI presents the achievable results, prior to conclude in Sect.VII. II.

BACKGROUND

A. The dataflow formalism and the MPEG-RVC standard The dataflow paradigm has been deeply studied since the first works by Dennis [8] and Kahn [9] in 1974. Dataflow specifications are provided as directed graphs, where the vertices are functional units (actors) and the edges are dedicated links (channels or connections), used to exchange streams of data packets (tokens). At the state of the art, several dataflow models of computations have been identified [9] [10] [11], providing different features and exhibiting different constraints on timing, parameters and blocking events. The Dataflow Process Network model with firing rules [11] has been adopted by the MPEG group in the definition of the MPEG-RVC standard [6]. The MPEG-RVC framework [12] constitutes a development environment for audio/video codecs providing, along with the formalism to describe codecs as dataflows,

also a library of common actors, the Video Tool Library, to enable quicker codecs specification/upgrade. This high-level formalism and the modularity of the MPEG-RVC approach, coupled with the development of a complete set of normative and non-normative tools, not only facilitate codecs deployment, but favour also their long-term adaptivity. Moreover dataflow specifications, by enabling an easier expression of parallelism, facilitate applications mapping on many-core, even heterogeneous, platforms. Despite the MPEG-RVC has been standardised only a few years ago, several research groups have been active on this topic and many tools have been developed. The Open RVCCAL Compiler (Orcc) is the most important one [13]. It generates from the given high-level specification the related source code for different platforms: hardware, software or mixed. Orcc transforms the input dataflow model into a java Intermediate Representation that can be then used to feed several other tools such as: Xronos [14], which is capable of high level synthesis for Xilinx FPGAs; Turnus [15], which offers design space exploration capabilities; and the MDC tool [16], which provides automatic runtime reconfigurable systems deployment. B. Co-processing units: automatic generation Computationally intensive applications commonly rely on hardware accelerators to speed up execution. Accelerators improve efficiency through highly specialized circuits, but their design and debugging is not trivial. It requires a detailed knowledge of the given application and, when not effectively supported, it may be extremely time consuming. Architecture Description Languages [17] have been used a lot to simplify and speed up embedded systems development and deployment. Both in academia and industry, such languages have been exploited for automatic tool generation, processor verification and High Level Synthesis [18]. High Level Synthesis is an automated process that, from input high level specifications, generates hardware or software descriptions. Xronos and Synflow Studio [20], for example, provide respectively Xilinx compliant and target independent RTL descriptions starting from different dataflow specifications. High Level Synthesis has been largely adopted also for automatic co-processors generation. Schreiber et al. [19] presented a methodology to generate co-processing units to speed up different portions of an application described in C code. The target of this process is fixed on a specific system. This limitation has been relaxed by Bond et al. [21] that, starting from C++ source code, provided a way to define accelerators to be deployed over different target devices. With respect to the work presented in this paper, none of these works addresses reconfigurable co-processing units, capable of adapting at runtime their computation to the current context of execution. Other research works aimed to an increased flexibility [7] and/or to speed up different applications [22]. Milakovic et al. [7] presented the implementation of a co-processor capable of switching among different datapaths. Basing on the input workload, the co-processor can opt for the most efficient datapath among those available. The MDC approach is different: shared resources are (de)multiplexed over a unique

datapath and the co-processing unit can be automatically derived using the design flow we are proposing. Rutten et al. [22] also considered a multi-application case: common functions (computational kernels) are identified exploiting a dataflow formalism. Kernels are then mapped into different non-reconfigurable hardware accelerators, alternatively invoked when required by the current execution. MDC relies on the coarse-grained reconfigurable paradigm. If explicitly required, different accelerators, assembled with our approach, may be adopted in parallel, but each of them will be able to compute several kernels over the same substrate. III.

R ECONFIGURABLE COARSE - GRAINED DATAPATH : THE M ULTI -DATAFLOW C OMPOSER TOOL

Design and management of efficient reconfigurable coarsegrained architectures is not trivial [23] [24]. Manual management is doable but inefficient. Therefore, automatic methodologies are of paramount importance. It has been already demonstrated in [25] and in [26] that combining the dataflow formalism with the coarse-grained design paradigm may be extremely beneficial to generate efficient and flexible datapaths, which can be integrated within co-processing units. The tool used within those studies, the Multi-Dataflow Composer (MDC) tool, is a software framework for the automatic generation of reconfigurable platforms. The limitation of adopting MDC to derive hardware accelerators is that the automatic generation of the reconfigurable computing core of a co-processing unit is not sufficient. In order to interface such a computing core with a host processor it is necessary to define an ad-hoc wrapper, which may require a design effort proportional both to the number of kernels to be accelerated and to their characteristics. In this paper, we will challenge this issue by defining a customizable hardware template and a software framework, connected to MDC, for its automatic deployment. A. Generalities The MDC tool leverages on the MPEG-RVC formalism [27]. It aims at releasing designers from the tedious and errorprone process of manually composing coarse-grained reconfigurable platforms. Given an input set of dataflow specifications, it assembles the resource-minimal multi-dataflow network. Common actors are shared through low overhead switching elements (Sboxes) placed on the incoming and outgoing edges of the shared actors or actor-chains. At the hardware level, these Sboxes are logically very simple and provide single-cycle reconfiguration. Working at a high level of abstraction, MDC keeps also trace of the configuration necessary to implement the input specifications on the composed platform. Then the tool maps the multi-dataflow specification on a RTL system, composed of all the necessary physical units that ensure computing correctness. MDC is not a high-level to hardware synthesizer. It needs, as input, the HDL Components Library. Although MDC was born under the MPEG-RVC standard, its employment is not bounded to the MPEG-RVC application field. The MDC approach, as already demonstrated in [26] [28] [29], is orthogonal with respect to the video coding application field. MDC in principle is able to process any type of dataflow and, according to that, designer must specify as input the communication protocol to be implemented among the physical

2) RTL Description: Once the multi-dataflow specification is composed, the MDC back-end maps it into the correspondent RTL top module. At this stage, the definition of the communication protocol influences the system implementation. It is intrinsically peculiar of the adopted dataflow formalism; therefore: 1) 2) Fig. 1.

MDC tool: an overview.

3) units. In our work we have coupled MDC to Orcc and Xronos, as it can be seen in Fig. 1. The MDC front-end processes the java Intermediate Representations generated by the Orcc frontend, to assemble a single multi-dataflow specification. The MDC back-end then creates the respective HDL coarse-grained reconfigurable hardware, mapping actors onto functional units. These latter are provided to MDC within the HDL Components Library, automatically generated by Xronos. Xronos uses the Orcc Intermediate Representations to synthesise the library. According to these considerations, since Orcc and Xronos obey to the MPEG-RVC standard, necessarily also the communication protocol passed to MDC has to be RVC-complaint. For the purposes of this work, since we are targeting a fully automated generator for reconfigurable co-processing units we opted for the automatic generation of the HDL Components Library, though this is not mandatory. Users can use their own libraries, keeping in mind that the communication protocol will have to be coherently defined. B. Multi-Dataflow Datapath MDC provides a RTL description of the multi-dataflow system specification, which depends on the characteristics of the given input networks and the communication protocol to be adopted. The former define the number and the type of I/O ports of the generated platform, while the latter determines how these ports communicate with the external world (i.e. the handshake to be implemented). 1) Multi-Dataflow Specification: To understand the variability of the platforms generated by MDC, Fig. 2 provides an example of multi-dataflow specification composition. Two simple input specifications are given: AddTop (Fig. 2.a) and SplitTop (Fig. 2.b). The AddTop network has two input ports and one output port, while SplitTop has one input port and two output ports. The sharable resources will be: the Input1 input port, the Output output port and the Abs actor. The multidataflow specification (created by the MDC front-end, see Fig. 1) is depicted in Fig. 2.c. It is composed of two input ports and two output ports, plus an additional input ID port. The ID port is added by MDC to allow runtime (re)configuration of the multi-dataflow platform (through encoded values, generated by the MDC front-end, referencing uniquely the given specifications). The AddTop network, while executing, will drive just Output of the two output ports of the multi-dataflow network, whereas the SplitTop network will read just Input1 of the two input ports.

it has to be implemented by all the physical units of the HDL Components Library; the Sboxes have to be built by the MDC back-end accordingly, to correctly connect the above-mentioned units; the I/O ports have to support the given handshake, since the external environment should properly feed and/or retrieve data to/from the coarse-grained reconfigurable core.

Considering the design flow presented in Sect.III-A, we deal with MPEG-RVC compliant systems. The HDL Components Library, created using Xronos, obeys to the FIFObased MPEG-RVC communication protocol. For any actor, each input port is driven by an input FIFO (queue), while each output port is connected to a forwarding module (fanout). Handshake among actors involves the following signals: •

DATA - the data channel of the connection,



SEND - single-bit signal meaning that DATA is valid,



ACK - single-bit signal meaning that the DATA received has been consumed,



RDY - single-bit signal meaning that the downstream actor is ready to receive,



COUNT - size of the transmitted packet of data.

A data transition occurs when the source actor, once seen a high RDY signal, sets the SEND one. The target actor consumes the DATA token when it sets ACK while SEND is high. IV.

A RCHITECTURAL C O - PROCESSOR T EMPLATE

One of the objectives of this paper is presenting the RVCcompliant architectural co-processing template, called Template Interface Layer (TIL). In context different by the RVC one, this template would still be useful to integrate memorymapped co-processing units, providing an adapter for the communication protocol implemented within their computing core. The TIL is composed of some fixed modules (that do not depend on the number of integrated kernels nor on the number of I/O ports of the coarse-grained datapath) and some variable ones (that depend on the number of integrated kernels and/or on the number of I/O ports of the coarsegrained datapath). TIL-based co-processing units are conceived as memory-mapped peripherals connected to the system bus of the architecture. Two address ranges are available: one for the registers bank configuring the co-processor, the other for a local double port memory (l mem) storing the data to be processed and the achieved results. This template is composed of: 1) a front-end in charge of data transfers from l mem to the coarse-grained reconfigurable datapath, and 2) a back-end managing the data transfers from the coarse-grained reconfigurable datapath to l mem.

a)

c)

b)

Fig. 2.

MDC tool composition example where a) AddTop and b) SplitTop networks are combined into the c) multi-dataflow.

A. The TIL Front-End

B. The TIL Back-End

The block scheme of the TIL front-end is depicted in Fig. 3. The configuration registers, an input registers bank, are used by the host processor to configure the co-processor for the demanded computation. The kernel ID register value serves to identify the issued computation and also as start signal for the co-processor: when the default zero value is updated with a valid kernel ID the peripheral is activated.

The block scheme of the TIL back-end is depicted in Fig.4. It is similar to the TIL front-end and exploits some of the configuration registers. The outM size registers value determines the total number of data to be processed by each output port. The outM baseaddr registers value is exploited by the Address Generator to compute the address of the data to be stored in l mem.

A transaction over an I/O port corresponds to a transmission of X different packets of Y data each. Therefore, X ∗ Y data is the total number of exchanged items, over a port, per execution. The inN size registers value determines the total number of data (X ∗ Y ) to be processed by each input port. The inN baseaddr registers value is exploited by the Address Generator to compute the address of the data to be read from l mem. The size pckt register gives the size of each data packet Y.

There is not any start signal in the back-end. Results are stored as soon as they are produced. Since just one memory is present in the template, each output port is served by a FIFO to decouple results production from memory arbitration.

The fsm manages data transfers from l mem to the coarsegrained reconfigurable datapath. It is triggered by the start signal. Ports loading is performed one input port at a time, since the second port of l mem is used by the TIL back-end to store the results. Port Selector identifies the input port to be loaded considering, for each of them, its correspondent RDY signal (coming from the coarse-grained datapath) and the value of data counters (indicating, for each port, whether other data have to be loaded). The input port stored within the Port Selector defines all the paths of data within the TIL front-end, driving all the muxes and demuxes, since input-related signals (as the SEND signal or the DATA signal) are generic. After the port selection process, the fsm starts reading data from l mem according to the address computed by the Address Generator (as addr = inN baseaddr + dataCountN , summing the N-th port inN baseaddr value to the current data counter value). Read data is forwarded to the selected coarsegrained datapath input port. When all the data in a packet have been transferred (pckt counter has finished), even if new data can be accepted (RDY signal is high), a new port will be selected according to a simple round-robin selection policy. When all the requested transfers have been accomplished, the fsm goes into a stalling state, waiting for the finish signal from the TIL back-end to reset all the information within the coprocessing unit preparing to start a new transaction.

The fsm is responsible of the results transfer from the FIFOs to l mem. Port Selector, operating in a round-robin manner, identifies the not empty FIFO from which the data have to be retrieved. Again, the port stored within the Port Selector defines all the paths of data within the TIL backend, driving all the muxes and demuxes, since output-related signals are generic. The Address Generator is in charge of computing the correct address (as addr = outM baseaddr + dataCountM ) to which results have to be stored. As soon as the currently selected FIFO has been emptied, a new output port can be served (if there is at least one data counter not finished); otherwise, the storing phase can terminate and the fsm sets the finish signal. C. Template Adaptability The described template is adaptable to any MDC generated datapath. As highlighted in Fig. 3 and Fig. 4, the template presents: fixed modules, port-dependent modules and extendable modules. Fixed modules do not change at all for any number of integrated kernels nor any number of I/O ports to be served. Port-dependent modules are fixed modules that are replicated as many times as the number of the I/O ports of the coarse-grained datapath. The extendable modules are those modules having an interface affected by the number of the I/O ports of the coarse-grained datapath, but their number of instances is fixed. Table I provides an idea of the number of TIL requested resources for different MDC datapaths. This table is not representative of all the hardware overhead: the interface of extendable modules (the multiplexers, but the 2x1 ones, the demultiplexers, the Port Selectors and the Address Generators)

...

ADDR

ADDRESS GENERATOR IN

DIN W_EN EN

DATA CNT N

DOUT

...

inN_baseaddr (Nth port base address) size_pckt (packet data size)

... ... ...

SIZE

...

...

DATA

SEND

...

ID

PCKT CNT

RDY

GO_PCKT START

...

END_BST GO

COARSE-GRAIN RECONFIGURABLE DATAPATH

EN_M

PORT SELECTOR IN

RDY SEND

...

FSM

port depend. modules extendible modules fixed modules

END FINISH

loading_port

Front-End of the Template Interface Layer.

DATA COUNTERS

EN

...

DATA CNT M

DIN

outM_baseaddr (Mth port base address)

... DIN

...

FIFO_1 DOUT

DATA_1 !FULL EMPTY RDY_1 WR SEND_1

...

...

RD

DIN

...

COARSE-GRAIN ACK_1 RECONFIGURABLE DATA_M DATAPATH

FIFO_M DOUT !FULL EMPTY

...

RDY_M WR RD

SEND_M

...

ACK_M

GO END

... port depend. modules estensible modules fixed modules

Fig. 4.

Back-End of the Template Interface Layer.

PORT SELECTOR OUT

EN_M

FSM

EMPTY READ FINISH

storing_port

...

... outM_size (# of Mth port data) in1_baseaddr (1st port base address)

...

W_EN

...

ADDRESS GENERATOR OUT

DOUT

out1_size (# of 1st port data)

DATA CNT 1

...

DUAL PORT MEMORY (l_mem)

ADDR

REGISTERS BANK

...

Fig. 3.

DUAL PORT MEMORY (l_mem)

...

...

inN_size (# of Nth port data) in1_baseaddr (1st port base address)

DATA CNT 1

...

...

...

kernel_ID (computing kernel ID) in1_size (# of 1st port data)

...

DATA COUNTERS

REGISTERS BANK

TABLE I. VARIATION OF THE TIL NUMBER OF RESOURCES WITH RESPECT TO THE DATAPATH NUMBER OF INPUT (N) AND OUTPUT (M) PORTS , WHERE N=M. resource registers counters multiplexers 2x1 multiplexers Nx1 multiplexers Mx1 demultiplexers 1xN demultiplexers 1xM FIFOs port selectors address generators memories FSMs

1 6 3 4 2 3 3 2 1 2 2 1 2

number of I/O ports 2 4 8 16 10 18 34 66 5 9 17 31 4 4 4 4 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 4 8 16 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2

32 130 65 4 2 3 3 2 32 2 2 1 2

linearly increases, in terms of ports number, with the number of I/O ports of the coarse-grained datapath. Moreover, ports depth is determined accordingly to the port type within the high-level dataflow specification: if the port is boolean a 1-bit depth data channel is adopted, if it is an integer or a float a 32-bit data channel is instantiated, whereas a double requires a 64-bit data channel. For integers it is also possible to specify a custom depth at the dataflow level, which will be assigned as it is at the hardware level. V.

C O -P ROCESSOR G ENERATOR

The second contribution of this work is the definition of an MDC extension for the automatic generation of co-processing units. This extension adapts the descried co-processor template to MDC generated coarse-grained reconfigurable datapaths. To this aim, we created a parser for the multi-dataflow specification of the multi-datapath computing core, capable of characterizing the TIL architecture basing on: •

the number of I/O ports of the multi-datapath,



the size of the I/O ports of the multi-datapath.

The proposed template and parser may be easily adapted to different contexts, characterized by different dataflow models. This would imply the interposition of a protocol translator between the TIL itself and the computing core. Nevertheless, this would not affect the conceived software extension and, at the same time, it would be doable thanks to the intrinsic generality of the MPEG-RVC FIFO-based communication protocol. To the best of our knowledge, the proposed extension to the design flow discussed in Sect.III-B makes this synthesis flow the first automatic generation framework for coarse-grained runtime reconfigurable co-processing units. It can be used for any dataflow-based design reconfigurable or not (overcoming the limitation experienced in [19] and [21]). Moreover, in the reconfigurable context, the exploitation of MDC guarantees optimal system deployment, without any resource redundancy or performance compromise as in [7]. Finally, the complete design flow allows defining automatically any type of platform. If data parallelism needs to be managed, as in [22], more than one reconfigurable (even heterogeneous, dealing with different starting high-level descriptions) co-processor can be instantiated.

TABLE II.

VARIATION OF THE CO - PROCESSOR OCCUPIED RESOURCES WITH RESPECT TO THE DATAPATH NUMBER OF INPUT (N) AND OUTPUT (M) PORTS , WHERE N=M. I N BRACKETS THE PERCENTAGE OF OCCUPIED RESOURCES WITH RESPECT TO THE AVAILABLE ONES . number of I/O 1 2 4 8 16 32

Slice Regs (207360) 153(0,1) 261(0,1) 475(0,2) 901(0,4) 1757(0,9) 3454(1,7)

resource (available) Slice LUTs BUFGs (207360) (32) 277(0,1) 1(3,1) 430(0,2) 1(3,1) 751(0,4) 1(3,1) 1558(0,8) 1(3,1) 2760(1,3) 1(3,1) 5339(2,6) 2(6,3)

BRAMs (288) 65(22,6) 65(22,6) 65(22,6) 65(22,6) 65(22,6) 65(22,6)

frequency [MHz] 243,8 243,8 239,0 208,9 197,6 158,4

The automation of the generation process, from high-level to RTL, makes it possible to couple this flow with dataflowbased partitioning algorithms. This would lead to optimal mapping (in terms of execution efficiency) with minimal designer effort, even considering heterogeneous substrates, possibly composed of dedicated co-processing units and one or more host processors. VI.

E XPERIMENTAL RESULTS

In this section we are going to present the obtained experimental results. In Sect.VI-A the overhead of the adaptability is discussed, whereas Sect.VI-B evaluates the proposed approach over the FPGA implementation of an image processing application scenario. An ad-hoc co-processing unit, manually assembled in [26], is going to be compared with the automatically generated one. A. Adaptability Evaluation Here is the discussion of the physical overhead on the TIL, when the number of I/O ports in the coarse-grained reconfigurable datapath is varied. Table II shows the synthesis results obtained with the Xilinx Synthesis Technology (XST) tool targeting a Virtex 5 FPGA board, the XC5VLX330 one. Uniquely the overhead on the TIL is reported. Results show an average increment of the 80% on the Slice Regs and LUTs as the number of I/O ports is doubled. The occupied number of BRAMs do not change, despite the TIL involves also a number of FIFO memories (in the back-end) directly proportional to the number of output ports. In terms of operating frequency it is fundamental for the TIL not to limit the system efficiency. The most are the I/O ports of the coarse-grained datapath, the longer is the TIL-related critical path. XST reports an average decrease of 8% in the operating frequency when the number of ports is doubled. Nevertheless, in our experiments, the system critical path was always located within the coarse-grained datapath (especially when long Sbox chains are present or when the actors granularity is large). B. Design Flow Evaluation The automated design flow presented in this work has been validated on the test-bed discussed in [26]. As we already said, the limit of that work was the need of manually defining the coarse-grained computing core wrapper to enable the communication with an host processor. In this work, we overcome this issue automating the co-processor generation.

In brief, the considered scenario is related to image processing. Four different applications (anti-aliasing filter, adaptive zoom, motion estimation and deblocking/deringing filter) have been profiled and, within each of them, different computational kernels have been isolated. 20 kernels have been identified overall. Their diverse functions led to derive highly different RVC-compliant specifications (in terms of granularity and in terms of number of actors per model), but composed of common sharable actors. In [26] the modelled networks have been used to assemble the coarse-grained reconfigurable computing core of the manually defined memory-mapped co-processor, hereafter called custom copr. Here, we have used the same specifications to automatically generate the entire accelerator, hereafter named auto copr. In both cases the assembled computing core is composed of 364 actors, 280 of which are Sboxes. The isolated kernels involve 181 actors overall: MDC reconfigurable approach allows saving 97 computational actors. The generated coarse-grained datapath has one 32-bit input port and one 96-bit output port. For further details about the reference application scenario and the kernels composition please refer to [26]. The auto copr datapath presents the composition shown in the first column of Tab. I (see Sect.IV-C). The custom copr is composed as follow: •

a central finite state machine monitoring the entire processor co-processor communication;



input and output FIFO memories to store temporarily the data to be computed by the datapath and the related results;



two finite state machines used respectively to read the input FIFO to feed data to the reconfigurable computing core of the co-processor and to write the computed results on the output FIFO.

All the modules involved in the custom copr were manually designed and are highly specialized for the considered kernels set. In particular, the finite state machines in charge of transferring the data to/from the coarse-grained reconfigurable computing core were built with specific knowledge of each single transaction per kernel. On the contrary, the two fsm units of the TIL totally ignore how the single transactions are composed and leverage just on the information of the internal counters to complete them. At first, we have compared the auto copr and the custom copr stand-alone, leaving aside (as an externally connected module) the coarse-grained reconfigurable datapath (the same for both designs). They have been evaluated in terms of resource occupancy on the targeted FPGA board and in terms of maximum operating frequency. Table III shows the synthesis results retrieved by the XST tool. The auto copr, despite being generic, occupies less slices than the custom copr. In particular, it saves more than the 50% of both slice registers and LUTs. It needs also the half of BRAMs occupied by the custom copr. The generalized solution requires less resources since it does not treat differently all the kernels. In the custom copr dedicated logic for all the involved kernels was adopted. As consequence, also the operating frequency has been improved with an increase of 46,2% with respect to the custom copr.

TABLE III.

C OMPARISON BETWEEN THE OCCUPIED RESOURCES AND THE MAXIMUM OPERATING FREQUENCY OF THE custom copr AND THE auto copr DESIGNS . resource (available) Slice Regs (207360) Slice LUTs (207360) BUFGs (32) BRAMs (288) frequency [MHz]

custom copr (%) 352(0,2) 1041(0,5) 1(3,1) 8(2,8) 155,1

auto copr (%) 163(0,1) 327(0,2) 1(3,1) 4(1,4) 226,7

auto vs. custom (%) -53,7 -68,6 0,0 -50,0 +46,2

TABLE IV.

C OMPARISON BETWEEN THE LOADING / STORING OVERHEAD OF THE custom copr AND THE auto copr DESIGNS . packet size loading [# cycles] loading [µs] storing [# cycles] storing [µs]

custom copr 1 4 16 3 6 18 0,19 0,39 1,16 1 0,06 -

1 3 0,13 1 0,04

auto copr 4 16 9 33 0,40 1,46 -

While generalization led to improve performance in terms of resource utilization and frequency, on the other hand it affected execution. The automated flow is not specialized on the use case. To explore this aspect we have measured the overhead of the loading and the storing phases in both the considered designs. This overhead is defined for the loading phase as the number of clock cycles that the co-processor spends in managing the input data transactions, which means: •

in the auto copr the number of cycles from the start signal high to all the data counters end;



in the custom copr the number of cycles from the first to the last read operations on the input FIFOs.

For the storing phase the overhead is the number of cycles that the co-processor spends in transferring the results on the FIFOs associated (in both designs) to the output ports, from the first to the last valid output data. Table IV summarizes the loading/storing overhead for the considered designs, reported both in terms of number of cycles and elapsed time (number of cycles times the maximum operating frequency period). The overhead has been measured considering three different loading packet sizes (used by the involved computational kernels): 1, 4 and 16 data packets. The storing packet size is always 1. The auto copr has a slower loading phase in terms of clock cycles number with respect to the custom copr. Values get worst as the packet size increases. The elapsed time of the auto copr benefits from the better operating frequency, so that it is worse than the custom copr one (21%) just when the maximum packet size is considered. In the storing phase, the custom copr presents a number of clock cycles that is the same of the auto copr design, resulting in 32% of improvements in the elapsed time. Overheads are indicative and give an idea of the performances difference between the automatically generated coprocessor and a custom solution. Obviously, the activation frequency and the execution time of each computational kernel will have a strong impact on the presented results, which goes beyond the scope of this paper. Moreover, dealing with the same staring dataflow descriptions, such an impact would be practically the same on the custom copr and on the auto copr.

VII.

C ONCLUSION

In this paper, we have faced the problem of cutting down the design time of efficient coarse-grained reconfigurable hardware accelerators. To this aim a fully automated design flow has been presented, leveraging on the MPEG-RVC dataflow formalism and on some already existing tools within the RVC framework. These latter have been exploited to generate the reconfigurable computing core of a hardware accelerator. On our side, we have defined a co-processing template adaptable to reconfigurable dataflow-based cores. We have also extended the abovementioned generation flow to instantiate the entire co-processing unit, wrapping the dataflow-based cores with our template, creating a ready-to-use memory-mapped coprocessing peripheral. The template adaptivity with respect to the different generated datapaths has been evaluated. Moreover, an automatically assembled co-processing unit has been compared, on a real use case, with a manually assembled one. Synthesis results showed significant resource saving and higher maximum operating frequency for the automatically generated solution, but the loading phase has still to be optimized to be competitive with respect to an ad-hoc solution.

[9] [10]

[11] [12]

[13] [14]

[15]

[16]

[17]

[18]

For the time being, the MDC-based design flow is under evaluation in a multi-decoder scenario [16], testing also a dedicated low-power extensions [30]. It would be useful to access also the proposed accelerator oriented extension in that context. Moreover, future developments will broaden the accelerator generation process to different contexts of application: a parser capable of automatically dealing with other commonly used dataflow models would be beneficial.

[20]

ACKNOWLEDGMENT

[21]

Dr. Carlo Sau is grateful to Sardinia Regional Government for funding the RPCT Project (L.R. 7/2007, CRP-18324) that led to these results. He is also grateful to Sardinia Regional Government for supporting his PhD scholarship (P.O.R. F.S.E., European Social Fund 2007-2013 - Axis IV Human Resources). R EFERENCES [1]

[2]

[3]

[4] [5]

[6] [7]

[8]

G. J. M. Smit, P. M. Heysters, M. Rosien, B. Molenkamp, Lessons learned from designing the MONTIUM - a coarse grained reconfigurable processing tile, in Intl Symp. on System-on-Chip, 2004. P. Master, Reconfigurable hardware and software architectural constructs for the enablement of resilient computing systems, ASAP06: Intl Conf. on Appl.-specific Syst., Arch. and Proc., 2006. A. Cappelli, A. Lodi, M. Bocchi, et al., XiSystem: a XiRisc-based SoC with reconfigurable IO module, in IEEE Jrnl. of Solid-State Circuits, vol 41, no. 1, 2006. F. Thoma, M. Kuhnle, P. Bonnot, MORPHEUS: Heterogeneous reconfigurable computing, in Intl Conf. on Field Progr. Logic and Appl., 2007. R. Koenig, L. Bauer, T. Stripf, et al., KAHRISMA: A Novel Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array Architecture, Design, Automation and Test in Europe Conf. and Exhib. (DATE), 2010. ISO/IEC 23001-4 (2009).MPEG systems tech.Part 4: Codec configuration representation. A. Milakovich, V. S. Gopinath, R. Lysecky e J. Sprinkle, Automated Software Generation and Hardware Coprocessor Synthesis for DataAdaptable Reconfigurable Systems, in Intl Conf. and Work. on Engineering of Computer-Based Systems, 2012. J. B. Dennis, First version of a data flow procedure language, in Symp. on Programming, pp. 362-376, 1974.

[19]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

G. Kahn, The Semantics of Simple Language for Parallel Programming, in IFIP Congress, pp. 471475, 1974. E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, IEEE Trans. Comput., vol. 36, no. 1, 1987. E. Lee and T. Parks, Dataflow process networks, Proc. of the IEEE, v. 83, n. 5, pp. 773 801, 1995. S. S. Bhattacharyya, J. Eker, J. W. Janneck, C. Lucarz, M. Mattavelli, and M. Raulet, Overview of the mpeg reconfigurable video coding framework, Signal Proc. Systems, v. 63, n. 2, 2011. Open RVC-CAL Compiler (Orcc), http://orcc.sourceforge.net/. J. Janneck E. Bezati, M. Mattavelli, High-level Synthesis of Dataflow Programs for Signal Processing Systems, in Proc. of the 8th Intl. Symp. on Image and Signal Proc. and Analysis (ISPA), 2013. S. C. Brunet, M. Mattavelli, and J. W. Janneck, Turnus: A design exploration framework for dataflow system design, in Intl. Symp. on Circuits and Systems (ISCAS), 2013. C. Sau, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, and M. Mattavelli, Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case, (to appear in) Proc. of the Intl. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2014. N. Dutt and P. Mishra, Architecture description languages for programmable embedded systems, in IEEE Proc.: Computers and Digital Techniques,2005. Z. E. Rakossy, A. A. Aponte and A. Chattopadhyay , Exploiting architecture description language for diverse IP synthesis in heterogeneous MPSoC, Conf. on Reconfigurable Computing and FPGAs (ReConFig), 2013. R. Schreiber, S. Aditya, B. R. Rau, V. Kathail, S. Mahlke, S. Abraham and G. Snider, High-Level Synthesis of Nonprogrammable Hardware Accelerators, in Application Specific Systems, Architecture and Processors, 2000. M. Wipliez, N. Siret, N. Carta,F. Palumbo, and L. Raffo, Design IP Faster: Introducing the C∼ High-Level Language, IP-SOC: IPEmbedded System Conf. and Exhib., 2012 B. Bond, K. Hammil, L. Litchev e S. Singh, FPGA Circuit Synthesis of Accelerator Data-Parallel Programs, in Intl. Symp. on FPGA for Custom Computing Machines (FCCM), 2010. M. Rutten, O. P. Gangwal, J. van Eijndhoven, E. Jaspers e E.-J. Pol, Application Design Trajectory towards Reusable Coprocessors MPEG Case Study, in Embedded System for Real-Time Multimedia, 2004. S. M. Carta, D. Pani, and L. Raffo, Reconfigurable coprocessor for multimedia application domain, Jrnl. VLSI Signal Process. Syst., v.44, pp. 135152, 2006. V. V. Kumar and J. Lach, Highly flexible multimode digital signal processing systems using adaptable components and controllers, EURASIP Jrnl. Appl. Signal Process., 2006. F. Palumbo, N. Carta, and L. Raffo, “The multi-dataflow composer tool: A runtime reconfigurable hdl platform composer,” Conf. on Design and Architectures for Signal and Image Processing (DASIP), 2011. F. Palumbo, N. Carta, D. Pani, P. Meloni and L. Raffo, “The multidataflow composer tool: generation of on-the-fly reconfigurable platforms,” Jrnl. of Real Time Image Proc., v.9, issue 1, pp. 233-249, 2014. F. Palumbo, D. Pani, and E. Manca,and L. Raffo, and M. Mattavelli, and G. Roquier RVC: A multi-decoder CAL Composer tool, In Proc. of DASIP, 2010. N. Carta, C. Sau, D. Pani, F. Palumbo, and L. Raffo, A coarse-grained reconfigurable approach for low-power spike sorting architectures, in Neural Engineering (NER), 2013 6th Intl. IEEE/EMBS Conf. on, 2013. N. Carta, C. Sau, F. Palumbo, D. Pani, and L. Raffo, A coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer tool, In Proc. of DASIP, 2013. F. Palumbo, C. Sau, and L. Raffo, “Coarse-Grained Reconfiguration: dataflow-based power management,” (to appear) IET Computers & Digital Techniques, Special Issue on Energy efficient computing with adaptive and heterogeneous architectures, 2014.