A Prototyping Method of Embedded Real Time

A Prototyping Method of Embedded Real Time Systems for Signal Processing Applications Luc Bianco†, Michel Auguin†, Alain Pegatoquet‡† † I3S, Université de Nice Sophia Antipolis, 41 Blv Napoléon III, 06041 Nice cedex, France ‡ VLSI Technology, 505 Route des Lucioles, Sophia Antipolis, 06560 Valbonne

Abstract Since the complexity of embedded applications is continuously increasing, system designers are faced to a more and more difficult task of selecting and interconnecting the right system components to implement a system functionality such that time and design constraints are satisfied. The lack of methods and tools able to explore various system solutions contributes to increase the time to market. In this paper we present an efficient algorithm for a semi-automatic system design space exploration to aid the designers in the determination of the specifications of a cost-efficient system architecture that satisfies the performance and design constraints.

1. Introduction Due to the increasing complexity of embedded applications it is of prime importance to dispose of efficient methods to explore a huge system design space [5],[9]. Recent works investigate this problem, e.g. [6],[9]. The dimension of this space results from the large number of possible mappings of the functionality of the target application (e.g., processor cores, custom hardware blocks) and the interdependent optimization criteria: performance, area and power [10]. Some codesign approaches [8],[7],[1] target system architectures composed of a single processor connected to coprocessors. Since nowadays heterogeneous multiprocessor embedded systems widen, prototyping methods must target more complex architectures [9],[2]. Our method considers such architectures and is able to start from scratch or from a description of a predesigned system composed of heterogeneous processor cores interconnected with coprocessors and sharable hardware accelerators. This facility allows the designer to perform a design space exploration or to optimize a particular architecture issued from a previous design. The method, based on a time critical path met-

ric [4], [3] provides specifications that satisfy the time constraints and takes advantage of the freedom of the scheduling of tasks to perform an area optimization including RAM and ROM memories. The method performs an iterative partitioning and scheduling (local choices) of the tasks of the application based on a global metric (as introduced in [8]). The low execution time of the algorithm and the interactivity of the method allow the designer to perform a fast system design space exploration and to investigate and optimize various potential architectures. This paper is structured as follows: sections 2 and 3 describe our model of applications and system architectures. In section 4 is detailed our partitioning algorithm based on a time and area metric. Experimental results are presented in section 5.

2. Application model We focus on signal processing applications dealing with streams of samples which are to be processed within a sampling period. The data flow graph model (DFG) is commonly used to describe such applications [8] since it is well suited to represent functional dependencies and the firing of operators as soon as their input data are available. Some annotations are introduced in the model in order to describe a) the time constraints Tmax(τi) on each terminal node or task τi of the data flow model, b) the size of data (dτi,τj) produced on each outgoing edge of a node, c) the width wτi,τj in number of bytes of each data word exchanged between the tasks τi and τj and, d) the set π(τi) of types of units enabled to execute the task τi of the graph. In Fig. 1 is given a simple example of an application model. For sake of simplicity we consider in the subsequent sections a single global value of Tmax set up on all the end nodes of the graph but the extension to multi-

ple time constraints is not laborious. π(τ1)=U1,U2

τ1

wτ1,τ2 dτ1,τ2

π(τ2)=U1

τ2

wτ2,τ3

π(τ3)=U3

τ3

dτ2,τ3

wτ2,τ5 U1, U2, U3 are types of HW or SW units * denotes the whole set of potential units e.g., {U1,U2,U3}

dτ2,τ5

wτ3,τ4 dτ3,τ4

π(τ4)=*

τ4 Tmax(τ4) = 25 π(τ5)=*

τ5 Tmax(τ5) = 18

Fig. 1. Example of a data flow graph specification

implementation models of the FFT). For sake of simplicity we consider as the single implementation model of τi on uk. The generalization of the method to several implementation models is straightforward. The partitioning method is based on a time criticality metric that guides the scheduling/allocation process. According to the time criticality of the execution of a task, the partitioning heuristic performs either a time optimization or an area optimization or a mix. This metric is based on the evaluation of the minimum and maximum path lengths from a task to any terminal tasks in the graph.

4.1. Time criticality

3. Target architecture model As there is a continuous pressure for embedding sophisticated services and better quality in telecommunication systems, the architecture of embedded systems are more and more parallel, including on a single chip IPs or inhouse components such as RISC and DSP core processors connected to coprocessors and sharing hardware accelerators (e.g. ASIP [11]). Therefore, we consider an heterogeneous target architecture model. The interconnection between components is based on a bus structure rather than on a point to point connection model [12] in order to provide extension capabilities and potential reuse of the architecture. In order to support asynchronous transfers, single or dual port communication memories can be incorporated in the interconnection. Software components may have DMA capabilities on I/O ports and a memory hierarchy composed of a cache memory or a local/external memory system. A coprocessor can be attached to a single software component and can be accessed by this component only. An accelerator is a sharable hardware component with an internal controller. An accelerator can execute a task in parallel with tasks executed by other components in the system.

4.1.1. Path length. The path length evaluated in our method is based on the path analysis introduced in [4]. However, in [4] the aim is to determine the fastest implementation of an application whatever its area cost. In order to take benefit from temporal freedom of execution of tasks for area optimization, we evaluate both minimum and maximum path lengths of tasks. Let uk be a unit able to realize a task τi of the DFG. For an implementation model we define these path lengths as: pl min ( τ i, u k ) = t exe ( τ i, u k ) + Max ( Min τ ∈ succ ( τ ), u ( tc u , u ( τ i, τ j ) + pl min ( τ j, u l ) ) ) j

The inputs of the method are the annotated DFG, the library of HW and SW units uk, the set of implementation models of tasks τi on uk , the area constraints and an optional predesigned system. The predesigned system consists of a set of HW and/or SW instances interconnected with buses and communication memories. Without specification of a predesigned system the method starts from scratch and must instantiate units. Otherwise, the method prioritizes the utilization of instances in the predesigned system to map the tasks of the DFG. A task may have several implementation models on the same unit (e.g. FFT radix-2 and FFT radix-4 on a given DSP are two

l

k

l

pl max ( τ i, u k ) = t exe ( τ i, u k ) + Max τ ∈ succ ( τ ), u ( tc u , u ( τ i, τ j ) + pl max ( τ j, u l ) ) j

i

l

k

l

where texe(τi,uk) is the execution time of the implementation model, succ(τi) represents the direct successors of the task τi in the DFG. The value tcu ,u (τi,τj) denotes the k l communication time to transfer the set of dτi,τj data between τi and τj. if k ≠ l

tc u , u (τ i,τ j) = Max k

else

4. Partitioning method

i

l

m = k, l

Tc wτ τ i j × w × -------md τ τ × ---------m i j th m wm

tc u , u (τ i,τ j) = 0 k

l

The I/O ports with the maximum throughputs thm of uk and ul are considered. The value wm represents the width of the I/O port of um and Tcm the clock period of the unit. The minimum and maximum path lengths are evaluated for every task and for every unit listed in the set π(τi) of potential units that can execute the task. The evaluation of these paths is done recursively from the terminal nodes of the DFG. The computation on these path lengths is simple as illustrated by the example in figure 2. Execution times of each tasks when executed on units U1,U2 and U3 are given in figure 2.b. For sake of simplicity, the edges of the graph (figure 2.a) are labeled with constant data transfer times.

These values are used when the tasks connected to an edge are executed on different units. Using the definition of plmin and plmax, we get the values listed in the table in figure 2.c. The minimum and maximum path lengths of endtasks are the same as their execution times. The recursive computation of path lengths starts with these values. π(τ1)=U1,U2

τ1

π(τ2)=U1

π(τ3)=U3

τ2

1

τ3

T a ( τ i, u k ) = T max – Max  Max τ  j

π(τ4)=*

τ4

2

1

Max τ

τ5

2

Tmax(τ5) = 18

a) data flow graph

τ1 τ2 τ3 τ4 τ5

U1

U2

U3

5

3

-

3

-

-

-

-

3

5

8

4

6

5

9

c) Evaluation of plmin/plmax U1

τ1 τ2 τ3 τ4 τ5

U2

16/23 15/22 11/18

U3 -

-

-

-

-

7/13

5/5

8/8

4/4

6/6

5/5

9/9

Fig. 2. Example of path length evaluation As an example, plmin(τ2,u1):

consider

the

computation

∈ pred ( τ i ) ( t end ( τ i )

plmin

plmax

(a)

4.1.2. Available times of tasks. Let pred(τi) the set of tasks that are the direct predecessors of the task τi. All the tasks in pred(τi) are previously scheduled. If all the instances p(τj) on which τj ∈ pred(τi) are mapped contain a DMA capability, the available time Ta(τi,uk) of a task τi associated with a new instance uk with a DMA unit is: j

+ tc p ( τ

j ),

t end ( τ j ),

u k ( τ j, τ i ) )



If the instances does not contain DMA facilities a more accurate value of the available time is evaluated by analyzing the times at which the utilizations of instances last. Furthermore, the actual value of Ta(τi,uk) is evaluated by scanning the idle time slots of uk in order to find the earliest time slot greater than texe(τi,uk) and such that data required by τi are available. Two relations between the path lengths and the task available times are obvious (Fig. 3): (a) If plmin(τi,uk)< Ta(τi,uk) and plmax(τi,uk)< Ta(τi,uk) then the scheduling of τi on uk is not harmful for successive partitioning steps. (b) If plmin(τi,uk)> Ta(τi,uk) then the scheduling of τi on uk leads to the obvious failure of the partitioning algorithm due to time constraint violations.

of

plmin(τ2,u1) = 3 + max(min (7+1), min (6+0, 5+2, 9+2)) = 11 The largest plmax(τi,uk), whatever a source node τi of the DFG, is noted PLMAX. The partitioning heuristic is based on a scheduling algorithm that determines at time t an assignment and a schedule to the data ready tasks τi ∈ DFG of the graph. A data ready task has its predecessors previously scheduled and are completed before t. The time criticality metric of a task is based on the values of the path lengths from this task and its available time according to the time constraints.

T a ( τ i, u k ) = T max – Max τ

j

p ( τ j ) = uk

Tmax(τ4) = 25 π(τ5)=*

b) Execution times of tasks

between the instance p(τj) and uk. This available time depends only on the availability of data issued from the predecessors. The available time Ta(τi,uk) associated with an existing instance uk in the architecture under construction also depends on the utilization of uk by tasks that have been previously scheduled on it:

∈ pred ( τ i ) ( t end ( τ j )

+ tc p ( τ

j ),

u k ( τ j, τ i ) )

where tend(τj) denotes the end of execution of τj and tcp(τj),uk(τj,τi) the communication time to transfer dτj,τi data

(b) (c)

Ta(τi,uk) plmin

plmin

plmax

plmax δ(τi,uk)

time

Fig. 3. Available times of tasks However, these relations are not sufficient to guide the partitioning algorithm in all cases (Fig. 3.c). Other metrics are introduced to assist the assignation process. 4.1.3. Time discriminating factor. The selection of a pair by the partitioning process for tasks falling in the case depicted in Fig. 3.c is based on a time discriminating factor. We define first the maximum freedom of schedule δ(τi,uk) by: δ ( τ i, u k ) = T a ( τ i, u k ) – pl min ( τ i, u k )

This value gives the maximum available time frame to schedule τi (and the successors of τi) on uk (Fig. 3). Units that have a large value of δ(τi,uk) impose the lowest constraints on the partitioning algorithm. The time discriminat-

ing factor γ(τi,uk) used to characterize each pair is: pl max ( τ i, u k ) 1 γ ( τ i, u k ) = 1 – -------------------- × ------------------------------δ ( τ i, u k ) PL MAX

This factor is weighted by the relative time position of the task in the DFG to lessen its influence as the algorithm proceeds in the analysis of the tasks of the DFG. The time discriminating factor allows the partitioning algorithm to estimate the impact on performances of the selection of a unit according to the time constraints imposed on the application graph.

from the mapping of τi on a new instance uk includes a measure of its reusability:

∑ σ( ϕ, uk ) ϕ ∈ Φk S ( τ i, u k ) = ------------------------------------------------------------------β ( τ i, u k ) s ( τ i, u k ) +

The area discriminating factor θk permits to prioritize units that introduce the least area increment taking into account its potential reuse: S ( τ i, u k ) θ ( τ i, u k ) = 1 – -----------------------------------------------------------------Max u ∈ U , τ ∈ V ( S ( τ j, u i ) ) i

4.2. Area evaluation The area cost associated with the mapping of a task τi is either the estimation of the actual increment of area σ(τi,uk) resulting from the assignation of the task to an existing unit uk in the system under construction or a weighted area cost S(τi,uk) associated with a new instance uk that takes into account the potential reusability of this instance for the succeeding tasks in the DFG. In the former case the estimated area increment σ(τi,uk) is equal to the difference between the available RAM and ROM memories (internal or external) in the instance uk and the memory size needed by the implementation model . The area cost S(τi,uk) resulting from the mapping of τi on a new instance uk is a weighted sum of the proper core area of this instance s(uk) given in the library and the evaluation of an additional area cost resulting form the potential mapping on this instance of succeeding tasks in the DFG. The computation of S(τi,uk) is based on an estimate of the reusability of the instance uk in the interval Ta(τi,uk). Let V˜ represents the set of remaining tasks τi in the graph to be partitioned. Let Φ be the set of the types from which tasks of V˜ are issued (e.g. if there are two FFT nodes in the DFG, they are issued from the FFT task type). Let Γ k = { τ i ∈ V˜ ⁄ u k realizes τ i } denotes the subset of tasks such that the unit uk is able to execute them. In the same way the set Φ k = { ϕ ∈ Φ ⁄ u k realizes ϕ } is the restriction to the types of tasks. We define the mean execution time of the tasks in Γk by: 1 t k = ------------------------ × ∑ t exe ( τ i, u k ) card ( Γ k ) τ ∈ Γ i k

This mean time permits to evaluate the average number of tasks that can be scheduled in the available time Ta(τi,uk) associated with uk :

T a ( τ i, u k ) card ( Γ ) k β ( τ i, u k ) = ------------------------- × -----------------------tk card ( V˜ ) The ratio card ( Γ k ) ⁄ card ( V˜ ) allows the units uk able to execute the greater number of tasks in V˜ to be prioritized

in the partitioning process. The area cost S(τi,uk) resulting

j

n

where U is the set of the potential new instances and Vn the set of data ready tasks. We define a global time/area discriminating factor Ω(τi,uk): Ω ( τ i, u k ) = ( 1 – α )γ ( τ i, u k ) + α × θ ( τ i, u k )

where α is a user defined parameter, 0≤α≤1.

4.3. Partitioning heuristic Let τi and An be respectively a data ready task and a system architecture under construction at step n. The system architecture A0 corresponds to the predesigned system which may be empty. In a first phase, the algorithm encourages the reuse of instances of An. Let ϕ(τi) be the type of a task τi. Based on instances uk∈An able to realize τi the algorithm constructs four lists (Fig. 4): L1reuse, L2reuse, L1new and Lothers: -L1reuse: list of such that pl max ( τ i, u k ) ≤ T a ( τ i, u k ), uk∈An and ϕ(τi) was previously mapped on uk. -L2reuse: list of such that γ ( τ i, u k ) ≥ T reuse, uk∈An and ϕ(τi) was previously mapped on uk. Treuse is a threshold value in [0,1] set up by the designer. Setting Treuse close to 0 enforces the reuse of instances. On the contrary, fixing Treuse close to 1 attempts to provide solutions with lower execution times since the algorithm is constrained to instantiate new units. -L1new: list of such that γ ( τ i, u k ) ≥ T new , uk∈An but ϕ(τi) is new on uk. Tnew is a second threshold value set up by the designer. If Tnew is close to 1 the algorithm attempts to provide solutions with fast units. If Tnew is close to 0 all the new instances including those with low execution times are considered. The algorithm attempts to minimize the total area. -Lothers: list of such that γ ( τ i, u k ) < T reuse, uk∈An. With new instances, the algorithm constructs a new list L2new and fills up the list Lothers: -L2new: list of such that γ ( τ i, u k ) ≥ T new, uk∉An. -Lothers: list of such that γ ( τ i, u k ) < T reuse, uk∉An.

In the set of data ready task, the partitioning algorithm selects the most critical task i.e., the task τi such that the value Max uk ( T a ( τ i, u k ) – p l min ( τ i, u k ) ) is minimum. For this task, the algorithm selects the best implementation by scanning the lists as depicted in Fig. 4. If τi has no realization, the algorithm fails since the minimum path length is larger than the time constraint (Fig. 3.b). new tasks on new tasks on new instances instances ∈An new instances or instances ∈An

tasks already mapped on instances ∈An

L1reuse

L2reuse

L1new

uk / plmax ≤ Ta(τi,uk) uk / γk ≥ Treuse

list sorted in decreasing order of (Ta - plmax)

list sorted in decreasing order of γk

L2new

γk ≥ Tnew

Lothers uk / γk < Treuse uk / γk < Tnew

γk ≥ Tnew

list sorted in decreasing order of Ωk



Order of scanning of lists

Fig. 4. Distribution of implementation models

5. Partitioning results In order to illustrate this algorithm we consider a simplified AC3 audio decoder whose DFG is given in Fig. 5. The time constraint is set up to 1.5ms in order to keep room for other applications on the system. The characteristics of implementation models of the tasks are detailed in Table I. Two processors P1, P2 and a HW accelerator are considered. These units have a core area of 11mm2, 5mm2 and 3.5mm2 respectively. The minimum RAM and ROM blocks used to organize the processor memories have a size of 256 bytes with an area of 0.5mm2 and 0.3mm2 respectively. mantissas (16 bits) 256 C1

256 4 bits

256 7 bits Left 256

C2 256

L0

256

L1

256

R0

256

R1

256

256 18 bits

Coupling G0 216

216

Exp: Exponent

BA: BitAlloc

G1

216

3

R2 216 G2

216 18 bits

DM : Decode Mantissas

C5

256 18 bits

256 L2 18 bits 256

Right 256

256 18 bits

512 18 bits

4 RX: Rematrixing

C0

256 5 bits

DC: Decoupling

Center

256 18 bits

L5

256 18 bits R5

256 18 bits

256 18 bits

256 18 bits

ITDAC

Fig. 5. DFG of a simplified AC3 audio decoder Characteristics of five system architectures are given in Fig. 6. These systems are obtained for different values of the parameters Tnew, Treuse and α. Compared to system (1), the solution (3) is obtained with a higher value of Treuse limiting thus the algorithm to reuse previous instances and mappings of tasks. For example, The task Exp is duplicated

TABLE I: Execution times (µs) / ROM size (bytes) / RAM size (bytes) Processor P1

Processor P2

HW accelerator

Exp

Sw: 69 / 774 / 258

Sw: 81 / 920 / 258

-

BA

Sw:532 / 1092 / 200

Sw: 589 /1680 /200

Hw: 73 / 0 / 0

Sw+coprocessor: 197/800/200 DM

Sw: 237 / 274 / 150

Sw: 213 / 392 / 150

-

DC

Sw: 47 / 120 / 48

Sw: 52 / 190 / 48

-

RX

Sw:10 / 20 / 0

Sw: 13 / 42 / 0

-

ITDAC

Sw: 130 / 776 / 384

-

-

Sw+coprocessor: 31 /311/384

on instances P1 and P2 in the system (3) increasing its total area. In system (1) the algorithm selects first the processor P2 since it has a lower area and it realizes numerous tasks in the DFG. In system (2) the parameter α is set to 0 avoiding area optimizations: the task ITDAC is mapped on the two processors with a SW only implementation and with a mixed SW+coprocessor model on P2. In order to optimize this system we used as the predesigned system input of the algorithm its description previously provided by the tool. We get the system (2-1) in which only one implementation of the task ITDAC with the coprocessor is used. The sizes of ROM and RAM memories are thus reduced. This example illustrates that the solutions provided by the partitioning algorithm are obviously not optimum since local choices are performed based on a criticality metric. However, the tool is able to improve each solution using its description as a predesigned system. The benefit of this facility is also illustrated by the system (4) which is obtained from an initial system including a processor P1 and an HW accelerator. The declaration of the HW accelerator avoids the instantiation of a second processor as performed in systems 1,2,3. The contributions of core, RAM, ROM and coprocessor areas to the total area are detailed for each solution in the right part of Fig. 6. Note that the five architectures have an execution time close to the time constraint. This illustrates the ability of the partitioning algorithm to take benefit of the whole time frame to perform area optimizations.The schedule of tasks on system (4) with the lowest area is given in Fig.7. The partitioning algorithm determines the set of data communications between units while the tasks are scheduled. All the communications in system (4) can be supported by a single bus. The partitioning algorithm is enough fast to allow the exploration of different potential architectures: for example, each solution depicted in Fig. 6 is constructed in less than 2 seconds on a SparcStation 5.

7. References Area (mm2)

Elapse Time(ms) 1.50

40

1.48 1.46

0.8/0.99/1.0 (3)

0.8/0.8/1.0 (1)

1.42

30

1.40

Copro

0.8/0.8/0.0 (2)

1.38 1.36 1.34 1.30

20

0.8/0.8/1.0 (2-1)

0.8/0.8/1.0 (4)

1.32 1.28

10

1.26

Rom Ram

Core

1.44

1.24 1.22

Tnew/Treuse/α 0

15

20

25

30

35

(1)

(2)

(2.1)

(3)

(4)

#system

Area (mm2) (1) : P1 + P2 + copro. Itdac + HW accelerator (2) : 2xP1 + copro. Itdac + coproPSD (3) : P1 + P2 + copro. Itdac + HW accelerator (4) : P1 + copro.Itdac + HW accelerator (2-1) : 2xP1 + copro. Itdac + coproPSD

Fig. 6. Partitioning results

P1 Inputs

Exp

Decode Mantissas

Hw

ITDAC Outputs

Bit Allocation

Fig. 7. Schedule of tasks on system (1)

6. Conclusion Our partitioning/scheduling algorithm is able to determine system specifications from a functional description of an application. In these system specifications important architecture parameters are accurately defined, e.g., sizes of RAM and ROM memories of processors, DMA or CPU controlled data transfers. This capability permits successive refinements of the specifications with limited feedbacks due to ill-defined estimates of characteristics in the initial system. The algorithm is able to start from scratch or from an initial system architecture. The interactivity provided by the method allows the designer to explore a system design space corresponding to different performance/cost tradeoffs. Future improvements of the method include interconnection synthesis according to different bus characteristics as well as extension of the application model to support mixed data flow/control flow applications.

[1] S. Agrawal, R. Gupta, Data-flow Assisted Behavioral Partitioning for Embedded Systems, Proc. 34th Design Automation Conference, Anaheim, 9-13, june 1997. [2] A. Bender, Design of an optimal loosely coupled heterogeneous multiprocessor system, Proc. of The European Design and Test Conference, 1996. [3] L. Bianco, M. Auguin, G. Gogniat, A. Pegatoquet, A Path Analysis Based Partitioning for Time Constrained Embedded Systems, Proc. Workshop on Hardware/Software Codesign, Seattle, March 15-18, pp 84-90, 1998. [4] P. Bjørn-Jørgensen, J. Madsen. Critical path driven cosynthesis for heterogeneous target architectures. Proc. 5th Codes/CASHE’97, 15-19, Braunschweig, March 1997. [5] G. De Micheli, R. Gupta, Hardware/Sowtware Co-Design, Proc. IEEE, vol. 85, No 3, pp 349-364, 1997. [6] J. Da Silva et al., Efficient System Exploration and Synthesis of Applications with Dynamic Data Storage and Intensive Data Transfer, Proc. 35th Design Automation Conference, San Francisco, 15-18, june 1998. [7] J. Henkel, R. Ernst, A Hardware/Software Partitioner using a Dynamically Determined granularity, Proc. 34th Design Automation Conference, Anaheim, 9-13, june 1997. [8] A. Kalavade, E. Lee, The extended partitioning problem: hardware/software mapping and implementation-bin selection, Proc. Int. Workshop on Rapid System Prototyping, June 7-9, Chapel Hill, NC, pp 12-18, 1995. [9] I. Karkowski , H., Corporaal, Design Space Exploration Algorithm for Heterogeneous Multi-processor Embedded System Design, Proc. 35th Design Automation Conference, San Francisco, 15-18, june 1998. [10]Y. Li, J. Henkel, A Framework for Estimating and Minimizing Energy Dissipation of Embedded HW/SW Systems, Proc. 35th Design Automation Conference, San Francisco, 15-18, june 1998. [11]P. Paulin, DSP design tools requirements for embedded systems: a telecommunication industrial perspective, Journal VLSI Signal Processing, vol. 9, pp 23-47, Jan, 1995. [12]S. Vercauteren, B. Lin, H. De Man, Constructing application specific Heterogeneous Embedded Architectures from Custom HW/SW Applications, Proc. 33rd Design Automation Conference, June, Las Vegas, NV, 1996.