Polynomial-Time Algorithm for On-Chip Scratchpad ... - CiteSeerX

Polynomial-Time Algorithm for On-Chip Scratchpad Memory Partitioning Federico Angiolini [email protected]

Luca Benini [email protected]

Alberto Caprara [email protected]

Dipartimento di Elettronica, Informatica e Sistemistica (DEIS) University of Bologna, Bologna Viale Risorgimento, 2 40134 Bologna, Italy

ABSTRACT

1.

Focusing on embedded applications, scratchpad memories (SPMs) look like a best-compromise solution when taking into account performance, energy consumption and die area. The main challenge in SPM design is mapping memory locations to scratchpad locations. This paper describes an algorithm to optimally solve such a mapping problem by means of Dynamic Programming applied to a synthesizable hardware architecture. The algorithm works by mapping segments of external memory to physically partitioned banks of an on-chip SPM; this architecture provides significant energy savings. The algorithm does not require any user-set bound on the number of partitions and takes into account partitioning overhead. Improving on previous solutions, execution time is polynomial in the input size. Strategies to optimize memory requirements and speed of the algorithm are exploited. Additionally, we integrate this algorithm in a complete and automated design, simulation and synthesis flow.

Advances in manufacturing processes are driving the semiconductor industry towards miniaturization and integration of chip design. While allowing for the implementation of many more features on the same die, ultimately leading to cheap yet very powerful System-on-a-Chip (SoC) products, there are problematic side effects in this evolution. One of them is the need to optimally manage the power consumption of such complex, multi-milliontransistor designs; another one is the growing relative cost (from both the point of view of energy and performance) of accessing offchip components, among which external memory certainly takes one of the most prominent spots. Many solutions have been architected to counter such difficulties. Possibly, the most significant one is the implementation of memory hierarchies. In such approach, the overall available physical memory is split in layers, going from the highest-capacity ones, which are usually relatively slow because of engineering constraints, to the smallest ones, which are also the fastest because the processing units are allowed direct access to them. Every layer is a subset of the previous, and efficiency is achieved through the use of intelligent algorithms to map, statically or dynamically, the most accessed content inside the layers nearest to the computing resources. These layers get small enough (usually in the range of kilobytes to hundreds of kilobytes) to make it possible to integrate them onto the same die as the execution units, thus reducing cost, access latency and energy consumption. The most typical implementation makes use of cache memories. While extremely versatile and very fast, caches are not always the best choice taking into account energy efficiency and die area, due to their large overhead of control logic. When focusing on embedded systems and applications, one of the biggest advantages of caches, versatility, is often unneeded, while power consumption and cost play much more important roles. For this reason, the alternative solution of scratchpad memories (SPM) has been developed. These are quite similar to caches in terms of size and speed (ideally one-cycle access time), but have no dedicated logic for dynamic swapping of contents. Instead, it is the designer’s responsibility to explicitly map addresses of external memory to locations of the SPM. While impractical in general-purpose architectures, this process becomes feasible in embedded contexts, where designers usually have fine control on both the software and underlying hardware, and are able to optimally match them. Additionally, it is not impossibile to implement both a scratchpad and a cache at the same time, exploiting the respective advantages. It is not difficult to add scratchpad memories at the hardware

Categories and Subject Descriptors F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems; F.1.3 [Computation by Abstract Devices]: Complexity Measures and Classes; C.3 [SpecialPurpose and Application-Based Systems]: Real-time and embedded systems; C.4 [Performance of Systems]: Design studies, Performance attributes; C.5.3 [Computer System Implementation]: Microcomputers

General Terms Algorithms, Performance, Design

Keywords Scratchpad Memory, Partitioning Algorithm, Dynamic Programming, Embedded Design, Memory Hierarchy, Power Saving, Design Automation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’03, Oct. 30–Nov. 2, 2003, San Jose, California, USA. Copyright 2003 ACM 1-58113-676-5/03/0010 ...$5.00.

INTRODUCTION

level; they usually require just an SRAM array and decoders to divide SPM accesses from RAM accesses. All of these can be easily generated with macros. The hardest problem is optimally mapping to the scratchpad the most frequently accessed content. This paper will focus on providing a general optimal solution to this problem, while using acceptable processing resources in its determination. An additional goal of today’s design flows, made more pressing by the growing complexity of architectures, is automation of design steps. This paper will also show how to integrate the partitioning of an SPM inside of a preestablished simulation and synthesis flow. The rest of this paper is structured as follows. Section 2 discusses previous work in the area of memory hierarchies, with emphasis upon scratchpad memories. Section 3 discusses the general design of our implementation and the ties to the underlying simulation platform. Section 4 focuses upon the partitioning algorithm we developed. Section 5 details the experiments performed to validate the proposed algorithm, discussing benchmark data. Finally, Section 6 summarizes the results of our work.

2.

RELATED WORK

A significant amount of literature is available on the subject of memory hierarchies and, more specifically, of scratchpad memories. For example, in [1], [2] and [3] some possible architectures for mid-to-high-end embedded processors with SPM are described. Scratchpad memories, in these works, can act in cooperation with caches, either by taking the role of fast buffers for data transfer or by helping prevent cache pollution thanks to intelligent data management.

2.1

Software assisted SPM mapping

The problem of optimally allocating critical data onto a fast but small memory has been thoroughly investigated since the early eighties, with the appearance of caches. Many software-based techniques for cache allocation have been developed in this time frame. A comprehensive review of compiler-based approaches may be found in [4]. More recently, these researches have been revisited and applied to SPMs as well. For example, the works [5], [6] and [7] deal with allocation of data onto scratchpad memories and/or caches; however, their approach, while fast, is quite suboptimal, as it does not cover code but just data, and not with the fine granularity one could desire. Scalar and array variables are separately managed, and arrays are entities to be monolithically mapped in the addressing space. Additionally, these works only take into account speed, not energy saving. One more difference with our approach is that the authors postulate availability of the source code of the application and try to fit it to fixed hardware, while our solution starts from application binaries to synthesize optimal hardware. The above mentioned works are extended in [8] with power estimation, but no focus is put upon scratchpads. Power, area and speed optimization through the use of SPMs are instead all explored in [9] and [10]; in these works, the authors rely on the support provided by specific compilers, again postulating availability of source code for a target application but invariant hardware. Their results are extremely interesting, especially considering the purely software nature of their approach. Compared to a cache subsystem, an SPM architecture with compiler-driven address mapping onto the SPM proves better, with percentual improvements above 20%, in every respect at the same time: execution cycles, energy consumption and die area. Once more, however, the granularity of the mapping is quite coarse; the authors only work on blocks of code and global variables. Interesting ways of optimally dealing with SPMs are detailed

in [11], [12], [13], [14] and [15], which explore choices of banking, tiling, allocation and even sharing in multiprocessor environments. Reference [12] is especially interesting, as it describes ways to dynamically move, at runtime, blocks of data from RAM to SPM and vice versa. The authors show benchmark results proving much greater efficiency than statically mapped SPMs; it is to be said, however, that their comparison is based upon a static mapping which should be noticeably less performing than the one described in the present paper.

2.2

Synthesis of SPM hardware

The works to which we will most often refer are [16] and [17], as they present the same synthesizable architecture this paper builds upon. A complete toolset is presented to get the layout of an optimized SPM platform starting from application binaries (source code allows additional flexibility, but is not required); backannotation functions are provided. Power estimations, very reliable because validated on actual layout, show 35% savings on average. However, it is to be underlined that, when dealing with the problem of mapping memory locations to scratchpad locations, the authors propose an algorithm with exponential worst case and subject to arbitrary boundaries upon the maximum number of partitions; these limits will be lifted in the present work. [18] and [19], on the other hand, are also very important as they lay the foundation for integration of our partitioning algorithm in a complete platform.

3.

DESIGN FLOW AND ARCHITECTURE

This paper is aimed at the development of efficient embedded systems, so some assumptions will be made: flexibility of the resulting product will be less important than power and area minimization; the target hardware architecture will be fully open for design, up to the layout level; optimization tools will be evaluated according to their output more than to their processing times, as processing penalties will be incurred only once during the design phase and not during everyday operation of the device.

3.1

Scratchpad data allocation

As previously said, scratchpad memories look like optimal solutions for embedded designs, and the actual process of synthesizing an SPM can be quite simple. Assuming an architecture similar to what outlined in [16], where the scratchpad does not have a separate addressing space but just maps portions of the addressing space of external memory, all that is required are an SRAM array and a decoder (see Fig. 1. Please notice that the SPM can be contiguously mapped in the global memory space or not; we took the second route, as shown in the diagram). The most challenging task in SPM designs, as previously said, is optimizing the mapping of data onto the scratchpad itself, so as to minimize expensive external RAM accesses. Multiple approaches have been explored in this regard; as the taxonomy in Fig. 2 shows, two main roads exist: the first makes use of software optimizations, the second focuses on hardware. Compiler assisted techniques are e.g. outlined in [9], [10] and [12]; assuming to have access to the source code of both the target application and the compiler, these approaches attempt to optimally place application code and/or data inside of a predefined, fixed size hardware scratchpad. This placing can either be statically decided once for all, or dynamically adapted by inserting additional functions in the application code to transfer data among the SPM and the external RAM at runtime. All these techniques have the big advantage of being software based, but the disadvantage of suboptimally exploiting scratchpad potential because of coarse grained object allocation onto it. In contrast, hardware fitting techniques take the opposite route: no application

Buffered memory hierarchies

single cycle

Cache Main RAM

Memory

Scratchpad memories

(optional)

On−die devices

Cache memories

External devices

Compiler assisted techniques

CPU

Decoder

tens of cycles

Static techniques

single cycle

Dynamic techniques

Figure 2: Main branches of SPM design approaches partitioned Scratchpad Memory

memory address space

Figure 1: Possible memory architecture with both a cache and a partitioned SPM

source code is required, and hardware access is instead assumed, as in [16] and [17]. In this case, optimal hardware is synthesized basing upon application execution trace, with a granularity which can be as fine as a single memory location. Our work belongs to this second sector of research. Compiler based techniques all involve ([10]) some variant of the Knapsack Problem, a well-known NP-complete problem dealing with optimally filling a fixed size knapsack with objects all having different weights and profits (see [20]). Although NP-complete, the problem can be solved in a time which is polynomial in the number of objects and the size of the knapsack. Since the encoding length of the latter is proportional to its logarithm, such a running time is exponential in the encoding length (and therefore in the input size), and is also called pseudopolynomial running time. In any case, as long as the size is not huge, the problem can be solved effectively. This is not the case for some strongly NP-complete variants of the problem, such as those with multiple knapsacks (instead of just one) and/or dynamic objects having a limited lifespan. All the optimal approaches to these variants are based on algorithms with time that in the worst case is exponential in the number of objects, and thus have limited scalability. Hardware fitting techniques, on the other hand, exhibit slight differences with respect to the allocation problem; this is due to an added degree of freedom in choosing object boundaries, because single memory locations are under scrutiny, while a compiler can only deal with whole routines, loops, and data structures. A more detailed discussion will follow in Section 4. What is interesting to say here, however, is that this second problem, which was believed to have exponential complexity as well, is polynomial instead, as this paper will show, making hardware fitting techniques even more interesting.

3.2

Hardware fitting techniques

Scratchpad partitioning

When the actual problem of hardware SPM design is examined, some choices have to be made. First of all, since no software support has been assumed, it is clear that all address decoding activity to split external RAM accesses from SPM accesses has to be done in hardware. This means that if the SPM is expected to map multiple address ranges, which is a key point to achieve efficiency (see

Section 5), a decoder with multiple comparators is required. At this point, since address-dependent logic has to be synthesized anyway, it is soon realized that partitioning the SPM in physical banks, instead of just mapping several memory ranges onto a single physical SPM buffer, is the next logical step, as this allows power savings (only the accessed bank can be active at any time, while the others are turned off). On the other hand, the drawback is that every time an additional partitioning is made, there is an overhead in terms of die area, power dissipation and delay, due to the greater complexity of the decoder and wiring; Fig. 1 shows this situation by using thick lines for partition boundaries, to represent the space wasted every time a partition is added. This means that the overall efficiency, plotted against the number of partitions, will have a point of optimum, the position of which mainly depends on the application and on the manufacturing process. The main purpose of this work is developing a tool to optimally partition an SPM, and linking it to a pre-existing simulation environment in order to get its inputs and provide its outputs.

3.3

Underlying target platform

Our work builds upon a fully parametric multiprocessor platform. Our group (see [19]) has implemented a multi-core ARM device, the interconnect of which is currently an AMBA bus. The number of processors is fully customizable. Every processor has got a private memory, and an additional shared memory is available for interprocessor communication. The platform is realized in SystemC, and it is fully accurate at the signal and timing level. The ARM cores themselves are actually simulated via freely available C++ routines, but are transparently fitted to the SystemC infrastructure by the use of specific wrappers.

3.4

Design flow

The SPM partitioner requires several inputs. Some of them are application-independent parameters, and thus can be independently configured: the most obvious are the size of the target SPM and a quantitative estimation of the hardware overhead associated with any further partioning of the scratchpad. If necessary, the latter can be accurately extrapolated with back annotation, and, according to the designer’s priorities, can be adjusted to more accurately reflect area, delay or power overhead, or an average of them all; see [17] and related work. In addition to such designer-provided inputs, the partitioner has been linked to the underlying development platform in two ways (see Fig. 3). First, some modifications have been made to the simulation engine in order to provide, as an output after a simulation

Application sources

Cross− Compiler Binary images Simulation Platform

Performance analysis

Execution traces Filters Filtered traces Partitioning Algorithm

Optimal partitionings

Figure 3: Schematic simulation flow cycle, the execution traces of the target application. If the source code of the application is available, the optional use of markers allows trimming the traces to span only the critical routines or portions of code instead of the entire application running time. These traces can be collected in parallel for every processor in a multiprocessor system. Eventually, the traces are passed to the SPM partitioner for analysis. Once the best scratchpad partitioning is found, the second link to the design platform comes into play. The algorithm’s output is fed back to the simulation engine, and a second simulation run is launched. Specific detection routines have the ability to check whether any memory access gets intercepted by the partioning of the SPM just found or not, thus providing complete reporting on the efficiency of the implementation. It becomes also possible to compare the results of an SPM versus the ones of a cache memory. All of the described steps can be fully scripted and automated, providing a powerful environment for parameter evaluation and, if useful, bracketing.

4.

PARTITIONING ALGORITHM

As outlined in previous sections, solutions to the problem of mapping locations onto an SPM have already been searched. However, they either postulate invariant hardware (reducing the potential for optimization), work with coarse grained memory space chunks, or impose arbitrary constraints and have exponential running time in the worst case. For example, the approach discussed in [17] works by splitting the available SPM space in a predefined number of partitions, and then assigns optimal ranges to each partition; this means that multiple explorations have to be done to find solutions with a variable numbers of partitions, and there is no way to be certain that the current set of solutions includes the globally optimal one.

4.1

Dynamic Programming

A Dynamic Programming (DP) [21] approach, achieving optimal results with only polynomial complexity, is now presented. A thorough discussion of dynamic programming is beyond the scope of this paper; we will just say that DP is a technique aimed at solving optimization problems by breaking them into steps, and then recursively solving them step by step. At first glance, the SPM

partitioning problem (SPP) may resemble a case of the Knapsack Problem (KP) mentioned above. Much research has been devoted to the KP, and effective solutions, including algorithms based upon Dynamic Programming, have been found to it. However, the SPP and KP show fundamental intrinsical differences: • Partitioning a memory in banks has a physical overhead, while assigning memory blocks to a contiguous memory has not. This means that SPP involves a penalty every time uncontiguous ranges are mapped to the scratchpad, while the KP assignment does not. Penalties in the SPP can be avoided only by forcing a single partition, which translates into the inability to map uncontiguous address ranges. • SPP is much more flexible, as it allows an arbitrary granularity of the locations to be mapped onto the scratchpad; KP needs a predefined set of candidate blocks as its input, and these blocks have fixed boundaries. • SPP and KP result in different power usage schemes; SPP gains power by adding the ability to selectively deactivate memory banks, but pays an overhead power cost due to extra decoder circuitry and wiring. These facts also explain complexity differences, which will be discussed below. One fundamental premise holds true in both problems. Every time a decision has to be made about if a certain address range should belong to the optimal SPM partitioning, two factors get evaluated: the profit it would bring (related to the total number of accesses it is subject to during application execution), and its weight (expressed in terms of SPM allocation, and due to both its length and the fixed overhead for adding any new partition). The first step we took before actually getting to the SPP problem was doing some preprocessing on the execution trace. Starting from a dump of every memory location accessed during the execution of benchmark software, we first rounded addresses so to align them to proper word boundaries when necessary (e.g. single-byte reads at odd addresses were treated like word reads at word-aligned addresses, and so on). We then merged all of the entries pertaining to identical memory locations, keeping a counter of such merges; the resulting trace file had a single line per accessed memory location, composed of an address and of the number of accesses it received. The trace was then sorted by addresses in increasing order. It is easy to see that this preprocessing has negligible impact upon overall execution time when compared to the algorithm solving the SPP problem itself. Additionally, it is important to notice that, by merging entries related to the same addresses, the sorted trace, which is the main input of the SPP problem, gets significantly reduced in size; the number of its entries equals the memory footprint of the benchmark application, expressed in words. Formally, the SPP problem input consists of the number N of addresses in the sorted execution trace, along with the actual address αi and the profit πi associated with the i-th sorted trace entry, the size C of the SPM, and the penalty τ for mapping uncontiguous ranges in the SPM (C and τ are both expressed in 32-bit Pj words). For each pair [i, j], 1 ≤ i ≤ j ≤ N , let pij := h=i πh and wij := (αj − αi + 1) + τ be the profit and weight (space cost) associated with the range of addresses from the i-th to the j-th. The objective of the problem is to select a set of mutually disjoint address ranges whose overall profit is maximized and whose overall weight does not exceed C. The method we propose would be capa-

N (number of locations in the execution trace)

if P [k][i] + pij > P [k + w(i, j)][j] then P [k + w(i, j)][j] := P [k][i] + p(i, j) copyprofits(i + 1)

size of partitioning

k loop

C (Scratchpad memory size)

i, j loops

upper boundary on last address included in partitioning profit ranges

Figure 4: Principle data structure of the algorithm ble of handling arbitrary values of pij and wij , without requiring a special structure for these values as is the case in our application. We next illustrate a polynomial-time algorithm to solve SPP, whose running time complexity is Θ(N · C 2 ). This is not too far from the Θ(M · C) complexity required to solve KP by dynamic programming, M with M ≤ N being the number of blocks and C again the SPM size (see [20])1 . At this point we would like to stress that the main advantage of our approach, when compared to the KP, is not that we solve a polynomial problem (SPP) instead of an NP-complete one (KP), since the time to solve the latter is smaller, but that we solve a much more general problem, as discussed above, in a comparable running time. On the other hand, it is also worth mentioning that previously documented solutions to the SPP itself had exponential complexity, so this approach may definitely be considered an improvement on them. Our algorithm can be described in terms of a matrix-shaped data structure with C rows and N columns, see Fig. 4. Every cell of the matrix contains two pieces of information about a possibile partitioning of the SPM: its overall profit (number of intercepted external memory accesses) and its list of ranges. The cell in row k and column j contains data about the best partitioning found until now, the last address of which is less than or equal to the j-th address of the trace and the overall size of which amounts to k 32-bit words in the SPM. In this way, the weight of the partitioning is also implicitly stored, and equals the row index k. The recursive definition of the profit values P [k][j] for this matrix is given by the DP equation: P [k][j] := max{P [k][j − 1],

max

i:i≤j,k≥wij

P [k − wij ][i]},

(1)

for j = 1, . . . , N and k = 1, . . . , C, whose initialization is P [k][0] := 0 for k = 1, . . . , C. The definition of the corresponding optimal ranges is straightforward, although extremely space consuming in a na¨ıve implementation, as discussed later. A corresponding na¨ıve forward DP implementation is: for i := 1 to N do for j := i to N do if wij > C then break for k := 1 to C − wij do 1

Please note, in particular, that the running time is polynomial in the input size in the SPP case since C ≤ N , i.e. the size of the SPM is expected to be not larger than the number of addresses. On the other hand, for the KP case the number of blocks may be smaller than the SPM size, and therefore the Θ(M · C) time complexity turns out to be exponential in the input size, which becomes roughly Θ(M · log C) (see again [20] for details)

The algorithm analyzes the target application’s execution trace, taking into account every possibile address range as a candidate partition for the desired optimal SPM partitioning. The outer loops (i and j) both sweep all of the addresses in the execution trace; the i-th address acts as the starting one for the current candidate range, while the j-th as the finishing one. Observing the behaviour of the outer loops, if the i-th start address is being considered, then all possible address ranges ending with up to the same i-th address have certainly already been swept. If the range under examination gets too long to fit into the SPM (overhead taken into account with the wij s), the algorithm shifts to new starting addresses. Otherwise, a third inner loop is performed. The P [][] matrix acts as storage for all of the processing results already found; some of its contents will be overwritten during following elaboration, some will be suboptimal, some will be predecessors of the optimal solution, and some (possibily just one cell) will represent the optimal partitioning of the SPM. Columns of the matrix are swept according to the i and j loops, while its rows are addressed via the inner k loop. The matrix contents get updated as loops i and j progress, and, when the i-th start address is being processed, the matrix already contains complete information about partitionings the last address of which is up to the same i-th. In other words, the columns up to the i-th of the matrix already contain final computation results and will not be further modified. Columns from the (i + 1)-th to the last, on the other hand, contain temporary data, i.e. speculative partitionings spanning beyond the i-th address which are currently potentially optimal but which could be refuted by further analysis when the outer loops progress. The [i, j] range is tentatively appended to all of the partitionings found as best until this (i-th) iteration. The resulting profits are compared to those of the best partitionings already found having the j-th location as their last, and requiring the same SPM size. If the former prove greater, the latter get overwritten. Not shown here for simplicity purposes, references to the actual ranges composing the partitioning are updated too. The call to a function copyprofits() is required by the way the algorithm proceeds. New potentially optimal partitionings are stored in a column of the matrix according to their last address, but when data are to be retrieved from the structure, the algorithm expects to find the optimal partitionings the last address of which is less than or equal to the column index, and not just equal. For this reason, it is necessary to periodically forward data in the matrix, so that the i-th column always actually contains the best data extracted from all of the first i columns. At the end of the execution, the last column contains all of the optimal partitionings of the SPM for any SPM size up to C. As it is easy to see, the algorithm has a time and space complexity of Θ(N · C 2 ), that is, polynomial in the input size. This is a definite improvement in comparison to previously documented exponential approaches ([17]). Additionally, no artificial boundary is imposed upon the maximum number of partitions allowed in a partitioning.

4.2

Algorithm implementation

The straighforward implementation of the above-described algorithm imposes a heavy load upon system resources. Assuming applications (or better, critical sections of applications) with memory footprints in the range of the hundreds of kilobytes, e.g. 256 kB, and an SPM of 16 kB, the matrix data structure has 64 k colums and

5.

EXPERIMENTAL RESULTS

50 AdaptFilter Butterfly Chaos Dhrystone DFT FilterBank IIRDemo Integr Interp Scramble

Number of partitions

40

30

20

10

0

0

80

400000

300000

5.1

Functionality tests

We tested the algorithm with varying values of the OVERHEAD2 parameter. The results (see Fig. 5) show, as expected, an inverse relation between the number of resulting partitions and the OVERHEAD value: the more the partitioning overhead, the less the partitions. Moreover, we found no upper limit imposed to the number of partitions generated. Additionally, we studied the effectiveness of our mapping strategy by measuring the number of intercepted external memory accesses versus varying target SPM sizes; the results are shown in Fig. 6. Our algorithm proved capable of delivering high rates of 2 Area, power and latency overhead are inherent to partitioning. The algorithm takes them into account as an area-only equivalent overhead, expressed in terms of the number of bytes of SPM which would take up as much space as the added wiring and decoding complexity for a new partition.

400

512 B 1 kB 2 kB 4 kB 8 kB 16 kB Ideal

200000

100000

0

• Functionality tests.

We chose to base our analysis upon benchmark traces taken from real world data processing algorithms, like transformations, filters, etc.; most of them are derived from [22]. The majority of our tests were focused upon three traces: Dhrystone (3423 locations), DFT (3766 locations) and FilterBank (13053 locations). These numbers translate respectively into about 13 kB, 15 kB and 51 kB memory footprints, spanned upon uncontiguous ranges.

320

500000

Many tests were run to assess the functionality and performance of our algorithm. These included:

• Partitioning performance tests.

160 240 Partitioning overhead (bytes)

Figure 5: Partition number versus partitioning overhead, 4 kB SPM

Number of intercepted accesses

4 k rows, totaling 256 M cells. As said, every cell must store information about the profit of a partitioning (at least 32 bits), plus some way of tracing the actual ranges composing the partitioning itself. Even by using pointers, it is impossible to take less than 16 bytes of memory per cell; this results in a grand total of 4 GB for the matrix. This is clearly unacceptable, as processing on a common workstation would prove impossible. To solve this problem, the matrix has to be discarded and replaced with a dynamic structure approach. We have implemented a mechanism based upon an array of lists, where memory gets allocated only when required and released as soon as it is not useful any more. Pointers have been heavily exploited to reduce as much as possible memory content duplication. A configurable buffering mechanism has been implemented to further cut down on memory usage. Of course, list management and dynamic memory allocation (i.e. malloc() and free() calls) impose a heavy load upon processing time, so some kind of tradeoff has to be accepted. As the results in Section 5 will show, we’ve managed to significantly cut memory requirements without slowing down execution to an unacceptable level. Additionally, we have provided a function which, on a best-effort basis, will try to keep memory usage below a configurable threshold by releasing allocated data not needed any more, so to try to avoid disk swapping. Provided the threshold is not unreasonably low, memory footprint of the algorithm can be kept within the boundaries of the workstation’s physical RAM, granting a good compromise on performance. Actually, a side effect of this cleaning process is a reduction of the data to be processed in subsequent iterations, thus speeding up execution and often balancing or reversing the performance penalty due to the garbage collection itself.

Dhrystone

DFT

FilterBank

Figure 6: Intercepted accesses with varying SPM sizes, 240 B overhead access hits already with very small scratchpad memories. With a partitioning overhead set to 240 bytes, we observed that the Dhrystone benchmark showed 46% of intercepted accesses already with a 512 B SPM, and 76% with an 1 kB SPM; DFT, 80% and 82% respectively; Filterbank, 58% and 83%. These percentages gradually increased with bigger SPMs. Lower overhead values led to even better hit rates.

5.2

Partitioning performance tests

With a variety of benchmark traces and target SPM sizes, we recorded the maximum memory footprint of our partitioner. It was then easy to compare the results of such test runs with the memory requirements of a “na¨ıve”, matrix-based implementation. Execution times3 were also recorded; see Fig. 7 and Fig. 8. As expected, both charts showed heavy dependency upon SPM size and more moderate dependency upon execution trace length. Deviation from theoretical trends is justified by trace characteristics (contiguity of accessed addresses) and implementation choices (dynamic memory management). Compared to a matrix implementation, our dynamic memory management showed excellent memory savings, ranging R 3 4 1.6 GHz noteSpeed benchmarks were taken on a Pentium book with 512 MB of RAM.

160

800

600

Algorithm memory usage (MB)

Algorithm memory usage (MB)

1000 2 kB SPM − dynamic 4 kB SPM − dynamic 8 kB SPM − dynamic 16 kB SPM − dynamic 2 kB SPM − matrix 4 kB SPM − matrix 8 kB SPM − matrix 16 kB SPM − matrix

400

Setup #1 Setup #2 Setup #3 Setup #4 Setup #5 Setup #4, 100 MB threshold Setup #4, 70 MB threshold Setup #4, 50 MB threshold

120

80

40

0

Dhrystone

DFT

200

Figure 9: Algorithm memory usage, with varying algorithm parameters 0

Dhrystone

DFT

FilterBank 8000

Algorithm execution time (s)

Figure 7: Algorithm and matrix implementation memory usage, 240 B overhead

Algorithm execution time (s)

8000

6000

2 kB SPM 4 kB SPM 8 kB SPM 16 kB SPM

Setup #1 Setup #2 Setup #3 Setup #4 Setup #5 Setup #4, 100 MB threshold Setup #4, 70 MB threshold Setup #4, 50 MB threshold

6000

4000

2000

4000

0

2000

Dhrystone

DFT

Figure 10: Algorithm execution time, with varying algorithm parameters 0

Dhrystone

DFT

FilterBank

Figure 8: Algorithm execution time, 240 B overhead

from 60% to an allocation reduction of two orders of magnitude. We investigated upon the tuning of the command line parameters of our software, showing wide potential for control of memory usage and running time4 . The charts in Fig. 9 and Fig. 10 report memory usage and computation time for the Dhrystone and DFT benchmark traces, with different sets of execution parameters. The settings discussed here deal with attempts to prune some candidate solutions, and thus avoid computing all of the huge number of additional candidate solutions which would derive from them, earlier than the algorithm would dictate. This is achieved by comparing new potential partitionings against a greater range of already found solutions, including some which still haven’t been confirmed as optimal themselves. This increases the chances of detecting useless computation branches as soon as possible, with savings in execution time and memory; this however also requires a comparison overhead. As a consequence, the aggressiveness of this policy must be tuned. In our empirical tests over two benchmarks, Setup #4 proved the most efficient overall, and for this reason was chosen as the default for our algorithm; all other benchmarks in this paper are based on it. Additional tests were also done to assess the utility of our memR 4 III 1 GHz dual Speed benchmarks were taken on a Pentium processor server with 512 MB of RAM.

ory cleaning routine, which deallocates as much memory as possible as soon as it detects that a predefined threshold has been exceeded. What this routine does is sweeping the already processed portion of the memory structure of the algorithm searching for solutions, and thus memory allocations, which have become suboptimal because of the further processing done in the meantime. Being quite an intensive task, it is not recommended to use this feature during every iteration, but it gives very good results if used when truly necessary, i.e. when physical memory is running low. In this case, as a side effect, it may even slightly improve execution times by pruning some solutions which would act as the base for subsequent processing. While it is not possible to keep memory usage arbitrarily low, the routine allows huge additional memory savings. As an example, the Dhrystone trace could be processed using as low as 54 MB of memory, opposed to about 131 MB with already optimized settings; execution time went actually down too, of about 10%. The DFT benchmark also showed noticeable, albeit less dramatic, improvements. Fig. 11 shows estimated energy utilization in different implementations of scratchpad memories. The comparison covers our partitioned SPM and a hypothetical unpartitioned SPM of the same 8 kB size, to which the best possible single range of addresses is mapped. Figures for access energy are taken from [10], additionally assuming 10 nJ as the energy of an external RAM access (which translates into a very typical power consumption of 1 W at 100 MHz). In the case of the partitioned memory, two different estimations are provided: the first interpolates the energy val-

2000

Energy required (uJ)

1500

8 kB partitioned SPM, interpolated 8 kB partitioned SPM, pessimistic 8 kB unpartitioned SPM

1000

7.

500

DFT

Dhrystone

FilterBank

Figure 11: Energy estimation for partitioned and nonpartitioned SPMs, 240 B overhead

ues discussed in [10] to exactly match partition sizes generated by our partitioner, while the second assumes pessimistic values, corresponding to power-of-two memory cuts immediately above the size of the partitions. Even despite a partitioning overhead assumed to be around 3% (equivalent to 240 bytes of memory space) per partition, which reduces the overall available space of our SPM somewhat below 8 kB, our partitioning strategy proves very successful in reducing energy requirements, with improvements ranging from 39% to 84%. This is due to both an optimization in scratchpad energy (a smaller chunk of memory is accessed during every cycle instead of the whole SPM) and to a reduction of the number of external RAM accesses, made possible by the ability to map multiple address ranges inside of the SPM.

6.

Additional future research work, always in the same direction, could revolve around studying mechanisms to dynamically adjust the SPM address space mapping, to better follow the execution flow of complex applications and the variability of their inputs. It may even become possible to reuse the hardware optimally synthesized for an application so to provide good performance in a totally different context. It must be remembered, however, that the focus of this research is on embedded designs, so some flexibility should usually be traded off for efficiency.

CONCLUSIONS

We have presented a dynamic programming algorithm to solve the problem of optimally partitioning a scratchpad memory for use in an embedded design. This algorithm has been integrated in a complete design, simulation and synthesis flow. Much care has been put in ensuring the generality of the algorithm, which can handle an arbitrary number of partitions and a variable partitioning overhead, in contrast to previous approaches. Additionally, the complexity of the algorithm is polynomial, improving on previously documented exponential solutions. While a “na¨ıve” implementation of such an algorithm would have required impractical memory resources, some optimization work, especially based upon runtime memory management, has showed that our solution is indeed viable. An important tradeoff of the approach we are proposing is the fact that optimization is done referring to a specific input to the target application. There is no guarantee that by running the application with different inputs optimality will be preserved. Excellent results should however be expected with applications doing little I/O, or with applications which do I/O with repeating patterns - data content is not relevant, only data position in memory is. Additionally, code location in memory does not depend on input data. Our preliminary investigations show good results with different inputs, but more research has to be done in this respect and robust methodologies must be developed. At the very least, it would be certainly possible, with trivial effort, to extend our approach so to choose a partitioning basing upon more than a single execution trace, and thus different inputs.

REFERENCES

[1] Raam, F.M.; Agarwal, R.; Malik, K.; Landman, H.A.; Tago, H.; Teruyama, T.; Sakamoto, T.; Yoshida, T.; Yoshioka, S.; Fujimoto, Y.; Kobayashi, T.; Hiroi, T.; Oka, M.; Ohba, A.; Suzuoki, M.; Yutaka, T.; Yamamoto, Y., “A High Bandwidth Superscalar Microprocessor for Multimedia Applications”, Digest of Technical Papers of the 1999 IEEE International Solid-State Circuits Conference, pp. 258–259, 1999. [2] Suzuoki, M.; Kutaragi, K.; Hiroi, T.; Magoshi, H.; Okamoto, S.; Oka, M.; Ohba, A.; Yamamoto, Y.; Furuhashi, M.; Tanaka, M.; Yutaka, T.; Okada, T.; Nagamatsu, M.; Urakawa, Y.; Funyu, M.; Kunimatsu, A.; Goto, H.; Hashimoto, K.; Ide, N.; Murakami, H.; Ohtaguro, Y.; Aono, A., “A Microprocessor with a 128-bit CPU, Ten Floating-Point MAC’s, Four Floating-Point Dividers, and an MPEG-2 Decoder”, IEEE Journal of Solid-State Circuits, Volume 34 Issue 11, Nov 1999, pp. 1608–1618, 1999. [3] Koyama, T.; Inoue, K.; Hanaki, H.; Yasue, M.; Iwata, E., “A 250-MHz Single-Chip Multiprocessor for Audio and Video Signal Processing”, IEEE Journal of Solid-State Circuits, Volume 36 Issue 11, Nov 2001, pp. 1768–1774, 2001. [4] Kennedy, K.; Allen, J.R., “High-Performance Compilers”, Elsevier Science and Technology Books, 2001. [5] Panda, P.R.; Dutt, N.D.; Nicolau, A., “Efficient Utilization of Scratch-pad Memory in Embedded Processor Applications”, Proceedings of the European Design and Test Conference, pp. 7–11, 1997. [6] Panda, P.R.; Dutt, N.D.; Nicolau, A., “Local Memory Exploration and Optimization in Embedded Systems”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 18 Issue 1, Jan 1999, pp. 3–13, 1999. [7] Panda, P.R.; Dutt, N.D.; Nicolau, A.; Catthoor, F.; Vandecappelle, A.; Brockmeyer, E.; Kulkarni, C.; De Greef, E., “Data Memory Organization and Optimizations in Application-Specific Systems”, IEEE Design and Test of Computers, Volume 18 Issue 3, May 2001, pp. 56–68, 2001. [8] Shiue, W.-T.; Chakrabarti, C., “Memory Exploration for Low Power, Embedded Systems”, Proceedings of the 36th Design Automation Conference, pp. 140–145, 1999. [9] Banakar, R.; Steinke, S.; Lee, B-S.; Balakrishnan, M.; Marwedel, P., “Scratchpad Memory: a Design Alternative for Cache On-Chip Memory in Embedded Systems”, Proceedings of the Tenth International Symposium on Hardware/Software Codesign, pp. 73–78, 2002. [10] Steinke, S.; Wehmeyer, L.; Lee, B-S.; Marwedel, P., “Assigning Program and Data Objects to Scratchpad for Energy Reduction”, Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 409–415, 2002. [11] Kim, S.; Vijaykrishnan, N.; Kandemir, M.; Sivasubramaniam, A.; Irwin, M.J.; Geethanjali, E., “Power-Aware Partitioned Cache Architectures”, Proceedings of the International Symposium on Low Power Electronics and Design, pp. 64–67, 2001. [12] Kandemir, M.; Ramanujam, J.; Irwin, M.J.; Vijaykrishnan, N.; Kadayif, I.; Parikh, A., “Dynamic Management of Scratch-Pad Memory Space”, Proceedings of the Design Automation Conference, pp. 690–695, 2001. [13] Kandemir, M.; Choudhary, A., “Compiler-Directed Scratch Pad Memory Hierarchy Design and Management”,

[14]

[15]

[16]

[17]

[18]

[19]

[20] [21] [22]

Proceedings of the 39th Design Automation Conference, pp. 628–633, 2002. Kandemir, M.; Kadayif, I.; Sezer, U., “Exploiting Scratch-Pad Memory Using Presburger Formulas”, Proceedings of the 14th International Symposium on System Synthesis, pp. 7–12, 2001. Kandemir, M.; Ramanujam, J.; Choudhary, A., “Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems”, Proceedings of the 39th Design Automation Conference, pp. 219–224, 2002. Benini, L.; Macii, A.; Macii, E.; Poncino, M., “Increasing Energy Efficiency of Embedded Systems by Application-Specific Memory Hierarchy Generation”, IEEE Design and Test of Computers, Volume 17 Issue 2, Apr-Jun 2000, pp. 74–85, 2000. Benini, L.; Macchiarulo, L.; Macii, A.; Poncino, M., “Layout-Driven Memory Synthesis for Embedded Systems-on-Chip”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 10 Issue 2, Apr 2002, pp. 96–105, 2002. Benini, L.; Bertozzi, D.; Bruni, D.; Drago, N.; Fummi, F.; Poncino, M., “Legacy SystemC Co-Simulation of Multi-Processor Systems-on-Chip”, Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 494–499, 2002. Bertozzi, D.; Poletti, F.; Benini, L., “Performance Analysis of Arbitration Policies for SoC Communication Architectures”, to be published in Design Automation of Embedded Systems, Special Issue on Covalidation of Embedded Hardware/Software Systems, 2003. Martello, S.; Toth, P., “Knapsack Problems”, John Wiley & Sons, Chichester, 1990. Bellman, R.E., “Dynamic Programming”, Princeton University Press, Princeton, NJ, 1957. The Ptolemy Project http://ptolemy.eecs.berkeley.edu