Hardware Software Partitioning of Multifunction Systems Abhijit Prasad

Wangqi Qiu Rabi Mahapatra Department of Computer Science Texas A&M University College Station, TX 77843-3112 Email: {abhijitp,wangqiq,rabi}@cs.tamu.edu

ABSTRACT The problem of hardware-software partitioning for systems that are being designed as multifunction systems is addressed. Simulated annealing is known to generate very good solutions for most optimization problems, however the running time of this algorithm can be very high. We apply a modified simulated annealing approach to the partitioning problem resulting in a smaller search space, yet yielding good partitions. We show experimental results that yield better solutions in comparison to the existing multifunction partitioning approaches within acceptable running time.

Keywords Hardware/software partitioning, Multifunction systems, Simulated annealing.

1. INTRODUCTION The problem of hardware-software partitioning is a version of the classical graph-partitioning problem, which is essentially an optimization problem. Given an application consisting of tasks, the partitioning task is to identify which tasks should be implemented in hardware and which in software, such that the specified constraints are met. Constraints are typically minimizing chip area or power. It goes without saying that the timing constraint on the application must be met in all cases. The application is described as a directed acyclic graph G = (N, E) where the nodes N represent computations (tasks or processes chosen in the granularity selection stage above), and the arcs E represent data and control dependencies between the nodes. Each node also contains information that helps the algorithm decide the “cost” of a given partition. This information is typically about the execution times, power utilization, and area required for the hardware and software implementation of the given node. There is also information about the cost of communication and area across the hardware-software interface, specified at the node level. The final solution to this problem simply tells us which node should be implemented in hardware and which in software. The problem described above is the simple partitioning problem (single function). However, if the goal is to design a system that performs multiple functions, (e.g. a device that combines functions of a PDA and cell phone), the partitioning problem becomes more complex. The timing constraints on both applications need to be met besides minimizing chip area or energy consumption. Such a problem is known as the multifunction partitioning problem. Considerable research has gone into the single function to hardware/software partitioning. However, the partitioning of multifunction systems is in its infancy and needs further attention. This is the primary motivation of this paper. All partitioning algorithms require a cost function, i.e. a way to estimate the cost of a partition to decide whether the resulting partition is better than a given partition. Different implementations have different such cost functions. In [1] the primary goal of the partition is to minimize the hardware cost. The primary objective in [2] is to minimize the communication cost between the hardware and software partitions. In [3], the cost function is a combination of the communication cost and the execution time. Low power consumption is the primary constraint in [4]. Throughput combined with hardware cost is the constraint for the implementation in [5]. In the global criticality/local phase driven algorithm in [6] and [7], minimization of chip area is the primary objective while still meeting timing constraints. Thus we see that a combination of constraints can be used in the formulation of the cost function. Typically the way minimization of the cost function is done, is by keeping one constraint as a minimal requirement, and trying to minimize the other factors. A similar approach has been followed in this paper. The simple partitioning problem has been proven to be NP-complete. The multi-functional partitioning problem is at least as difficult as the simple partitioning problem, and thus no polynomial time algorithm exists for it. Approximation algorithms are used for all NP-complete problems and we use one such popular approximation technique called simulated annealing. Simulated annealing is known to yield good results with all optimization problems but its running time is high, since the search space is very large. Other algorithms, such as the Global Criticality/Local Phase (GCLP) driven algorithm [6, 7], have low running times but their partition quality is not as

good as results due to the use of simulated annealing. In this paper we present a simulated annealing based technique, which has high partition quality as well as affordable running time. The rest of this paper is organized as follows. Section 2 discusses the background of the multifunction partition with brief reviews on previous efforts. In Section 3, we formally formulate the multifunction partition problem. In Section 4 we explain our new algorithm in detail. In Section 5 experimental results and comparison to the existing methods are presented. Section 6 concludes the paper.

2. BACKGROUND Simulated annealing has been used in [3, 8, 9, 10] for partitioning problems. Simulated annealing has been one of the most popular algorithms for use in optimization problems as it easy to implement, has the capability of make “uphill” moves in the solution space and thus gives adequate results. However, the tuning of the parameters in the algorithm takes a considerable amount of effort and poorly tuned algorithms yield poor results. Besides this it has a high running time, making it a popular choice only when processing time is not a constraint. The Kernighan-Lin (KL) partitioning algorithm is used in [2]. The running time of this algorithm is considerably less than that of simulated annealing, however the quality of solutions is not as good. The Fiduccia-Mattheyses (FM) min-cut algorithm that is an extension of the KL algorithm has been modified for hardware-software partitioning in [11]. The running time of this algorithm is an order better than KL. Both the algorithms mentioned above are greedy algorithms. They always move only in the direction of a more optimal solution. These algorithms tend to get caught in local minima. Other algorithms which do not always move in the direction of a more optimal solution have the advantage of being able to come out of local minima and are called hill-climbing algorithms. Simulated annealing is one such algorithm. Greedy algorithms are compared with hill-climbing algorithms in [1]. Other optimization techniques such as integer programming and genetic algorithms have also been applied to partitioning problem. An integer programming formulation of the partitioning problem is derived in [12]. Genetic algorithms have also been applied to partitioning problems on distributed systems in [13]. In [6, 7] the Global Criticality/Local Phase (GCLP) driven algorithm is introduced. This algorithm makes use of the global as well as local criteria of each node to determine the best implementation of the particular node. This algorithm yields better results than ordinary greedy algorithms, and is reasonably fast with linear running time. We compare our solution to this algorithm and see that our solution yields better results than this, albeit with a small penalty on running time.

3. PROBLEM FORMULATION Given a set of k applications in the set AP = {A1, A2… Ak}, only one of these applications is active at any given time. We consider each application to have a repetitive behavior with fixed timing constraints. One iteration through the application Ai is specified by a Directed Acyclic Graph (DAG) G = (N, E), where the nodes N specify computations and the edges E specify data and control precedence between nodes. Each edge of the DAG eij contains information on the communication cost between node ‘i’ and ‘j’ if they are implemented in software and hardware respectively. For ease of programming, it is assumed that communication cost between the nodes ‘i’ and ‘j’ is same in both ways (Costi-j = Costj-i). However, this approach can easily be extended to have different values for the two directions of communication. Extraction of the DAG from specification of the application is a different problem and we do not consider it as scope of our work. The Ptolemy co-design environment takes as input the description of the application in Standard Design Format (SDF) and generates a DAG. We do not consider any granularity estimation or pre-clustering of basic blocks for the application. It is assumed that this has been done before generating this graph. Each node in the DAG represents some computation and thus needs to have information on its execution times if implemented in the hardware and in the software as well. Besides this, the area cost of the node is also required in respective implementations. We assume that there is only one processor available and hence one value each for the execution time and the code size of each node is specified. The area of the hardware is the sum of all areas of all the nodes that are in implemented in the hardware. The estimation of these values is another step in the co-design process and is available in the form of a library with pre-estimated working parameter values for all the applications. The constraints on the system are assumed to be execution time and area, i.e. the maximum execution times of the different applications and maximum area of the chip are known apriori. The area of the chip is to be minimized while meeting the timing constraints of all the applications. The partitioning task consists of determining the mapping of nodes to either hardware or software. To make the description of the system clearer, an example is considered.

Figure 1 is a partitioning problem for an embedded system that can run 3 applications on it. Let there be 10 basic computational units (numbered C0 thru C9) and the three applications use a set of basic units as shown in Figure 1. Application 1

Application 2

C1 C3

C0

C6 C6

C5 C2

C4

C5

Application 3

C7

C8

C3

C1

C9

C8

C4 C9

C0

C5

C7

C4

Figure 1. An example for a 3 application system partitioning

Timing constraints for the three applications are T1, T2 and T3 respectively. The applications need to confirm to these timing specifications and the area of the system must be minimal. The values associated with each node of the graphs are: 1. th: Execution time if implemented in hardware; 2. ts: Execution time if implemented in software; 3. ah: Area when implemented in hardware; 4. as: Code size when implemented in software. Each of these specifications for a computation unit (say floating-point multiplier) is same for different applications. For example, if the floating-point multiplier takes an area Ah for hardware implementation for Application 1, it takes same area for other two applications if implemented through hardware. There is also communication cost associated with each possible edge of the graph. In this paper “Ci ↔ Cj” indicates the communication cost between node i and j, if they have different implementations, i.e. node i is in software and j in hardware, or vice versa. This value contains area overhead and time overhead due to the communication involved. Time overhead is caused by the communication between the hardware and software interface. The area overhead is the interconnect area of the interface. Let t(Ci ↔ Cj) indicate the time overhead and s(Ci ↔ Cj) the area overhead. As an example, let the current configuration be as follows: C0 – SW; C1 – HW; C2 – HW; C3 – SW; C4 – SW C5 – HW; C6 – SW; C7 – HW; C8 – SW; C9 – HW Consider Application 2. After scheduling, we know that the path “C6” and path “C2-C7” can be executed in parallel, since they do not have resource confliction. Suppose task C6 is finished later than C7. Therefore the critical path in application 2 is: C0-C6-C4- C5. Let the timing constraint on this application be T2. The execution time for this application is: Timeapp2 = ts(C0) + ts(C6) + ts(C4) + t(C4 ↔ C5) + th(C5) Similarly, the time for other two applications are also calculated. Let these be Timeapp1 and Timeapp3. Let T1 and T3 be the timing constraints on application 1 and 3. The timing constraints are: Timeappi ≤ Ti (i = 1, 2, 3).

4. PARTITIONING ALGORITHM 4.1 Simulated Annealing We intend to implement the partitioning using the simulated annealing algorithm. Simulated annealing is a popular algorithm used to solve such optimization problems. The algorithm starts off at certain “temperature”. New configurations of the partitioning are generated by the perturb function. The costs of partitions are determined by an evaluate function (in this work, it is the total chip area). If the new configuration results in a decrease of the cost, this perturbation to the new configuration is accepted and the next perturbation is tried after incrementing the loop count. If the perturbation increases the cost of the partition, this configuration is accepted depending on the value of e-∆C/T where ∆C is the change in cost and T is the temperature of the system. If a random number generated between 0 and 1 is less than this quantity, the new configuration is accepted. This procedure allows the system to move consistently towards lower cost states, yet still “jump” out of local minima due to the probabilistic acceptance of some upward moves.

The number of iterations at a given temperature fixes the inner loop criterion. At the end of each complete run of the inner loop, the temperature of the system is reduced. The stopping criterion is typically when one complete run of the inner loop does not yield a better solution. At this point the system is said to be frozen and the algorithm ends. The cost function is simply defined as the area of the chip for the given partition.

4.2 Perturb Function We define our perturbation based on a bias value. Bias value is a number between 0 and 1, and reflects the probability of a task node to be implemented in hardware. A task with less bias value is more likely to be implemented in software. In a traditional simulated annealing algorithm, a new configuration is generated by randomly selecting a task node from its current implementation to the opposite, i.e. from software implementation to hardware, or vise versa. However, this completely random process makes the search space very large. In our approach, we limit the search space by controlling this random process, using the bias value. For example, that a task node has a high bias value means it is biased to hardware implementation and has a low probability of moving into the software, if it has already been mapped to the hardware. On the other hand, if a task node with high bias value is currently mapped to software, it’s more likely to be selected in the perturbation. This bias value is calculated as a weighted summation of three parameters: Commonality, Performance-Area Ratio, and Urgency.

4.2.1 Commonality (Com) Some task nodes are common across different applications, and they can be implemented on the same resource. To identify common “task groups” is more useful, but it is an extremely hard job. We implement this idea in a simple way, by defining commonality as the number of times a particular node appears across all the applications. Intuitively, common tasks are more likely to be implemented in hardware. We give each task a commonality value Com, which will be used in computing its hardware-software bias value.

4.2.2 Performance-Area Ratio (PAR) This is defined as the ratio of ∆t/(ah-as), where ∆t is the difference in computation time of the task when implemented in hardware and software. If the additional area for hardware implementation is small and the performance increase is high, the task has high performance-area ratio, thus this task is biased to hardware implementation.

4.2.3 Urgency (Ur) This is the summation of the lengths of all paths through a task across all the applications. If the task node appears on the critical path in many applications, the urgency value is high, thus the task is more likely to be implemented in hardware. A weighted bias value is defined as w1*Com + w2*PAR + w3*Ur. w1, w2, and w3 are the weights. We call this value the static bias value. For two consecutive tasks, the communication costs (area, time) are higher if one is implemented in software and the other is implemented in hardware. Therefore we try to avoid that. The dynamic bias value is defined as the average of all its neighbor tasks’ current implementation (software = 0 and hardware = 1). For example, if all the neighbors of a task are implemented in software, this task is not likely to be implemented in hardware. The bias value we use in our experiments is a weighted summation of static and dynamic bias value.

4.3 Scheduling and Validation When the implementation of each task node is decided, we need to do the scheduling for each application. Since this problem is NP-complete, we use a simple scheduling algorithm. The purpose of scheduling is to check if the timing constraints are met in all the applications. If not, the partitioning solution is not a valid one and needs further validation. The validation procedure is to move the task node with the maximum performance-area ratio value in software implementation to the hardware implementation, until all the applications’ timing constraints are met.

4.4 User Controllable Parameters The number of inner loops is the number of iterations that are to be made at a particular temperature. This parameter is user controllable and can be used to control running time of the algorithm. The criterion to stop the algorithm is when there is no decrease in chip area for n consecutive iterations. The parameter n is also user controllable and in turn controls the running time of the solution.

5. EXPERIMENTS We generated 2 task libraries, which contain 100 task nodes and 1000 task nodes respectively. And for each experiment, we assume there are 3 applications constructed by the task nodes in one task library. We generated random DAGs to represent applications. For each application, we assume t1 is the time for the application to finish if every node is implemented in hardware, and t2 is the time that every node is in software. We set the timing deadline for the application to (t1+t2)/2, so that it is sure that some of the nodes will be implemented in hardware and some will be in software after partitioning. The experimental program is implemented in Microsoft Visual C++. We also implemented the GCLP algorithm for multifunction partitioning problem for these example task sets, and compare its results with our algorithm.

5.1 Weight Assignment We use a simple greedy algorithm to decide weights for the 3 factors (Commonality, Performance-Area Ratio, and Urgency) of static bias value. That is, initially we put all the nodes in hardware, then move nodes with least bias values to software one by one until the timing constraints for at least 1 application is not met. We tried different weights for Commonality, Performance-Area Ratio, and Urgency to calculate the bias value and ran the simple greedy algorithm to compare. The results are listed in Table 1 and Table 2. Table 1 shows the 100-node case and Table 2 shows the 1000-node case. Table 1. Different weight assignment for the 100-node case

Table 2. Different weight assignment for the 1000-node case

w1

w2

w3

Area

w1

w2

w3

Area

w1

w2

w3

Area

w1

w2

w3

Area

0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8

1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0

6,206 5,943 5,724 5,810 5,258 6,236 5,704 5,232 5,059 4,991 3,901

0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 1.0

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.2 0.0

0.6 0.4 0.2 0.0 0.4 0.2 0.0 0.2 0.0 0.0

5,183 5,193 4,676 4,532 5,148 5,247 4,532 5,449 4,560 5,218

0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8

1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0

64,598 64,598 64,196 63,240 62,884 31,361 62,712 59,083 55,442 47,957 34,266

0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 1.0

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.2 0.0

0.6 0.4 0.2 0.0 0.4 0.2 0.0 0.2 0.0 0.0

55,982 51,800 42,050 34,574 45,794 42,224 35,756 43,064 35,752 42,269

We can see that any weight assignments with w3=0 can achieve fairly good solution, and the best solution is achieved with w1=0 and w2=1 in the 100-node case and with w1=0.2 and w2=0.8 in the 1000-node case. The results indicate that Performance-Area Ratio is the most important factor and Urgency does not help at all. In both experiments we change the variation of Performance-Area Ratio, but the results still remain same.

5.2 Annealing Results

7000

80000

6000

70000 S20

5000

S200

4000

S2000 3000

S200

60000 Area

Area

We set w1=0.2, w2=0.8 and w3=0 in our algorithm to get the bias value, and ran the simulated annealing. Figure 3 and Figure 4 show the simulated annealing curves. S20 means there are 20 iterations on each temperature. In simulated annealing, we found that the dynamic bias value does not help. Instead, it leads the solution to some local minima.

S2000

50000

S10000

40000 30000

2000

20000 Solutions

Figure 2. SA curves for the 100-node case

Solutions

Figure 3. SA curves for the 1000-node case

5.3 Algorithm Performance Comparison Table 3 shows the performance of the 4 algorithms we tried. The time in Table 3 is min:sec. The proposed approach achieves the best chip area but the running time is longer than the other algorithms. However, the running time is acceptable, because in reality, even if we put the granularity to instruction level, most embedded system that has less than 1000 task nodes. For the systems having more than 1000 instructions, simply changing the granularity can control the running time. Table 3. Different algorithm performance comparison

Random Partitioning Simple Greedy GCLP Proposed Approach

100-node case Area Time 7,911 00:01 3,901 00:01 3,555 00:01 2,775 00:08

1000-node case Area Time 70,828 00:01 31,361 00:01 27,053 00:51 24,607 08:51

6. CONCLUSIONS AND FUTURE WORK We developed a simulated annealing based algorithm to solve the multifunction partitioning problem. Simulated annealing yields best results for optimization problems for large solution space, and we used “bias values” to reduce the search space remarkably (otherwise the simulated annealing curves have very long tails and take a very long time to stop). We found that the Performance-Area Ratio is the most important factor for hardware-software bias. Commonality helps to some extent, but Urgency and dynamic bias value do not help at all, with respect to minimizing the chip area. Experiments show that the proposed algorithm gets better results than other existing algorithms, within reasonable time. Intuitively, non-greedy algorithms that permit uphill moves appear to yield better results than the greedy algorithms. So far there have been no work reported that apply simulated annealing concepts to the partitioning problem of multifunction systems. This work can also be extended to N-way partitioning, i.e., a task node can be implemented in many types of hardware. And we can add more constraints in addition to the time and chip area to the partitioning in the future.

7. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Frank Vahid, Jie Gong, and Danel Gajski, “A Hardware-Software Partitioning Algorithm for Minimizing Hardware” European Design Automation Conference (EURO-DAC) 1994. Samir Agarwal and Rajesh K. Gupta “Data-flow Assisted Behavioral Partitioning for Embedded Systems” Proc. 34th Design Auto- mation Conference 1997. Jörg Henkel et al., “Adaptation of Partitioning and High-Level Synthesis in Hardware/Software Cosynthesis” ICCAD' 1994. Jörg Henkel, “A Low Power Hardware/Software Partitioning Approach for Core-based Embedded Systems” DAC 1999. Smita Bakshi and Daniel Gajski, “Hardware/Software Partitioning and Pipelining” DAC 1997. Asawaree Kalavade and Edward A. Lee, “A Global Criticality/Local Phase Driven Algorithm for the Constrained Hardware/Software Partitioning Problem” Proc. of Codes/CASHE 94, Third IEEE International Workshop on Hardware/Software Codesign, Grenoble, France, Sept. 22-24, 1994, pp 42-48. Asawaree Kalavade and Edward A. Lee, “The Extended Partitioning Problem: Hardware/Software Mapping, Scheduling, and Implementation-bin Selection” Journal of Design Automation of Embedded Systems, vol 2, no.2 pp 226-163, Mar 1997. Jörg Henkel, and Rolf Ernst, “An Approach to Automated Hardware/Software Partitioning Using a Flexible Granularity that is Driven by High-Level Estimation Techniques” IEEE Transactions on VLSI Systems, VOL. 9, NO. 2, April 2001. Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli, “System Level Hardware/Software Partitioning Based on Simulated Annealing and Tabu Search” Kluwer Journal on Design Automation for Embedded Systems, vol. 2, no. 1, January 1997, pp. 5-32. Jörg Henkel, Rolf Ernst, and Thomas Benner, “Hardware-Software Cosynthesis for Microcontrollers” IEEE Design & Test of Computers, vol. 10, no. 4, December 1993. pp. 64–75. Frank Vahid, “Modifying Min-Cut for Hardware and Software Functional Partitioning” CODES 1997. I. Karkowski and R.H.J.M Otten, “An Automatic Hardware-Software Partitioner Based on the Possibilistic Programming” European Design and Test Conference, March 1996. Robert P.Dick and Niraj K. Jha, “MOGAC: A Multiobjective Genetic Algorithm for Hardware-Software Co-Synthesis of Distributed Embedded Systems” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 1998.

Wangqi Qiu Rabi Mahapatra Department of Computer Science Texas A&M University College Station, TX 77843-3112 Email: {abhijitp,wangqiq,rabi}@cs.tamu.edu

ABSTRACT The problem of hardware-software partitioning for systems that are being designed as multifunction systems is addressed. Simulated annealing is known to generate very good solutions for most optimization problems, however the running time of this algorithm can be very high. We apply a modified simulated annealing approach to the partitioning problem resulting in a smaller search space, yet yielding good partitions. We show experimental results that yield better solutions in comparison to the existing multifunction partitioning approaches within acceptable running time.

Keywords Hardware/software partitioning, Multifunction systems, Simulated annealing.

1. INTRODUCTION The problem of hardware-software partitioning is a version of the classical graph-partitioning problem, which is essentially an optimization problem. Given an application consisting of tasks, the partitioning task is to identify which tasks should be implemented in hardware and which in software, such that the specified constraints are met. Constraints are typically minimizing chip area or power. It goes without saying that the timing constraint on the application must be met in all cases. The application is described as a directed acyclic graph G = (N, E) where the nodes N represent computations (tasks or processes chosen in the granularity selection stage above), and the arcs E represent data and control dependencies between the nodes. Each node also contains information that helps the algorithm decide the “cost” of a given partition. This information is typically about the execution times, power utilization, and area required for the hardware and software implementation of the given node. There is also information about the cost of communication and area across the hardware-software interface, specified at the node level. The final solution to this problem simply tells us which node should be implemented in hardware and which in software. The problem described above is the simple partitioning problem (single function). However, if the goal is to design a system that performs multiple functions, (e.g. a device that combines functions of a PDA and cell phone), the partitioning problem becomes more complex. The timing constraints on both applications need to be met besides minimizing chip area or energy consumption. Such a problem is known as the multifunction partitioning problem. Considerable research has gone into the single function to hardware/software partitioning. However, the partitioning of multifunction systems is in its infancy and needs further attention. This is the primary motivation of this paper. All partitioning algorithms require a cost function, i.e. a way to estimate the cost of a partition to decide whether the resulting partition is better than a given partition. Different implementations have different such cost functions. In [1] the primary goal of the partition is to minimize the hardware cost. The primary objective in [2] is to minimize the communication cost between the hardware and software partitions. In [3], the cost function is a combination of the communication cost and the execution time. Low power consumption is the primary constraint in [4]. Throughput combined with hardware cost is the constraint for the implementation in [5]. In the global criticality/local phase driven algorithm in [6] and [7], minimization of chip area is the primary objective while still meeting timing constraints. Thus we see that a combination of constraints can be used in the formulation of the cost function. Typically the way minimization of the cost function is done, is by keeping one constraint as a minimal requirement, and trying to minimize the other factors. A similar approach has been followed in this paper. The simple partitioning problem has been proven to be NP-complete. The multi-functional partitioning problem is at least as difficult as the simple partitioning problem, and thus no polynomial time algorithm exists for it. Approximation algorithms are used for all NP-complete problems and we use one such popular approximation technique called simulated annealing. Simulated annealing is known to yield good results with all optimization problems but its running time is high, since the search space is very large. Other algorithms, such as the Global Criticality/Local Phase (GCLP) driven algorithm [6, 7], have low running times but their partition quality is not as

good as results due to the use of simulated annealing. In this paper we present a simulated annealing based technique, which has high partition quality as well as affordable running time. The rest of this paper is organized as follows. Section 2 discusses the background of the multifunction partition with brief reviews on previous efforts. In Section 3, we formally formulate the multifunction partition problem. In Section 4 we explain our new algorithm in detail. In Section 5 experimental results and comparison to the existing methods are presented. Section 6 concludes the paper.

2. BACKGROUND Simulated annealing has been used in [3, 8, 9, 10] for partitioning problems. Simulated annealing has been one of the most popular algorithms for use in optimization problems as it easy to implement, has the capability of make “uphill” moves in the solution space and thus gives adequate results. However, the tuning of the parameters in the algorithm takes a considerable amount of effort and poorly tuned algorithms yield poor results. Besides this it has a high running time, making it a popular choice only when processing time is not a constraint. The Kernighan-Lin (KL) partitioning algorithm is used in [2]. The running time of this algorithm is considerably less than that of simulated annealing, however the quality of solutions is not as good. The Fiduccia-Mattheyses (FM) min-cut algorithm that is an extension of the KL algorithm has been modified for hardware-software partitioning in [11]. The running time of this algorithm is an order better than KL. Both the algorithms mentioned above are greedy algorithms. They always move only in the direction of a more optimal solution. These algorithms tend to get caught in local minima. Other algorithms which do not always move in the direction of a more optimal solution have the advantage of being able to come out of local minima and are called hill-climbing algorithms. Simulated annealing is one such algorithm. Greedy algorithms are compared with hill-climbing algorithms in [1]. Other optimization techniques such as integer programming and genetic algorithms have also been applied to partitioning problem. An integer programming formulation of the partitioning problem is derived in [12]. Genetic algorithms have also been applied to partitioning problems on distributed systems in [13]. In [6, 7] the Global Criticality/Local Phase (GCLP) driven algorithm is introduced. This algorithm makes use of the global as well as local criteria of each node to determine the best implementation of the particular node. This algorithm yields better results than ordinary greedy algorithms, and is reasonably fast with linear running time. We compare our solution to this algorithm and see that our solution yields better results than this, albeit with a small penalty on running time.

3. PROBLEM FORMULATION Given a set of k applications in the set AP = {A1, A2… Ak}, only one of these applications is active at any given time. We consider each application to have a repetitive behavior with fixed timing constraints. One iteration through the application Ai is specified by a Directed Acyclic Graph (DAG) G = (N, E), where the nodes N specify computations and the edges E specify data and control precedence between nodes. Each edge of the DAG eij contains information on the communication cost between node ‘i’ and ‘j’ if they are implemented in software and hardware respectively. For ease of programming, it is assumed that communication cost between the nodes ‘i’ and ‘j’ is same in both ways (Costi-j = Costj-i). However, this approach can easily be extended to have different values for the two directions of communication. Extraction of the DAG from specification of the application is a different problem and we do not consider it as scope of our work. The Ptolemy co-design environment takes as input the description of the application in Standard Design Format (SDF) and generates a DAG. We do not consider any granularity estimation or pre-clustering of basic blocks for the application. It is assumed that this has been done before generating this graph. Each node in the DAG represents some computation and thus needs to have information on its execution times if implemented in the hardware and in the software as well. Besides this, the area cost of the node is also required in respective implementations. We assume that there is only one processor available and hence one value each for the execution time and the code size of each node is specified. The area of the hardware is the sum of all areas of all the nodes that are in implemented in the hardware. The estimation of these values is another step in the co-design process and is available in the form of a library with pre-estimated working parameter values for all the applications. The constraints on the system are assumed to be execution time and area, i.e. the maximum execution times of the different applications and maximum area of the chip are known apriori. The area of the chip is to be minimized while meeting the timing constraints of all the applications. The partitioning task consists of determining the mapping of nodes to either hardware or software. To make the description of the system clearer, an example is considered.

Figure 1 is a partitioning problem for an embedded system that can run 3 applications on it. Let there be 10 basic computational units (numbered C0 thru C9) and the three applications use a set of basic units as shown in Figure 1. Application 1

Application 2

C1 C3

C0

C6 C6

C5 C2

C4

C5

Application 3

C7

C8

C3

C1

C9

C8

C4 C9

C0

C5

C7

C4

Figure 1. An example for a 3 application system partitioning

Timing constraints for the three applications are T1, T2 and T3 respectively. The applications need to confirm to these timing specifications and the area of the system must be minimal. The values associated with each node of the graphs are: 1. th: Execution time if implemented in hardware; 2. ts: Execution time if implemented in software; 3. ah: Area when implemented in hardware; 4. as: Code size when implemented in software. Each of these specifications for a computation unit (say floating-point multiplier) is same for different applications. For example, if the floating-point multiplier takes an area Ah for hardware implementation for Application 1, it takes same area for other two applications if implemented through hardware. There is also communication cost associated with each possible edge of the graph. In this paper “Ci ↔ Cj” indicates the communication cost between node i and j, if they have different implementations, i.e. node i is in software and j in hardware, or vice versa. This value contains area overhead and time overhead due to the communication involved. Time overhead is caused by the communication between the hardware and software interface. The area overhead is the interconnect area of the interface. Let t(Ci ↔ Cj) indicate the time overhead and s(Ci ↔ Cj) the area overhead. As an example, let the current configuration be as follows: C0 – SW; C1 – HW; C2 – HW; C3 – SW; C4 – SW C5 – HW; C6 – SW; C7 – HW; C8 – SW; C9 – HW Consider Application 2. After scheduling, we know that the path “C6” and path “C2-C7” can be executed in parallel, since they do not have resource confliction. Suppose task C6 is finished later than C7. Therefore the critical path in application 2 is: C0-C6-C4- C5. Let the timing constraint on this application be T2. The execution time for this application is: Timeapp2 = ts(C0) + ts(C6) + ts(C4) + t(C4 ↔ C5) + th(C5) Similarly, the time for other two applications are also calculated. Let these be Timeapp1 and Timeapp3. Let T1 and T3 be the timing constraints on application 1 and 3. The timing constraints are: Timeappi ≤ Ti (i = 1, 2, 3).

4. PARTITIONING ALGORITHM 4.1 Simulated Annealing We intend to implement the partitioning using the simulated annealing algorithm. Simulated annealing is a popular algorithm used to solve such optimization problems. The algorithm starts off at certain “temperature”. New configurations of the partitioning are generated by the perturb function. The costs of partitions are determined by an evaluate function (in this work, it is the total chip area). If the new configuration results in a decrease of the cost, this perturbation to the new configuration is accepted and the next perturbation is tried after incrementing the loop count. If the perturbation increases the cost of the partition, this configuration is accepted depending on the value of e-∆C/T where ∆C is the change in cost and T is the temperature of the system. If a random number generated between 0 and 1 is less than this quantity, the new configuration is accepted. This procedure allows the system to move consistently towards lower cost states, yet still “jump” out of local minima due to the probabilistic acceptance of some upward moves.

The number of iterations at a given temperature fixes the inner loop criterion. At the end of each complete run of the inner loop, the temperature of the system is reduced. The stopping criterion is typically when one complete run of the inner loop does not yield a better solution. At this point the system is said to be frozen and the algorithm ends. The cost function is simply defined as the area of the chip for the given partition.

4.2 Perturb Function We define our perturbation based on a bias value. Bias value is a number between 0 and 1, and reflects the probability of a task node to be implemented in hardware. A task with less bias value is more likely to be implemented in software. In a traditional simulated annealing algorithm, a new configuration is generated by randomly selecting a task node from its current implementation to the opposite, i.e. from software implementation to hardware, or vise versa. However, this completely random process makes the search space very large. In our approach, we limit the search space by controlling this random process, using the bias value. For example, that a task node has a high bias value means it is biased to hardware implementation and has a low probability of moving into the software, if it has already been mapped to the hardware. On the other hand, if a task node with high bias value is currently mapped to software, it’s more likely to be selected in the perturbation. This bias value is calculated as a weighted summation of three parameters: Commonality, Performance-Area Ratio, and Urgency.

4.2.1 Commonality (Com) Some task nodes are common across different applications, and they can be implemented on the same resource. To identify common “task groups” is more useful, but it is an extremely hard job. We implement this idea in a simple way, by defining commonality as the number of times a particular node appears across all the applications. Intuitively, common tasks are more likely to be implemented in hardware. We give each task a commonality value Com, which will be used in computing its hardware-software bias value.

4.2.2 Performance-Area Ratio (PAR) This is defined as the ratio of ∆t/(ah-as), where ∆t is the difference in computation time of the task when implemented in hardware and software. If the additional area for hardware implementation is small and the performance increase is high, the task has high performance-area ratio, thus this task is biased to hardware implementation.

4.2.3 Urgency (Ur) This is the summation of the lengths of all paths through a task across all the applications. If the task node appears on the critical path in many applications, the urgency value is high, thus the task is more likely to be implemented in hardware. A weighted bias value is defined as w1*Com + w2*PAR + w3*Ur. w1, w2, and w3 are the weights. We call this value the static bias value. For two consecutive tasks, the communication costs (area, time) are higher if one is implemented in software and the other is implemented in hardware. Therefore we try to avoid that. The dynamic bias value is defined as the average of all its neighbor tasks’ current implementation (software = 0 and hardware = 1). For example, if all the neighbors of a task are implemented in software, this task is not likely to be implemented in hardware. The bias value we use in our experiments is a weighted summation of static and dynamic bias value.

4.3 Scheduling and Validation When the implementation of each task node is decided, we need to do the scheduling for each application. Since this problem is NP-complete, we use a simple scheduling algorithm. The purpose of scheduling is to check if the timing constraints are met in all the applications. If not, the partitioning solution is not a valid one and needs further validation. The validation procedure is to move the task node with the maximum performance-area ratio value in software implementation to the hardware implementation, until all the applications’ timing constraints are met.

4.4 User Controllable Parameters The number of inner loops is the number of iterations that are to be made at a particular temperature. This parameter is user controllable and can be used to control running time of the algorithm. The criterion to stop the algorithm is when there is no decrease in chip area for n consecutive iterations. The parameter n is also user controllable and in turn controls the running time of the solution.

5. EXPERIMENTS We generated 2 task libraries, which contain 100 task nodes and 1000 task nodes respectively. And for each experiment, we assume there are 3 applications constructed by the task nodes in one task library. We generated random DAGs to represent applications. For each application, we assume t1 is the time for the application to finish if every node is implemented in hardware, and t2 is the time that every node is in software. We set the timing deadline for the application to (t1+t2)/2, so that it is sure that some of the nodes will be implemented in hardware and some will be in software after partitioning. The experimental program is implemented in Microsoft Visual C++. We also implemented the GCLP algorithm for multifunction partitioning problem for these example task sets, and compare its results with our algorithm.

5.1 Weight Assignment We use a simple greedy algorithm to decide weights for the 3 factors (Commonality, Performance-Area Ratio, and Urgency) of static bias value. That is, initially we put all the nodes in hardware, then move nodes with least bias values to software one by one until the timing constraints for at least 1 application is not met. We tried different weights for Commonality, Performance-Area Ratio, and Urgency to calculate the bias value and ran the simple greedy algorithm to compare. The results are listed in Table 1 and Table 2. Table 1 shows the 100-node case and Table 2 shows the 1000-node case. Table 1. Different weight assignment for the 100-node case

Table 2. Different weight assignment for the 1000-node case

w1

w2

w3

Area

w1

w2

w3

Area

w1

w2

w3

Area

w1

w2

w3

Area

0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8

1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0

6,206 5,943 5,724 5,810 5,258 6,236 5,704 5,232 5,059 4,991 3,901

0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 1.0

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.2 0.0

0.6 0.4 0.2 0.0 0.4 0.2 0.0 0.2 0.0 0.0

5,183 5,193 4,676 4,532 5,148 5,247 4,532 5,449 4,560 5,218

0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8

1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0

64,598 64,598 64,196 63,240 62,884 31,361 62,712 59,083 55,442 47,957 34,266

0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.8 0.8 1.0

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.2 0.0

0.6 0.4 0.2 0.0 0.4 0.2 0.0 0.2 0.0 0.0

55,982 51,800 42,050 34,574 45,794 42,224 35,756 43,064 35,752 42,269

We can see that any weight assignments with w3=0 can achieve fairly good solution, and the best solution is achieved with w1=0 and w2=1 in the 100-node case and with w1=0.2 and w2=0.8 in the 1000-node case. The results indicate that Performance-Area Ratio is the most important factor and Urgency does not help at all. In both experiments we change the variation of Performance-Area Ratio, but the results still remain same.

5.2 Annealing Results

7000

80000

6000

70000 S20

5000

S200

4000

S2000 3000

S200

60000 Area

Area

We set w1=0.2, w2=0.8 and w3=0 in our algorithm to get the bias value, and ran the simulated annealing. Figure 3 and Figure 4 show the simulated annealing curves. S20 means there are 20 iterations on each temperature. In simulated annealing, we found that the dynamic bias value does not help. Instead, it leads the solution to some local minima.

S2000

50000

S10000

40000 30000

2000

20000 Solutions

Figure 2. SA curves for the 100-node case

Solutions

Figure 3. SA curves for the 1000-node case

5.3 Algorithm Performance Comparison Table 3 shows the performance of the 4 algorithms we tried. The time in Table 3 is min:sec. The proposed approach achieves the best chip area but the running time is longer than the other algorithms. However, the running time is acceptable, because in reality, even if we put the granularity to instruction level, most embedded system that has less than 1000 task nodes. For the systems having more than 1000 instructions, simply changing the granularity can control the running time. Table 3. Different algorithm performance comparison

Random Partitioning Simple Greedy GCLP Proposed Approach

100-node case Area Time 7,911 00:01 3,901 00:01 3,555 00:01 2,775 00:08

1000-node case Area Time 70,828 00:01 31,361 00:01 27,053 00:51 24,607 08:51

6. CONCLUSIONS AND FUTURE WORK We developed a simulated annealing based algorithm to solve the multifunction partitioning problem. Simulated annealing yields best results for optimization problems for large solution space, and we used “bias values” to reduce the search space remarkably (otherwise the simulated annealing curves have very long tails and take a very long time to stop). We found that the Performance-Area Ratio is the most important factor for hardware-software bias. Commonality helps to some extent, but Urgency and dynamic bias value do not help at all, with respect to minimizing the chip area. Experiments show that the proposed algorithm gets better results than other existing algorithms, within reasonable time. Intuitively, non-greedy algorithms that permit uphill moves appear to yield better results than the greedy algorithms. So far there have been no work reported that apply simulated annealing concepts to the partitioning problem of multifunction systems. This work can also be extended to N-way partitioning, i.e., a task node can be implemented in many types of hardware. And we can add more constraints in addition to the time and chip area to the partitioning in the future.

7. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Frank Vahid, Jie Gong, and Danel Gajski, “A Hardware-Software Partitioning Algorithm for Minimizing Hardware” European Design Automation Conference (EURO-DAC) 1994. Samir Agarwal and Rajesh K. Gupta “Data-flow Assisted Behavioral Partitioning for Embedded Systems” Proc. 34th Design Auto- mation Conference 1997. Jörg Henkel et al., “Adaptation of Partitioning and High-Level Synthesis in Hardware/Software Cosynthesis” ICCAD' 1994. Jörg Henkel, “A Low Power Hardware/Software Partitioning Approach for Core-based Embedded Systems” DAC 1999. Smita Bakshi and Daniel Gajski, “Hardware/Software Partitioning and Pipelining” DAC 1997. Asawaree Kalavade and Edward A. Lee, “A Global Criticality/Local Phase Driven Algorithm for the Constrained Hardware/Software Partitioning Problem” Proc. of Codes/CASHE 94, Third IEEE International Workshop on Hardware/Software Codesign, Grenoble, France, Sept. 22-24, 1994, pp 42-48. Asawaree Kalavade and Edward A. Lee, “The Extended Partitioning Problem: Hardware/Software Mapping, Scheduling, and Implementation-bin Selection” Journal of Design Automation of Embedded Systems, vol 2, no.2 pp 226-163, Mar 1997. Jörg Henkel, and Rolf Ernst, “An Approach to Automated Hardware/Software Partitioning Using a Flexible Granularity that is Driven by High-Level Estimation Techniques” IEEE Transactions on VLSI Systems, VOL. 9, NO. 2, April 2001. Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli, “System Level Hardware/Software Partitioning Based on Simulated Annealing and Tabu Search” Kluwer Journal on Design Automation for Embedded Systems, vol. 2, no. 1, January 1997, pp. 5-32. Jörg Henkel, Rolf Ernst, and Thomas Benner, “Hardware-Software Cosynthesis for Microcontrollers” IEEE Design & Test of Computers, vol. 10, no. 4, December 1993. pp. 64–75. Frank Vahid, “Modifying Min-Cut for Hardware and Software Functional Partitioning” CODES 1997. I. Karkowski and R.H.J.M Otten, “An Automatic Hardware-Software Partitioner Based on the Possibilistic Programming” European Design and Test Conference, March 1996. Robert P.Dick and Niraj K. Jha, “MOGAC: A Multiobjective Genetic Algorithm for Hardware-Software Co-Synthesis of Distributed Embedded Systems” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 1998.