EMBEDDED SYSTEM CO-DESIGN: SYNTHESIS AND ... - CiteSeerX

EMBEDDED SYSTEM CO-DESIGN: SYNTHESIS AND VERIFICATION LUCIANO LAVAGNO Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy ALBERTO SANGIOVANNI-VINCENTELLI Department of EECS, University of California, Berkeley, CA 94720 AND HARRY HSIEH Department of EECS, University of California, Berkeley, CA 94720

1. Introduction The electronics industry has been growing at an impressive rate for the past few years. A reason for its growth is the use of electronics components in almost all traditional systems such as automobiles, home appliances, and personal communication devices. In this framework, objects assume an electronic dimension that makes them more eective, more reliable, and less expensive. Home personal computers will not be as pervasive as they are today because more dedicated electronic components will be more appealing and cost-eective for the nal user. Because of the nature of the new applications of electronics, it will be necessary to cope with speci cations that can change continuously and with more and more stringent time-to-market constraints. This calls for the use of software programmable components whose behavior can be fairly easily changed. Such systems, which use a computer to perform a speci c function, but are neither used nor perceived as a computer, are generically known as embedded systems. Typically, embedded systems perform a single function and are used to control a larger heterogeneous system. More speci cally,

2 we are interested in reactive real-time embedded systems, which maintain a permanent interaction with the external environment, and are subject to external timing constraints. Embedded systems often are used in life critical situations, so that reliability and safety are more important criteria than performance. Today embedded systems are designed with an ad hoc approach that is heavily based on earlier experience with similar products and on manual design. In the past, an embedded system designer had to cope with low performing micro-processors so that he was forced to program with low level languages such as assemblers. Ad hoc software was written to handle resources and I/O, thus avoiding the overhead of operating systems. Recently, more and more powerful micro-processors have become available at reasonable prices. The trend is to use 32 bit micro-controllers for complex applications such as automotive engine control. In this situation, there is no need to handoptimize code. In fact, higher level languages such as C are now used in several cases. However, even if the number of errors does go down with the use of higher level languages, there are still issues related to the reliability of embedded systems. In addition, the methodology in use today prevents the full exploitation of the computing power available. Interest in methods for the design of embedded systems that x some of the reliability and time-to-market problems has been growing recently. We believe that any design approach should be based on the use of a formal model to describe the behavior of the system before a decision on its decomposition into hardware and software components is taken. In addition, we believe that extensive veri cation should be made on the behavior of the system at the highest level of abstraction to reduce the risk of having a non-performing product. Finally, the nal implementation of the system into hardware and software components should be made as much as possible using automatic synthesis techniques. In this paper, we review some of the existing approaches pointing out strengths and weaknesses of each. Our emphasis is on methods that are focusing on one or more of the points described above.

2. Co-design Models and Languages A formal model is an essential ingredient of a sound design methodology. It allows one to unambiguously analyze the functionality of the design. Moreover, with a formal model, the eect of an operation or transformation on a design is always well de ned. The transitions along the design ow can be smooth and with few or no special conditions that are design-dependent. At every stage of the design ow, one can also argue about the correctness

3 of the intermediate result or nal implementation because of the existence of this underlying formal model. 2.1. FORMAL MODELS

Many formal models have been proposed for real-time reactive systems. In this section, we propose a general taxonomy to classify them, based on the notion of hierarchical process network . The taxonomy is not based on rigorous mathematical de nitions, but relies on intuition to compare expressiveness, compactness and practical implications of the various models. A process network is a structural model that consists of a set of nodes representing processes and/or sub-networks , and a set of edges representing communication links . Its semantics is determined by: a node model and a communication model. Communication among processes involves writing and reading information among partners. Sometimes just the act of reading or writing is used for synchronization purposes, and the transmitted information is ignored. We distinguish between two basic mechanisms, depending on whether a single piece of information (written once) can be read once or many times: 1. Destructive read , if the information written by one partner can be read at most once by the other partner. In this case, we further distinguish four cases, depending on whether the write and read operations are blocking or not. A write operation is blocking if it cannot be completed if the current or previous information would be lost (e.g., because a buer is full or because a partner is not ready to perform an unbuered communication). A read operation is blocking if it cannot be completed until some data has been or is being written by a partner. 2. Non-destructive read , if the information written by one partner can be read many times by the other partner. The simplest form is a variable shared among partners. Communication can be: 1. Synchronous if the write and read operations must occur simultaneously, and 2. Asynchronous , otherwise. In this case there can be a nite or an in nite number of buer locations , where the information is stored between the instant in which it is written and the instant in which it is read. Note that non-destructive read is generally asynchronous, and implies a buer with a single cell (the shared variable). The standard distinctions between uni-directional and bi-directional communication, as well as between point-to-point and broadcast , are also

4 useful for classi cation purposes. Each node of a network can be: 1. A process (\leaf" node), which reads data from incoming edges, computes a function based on those data and writes information along outgoing edges. 2. An instance of a network. The instantiation mechanism can be static if the total number of nodes in a given hierarchy is xed, or dynamic if it is variable. Dynamic instantiation may increase the expressive power of the model, while static instantiation is only a convenient mechanism to reduce the size of the representation. We can further distinguish two cases of dynamic instantiation: (a) recursive instantiation, in which a node can instantiate a network that includes the node itself, (b) non-recursive instantiation. In both cases, an appropriate mechanism to dynamically update the communication structure must also be provided. We can now proceed to classify various models described in the literature along these lines. 2.1.1. Finite State Machines A Finite State Machine (FSM) is a process whose input/output function can be computed by a nite automaton. The edges of the automaton are labeled with input/output data pairs. A network of FSMs uses broadcast synchronous non-blocking communication among the leaf nodes. A network of extended synchronous FSMs is similar to a network of FSMs, but also allows non-destructive synchronous communication of integer values. The (no longer nite) state of each EFSM is the Cartesian product of a nite-valued variable and a set of integer variables. The next state and output functions of the nodes are usually described, for notational convenience, using relational, arithmetic and Boolean operators. The descriptive power of EFSMs is equivalent to that of Turing machines. Sometimes in the literature, FSMs which can perform arithmetic and relational tests on integer subrange variables are referred to as \extended FSMs". In this case, though, the set of possible states of the system is nite. Hence this \extension" does not augment the power of FSMs beyond regular languages. The only advantage is that the next state and output functions can be written more compactly by using arithmetic and relational operators.

5 A Behavior Finite State Machine (BFSM [56]) is an FSM whose inputs and outputs are partially ordered in time, according to a set of linear inequalities (i.e., computation can take time). There can be many implementations that are consistent with the partial order and timing of a given BFSM. These are called the register transfer FSMs (RTFSMs) that implement the BFSM. A scheduled BFSM is a BFSM (actually an RTFSM) whose relative timing between all input and output events is xed. The system behavior is speci ed as a network of BFSMs with synchronous communication. The network behavior is de ned by a simulation procedure . Finding a schedule for the network is therefore not equivalent to nding schedules for individual BFSMs. BFSMs hence can specify timing constraints within an FSM framework. The compositionality when only linear inequalities are present is also easily de ned. A Co-design Finite State Machine (CFSM [14]), like a classical Finite State Machine, transforms a set of inputs into a set of outputs with only a nite amount of internal state. The dierence between the two models is that the synchronous communication model of concurrent FSMs implies that all the FSMs change state exactly at the same time. On the other hand, a software implementation of a set of FSMs generally interleaves them in time. Each transition of a CFSM, hence, is triggered by communications along the input edges and causes, after an unbounded non-zero amount of time, communications along the output edges. Transformation of a CFSM into an implementation implies the choice of a particular range of values for the initially unbounded reaction delays. Hence, the correctness of the partitioning into heterogeneous components can be shown by construction. 2.1.2. Control/data ow-based models We next turn our attention to process modeling with Control/Data Flow Graphs (CDFGs). While FSM-based speci cation naturally separates hardware from software at the task or process level, the use of control/data ow graphs facilitates partitioning at the operation level. The former is usually called coarse-grain partitioning and the latter is usually called ne-grain partitioning. The next two approaches are both geared toward automatic ne-grain partitioning. A hierarchical Flow Graph model is used for all partitioning, scheduling and synthesis manipulations in the Vulcan system [23]. The model consists of nodes and edges, where nodes denote operations and edges denote dependencies. Operations can be conditional operations, computational operations, wait, and loop. Conditional operations allow data-dependent paths in the graph model. Computational operations includes logical, arithmetic, and relational operations. Both non-blocking and blocking read operations,

6 called \receive" and \wait" respectively, are available. The blocking read can result in a non-deterministic delay. Data-dependent loops can also have a non-deterministic delay. COSYMA [21] is a design system for co-processor generation. It uses as its internal model a syntax graph, extended by a symbol table and local data and control dependencies, called an ES graph. It is a directed acyclic graph describing a sequence of declarations, de nitions and statements. A basic scheduling block is then overlaid on top of the ES graph to describe data dependency. 2.1.3. Petri nets and data- ow networks A Petri net is a at hierarchy. Nodes (usually called \transitions") \ re" by reading from each input and writing to each output. Communication is asynchronous , with in nite buers (usually called \places"), with blocking read and non-blocking write. In the pure Petri net model no value is transferred by communications, the only signi cant information being the possible transition ring sequences. A data- ow network is similar to a Petri net, but each communication can transfer a value (e.g., integer or Boolean), and buers have FIFO semantics. Little or no restriction is posed on the function computed by each leaf node in response to a set of communications on its input edges, apart from terminating in nite time. Note that due to the blocking read communication, a node cannot test an input buer for emptiness. Nonetheless nodes can decide from which input(s) and to which output(s) communications will be performed. The main result concerning data ow networks, which is directly connected with the blocking read communication mechanism, is their determinacy . This means that the sequence of output values at each node does not depend on the order in which nodes are selected to execute computations and communications, as long as the order does not cause deadlock (e.g., by scheduling a process when one of the buers it will read from is empty) or starvation (e.g., by never scheduling a process with non-empty input buers). 2.1.4. Process networks An SDL process network ([51]) is similar to an extended FSM network, but communication is asynchronous, with blocking read, non-blocking write and in nite buers. Leaf nodes can dynamically instantiate networks (i.e., create new FSMs), and hence communication is dynamically de ned. A network of Communicating Sequential Processes ([29]) is composed of extended FSMs, with the additional possibility of recursion (via dynamic self-instantiation). Communication is synchronous blocking .

7 SOLAR is an internal format based on communicating processes, which is used in the COSMOS [58] system and derived from an SDL speci cation. The basic unit in SOLAR is a Design Unit. Each DU can consist of a hierarchical FSM (like StateCharts, described in Section 2.2.1), or other DUs with Channel Units performing communication for these DUs. A third entity in SOLAR is a Functional Unit, which is a combinational circuit that performs data computation (like addition, multiplication, etc.). By using Channel Units, the communication is eectively separated from the rest of the design and then can be optimized and reused. Channel Units also allow SOLAR to be eectively a superset of most extended FSM models and communicating process models. The complexity of this model may lead to diculties in analysis and synthesis across boundaries (e.g., hierarchical DUs). The UNITY model ([8]) is composed by a at hierarchy of leaf nodes that implement a set of guarded commands . A guarded command is an assignment of a set of expressions to a set of variables, which is enabled (can be executed) whenever a Boolean expression (its \guard") is true. Communication is non-destructive, and synchronization is de ned by a xed-point computation. Fixed point is reached in a UNITY speci cation whenever the value of the variables can no longer change due to enabled commands. 2.2. FORMAL LANGUAGES

In this section, we will examine in detail the dierent programming languages used as a front end of various co-design systems, as well as the domain of application that each modeling approach speci cally tackles. 2.2.1. FSM-based languages Esterel [4] belongs to a family of synchronous languages which includes Lustre and Signal. The synchronous hypothesis , common to all these languages, states that time is a sequence of instants, between which nothing interesting occurs. In each instant, some events occur in the environment, and a reaction is computed instantly by the modelled system. This means that computation and internal communication take no time. This hypothesis is very convenient, because it allows modeling the complete system as a single FSM. This has the advantage that the behavior is totally predictable, because there is no problem of synchronizing or interleaving concurrent processes. One problem is that this hypothesis entails the implicit or explicit computation of a fairly large FSM, which may be problematic for large speci cations with lots of concurrency. The Esterel language is very simple and includes constructs for hier-

8 archy, pre-emption, concurrency and sequencing. Hierarchy is handled via procedure calls and module instantiations, like in normal programming languages. Pre-emption consists of two basic exception raising and exception handling constructs, one which allows the module to terminate its computation for the current instant, and one that does not. Concurrency is speci ed by using a parallel composition construct (otherwise the module behavior is sequential but instantaneous). StateCharts [18] is a graphical speci cation language that extends classical Finite State Machines by allowing hierarchical decomposition, timing speci cations, concurrency, synchronization, and subroutines. Hierarchical decomposition is accomplished by clustering states through the use of AND or OR operations. AND clustering models concurrency (it corresponds to the classical FSM product), while OR clustering models sequencing (the FSM can be in only one of several OR states at a time). Transitions in and out of states or hierarchical states can be speci ed without restriction. The emphasis of this hierarchical composition is on condensing information. Timing is speci ed by allowing the use of linear inequalities as timing constraints on states at any hierarchical level. Concurrent FSMs can be synchronized through transition edges either explicitly with conditions, or implicitly by going through hierarchical states. An implicit stack allows a state to \call" a sequence of states and then \return" to the calling state. 2.2.2. High-level languages Various high-level textual languages have been used for co-speci cation of embedded systems, generally using a control/data ow internal model. For example, the entry language for COSYMA [21] is the C x, which is ANSI C extended with minimum/maximum delays, tasks, and task communication. A C x speci cation is translated into the ES graph internal representation, as discussed above. Another example is Hardware-C, which is translated into a ow graph as described in [35].

3. Speci cation and implementation validation The concurrent design process for mixed hardware/software embedded systems involves, according to most authors, solving the following sub-problems: 1. speci cation, 2. validation at various speci cation and implementation levels, 3. mapping from speci cation to architecture, that is: (a) partitioning, (b) synthesis, scheduling and optimization of hardware, software and interface components,

9 The solution to the subproblems is generally arranged in (possibly concurrent) steps, which can be part of optimization loops involving iteration and design space exploration. In this section and in the next one we will examine in detail the dierent approaches that have been proposed to tackle each aspect of co-design, using a horizontal (\by problem") rather than vertical (\by methodology") organization. 3.1. CO-SIMULATION

Simulation is still the main tool used in practice to validate a model (e.g., a high-level language program, or a logic netlist) against a set of speci cations (a set of test vectors hopefully covering most of the desired behavior of the target system). The basic problem in co-simulation is how to reconcile two apparently contrasting requirements: 1. to execute the software as fast as possible, often on a host machine that may be faster than the nal embedded CPU, and that is probably dierent from it. 2. to keep the hardware and software simulations synchronized, so that they interact just as they will in the target system. For this purpose, one could in principle (and often does) simulate a mixed system by running on a general-purpose hardware simulator (based on VHDL or Verilog) a simulation model of the target CPU, executing the software program on this simulation model. This method can be used at various levels of accuracy: ? true gate-level models of the CPU, which are extremely inecient and can no longer be used for 32 and 64 bit processors, ? cycle-based simulators, in which only the bus interface is simulated with timing, while the program itself is executed on a high-level instruction interpreter that provides information only about the number of clock cycles required for a given sequence of instructions between a pair of I/O operations. None of these methods, though, can achieve the full speed of the target processor. So, apart from using hardware emulators, one has to resort to compiling the software code on a host processor, which is running also a hardware simulator for the remaining part of the system, and somehow synchronize these two processes. Another problem is the accurate modeling of a controlled (e.g., electromechanical) system, which is generally governed by a set of dierential

10 equations. This may require, for example, to synchronize also some sort of numerical integrator (like, e.g., MATLAB or SPICE), together with the hardware and software simulator(s). 3.1.1. Overview of co-simulation methods A rst co-simulation method, proposed for example by Gupta et al. in [22], relies on a single custom simulator for hardware and software. This simulator uses a single event queue, and a high-level, bus-cycle model of the target CPU. A second method, described by Wilson et al. in [60], by Thomas et al. in [52], and by Rowson in [48], loosely links a hardware simulator with a software process. Synchronization is achieved by using the standard interprocess communication mechanisms oered by the host Operating System. One of the problems with this approach is that the relative clocks of software and hardware simulation are not synchronized. This requires the use of handshaking protocols, which may impose an undue burden on the implementation. This may happen, for example, because hardware and software would not need such handshaking since the hardware part runs in reality much faster than in the simulation. A third method, described in [53], keeps track of time in software and hardware independently, using various mechanisms to synchronize them periodically. If the software is master, then it decides when to send a message, tagged with the current software clock cycle, to the hardware simulator. If the hardware time is already ahead, the simulator may need to back up, which is a capability that few hardware simulators currently have. If the hardware is master, then the hardware simulator calls communication procedures which in turn call user software code. A similar approach is used by Kalavade et al. in [33] and by Lee et al. in [39]. In both cases, the simulation and design environment Ptolemy, described elsewhere in this book, is used to provide an interfacing mechanism between dierent domains. In [33] co-simulation is done at the speci cation level by using a data ow model, and at the implementation level by using a bus-cycle model of the target DSP and a hardware simulator (both built within Ptolemy). In [39] the speci cation is simulated by using concurrent processes, communicating via queues. The same message exchanging mechanism is retained in the implementation (using a mix of micro-processorbased boards, DSPs, and ASICs), thus allowing one to debug a partial implementation together with a simulation model of the remainder of the system. For example, the control software running on the micro-processor can also run on a host computer, while the DSP software runs on the DSP itself.

11 Finally, Sutarwala et al. in [50] describe an environment (coupled with a re-targetable compiler described in [38]) for cycle-based simulation of a user-de nable DSP architecture. The user only provides a description of the DSP structure and functionality, while the environment generates a behavioral bus-cycle VHDL model for it, which can then be used to run the code on a standard hardware simulator. 3.2. FORMAL VERIFICATION FOR EMBEDDED SYSTEMS

Formal veri cation is the process of checking that the behavior of a system, described using a formal model, satis es a given property, also described using a formal model. The two models may or may not be the same, bust must share a common semantic interpretation. Generally this process is classi ed as: 1. Speci cation veri cation, checking if a high-level model satis es some abstract property (e.g., if a protocol modeled as a network of communicating FSMs can ever reach a deadlock condition). 2. Implementation veri cation, checking if a relatively low-level model correctly implements a higher-level model or satis es some implementationdependent property (e.g., if a piece of hardware correctly implements a given FSM, or if a given data ow network implementation on a given DSP completely processes an input sample before the next one arrives). While simulation could theoretically fall into this de nition (if the property is \the behavior under this stimulus is as expected"), the term formal veri cation is usually reserved for checking: ? safety properties, stating that no matter what inputs are given, and no matter how non-deterministic choices are resolved inside the system model, the system will not get into a speci c \bad" con guration (e.g., deadlock, emission of undesired outputs, : : : ). ? liveness properties, stating that some \good" con guration will be visited eventually or in nitely often (e.g., the expected response to an input will be eventually emitted, : : : ). More complex checks, such as the correct implementation of a speci cation, can usually be done in terms of those basic properties. For example, Dill [20] describes a method to de ne and check correct implementation in an automata-theoretic framework. Even though formal veri cation is emerging as a validation tool, its application to embedded reactive real-time systems is still in its infancy. One of the reasons why formal veri cation is needed in this eld is that these

12 systems are often safety-critical. At the same time, formal veri cation for embedded systems is more feasible than for general-purpose algorithms, because their function is often speci ed in a much more restricted way than general-purpose algorithms (e.g., without recursion, \wild" pointers, and so on). In this section we only summarize the major approaches that have been or can be applied to embedded system veri cation. A rst basic distinction must be made between: 1. Theorem proving-based methods, which provide an environment that assists the designer in carrying out a formal proof of speci cation or implementation correctness. The assistance can be either in the form of checking the correctness of the proof, or in performing some steps of the proof automatically ([24], [7]). The main problems of this approach are the undecidability of higher order logics, which requires to restrict its power in order to partially or totally automate the proof, and the size of the search space even for decidable logics. 2. Finite automata-based methods, which restrict the power of the model in order to automate the proofs. The automata used to model the system can be: (a) Classical automata on nite or in nite strings ([54]). In this case, methods for verifying if the model satis es a property generally fall into one of the following main categories: i. Language containment-based methods, in which both the system and the property are described as a synchronous composition of automata. The proof is carried out by testing if the language of one is contained in the language of the other ([36]). One particularly simple case occurs when comparing a synchronous FSM with its hardware implementation. Then both automata are on nite strings, and the proof of equivalence can be performed by traversing the state space of their product ([13]). ii. Model checking-based methods ([5]), in which the system is modeled as a synchronous composition of automata, and the property is described as a formula in some temporal logic ([46]). The proof is carried out by traversing the state space of the automaton and marking the states which satisfy the formula. (b) Automata with in nite state spaces, for which some minimization to a nite form is possible. The most notable example of this class are the so-called timed automata ([2]), in which a set of real-valued clocks is used to

13 measure the elapsing of time. The main properties that make this model decidable are: ? clocks can only be tested and started, as part of the input and output labels of the edges of a nite automaton, ? clocks can only be compared against integer values and initialized to integer values. In this case, it is possible to show that only a nite set of equivalence class representatives is sucient to represent exactly the behavior of the timed automaton ([2]). Recently [44] introduced the notion of suspension (i.e., stopping and re-starting without re-initializing a clock), which extends the class of systems which can modeled with variations of timed automata. This extension is signi cant because it allows, in principle, the veri cation of timing constraints in a pre-emptive scheduling framework (i.e., one in which a low-priority process can be stopped in the middle of a computation by a high-priority one). The main obstacles to the widespread application of nite automatabased methods are the inherent complexity of the problem, and the diculty for designers, generally used to simulation-based models, to formally model the system or its properties. The synchronous product of automata, which is the basis of all known automata-based methods, is inherently sensitive to the number of states in the component automata, since the size of the total state space is the product of the sizes of the component state spaces. Among the techniques that have been developed to tackle this problem, the most promising one seems to be abstraction , which is the replacement of some system components with simpler versions, exhibiting non-deterministic behavior. Non-determinism in this case is used to reduce the size of the state space, without losing the possibility to verify the desired property. The basic idea is to build provably conservative approximations of the exact behavior of the system model, such that the complexity of the veri cation is lower, but no false positive results are possible. I.e., the veri cation system may say that the approximate model does not satisfy the property, while the original one did, thus requiring a better approximation, but it will never say that the approximate model satis es the property while the original one did not ([11]). Another interesting family of techniques, which can be useful for heterogeneous systems with multiple concurrent agents, is based on the notion of partial ordering between computations in an execution of a process network. Direct use of available concurrency information can be used during the veri cation process to reduce the number of explicitly explored states.

14 For example, to analyze safety properties of a Petri net model, it is enough to store for each transition only one representative (a minimal one according to a partial order based on history) of the set of states in which the transition is enabled ([42]). Model checking and language containment have been especially useful in verifying the correctness of protocols, which are particularly well-suited to the nite automaton model due to their relative data independence. The veri cation problem becomes much more dicult when either the value of data and the operations performed on them, or the timing properties of the system must be taken into account. The rst problem can be tackled by separating concerns between: ? veri cation of the desired property, by assuming equality of arithmetic functions with the same name used at dierent levels of modeling (e.g., speci cation and implementation, [6]), ? veri cation that a given piece of hardware implements correctly a given arithmetic function ([3]). The second problem still needs to be formulated in a way that allows a solution of practical problems in a reasonable amount of space and time. One possibility in this direction could be, rather than building the fullblown timed automaton right from the beginning, to incrementally add timing constraints to an initially untimed model. This addition should be done iteratively, to gradually eliminate all \false" violations of the desired properties due to the fact that some timing properties of the model have been ignored. The iteration can be shown to converge, but the speed of convergence still depends heavily on the ingenuity of the designer in providing \hints" to the veri cation system about the next timing information to consider ([10]).

4. Mapping from speci cation to architecture The problem of architecture selection is one of the key aspects of hardware/software co-design. Supporting the designer in choosing the right mix of components and implementation technologies is essential to the success of the nal product, and hence of the methodology that was used to design it. Generally speaking, the mapping problem takes as input a functional speci cation and produces as output: 1. an architecture composed of: ? hardware black boxes: micro-processors, micro-controllers, memories, I/O devices, ASICs, FPGAs, : : :

15

? software black boxes: operating system, device drivers, proce-

dures, concurrent programs, : : : ? interconnection media and mechanisms: abstract channels, busses, shared memories, : : : The implementation of these black boxes is determined during software and hardware synthesis and optimization. 2. a mapping of functional units to architectural units. The cost function optimized by the mapping process is a mixture (in which the importance of the individual components depends heavily on the type of application) of: ? time cost (often more critical for software), measured either as execution time for an algorithm, or as missed deadlines for a real-time system, ? area (often more critical for hardware), measured as chip, board, and/or memory size, ? communication cost, measured as time and/or area overhead. Note that each component of the cost function may assume two aspects, again depending on the type of application: 1. A constraint, which must absolutely be met in order to satisfy the speci cation. For example, the maximum response time of an embedded controller may be dictated by the time constants of the controlled system. 2. A cost, which should be minimized in order to increase the revenue from the nal product. For example, the response time of a FAX machine to a user command, which should be minimized while satisfying other, strict constraints (say the international standards it must obey). Current synthesis-based co-design methods almost invariably impose some restrictions on the target architecture, in order to make the mapping problem manageable. 1. The architecture of the nal system is limited to a library of prede ned choices. Especially the structure of the components surrounding the micro-controller is xed, and there have been relatively few papers published on automating the design of, say, a memory hierarchy or an I/O subsystem based on standard components. Notable exceptions to this rule are papers dealing with re-targetable compilation (e.g. [55]), or with a very abstract formulation of partitioning for co-design (e.g. [32], [47], [57]). The structure of the application-speci c hardware components, on the other hand, is generally much more free.

16 2. The communication mechanisms are often standardized for a given co-design methodology. Few choices (often closely tied to the communication mechanism used at the speci cation level) are oered to the designer. Again, one exception are papers dealing with the design of interfaces (e.g. [17]). The next sections review the main steps used to transform the speci cation into a mixed hardware/software implementation. 4.1. PARTITIONING

Partitioning methods can be classi ed according to four main characteristics: 1. The representation model. Generally it is some sort of labelled graph obtained by abstracting the speci cation. Some authors also consider set-based and other mathematical models. 2. The level of granularity, de ning the partitioning units. Most authors use either basic statements of the speci cation language, or design units (such as processes, tasks or loops). 3. The cost function. It can be based on a classi cation of the type of operations, or on pro ling, or on static estimates of cost and performance. 4. The algorithm. It can be based on: ? more or less greedy heuristics, ? clustering methods, ? iterative improvement, ? mathematical programming. 4.1.1. Overview of partitioning methods Ernst et al. in [21, 28] use a graph-based model, with nodes corresponding to elementary operations (statements in the C x speci cation they use). The cost is derived: ? by pro ling, aimed at discovering the bottlenecks that can be eliminated from the initial, all-software partition by moving some operations to hardware, and ? by estimating the closeness between operations: control closeness, based on the distance (number of control nodes) between activations of the same operation in the control ow graph (e.g., operations that occur immediately after one another are close),

17

data closeness, based on the number of common variables among

operations (e.g., the initialization, increment and comparison of a loop index variable are close), operator closeness, based on the similarities (e.g., an add and a subtract are close). ? by estimating the communication overhead incurred when blocks are moved across partitions. This is approximated by the (static) number of data items exchanged among partitions, assuming a simple memorymapped communication mechanism between hardware and software. Partitioning is done in two loops. The inner loop uses simulated annealing, with a quick function estimating the cost gain derived by moving an operation between hardware and software, to improve an initial partition. The outer loop uses synthesis to re ne the estimates used in the inner loop. Olokutun et al. in [45] perform performance-driven partitioning working on a block-by-block basis. The speci cation model is a hardware description language, rather than C. This allows them to use synthesis for hardware cost estimation, and pro ling of a compiled-code simulator for software cost estimation. Partitioning is done together with scheduling, since the overall goal is to minimize response time , in the context of using emulation to speed up simulation: 1. List scheduling is performed rst. 2. An initial partition is obtained by classifying blocks according to whether or not they are synthesizable, and the communication overhead justi es a hardware implementation (this step determines some blocks which must either go into software or hardware). 3. Uncommitted blocks are assigned to hardware or software starting from the block which has most to gain from a speci c choice. 4. The initial partition is then improved by a Kernighan and Lin-like iterative swapping procedure. Kumar et al. in [32], on the other hand, consider partitioning in a very general and abstract form. They use a complex, set-based representation of the system, of its various implementation choices and of the various costs associated with them. Cost attributes are determined mainly by pro ling. The system being designed is represented by four sets: 1. a set of available software functions , 2. a set of hardware resources , 3. a set of communications between the (software and/or hardware) units, 4. a set of functions to be implemented, each of which can be assigned a set of software functions, hardware resources and communications,

18 meaning that the given software runs on the given hardware and uses the given communications to implement the function. The partitioning process is followed by a decomposition of each function into virtual instruction sets , followed by design of an implementation for the set using the available resources, and by an evaluation phase. D'Ambrosio et al. in [27] tackle the problem of choosing a set of processors on which a set of cooperating tasks can be executed while meeting real-time constraints. They also use a mathematical formulation, but provide an optimal solution procedure by using branch-and-bound. The cost of a software partition is estimated as a lower and an upper bound on processor utilization. The upper bound is obtained by rate-monotonic analysis ([37]), while the lower bound is obtained by various re nements of the sum of task computation times divided by task periods. The branch-and-bound procedure uses the bounds to prune the search space, while looking for optimal assignments of functions to components, and satisfying the timing constraints. Other optimization criteria can be included beside schedulability, such as response times to tasks with soft deadlines1 , hardware costs, expandability (which favors software solutions), and so on. Another mathematical programming-based method for partitioning of software tasks on a multi-processor system is presented by Vahid et al. in [57]. It starts from an acyclic data- ow graph, in which subtasks that must be mapped onto the same processor are grouped together. The cost function considers processor and communication costs, under timing constraints. The system model includes causality constraints among sub-tasks, computation and communication time estimation. The solution is optimum within the chosen model, because the method uses a Mixed Integer-Linear Program solver. Barros et al. in [8] use a graph-based ne-grained representation, with each unit corresponding to a simple statement in the UNITY speci cation language. They cluster units according to a variety of (sometimes vague) criteria: ? concurrency (control and data independence), ? sequencing (control or data dependence), ? mutual exclusion, ? vectorization of a sequence of related assignments. 1 In real-time systems terminology a task has a hard deadline if missing it means a serious system malfunction , and a soft deadline if missing it means only a system performance degradation .

19 Clustering minimizes the cost of cuts in the clustering tree, and is then improved by considering: ? pipelining opportunities, ? allocations done at the previous stage, ? cost savings due to resource sharing. Kalavade et al. in [34] use an acyclic dependency graph to simultaneously map each node (task) to software or hardware and schedule the execution of the tasks. The approach, unlike [19] which used a global scheduling algorithm, is heuristic, and hence can give an approximate solution to very large problem instances. It uses two measures assigned to each node to guide the search process: 1. \Criticality" is de ned as the probability that any yet un-mapped node is assigned to hardware to meet the timing constraints. At each iteration, the critical path (the one with the tightest timing budget) of the causality graph is computed, and the most critical node is mapped. 2. \Repeller value" is a measure of the unsuitability of a node for a particular implementation, e.g., bit manipulations for software or random memory accesses for hardware. It is used to bias the mapping towards a smaller solution which still satis es the timing constraints. Steinhausen et al. in [49, 59] describe a complete co-synthesis environment, in which a CDFG representation is derived from an array of speci cation formats, such as Verilog, VHDL and C. The CDFG is partitioned by hand, based on the results of pro ling, and then mapped onto an architecture that can include: ? general-purpose micro-processors, ? Application-Speci c Instruction Processors (ASIPs, software-programmable components designed ad hoc for an application), and ? Application-Speci c Integrated Circuits (ASICs). An interesting aspect of this approach is that the architecture itself is not xed, because synthesis is driven by a user-de ned structural description. ASIC synthesis is done by using a commercial tool, while software synthesis, both for general-purpose and specialized processors, is done by using a retargetable compiler ([26]). Finally, Chou et al. in [17] describe a specialized, scheduling-based algorithm for interface partitioning. The algorithm is based on a graph model derived from a formalized timing diagram. Nodes represent low-level events in the interface speci cation. Edges represent constraints, and can either be derived from causality links in the speci cation, or be added during the partitioning process (for example to represent events that occur on the same wire, and hence should move together). The cost function is time for

20 software and area for hardware. The algorithm is based on a min-cut procedure applied to the graph, in order to reduce \congestion". Congestion is de ned as software being required to produce events more rapidly than the target processor can do so (which implies the need for some hardware assistance). 4.2. SYNTHESIS

After partitioning (and sometimes before partitioning, in order to provide cost estimates) the hardware and software components of the embedded system must be implemented. The inputs to the problem are a speci cation, a set of resources and possibly a mapping onto an architecture. The objective is to realize the speci cation with the minimum cost. Generally speaking, the constraints and optimization criteria for this step are the same as those used during partitioning. Area and code size must be balanced against performance, which often dominates due to the real-time characteristics of many embedded systems. Cost considerations generally suggest the use of software running on o-the-shelf processors, whenever possible. This choice, among other things, allows one to separate the software from the hardware synthesis process, relying on some form of pre-designed or customized interfacing mechanism. One exception to this rule are authors who propose the simultaneous design of a computer architecture and of the program that must run on it (e.g., [40], [41], [59], : : : ). In this summary we will only consider those who synthesize an Application-Speci c Instruction Processor (ASIP) and the micro-code that runs on it, because the problems tackled during the design of a general-purpose CPU are very dierent than those facing the embedded system designer. In the former case, a designer must worry about backward compatibility, compiler support, and optimal performance for a wide variety of applications, while in the latter case a designer must worry about addition of new functionality in the future, user interaction, and satisfaction of a speci c set of timing constraints. Note that by using an ASIP rather than a standard Application Speci c Integrated Circuit (ASIC), which generally has very limited programming capabilities, the embedded system designer can couple some of the advantages of hardware and software. For example, performance and exibility can be improved simultaneously. Another method to achieve the same goal is to use re-programmable hardware, like Field Programmable Gate Arrays. FPGAs can be re-programmed either o-line, as is commonly done to replace embedded software by changing a ROM, or on-line, to speed up the

21 currently executed algorithm. The hardware synthesis task for ASICs used in embedded systems (whether they are implemented on FPGAs or not) is generally performed according to the classical high-level and logic synthesis methods. We will not elaborate further on this problem, referring the reader to, e.g., [43]. The software synthesis task for embedded systems, on the other hand, is a relatively new problem. Traditionally, software synthesis has been regarded with suspicion, mainly due to the excessive claims made in earlier times. In fact, the problem is much more constrained in the case of embedded systems than in the case of general-purpose computing. For example, embedded software cannot use unconstrained dynamic memory allocation nor virtual memory. This is due to the physical constraints (the absence of a swapping device), to real-time constraints, and to the need to partition the speci cation between software and hardware. For some highly critical applications even the use of a stack may be forbidden, and everything must be dealt with by polling and static variables. Algorithms also tend to be simpler, with a clear division into cooperating tasks, each solving one speci c problem from digital ltering to control and decision-making. In particular, the problem of translating cooperating Finite State Machines into a software implementation has been successfully solved in a number of ways. Software synthesis methods that have been proposed in the literature can be classi ed according to the following general lines: 1. the speci cation formalism, which may be more or less similar to a programming language, 2. the speci cation- and the implementation-level interfacing mechanisms, 3. the scheduling method. Some sort of scheduling is required by almost all software synthesis methods (except, as described below, by [4], which resolves all concurrency at compilation time) to sequence the execution of a set of originally concurrent tasks. Concurrent tasks are an excellent speci cation mechanism, but cannot be implemented as such on a standard CPU. The scheduling problem (see, e.g., [30] for a review) amounts to nding a linear execution order for the elementary operations composing the tasks, so that all the timing constraints are satis ed. Depending on how and when this linearization is performed, scheduling algorithms can be classi ed as: 1. pre-run-time , if a xed execution order is chosen at compilation time. 2. run-time , if the order is chosen at execution time, by a scheduler process controlling the tasks. Run-time schedulers are further classi ed as:

22 (a) o-line , if the scheduler relies only on pre-computed information to determine the order, (b) on-line , if the scheduler uses run-time information to determine the order. Moreover, a run-time scheduler may be: (a) pre-emptive if a running task may be interrupted during its execution, (b) non-pre-emptive otherwise. Pre-run-time, o-line, non-pre-emptive policies all tend to: ? increase the reliability and predictability of software performance, ? reduce scheduling exibility, ? reduce scheduling overhead. The last two factors may mean that a highly exible policy may or may not increase processor utilization, because more exibility comes at the price of more complexity. Run-time scheduling is often performed by assigning a priority to each task, and then executing at each instant the currently enabled task with the highest priority. O-line and on-line priority-based schedulers are also called static and dynamic respectively. We will rst describe software analysis and synthesis methods which output a program in a high-level or assembly language, and then summarize some re-targetable compilation approaches. 4.2.1. Overview of co-synthesis methods Berry et al. in [4] take a rather unique approach to software synthesis. They start from Esterel to derive a single FSM from a collection of initially concurrent modules. All the communication within and among the modules is resolved immediately by the synchronous hypothesis (stating that any reaction to external inputs must take zero time ). This results in a fairly complex causality analysis for each source program, to determine exactly what set of reactions is caused by each possible set of input events in any possible source program state. The generated software is basically an emulation of a hardware circuit derived syntactically from the Esterel program, replacing causality with wires, and then heavily optimized. One major advantage of this approach is that, by completely avoiding a scheduler (every time an event is sensed, the complete reaction to all the simultaneously present events is computed), it allows a very precise determination of performance. Most other approaches to software synthesis for embedded systems divide the computation into cooperating tasks, perform some scheduling of those tasks. This scheduling can either be done using

23 classical algorithms (see, e.g., [37], [30]), or by developing new techniques based on a better knowledge of the domain (embedded systems with fairly restricted speci cation paradigms, instead of fully general algorithms written in an arbitrary high-level language). The former approach is advocated, for example, by Cochran in [16], who uses Rate Monotonic Analysis (RMA [37]) to perform schedulability analysis. In the RMA model, tasks can be pre-empted, have deadlines equal to their invocation period, and system overhead (context switching, interrupt response time, and so on) is negligible. The basic result in [37] states that under these hypotheses, if a given set of tasks can be scheduled by a static priority algorithm, then it can be scheduled by assigning o-line priorities to tasks sorted by the inverse of the invocation period (with the highest priority given to the task with the lowest period). The basic RMA model must be augmented in order to be practically applicable. For example, Cochran uses other results from the real-time scheduling literature to describe a practical scheduling environment which can handle:

? process synchronization requirements, ? high-priority I/O handling by interrupt service routines, ? context switching overhead, ? deadlines dierent than the task invocation period, ? mode changes, which may cause a change in the number and/or deadlines of tasks,

? multi-processors. Multi-processor support consists of analyzing the schedulability of a given assignment of tasks to processors, providing the designer with feedback about potential bottlenecks and sources of deadlocks. The latter approach is used by Chou et al. in [17] to nd a valid schedule of processes speci ed in Verilog, under given timing constraints. This approach, like that of Gupta et al. described below, and unlike classical task-based scheduling methods, can take into account both ne-grained and coarse-grained timing constraints. The speci cation style in this case uses Verilog constructs that provide structured concurrency with watchdogstyle pre-emption as in Esterel. In this style, multiple computation branches are started in parallel, and some of them (the watchdogs) can \kill" others upon occurrence of a given condition. The model of time, though, is different than Esterel, because operations consume time. Hence a set of \safe recovery points" is de ned for each branch, and pre-emption is allowed only at those points. Timing constraints are speci ed by using modes , which represent dier-

24 ent \states" for the computation (e.g., initialization, normal operation and error recovery). Constraints on the minimum and maximum time separation between events (even of the same type, to describe occurrence rates) can be de ned either within a mode or among events in dierent modes. Scheduling is performed within each mode, by nding a cyclic order of operations which preserves I/O rates and timing constraints. Each mode is transformed into an acyclic partial order by unrolling, and possibly splitting if it contains parallel loops with harmonically unrelated repetition counts. Then the partial order is linearized by using a longest-path algorithm to check feasibility and assign start times to the operations. The same group in [12] describes a technique for device driver synthesis. It is targeted towards micro-controllers with specialized I/O ports, and takes as input a speci cation of the system to be implemented, as described above, plus a description of the function and structure of each I/O port as: ? a list of bits and directions, ? a list of communication instructions, ? a list of specialized functions such as, for example, the implicit latching oered by some ports, the parallel/serial and serial/parallel conversion capabilities, and so on. The algorithm assigns communications in the speci cation to physical entities in the micro-controller. It rst tries to use special functions, then assigns I/O ports, and nally resorts to the more expensive memory-mapped I/O for over ow communications. It takes into account resource con icts (e.g. among dierent bits of the same port), and allocates hardware components to support memory-mapped I/O. The output of the algorithm is a netlist of hardware components, initialization routines and I/O driver routines that can be called by the software generation procedure whenever a communication between software and hardware must take place. Gupta et al. started their work on software synthesis and scheduling by analyzing, in [23], various implementation techniques for embedded software. Their speci cation model is a set of threads , extracted from a Hardware-C program. Threads are concurrent loop-free routines, which invoke each other as a basic synchronization mechanism. Statements within a thread are scheduled statically , at compile-time, while threads are scheduled dynamically , at run-time. By using a concurrent variant of C as a speci cation language, the translation problem becomes easier, and the authors can concentrate on the scheduling problem, to simulate the concurrency of threads. The authors compare the inherent advantages and disadvantages of two main techniques to implement threads: coroutines and a single case statement (in which each branch implements a thread). The coroutinebased approach is more exible (coroutines can be nested, e.g. to respond to

25 urgent interrupts), but more expensive (due to the need to switch context) than the case-based approach. In [25] the same group uses the thread model to develop a scheduling method for reactive real-time systems. The input speci cation is a CDFG derived from a set of Hardware-C processes (which play a similar role to the modes used by Chou et al.). The cost model takes into account: ? the processor type (Instruction Set Architecture), ? the memory model (storage alignment, : : : ), ? the instruction execution time, derived bottom-up from the CDFG, by assigning a processor- and memory-dependent cost to each leaf operation in the CDFG. Some operations have an unbounded execution time, because they are either data-dependent loops or synchronization (I/O) operations. Timing constraints are basically data rate constraints on externally visible Input/Output operations. Bounded-time operations within a process are linearized within a thread, by a heuristic method (the problem is known to be NP-complete). The linearization procedure selects the next operation to be executed among those whose predecessors have all been scheduled, according to: ? whether or not their immediate selection for scheduling can cause some timing constraint to be missed. ? a measure of \urgency" that performs some limited timing constraint lookahead. Unbounded-time operations, on the other hand, are implemented by a call to the runtime scheduler, which may cause a context switch in favor of another more urgent thread. Chiodo et al. in [15] also propose a software synthesis method from from a CFSM speci cation. The method takes advantage of optimization techniques from the hardware synthesis domain, and is somewhat similar to that of Berry et al. mentioned above. The main dierence is that by allowing multiple communicating CFSMs, rather than a single FSM, Chiodo et al. can handle systems with widely varying data rates and response time requirements. The method of Berry et al. requires testing of all inputs whenever one of them is detected, while by partitioning the speci cation, Chiodo et al. can separate it into blocks with dierent priority levels, and schedule them according to classical algorithms. Their software synthesis technique is based on a simpli ed acyclic CDFG, representing the transition and output functions of the CFSM. The nodes of the CDFG can only be of two types:

26

? TEST nodes, which evaluate an expression and branch according to its result,

? ASSIGN nodes, which evaluate an expression and assign its result to a variable.

The authors develop a mapping between classical Binary Decision Diagram ([9]) and logic circuit ([43]) representations of the transition and output functions to the CDFG form, and can thus use a wide body of logic optimization techniques to minimize memory occupation and/or execution time. The simple CDFG form allows also an easy and relatively accurate prediction of software cost and performance, based on cost assignment to each CDFG node. The cost (code and data memory occupation) and performance (clock cycles) of each node type can be evaluated with a good degree of accuracy, by using a handful of system-speci c parameters (e.g., the cost of a variable assignment, of an addition, of a branch). These parameters can be derived by compiling and running a few carefully designed benchmarks on the target processor, or on a cycle-accurate emulator or simulator. The accuracy unfortunately decreases drastically whenever the eects of a memory hierarchy (which is generally considered harmful for real-time systems) comes into play. Antoniazzi et al. in [1] describe a synthesis environment in which an EFSM-based speci cation is interactively modi ed to optimize the software or hardware implementation. Operations include local EFSM state transition condition restructuring and mapping of timeout constraints onto counter-based timers. The software synthesis task is then performed on each EFSM individually, by using a virtual instruction set as intermediate representation, on which operations such as bit packing into a single word are performed. The virtual instruction set is used for performance estimation, and is translated into processor-speci c assembly code. Hardware synthesis is done by commercial tools from VHDL code generated by the internal EFSM description. Ben Ismail et al. and Voss et al. in [31, 58] solve the communication synthesis problem by using channels . Abstract channels used by the SOLAR internal representation are allocated to physical channels by a channel binding operation. The designer is oered a choice of implementation mechanism (serial, parallel, bus-based, : : : ) for each channel, and interactively selects one based on the performance requirements. The system then automatically generates software and/or hardware for the implementation, depending on the style chosen for the channel and on the implementation chosen for the communicating processes. Liem et al. in [38] tackle a very dierent problem, that of re-targetable compilation for a generic processor architecture. They focus their optimiza-

27 tion techniques towards highly asymmetric processors, such as commercial Digital Signal Processors (in which, for example, one register may only be used for multiplication, another one only for memory addressing, and so on). Their register assignment scheme is based on the notion of classes of registers, describing which type of operation can use which register. This information is used during CDFG covering with processor instructions to minimize the number of moves required to save registers into temporary locations. A similar CDFG covering approach is also used by Marwedel in [41], and by Menez et al. in [40]. The latter work uses an exhaustive scheduling algorithm, that can take into account pipelining (unlike [38]) and communication costs.

5. Conclusions In this chapter we outlined some important aspects of the hardware/software co-design process, such as formal representation models, co-simulation, formal veri cation, hardware and software synthesis and optimization, scheduling, and partitioning. For each topic, we provided a brief description of the key contributions, encapsulated in a general reference scheme. While it was impossible to include all the contributions to the eld, we attempted to be as comprehensive as possible, given the page limits.

References

1. S. Antoniazzi, A. Balboni, W. Fornaciari, and D. Sciuto. A methodology for control-dominated systems codesign. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 2. R. Alur and D. Dill. Automata for Modeling Real-Time Systems. In Automata, Languages and Programming: 17th Annual Colloquium, volume 443 of Lecture Notes in Computer Science, pages 322{335, 1990. Warwick University, July 16-20. 3. R.E. Bryant and Y-A Chen. Veri cation of arithmetic circuits with Binary Moment Diagrams. In Proceedings of the Design Automation Conference, pages 535{541, 1995. 4. G. Berry, P. Couronne, and G. Gonthier. The synchronous approach to reactive and real-time systems. IEEE Proceedings, 79, September 1991. 5. J. Burch, E. Clarke, K. McMillan, and D. Dill. Sequential circuit veri cation using symbolic model checking. In Proceedings of the Design Automation Conference, pages 46{51, 1990. 6. J.R. Burch and D.L. Dill. Automatic veri cation of pipelined microprocessor control. In Proceedings of the Sixth Workshop on Computer-Aided Veri cation, pages 68{80, 1994. 7. R.S. Boyer, M. Kaufmann, and J.S. Moore. The boyer-moore theorem prover and its interactive enhancement. Computers & Mathematics with Applica-

28 tions, pages 27{62, January 1995. 8. E. Barros, W. Rosenstiel, and X. Xiong. Hardware/software partitioning with UNITY. In Proceedings of the International Workshop on HardwareSoftware Codesign, October 1993. 9. R. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8):677{691, August 1986. 10. F. Balarin and A. Sangiovanni-Vincentelli. A veri cation strategy for timingconstrained systems. In Proceedings of the Fourth Workshop on ComputerAided Veri cation, pages 148{163, 1992. 11. J. R. Burch. Automatic Symbolic Veri cation of Real-Time Concurrent Systems. PhD thesis, Carnegie Mellon University, August 1992. 12. P. Chou and G. Borriello. Software scheduling in the co-synthesis of reactive real-time systems. In Proceedings of the Design Automation Conference, June 1994. 13. O. Coudert, C. Berthet, and J. C. Madre. Veri cation of Sequential Machines Using Boolean Functional Vectors. In IMEC-IFIP Int'l Workshop on Applied Formal Methods for Correct VLSI Design, pages 111{128, November 1989. 14. M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-Vincentelli. A formal methodology for hardware/software codesign of embedded systems. IEEE Micro, August 1994. 15. M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-Vincentelli. Synthesis of software programs from CFSM speci cations. In Proceedings of the Design Automation Conference, June 1995. 16. M. Cochran. Using the rate monotonic analysis to analyze the schedulability of ADARTS real-time software designs. In Proceedings of the International Workshop on Hardware-Software Codesign, September 1992. 17. P. Chou, E.A. Walkup, and G. Borriello. Scheduling for reactive real-time systems. IEEE Micro, 14(4):37{47, August 1994. 18. D. Drusinski and D. Har'el. Using statecharts for hardware description and synthesis. IEEE Transactions on Computer-Aided Design, 8(7), July 1989. 19. J.G. D'Ambrosio and X.B. Hu. Con guration-level hardware/software partitioning for real-time embedded systems. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 20. D.L. Dill. Trace Theory for Automatic Hierarchical Veri cation of SpeedIndependent Circuits. The MIT Press, Cambridge, Mass., 1988. An ACM Distinguished Dissertation 1988. 21. R. Ernst and J. Henkel. Hardware-software codesign of embedded controllers based on hardware extraction. In Proceedings of the International Workshop on Hardware-Software Codesign, September 1992. 22. R. K. Gupta, C. N. Coelho Jr., and G. De Micheli. Synthesis and simulation of digital systems containing interacting hardware and software components. In Proceedings of the Design Automation Conference, June 1992. 23. R. K. Gupta, C. N. Coelho Jr., and G. De Micheli. Program implementation schemes for hardware-software systems. IEEE Computer, pages 48{55, January 1994. 24. M.J.C. Gordon and T.F. Melham, editors. Introduction to HOL: a theorem proving environment for higher order logic. Cambridge University Press, 1992. 25. R.K. Gupta and G. De Micheli. Constrained software generation for hardware-software systems. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 26. J. Hoogerbrugge and H. Corporaal. Transport-triggering vs. operationtriggering. In 5th International Conference on Compiler Construction, April 1994.

29 27. X. Hu, J.G. D'Ambrosio, B. T. Murray, and D-L Tang. Codesign of architectures for powertrain modules. IEEE Micro, 14(4):48{58, August 1994. 28. J. Henkel, R. Ernst, U. Holtmann, and T. Benner. Adaptation of partitioning and high-level synthesis in hardware/software co-synthesis. In Proceedings of the International Conference on Computer-Aided Design, November 1994. 29. C. A. R. Hoare. Communicating Sequential Processes. In Communications of the ACM, pages 666{677, August 1978. 30. W.A. Halang and A.D. Stoyenko. Constructing predictable real time systems. Kluwer Academic Publishers, 1991. 31. T.B. Ismail, M. Abid, and A.A. Jerraya. COSMOS: a codesign approach for communicating systems. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 32. S. Kumar, J. H. Aylor, B. Johnson, and W. Wulf. Exploring hardware/software abstractions and alternatives for codesign. In Proceedings of the International Workshop on Hardware-Software Codesign, October 1993. 33. A. Kalavade and E. A. Lee. Hardware/software co-design using Ptolemy { a case study. In Proceedings of the International Workshop on HardwareSoftware Codesign, September 1992. 34. A. Kalavade and E.A. Lee. A global criticality/local phase driven algorithm for the constrained hardware/software partitioning problem. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 35. D. Ku and G. De Micheli. High level synthesis of ASICs under timing and synchronization constraints. Kluwer Academic Publishers, 1992. 36. R. P. Kurshan. Automata-Theoretic Veri cation of Coordinating Processes. Princeton University Press, 1994. 37. C. Liu and J.W Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. Journal of the ACM, 20(1):44{61, January 1973. 38. C. Liem, T. May, and P. Paulin. Register assignment through resource classi cation for ASIP microcode generation. In Proceedings of the International Conference on Computer-Aided Design, November 1994. 39. S. Lee and J.M. Rabaey. A hardware-software co-simulation environment. In Proceedings of the International Workshop on Hardware-Software Codesign, October 1993. 40. G. Menez, M. Auguin, F Boeri, and C. Carriere. A partitioning algorithm for system-level synthesis. In Proceedings of the International Conference on Computer-Aided Design, November 1992. 41. P. Marwedel. Tree-based mapping of algorithms to prede ned structures. In Proceedings of the International Conference on Computer-Aided Design, November 1993. 42. K. McMillan. Symbolic model checking. Kluwer Academic, 1993. 43. G. De Micheli. Synthesis and optimization of digital circuits. McGraw-Hill, 1994. 44. J. McManis and P. Varaiya. Suspension automata: a decidable class of hybrid automata. In Proceedings of the Sixth Workshop on Computer-Aided Veri cation, pages 105{117, 1994. 45. K. Olokutun, R. Helaihel, J. Levitt, and R. Ramirez. A software-hardware cosynthesis approach to digital system simulation. IEEE Micro, 14(4):48{58, August 1994. 46. A. Pnueli. The temporal logics of programs. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science. IEEE Press, May 1977. 47. S. Prakash and A. Parker. Synthesis of application-specifc multi-processor architectures. In Proceedings of the Design Automation Conference, June 1991.

30 48. J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference, pages 439{440, 1994. 49. U. Steinhausen, R. Camposano, H Gunther, P. Ploger, M. Theissinger, et al. System-synthesis using hardware/software codesign. In Proceedings of the International Workshop on Hardware-Software Codesign, October 1993. 50. S. Sutarwala and P. Paulin. Flexible modeling environment for embedded systems design. In Proceedings of the International Workshop on HardwareSoftware Codesign, 1994. 51. R. Saracco, J.R.W. Smith, and R. Reed. Telecommunications systems engineering using SDL. North-Holland - Elsevier, 1989. 52. D.E. Thomas, J.K. Adams, and H. Schmit. A model and methodology for hardware-software codesign. IEEE Design and Test of Computers, 10(3):6{ 15, September 1993. 53. K. ten Hagen and H. Meyr. Timed and untimed hardware/software cosimulation: application and ecient implementation. In Proceedings of the International Workshop on Hardware-Software Codesign, October 1993. 54. W. Thomas. Automata on in nite objects. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science. Elsevier, 1990. 55. M. Theissinger, P. Stravers, and H. Veit. CASTLE: an interactive environment for hw-sw co-design. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 56. A. Takach and W. Wolf. An automaton model for scheduling constraints in synchronous machines. IEEE Transactions on Computers, 44(1):1{12, January 1995. 57. F. Vahid and D. G. Gajski. Specifcation partitioning for system design. In Proceedings of the Design Automation Conference, June 1992. 58. M. Voss, T. Ben Ismail, A.A. Jerraya, and K-H. Kapp. Towards a theory for hardware-software codesign. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 59. J. Wilberg, R. Camposano, and W. Rosenstiel. Design ow for hardware/software cosynthesis of a video compression system. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994. 60. J. Wilson. Hardware/software selected cycle solution. In Proceedings of the International Workshop on Hardware-Software Codesign, 1994.