A reconfigurable system featuring dynamically ... - IEEE Xplore

1 downloads 0 Views 831KB Size Report
sentative of the potential of using programmable logic in system design. It features an embedded ... SRAM-based embedded field-programmable gate array (FPGA). Application-specific ..... Flexeos Family Technical Manual. [Online]. Available:.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

521

A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, FPGA, and Customizable I/O Michele Borgatti, Francesco Lertora, Benoit Forêt, and Lorenzo Calí

Abstract—A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design. It features an embedded reconfigurable processor built by joining a configurable and extensible processor core and an SRAM-based embedded field-programmable gate array (FPGA). Application-specific bus-mapped coprocessors and flexible input/output peripherals and interfaces can also be added and dynamically modified by reconfiguring the embedded FPGA. The architecture of the system is discussed as well as the design flows for pre- and post-silicon design and customization. The silicon area required by the system is 20 mm2 in a 0.18- m CMOS technology. The embedded FPGA accounts for about 40% of the system area. Index Terms—Application-specific integrated circuits (ASICs), digital signal processors, field-programmable gate arrays (FPGAs), integrated circuit design, multimedia computing, reconfigurable architectures.

I. INTRODUCTION

C

URRENTLY, we are witnessing two conflicting trends in the electronic industry. On one hand, the economics of system integration pushes logic suppliers toward ever more complex system-chip devices. On the other hand, increasing design complexity and its associated risks, increase of nonrecurrent engineering expenses, and shorter time-to-market and product life are causing original equipment manufacturers to look for faster turnaround and lower risk solutions for design and technology. The recent introduction of embedded programmable logic allows application-specific integrated circuit (ASIC) and application-specific standard product (ASSP) vendors to broaden the appeal of their products. Also, hardware programmability can be exploited by system integrators for product customization. The reconfigurable computing paradigm is a topic of research that has been very active in recent years [7]–[9], [13]. In particular, extending a microprocessor architecture with configurable hardware was shown to deliver large speedups while keeping the ease of the programmer’s model [9], [12]. Manuscript received April 16, 2002; revised October 22, 2002. This work was supported in part under IST-2001-34607. M. Borgatti, F. Lertora, and L. Calí are with STMicroelectronics, 20041 Agrate Brianza, Italy (e-mail: [email protected]). B. Forêt is with STMicroelectronics Central Research and Development, 38921 Crolles, France. Digital Object Identifier 10.1109/JSSC.2002.808288

In this paper, we present a pragmatic approach to introduce flexibility in system-chip design and exploit embedded programmable silicon fabrics to enhance system performances. In particular, enabling application-specific configurations to adapt the underlying hardware architecture to time-varying application demands can improve execution speed and reduce power consumption compared to a general-purpose programmable solution. In the proposed system, the embedded programmable logic allows static or dynamic configuration of the instruction set of an embedded microprocessor, the creation of bus-mapped application-specific hardware coprocessors and accelerators, and the customization of the system input/output (I/O). The latter feature allows the device to potentially connect to any external unit/sensor given that its communication protocol can be mapped to the on-chip programmable logic. Also, some computations can be performed “on the fly” when data is captured. The design approach described in this paper mainly consists of the usage of state-of-the-art off-the-shelf IP and system design methodologies. Emphasis is given to the methodology for building reconfigurable systems using ASIC-derived design methodology. A configurable and extensible processor [1] with associated tools and an embedded field-programmable gate array (e-FPGA) [2] were used. The resulting system has been developed to target image and voice processing and recognition application domains. Specific design flows for system exploration and implementation are also introduced. This paper is organized as follows. Section II describes the integration of embedded programmable logic into ASIC, and in Section III the architecture of the system is detailed and the main system components are described. Particular attention is paid to the embedded FPGA microprocessor interface and some details of the embedded FPGA are given. As an application example, an embedded face recognition system will be described in Section IV. In Section V, details on the system-chip integration with the description of the system-to-RTL and RTL-to-layout design flows are given. Details on chip implementation and some experimental results are reported in Section VI. II. EMBEDDED PROGRAMMABLE LOGIC IN ASICS Integration of embedded programmable logic into ASIC has the overall goal of building silicon systems that bring together many advantages, namely, hardware efficiency (hardwired),

0018-9200/03$17.00 © 2003 IEEE

522

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

(a)

(b)

(c) Fig. 1. (a) Timing constraints at ASIC/FPGA borders. (b) Silicon compilation. (c) e-FPGA compilation.

software programmability, and hardware configurability on a single die. A specific hybrid architecture needs to be targeted to a specific family of applications. Enabling technologies are off-the-shelf configurable processors and embedded FPGA. On the logic design side, the use of embedded programmable logic in ASICs introduces many design challenges because performance in terms of density, speed, and power consumption is significantly less aggressive than in cell-based design (30–50 times in area, 1–10 times in speed). This means that in order to design efficient programmable platforms, the use of configurable logic must be limited to what really needs to be programmable. This heavily impacts the architecture definition of such systems. In addition to worse performances, the integration of programmable hardware in system-on-chips (SoCs) introduces changes in the design flow. The implementation of e-FPGA logic is usually done after silicon fabrication. Different implementations can also be repeated to produce different configurations for the same chip. There are two main consequences. First, the signoff of the whole system must be done for each configuration, since for every configuration the logic implemented in the e-FPGA must operate correctly with the hardwired logic and possible timing violations need to be avoided. This is usually an automated task and can be repeated even for a large number of different configurations. Second, the logic synthesis and optimization must be performed separately twice, for the hardwired logic and for the configuration logic, respectively. In order to handle this suboptimal partition, a careful propagation of constraints needs to be implemented. These issues are discussed in detail in Section V.

In Fig. 1(a), an example of synchronous logic circuit is represented as a mix of combinatorial logic (CL) and registers. Such example circuit is implemented by mixing conventional hardwired logic (cell-based) and e-FPGA. The latter logic is intrinsically slower. The optimization of the standard cell domain happens first and the e-FPGA is considered as a black box [see Fig. 1(b)]. The goal at this point is to “squeeze” as much as possible the delay of the standard cell domain at the boundary of the e-FPGA to allow the maximum time available for the e-FPGA logic. Then the e-FPGA content is optimized [see Fig. 1(c)], usually after silicon fabrication, and this requires that the constraints capturing the result of the optimization of the hardwired logic are put on the boundary of the e-FPGA. The e-FPGA implementation might fail for two reasons: insufficient logic resources or delay exceeding the constraints. In the former case, splitting the logic design into different contexts can be a solution for certain applications, in addition to logic simplification and restructuring. In the latter case, timing-driven logic optimization and trading logic resources for maximum speed may prevent system speed reduction. To maximize system performances, clock phases internal to the e-FPGA must be balanced with the clock phases in the hardwired logic so that minimum skew is obtained and minimal clock period imbalance is produced at the e-FPGA borders. The clock tree synthesis methodology for the hardwired logic must consider the maximum and minimum clock insertion delay of the e-FPGA. A key characteristic of an e-FPGA suitable for integration in an ASIC is that clock insertion delay is not dependent on e-FPGA configuration. In this way, maximum clock skew can be guarantee for any configuration applied after silicon manufacturing.

BORGATTI et al.: RECONFIGURABLE SYSTEM FEATURING EMBEDDED MICROPROCESSOR, FPGA, AND CUSTOMIZABLE I/O

523

Fig. 2. System architecture block diagram.

III. SYSTEM ARCHITECTURE One of the main goals of this work was to build a flexible architecture, working at a reasonably high clock frequency, built around an embedded FPGA and an extensible 32-bit microprocessor. The base processor is a specific customization of that described in [1]. It comes with a complete set of tools for configuration and performance analysis. The main features of the prokB cessor core used in our system are a five-stage pipeline, direct-mapped data/instruction caches, a 24- or 16-bit instruction format for improved code density, a 64-bit processor interface (PIF) with burst transfers for cache-page refill, and 13 interrupt lines organized in four priority interrupt levels. The system architecture is illustrated in Fig. 2. The PIF/AHB bridge translates processor cycles to the AMBA advanced high-performance bus (AHB) bus [3] with support for fast burst and locked transfers. An external memory interface (EMI) exploits the available peak throughput of the fastest commercial external nonvolatile flash memories. It allows a wide range of burst-mode and page-mode configurations under software control and supports low-voltage low-swing operations. If required, an external RAM port allows the extension of the on-chip 48-kB SRAM. The heart of the system is an embedded FPGA and its multiple interfaces to main system units. In particular, the functional purposes of the e-FPGA programmable logic are the following. • Extension of the processor datapath supporting a set of additional special-purpose instructions. This is done by connecting the processor datapath through a wide bus

and a specific interface (instruction extension interface in Fig. 2). • Bus-mapped coprocessor. Hardware units mapped into the e-FPGA can be interfaced to the system bus through an AHB bus master/slave. • Flexible I/O. The programmable general-purpose I/O pads interface is used to connect external units or sensors with their application-specific communication protocol. All these possibilities may be mixed in a singular configuration for the FPGA; this results in a highly configurable device. To accelerate communication between the configurable hardware and software tasks running on the processor, four interrupt channels can be driven by logic mapped into the e-FPGA. A two-way hardware/software communication can be implemented by the joint usage of these interrupt channels and dedicated AMBA advanced peripheral bus (APB) registers [3]. Download of the FPGA bitstream is performed by a flexible programming interface with support of a specific direct memory access (DMA) connected to the AMBA AHB Bus. To allow validation of the FPGA configuration, the bitstream may be read back by hardware support. Most audio or video applications require storage buffers to interface fast decoding hardware and slower software running on the processor. With this concept in mind, a 1-kB dual port buffer has been added and organized as 4 256-byte rows. One port of this buffer is connected to the AHB bus while the second port is directly accessed by the FPGA dual-port buffer interface. The AMBA APB bus connects all the configuration/general purpose registers to the system. On the same bus, an I2C master interface has been added to connect external devices or sensors such as LCD displays or CMOS cameras.

524

Fig. 3.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

Embedded FPGA–microprocessor interface.

A programmable general-purpose I/O module features mono I/O and bidirectional pads under the control of both the e-FPGA and the microprocessor. A. Microprocessor–FPGA Interface Different architectures that interface reconfigurable logic with microprocessors can be found in the literature [9]–[12]. Our design uses a single context embedded FPGA to extend the instruction set of a commercial microprocessor architecture [1] which allows adding user-defined instructions. In the proposed architecture, this capability was mapped exclusively into the e-FPGA, allowing runtime reconfiguration of the instruction set. This implies that the number of user-defined instructions available at a given time is limited by the e-FPGA logic capacity and instruction logic complexity. However, a set of additional instructions can be defined to target specific application needs. If the logic size of the set of additional instructions exceeds the logic capacity of the e-FPGA, it might be split into a number of contexts fitting the size constraints of the e-FPGA. These contexts might be used to dynamically reprogram the FPGA to support application needs. The flexibility advantage of this architecture implies a speed penalty for the part of logic mapped inside the e-FPGA. In particular, specific processor instructions mapped in the reconfigurable fabric may be one to ten times slower than their equivalent implementation in standard cells. Fig. 3 details the processor–FPGA interface, showing how instruction extensions are mapped inside the FPGA and how synchronization between the microprocessor and the e-FPGA is guaranteed. As the additional instruction set is part of the processor pipeline [1], slowing down this logic results in a drastic reduction of processor maximum speed, hence, affecting processor performance when using the baseline general-purpose instruc-

Fig. 4.

e-FPGA–microprocessor synchronization.

tion set. A mechanism is introduced to allow the processor to be clocked at its maximum speed while executing standard instructions, whereas it is slowed down by a programmable instruction-dependent number of clock cycles (1–16) when executing processor instructions mapped into the FPGA. A clock control system allows the processor to be synchronized with the e-FPGA for the number of cycles during which the instruction is executed. A dedicated module is able to identify instructions whose performance is not aligned with the processor. As each of these instructions needs to be associated with its execution time, the set is partitioned. The synchronization mechanism for two different opcode types is shown in Fig. 4. When an opcode of an additional instruction of type 1 (2) is issued then processor clock is stalled ( ) clock cycles to wait for valid results; in this case, for 1 (4) clock cycles. A predefined map-table divides into four the whole set of opcodes reserved for user-defined instructions. For each set that belongs to a configuration, a number defines the number of

BORGATTI et al.: RECONFIGURABLE SYSTEM FEATURING EMBEDDED MICROPROCESSOR, FPGA, AND CUSTOMIZABLE I/O

525

TABLE I MICROPROCESSOR EXTENSIONS CLASS AND RELATED SLOWDOWN FACTORS

Fig. 5. Block diagram of the e-FPGA.

times the clock needs to be stretched to synchronize properly the execution of the pipeline between the e-FPGA and the base processor. Thus, the system allows executing a set of instruction extensions among a panel of four user-defined speed penalties for any e-FPGA configuration. In this way, the processor CPU is tied to the e-FPGA speed for the strictly required number of cycles. Table I shows how 32 additional opcodes are classified according to their delay. Note that the slowdown factors are binary numbers programmed as constants into the e-FPGA so that they can be changed with the e-FPGA configuration, making this scheme very flexible in maximizing the performances of both hardwired and programmable logic. B. Block Description of the e-FPGA The architecture of the e-FPGA [2] is organized as a hierarchical multilevel interconnect network (see Fig. 5). An array of logic elements called the multifunction logic cell (MFC) allows implementation of digital logic. The MFC is a four-input/oneoutput programmable structure associating a four-input lookup ). There are 3K MFC table and a storage element ( , shared among 24 clusters. The global interconnect network links the clusters together and to input and output pads peripheral cells. At a lower level, a local interconnect network links the MFC togethers and to the global network. The architecture allows defining up to one clock signal per cluster. The MFC clock is one of three global signals defined to be connected to any input of the cluster. This insures a low skew between cluster clocks and full I/O assignment flexibility. In particular, insertion delay for clock pins does not depend on e-FPGA configuration, as it accounts for the global network delay. The input (respectively, output) pin set counts 384 independent and fully equivalent inputs (respectively, outputs). IV. APPLICATION EXAMPLE: A FACE RECOGNITION SYSTEM To evaluate our reconfigurable architecture, a face recognition application has been used. An important issue of this target application is the recognition speed. Recognition time must be in the order of one second to use this algorithm in interactive applications like smart toys or advanced human–machine interfaces. The face-recognition image-processing flow can be divided in three main stages (see Fig. 6). First, an image is ac-

Fig. 6.

Image-processing flow.

quired using a color CMOS camera. A dedicated processing is applied to get the filtered and gray-level images out of the grabbed image in the Bayer format. The filtered image is obtained using a filter that excludes all the background colors. This second image is eroded by clearing the borders. Second, a face-location algorithm [5] is performed on the eroded image by using edge detection and a generalized Hough transform. The latter step returns the coordinates of an ellipse marking the border of the face to be recognized. After the face-location stage, feature extraction [6] is performed. Here, the face (gray level) is cut with a given number of random lines using the ellipse as a border. Every line is divided into a fixed number of segments and a single feature is calculated as the average value of the pixels covered by a line segment. Each feature is collected into a vector of average values with a number of items equal to the segments number. The extracted features are added to a database during a training session or compared with the database in a recognition session. To implement the face-recognition system, a video camera interface, a parallel-port interface, and more than 20 new microprocessor opcodes have been designed as “soft RTL” inside the e-FPGA. Fig. 7 shows a simple example of a microprocessor extension and how the hardware inside the e-FPGA is interfaced with the processor pipeline. This new opcode is an optimized fixed-point calculation of a square root.

526

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

TABLE II FACE RECOGNITION PERFORMANCES AT 100 MHZ

V. DESIGN FLOW AND SYSTEM INTEGRATION Fig. 7.

Microprocessor extension example.

A. System-to-RTL Design Flow

The most complex microprocessor extension has been developed to speed up the most demanding task. The face-recognition distance for each line to be calculated for process requires an every line within two vectors of features (1) is stored in a byte. Each item in the vectors of features The following parameters have been chosen for the face-recognition target application: • 20 classes with three pictures per class; • 250 lines per picture; • 24 segments per line. , load , subtract, multiply, accuFive operations (load mulate) are required for each segment in (1). As a consequence, the recognition process requires M operations To execute these operations without any instruction extensions, our architecture takes about 1 G clock cycles. To cope with this specific computational kernel, we designed a set of microprocessor extensions that speed up the face-recognition algorithm more than 10 times. The overall performance improvements are shown in Table II. While the face-recognition algorithm kernel was rewritten to achieve best performances, the edge and face detection speedups are mainly due to a set of specialized fixed-point arithmetic implementations, among these, a fixed-point square root (Fig. 7) and an efficient fixed-point implementation of the arc tangent function. It is important to note that switching between algorithm stages requires only one reconfiguration of the e-FPGA. Reconfiguration time is always negligible. In [11], a general discussion on the tradeoff between time to execute instruction extensions on a reconfigurable hardware platform and overhead related to hardware reconfiguration can be found. Still, according to [11], all tasks reported in Table II feature a coarse grain size. The speed-up factors take into account the clock-stretching penalty on the processor.

In Fig. 8, the design flow used for system architecture exploration and integration is described. The starting point is an untimed model of the system written in C/C++ code describing the desired functionality; at this stage, the verification is done with simulations in the CoWare N2C environment [4]. This methodology allows designers to validate the system specifications and, consequently, with a progressive refinement of the functional blocks into hardware and software (partitioning process) and the generation of the hardware/software interface (interface synthesis), the verification of the system at a cycle accurate abstraction level. The microprocessor core is abstracted in the coverification with its instruction set simulator (ISS) integrated into the simulation engine. Extensive simulations of the system with the usage of the profiler (memory accesses, CPU load, exceptions) help in finding the computational kernels of the software running on the core (performance analysis). At this point, it is possible to group segments of codes that are most timing consuming as new instructions of the extensible processor. The impact of such complex instructions can be evaluated using the simulation framework described above. Instruction extensions described using high-level code can be translated to ISS extensions and HDL code for direct hardware implementation [1]. Such extensions of the instruction set can be mapped on the e-FPGA as well as the VHDL code that results from the refinement process done during partitioning phase. In particular, application-specific coprocessors and peripherals can be refined from C descriptions down to RTL [4]. The system integration flow produces • soft hardware (reconfigurable logic) to be mapped on the e-FPGA: HDL RTL code of instruction extensions, bus-mapped coprocessors, and special-purpose I/O peripherals; • conventional fixed hardware (hardwired logic): microprocessor RTL code, AHB/APB bus, and peripherals; • embedded software (C code): application software and low-level drivers for the hardware platform. The C code generated by the flow described above becomes the final application while the RTL of the system with the e-FPGA hard macro goes into the system integration flow.

BORGATTI et al.: RECONFIGURABLE SYSTEM FEATURING EMBEDDED MICROPROCESSOR, FPGA, AND CUSTOMIZABLE I/O

527

Fig. 8. System to RTL.

Fig. 9. RTL to layout.

B. RTL-to-Layout Design Flow In Fig. 9, both silicon implementation flow and e-FPGA configuration flow are shown. These flows are run at different times. Once silicon implementation flow has produced the routed database, it is possible to implement e-FPGA flow that can be repeated for each different function built as a soft macro. The RTL code of the CPU core, IP blocks, and interface modules (system bus) are synthesized and integrated with RAM blocks and FPGA hard macro in the floor-planning environment. To meet timing requirements at the boundary of the e-FPGA, special care was taken during the synthesis process for the logic cells that interface e-FPGA with the rest

of the system. A particular set of constraints was specified to reach minimum delay of the hardwired logic. After the place and route stage, the final database was statically and dynamically verified against the RTL simulations in order to make verification at all levels of abstraction. The timed database used for the verification, built after a parasitic extraction and a delay calculation process, allows knowing the effective delays at the boundary of the e-FPGA hard macro (all e-FPGA I/O pins are characterized with the static timing analyzer in the worst case condition). This information is exported in the e-FPGA flow as a constraint file and used during synthesis/mapping of the soft hardware by specific e-FPGA tools.

528

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

TABLE III DEVICE PERFORMANCES AND POWER CONSUMPTION

TABLE IV TECHNOLOGY AND DEVICE CHARACTERISTICS

Fig. 10.

Chip micrograph.

VII. CONCLUSION

This is done to correctly constrain the logic mapped on the e-FPGA with the real timing budget. Finally, the generation of the bitstream and a timed view of the macro can be used for the final signoff. Static-timing analysis of the e-FPGA results in both a back-annotated netlist and a timing view for full-chip static-timing analysis. VI. SYSTEM IMPLEMENTATION AND TEST The full chip has been implemented in a standard CMOS 1.8-V 3.3-V 0.18- m technology featuring six metal layers. The layout of the system has been integrated using commercial place and route tools for digital ASIC. To avoid external multiple power supplies, an internal dc 3–1.8-V voltage regulator has been integrated. The chip has been completely tested and is fully functional at a clock rate of 175 MHz. The processor system is able to reconfigure the e-FPGA at full speed. Reconfiguration takes about 500 s at a clock rate of 100 MHz. From a performance standpoint, reconfiguration overhead has minimal impact on processing speed. The device requires at most one reconfiguration of the e-FPGA to set up its hardware resources and execute each algorithmic stage. Recalling Table II, the impact of reconfiguration time is always negligible except for edge detection (the overall speedup reduces from 1.8 to 1.5 ). The overall impact of reconfiguration time on total execution time for the whole algorithm execution is less than 0.2%. During reconfiguration, the average throughput sustained by external memories, EMI, and the programming interface is 50 MB/s. Device performances and power consumption are summarized in Table III. Technology and device characteristics are summarized in Table IV, and a chip micrograph is shown in Fig. 10 with a floorplan view of system components.

In this paper, a novel system architecture based on a reconfigurable microprocessor has been presented and its implementation using embedded FPGA technology has been described. It is the authors’ belief that this is the first silicon example of integration of a reconfigurable processor using off-the-shelf IP blocks. Application-specific bus-mapped coprocessors and flexible I/O peripherals and interfaces can also be added and dynamically modified by reconfiguring the embedded FPGA. The silicon area required by the system is 20 mm . The embedded FPGA accounts for about 40% of the system area and, in the considered application, it implements about 100K equivalent ASIC gates split in several contexts by time multiplexing. The area overhead compared to a full ASIC solution with equivalent functionality is about 42%. The system has been tested using both a face-recognition application and a speech-recognition application. We reported speedups ranging from 2 to 10 using instruction extensions to accelerate face-recognition computing kernels. These speedups include the overhead introduced by clock speed adaptation for the e-FPGA, and reconfiguration time is negligible. Additional 1.5–2 performance improvements are reported on specific I/O intensive tasks to interface an external CMOS camera and doing some image processing computations on the fly using the e-FPGA. Particular emphasis was given to the extension of current ASIC-oriented design flow to take advantage of embedded reconfigurable logic. In particular, specific design flows for system exploration and implementation were also discussed in detail. The impact of dynamic hardware configuration on energy efficiency of the computing system needs to be investigated in detail. This will be an object of future work. In fact, conflicting effects are expected to have the opposite impact on energy efficiency. On one hand, FPGAs have inherent reduced energy efficiency compared to hardwired logic due to lookup-tablebased logic and programmable interconnect overheads; also, the

BORGATTI et al.: RECONFIGURABLE SYSTEM FEATURING EMBEDDED MICROPROCESSOR, FPGA, AND CUSTOMIZABLE I/O

energy overhead of reconfiguration needs to be quantified. On the other hand, increased architectural efficiency of special-purpose instructions and substantial reduction of the stream of instructions to be decoded in order to complete a task suggest for better energy efficiency. Moreover, it is common understanding that reducing execution time of a processor-based system is beneficial for reducing overall energy consumption for task execution even when peak power consumption increases. We expect the energy efficiency of this class of systems to lie between conventional ASIP/DSP computing systems and dedicated hardware implementations using configurable logic. Preliminary measurements show the best energy efficiency in the range of several MOPS/mW at 1.8-V supply. ACKNOWLEDGMENT The authors would like to thank G. Repetto, C. Gazzina, and L. Fumagalli for their valuable help and support. They also thank O. Lepape, J. Barbier, and F. Reblewsky at M2000, K. Ahluwalia, D. Tilley, and M. Woodward at CoWare, J. Massingham at Tensilica, and B. Campbell. Special thanks go to P. L. Rolandi and A. Kramer for their support and encouragement.

529

Michele Borgatti was born in Finale Emilia, Italy, in 1968. He received the Dr.Eng. degree (with honors) in electrical engineering and the Ph.D. degree in electrical and computer engineering from the University of Bologna, Bologna, Italy, in 1994 and 1998, respectively. In 1998, he joined STMicroelectronics Central Research and Development, Agrate Brianza, Italy, where he lead several innovative projects in the area of low-power signal processing. Since 2001, he has been a Design Manager in the Non-Volatile Memory Design Platform Department, Central Research and Development. His research interests include system and integrated circuit digital design for speech and image processing, multimedia portable applications, reconfigurable computing, and design methodologies for hardware/software codesign.

Francesco Lertora was born in Chiavari, Italy, in 1971. He received the Laurea degree in electronic engineering from the University of Genoa, Genoa, Italy, in 1998. He joined STMicroelectronics Central Research and Development, Agrate Brianza, Italy, in September 2000, where he currently is a System Architect. His main research interests include systemon-chip architectures, digital signal processors, software for embedded systems, and compilers.

REFERENCES [1] R. E. Gonzalez, “Xtensa: A configurable and extensible processor,” IEEE Micro, vol. 20, pp. 60–70, Mar./Apr. 2000. [2] M2000. Flexeos Family Technical Manual. [Online]. Available: http://www.m2000.fr [3] AMBA™ Specification Rev. 2.0, ARM Ltd., 1999. [4] I. Bolsens, H. De Man, B. Lin, C. Van Rompaey, S. Vercauteren, and D. Verkest, “Hardware/software co-design of digital telecommunication systems,” Proc. IEEE, vol. 85, pp. 391–418, Mar. 1997. [5] D. Maio and D. Maltoni, “Real-time face location on gray-scale static images,” Pattern Recognit., vol. 33, no. 9, pp. 1525–1539, Sept. 2000. [6] O. De Vel and S. Aebrhard, “Line-based face recognition under varying pose,” IEEE Trans. Pattern Anal. Machine Intell., vol. 21, pp. 1081–1088, Oct. 1999. [7] S. Hauck, “The roles of FPGAs in reprogrammable systems,” Proc. IEEE, vol. 86, pp. 615–638, Apr. 1998. [8] E. Sanchez, M. Sipper, J.-O. Haenni, J.-L. Beuchat, A. Stauffer, and A. Perez-Uribe, “Static and dynamic configurable systems,” IEEE Trans. Comput., vol. 48, pp. 556–564, June 1999. [9] A. DeHon, “DPGA-coupled microprocessors: Commodity ICs for the early 21st century,” in Proc. IEEE Workshop FPGAs for Custom Computing Machines, Napa Valley, CA, Apr. 1994, pp. 31–39. [10] M. J. Wirthlin and B. L. Hutchings, “A dynamic instruction set computer,” in Proc. IEEE Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, Apr. 1995, pp. 122–128. [11] R. D. Wittig and P. Chow, “OneChip: An FPGA processor with reconfigurable logic,” in Proc. IEEE Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, Apr. 1996, pp. 126–135. [12] J. R. Hauser and J. Wawrzynek, “Garp: A MIPS processor with a reconfigurable coprocessor,” in Proc. IEEE Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, Apr. 1997, pp. 12–21. [13] “Cover: Configurable computing,” IEEE Comput., vol. 33, pp. 39–77, Apr. 2000.

Benoit Forêt received the M.S. degree in electrical engineering from the University of Montpellier, Montpellier, France, in 2000. In 1999, he joined STMicroelectronics Central Research and Development, Agrate Brianza, Italy, working on innovative system-on-chip designs. Since 2002, he has been with STMicroelectronics Central Research and Development, Crolles, France. He has investigated dynamically extending instruction sets of an application-specific processor. His research interests include hardware/software codesign and field-programmable gate arrays usage in reconfigurable systems.

Lorenzo Calí was born in Reggio Calabria, Italy, in 1971. He received the Dr. Eng. degree in electrical engineering from the University of Bologna, Bologna, Italy, in 1997. In 1999, he joined STMicroelectronics Central Research and Development, Agrate Brianza, Italy, working on several projects in the Innovative Systems Design Group. Since 2001, he has been a Senior Designer within the Non-Volatile Memory Design Platform, Central Research and Development. His research interests cover both system and integrated circuit digital design for speech and image processing, multimedia portable applications, reconfigurable computing, and design methodologies for hardware/software codesign.