energy-efficient design - IEEE Xplore

4 downloads 0 Views 109KB Size Report
(power) reduction (the power wall); and. • the decreased performance gains from higher frequency and longer pipelines (the frequency wall). As a result of these ...
Guest Editors’ Introduction

ENERGY-EFFICIENT DESIGN

Kunio Uchiyama Hitachi Pradip Bose IBM

6

Energy efficiency has been a key design constraint for microprocessor development teams since the late 1990s. The fundamental technological issues that have led to this point are quite well understood at this time by industry and academia. Although active (or dynamic) and passive (or standby) components of the net power equation are of concern, in recent years the latter (leakage) aspect of chip power has been escalating at a much faster rate than active power. In fact, as we write, leakage power has almost equaled active power in the total power breakdown of a typical microprocessor. This means that a 100 W chip in today’s technology will be burning about 50 W with just power on and no program running! And, by the way, the highestperformance, multicore, server-class processors are already close to 200 W in maximum power consumption! This equates to power densities that are pretty much at the very edge of aircooled systems. So, without investing in liquid-cooled systems and their corresponding packaging (at significantly higher cost), microprocessors at the high end, targeted for traditional air-cooled server boxes, are pretty much at the end of the road, without a major paradigm shift in design and/or packaging technology. In fact, such a basic paradigm shift has been in the works for a few years, with the introduction of IBM’s dual-core Power4 chip in 2000, for example. The industry in general has recently made a clear shift toward lower frequency, multicore architectures for general-purpose high-performance microprocessors. Intel, AMD, and Sun Microsystems have all announced future product roadmaps that embrace the multicore paradigm. Although such a shift has enabled design groups to keep going for a little while, the need for building power efficiency into the chip’s noncore components continues unabated. Also, we recently have witnessed a trend toward finer

levels of clock gating in all designs, and the increasing use of power-gated modes to reduce leakage. Academic research in low-power design techniques have evolved from lower-level issues related to the underlying device and circuit technologies to higher-level knobs available in the microarchitecture, architecture, and even the application and software layers. In addition to the established International Symposium on Low Power Electronic Design (ISLPED), other smaller conferences and workshops have emerged to highlight the latest research—especially at the architecture and design levels. That’s why IEEE Micro, in putting together a special issue on this important theme, decided to focus on two recently held (and relatively new) conferences: Cool Chips VIII, held April 2005 in Yokohama, Japan, and the fourth Annual Austin Conference on Energy Efficient Design (ACEED), held in March 2005 in Austin, Texas. (Both of these conferences are organized to have presentations by speakers, without formal proceedings of written papers.) As our readers might know, Cool Chips covers talks on exciting new processor products or test chips that have “low power” as a primary constraint; ACEED, on the other hand, deals more with general research topics related to this field. As technical program chairs for these two conferences, we then teamed up to organize this theme issue, after inviting an initial set of articles from selected speakers and then screening those further into a set of seven final choices after a due review process, per IEEE Micro guidelines. We initially received a total of 17 submissions (10 from Cool Chips and seven from ACEED). Each article received at least two independent reviews; many received more than two reviews. After receiving reviews by a given deadline, we guest editors tried to make initial recommendations. One of us (Pradip Bose, who is also editor-in-chief of IEEE Micro)

Published by the IEEE Computer Society

0272-1732/05/$20.00 © 2005 IEEE

then made the final decisions about inclusion in the theme issue. Three of the six selected articles turned out to be related to the Cell processor developed by Sony, Toshiba, and IBM. As Peter Hofstee of IBM describes in the “First-Generation Cell Processor” sidebar, the Cell chip was designed as a heterogeneous, multicore system-on-chip, but using custom CMOS silicon-on-insulator technology, targeted to achieve leadership frequency and performance at affordable power, for the game market. The challenge of delivering such high performance at a power level that makes it possible to use the processor chip in a game console or set-top system, is understandably very steep. In the article by Takahashi et al., the authors describe the power-aware design principles behind each of the synergistic processing element cores within the full chip. In their article, Maeda et al. deal with some of the challenging issues in the programming model, paying attention to power conservation issues in a real-time computing scenario. In the article by Asano et al., we find a treatment of the low-level design issues related to achieving a high-performance SRAM design (again, for the SPE cores within Cell) at low power. The other three selections in this issue are research articles, reflecting some of the leading-edge academic research activities in the area of power-aware microarchitecture design. In their article on duration prediction, Isci et al. dwell on the problem of predicting the length (or time duration) of each distinct phase of an executing workload. In a setting where dynamic voltage/frequency scaling (DVFS) helps to manage power efficiently, the ability to accurately predict a phase’s duration enables the precision deployment of the underlying voltage change mechanism at a low overhead cost. In the article on formal control techniques, Wu et al. touch on a topic that is of significant current interest within the field of energy-efficient design: on-chip adaptive control techniques and algorithms, an area that has received considerable coverage in recent academic research, as a basic mechanism to manage power in the presence of changing workload demands. In this article, the authors address the formal control-theoretic aspects of such mechanisms, pointing the community toward an era of mathematically

provable, robust control algorithms. Finally, the article by Marculescu et al. touches on a topic of increasing importance to the chip design community: that of uncertainty in design caused by the increased variability and failure rates of component devices and building blocks. How this emerging new constraint interplays with the now wellknown constraint of power dissipation limits is the interesting subject of this article.

W

e hope you enjoy this theme issue on energy-efficient design. We would like to thank all the anonymous reviewers who helped us select this excellent collection of articles from two very relevant and interesting conferences. We are also grateful to all the authors who took the time and effort to submit written versions of their original talks to this special issue.

Kunio Uchiyama is a a senior chief researcher in Hitachi’s Central Research Laboratory, where he is responsible for the research and development of microprocessors and SoCs. His interests include microprocessors, SoC architecture, and design automation tools. Uchiyama has a PhD in advanced applied electronics from the Tokyo Institute of Technology. He was awarded the Japanese national Medal of Honor with Purple Ribbon, and is a senior member of the IEEE. Pradip Bose is a research staff member and manager at the IBM T.J. Watson Research Center; he is also editor in chief of IEEE Micro. His research interests include high-performance computer architectures, power- and reliability-aware design, and computer-aided design. Bose has a PhD in electrical and computer engineering from the University of Illinois, Urbana-Champaign. He is a senior member of the IEEE and the IEEE Computer Society, and a member of the ACM. Direct questions and comments about this special issue to Pradip Bose, pbose@us. ibm.com. For further information on this or any other computing topic, visit our Digital Library at http://www.computer.org/publications/dlib.

SEPTEMBER–OCTOBER 2005

7

GUEST EDITORS’ INTRODUCTION

First-Generation Cell Processor H. Peter Hofstee, IBM The Cell Broadband Engine processor is the product of collaboration among IBM, Sony, and Toshiba. The three companies were looking at the future market for entertainment applications that would rely on real-time multimedia processing and broadband Internet communication. Sony set a goal for future applications to eventually run at 1,000 times the performance of Sony’s PlayStation 2 Emotion Engine processor developed with Toshiba.1,2 As a first step, the first-generation Cell’s objective was to achieve 100 times PlayStation 2’s performance. In summer 2000, participants at a critical meeting determined that a conventional organization, and even a homogeneous chip multicore processor, would not deliver sufficient computational power.3 Besides increased performance, Cell had to consume much less power per operation to realize the performance improvements while meeting power constraints. Because Cell is targeted at devices and systems in a multi-standard interconnected world, flexibility and programmability are key aspects. Therefore, it was not possible to improve efficiency by simple specialization and limiting the application domain. It is these opposing needs—high performance, high efficiency, and good flexibility and programmability—that made designing Cell an interesting challenge.

Choosing an architectural organization In choosing the basic organization for the Cell processor, the design team considered the several hurdles to improvements in single-thread performance. Such improvements are increasingly difficult to come by because of • the inability of DRAM latency to keep up with processors (the memory wall); • the fact that in almost all systems, thermal output has reached its practical limit, while transistor scaling limits challenge voltage (power) reduction (the power wall); and • the decreased performance gains from higher frequency and longer pipelines (the frequency wall). As a result of these factors, single-thread performance growth has become so challenged that the industry is switching en-masse to multicore arrangements, allowing improved performance per watt on applications that can execute in parallel. The multiple cores also permit more memory requests in flight at the same time per chip, but do not address the inefficiencies that arise within each core as a result of the memory and frequency wall. Heterogeneous multicore designs permit specialization and further increased efficiencies.

Basic Cell Broadband Engine Architecture Because a shared, coherent, system address space was deemed essential for programmability, and because efficient SMP architectures take a long time to develop, the initial architectural team decided to build the Cell

8

IEEE MICRO

Broadband Engine Architecture (http://www.ibm.com/developerworks/ power/cell) on a system organization inherited from the Power Architecture (http://www.ibm.com/developerworks/eserver/library/es-archguidev2.html). The Power Processor Element (PPE) implements the 64-bit Power instruction set architecture and provides the operating system and control functions. Accompanying the PPE are processors optimized to run applications. We chose to call these Synergistic Processor Elements (SPEs) because we designed them to have a mutual dependence on the PPE, working in harmony to perform tasks more efficiently than either type of processing element alone. Other key aspects of Cell include enhanced real-time controls to allow real-time and non-real-time operating systems and applications to run concurrently; and a hardware architecture to support privacy, security, and digital-rights-management applications.

PPE The PPE is a 64-bit Power processor optimized for performance per watt and performance per area while matching the frequency of the SPEs. Designers implemented the PPE as a dual-threaded core, and it includes floating-point and the vector media extensions of the Power architecture. The processor contains 32-Kbyte instruction and data caches, a 512-Kbyte L2 cache, and on-chip bus interface logic. A new, ground-up implementation, this core has an extended pipeline to achieve a low fan-out of 4 to match the SPEs, and it is an enhanced in-order design. To support realtime operations, the PPE was extended with resource management tables for the L2 and translation caches.

SPE With the central CPU handling the operating system and other controlintensive tasks, we could design the SPEs for efficient general-purpose application processing. The SPE instruction set combines all data types in a single 128-entry 128-bit register file. Special purpose registers (link, count, condition registers) are unified with this register file as well. With 128 registers, the SPE tolerates deep pipelines better than conventional processors. A rational microarchitecture further helps in attacking the frequency wall. However, for applications that fit the local store, from a programmer’s perspective, the processor is not fundamentally different from other scalar processors with single-instruction multiple-data (SIMD) extensions. The most fundamental aspect of the SPE is that it manages two levels of store (registers and local-store memory) in software. Because (direct memory access, DMA) transfers between system memory and local-store memory are asynchronous, a fundamental break with sequential semantics, SPEs can have many main-memory accesses in flight without resorting to speculation. DMA transactions use standard Power effective addressing to refer to system memory. These addresses are translated into a virtual and finally a real address using the standard Power architecture page and segment table caches. Like Power load and stores, DMAs

are coherent in the system. The DMA unit is also capable of fetching DMA commands from the local-store memory, thus acting as a separate data moving processor. Although programmers can use conventional programming models and language with appropriate compiler support, we anticipate that the asynchronous streaming-DMA aspect of the SPEs will also lead to the introduction of new programming paradigms. The SPE ISA includes a software branch hint instruction, allowing implementations of the architecture with minimal hardware branch prediction support. The implementation of the SPE on Cell realizes a processor with leading performance on compute-intensive applications in just 10 mm2 (15 mm2 including the DMA unit). We optimized the current implementation of the SPEs for integer and single-precision floating-point. With a dual-threaded PPE and eight SPEs, the first-generation Cell Broadband Engine processor is capable of 10 simultaneously executing threads (18 including the DMA units) and over 100 outstanding memory requests. The high degree of software control in the SPE is a double-edged sword. On the one hand, because we have largely removed speculative mechanisms from the implementation, the behavior of the SPE is highly predictable, which is a boon for real-time programming and programmers, including compiler writers, who want to optimize their code. At the same time, software management of the local-store memory and branches presents a challenge to compilers. Although Cell operates in excess of 4 GHz under laboratory conditions, the high-frequency efficient design of the processor, made possible by the operating system and application specialization of the cores, is intended to allow the most efficient operation at minimum operating voltage. The Cell Broadband Engine processor is 235 mm2 in 90-nm SOI (silicon-oninsulator) technology. At its minimum operating voltage, it dissipates power comparable to high-end PC processors and operates in excess of 3 GHz, but in many cases delivers an order of magnitude more performance more than conventional processors and in some cases more than that. Future work will focus primarily on broadening the reach of the architecture by pursuing implementations with reduced power and die size.

H. Peter Hofstee is the chief scientist for the Cell project and the chief architect for the Synergistic Processor at the SCEI/Sony Toshiba IBM Design Center, Austin, Texas. Contact him at [email protected].

SCHOLARSHIP MONEY FOR STUDENT LEADERS Lance Stafford Larson Student Scholarship best paper contest ✶ Upsilon Pi Epsilon/IEEE Computer Society Award for Academic Excellence Each carries a $500 cash award.

Application deadline: 31 October

References 1. K. Kutaragi et al., “A Micro Processor with a 128b CPU, 10 Floating-Point MACs, Floating-Point Dividers, and an MPEG2 Decoder,” Int’l Solid-State Circuits Conf. Digest of Technical Papers (ISSCC 99), IEEE Press, 1999, pp. 256-257. 2. A. Kunimatsu et al., “Vector Unit Architecture for Emotion Synthesis,” IEEE Micro, vol. 20, no. 2, Mar.-Apr. 2000, pp. 4047. 3. J.H. Kahle et al., “Introduction to Cell,” IBM J. Research and Development, vol. 49, no. 4/5, July-Sep. 2005, pp. 589-604.

Investing in Students www.computer.org/students/

SEPTEMBER–OCTOBER 2005

9