Invasive Manycore Architectures - Semantic Scholar

2 downloads 0 Views 842KB Size Report
of the power wall, processors do not scale any more towards higher frequencies. Instead, the major trend goes to the inte- gration of more and more processor ...
Invasive Manycore Architectures J¨org Henkel∗ , Andreas Herkersdorf‡ , Lars Bauer∗ , Thomas Wild‡ , Michael H¨ubner∗ , Ravi Kumar Pujari‡ , Artjom Grudnitsky∗ , Jan Heisswolf∗ , Aurang Zaib‡ , Benjamin Vogel∗ , Vahid Lari† , Sebastian Kobbe∗

∗ Karlsruhe

Institute of Technology,† University of Erlangen-Nuremberg,‡ Technical University of Munich http://invasic.de

Abstract—This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described. Index Terms—invasive computing, manycore architectures, multiprocessor system on a chip, reconfigurable adaptive processors, networks on chip, resource aware programming

I. I NTRODUCTION With ongoing technology progress and as a consequence of the power wall, processors do not scale any more towards higher frequencies. Instead, the major trend goes to the integration of more and more processor cores per chip (Moore’s Law of multicore). From a CMOS technology perspective, the integration of several hundreds of cores will be feasible in the foreseeable future. Different manycore architectures have already been presented, e.g. Tilera [1], Intel’s TeraFlop [2] or SCC [3]. The most important problem, however, related to manycore processors, i.e. how to efficiently exploit the nominally abundant processing power of the available cores, is still unsolved. It is a widely tackled research topic in industry and academia how applications can efficiently use manycore processors. This challenge was one of the major motivating factors to propose a new programming paradigm called invasive computing [4]. In invasive computing an application may dynamically expand on parallel cores when the algorithm allows parallel execution. In such a situation cores are invaded (i.e. resources are reserved) and infected (i.e. resources are used) with the appropriate program binary. When less parallelism is possible, the application would retreat again to give room for the expansion of other applications. I.e. an invasive program would dynamically make use of the available processing resources according to its current requirements. If such applications running concurrently on an invasive platform behave in a cooperative way (realized by a resource management system), a better exploitation of the computing resources can be expected. From the application programmer’s point of view an invasive program will have to be written with respect to the resources currently available for his application. Therefore, This research program is supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (TR-SFB 89 - http://invasic.de)

this paradigm is called resource aware programming. To ease invasive programming the programmer is supported in managing and utilizing the available processing, communication, and memory resources by invasive hardware/software middleware layers. As these mechanisms operate in a distributed way an inherently higher robustness compared to a centralized approach can be expected. See [4] for more details. In order to make invasive computing efficient we propose in this paper a specifically adapted manycore computation platform architecture. It consists of decentralized agent-based middle-ware, OS, and hardware assist layers for enabling invasion. These layers relieve the burden to specify invasion, infect, and retreat operations during design or compilation time, and dynamically take care for them during runtime. The extensions are built on top of a HW platform based on compute tiles connected by an invasion-enhanced NoC. In the following section we will first give an overview of the invasive manycore architecture. Then, the specific aspects of the invasive middle-ware and the proposed hardware extensions for performing computation and communication within our invasive platform will be described. II. I NVASIVE MULTI -/ MANYCORE PLATFORM Several variants of manycore architectures have been designed and are in use for years now. Some of these architectures are homogeneous comprising of uniform standard components, while the others are built using special hardware accelerators forming heterogeneous platforms. Intel SCC [3] is one such example of a homogeneous 48 cores architecture. In this architecture 24 tiles comprising of pentium dual core processors are connected in a mesh network over a standard NoC. It supports formation of special voltage and frequency operation domains, comes with configurable memory access range, and inter-tile communication is performed using dedicated hardware message buffers. Applications from the high-performance computing (HPC) domain can make use of such homogeneous architectures and exploit their parallelism. In IBM’s Cell Broadband Engine [5], 8 synergistic processing elements (SPEs) and a master power processor (PPE) are connected over a bus. Similarly special DSP engines are incorporated in TI’s DaVinci [6] architecture which comes along with an ARM core. These architectures are specially suited for multimedia and embedded applications. Other special multicore platforms are also available, like network processors tailored for IP packet processing or graphic processing

units more suited for single instruction, multiple data (SIMD) applications. In this paper we introduce a tile-based manycore platform architecture (Fig. 1), which is specifically adapted to support invasive computing. To exploit both loop level and thread level parallelism in applications, we envision heterogeneous compute tiles. Compute tiles comprise loosely coupled RISC cores [7], either off the shelf or with dynamically reconfigurable, application specific instruction set extensions (iCore) [8], tightly coupled processor arrays (TCPA) [9] (especially adapted to data driven applications) and special purpose hardware accelerators for dynamic thread assignment (Core ilet Controller, CiC). The components within a RISC compute tile communicate using standard buses whereas the tile-external communication is performed via an invasive network on chip (iNoC). Additionally on-chip memory tiles, off-chip memory tiles (accessible via memory controllers) and I/O tiles for interfacing with the peripherals are also provided. Memory Hierarchy The platform consists of following hierarchically distributed memory sub components (Fig. 1): • • • • •

L1 instruction and data caches with cache coherence among cores within the same tile. Scratch pad memory is private to one core, provisions fast access and is not cached. Tile local memory is accessible by all cores within a tile over the bus. Shared on-chip memory constitutes one or more memory tiles accessible over the iNoC. External memory is accessible via memory controllers on I/O tiles.

These different memory subcomponents are physically distributed and are accessed over bus and iNoC, hence represent

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CiC

CiC

Memory

NoC Router

A N

A N

A N

Memory

NoC Router

Memory I/O NoC Router

CPU

iCore

CPU

iCore

CPU

CPU

iCore

CPU

Memory

CiC

CiC

N

N

A N

Memory NoC Router

CPU CPU

Memory

A

A

No NoC oC Router Rou uter Router

NoC Router

CPU

TCPA

CPU

CiC

CPU

CPU

CPU

CPU

CiC Memory

Memory NoC Router

Fig. 1.

A N

A N

A N

NoC Router

NoC Router

Invasive manycore architecture

a non-uniform memory access (NUMA) architecture. The external memory can be either logically partitioned into different address spaces or used as a single shared memory region, depending on application’s requirements. Processes running on cores within one compute tile communicate using shared memory regions (either tile local memory or external memory). Inter-process communication between processes running on different tiles is achieved by message passing services. III. I NVASIVE M IDDLE - WARE A RCHITECTURE The invasive Run-Time Support System (iRTSS) [10], [11] acts as the resource manager and a hardware abstraction layer. It provides the basic infrastructure to support invasion, memory management, network access via message passing APIs, I/O access etc. The main challenge of using manycore architectures is to manage their resources at run time in a scalable manner. When targeting 1024 cores or even more, then mapping applications to cores can become a bottleneck, because on such large systems, typically several applications execute at the same time and compete for the available resources. In the scope of this work, an application is parallelized by executing several so-called ilets. The application programmer or compiler indicates in the source code which algorithm can be parallelized by how many ilets. From the application’s perspective, each ilet acts as a remote executable (together with its code and data) that needs to be assigned to a core for execution. To provide the required degree of scalability in manycore architectures, we use a decentralized multi-agent system for the resource management decisions [11] in invasive manycore architectures. As interface to the resource management system within the iRTSS, we use the introduced concepts of invade, infect, and retreat. An application that wants to parallelize its execution issues an invade request to the resource management system that contains the following information: 1) Which resources are requested, e.g. general-purpose cores, memories, fine-grained reconfigurable cores (see Section IV-B), tightly-coupled processor arrays (see Section IV-C), or communication links (see Section IV-D). 2) How much the application benefits from these resources, e.g. speedup per core, providing a so-called trade-off curve. 3) Further constraints, e.g. whether the cores need to be homogeneous, or shall be located near to each other etc. The resource management system evaluates the request and – if it can be fulfilled to some degree – returns a claim, i.e. a set of resources that are now reserved for the application. The application can then infect these resources (managed by the Core ilet Controller described in Section IV-A) to start executing the ilets. After their computations are completed, the application may retreat the resources or use them right ahead for parallelizing other algorithms, i.e. infecting the cores with further ilets. Fig. 2 shows an example of three applications A, B, and C competing for resources. In the first request, application A

1st Request

2nd Request

Resulting Mapping

B A C Fig. 2. Example of three applications (A, B, and C) competing for cores in a manycore architecture [11]

requests additional cores by issuing an invade request. The agents examine the system state in a distributed manner (by negotiation) and finally decide that some cores are taken away from application B and assigned to application A. Application B is informed about its loss of cores and can either continue executing with the same number of ilets on the reduced number of cores or it can adapt its internal parallelization accordingly. This is another example of resource aware programming. Applications that can efficiently adapt to more or less available resources dynamically (so-called malleable applications) benefit from this adaptivity, whereas applications that are not aware of the resources still work correct, but cannot exploit the benefits provided by the system. In the second request shown in Fig. 2, application C request more cores, and the agents decide to take away some core from applications A and B. These agent decisions (determining the amount of cores for each application and the binding of ilets to particular cores) are based on estimates of the performance of applications for a certain number of cores (using the provided trade-off curve and online monitoring data collected by the Core ilet controller described in Section IV-A). The aforementioned resource constraints and the provided trade-off curves are also utilized to directly minimize the power consumption. By using the invade/infect/retreatconstructs, the application developer can express the computational requirements of its application to the agent system/iRTSS. Resources which are not needed by an application, i.e. idle resources, can be put into different deep sleep modes. The transition in and, more important, out of a sleep mode requires considerable energy and time, the transition overhead. It is therefore inevitable to determine the points at which the application will need additional resources beforehand. The transition overhead should be hidden from the application. The transition between the active and the (different deep) sleep modes is fulfilled by a dynamic power management (DPM) policy. State of the art DPM policies predict the resources’ idle time with stochastic or history based approaches in the operating system or the hardware [12]. Conceptually, the operating system as well as the hardware can only determine the symptoms of the application’s current behavior. The application’s computational requirements cannot be determined in these system layers. Sudden variations in the workload requirements, e.g. an additional invade operations, are likely to lead to costly mispredictions. Several problems can arise due to mispredictions:

The resources might be activated too early, the application might not yet need them. • The resources might be activated too late, the application might suffer from performance degradation or does not need the resources anymore. • If the resources are deactivated too late, the savings potential will be reduced. With the power/performance trade-off curves, the agent system can trade off the application’s resource requirements with the needed power consumption. With the invade/infect/retreatconstructs, the transition overhead can be another factor in the decision process of the agent systems. The DPM policy can now work cooperatively with the application. Thereby, the agent system/iRTSS has, besides being able to manage the resource in a decentralized manner, the possibility to improve the power/performance trade-off. •

A. Cost analysis of Monitor aggregation and Thread assignment Whenever an application wants to invade new resources, the agent system/iRTSS layer probes hardware and software monitors in order to obtain the current availability and operation status of processing, communication, and memory resources. The following overhead estimate assumes that this probing and decision making is entirely done by software mechanisms. In Section IV-A, we will then show the benefits of a balanced software-hardware partitioning for these tasks. Let’s assume: We have m different monitor values, namely availability, temperature, reliability, utilization, operating frequency etc., of processing, communication and memory resources to be considered during invasion. This information has to be gathered from nl local cores (within a compute tile) and nr remote cores (from neighboring tiles with an average hop distance hr ), consuming σbus and σN oC processor clock cycles. Once this information is available, the agent/iRTSS evaluates a priority sorted rule set of cardinality Krules with a basic structure as shown below: IF ( ( avg temp ( Core [ i ] ) < 50 C) AND ( max load ( Core [ i ] ) < 20 %)) s e l e c t Core [ i ] ELSE IF ( ( max hops ( Core [ j ] ) < 5 h o p s ) OR ( e n e r g y b u d g e t ( Core [ j ] ) > 20 Wh) ) ... ...

The evaluation of each less/greater comparison and AND/OR conjunction consume CP I (clocks per instruction) CPU clock cycles and are executed in a strictly sequential fashion. Thus the total time τd required by the agent system/iRTSS for processing an invasion request is given by: τd

=

[m · nl · σbus ] + [m · nr · (σbus + (hr · σN oC ))] + [m · Krules · (nl + nr ) · CP I]

(1)

The first and second term of Eq. 1 correspond to gathering of status information, while the last term represents the evaluation effort. For example, we assume 8 (m) monitors, such as temperature, load etc, each of them having a range given by minimum,

maximum and average values and say just 3 rules (Krules ) to evaluate. Further assuming an ideal CP I of 1 CPU cycles, 20 cycles access time for the tile internal bus, 100 cycles for the iNoC. The total time τd for selecting a core within the search space of 4 local and 16 remote cores from immediate neighboring tiles at one hop distance over the iNoC would be about 48480 cycles. Fig. 3 shows the overhead due to thread assignment decision τd of Eq. 1 with respect to ilet’s runtime τilet . In the case of our simple search, the overhead is above 30% for an ilet with 100 Kcylces runtime. Applications have to take into account that an invasion may not necessarily succeed and the search space has to be expanded further over iNoC. The agent system/iRTSS have also to consider the overhead for scheduling, the memory overhead to load the ilet into the target tile’s memory, the context switching overhead and the delay until the ilet starts its execution. Owing to these overheads, invasion would become prohibitively slow. These overheads can be reduced or at least hidden from the application by offloading some latency intensive functionalities to hardware. The Section IV explains in detail many of these special hardware enhancements that are provisioned as part of the architectural exploration for invasive computing.



A. Core ilet Controller To minimize the delay in ilet assignment (Section III-A) and hide the overhead latencies (Fig. 3), we perform a hardwaresoftware functional partitioning. We propose to implement i) global, coarse level ilet mapping decisions in software for flexibility and scalability reasons and ii) finer, localized monitor aggregation, and thread assignment decisions in hardware. A dedicated hardware called Core ilet Controller (CiC), shown in Fig. 4, is provisioned to take care of the finer, localized invasion and infection decisions. Using the CiC the functional partitioning of thread assignment is done as follows: •



IV. I NVASIVE H ARDWARE E XTENSIONS •



During invasion, the agent system/iRTSS identifies a target tile (nrtile ) by evaluating only a subset of rules (annotated as Krules ↓). The target tile is identified using only an abstracted status information (mabstracted