LowPower Adaptive Pipelined MPSoCs for Multimedia: An ... - CiteSeerX

54.3

Low-Power Adaptive Pipelined MPSoCs for Multimedia: An H.264 Video Encoder Case Study Haris Javaid†

Muhammad Shafique‡

Sri Parameswaran†

Jorg ¨ Henkel‡

†

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia ‡

Chair for Embedded Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany

{harisj,

sridevan}@cse.unsw.edu.au, {muhammad.shafique, henkel}@kit.edu

Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications, with reduced design time and improved flexibility. Typically a pipelined MPSoC is balanced at design-time using worst-case parameters. Where there is a widely varying workload, such designs consume exorbitant amount of power. In this paper, we propose a novel adaptive pipelined MPSoC architecture that adapts itself to varying workloads. Our architecture consists of Main Processors and Auxiliary Processors with a distributed run-time balancing approach, where each Main Processor, independent of other Main Processors, decides for itself the number of required Auxiliary Processors at run-time depending on its varying workload. The proposed run-time balancing approach is based on off-line statistical information along with workload prediction and run-time monitoring of current and previous workloads’ execution times. We exploited the adaptability of our architecture through a case study on an H.264 video encoder supporting HD720p at 30 fps, where clock- and power-gating were used to deactivate idle Auxiliary Processors during low workload periods. The results show that an adaptive pipelined MPSoC provides energy savings of up to 34% and 40% for clock- and power-gating based deactivation of Auxiliary Processors respectively with a minimum throughput of 29 fps when compared to a design-time balanced pipelined MPSoC.

EŽ͘ŽĨ^Ɛ

ABSTRACT

Categories and Subject Descriptors C.1.3 [Other Architectural Styles]: Adaptable architectures, Heterogeneous (hybrid) systems, Pipeline processors; C.4 [Performance of Systems]: Modeling techniques, Design studies

General Terms Algorithms, Design, Performance

Keywords Adaptive MPSoCs, Low-Power Design

1. INTRODUCTION The increasing use of multimedia applications in portable devices, the demanding workload of these applications, and the reduced design turnaround times required for these devices have given rise to specialized (application specific) Multiprocessor System on Chip (MPSoC) platforms. Multimedia applications are typically characterized by several kernels which are executed repeatedly on the incoming data stream, favoring their implementation on pipelined MPSoCs [1, 2]. A pipelined Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2011, June 5-10, 2011, San Diego, California, USA. Copyright 2011 ACM ACM 978-1-4503-0636-2/11/06 ...$10.00.

1032

Low Workload Periods

ϱϬϬ ϰϬϬ ϯϬϬ ϮϬϬ ϭϬϬ Ϭ Ϭ

ϱϬ

ϭϬϬ

ϭϱϬ

ϮϬϬ

DEƵŵďĞƌ

Figure 1: No. of SADs Computed for Different MBs of ‘pedestrian’ Video Sequence MPSoC is a system where processors are connected in a pipeline configuration [1, 2]. It is divided into several stages where each stage contains one or more processors. These processors are connected via FIFOs and execute different sub-tasks of a multimedia application. Hence, in a pipelined MPSoC, the incoming data, read by the first stage, streams through the subsequent stages before being written by the last stage. Typically, a high throughput pipelined MPSoC is achieved by equally distributing the workload to the stages, that is, balancing the pipelined MPSoC at design-time. The state-of-the-art techniques [3, 4] for designtime balancing are based on the use of Application Specific Instruction set Processors (ASIPs) such as Xtensa, Nios, ARC 600 and 700 core families [5, 6, 7]. They extend the processors executing computationally intensive sub-tasks with specialized instructions so as to reduce their execution time, and reduce resources in other processors which execute less intensive sub-tasks (for example, reduce cache sizes). As a result, different processors (ASIPs) are selected for each sub-task such that the execution times of all the stages are close to each other, creating a design-time balanced pipelined MPSoC. Typically, design-time balanced pipelined MPSoCs are designed considering the worst case execution times so that the required throughput is always guaranteed when the system is deployed. A design-time balanced pipelined MPSoC however lacks adaptability to run-time variations in workload and thus suffers from inefficient resource utilization and high energy consumption. Let us examine the limitations of a design-time balanced pipelined MPSoC in more detail through a case study of motion estimation in the H.264 encoder, a fundamental component of advanced multimedia applications.

1.1 Motivational Example Motion estimation in the H.264 encoder is one of the most computationally intensive sub-tasks. Motion estimation is performed on each MacroBlock (MB) of the incoming frame, where Sum of Absolute Differences (SAD) is used to compare the current MB with the reference MBs to find the best possible match. The number of SADs that need to be computed for an MB is heavily dependent on the motion contained in that particular MB. An MB containing fast moving objects will require more SADs compared to an MB of slow moving objects. Figure 1 shows the number of SADs that were computed for the first 200 MBs of the second frame (the first frame does not require motion estimation) of the ‘pedestrian’ video sequence. It is obvious that the workload of the

54.3

‘motion estimation’ sub-task varies significantly at run-time – the number of computed SADs can go as high as 450 and as low as 10 with an average and standard deviation of 154 and 153 SADs respectively. Consider the motion estimation stage of the design-time balanced pipelined MPSoC, which contains 16 processors in parallel where each processor can compute 30 SADs within the allocated time budget. Thus, in total the motion estimation stage is capable of computing 16 × 30 = 480 SADs which is enough to sustain the throughput (the worst case is 450 SADs) at all times. However, in such a pipelined MPSoC, low workload periods (also marked on the graph) require the computation of less than 30 SADs, which can be handled by a single processor. Thus, during low workload periods the remaining 15 processors will be idle, resulting in inefficient utilization of resources and increased energy consumption of the pipelined MPSoC. In contrast, in an adaptive pipelined MPSoC, a resource-aware approach would have shared the idle processors with other stages at run-time, while an energy-aware approach would have deactivated the idle processors at run-time to save energy. To summarize, this example shows that design-time balanced pipelined MPSoCs [3, 4] do not provide a resource- or energy-aware implementation platform for advanced multimedia applications such as H.264/AVC [8], AVS [9], VC1 [10] which exhibit huge variations in workload at runtime due to the adaptive nature of their algorithms. As a result, the applicability of design-time balanced pipelined MPSoCs as a platform for multimedia applications in portable devices is limited where both the area and energy consumption are important design parameters.

1.2 Basic Idea And Novel Contribution We address the limitations of design-time balanced pipelined MPSoCs by introducing adaptability with the use of a novel run-time balancing approach, where the pipelined MPSoC adapts itself to the run-time varying workloads. To this end, we propose a novel adaptive pipelined MPSoC architecture, where stages with significant run-time variations in workload are implemented using Main Processors (MPs) and Auxiliary Processors (APs). The APs are used by the MPs only when the stage workload is beyond the capacities of MPs, that is, offloading the additional workload of MPs to APs. The number of APs to be used by an MP is decided by the run-time balancing approach considering the workload variations. To achieve a quick response time (as shown in Section 6.2), the proposed run-time balancing approach uses a combination of workload prediction from the application and run-time monitored execution time of previous and current workloads (to predict the number of APs that will be required in future). Finally, we demonstrate the applicability of our approach through a case study on an H.264 video encoder, where the adaptability of the pipelined MPSoC is exploited to reduce the energy consumption by deactivating the idle APs using clock- and power-gating techniques. In summary, the contribution of our work is three-fold: ∙ an adaptive pipelined MPSoC for multimedia applications, where MPs are augmented with APs to accommodate run-time varying workloads; ∙ a novel run-time balancing approach to determine the number of APs to be used by MPs, based on the workload prediction from the application and monitoring of the previous and current workloads’ execution time; and ∙ a case study on an H.264 video encoder supporting HD720p at 30 fps, implemented using a commercial design environment from Tensilica [11]. In this case study, APs were activated/deactivated using clock- and power- gating techniques to reduce the energy consumption of the adaptive pipelined MPSoC. The rest of the paper is organized as follows. Section 2 provides the necessary literature review. Section 3 explains the adaptive pipelined MPSoC’s architecture. Section 4 states the problem statement and the proposed run-time approach is explained in Section 5. Section 6.1 and 6.2 present the implementation details and results of the case study on H.264 video encoder, followed by the conclusion in Section 7.

2. RELATED WORK Pipelining is a well known technique for high throughput systems, and has been deployed at different levels of design. Various uses of pipelining at system level include exploitation of loop pipelining and pipelined

1033

MP1 S1 AP 2-1

S2

MP3

FIFO Buffers S3

S4

S5

AP 4-1-1

AP 2-2

MP2

MP 4-1

ASIPs with local memories MP 4-2

AP 4-2-1

MP5

Figure 2: Adaptive Pipelined MPSoC’s Architecture (Details of runtime balancing approach are not shown for the sake of simplicity) scheduling of tasks on multiprocessor systems to speed up applications [12, 13, 14, 15, 16, 17, 18]. However, none of these works considered processors connected in a pipeline configuration (pipelined MPSoC) which recently has emerged as a viable platform for high throughput implementation of multimedia applications [2, 3, 4, 19, 20]. In [19], functional pipelining was explored by mapping loops which lie high in the hierarchy of an application (written in C) onto a pipeline of ASIPs. In contrast to our work, [19] focused on parallelization of the application for a pipelined MPSoC. The authors in [3, 4] proposed designtime balancing techniques for pipelined MPSoCs. Each processor in a pipeline stage is an ASIP where several configurations are available for it. Heuristics are used to search through these ASIP configurations to select those which provide the best balancing of the pipelined MPSoC. The authors in [2, 20] have recently shown the application of a feedback based approach for Dynamic Voltage and Frequency Scaling (DVFS) in pipelined MPSoCs. Each processor is associated with a dedicated controller which monitors the occupancy level of the queues to determine when to increase or decrease the voltage-frequency levels of the processor. Our novel adaptive pipelined MPSoC architecture is different from the one considered in [2, 20]. Furthermore, our run-time balancing approach uses workload prediction from the application in contrast to a purely feedback based approach as used in their work. Run-time approaches based only on feedback (that is, without any prediction) exhibit high response time (as shown in Section 6.2) and thus are not suitable for multimedia applications requiring fine-grained run-time management (which is the case with macroblock based video coding). To the best of our knowledge, there does not exist any work on run-time balancing of pipelined MPSoCs using clock- and power-gating techniques. To recapitulate, our work is different as we use a novel adaptive architecture for pipelined MPSoCs, augmented with a novel run-time balancing approach.

3. ADAPTIVE PIPELINED MPSOCS Figure 2 shows a typical pipelined MPSoC, comprised of various pipeline stages. Parallel pipeline stages contain more than one processor that operate in parallel to increase the system throughput, for example, MP4-1 and MP4-2 in stage S4. Each processor is an ASIP with separate instruction and data caches, which are connected to its local memory. Use of ASIPs can significantly increase the throughput of a pipelined MPSoC [1]. In addition to local memories, shared memories could also be used where common data need to be shared among different stages. A typical multimedia application contains several kernels, and hence can be partitioned into sub-tasks. These sub-tasks are then mapped onto the ASIPs, where the special instructions for each ASIP are designed according to the sub-task(s) mapped on it. Please note that the partitioning and mapping of a multimedia application is not the focus of this paper. Each sub-task is executed as many times as the input data, where the total number of times a sub-task is executed is termed as the number of

54.3 Multimedia Application’s Sub-tasks

Run-time

Decide “when” and “how many” APs to activate and deactivate considering the run-time varying workload

One-to-one Mapping

Pipelined MPSoC

Customization

Design-time

Statistical Information:

Min, Max, Average Workload; etc.

Architectural Information: Max APs; etc.

Application Information: Throughput; etc.

ASIP-based Pipelined MPSoC

processors are required to work in parallel. Hence the motion estimation stage can be implemented with one MP (since the minimum is 10 SADs which can be handled by one processor) and 14 APs. A similar procedure is used for other stages as well. To summarize, statistical information from the profiling is used to decide the number of MPs and APs for each stage of the adaptive pipelined MPSoC. Finally, the information gathered off-line (statistical, architectural and application) is used by the run-time balancing approach in addition to run-time monitoring of execution time to activate and deactivate APs in the presence of run-time workload variations.

Special instructions are generated for each processor according to the sub-task(s) mapped on it

Profiling & Offline Statistical Analisys Adaptive Pipelined MPSoC

Selection of MPs and APs for each pipeline stage

4. PROBLEM STATEMENT

Figure 3: Design Flow for Adaptive Pipelined MPSoCs iterations of the multimedia application or the pipelined MPSoC. We introduce adaptability in a pipelined MPSoC with the use of Auxiliary Processors (APs). All the processors in an adaptive pipelined MPSoC are divided into two categories: Main Processors (MPs); and, Auxiliary Processors (APs). A processor is categorized as an MP if its subtask is executed for every iteration of the multimedia application, that is, it is always active. On the other hand, an AP is a processor whose mapped sub-task will be executed for a maximum of the total number of iterations. Thus, stages with significant run-time variation in workloads are implemented with a combination of MPs and APs, where APs are connected to their corresponding MP using FIFOs. For example, stage S4 contains 2 MPs (MP4-1 and MP4-2) which are active at all times, and use their corresponding APs (AP4-1-1 and AP4-2-1, respectively) only when the workload increases beyond the MPs’ capacities. In other words, MPs handle the nominal workload while APs handle high workloads by working in parallel with their corresponding MPs. Considering all the APs of an adaptive pipelined MPSoC as MPs will essentially provide a design-time balanced pipelined MPSoC where all the processors are always active (thus only the existence of MPs). It should also be noted that stages with more or less constant workload do not need APs and are implemented with MPs only, for example, stage S3 in Figure 2. The proposed adaptive pipelined MPSoC is a hybrid system due to the co-existence of MPs and APs, and thus provides an implementation platform for advanced multimedia applications which contain stages with both constant and run-time varying workloads. More importantly, the adaptability of our pipelined MPSoC can be exploited in several different ways. For example, a resource-aware run-time approach could be deployed to allocate the idle APs of one stage to another stage with currently high workload, resulting in efficient resource utilization1 . As another example, an energy-aware run-time approach could be used where APs are deactivated during idle iterations to reduce the energy consumption. In this paper, we focus on exploiting the adaptability for energy reduction considering the support for clock- and power-gating (two well-known power reduction techniques) based deactivation of idle APs.

3.1 Design Flow The basic design flow to create an adaptive pipelined MPSoC is shown in Figure 3. The partitioned sub-tasks of a multimedia application are one-to-one mapped on the processors of a pipelined MPSoC. The pipelined MPSoC is then passed through a customization process, where each processor is highly customized according to the sub-task(s) mapped on it to obtain an ASIP-based pipelined MPSoC [4]. The throughput constraint of the multimedia application is used to derive the maximum number of clock cycles each stage can consume for one iteration, and is termed 𝑇𝑐 . The ASIP-based pipelined MPSoC is then profiled with various data inputs to gather statistical information such as minimum, maximum and average workload, and clock cycle information of each ASIP. For example, for the motivational example given in Section 1.1, a minimum and maximum of 10 and 450 SADs are computed respectively with a standard deviation of 153 SADs. Furthermore, consider the motion estimation ASIP can only compute 30 SADs in 𝑇𝑐 clock cycles. Using this information, it can be concluded that if 450 SADs need to be computed in 𝑇𝑐 clock cycles, 450/30 = 15 1

Resource sharing would require connection of an AP with multiple MPs. Since we do not consider resource sharing in this paper, each AP is connected to only one MP, though multiple APs can be connected to a single MP.

1034

Given an adaptive pipelined MPSoC and the off-line gathered information, the goal is to determine at run-time “when” and “how many” APs to activate and deactivate for each MP under run-time varying workload so that the required throughput is delivered with minimal degradation and maximal reduction in energy consumption. The challenge is to predict the correct number of APs required for an iteration because using the wrong number of APs will either result in loss of throughput (when less than required APs were used) or increase in energy consumption (when more than the required number of APs were used which could have been deactivated). A simple feedback based approach suffers from slow response because the run-time system cannot detect workload variation until the current iteration has finished and thus, may result in heavy loss of throughput (as shown in Section 6.2). Furthermore, multiple activations/deactivations of an AP within the same iteration will lead to an energy increase rather than saving due to the overhead of activation and deactivation. Thus, a sophisticated runtime approach is required to decide the number of APs that should be activated for an MP considering more than just the previous iteration’s execution time. In addition, such a run-time approach should have low run-time and energy overhead.

5. RUN-TIME BALANCING APPROACH Our novel run-time balancing approach combines the run-time monitoring of previous and current iterations’ clock cycles with the workload prediction from the application, in addition to the use of statistical information from the profiling phase to decide “when” and “how many” APs should be activated and deactivated. The proposed run-time balancing approach is a distributed approach where each MP adapts to the varying workload by activating/deactivating its APs, independent of other MPs. Thus, for the sake of simplicity, the following explanation is written from the perspective of one MP; however the approach is equally applicable to other MPs of the adaptive pipelined MPSoC. The run-time approach exploits the fact that an MP is not allowed to exceed 𝑇𝑐 clock cycles for an iteration in order to guarantee the required throughput. Thus, at some time instants, the MP checks whether there is a possibility of violating 𝑇𝑐 constraint during the current iteration based on the workload prediction and the previous iteration’s clock cycles. If so, more APs are activated; if not, the number of active APs is left unchanged or reduced. The following terms are defined (with their calculation method), in addition to 𝑇𝑐 (from Section 3.1), to explain the run-time balancing approach: ∙ W𝑀 : Maximum workload that can be handled by the MP in 𝑇𝑐 clock cycles, which is available from the statistical information. ∙ W𝑆𝐷 : Standard deviation of the workload of MP, also available from the statistical information. ∙ CC[k]: Clock cycles spent by the MP in its k-th iteration, which are monitored at run-time. ∙ APCount: Number of currently active APs. ∙ AP𝑀 : Total number of APs for the MP. ∙ AP𝑇 : Minimum number of APs that will be activated or deactivated at an instant, considering the standard deviation of the workload. The value of AP𝑇 affects the reactivity of the MP to highly varying workload. For example, a high value of AP𝑇 will enable a quick response by activating a large number of APs, lowering the impact on the throughput; however, a very high value (close to AP𝑀 ) will result in most of the APs being active at all times. We compute AP𝑇 so that the MP can respond to a variation of W𝑆𝐷 /2 (half of the workload standard deviation) within 𝑇𝑐 clock cycles to allow a quick response. Consider APs are

54.3

activated when the current iteration’s clock cycles has reached = 𝑇𝑐 /2 and 3𝑇𝑐 /4 (which will be further explained later): 𝑊𝑆𝐷 2 𝑊𝑀 𝑊𝑀 𝑊𝑀 × 𝐴𝑃 × 𝐴𝑃 × 𝐴𝑃 + . The factors and 𝑇 𝑇 𝑇 2 4 2 𝑊𝑀 × 𝐴𝑃𝑇 refer to the workload that can be distributed to APs 4 (without exceeding 𝑇𝑐 clock cycles) at 𝑇𝑐 /2 and 3𝑇𝑐 /4 time instant 𝑆𝐷 . Further variations in respectively. Simplifying, 𝐴𝑃𝑇 = 2𝑊 3𝑊𝑀 the workload are covered by the workload prediction from the application. ∙ W𝑃 : Current iteration’s workload prediction from the application sub-task, done by analyzing input data features. The predicted workload can categorize the current iteration as either high, medium or low workload iteration. It should be noted that any number of categories could be used, however typically more than 3 categories are not available in real applications [21]. The statistical analysis of the profiling results is used to obtain the minimum number of APs required to handle low, medium and high workload iterations, and are saved in minAPs array. For example, motion estimation sub-task can analyze homogeneity of input MB to categorize it as high, medium or low motion MB. Furthermore, at least 3, 8 and 12 APs should be used to process low, medium and high motion MBs according to the statistical analysis. It should be noted that the actual workload can be smaller or greater than the predicted workload. ∙ AP𝐷 : The maximum difference between the consecutive elements of the minAPs array. Using the above defined terms, two functions: activateAPs and activateOrDeactivateAPsInAdvance are used to activate and deactivate APs during the current iteration. The activateAPs function is triggered at 𝑇𝑐 /2, 3𝑇𝑐 /4 and 𝑇𝑐 time instants. When triggered at 𝑇𝑐 /2 and 3𝑇𝑐 /4 (lines 1–6), it checks whether the last or the second last iteration’s (line 2) clock cycles were greater than 𝑇𝑐 . If so, addAP many extra processors are activated, where the value of addAP equals the integer division of the previous iteration’s clock cycles by 𝑇𝑐 which provides an indication of how many APs should have been active (line 3). The next two lines (lines 4–5) ensure that at least AP𝑇 APs are activated, and the maximum number of active APs does not exceed AP𝑀 . When one of the previous two iteration’s (two iterations are used to avoid negative spikes) clock cycles exceeded 𝑇𝑐 and the current iteration is taking more than 𝑇𝑐 /2 and 3𝑇𝑐 /4 clock cycles, the possibility of exceeding 𝑇𝑐 again is high. Thus, lines 1–6 ensure quick activation of APs to cope with the increased workload. The activateAPs function is also triggered at 𝑇𝑐 time instant, which corresponds to the case when the workload was too high and the activation of APs during the iteration was not enough. Thus, at 𝑇𝑐 time instant, AP𝐷 many APs are activated which ensures that the remaining workload will be handled within the next 𝑇𝑐 clock cycles, introducing a maximum penalty of 𝑇𝑐 clock cycles only. This is because if, for example, the current iteration was predicted to be low workload iteration, then it cannot require activation of more than AP𝐷 APs otherwise it would have been categorized as the medium workload iteration. The activateOrDeactivateAPsInAdvance function activates or deactivates the APs before the start of the current iteration. It uses the prediction from the sub-task to obtain the minimum number of APs that will be required during the current iteration (stored as predAP in line 1). If predAP is greater than the number of currently active APs (APCount), then (predAP – APCount) many APs are activated. In the case when the predAP is less than the APCount, some or all of the APs need to be deactivated. The decision on how many APs should be deactivated (subAP) is not solely based on predAP value as the actual workload can be different from the predicted workload. Instead, the value of subAP is also based on the previous iteration’s clock cycles (line 7). The factor CC[k–1] × (APCount + 1))/𝑇𝑐 gives the number of APs that should be active according to the previous iteration’s workload, where the addition of 1 to APCount is due to the presence of the MP. If the value of subAP is less than the APCount, then (APCount – subAP) many APs are deactivated. Line 10 ensures that not more than AP𝑇 APs are deactivated so that the MP can respond back quickly in case of sudden increase in the workload. The APs are only deactivated when the execution time of the last two iterations is less than 𝑇𝑐 /2, providing an indication of low workload period.

1035

Function activateAPs 1 2 3 4 5 6 7 8 9 10

if CC[k] = 𝑇𝑐 /2 ∥ CC[k] = 3𝑇𝑐 /4 then if CC[k–1] > 𝑇𝑐 ∥ CC[k–2] > 𝑇𝑐 then addAP = CC[k–1]/𝑇𝑐 ; addAP = max(addAP, AP𝑇 ); addAP = min(addAP, AP𝑀 – APCount); ACTIVATE addAP extra processors; else if CC[k] = 𝑇𝑐 then addAP = AP𝐷 ; addAP = min(addAP, AP𝑀 – APCount); ACTIVATE addAP extra processors;

Function activateOrDeactivateAPsInAdvance(W𝑃 ) 1 2 3 4 5 6 7 8 9 10 11 12

predAP = minAPs[W𝑃 ]; if predAP > APCount then addAP = predAP - APCount; ACTIVATE addAP extra processors; else if CC[k–1] < 𝑇𝑐 /2 && CC[k–2] < 𝑇𝑐 /2 then subAP = (CC[k–1] × (APCount + 1))/𝑇𝑐 ; subAP = max(predAP, subAP); if subAP < APCount then subAP = max(subAP, APCount – AP𝑇 ); subAP = APCount – subAP; DEACTIVATE subAP extra processors;

It should be noted that these two functions ensure the APs are not activated and deactivated multiple times within the same iteration, which otherwise would have incurred a significant overhead of activation and deactivation. The activateOrDeactivateAPsInAdvance only deactivates the extra processors if the last two iteration’s execution time is less than 𝑇𝑐 /2 (line 6) which is mutually exclusive to the condition CC[k– 1] > 𝑇𝑐 ∥ CC[k–2] > 𝑇𝑐 (line 2) of activateAPs function, avoiding unnecessary activation and deactivation of APs. Furthermore, the deactivated APs will remain deactivated for at least one iteration. These two functions are inserted into the code of any MP with APs; however, the values of AP𝑀 , AP𝑇 , AP𝐷 and W𝑃 will be different for different MPs as these values are obtained from the statistical information of the mapped sub-task(s). It should also be noted that our run-time approach does not use any complex computations and hence its overhead is negligible (refer to Section 6.2). The activation and deactivation of each AP can be done through either clock- or power-gating. Furthermore, an AP is activated and then the data is sent to overlap the activation time with the communication time as the FIFOs between the MP and the APs are always active.

6. CASE STUDY: AN H.264 VIDEO ENCODER 6.1 Implementation Details We used a commercial design environment from Tensilica to implement an adaptive pipelined MPSoC for an H.264 video encoder supporting HD720p at 30 fps. Tensilica’s Xtensa LX3 [5] family of processors with RC-2010.1 tool suite was used to create the adaptive pipelined MPSoC. The basic experimental flow is shown in Figure 4. The H.264 encoder’s sub-tasks (explained later) are input to the Xtensa PRocessor Extension Synthesis (XPRES) tool to automatically generate special instructions from the C code for each processor. These special instructions contain a combination of fused operations, vector operations, FLIX instructions [22] and specialized operations [23]. The resulting ASIPs are then used to create the pipelined MPSoC in XTensa Modeling Protocol (XTMP), which is a cycle-accurate multiprocessor simulation environment. XTMP uses XT-XENERGY tool to measure the power and energy of the ASIPs in a multiprocessor environment. Hence, we obtained the throughput and energy of the adaptive pipelined MPSoC from XTMP, where all the ASIPs were running at 1 GHz and XT-XENERGY was configured for a given 45nm technology. In addition, we assumed no overhead for clock-gating an AP as it can be done in a few clock cycles.

54.3

XPRES

Special Instructions

Power, Energy & Throughput

XTMP & XT-XENERGY

Adaptive Pipelined MPSoC

WŽǁĞƌ;ŵtͿ ƌĞĂ ;