Valar - Northeastern College of Engineering - Northeastern University

Valar: A Benchmark Suite to Study the Dynamic Behavior of Heterogeneous Systems Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Electrical and Computer Engineering Northeastern University Boston, MA

{pmistry,yukidave,dschaa,kaeli}@ece.neu.edu ABSTRACT

General Terms

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) cooperate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them. To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages. We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

Profiling, Performance Measurement, Benchmarking

Categories and Subject Descriptors C.1.3 [Processor Architecures]: Other Architecture Styles—Heterogeneous (hybrid) systems; D.2.8 [Software Engineering]: Metrics—performance measures

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GPGPU-6 ’ March 16 2013, Houston, TX, USA Copyright 2013 ACM 978-1-4503-2017-7/13/03 ...$15.00.

Keywords OpenCL, GPGPU, Heterogeneous computing, Benchmark suite, Computer Vision

1. INTRODUCTION Heterogeneous systems are becoming the norm in both general purpose and high performance computing environments. The term heterogeneous in this context refers to systems where a CPU and a second device (CPU, GPU, or DSP) are present on the same system or on the same chip, and share a common system memory. New heterogeneous chips have recently been released by AMD and Intel and are popularly known as Accelerated Processing Units (APU) [13]. The tighter integration of the CPU and the GPU platform onto the same chip has reduced the cost of data movement as compared to traditional discrete GPUs, which require moving data across the PCI-Express (PCIe) bus [16]. Although APUs experience lower overhead when managing shared data between devices, there is still some overhead that must be considered and typically the number of cores on the APU is much fewer than present on discrete devices. Given that the difference between the CPU and GPU compute capabilities has narrowed as a result, the offload of computation needs to be carefully considered [31]. Applications that take advantage of this close coupling in shared memory devices have been demonstrated [13, 20]. Heterogeneous devices with either discrete or shared memory GPUs have been widely adopted in a range of systems from smartphones to high-end servers [13, 14, 27]. The adoption of heterogeneous devices in real world applications has motivated the architecture research community to optimize power and performance efficiency [15]. High quality benchmark suites have been developed to provide target workloads for architects to evaluate future platforms [8, 10, 33]. However there is still a gap in the benchmark space. We are missing an appropriate set of benchmarks which can highlight the interaction between multiple compute devices, and particularly target systems where the host participation in the computation is enabled through shared memory. We have characterized the different dimensions of interaction present in heterogeneous applications based on their inherent algorithmic design and their mapping to different heterogeneous devices. This has motivated our work to develop Valar, a benchmark suite consisting of a set of mature OpenCL applications implemented for het-

erogeneous systems. This paper makes the following key contributions: • we present a benchmark suite that enables us to study interaction in heterogeneous applications. • we describe a set-wise evaluation strategy and associated abstraction layers to classify behavior of applications on heterogeneous systems, and • we evaluate the large behavioral space of the benchmarks to characterize execution behavior on APUs and discrete GPUs. This evaluation allows architects and performance analysts to study the effects of interaction in heterogeneous applications. The rest of this paper is organized as follows. Section 2 provides background on available benchmark suites for heterogeneous computing and profiling execution on heterogeneous devices. Section 3 categorizes application behaviors on heterogeneous devices and introduces the applications selected for inclusion in Valar. Section 4 discusses our evaluation methodology which is used to characterize our benchmarks. Section 5 discusses the experimental evaluation of the benchmarks. Section 6 discusses related research and Section 7 covers future work and concludes the paper.

2.

BACKGROUND AND MOTIVATION

2.1 Heterogeneous Computing and Benchmark Suites Since our initial implementation for Valar will be developed in OpenCL, we begin by discussing some important properties of this language/runtime. Heterogeneous computing using OpenCL has gained popularity due to the low cost and performance benefits discrete GPU and shared memory APU based systems. OpenCL is an open standard maintained by the Khronos group, and has received the backing of a number of major graphics hardware vendors. An OpenCL program has host code that executes on a single thread of a CPU. The host code is responsible for setting up data and scheduling execution on OpenCL compute devices such as a GPU and/or CPU cores. OpenCL creates a context of work which creates unique command queue for each of the OpenCL device to be used(such as GPU or CPU). This architecture is shown in Figure 1. The code that executes on a compute device is called a kernel. In-depth information on implementing heterogeneous applications in OpenCL is provided in [13]. Research in heterogeneous systems architecture has motivated researchers to develop a number of benchmark suites to equip architects with appropriate workloads to evaluate this emerging class of systems [15]. The most popular open source benchmark suites targeting heterogeneous devices include Parboil [33], Rodinia [8] and the Scalable Heterogeneous Computing (SHOC) benchmark suite [10]. These benchmark suites fulfill roles similar to PARSEC [7]. SPEC [19] is another popular benchmark suite managed by an industry consortium. Parboil, Rodinia and SHOC support OpenCL and also provide applications implemented in different programming models such as CUDA and OpenMP [8, 10, 33]. These benchmarks assist architects with the study of programming models and applications concurrently.

Figure 1: OpenCL allows targeting the different compute devices on heterogeneous systems using command queues.

SHOC [10] provides a range of low level benchmarks based on scientific computing workloads and is unique in its support for GPU clusters. Rodinia provides representative real world applications targeting GPU systems from multiple domains such as data-mining and medical imaging that can be run “out of the box” as target workloads for architecture research. The Parboil [33] benchmarks also provide workloads from domains such as data-mining and medical imaging. However Parboil also provides workloads with serial C++ versions and different optimizations for their CUDA, OpenMP and OpenCL workloads. This feature of Parboil allows compiler writers to evaluate source optimizations and compiler optimizations on different architectures. The goal of Valar is to provide benchmarks that show a wide range of host-device behavior and data sharing. Recent application studies on key value stores [20] and databases [6] show the importance of studying host-device behavior on closely coupled heterogeneous systems. The differences between these publicly available benchmarks and our goals for Valar have been summarized in Table 1.

2.2 Profiling Heterogeneous Applications The set of heterogeneous applications chosen to be included in the Valar suite have been evaluated using profilebased analysis. Profiling of heterogeneous applications can be carried out at multiple layers of abstraction. OpenCL applications can be studied at a high level across different vendors and compute devices, or using low-level, hardwarespecific statistics. Our general evaluation methodology throughout can be summarized in Figure 2, where the profiler gathers information which is processed offline to select new input cases. Device-specific performance of OpenCL kernels can be studied using vendor provided profilers such as the AMD APP profiler [2]. These profilers query hardware performance counters on the device [2]. Kernel optimization studies utilize such profilers [8, 13, 22, 29] since the area of interest of such studies is optimization within a single compute kernel. Studying host-device interaction only requires high level statistics such as time and data size. When studying this behavior, we use OpenCL’s event cl_event interface which

Table 1: Existing open source benchmarks commonly used in heterogeneous computing research and goals of the Valar benchmarks. The table shows the common characteristics between multiple suites followed by the differences between them. allows querying statistics regarding OpenCL commands off the critical path with lower overhead [13, 28].

• Implementation: the mapping of computation onto the OpenCL compute device or devices, and • Behavior: the interaction of the compute devices with the host. The following characterizes the implementation characteristics of OpenCL kernels. Computation Pipeline Implementations: A computational pipeline denotes applications where the majority of the algorithm’s execution is carried out on a particular OpenCL device. Applications implemented for discrete GPU devices [5, 22, 29] follow such an implementation pattern to minimize the cost of data movement [16] over PCIe and efficiently utilize all the available compute units.

Figure 2: Profile driven analysis of heterogeneous applications. Profiling subsystems could refer to the vendor provided profiler, OpenCL events or basic timing routines.

3.

A HETEROGENEOUS BENCHMARK SUITE

Valar is a unique benchmark suite whose goals are to allow architects and application designers to study the performance of multiple OpenCL compute devices and their communication with the host in heterogeneous systems. To study the benchmarks provided in Valar, we first discuss high level characteristics of heterogeneous applications.

3.1 Benchmark Characteristics and Algorithm Patterns A heterogeneous application’s execution is a function of the algorithm and it’s mapping to a heterogeneous device. The importance of architecture independent metrics to model applications has been discussed for single-threaded programs and multi-core devices [25]. For heterogeneous systems where work is offloaded to a secondary device, we present a simple architecture-neutral model to characterize possible host-device interaction that can occur on a heterogeneous system. The benchmarks discussed in Valar have been categorized based on:

Multi-Device Decoupled Implementations: This category refers to applications which utilize multiple compute devices in the execution of their algorithm. In this context, decoupled refers to a low frequency of communication between the devices. Applications that can benefit from a multi-device, decoupled implementation have independent units of work that can be dispatched to different devices and do not need to communicate with each other [32]. Multi-Device Coupled Implementations: This category also refers to benchmarks which utilize multiple compute devices in the execution of their algorithm. However, coupled implementations have a higher degree of communication between the compute devices and the host. Coupled implementations have recently become feasible with the introduction of APU devices (where the CPU and the GPU share system memory), architectural support for data movement (e.g: memory mapping over PCIe) and lower kernel launch overhead. Recently announced coupled implementations with irregular parallelism include hashing [20], databases [6] and image feature extraction [28]. These three computational patterns are illustrated in Figure 3a. In addition to the implementation, the behavior of a heterogeneous application describes how the host and the OpenCL devices interact with respect to the shared data. Latency Sensitive Behavior: Latency sensitive applications commonly have real time constraints and allow only limited combining of work. Such applications may not fully utilize a compute device’s memory bandwidth or the PCIe

Figure 3: The dynamic execution patterns of applications considered for Valar. (a) shows mapping of a heterogeneous application onto compute devices. (b) shows the possible host-device behavior with regards to data sharing. bandwidth. Streaming Behavior: Streaming behavior denotes applications which have a continuous input stream and/or output stream of data. Such applications commonly have large amounts of data and are bandwidth intensive. Quality of Service or Anytime Behavior: Quality of Service (QOS) denotes an application where the processed output result must match certain quality requirements (e.g., residue error in a linear algebra solver). Anytime Behavior refers to algorithms where there is flexibility in the quality of the resultant output. Web search is an example of an Anytime Behavior application [12, 21]. In Anytime applications, the processing time for the output is determined by external factors such as system load [21]. Implementing applications based on QOS is non trivial for heterogeneous platforms due to the “offload to compute device” programming model, lack of exceptions and a non-negligible cost of checking data. It should be noted that the Implementation characteristics discussed above are not mutually exclusive in an application. For example an application could be a compute pipeline with multiple kernels executing on a device and also be a decoupled implementation when it shares the pipeline’s results with other compute devices. Similarly the behavioral characteristics are not always the same for an application. For example, a benchmark could be latency sensitive in scenarios with low system load, while QOS based with high system load. In this work, the behavior classification is restricted to interaction between a host and device. We leave the interaction within a compute units on a OpenCL device such as load imbalance between compute units and communication using atomics for future work. Section 3.2 discusses our sample applications and the characteristics they exhibit.

3.2 Benchmark Description The Valar benchmarks follow the patterns shown in Figure 3. We have selected applications from a number of domains such as scientific computing, computer vision and

data mining. The benchmarks selected exhibit variations in their OpenCL kernel execution and host device interaction, based on their real world input parameters. A summary of the application properties discussed in Section 3.1 is presented in Table 2. Designing inputs for these benchmarks is a challenging task. As shown in Table 3, the input parameters stress different components of a heterogeneous system. The notion of an “input size”, as commonly discussed in SPEC benchmarks [19], cannot be easily applied here. Workload sizes and input arguments to the Valar benchmarks need be chosen based on behavioral requirements. SURF: Speeded Up Robust Features (SURF) is a commonly used algorithm in computer vision. It generates features that are invariant to rotation. SURF is commonly used as a stage in computer vision pipelines such as searching and video stabilization. Valar utilizes a optimized publicly available OpenCL version [28]. Traffic Simulation: The traffic simulation is an example of an agent-based modeling [34] application. The traffic is modeled as a simple cellular automaton model for flow. The model can reproduce traffic jams, (i.e., the simulator can capture the deceleration of a car when the road is crowded with a high density of cars). The model is based on randomization since one car braking due to a random cause can slow down the cars behind. We base our implementation on published work [34]. Adaptive FIR Filter: FIR filters are widely used in digital signal processing. The input signal to an FIR filter is split into chunks, called blocks, which are processed by a kernel. Each element in the block is multiplied by a number of coefficients, called taps, producing an output block (i.e., an output signal). The number of taps for the filter decides the sharpness and stop-band attenuation characteristics. The number of taps affects the memory usage and amount of computation per OpenCL workgroup1 . 1 OpenCL workgroups are analogous to thread-blocks in CUDA

Table 2: The domains covered by each benchmark in Valar. The table shows the implementation and behavior characteristics seen in each benchmark.

Table 3: Input arguments for Valar’s benchmarks and their influence on the benchmark’s execution. Adaptive filtering [35] extends the FIR filter by changing the weight of the taps for the filter on a separate command queue based on signal characteristics. Adaptive filters are used in audio filtering, speech recognition, and pulse detection applications. Search: Search algorithms have multiple strategies for refining. Commonly, load and cutoff latency determine possible refinement. Search applications on multicore platforms [21] are an upcoming area of interest. Search applications uses the page view count algorithm which has been implemented using a GPU-based map reduce framework [17]. We have extended it to an online system capable of dynamic updates. The GPU executes the search for elements in different ranges of values for the input data. The CPU reduces the data and generates a statistical mean of the output ranges. The search application is tuned according to the generated mean. Tuning involves changes in search size and number of iterations, which change the load on the GPU dynamically. Such dynamic search applications are also used in industrial engineering for process capability tuning.

Physics: This application is a mixed particle simulation discussed in [13]. A mixed particle simulation is a simulation where the particles are of different sizes [13]. While particle simulations are easily parallelized, mixed particle simulations are irregular in nature and inefficient if implemented naively. The inefficiency results from the non-uniform granularity of the computation, especially for the collision detection, which is the most expensive part in a particle simulation. If there is a single large particle and many small particles, the number of collisions detected on the large particle can be significantly more than the number detected on the small particles [13]. The physics simulation has been implemented as a pipeline where the large-small particle collisions and the large-large particle collisions are calculated on the CPU, while the small-small collisions are calculated on the GPU.

4. VALAR BENCHMARKS EVALUATION METHODOLOGY Our evaluation of this initial set of heterogeneous bench-

marks is based on profiling the native execution of each applications. We provide a classification methodology that allows architects or application developers to explore the multidimensional execution and behavior spaces of our benchmarks.

4.1 Workload Classification Method Application behavior can be studied at different levels of abstraction and granularity. We define three different abstraction layers (AL0-AL2) from which applications can be studied. The AL0 features denote the input arguments provided to an application. The AL0 features of an application provide the maximum amount of abstraction, but minimum flexibility with respect to possible changes towards performance optimization. As expected, AL0 Features are unique to each application and consist of the application input arguments shown in Table 3. The AL1 features shown in Figure 4 have been used to study the host device interaction. The AL1 features describe data movement frequency and the time taken by a kernel to execute on an OpenCL compatible compute device. The AL1 features can be controlled to a large extent by choosing the appropriate input arguments to the benchmarks. This flexibility in choosing host-device behavior is one of the primary characteristics of the benchmarks in Valar. The AL2 features shown in Figure 4 shows the execution results of the OpenCL kernels and the host-device transfer time. The execution results of an OpenCL kernel include not only kernel execution time but also architectural metrics provided by the hardware performance counters.

The platforms we have run Valar upon are shown in Table 4. The list is far from exhaustive, but provides a range of different heterogeneous architectures. The platforms include two models of latest discrete GPU architectures and shared memory architectures from AMD. The discrete GPUs are implementations of AMD’s Southern Islands architecture [27]. The shared memory APUs integrate Evergreen and Northern Islands GPUs [2].

Table 4: OpenCL platforms evaluated with Valar.

5. PERFORMANCE RESULTS

Figure 4: Abstraction layers to study heterogeneous applications. AL2 statistics denote hardware counters exposed by AMD’s Southern Islands GPUs. Other compute devices such as Ivy Bridge GPUs would have different counters. Figure 4 summarizes the breakdown of the different abstraction layers to which the behavior of a heterogeneous application can be decomposed. Using a profiling subsystem appropriate for the layer, we provide insight into the behavior of Valar on different heterogeneous systems. The classification discussed here provides layers to study application behavior. However, our present work still requires low-level profilers when we wish to study kernel execution performance. For future work our goal is to develop a simulation-driven study to generate statistics and abstract models of compute kernels [11]. This would allow comparison of OpenCL kernel behavior across devices without depending on hardware counters, which complicate comparison across architectures.

4.2 Evaluated Platforms

We present performance results in two parts. First in Section 5.1, we discuss the effects of input parameters on the performance of OpenCL kernels. For our targeted platforms we study how behavior varies across different inputs for the application and also within the runtime of the application. In Section 5.2, we show how different applications can be characterized with respect to their host-device IO behavior. Our results demonstrate the key features of Valar to study the behavioral characteristics of heterogeneous applications.

5.1 OpenCL Kernel and Input Set Characterization This section evaluates the execution of OpenCL kernels within the Valar benchmarks.

5.1.1 OpenCL Kernel Behavior Coverage To examine the appropriateness of our choices of heterogeneous applications, we profile our application kernels and study the utilization statistics. This methodology examines the coverage of Valar over a range of the execution metrics for the OpenCL kernels. Coverage simply refers to the range of kernel behaviors witnessed via profiling. Examining the coverage of a set of workloads entails studying their usage of different architectural components [18]. Due to the high correlation in the behavior of different architectural resources, only a small set of representative metrics are needed to characterize multicore workloads [8, 18, 25]. Research on multicore workloads selects metrics that are uncorrelated to each other. However, the number of performance counters exposed in the GPU com-

pute devices of heterogeneous systems like Southern Island’s GPUs is much smaller [2] than multicore processors.

and search use the scalar unit efficiently due to the presence of tight regular loops in the kernel body. Physics has a low ALU utilization even though the memory utilization is high, due to the divergence present in the OpenCL kernels [13]. An alternative study for benchmark coverage would be a simulation-based methodology, which would allow measuring similar performance statistics across architectural families [18, 25].

5.1.2 OpenCL Kernel Behavior within Inputs

Figure 5: Range of performance characteristics seen across input sizes for different benchmarks on SI GPUs (Tahiti and Pitcairn). The metric denoted by each line in the Kiviat chart is shown in the Table above. Performance counters on AMD GPUs [2] are accessed through the GPU driver by the AMD APP profiler. The profiler provides the same set of performance metrics for each architectural family. This allows us to compare the resource usage on different GPUs of the same Southern Islands (SI) family (Table 4). The results are shown in Figure 5 using Kiviat charts, a popular representation to understand workloads [8, 26]. In Figure 5 we observe the range of application behavior and resource utilization of the OpenCL kernels per application. We see that each application stresses different subsystems on the discrete GPUs. Large variation in vector ALU and memory utilization is observed for computational pipeline applications such as SURF. Applications such as traffic have a lower cache hit rate due to the random and strided memory access patterns. We also see that applications such as Search and FIR have a high scalar unit utilization. The scalar unit is an architectural resource introduced in SI GPUs where actions that would be duplicated across work items in a wavefront (e.g., incrementing a loop index) could now be done in a single hardware structure. Only FIR

Selecting representative input sets for any benchmark is a challenging problem. Multicore benchmark suites like PARSEC [7] have evaluated the fidelity of their input sets by comparing the architectural metrics of test input sets and native input sets. However, heterogeneous applications such as FIR and Traffic have a wide range of practical inputs as shown in Table 3. There is no clear notion of “native size” of the data since different inputs stress different subsystems. This will be shown by profiling our application kernels while Section 5.2 discusses this fact with respect to data movement. Variation of OpenCL kernel behavior for a single input is studied to examine the stability of the benchmark’s execution and resource utilization. This variation is stable and repeatable for each input since it occurs because the OpenCL kernels are called multiple times with different data. A single input case for each benchmark is shown in Figure 6. We have used unmodified real-world data to study resource utilization. We see that our computational pipeline implementations (SURF and Physics) have the largest number of kernels. For the purpose of this evaluation, the CPU kernels in FIR (tap-change) and Search (wg-redn), were run on a SI GPU (Pitcairn) to get statistics that can be compared with the main computational kernels which were also run on the SI GPU2 . The hatched results in Figure 6 show the kernels in Valar which have the largest variation within an application. SURF’s variations are due to different number of features seen in each frame of input video. Search uses a different number of workgroups in each search step. Traffic exhibits different random memory access patterns for vehicle data at each time step. FIR in contrast does the same processing steps for every block in the signal, thus showing much minor variations in utilization. Results of individual kernel execution for the duration of an application can be used to select kernels for future optimizations that could reduce the data dependent behavior or tracking the behavior of a data set. However, querying the information shown in Figure 6 for an online system from the graphics driver is a costly operation and affects a systems throughput. The experiments in Section 5.1.1 have been used to study the coverage of the benchmarks across inputs and devices of the same family. Section 5.1.2 discusses the variability in resource utilization over the runtime of application.

5.2 Studying Host Device Interaction In these experiments, we study the interaction of the host and device. The interaction between the host and device can be considered as AL1 optimizations (Section 4.1), since it affects how the host and device share data for the same application. 2 Equivalent statistics for CPU devices would not provide a fair comparison due to architectural differences

Figure 6: Variations in the average utilization characteristics for a single input case on a SI GPU (Pitcairn) for the OpenCL kernels present in Valar. Error bars show the minimum and maximum utilization. Hatching shows kernels with larger variation in their utilization statistics over the runtime of the application.

5.2.1 Investigating Streaming Optimizations - FIR In this section, we demonstrate Valar’s ability to study the heterogeneous system optimization space with respect to data movement. The Adaptive FIR application is controlled by an input argument Dispatch (Table 3) that can be used to vary the granularity of work offloaded to the compute device. Dispatch denotes the number of filter blocks batched into one buffer before initializing execution on the GPU. Batching is a commonly implemented optimization in multicore memory systems. Batching data transfers on GPU architectures typically increases throughput due to more efficient utilization of the PCIe bus and amortization of kernel launch overheads. To study the true benefit of batching using the adaptive FIR application, we vary the dispatch size of the filter. The results are shown for the discrete GPU and the APU in Figures 7a and 7b, respectively. The filter block size shown in Figure 7 denotes the size of each block and is the minimum unit of work done by the GPU device. We see that the execution time reduces to a certain level which is dependent on the filter block size. Higher dispatch sizes in Figure 7 denote multiple filter blocks are being offloaded to the GPU at a time. The results in Figure 7 show why the performance for APU devices is more dependent on the Filter Block Size than discrete devices. Figure 7 shows the times spent in data movement and kernel computation. In the discrete GPU, the time spent moving data substantially reduces by up to 30% as dispatch size increases. This is due to lower number of transactions and better utilization of the PCIe bus. In the APU case, due to the smaller number of compute units, than the computation time is much larger than the data management time. These results lead to the conclusion that streaming IO optimizations like batching are not as beneficial to the throughput of APU devices due to the low count of compute units (Table 4) and low data movement cost. Adaptive FIR has a tap weight modulation component concurrently executing on the CPU. However it is a lightweight CPU based OpenCL kernel that only changes the value of the filter taps which are usually under 1024 in

real world applications. The tap weight modulation does not get affected by the batching since it only operates on the taps of the filter.

5.2.2 Multi-Device Coupling Effects - Search In this section, we study the effect of the coupling of IO and compute on compute devices in different heterogeneous systems. As discussed in Section 3, the Search application searches for a set of target data values in blocks of data using the GPU OpenCL kernel. The application hands off the resultant data to the CPU for a final reduction step. Search has a parameter to vary the frequency of communication of results from the GPU device to the CPU device. To study the effects of communication between the CPU and GPU on discrete and APU platforms, we vary the frequency at which the CPU device reads in data from the GPU’s buffers. The results are shown in Figure 8. Figure 8c shows the variation in throughput of search when the frequency of communication between the CPU and GPU device is varied. From, Figure 8a we see that the performance of the CPU is more stable than the GPU across communication intervals. This occurs in search, since for less frequent communication, the GPU is allocated more data to search through for the target values. However, the work done in the CPU to carry out the data reduction is independent of the communication interval since the CPU’s work depends on the number of target elements being searched for by the GPU. In Figure 8b, the GPU execution time per kernel increases as the communication frequency is reduced. The discrete GPU performs better for the search kernel due to larger memory bandwidth and more compute units. We also see that for a shorter communication interval (with highly frequent communication), the CPU kernel on the APU behaves more poorly. This can be attributed to the shared memory of an APU where the limited memory bandwidth is also utilized by the GPU part of the APU. We also observe that our CPU kernel may need further tuning since it does not efficiently utilize the multithreaded CPU cores in the Pitcairn system, since CPU kernel’s performance is comparable to the quadcore system.

a) Decomposing execution of the adaptive FIR application on a discrete GPU (Tahiti). All 3 charts follow the same y axis.

b) Decomposing execution of the adaptive FIR application on an APU (Trinity). All 3 charts follow the same y axis. Figure 7: Variation in execution time for the Adaptive FIR application.

We see that the effective throughput is comparable on both platforms for shorter communication intervals. However, at longer intervals between communication, the high throughput of the discrete GPU and the lower usage of the PCIe bus increases performance beyond the APU.

5.2.3 Multi-Device Coupling Effects - Physics As discussed in Section 3, the physics application partitions the collision detection pipeline between the CPU and the GPU. We observe the effects of this property on two heterogeneous platforms. The physics application’s behavior has been studied by observing the resultant throughput of the application for different particle distributions. Figure 9, shows the results on the discrete SI GPU (Tahiti) system and the Trinity APU. We have chosen three values for the number of small particles and vary the number of large particles for each test case. As expected, we observe that the time taken by the simulation increases.

Figure 8: Search execution characteristics for a range of communication intervals between the CPU and the GPU compute devices.

To understand the difference in throughput between a discrete GPU system and a APU system, we examine the time taken by each system to perform the large particle collisions. The large-large collisions are the CPU intensive task. Figure 10 shows that the throughput of the large-large particle comparison for both the systems for a fixed number of small particles. We see that both the systems have comparable throughput. The Tahiti system has higher throughput on the large-large calculations because of the greater number of CPU cores in the system. Thus we see that when we run the application with the maximum number of small particles (57, 344), the discrete GPU Tahiti system has better performance by up 2X when the number of large particles is small, and the speedup reduces to 1.3X as we increase the number of large particles. To summarize, in Section 5.2 we discussed how optimization of interaction between the host and the OpenCL devices enables up to 30% improvement and 15% improvement in throughput on discrete and APU systems respectively. We observed that for Search, effective throughput of the discrete and APU systems was comparable when the devices were

Figure 9: Throughput for Physics for SI discrete GPU (Tahiti) system and Trinity (APU). For less large particles, performance difference between the platforms is greater.

multicore research [12, 26]. The autonomic computing research community developed benchmark suites that stress self-monitoring/self-repairing behaviors of an application [9]. For e.g. the Autonomous Benchmark suite is used to test reliability [9]. Benchmarks are also commonly tailored towards real time systems [24]. Newer heterogeneous systems are better equipped to handle the challenges of applications with irregular parallelism [14]. Such applications include recent studies on key value stores [20] and GPU based databases [6]. As shown in Valar, true heterogeneous applications are highly sensitive to input data inputs. PARSEC [7] examines the fidelity of an input set by comparing its behavior to a native representative input. However, due to the range of application scenarios on heterogeneous systems, no input would be truly representative. Previous workload characterization research on multicore systems used architectural counters to study a range of applications [18]. Low level counter information can be obtained using simulators or native platforms. Gregg and Hazelwood [16] examine how the location of data affects performance in heterogeneous systems. Spafford et al. have recently studied the low level architecture of APU devices using the SHOC benchmark suite [31]. Other GPGPU workload characterization [5] studies examine the effects of architectural modifications on OpenCL / CUDA kernels. Characterization of a GPU memory system can be carried out by sparse instrumentation [4] and subsequent modeling. These workload characterization methods for discrete GPUs compliment our analysis for heterogeneous workloads. The importance of interaction in heterogeneous devices is highlighted in recent research where the role of the host CPU is being reevaluated [3] in heterogeneous systems.

7. CONCLUSIONS AND FUTURE WORK Figure 10: Large particle throughput for increasing large particle counts for the Tahiti and Trinity systems. closely coupled. For physics, we see that the lower throughput of the large particle collision reduces the performance advantage provided by the discrete GPU.

6.

RELATED WORK

Benchmarking parallel systems has a rich history. SPECint and SPECfp [19] are two suites from the SPEC organization, an industry consortium focused on developing evaluation standards for benchmarking commercial systems. These two suites focus on single-core performance, specifically single-core integer and floating point performance. PARSEC is a widely adopted benchmark suite in studying multicore processors [7]. Architectural research on GPGPUs has lead to benchmarks like Parboil [33] and Rodinia [8]. SHOC [10] is a benchmarks suite tailored to HPC workloads and clusters. While Valar also targets heterogeneous systems, it is complimentary to existing approaches since it’s applications allow additional exploration into the interaction between the host and compute device on heterogeneous systems. Newer benchmark suites have also been developed in other popular domains like cloud computing. The target workloads in cloud computing substantially differ from HPC and

In this work we present Valar, a new benchmark suite consisting of real world applications that effectively leverage heterogeneous devices. The main characteristics of these benchmarks is their data-dependent behavior and their usability to study interaction between computation and data movement on different heterogeneous devices. We have presented a classification methodology that can be used to understand the different implementations and behaviors commonly seen in heterogeneous devices. We have evaluated Valar on different heterogeneous systems, including discrete GPU platforms and shared memory platforms (APUs). Our evaluation has examined the coverage of the benchmarks and identified the kernels possessing unique kinds of data dependent behavior. To study the host device interaction, we have examined the benefits of staging IO in applications and observer that varying the offload granularity can provide performance benefits of upto 30%. To study interaction between OpenCL devices, we have examined the effects of the CPU device in two of our selected applications where closely coupled behavior results in comparable performance between discrete GPUs and APUs. There are a number of possible directions for future work. • We plan to add an modeling framework to describe required host-device interaction scenarios, which would guide the user to the right Valar benchmark configuration. This would provide test scenarios and approximate application models for researchers working on developing run-times for heterogeneous applications.

Similar approaches have been done for producing SimPoints [30]. • We are working on Valar 2.0, which will extend the suite to include domains such as databases [6], where concurrent queries could be targeted onto data resident in a heterogeneous system with shared memory. We also plan to investigate complex algorithms such as Predator, an object detection system based on the dynamic interaction between feature detection and machine learning [23]. • We are also interested in porting Valar to other heterogeneous languages such as OpenACC [1]. • Other directions of future work include leveraging OpenCL portability to evaluate Valar on recently announced embedded CPU-GPU based OpenCL devices such as Adapteva Epiphany and Qualcomm platforms. We also plan to examine the effects of the Valar benchmarks on power consumption.

8.

ACKNOWLEGEMENTS

This work was supported in part by NSF Award EEC0946463, and through the support and donations from AMD and NVIDIA. The authors would also like to thank Norman Rubin for his advice and feedback on this work.

9.

REFERENCES

[1] The OpenACC Application Programming Interface 1.0. http://www.openacc-standard.org/, 2011. [2] AMD. Accelerated parallel processing: Opencl programming guide. 2011. [3] M. Arora, S. Nath, S. Mazumdar, S. B. Baden, and D. M. Tullsen. Redefining the Role of the CPU in the Era of CPU-GPU Integration. In IEEE Micro, pages 1–1, 2012. [4] S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP ’12, page 23, New York, New York, USA, 2012. ACM Press. [5] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 163–174. IEEE, Apr. 2009. [6] P. Bakkum and K. Skadron. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU ’10, page 94, New York, USA, 2010. ACM Press. [7] C. Bienia and K. Li. Fidelity and scaling of the PARSEC benchmark inputs. In IEEE International Symposium on Workload Characterization (IISWC’10), pages 1–10. IEEE, Dec. 2010. [8] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, and K. Skadron. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC’10), pages 1–11. IEEE, Dec. 2010.

[9] J. Coleman, T. Lau, B. Lokhande, P. Shum, R. W. Wisniewski, and M. P. Yost. The Autonomic Computing Benchmark. pages 1–22, 2008. [10] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. L. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In ACM International Conference Proceeding Series; Vol. 425, 2010. [11] K. Fatahalian, W. J. Dally, P. Hanrahan, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, and A. Aiken. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing SC ’06, number November, page 83, New York, USA, 2006. ACM Press. [12] M. Ferdman, B. Falsafi, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, and A. Ailamaki. Clearing the Clouds A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’12, number Asplos, page 37, New York, USA, 2012. ACM Press. [13] B. Gaster, L. Howes, D. Kaeli, P. Mistry, and D. Schaa. Heterogeneous Computing with OpenCL. Morgan Kaufmann. [14] B. R. Gaster and L. Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 45(8):42–52, Aug. 2012. [15] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceeding of the 38th annual international symposium on Computer architecture - ISCA ’11, page 235, New York, USA, 2011. ACM Press. [16] C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 134–144. IEEE, Apr. 2011. [17] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT ’08, page 260, New York, New York, USA, 2008. ACM Press. [18] W. Heirman, T. E. Carlson, S. Che, K. Skadron, and L. Eeckhout. Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads. 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 38–49, Nov. 2011. [19] J. L. Henning. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17, Sept. 2006. [20] T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. 2012 IEEE International Symposium on Performance Analysis of Systems & Software, pages 88–98, Apr. 2012. [21] V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid.

[22]

[23]

[24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

[32]

[33]

Web Search Using Mobile Cores : Quantifying and Mitigating the Price of Efficiency. In Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10, number Table 1, page 314, New York, USA, 2010. ACM Press. B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures. IEEE Transactions on Parallel and Distributed Systems, 22(1):105–118, Jan. 2011. Z. Kalal, J. Matas, and K. Mikolajczyk. Online learning of robust object detectors during unstable tracking. 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pages 1417–1424, Sept. 2009. T. Kalibera, J. Hagelberg, and P. Maj. A family of real time Java benchmarks. Concurrency and Computation: Practice and Experience, 2011. M. Kulkarni and V. Pai. Towards architecture independent metrics for multicore performance analysis. ACM SIGMETRICS Performance, 2011. P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi. The HPC Challenge (HPCC) benchmark suite. Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC ’06, page 213, Nov. 2006. M. Mantor and M. Houston. AMD Graphics Core Next. In AMD Fusion Developer Summit, 2011. P. Mistry, C. Gregg, N. Rubin, D. Kaeli, and K. Hazelwood. Analyzing program flow within a many-kernel OpenCL application. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4, page 1, New York, New York, USA, 2011. ACM Press. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming PPoPP 08, pages:73, 2008. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X) - ASPLOS ’02, page 45, New York, New York, USA, 2002. ACM Press. K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth, and J. S. Vetter. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proceedings of the 9th conference on Computing Frontiers - CF ’12, page 103, New York, 2012. ACM Press. K. L. Spafford, J. S. Meredith, and J. S. Vetter. Maestro: Data Orchestration and Tuning for OpenCL Devices. EuroPar 2010Parallel Processing, 6272:275–286, 2010. J. A. Stratton, C. I. Rodrigues, I.-j. Sung, N. Obeid, L.-w. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil : A Revised Benchmark Suite for Scientific and Commercial Throughput Computing.

2012. [34] D. Strippgen and K. Nagel. Multi-agent traffic simulation with CUDA. In 2009 International Conference on High Performance Computing & Simulation, pages 106–114. IEEE, June 2009. [35] W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. Amarasinghe. Teleport messaging for distributed stream programs. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’05, page 224, New York, New York, USA, 2005. ACM Press.