Design of OpenCL Framework for Embedded Multi-core ... - IEEE Xplore

0 downloads 0 Views 497KB Size Report
Jun 23, 2014 - Jung-Hyun Hong, Young-Ho Ahn, Byung-Jin Kim and Ki-Seok Chung, ... Jung-Hyun Hong is with the Department of Electronic and Computer.
J.-Hyun Hong et al.: Design of OpenCL Framework for Embedded Multi-core Processors

233

Design of OpenCL Framework for Embedded Multi-core Processors Jung-Hyun Hong, Young-Ho Ahn, Byung-Jin Kim and Ki-Seok Chung, Member, IEEE Abstract — In modern mobile embedded systems, various energy-efficient hardware acceleration units are employed in addition to a multi-core CPU. To fully utilize the computational power in such heterogeneous systems, Open Computing Language (OpenCL) has been proposed. A key benefit of OpenCL is that it works on various computing platforms. However, most vendors offer OpenCL software development kits (SDKs) that support their own computing platforms. The study of the OpenCL framework for embedded multi-core CPUs is in a rudimentary stage. In this paper, an OpenCL framework for embedded multi-core CPUs that dynamically redistributes the time-varying workload to CPU cores in real time is proposed. A compilation environment for both host programs and OpenCL kernel programs was developed and OpenCL libraries were implemented. A performance evaluation was carried out with respect to various definitions of the device architecture and the execution model. When running on embedded multi-core CPUs, applications parallelized by OpenCL C showed much better performance than the applications written in C without parallelization. Furthermore, since programmers are capable of managing hardware resources and threads using OpenCL application programming interfaces (APIs) automatically, highly efficient computing both in terms of the performance and energy consumption on a heterogeneous computing platform can be easily achieved1. Index Terms — Embedded multi-core system, OpenCL Framework, Parallel processing, Heterogeneous computing.

I. INTRODUCTION Modern embedded consumer devices are often required to process multiple computationally intensive applications simultaneously. Therefore, various application-specific accelerators are commonly employed. Typically, mobile devices are equipped with a multi-core central processing unit 1 This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2013-H0301-13-1011). Jung-Hyun Hong is with the Department of Electronic and Computer Engineering, Hanyang University, Seoul, Korea (e-mail: [email protected]). Young-Ho Ahn is with the Department of Electronic and Computer Engineering, Hanyang University, Seoul, Korea (e-mail: [email protected]). Byung-Jin Kim is with the Department of Electronic and Computer Engineering, Hanyang University, Seoul, Korea (e-mail: [email protected]). Ki-Seok Chung is with the Department of Electronic Engineering, Hanyang University, Seoul, Korea (e-mail: [email protected]).

Contributed Paper Manuscript received 03/25/14 Current version published 06/23/14 Electronic version published 06/23/14.

(CPU), a multi-core graphics processing unit (GPU), digital signal processors (DSPs), image signal processors (ISPs) and video decoders. Correspondingly, a set of software frameworks is ported to utilize such devices [1]. Since these processors have distinct hardware structures and instruction sets, compilers should generate the corresponding target binary codes to utilize them. Once a certain binary code for a particular target is generated, the binary code can be executed only by the specific target device. Thus, it is not easy to fully utilize the various computing resources at runtime. Therefore, it is desirable to have a flexible and dynamically redistributable computing environment. By having such an environment, dynamic job assignment and load balancing become feasible, and better real-time responsiveness and reduction in the power consumption can be achieved. Open Computing Language (OpenCL) was proposed to provide a framework that supports heterogeneous computing platforms. OpenCL includes programming languages and application programming interfaces (APIs). OpenCL C is the programming language to write a parallel function code called the OpenCL kernel. When the kernel is compiled dynamically, it can be executed on various heterogeneous devices [2]. However, unlike the original motivation of OpenCL, most vendors offer OpenCL software development kits (SDKs) that support their own computing platforms [3], [4]. Especially, for embedded mobile devices, communication interfaces and APIs may not be openly available to developers, and therefore, it is not easy for programmers to efficiently utilize heterogeneous computing platforms. Hence, it is necessary to develop an OpenCL framework that automatically generates binary codes from OpenCL codes for various computing devices. By using this framework, heterogeneous computing platforms can be utilized by a program written in one language, which is the key advantage of OpenCL. Various ways of executing parallelized application programs on computing devices have been actively studied [5]-[7]. Each computing resource has pros and cons in terms of the performance, applicability, and power consumption. GPUs are known to be effective in dealing with data parallelism, while CPUs are better for task parallelism. Park et al. exploited both data parallelism and task parallelism in designing a parallel lowdensity parity-check (LDPC) decoder in such a way that some parts that were more appropriate for data parallelism were assigned to the GPU, and others that were more appropriate for task parallelism were assigned to the CPU [8]. However, as the number of applications running simultaneously on an embedded device increases, the characteristics of the workload vary significantly. Therefore, it

0098 3063/14/$20.00 © 2014 IEEE

234

is highly desirable to have a flexible OpenCL framework that can redistribute workloads at runtime to fully utilize the available computing resources. Since the number of cores in an embedded CPU has increased rapidly, the processing power and the potential for parallelism of embedded CPUs have increased greatly. Also, it has become feasible for tasks that were originally intended to run on application-specific processors to be executed on an embedded CPU with the OpenCL framework. Therefore, the OpenCL framework for embedded multi-core CPUs is needed to fully utilize a heterogeneous embedded system. However, the study of the OpenCL framework for embedded multi-core CPUs is in a rudimentary stage. Hence, in this paper, a novel OpenCL framework for embedded multi-core CPUs is proposed. How the proposed OpenCL framework is designed, and how various execution environment models are defined for an embedded multi-core CPU will be discussed. To implement the proposed framework, the compilation environment for both the host program and OpenCL kernels has been developed. OpenCL libraries have also been implemented. A performance evaluation was carried out with respect to various definitions on the device architecture and the execution model. When running on embedded multi-core CPUs, applications parallelized by OpenCL C showed much better performance than those written in C. Furthermore, since programmers are capable of managing hardware resources and threads using OpenCL APIs automatically, efficient computing on a heterogeneous computing platform can be easily achieved. The remainder of the paper is organized as follows. In Section II, the overall structure of the proposed OpenCL framework for embedded multi-core CPUs is presented. In Section III, the compilation environment to generate binary codes for the target embedded multi-core processors is addressed. OpenCL libraries that are customized for the device structures of embedded multi-core processors will be explained in Section IV. The experimental environment and results will be presented next. Finally, the conclusions and future works will follow.

IEEE Transactions on Consumer Electronics, Vol. 60, No. 2, May 2014

Fig. 1. Architecture of OpenCL framework for embedded multi-core processor.

In general, the number of cores in a CPU is much smaller than other hardware accelerators in heterogeneous platforms. Therefore, the architecture-specific features of the embedded multi-core processor should be considered in order to enable the target CPU to work as an efficient OpenCL computing device. In this proposed OpenCL framework, the platform, execution and memory models for the target CPU are properly defined in order to compose the execution environment that conforms to the OpenCL specification [9]. The detailed explanation of each model is given in the following discussion. A. Platform Model In the OpenCL specification, processors are called compute devices. An OpenCL compute device has one or more compute units (CUs), which are divided into one or more processing elements (PEs). The computation is carried out in a PE, and each PE can be executed in parallel. A kernel is built by the host program and is downloaded to a compute device. Then each kernel is executed by each PE in parallel [9].

II. OVERVIEW OF OPENCL FRAMEWORK FOR EMBEDDED MULTI-CORE PROCESSOR OpenCL is an open standard programming framework that is independent of a specific target platform, so it supports heterogeneous computing platforms. Various hardware accelerators are abstracted as OpenCL computing devices [9]. The proposed OpenCL framework generates binary codes for a multi-core CPU by compiling the host program and kernel programs written in the OpenCL C language. The host program is the main program that is executed by the host, and the OpenCL kernel is a program that runs on an embedded multi-core processor. Hence, both the compiler for the host program and the compiler for OpenCL kernels were implemented. Also, OpenCL libraries, which include built-in functions called the OpenCL kernel and OpenCL runtime, which are called by the host program, were implemented. The architecture of the proposed OpenCL framework is shown in Fig. 1.

Fig. 2. Target platform model.

As shown in Fig. 2, the embedded processor is defined as a compute device, the multi-core CPU is the CU, and each core in a CPU is defined to be a PE in the proposed OpenCL framework.

J.-Hyun Hong et al.: Design of OpenCL Framework for Embedded Multi-core Processors

B. Execution Model As shown in Fig. 3, the index space for OpenCL kernels is called “NDRange,” which stands for the N-Dimensional Range. When the host program schedules a kernel execution, the index space is defined.

235

many compute units such as GPUs. Compute devices such as embedded CPUs have a small number of compute units. Therefore, the memory model should be defined differently so that the target CPU can effectively carry out parallel executions of work-groups and work-items.

Fig. 3. The index space of OpenCL.

An instance of kernel executions is called a work-item, and work-items are grouped into a work-group. A work-item is executed by a PE at a point in the index space. The total number of work-items in the entire index space is called the global size. Even though all work-items execute the same code, the data to be processed may be assigned differently for each work-item by changing the unique ID within a workgroup. This unique ID is called the global ID. Work-groups are organized and assigned to a CU which contains multiple PEs. The number of work-groups is computed by dividing the global size by the local size. Consequently, programming multiple versions of kernels becomes possible by simply differing the global and the local sizes [9]. The number of CUs and PEs in an embedded multi-core is much smaller than in other OpenCL devices such as GPUs, and a programmer may create many more work-items than the number of PEs. In the proposed OpenCL framework, many work-items can be processed in parallel by logical processing elements (LPEs), which are automatically set to be the local size, even though the number of physically existing PEs is small. The execution model in the proposed framework will be explained in detail in Section IV. C. Memory Model In the OpenCL specification, each work-item can access to distinct memory regions: global memory, constant memory, local memory and private memory. Global memory is a memory region which permits read and write accesses to all work-items in all work-groups. Constant memory is a global memory region for constants. Local memory is a memory region which is local to a work-group. Local memory can be used to allocate objects which are shared by work-items in one work-group. Private memory is a memory region which is private to a work-item, and this memory region is invisible to other work-items [9]. The memory model of the OpenCL specification is originally defined to be appropriate for compute devices with

Fig. 4. Target memory model.

The memory model to execute a kernel on a multi-core embedded CPU is shown in Fig. 4. Since each core in an embedded CPU is defined as a PE, the private cache is defined as the private memory that belongs to a work-item. In the OpenCL framework for embedded CPUs, the multi-core CPU should be the compute device as well as the host. Thus, the main memory is defined as both the global memory and the local memory. The shared cache in an embedded CPU can be regarded as a part of the local memory which caches the recently accessed data in the main memory. III. COMPILATION ENVIRONMENT OF OPENCL FRAMEWORK FOR EMBEDDED MULTI-CORE PROCESSOR

Fig. 5. Compilation flow of OpenCL framework.

To execute the host program and a kernel program on an embedded multi-core processor, the binary codes should be generated both by the host program compiler and the kernel program compiler. Fig. 5 shows the compilation flow of the OpenCL framework. The host program may be written in C,

236

IEEE Transactions on Consumer Electronics, Vol. 60, No. 2, May 2014

C++, or Java, and it describes the overall operation scenario of an application program. The OpenCL kernel is a computationally intensive code that is executed on a compute device, and it is written in the OpenCL C language [10]-[12]. In the proposed OpenCL framework, the compilation environment is composed of a Linux-based mobile operating system (OS), the software development kit (SDK) and native development kit (NDK) for mobile platform [13], and LLVM [14]. Binary codes are generated by compiling both the host program and the kernel program with OpenCL libraries. A. Host program Compilation The host program describes the overall scenario of an application, and is executed on the host. When an OpenCL application is described, the host program uses APIs in the OpenCL library and generates the context object. The host program is compiled by the SDK and NDK for mobile platform. Binary codes for the embedded multi-core CPUs from C/C++ source programs are generated by the cross-compiler, which is provided by the NDK. The host program, which is written in Java, invokes native C library functions using the Java Native Interface (JNI). JNI is a programming framework that enables Java code running in a Java Virtual Machine (JVM) to call, and to be called by, native applications and libraries written in other languages such as C and C++ [15]. In the proposed implementation, Java bindings for OpenCL (JOCL), which is a Java wrapper for OpenCL, was used [16]. B. OpenCL Kernel Compilation 1) LLVM

Fig. 6. LLVM's implementation of three-phase design.

An OpenCL kernel is compiled by LLVM. LLVM is implemented as a set of modules for compilation in the form of libraries. All of the compiler modules and linkers are provided as open source packages. As shown in Fig. 6, LLVM defines an intermediate code representation that is called LLVM-IR. Source programs written in various programming languages can be translated into LLVM-IR, and then a set of optimization techniques are applied before the target machine instructions are generated [14]. The OpenCL kernel compilation process using LLVM in the proposed framework is shown in Fig. 7. The source code, which is written in the OpenCL C language, is compiled by a

Clang compiler to generate the LLVM-IR. Using the LLVM compiler, assembly codes for the target embedded multi-core processor are generated. Finally, native shared libraries for a mobile OS are generated using the NDK cross-toolchains for mobile platform.

Fig. 7. Compilation flow of OpenCL kernel.

2) Dynamic and Static Kernel Compilation There are two ways to compile OpenCL kernels: dynamic and static compilations. In the dynamic compilation, the host program reads in OpenCL source codes and calls OpenCL APIs to compile OpenCL. In the static compilation, first the OpenCL kernel source codes are compiled using LLVM. Then the host program reads in the binary codes for the precompiled OpenCL kernels. By calling APIs, the OpenCL kernel is loaded from the read binary codes [9]. IV. IMPLEMENTATION OF OPENCL LIBRARY A. OpenCL Runtime OpenCL runtime allows the host program to manipulate the context object once it has been created. The OpenCL runtime APIs that are called by the host program were implemented based on the execution model as defined in Section II. The host program first retrieves information on the platform and the device, and then generates the context object to compose the execution environment to run an application. The host program issues commands to the command-queue object assigned to a device to control the devices. Next, the program object is created and the kernel binary codes are read. After the memory object is created as one of the arguments of the kernel, the host delivers the address of the memory object to the device. Then the kernel is executed by creating a kernel object. The host program is terminated after reading the kernel’s execution results from the memory and returning the created objects. In the command-queue, the commands for kernel execution, memory control, and synchronization to control the execution of a command while the command is en-queued are typically inserted [9]. The command-queue in the proposed framework is an inorder queue in which the earlier command must be terminated before a new command can start. In this way, the order of execution of the commands in one command-queue

J.-Hyun Hong et al.: Design of OpenCL Framework for Embedded Multi-core Processors

is strictly preserved. However, when multiple commandqueues exist while there is only one device, the execution of commands in multiple command-queues should be serialized. The event object takes care of the synchronization of the command executions under the same context object. The OpenCL API that en-queues a command to the commandqueue will return the event object after the command execution is completed and the memory object is updated. Commands that have allocated an event object to use the result of the command may start execution after the event object is returned. In this way, synchronization is ensured.

Fig. 8. Overview of OpenCL runtime execution flow.

Fig. 8 shows the execution flow of proposed OpenCL runtime. First, when a command-queue is created for a device, threads are created per command-queue for parallelization and synchronization. The created command-queue has a linked list data structure and is managed by the “Command-queue governor” in Fig. 8. The created command-queue waits until the command is issued by the “Enqueue” runtime APIs. The “Enqueue” APIs first register the type of the command and the list of events for the command to be issued, and then wake up the waiting commandqueue thread. When the “Command-queue governor” receives the wake-up signal, one of the execution functions (READ, WRITE, COPY, and RUN) for the command that is woken up is invoked. In general, CPUs have a much smaller number of CUs compared to other OpenCL compute devices. In the proposed framework, the number of CUs is 1 in the execution model of the embedded multicore processor. Therefore, threads that have completed execution are reused in order to avoid multiple thread creation and removal overhead in one CU. Data exchanges between the host and the device are manipulated by the READ, WRITE and COPY commands, and the kernel execution in the host program is carried out by the RUN command. To ensure mutual exclusion when multiple threads try to access the same command-queue data structure, critical section management, which guarantees that only one thread accesses the shared resource until it finishes execution, is conducted by a method called “mutex lock”. “clEnqueueNDRangekernel” is one of the enqueue runtime APIs, and it sets the value of NDRange and the type of the command to RUN. Then it sends a wake-up signal to the “Command-queue governor”. “Command-queue governor”

237

issues a RUN command to the command-queue to invoke the command. In the RUN command, the IDs of the work-group and the work-items of the index space set by the programmer are stored in the corresponding variables. The parallelization is processed by cloning, which means that the child processes, which share resources with the parent process, are created. The number of created child processes is the same as the total number of work-items specified by the programmer. Since the number of threads to be created may be the same as the total number of work-items (local size), the shared resources should be protected by “mutex lock” until each child process completes its execution. The parent process waits until all of the child processes have been completely invoked. The programmer can automatically manage threads that are allocated to the corresponding command-queue by calling OpenCL runtime APIs in the host program. B. OpenCL Built-in Functions In the OpenCL specification, a set of built-in functions for various vector and scalar computations is defined for use in the kernel, which is written in the OpenCL C language. In the proposed framework, many built-in functions were implemented by writing wrapper functions around functionally similar standard C library functions. Other built-in functions that have no counterpart in the standard C library were implemented from scratch, and their functionality was verified. In the proposed framework, a total of 646 built-in functions of 158 different types in the categories of Work-item, Common, Geometric, Integer, Math, Relation, and Synchronization were implemented, and their functionality has been verified [9]. C. Execution Flow of OpenCL Framework for Embedded Multi-core Processor

Fig. 9. Execution flow of OpenCL framework for embedded multi-core processor.

Fig. 9 summarizes the overall execution flow of the proposed OpenCL framework from the perspective of OpenCL libraries. The host program calls OpenCL runtime APIs in the target application’s address space. The global size is defined as the total number of kernels. The number of workgroups is computed by dividing the global size by the local size [9]. Embedded multi-core processors have far fewer

238

IEEE Transactions on Consumer Electronics, Vol. 60, No. 2, May 2014

compute devices than typical OpenCL devices in general. Therefore, the numbers of CUs and PEs are much smaller. In the proposed framework, the number of CUs is 1 and the number of PEs is 2 in the case of a dual core processor and 4 in the case of a quad core processor in the platform model. In the execution model, programmers may create many more work-items than the real number of PEs. Such work-items are scheduled by the OS to share PEs. Thus, many more LPEs than physical PEs are supposed to exist. By increasing the number of work-items and correspondingly the number of LPEs, the core utilization rate becomes higher. In the proposed implementation, the number of LPEs is automatically set to be the local size so that each work-item should be assigned to one LPE. Consequently, the framework can process many work-items in parallel, even though typical embedded mobile processors have only 2-8 cores. OpenCL built-in functions are invoked when the kernel binary code executes kernel functions. Synchronization among work-items inside the same work-group is dealt with by using the “barrier” built-in function. By varying the global size and the local size, programmers can change the number of work-groups and therefore, a wide variety of kernel codes can be generated. To evaluate the performance of each kernel function using the proposed OpenCL framework, various numbers of LPEs and changes to the kernel structures were attempted. The following section will report the results of the performance comparison when running two target applications.

number of work-groups was 1. This strongly implies that the amount of overhead becomes more significant when the amount of data to process and the number of work-groups increase.

Fig. 10. Processing time of matrix multiplication with varying number of work-groups.

Fig. 11 shows the performance results for the ray tracing application. Again, the execution time was the shortest when the number of work-groups was 1. When the number of workgroups was increased, managing the work-groups caused significant overhead.

V. EXPERIMENTAL RESULTS To evaluate the performance, an embedded board with a quad-core processor running a Linux-based mobile OS was used, and two target applications were used. One was a 2dimensional matrix multiplication application, and the other was a 2-dimensional ray tracing application. The host program of the matrix multiplication application was written in C, while that of the ray tracing application was written in Java. The host program of the ray tracing application was composed with JOCL. The performance was measured with respect to various combinations of global and local sizes. The number of workgroups was changed by varying the global and local sizes. The number of LPEs was changed by varying the local size in a work-group. The performance in terms of the processing time and memory usage to run each application when it is written in the OpenCL C language was compared with that when it is written in the C language to show the effectiveness of the proposed framework. A. Performance Evaluation for Varying Work-groups Fig. 10 shows the performance results of running the matrix multiplication application when the number of work-groups was varied from 1 to 65536 with the fixed local size of 4. Two different sizes of matrix multiplications were tested: 1024x1024 and 2048x2048. As Fig. 10 shows, when the size of the matrix is bigger, the performance degradation due to the increasing number of work-groups becomes more conspicuous. The execution time was the shortest when the

Fig. 11. Processing time of ray tracing application with varying number of work-groups.

These results are mainly due to the fact that the number of CUs defined in the execution model is 1. The CU is the processing resource which is capable of supporting a workgroup. In the proposed OpenCL framework, work-items in the same work-group are executed simultaneously on the PEs of a single CU by cloning the child processes. If the application is parallelized with too many work-groups on one CU, creation and removal of the cloned threads for every work-group impose burden on the framework. Therefore, the number of work-groups for target processor should be carefully chosen to achieve high performance in the OpenCL framework.

J.-Hyun Hong et al.: Design of OpenCL Framework for Embedded Multi-core Processors

B. Performance Evaluation for Varying Local Size In the proposed OpenCL framework, the number of created child processes is the same as the total number of work-items specified by the programmer. Since the local size is the number of work-items that execute OpenCL kernels in parallel, the number of LPEs changes when the local size in a workgroup changes. Since the numbers of PEs in embedded CPUs are much smaller than other OpenCL devices, programmers should create an appropriate number of child threads. Fig. 12 and 13 show the performance comparison results with different levels of parallelization. In the case of the matrix multiplication application, which was processed in parallel with 4 threads, the execution time was most improved by 57% compared to serial processing. In the case of the ray tracing application, which was processed in parallel with 4 threads, the execution time was most improved by 59%. These results are mainly due to the fact that the number of PEs defined in the execution model is 4. The proposed OpenCL framework shows the best performance when the number of cloned threads and that of physical PEs are equal.

239

C. Performance Comparison for C vs. OpenCL C It is common to write applications in the standard C language for CPUs since CPUs are designed to perform well for a wider range of applications when compared to typical OpenCL devices. However, by implementing the OpenCL framework for embedded CPUs, more flexible job assignments will be feasible. In this experiment, the execution times of the kernel codes written in C and those of the kernel codes written in OpenCL C are compared. Tables I and II summarize the execution time comparison when both implementations are parallelized. In the case of the matrix multiplication application, the same parallelization method was applied to the C implementation by creating up to 8 threads with cloning. Due to OpenCL framework overhead caused by the OpenCL object management, when the same number of threads was used in both implementations, the OpenCL C implementation was a little slower than the C implementation. However, the performance degradation is not so significant, and by using the proposed OpenCL framework, the software can fully utilize the heterogeneous platform. Hence it is meaningful to implement the proposed OpenCL framework for embedded processors. When serial processing is carried out without parallelization, the code in OpenCL C is slower than the code in C language. However, when more than 2 threads are used to parallelize the execution, the execution time of the code written in OpenCL C is better than that in C. In the case of the ray tracing application, the execution time of OpenCL C implementation is 1.56 times faster than serial C implementation when parallelized with 4 threads. TABLE I MATRIX MULTIPLICATION PROCESSING TIME (WORK-GROUP=1, 2D, 2048X2048) Thread

Fig. 12. Processing time of matrix multiplication with varying local size. 1 2 4 8

C Processing time (s) 563 360 263 312

OpenCL Processing time Speedup (s) 704 0.80 439 0.82 299 0.88 347 0.90

TABLE II RAY TRACING PROCESSING TIME (WORK-GROUP=1, 2D) Thread 1 Thread

Fig. 13. Processing time of ray tracing application with varying local size.

1 2 4 8

C Processing time (Frames/s) 0.837 OpenCL Processing time (Frames/s) 0.834 1.204 1.313 1.275

Speedup 0.98 1.43 1.56 1.52

240

IEEE Transactions on Consumer Electronics, Vol. 60, No. 2, May 2014

D. Analysis of Memory Usage and CPU Utilization The amount of memory usage and the utilization rate of the CPU are analyzed using both OpenCL C and C implementations in order to estimate the runtime overhead of the OpenCL framework. As shown in Table III, in the case of the matrix multiplication application, when a single thread was used, the average CPU utilization rates in both cases were 50%. When the number of threads increased, the average CPU utilization rates decreased, implying that the execution times were reduced. The average CPU overhead of the OpenCL C implementation turned out to be 1.75% because the framework allocated the OpenCL objects at runtime.

Table VI shows the comparison of the amount of memory usage in the ray tracing application. The amount of memory usage is up to 16% more in the case of the OpenCL C implementation as compared to the standard C implementation. This is mainly due to the fact that the host program in the OpenCL C implementation was written in Java and JOCL, which is used to invoke the standard C library functions and allocates additional memory space for objects. Thus, when the host program is written in Java, the proposed OpenCL framework may suffer from some memory usage overhead. TABLE VI PSS OF RAY TRACING (WORK-GROUP=1, 2D)

TABLE III

Thread

AVERAGE CPU UTILIZATION OF MATRIX MULTIPLICATION (WORK-GROUP=1, 2D, 2048X2048)

1

C (%) 50 30 22 12

Thread 1 2 4 8

OpenCL (%) 50 33 25 13

Tables IV and V summarize the amount of memory usage in the matrix multiplication application. The amount of memory usage was measured in terms of the Proportional Set Size (PSS). PSS is the total amount of private memory usage of a process and the prorated shared memory usage of the process. Since PSS is computed with the consideration that the shared amount is computed by dividing the amount by the number of processes that share the memory space, it is known to be quite accurate [17]. As the number of threads increased and the amount of data increased, the amount of memory usage steadily increased as expected. However, the difference in memory usage between the C implementation and the OpenCL C implementation increased merely 2.28% when computing a 2048x2048 matrix multiplication with 8 threads. Therefore, it is verified that the amount of memory overhead in the OpenCL framework is quite minimal. TABLE IV PSS OF MATRIX MULTIPLICATION (WORK-GROUP=1, 2D, 1024X1024) Thread 1 2 4 8

C PSS (KB) 12125 12371 12380 12411

OpenCL PSS (KB) 12213 12512 12531 12626

Difference (%) 0.72 1.12 1.20 1.70

TABLE V PSS OF MATRIX MULTIPLICATION (WORK-GROUP=1, 2D, 2048X2048) Thread 1 2 4 8

C PSS (KB) 49234 49237 49245 49171

OpenCL PSS (KB) 49317 49412 50124 50313

Difference (%) 0.16 0.35 1.78 2.28

Thread 1 2 4 8

PSS (KB) 7201 7549 7554 7713

C PSS (KB) 6743 OpenCL Difference (%) 7 12 13 16

VI. CONCLUSION The OpenCL framework has advantages over other programming frameworks in the sense that OpenCL can be executed on heterogeneous platforms such as CPUs and GPUs. However, the study of the OpenCL framework for embedded multi-core CPUs is in a primitive stage. In this paper, an OpenCL framework for embedded multi-core CPUs was proposed. The compilation environment for the host and kernel programs consists of the NDK and SDK for mobile platform, and LLVM. The platform model, memory model, and execution model for embedded multi-core processors were defined to implement OpenCL libraries. OpenCL runtime was implemented based on the execution model definition. The implemented OpenCL framework uses a command-queue governor to manage threads and handle parallel processing. In addition, built-in functions that are called by OpenCL kernels were implemented. The performance of the proposed framework was evaluated when varying multiple design parameters such as the number of work-groups and the sizes of the work-items. The applications that were parallelized by OpenCL C showed much better performance than those written in C without parallelization. The performance evaluation confirms the effectiveness of the proposed framework. Even though there is some performance and memory usage overhead of the OpenCL C implementation when running on the proposed framework, the OpenCL C implementation can be executed by various heterogeneous platforms, while the standard C implementation can be executed only by CPUs. This paper verifies that the proposed framework will be very useful in fully utilizing heterogeneous platforms, which is the original motivation of the OpenCL programming framework. Future works are ongoing to apply this framework to various computing platforms in addition to CPUs and GPUs.

J.-Hyun Hong et al.: Design of OpenCL Framework for Embedded Multi-core Processors

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8] [9] [10] [11]

[12] [13] [14] [15]

J. Lee et al., “An OpenCL framework for heterogeneous multicores with local memory,” in Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp.193-204, Sep. 2010. J. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Computing in Science & Engineering, vol. 12, no. 3, pp. 66-73, May 2010. J. Park, J.-Y. Choi, I. Ryu, and J.-I. Park, “Universal view synthesis unit for glassless 3DTV,” IEEE Trans. Consumer Electron., vol. 58, no. 2, pp. 706-711, May 2012. R. Ferrer et al., “Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL,” Languages and Compilers for Parallel Computing, vol. 6584, pp. 215-229, Oct. 2010. W.-J. Kim, K. Cho, and K.-S. Chung, “Multi-threaded syntax element partitioning for parallel entropy decoding,” IEEE Trans. Consumer Electron., vol. 57, no. 2, pp. 897-905, May 2011. S. H. Jo, S. Jo, and Y. H. Song, “Efficient coordination of parallel threads of H.264/AVC decoder for performance improvement,” IEEE Trans. Consumer Electron., vol. 56, no. 3, pp. 1963-1971, Aug. 2010. C. H. Chan and F. C. Lau, “Parallel decoding of LDPC convolutional codes using OpenMP and GPU,” 2012 IEEE Symposium on Computers and Communications, pp. 225-227, Jul. 2012. J.-Y. Park and K.-S. Chung, “Parallel LDPC decoding using CUDA and OpenMP,” EURASIP Journal on Wireless Communications and Networking, Vol. 2011, no. 1, Dec. 2011. Khronos OpenCL Working Group. The OpenCL Specification Version 1.1, Document Revision 44, 2010. I. Lőrentz, R. Andonie, and M. Maliţa, “An Implementation of Evolutionary Computation Operators in OpenCL,” Intelligent Distributed Computing V, vol. 382, pp. 103-113, 2012. P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, “From CUDA to OpenCL: Towards a performanceportable solution for multi-platform GPU programming,” Parallel Computing, vol. 38, no. 8, pp. 391-407, Aug. 2012. H. Kim and R. Bond, “Multicore software technologies,” IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 80-89, Nov. 2009. J. K. Lee and J. Y. Lee., “Android programming techniques for improving performance,” IEEE 3rd International Conference on Awareness Science and Technology, pp. 386-389, Sep. 2011. C. Lattner and V. Adve., “The LLVM compiler framework and infrastructure tutorial,” Languages and Compilers for High Performance Computing, vol. 3620, pp. 15-16, Sep. 2004. R. Gordon, Essential JNI: Java Native Interface, 1st ed., PrenticeHall, 1998, pp. 134-136.

241

[16] F. Jacob, D. Whittaker, S. Thapaliya, P. Bangalore, M. Mernik, and J. Gray, “CUDACL: A tool for CUDA and OpenCL programmers,” IEEE International Conference on High Performance Computing, pp. 1-11, Dec. 2011. [17] C.-K. Hsieh, H. Falaki, N. Ramanathan, H. Tangmunarunkit, and D. Estrin, “Performance evaluation of android IPC for continuous sensing applications,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 16, no. 4, pp. 6-7, Feb. 2013. BIOGRAPHIES Jung-Hyun Hong (M’12) received his B.S. degree in Media Communication Engineering from Hanyang University, Seoul, Korea in 2011. Since 2011, he has been taking a unified M.S. and Ph.D. course at Hanyang University, Seoul, Korea. His research interests include software parallelization, heterogeneous computing, and embedded multi-core architecture, Young-Ho Ahn (M’13) received his B.S degree in Computer Science and Engineering from Hanyang University, Seoul, Korea in 2008 and M.S. degree in Electronics, Computer, and Communication from Hanyang University, Seoul, Korea in 2010. Since 2010, he has been taking a Ph.D. course at Hanyang University, Seoul, Korea. His research interests include system-level power management for embedded systems, low power chip design, and transactionlevel design. Byung-Jin Kim (M’13) received his B.S. degree in Computer Science and Engineering from Hanyang University, Seoul, Korea in 2009 and his M.S. degree in Electronics and Computer Engineering from Hanyang University, Seoul, Korea in 2013. His research interests include software parallelization, operating systems, and embedded multi-core architecture. Ki-Seok Chung (M’09) received his B.E. degree in Computer Engineering from Seoul National University, Seoul, Korea in 1989 and his Ph.D. degree in Computer Science from the University of Illinois at UrbanaChampaign in 1998. He was a Senior R&D Engineer at Synopsys, Inc. in Mountain View, CA from 1998 to 2000, and was a Staff Engineer at Intel Corp. in Santa Clara, CA from 2000 to 2001. He also worked as an Assistant Professor at Hongik University, Seoul, Korea from 2001 to 2004. Since 2004, he has been an Associate Professor at Hanyang University, Seoul, Korea. His research interests include low power embedded system design, multi-core architecture, image processing, reconfigurable processor and DSP design, SoC-platform based verification and system software for MPSoC.