Specialized Multicore Computing Course - GSU C.S

Experiences in Teaching a Specialty Multicore Computing Course

∗

Peter E. Strazdins∗ , Research School of Computer Science, The Australian National University, Email: [email protected]

Abstract—We detail the design and experiences in delivering a specialty multicore computing course whose materials are openly available. The course ambitiously covers three multicore programming paradigms: shared memory (OpenMP), device (CUDA) and message passing (RCCE), and involves significant practical work on their respective platforms: an UltraSPARC T2, Fermi GPU and the Intel Single-Chip Cloud Computer. Specialized multicore architecture topics include chip multiprocessing, virtualization support, on-chip accelerators and networks, transactional memory and speculative execution. The mode of delivery emphasizes the relationship between programming performance and the underlying computer architecture, necessitating the need to provide suitable infrastructure in the form of instrumented test programs and the use of performance evaluation tools. Further infrastructure had to be created to facilitate the safe, convenient and efficient use by students on the GPU and Single-Chip Cloud Computer. The programming assignments, based on the theme of the LINPACK benchmark, also required significant infrastructure for reliably determining correctness and assisting debugging. While the course assumed as background knowledge an introductory computer systems and concurrency course, we found that students could learn device programming in a short time, by building on their knowledge of shared memory programming. However, we found that more time is needed for learning message passing. We also found that, provided students had a suitably strong computer systems background, they could successfully meet the course’s learning objectives, although the skill of correctly interpreting performance data remains difficult to learn when suitable performance analysis tools are not available. Keywords-multicore computing, parallel computing, computing education

I. I NTRODUCTION COMP8320 Multicore Computing: Principles and Practice is a specialized multicore computing course available for advanced undergraduate and postgraduate students. It provides an in-depth study of the principles, the architecture and programming paradigms of contemporary and emerging multicore (parallel) processors. As such, it is the most advanced and specialized course in the Computer Systems curriculum at The Australian National University. The course ran for a first time in the second half of 2009 and again in 2011, with approximately a dozen students each time. The course covered three multicore programming paradigms: shared memory (OpenMP), device (CUDA) and

message passing (RCCE – the Rocky Creek Communication Environment), and involves significant practical work on their respective platforms: an UltraSPARC T2, Fermi GPU and the Intel Single-Chip Cloud Computer. To our knowledge this was the first time the SCC has been made available to coursework students. The SCC represents a future generation of manycore chips without cache coherency; to date, the only readily available programming environment is RCCE, which provides a simple interface to perform message passing. In this paper, we discuss key elements that were required to make the course a success: its design philosophy, key learning goals, course structure and the surprising amount of infrastructure required to support the learning goals. Some of this infrastructure is platform-specific; some is problemdomain specific. In this case, the domain was the LINPACK benchmark, together with dense matrix multiplication. This paper is organized as follows. The context for the course is described in Section II, its design philosophy is discussed in Section III and its structure is outlined in Section IV. Selected teaching and learning experiences are described in Section V, and infrastructure developed for the support of the course is described in Section VII. Finally conclusions are given in Section VIII. II. C ONTEXT The relevant curriculum for the course begins with COMP2300 Introduction to Computer Systems, which introduces instruction sets computer architecture (for both illustrative and contemporary processors). Concurrency issues are introduced in the course COMP2310 Concurrent and Distributed Systems. These two courses comprise the assumed knowledge for the multicore computing course. Our Multicore Computing course must necessarily specialize in multicore aspects due to the existence of the following two pre-existing courses. COMP3320 High Performance Scientific Computation, which covers data modelling, programming (including shared memory models) and program performance issues for high performance scientific applications. COMP4300 Parallel Systems provides a practically oriented introduction to the special case of concurrency of shared and distributed memory parallel computers. These courses are also available to both undergraduates and post-graduates. As with the Multicore Computing course

itself, they are only offered every second year. For this reason, they cannot be used as prerequisites. There is however a potential problem of overlap between these courses (for example, Graphics Processing Units and OpenMP programming are taught in all three courses, although not to the same depth as in COMP8320). In some sense, the Multicore Computing course forms a ‘capstone’ in computer systems curriculum, covering the most advanced aspects of computer architecture, concurrent programming and computer performance themes. III. D ESIGN P HILOSOPHY The course was designed in order to prepare advanced undergraduate and postgraduate students for the rapidly unfolding future of multicore (sometimes termed manycore) computing. The material is based on relevant research and practice of the Computer Systems group at the ANU, and the method of delivery is termed as research-based education [1]. The student numbers are intended to be small in order to facilitate more personal contact with the instructor, using the Cognitive Apprenticeship approach where the instructor’s experiences are passed on to the students when they undertake similar activities [2]. While the course emphasizes computer architecture and operating system concepts, it is from the context of a programmer undertaking the task of writing correct, portable and efficient programs. A key goal is to teach how architectural effects relate to changes in program performance when a program or system parameter is varied. For this reason, infrastructure must be provided to the students (mainly in the form of instrumented test programs) in order to explore these effects. In order to better achieve this aim, the course also has a strong emphasis on the use of software tools (e.g. profilers, instrumented runtimes), which can provide more detailed insights. IV. C OURSE S TRUCTURE The course structure comprises of ten modules, each comprising of a two-hour lecture and associated activities, as illustrated in Table I. Each module also has a twohour tutorial and/or laboratory session. All lectures, except those on Advanced OpenMP, Synchronization and Software Engineering, deal with specific multicore (as opposed to general parallel computing) issues. The tutorials typically involve discussions on relevant papers and articles, and/or problem solving exercises. The first 6 laboratory exercises use an UltraSPARC T2; however, they can easily be adapted to any other processor with multiple cores and hardware threading. The course assessment in 2011 comprises of three assignments, worth 20%, 15% and 10% respectively, each of which involved programming, performance evaluation and performance analysis activities. The assignments were based on a common theme: optimizing the LINPACK benchmark,

which prominently featured optimizing a matrix multiply kernel. Modules 1–7 were scheduled in the first half of the 13 week semester, in order to give more time for the second and third assignments. In 2009, instead of the modules on GPUs, the SCC and Outlook (and associated assignments), we ran small group mini-projects on Image Registration, both on the GPU and Cell Broadband Engine systems, and on SPARC Haskell. Here, the relevant project supervisor gave an overview in the first lecture and the student groups presented their projects’ ‘theory’ on the second, and their practical results on the third. While this gave the students an in-depth experience of a multicore computing paradigm, we found that the studentgenerated material was not sufficiently coherent to base exam questions on. Student feedback was also critical, citing it created a high workload and presenting was stressful. For these reasons, we abandoned this approach for 2011. Feedback from 2011 indicated the above structure worked well, but more time is needed for the GPU and SCC topics, especially given their emphasis on assessment. The Software Engineering module proved the least popular, and replacing it with more material related to GPUs and the SCC (especially programming related) is a suggested improvement. V. T EACHING AND L EARNING E XPERIENCES This section describes salient experiences in the delivery of the course; these fell mainly under the categories of programming paradigms and architectural insights. A. Programming Paradigms In 2011, the course covered three different parallel programming paradigms: shared memory (OpenMP), device (CUDA), and message passing (the Rocky Creek Communication Environment – RCCE). The latter is a relatively simple message passing library built for the Intel Single Chip Cloud Computer (SCC) [10], [9]. The covering of these three paradigms in a single course was ambitious considering that experience in the latter two could not be readily be assumed, and that a whole course can be easily devoted to just one paradigm (e.g. a course teaching CUDA and GPUs specifically [8]). However, by comparing and contrasting the students’ prior experience in OpenMP, it was found that the students could pick up CUDA programming very quickly. For example, Figure 1 illustrates how CUDA programing may be introduced for students with an OpenMP background1 . By and large, the students were able to successfully code kernels for the components of the LINPACK benchmark, although were somewhat frustrated at the difficulty of approaching the performance of NVIDIA’s CUBLAS library. 1 While the creation of a single thread block here is not efficient for a GPU, this gives the simplest analogy.

Table I C OURSE MODULES AND ASSOCIATED REFERENCES AND LEARNING ACTIVITIES FOR 2011

1 2 3 4 5 6 7 8 9 10

module Advent of Multicore Multicore Architecture and the T2 Advanced OpenMP Programming Performance Issues; Synchronization Software Engineering for Multicore Operating System Issues and Virtualization Graphics Processing Units On-chip Networks and the Single-chip Cloud Computer Transactional Memory and Speculation; Heterogeneous Multicore Outlook (Manycore) and Review

r e v e r s e ( a d , N ) ; ... global v o i d r e v e r s e ( i n t ∗a , i n t N) { i n t idx = threadIdx . x ; i n t v = a [N−i d x −1]; a [N−i d x −1] = a [ i d x ] ; a [ i d x ] = v ; } Figure 1.

main references [3], [4], [5] [6], [5] [5] [5], [7] [4] [6] [8] [9] [7], [3]

tutorial 1 2 3 4 5 6 7 8

laboratory 1 2 3 4 5 6 7 8

assignment 1 1 1 2 3

#pragma omp p a r a l l e l n u m t h r e a d s (N / 2 ) \ default ( shared ) { i n t idx = omp get threads num ( ) ; i n t v = a [N−i d x −1]; a [N−i d x −1] = a [ i d x ] ; a [ i d x ] = v ; }

Illustrating CUDA programming using OpenMP

However, we found that more time than was available in a single module was required for learning message passing to the same extent, and the programming exercises for the third assignment had to be made more limited accordingly. A question on the third assignment asked students to comment on the programmability (relative ease of designing, implementing and debugging programs) in the three paradigms. Rather surprisingly, they felt that CUDA was the hardest, due to the lack of debugging facilities. The availability of an GPU emulator, which permits the use of print statement, would have been the most useful, for in such a context there is not the time to learn specific debugging tools. VI. A RCHITECTURAL I NSIGHTS Basing the assignments on a single theme had benefits in that students can amortize the cost of learning the problem domain over the whole course, and it provides a commonality in the infrastructure provided to the students. It also helps facilitate a comparative experience of the three programming paradigms. The LINPACK benchmark was chosen due to the instructor’s long experience in that computation. The downside was that the students had little experience with such classes of problem, and reported “getting sick of LINPACK” after the second assignment. LINPACK is a highly regular computation. The first assignment revealed that when such a computation is parallelized under a fixed decomposition scheme, destructive sharing affected performance on a highly chip multiprocessing/multithreading platform such bas the UltraSPARC T2. After profiling with Solaris analyzer tool, about half the students could surmise that this was causing the unexpected loss of performance as the number of software threads

increased beyond 32. This could be inferred from an unexpectedly long amount of time spent in barrier routines. The vendor-supplied codes for this computation (dgetrf() did not suffer from this problem: the analyzer tool revealed that these codes did not spend significant time in barriers at all (and indeed used no atomic instructions to any extent). From this, it can be inferred that these codes were using dynamic partitioning methods, probably in conjunction with work stealing techniques [7]. For the CUDA programming assignment, one version of the codes which used an instructor-provided matrix multiply kernel, suffered from a strange CUDA error when the storage for the matrix on the device was released. The kernel was identified as the cause, with a subtle buffer overwrite, possibly from the compiler, being suspected. This was set as a ‘challenge problem’ to the students, with one student finding the cause: the kernel read past the end of the buffer2 . This anecdote highlights that, in running such a course, very difficult problems may often be encountered, for both the instructor and the students. While the programing exercise in the SCC assignment had to be relatively limited, the test program provided a number of options which illustrated subtle effects of the communication architecture on performance. Figure 2 illustrates the setup for the multiply. In general, a P × Q process grid could be used. The algorithm involved broadcasting the A matrix across the grid, then broadcasting the B matrix down the grid, and performing a local multiply. Command line parameters for the test program could be used to set P and Q, select a a tree-based or a ring-based or the provided RCCE broadcast, and time the computation over a series of 2 It

is possible that the over-read brought stale data into the GPU’s caches.

explained correctly the arguably counterintuitive result that the (near-) square ratios gave the best performance, due to the fact that these ratios minimize the communication volume. The SCC currently lacks any profiling tools for its cores, which made more difficult the interpretation of performance data for the students. VII. I NFRASTRUCTURE S UPPORT Such a course requires a considerable degree of infrastructure so that students can efficiently concentrate on their programming exercises and on interpreting their experimental data, and be able to debug their programs reasonably efficiently, without having to deal directly with the complexities of the facilities and be experts in the problem domain. This sections details the support given, which can be divided into problem domain and platform specific categories. A. Problem Domain Related Figure 2. Grid

Distributed Matrix Multiply for the SCC for a 4 × 4 Process

iterations. The following questions for the SCC assignment are illustrative of the course’s goal to teach how architectural effects relate to changes in program performance, and the resulting learning experience. Question 6 concerned the effect on floating point speed of repeating the matrix multiply computation for r repetitions. The following table represents an example experiment that students performed on a 1 × 8 process grid, in this case for a 1000 × 1000 × 48 multiply: broadcast MFLOPS (r = 1) MFLOPS (r = 10)

RCCE 107 164

ring 106 170

tree 140 172

The question asked Try to explain the difference in performance for the 2 values of r in each case. If possible, devise further experiments to test your conjectures. Rather surprisingly, all students failed to give the correct explanation, most suggesting that it was due to warming of the caches from the first repetition. A further experiment to test this would be to use a 1×1 grid with a 1000×125×48 multiply: in this case r had no significant impact on speed. The correct answer is that this is due to pipelining effects (this is rather surprising, since on the SCC communications are synchronous, which would seem to limit this effect). Experiments showing that the effect increases as you increase Q confirm this as the cause. The students did answer Question 7 better, which asked them to explore the effect of parallel multiply speed on grid aspect ratio (the SCC has 48 cores, which is very convenient for this purpose!). The question asked what aspect is optimal, and why? Hint: consider the amount of data flowing across individual network links. Most students

The principal learning activity in the assignments was optimizing components of the computation, and analyzing the effects of their optimizations, and also varying algorithm and architecture related parameters. For example, one of the test programs provided for Assignment 2 had the following synopsis: runXe ./linpack [-p] [-b NB] [-v v] [-w W] [-k] [-a info] N The test program and support library infrastructure needed to be provided so that students can vary parameters (e.g. the LINPACK blocking factor NB, the CUDA thread block size W and the linear system size N), and select optimizations (-v, -k and -a options). The test programs also reported the benchmark’s execution time / speed, which acted as single measurement. As a moderately large number of these were needed over an assignment, students were encouraged to nest the commands within for loops. Preserving correctness in optimizations was essential. The test program not only rigorously checked for correctness, but provided limited debugging support. The -p option printed out the input and result data, although in the case of LINPACK, this is not very helpful in most situations in determining the cause of the error. For the CUDA assignment, support was needed for debugging the main task, an optimized matrix multiply kernel, in isolation. In the context of LINPACK, buffer overwrites by the kernel would result in an incorrect result (or even stranger errors). Thus, the test program mimicked the memory layout of the LINPACK program, and checked for buffer overwrites, as well as the correctness of the result itself. The -p option was made more useful for debugging by setting the initial matrix elements to be simple linear functions of their indices (the result matrix would then have a similar pattern). However, due to the very large linear

system sizes required for good GPU performance (up to 8192), this created a problem: result matrix elements grew so large that roundoff errors became comparable to algorithmic errors, making the result checking method unreliable. This was particularly problematic when the codes were compiled for single precision. The solution was to limit the result to be exact integers, which can be done by setting the matrix elements for an m × n × k multiply as follows: ai,j M (r, k)

= ri (i%M (ri , k)) + rj (j%M (rj , k)) (1) 2W/2 = (2) (r + 1)k

where W is the mantissa width, and we chose ri = 2, rj = 1. This method was used for checking the multiply for the third assignment as well, but the checking method, comparing the computed result with that of a known-to-be correct serial algorithm, was no longer acceptably efficient on the SCC: to gather the result on core 0 and compare with a serial algorithm took over a minute for a 10s parallel multiply! The solution was to derive a formula to generate the values of the result matrix in O(mn) time, utilizing the fact the initial values are determined by Equation 1, and perform a distributed check (this was very fast!). B. Platform Related To obtain reliable measurements, or even to use a multicore ‘device’ safely, mutual exclusion to the ‘device’ must be ensured. This is an issue in a course with even a moderate number of students, who expect to get access whenever they want (and typically intensively as an assignment deadline approaches!). On a traditional multicore processor, the operating system ensures mutual access to the cores. Keeping the runtime of each measurement relatively small (normally well under 10s), mutual exclusion is to an extent facilitated by the operating system (with the students being advised to repeat the measurements if they suspect interference). This section details infrastructure developed for a GPU accessed via batch processing and for the SCC. The NCI National Facility [11] provided access to Fermi GPUs for the course through their Xe cluster. Even though only single node usage was required, experiments had to be submitted to a batch system, as it ensured mutual exclusion to the device. After performing a number of such experiments (or debugging runs), this quickly becomes unwieldy, since program output and job summaries appear some time later in separate files, and users have to keep track of job ids to identify the appropriate file. Due to the relatively low usage of the Xe cluster, it was found that single node jobs were typically dispatched quickly, normally within 30s. To take advantage of this, a front-end was written to hide the PBS batch system from the students (this is the runXe script mentioned in Section VII-A). This script submitted the job,

and, provided it completed within the expected time limit, sent program output and job completion information to the terminal. This provided a pseudo-interactive access GPUs, enhancing student productivity and experience. Programs to collate and summarize the salient output from the cvp profiler were also provided, in order to expedite student interpretation of the profiler output on their kernels’ performance. The Single-chip Cloud Computer is effectively a single user machine: as users log into or launch parallel jobs on the cores, all associated processes run as the root user, and only one process at any one time can access the message passing buffers required by all parallel jobs. We found that interfering jobs can not only lock each other up, but sometimes make the SCC cores crash entirely! This could even happen when a single user repeatedly submitted jobs, due to the fact that occasionally jobs do not terminate cleanly. Using such a facility for assessable work under a given deadline is therefore extremely problematic. Students were provided with a submission script called rccerun_h which not only abstracted various inessential details of the provided submission system, but tried to ensure fair and safe access to the facility. It used a system of lockfiles: an overall lockfile managed by the script itself, and per-core lockfiles managed by the job’s processes on the SCC. The script determined, when a new job was submitted, whether an existing job was running. It then waited till the job either finished or exceeded its time limit (in which case it killed the job). If the submitted job exceeded its time limit, or the user sent it a kill signal, its respective processes on the SCC were identified and killed. We found that as soon as several students used the SCC simultaneously, it would however still (within a few hours) lock up. As it proved impossible to reproduce the cause of this, we tried purging all ‘unwanted’ processes (those which did not appear to be usual system processes) from SCC cores before each job was submitted. While this seemed an aggressive and risky strategy, it worked and from that point on: student jobs no longer interfered with each other and no rebooting was needed. VIII. C ONCLUSIONS Our specialty multicore computing course, covering three distinct programming paradigms, used the research-based education and cognitive apprenticeship approaches and had a key learning goal of understanding how architectural effects relate to changes in program performance when an algorithm or system parameter is varied. Its modular structure with three assignments comprising of programming, performance evaluation and performance analysis activities well supported this goal, with all but one of student passing the course in each year it was run. Suitable infrastructure in the form of instrumented test programs and the use of performance evaluation tools were

also needed to support this goal. Further infrastructure had to be created to facilitate the safe and efficient use by students of the GPU and Single-Chip Cloud Computer. The programming assignments, based on the theme of the LINPACK benchmark, also required significant infrastructure for reliably determining correctness and assisting debugging. While the course assumed as background knowledge an introductory computer systems and concurrency course, it was found that students could learn device programming in a short time, by building on their knowledge of shared memory programming, although we found more time was needed for learning message passing. We found that, provided students had a suitably strong computer systems background, they could successfully meet the course’s learning objectives, although the skill of correctly interpreting performance data remains difficult to learn when suitable performance analysis tools are not available. The resulting infrastructure and other materials for this course have potential to be useful for other multicore and parallel computing courses, using similar platforms and assignment themes. These are freely available at the course’s web site: http://cs.anu.edu.au/student/comp8320 ACKNOWLEDGMENTS The author thanks Sun Microsystems (now Oracle) for the donation of the UltraSPARC T2 and Intel Corporation for the donation of the Single-chip Cloud Computer. We thank Steve Hanley for configuring the T2. We also thank Jie Cai and Judy Jenkinson of the NCI National Facility for administrative and technical support on the Xe cluster, and the Facility itself for the usage of the cluster. Finally, we thank Daniel Frampton for technical support on the SCC.

R EFERENCES [1] A. Jenkins, R. Breen, and R. Lindsay, Reshaping Teaching in Higher Education: Linking Teaching with Research. UK: Kogan Page, 2003. [2] A. Collins, J. S. Brown, and A. Hollum, “Cognitive Apprenticeship: Making Thinking Visible,” American Educator, Winter 1991, ht tp://www.21learn.org/arch/articles/brown seely.html. [3] K. Olokoton, L. Hammond, and J. Landon, Chip Multiprocessor Architecture. Morgan & Claypool, 2007. [4] C. Hughes and T. Hughes, Professional Multicore Programming: Design and Implementation for C++ Developers. Wiley, 2008. [5] D. Gove, Multicore Application Programming. Wesley, 2011.

Addison

[6] Sun Microsystems, “Slidecast: All About OpenSPARC,” http://www.opensparc.net/publications/presentations, 2008. [7] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufman, 2008. [8] D. Kirk and W. mei Hwu, Programming Massively Parallel Processors. Morgan Kaufman, 2010. [9] T. Mattson, “The Future of Many Core Computing: A tale of two processors,” (lecture slides), 2010. [10] Intel Corporation, “Many-core Applications Research Community,” http://communities.intel.com/community/marc, 2012. [11] “The NCI National http://nf.nci.org.au/, 2012.

Supercomputing

Facility,”