OpenMP in Multicore Architectures - CiteSeerX

OpenMP in Multicore Architectures Venkatesan Packirisamy, Harish Barathvajasankar Abstract OpenMP is an API (application program interface) used to explicitly direct multi-threaded, shared memory parallelism. With the advent of Multi-core processors, there has been renewed interest in parallelizing programs. Multi-core offers support to execute threads in parallel but at the same time the cost of communication is very less. This opens up new domains to extract parallelism. The aim of our project is to study how OpenMP could be used in Multi-core processors. We investigated this in two parts - extracting fine grained parallelism and extracting speculative parallelism. The report mainly concentrates on the second part - speculative parallelism, though we offer some insights into fine grained parallelism. Our results show the specifying hints through OpenMP directives is important to expose parallelism. Also we propose a hardware technique that could improve performance of thread level speculation.

1

Introduction

OpenMP is an API (application program interface) used to explicitly direct multithreaded, shared memory parallelism. OpenMP was introduced in 1997 to standardize programming extensions for shared memory machines. It has been widely used since then in parallelizing scientific programs. In OpenMP the user specifies the regions in the code that are parallel. The user also specifies necessary synchronization like locks, barriers, etc to ensure correct execution of the parallel region. At runtime threads are forked for the parallel region and are typically executed in different processors sharing the same memory and address space. Recently many chip manufacturing companies have announced multi-core processors [1][2]. In multi-core processors, each processor die has many processing elements and they typically have a common L2 cache. The whole die acts as a traditional multiprocessor system except that they now share the L2 cache. The advantage now is that we could execute different threads in these processing elements and the communication cost between the threads is very less since they share the L2 cache. Another advantage of having multiple cores is that we could use these cores to extract thread level parallelism in a program and hence increase the performance of the single program. A lot of research has been done on this area. Many techniques rely on hardware based mechanisms [3][4] and some depend on compiler to extract the threads [5][6]. 1

So we see that we have a system which can be used as a traditional multiprocessor system with minimum communication cost and also it can be used to improve single thread programs. OpenMP so far have been used only in traditional multiprocessor environments and we saw that the multi-core processors are very similar to a traditional multiprocessor. So it is natural to think of OpenMP in multi-core processors. Also we saw that multi-core processors can be used as a traditional multiprocessors or it can be used to implement TLS. Consequently OpenMP can also be used in both the ways: Fine grained parallelism Because of the minimal communication cost in multicore processors, we could parallelize loops that suffered due to high communication cost in traditional multiprocessors. We could parallelize inner loops that were previously not parallelizable due to their communication cost and size. And OpenMP can be extended to support additional directives needed to support this fine-grained parallelism. Speculative Parallelism We saw that multi-core processors can be used to improve single thread performance by using thread level speculation (TLS). OpenMP can be used as a way for the user to specify hints to the compiler and hardware, exposing the parallelism in the speculative threads. The rest of the paper is organized as follows: in section 2 we will discuss the related work done in this field. In section 3 we would discuss the potential for using OpenMP to extract fine grained parallelism. In section 4 we will discuss the simulation environment used in the study of speculative parallelism. In section 5 we would discuss how we could use OpenMP for speculative parallelization. In section 6 we would present the parallelization framework we used to parallelize the loops. In section 7 we will see the results comparing our parallelized loops and the loops parallelized by the compiler [5]. In section 8 we would discuss the possible hardware optimizations possible. In section 9 we would conclude and in section 10 we would discuss the future work.

2

Related Work

Using OpenMP for extracting multi-level parallelism is studied in [6]. Here some OpenMP directives were proposed to extract multi-level parallelism. We would want a similar technique for multi-core environments, where say the outer loop is parallelized between processors and the inner loop is parallelized for the processing elements inside each processor. [8] Tried to apply speculative techniques to improve OpenMP programs. Here the threads do not always (depending on the OpenMP hint) wait at synchronization points. So violations could occur which are detected and the offending thread is squashed. In our report we use OpenMP to improve thread level parallelism of integer programs which are not inherently parallel. 2

Thread level speculation has been widely studied as a technique which can extract thread level speculation. Here the threads are either extracted in hardware [9][10][11][15] or in software [12][13][16] and are executed in parallel. The threads are not always independent, so if a dependence violation occurs, it is typically checked in hardware and the offending thread is squashed and restarted. In [8-ant] compiler is used to schedule the instructions in the threads so as to reduce the impact of inter-thread data dependencies. The code generated using this technique is used to compare with the code we generate. In [12] pipelined execution of threads is studied. The parallelization framework we use is very similar to the thread pipelining, except our method is more general and is more easily applicable to different loops. Also in our case all compiler optimizations [8] are still applicable. In [13][14] hardware based synchronization mechanisms have been studied to improve the performance of the synchronization operations. In this report we propose to use hardware based update mechanisms which are very similar in principle to the above mechanisms. [16] Used manual parallelization to improve performance of speculative multithreading. Our report does not modify the basic algorithm design in the code. It only exposes parallelism to the compiler and hardware by providing hints.

3

Fine-grained parallelism

As the first possibility of using OpenMP in multi-cores, we tried to extract parallelism in the inner loop of SpecOMP benchmarks. But we were not able use the OpenMP compiler [15] to compile the SpecOMP benchmarks. So we could not proceed very much in this direction. We tried to study the possible parallelism in SpecOMP benchmarks by manually looking at the code. In the benchmark equake, we found that the OpenMP had been used to parallelize 14 of the major loops. There were some inner loops and also there are some loops that that has not been parallelized by OpenMP. But when we looked at the loops closer, we found that the loops that were not parallelized have small iteration count. Also the inner loops also have very small loop body. We were not able to analyze the FORTRAN based benchmarks because of our limited familiarity with FORTRAN. Due to the apparent lack of potential and the difficulty in understanding and compiling the SpecOMP code, we abandoned this part and started focusing on the other part - speculative parallelism. The rest of the paper concentrates on the main focus - speculative parallelization of integer programs using OpenMP directives. First we shall see the simulation methodology and the basic multi-core architecture used in for this project.

3

4

Simulation Methodology

For our study we used a simplescalar based Chip Multiprocessor (CMP) simulator. The basic configuration of the simulator is given in fig 1.

Figure 1: Basic Configuration

4.1

Basic Architecture

In the simulator CMP architecture we have 4 processing elements. Each has the basic configuration stated above. And they have private L1 cache and a shared L2 cache. Also there is a signal table for each processing element which is used to synchronize to avoid squashing. Each processing element has a store buffer of size 2K entries and a load-address buffer of size 2K entries. When a speculative thread executes in a processor, the stores values are put into the store buffer. When a load is issued the address of the load is put in the address buffer. For each store in a thread, subsequent thread’s address buffers are checked to see if there is any dependence violation. If there is any violation detected, the whole thread and the subsequent threads are squashed and the execution of the first thread violating is restarted.

4.2

Tool used

One of the major parts of our project is to analyze the different benchmark programs to find scope for the user to specify hints. We used a simplescalar based tool to study the loops. We use the tool to identify frequently occurring data dependencies between threads. Though we have other tools based on dependence profiling and SCCs, we chose to implement our own tool because the other tools don’t give which PC address actually cause the dependence. We need this information because we manually analyze the benchmarks and we need to find which statements actually cause dependence violation. To use the tool, we just mark the loop we want to study, and the tool would print out the PC addresses of store-load dependence pairs.

4

Figure 2: CMP Architecture

4.3

Benchmarks studied

We aimed to study 6 of the Spec2000 integer benchmarks. We did analyze 6 benchmarks (fig. 3), we present our results only for 3 of the benchmarks. For the remaining benchmarks we provide insights on what the behavior of their loops and how it can be parallelized.

Figure 3: Benchmark Description

4.4

Simulation methodology

Since we had run the simulation many times till we get good performance we worked only with test input sets. For all three benchmarks for which the results are presented only for test input sets.

5

Speculative parallelization

In this section we would use the directives given by the user to improve the performance of hardware and compiler mechanisms to improve thread level speculation. First we will see what the possibilities are. Then in section 5.2 we will present a 5

framework for exposing parallelism. In section 5.3 we would compare our framework to the other related work done. In section 5.4 we would discuss how the different benchmarks we studied fit in this framework.

5.1

OpenMP directives

There have been many hardware and compiler based techniques proposed to improve Thread level speculation. Compiler techniques aim to reduce the complexity of the hardware techniques by moving some of the complexity to the hardware. Also the compiler due to its global view is better placed to extract Thread level speculation. Even the hardware based techniques usually assume some form of compiler support in selecting the region or to specify synchronization. But the compiler also lacks some run time information. One way to solve this problem is to allow the user to specify some hints that would help the compiler as well as the architecture. The possible directives from the user can be broadly classified into three categories: Identifying regions to parallelize One of the major decisions for compiler or hardware to make is which loop or region to parallelize. This is a very hard decision and usually done after taking the execution time of the loop and the dependence properties of the loop. A user (or an optimizer) is in a better position to take the decision. The user can just identify the loop and communicate it through a directive. This is very similar to the OpenMP parallel directive, but now the loop is only speculatively parallel. Changing loop body After identifying the loop, the hardware can execute the threads in different processors simultaneously. But this may cause frequent squashes due to dependence violations. So the hardware could selectively synchronize on some dependencies. Usually this requires compiler support and the compiler can also schedule the code to reduce the impact of data dependencies. Here also the user is in a better position to optimize the code. The user could just indicate specific information for each load whether there is going to be a dependence violation or the user can also re-arrange the loop to reduce the dependence violation. This is the part we try to answer in our project. Other specific information Other than the above two major information, the hardware may still need some specific information to efficiently execute the threads. Eg. The user could specify the typical iteration count for the loop. The compiler can take important decisions based on this. The user could indicate the load balancing information that could help the hardware to efficiently distribute the work. This part we don’t consider in our project.

6

6

Framework for parallelization

In this section we will discuss the typical behavior of the benchmarks loops. From the behavior we derive a framework for manual parallelization. By framework, we mean a specific code layout on how the code should be arranged so that it exposes to the hardware or compiler the parallelism in the loop.

6.1

Method of analysis

To analyze a benchmark program, we first get the profile information and find the loops that have high coverage. Then the data dependence profile information is looked at. If the number of dependencies is really large, that loop is not considered further. Then the simplescalar based tool is used to find the actual statements in the source code that cause the mis-speculations. Then the source code is analyzed to see why exactly the dependence occurred. A typical loop in integer program would look like:

From our analysis, we could classify the benchmark loops as two types: Possible parallel loops In most loops, the data needed to start the next iteration is not dependent on the main processing done in the iteration. Eg. The inner loop in the function refresh potential of mcf. Here the induction variable is the only data needed to start the next iteration and the induction variable is independent of the processing in the current iteration. Serial loops But there are many loops, were the next iteration is very much dependent on the processing in the current iteration. Eg. The loop in the sort basket() function of mcf has a very good coverage. But the loop is very serial - it is a quick sort algorithm, the next iteration is started based on the value of the current iteration.

6.2

Ideal parallel execution

We saw that in some cases the only thing we need to start the thread is the induction variable. So ideally we want the loop execution to be.

7

Now we can start the loop immediately after generating data for the next iteration. But when we examine the loops, we found an interesting trend in the loops.

6.3

Usual model of Execution

Apart from generating the values for the next iteration, the loops usually update some global variables. The results of the processing done in the current iteration are updated to some global variables or array. Updating values at the end seems to be very common across benchmarks. And unlike the induction variables these cannot be moved to the top of the iteration as it needs the results of the entire iteration.

Eg. Consider the end of the iteration code of a loop in new_dbox_a of twolf: } else { m = 0 ; } tmp_missing_rows[net] = -m ; delta_vert_cost += ((tmp_num_feeds[net] - num_feeds[net]) + (tmp_missing_rows[net] - missing_rows[net])) * 2 * rowHeight ; } We see, at the end of the iteration, the global variable delta vert cost is updated with the values calculated in the current iteration. Let’s see how the current TLS mechanisms handle such situations: 8

Speculation One way is to execute the code speculatively. Usually the update occurs towards the end of the iteration and if the non-speculative thread completes first, then it doesn’t cause a dependence violation. This may not always work, if the speculative thread runs faster, it experiences a squash. Synchronization We could synchronize on this dependency. This will avoid the squash, but it could serialize the code. In our example it is ok, but these types of reductions do occur in the middle of the iteration. Eg. Inner loop in twolf new dbox() Recovery code Recovery code could help, but there are limitations in applying the recovery code. Consider the loop in twolf - new dbox():

} *costptr += ABS( newx - new_mean ) - ABS( oldx - old_mean ) ; } Here the reduction occurs in the inner loop. So if we want to recover, we have to execute the entire inner loop.

6.4

Our Technique

Among these techniques, synchronization seems to be a good choice. But the performance could be affected if the update occurs in the middle of the loop. To reduce this impact, we could privatize the update and add the results to a temporary variable. And at the end of the loop, the variable is updated to the global structure. In our technique we try to move the global update down. If it occurs in the middle we try to privatize the local update. Then in our basic technique the global update is guarded by wait instruction. We also identify the induction variable so that it does not cause un-necessary synchronization or speculation. Based on the above analysis, we believe that a manually parallelized loop should have the following structure. And in our basic technique we synchronize before entering the update phase. Different hardware based optimizations are possible and they are discussed in section 8.

9

From our initial analysis we found that the coverage of the loops that could have such framework structure is given below: Benchmark potential parallel loops (% of execution time coverage) Mcf 0.563893 Twolf 0.406532 Vpr 0.751183 Some of the loops were later found be either very large to be parallelized or they have high frequency inter-thread data dependences.

6.5

OpenMP hints

The user has to specify hints that identify these phases in the loop. The user himself can rearrange the code, privatize the variables and do other optimizations. The compiler can do the same thing provided the basic hints identifying the phases are provided.

6.6

Behavior of benchmark programs

In this section we will see how the different benchmark programs we analyzed fit into the framework we discussed in the previous section. Also we suggest techniques that could be further studied for the different benchmarks. a)Twolf Two high coverage loops exactly fit into the framework. For these two loops it is very difficult to parallelize without the user directives. Some of the loops suffer from small size. For these loops unrolling could help, but unrolling could disrupt the loop structure and could cause mis-speculations. The hardware technique discussed later could be used for such loops. b) Mcf Mcf also has many loops that satisfy the code framework. Here also the loops suffer from small size. They could be unrolled but that violates the framework. c) vpr There were some loops that satisfy the global update property. But these loops are enclosed by an outer loop which is also parallel. Even a simple compiler technique can identify the parallelism. So we don’t gain much by using our technique. d) Gzip and bzip2 Here there are some parallel loops, but they all have very small code size. They could give good performance if unrolled. We did not find any major loops which satisfy our framework. e) Vortex Vortex loops have lots of function calls and the function calls are usually have complicated nesting which made analyzing the loops very hard. From the result of using the tool we find that many dependences occur due to a status variable which is updated almost everywhere. But we believe that the value read would always be 0 unless an error occurred. So we could modify the hardware to squash the thread only when the value read was changed. 10

6.7

Comparison with other techniques

In our technique we have three different phases of execution - induction variable calculation, main processing and global update. This is very similar to the thread pipelining strategy of [14]. In [14] the first two stages are same, but the update stage is replaced by a write-back stage. In [14], the aim is to always find all the information needed for other thread as soon as possible; in our case we move the updates downward. In [16], manual parallelization was discussed. But there the focus was on changing the code completely to increase parallelism. In our technique the user only exposes the available parallelism and the parallelism is never created.

7 7.1

Results Results of basic technique

The aim of passing user hints is to overcome the limitations of the compiler and hardware. So we compare our technique with the compiler based technique [15], where the code is scheduled to reduce the impact of data dependencies. We were able to get the code generated by using the scheduling technique from Prof. Antonia Zhai who had implemented that technique in the SUIF compiler. But the code had some directives specific to their environment. Also the code overhead was very huge in their code. So we replicated the effect of their code in the original benchmark code. Signal and wait instructions were inserted at exactly the same points. And the loops were unrolled exactly the same number of times. All other modifications are also done to match their code completely. Then we applied our technique over that code. Sometimes the same loops could be better scheduled by using our technique. At other times more loops could be parallelized which were not parallelized due to some limitations. In fig. 5 we see the comparison of the scheduling based technique and our framework based technique. The bars indicate the percentage decrease in the execution time after applying the technique. We see that performance sometimes is less than the single thread execution. One of the main reasons is the instruction overhead. Fig. 6 shows the percentage increase in the instructions executed after applying the technique.

7.2

Analysis of results

Here we try to analyze the performance of each of the benchmarks. 1.mcf Though there are parallel loops the size of the loops are very small. When we unrolled the loops, the loops suffered mis-speculations. So the two loops pulled back the performance. In the case of the scheduler based code, the second loop was synchronized. This lead to serial execution and at the same 11

Figure 4: Performance Impact of the technique

Figure 5: increase in the number of instructions executed time there were gaps in the pipeline and there was instruction overhead. Due to these effects there was a huge performance decrease. 2.twolf Our technique showed more than 13% performance improvement. The scheduler based technique could not effectively parallelize two major loops. In our technique we could cover many more loops than the scheduler based technique. 3.vpr The results are exactly the same. Though we were able to cover many loops not covered by the scheduler technique, there was one outer loop which includes many such inner loops. The performance is also dominated by that single loop. 5.Hardware based improvements In this section we discuss some improvements that can be done over the basic technique.

12

8

Hardware based improvements

In this section we discuss some improvements that can be done over the basic technique.

8.1

Commit overlap

In a typical speculative multithreading processor, the commit of one thread starts only after the commit of the previous thread. In the update phase of a thread execution, the thread is going to update the global variables. So potentially it will not cause any additional mis-speculations. So the next thread can start committing instructions once the previous thread has reached the update phase. Due to this we could overlap the commit time to some extent. In [14] commit time was identified as one of the major bottlenecks and different techniques were studied.

8.2

Update in hardware

In [11][12] hardware based synchronization were proposed to improve the performance of some common type of update operations. The update phase in our threads is also a kind of synchronization. So we could use hardware support to improve the performance of such operations. As we saw in the results that some of the benchmarks suffer from small code size. But if the code size is increased by unrolling the loops, the structure of the loop is affected and it no longer has the phase based structure. Now we would be having two or more update sections distributed over the entire iteration. This can be a severe bottleneck since it could cause synchronization (in the basic technique synchronization is used to guard the update phase). We propose a hardware buffer that would be used to hold the update instructions and their related load instructions. The update instructions instead of executing are sent to this buffer. When the previous thread sends a signal to start the update phase, the instructions in this buffer can be executed. So we are now able to maintain the order of the updates and at the same time we won’t suffer from synchronization.

9

Conclusion

In this report we study how OpenMP hints are applicable to multi-core architectures. The OpenMP could be applied to extract fine-grained parallelism and also to extract speculative parallelism. In this report we concentrated more on the later part - how the user can expose the parallelism so that the compiler and hardware are more efficient. We identified a typical structure of loops that can be parallelized. The user specifies the loop in such a structure. The compiler or the user can accordingly schedule the code. When we compared the code generated in such a way with the code generated using a typical scheduling based technique, we found that

13

our technique showed much better performance. Also we propose some hardware techniques which can further improve the performance.

10

Future Work

We need to analyze the other remaining benchmarks. It is not clear if the framework we present is applicable to programs like gcc, parser, etc. And we have to complete the hardware based techniques and study the performance improvement. Also by analyzing more benchmarks we would be able to get a deeper understanding on their behavior and any apparent patterns. In our analysis we found that load-balancing could be an important factor in some benchmarks. Some iterations could execute faster than the others due to different control flow paths e.g. Vpr.

References [1] ”Power4: A Dual-CPU Processor Chip”, Kahle, J., Microprocessor Forum ’99 (October 1999). [2] ”MAJC: Microprocessor Architecture for Java Computing” Tremblay, M., HotChips ’99 (August 1999). [3] ”Multiscalar processors”, Sohi, G. S., Breach, S., and Vijaykumar, T.. In Proceedings of the 22nd ISCA (June 1995). [4] ”Clustered Speculative Multithreaded Processors”, Marcuello, P., and Gonzlez, A.. In Proc. of the ACM Int. Conf. on Supercomputing (June 1999). [5] ”The Superthreaded Processor Architecture”, Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D., AND Yew, P.-C.. IEEE Transactions on Computers, Special Issue on Multithreaded Architectures 48, 9 (September 1999). [6] ”Compiler Optimization of Scalar Value Communication Between Speculative Threads” Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, CA, USA, Oct 7-9, 2002. [7] ”Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study”, Eduard Ayguade, Xavier Martoreli, Jesus Labarta, Marc Gonzalez and Nacho Navarro, Proceedings of the 1999 International Conference on Parallel Processing. [8] ”Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications by Jos F. Martnez and Josep Torrellas, 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002.

14

[9] ”A Clustered Approach to Multithreaded Processors”, Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium (IPPS), March 1998. [10] ”Using Thread-Level Speculation to Simplify Manual Parallelization”, Manohar Prabhu and Kunle Olukotun, Proceedings of the 2003 Principles and Practices of Parallel Programming, San Diego, CA, June 2003. [11] ”The Cedar system and an initial performance study”, Proceedings of the 20th annual international symposium on Computer architecture, 1993. [12] ”On data synchronization for multiprocessors”, H.-M. Su, P.-C. Yew , International Conference on Computer Architecture, 1989. [13] ”ORC-OpenMP: An OpenMP Compiler Based on ORC”, Yongjian Chen, Jianjiang Li, ShengyuanWang, and DingxingWang, Tsinghua University, China. [14] ”Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization”, Milos Prvulovic, Maria Jesus Garzaran, Lawrence Rauchwerger, and Josep Torrellas, 28th Annual International Symposium on Computer Architecture (ISCA), June 2001. [15] ”The Stanford Hydra CMP”, Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mike Chen and Kunle Olukotum, IEEE Micro, march-April 2000. [16] ”Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads”, Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry. The 2nd International Symposium on Code Generation and Optimization (CGO-2004), Palo Alto, CA, USA, March 20-24, 2004.

15