image magnification and reduction using high order

© 5th IEEE International Multi-Conference on Systems, Signals and Devices (SSD’08), Amman, Jordan, July 2008

IMAGE MAGNIFICATION AND REDUCTION USING HIGH ORDER FILTERING ON THE CELL BROADBAND ENGINE Hashir Karim Kidwai

Fadi N. Sibai

Tamer Rabie

College of Information Technology, UAE University, Al Ain, UAE ABSTRACT The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency, bandwidth and compute power. In this paper, we describe a 2D graphics algorithm for image resizing which we parallelized and developed on the Cell BE. We report the performance measured on one Cell blade with varying numbers of synergic processor engines enabled. These results were compared to those obtained on the Cell’s single PPE and with all 8 SPEs disabled. The results indicate that the Cell processor can outperform modern RISC processors by 20x on SIMD compute intensive applications such as image resizing. Index Terms— Cell multi-core computing,, image resizing. 1.

INTRODUCTION

The Cell Broadband Engine (BE) is a heterogeneous multicore microprocessor whose objective is to provide high performance computation for graphics, imaging and visualization, and to a wide scope of data parallel applications. It is composed of one 64-bit PowerPC Processing Element (PPE) serving as host processor, eight specialized co-processors called Synergistic Processing Elements (SPE), and one internal high speed bus called Element Interconnect Bus (EIB) which links PPE and SPEs together. The host PPE supports the 64-bit PowerPC AS instruction set architecture, and the VMX (AltiVec) vector instruction set architecture [9] to parallelize arithmetic operations. Each SPE consists of a Synergistic Processing Unit (SPU), and a Synergistic Memory Flow Controller (SMF) unit providing DMA, memory management, and bus operations. A SPE is a RISC processor with a 128-bit SIMD organization for single and double precision instructions. Each SPE contains a 256KB instruction and data local memory area, known as the local store (LS), which is visible to the PPE and can be addressed directly by software. The LS does not operate like a superscalar CPU cache since it is

neither transparent to software nor does it contain hardware structures that predict what data to load. The EIB is a circular bus made of two channels in opposite directions and allows for communication between the PPE and SPEs. The EIB also connects to the L2 cache, the memory controller, and external communication. The Cell BE can handle 10 simultaneous threads and over 128 outstanding memory requests. The following sections describe the 2D graphics image resizing application parallelized and developed for the Cell BE, and the programming model chosen for this application. Serial implementations on PPE only and parallel implementations on PPE-SPE with 8 SPEs are described. This is followed by the presentation of the speedup comparisons with and without DMA and thread creation overhead times. 2.

APPLICATION

Texture mapping on graphics processing units (GPUs) passes through the process of texture image magnification (zooming in) and reduction (zooming out) depending on the level of detail (LOD) of the 3D object it is mapped onto. The two most commonly implemented techniques for this type of texture processing are nearest neighbor (a.k.a. zeroorder hold) texture filtering and bilinear (a.k.a. first-order hold) texture filtering [1]. Next, we briefly described these two techniques. 2.1. Zero-order hold texture filtering Zero-order hold is performed by repeating previous image pixel values, thus creating an undesirable blocky effect in the resized texture [2]. This is clear from Figures 1-5. 2.2. First-order holds (bilinear) texture filtering Bilinear texture filtering enhances a computer's ability to scale 3D graphics in a smoother, more realistic way. With 3D graphics, especially with games, a graphics card should not simply take the texture maps from memory and map them onto the 3D objects that are rendered to the computer screen: as the polygons drawn onscreen grow bigger, they

© 5th IEEE International Multi-Conference on Systems, Signals and Devices (SSD’08), Amman, Jordan, July 2008 new pixel value that renders a more subtle, realistic texture. An example is depicted in Figure 7, where pixels in between every four adjacent pixel values are calculated by summing these four pixels and dividing by 4 (i.e. 0.25*(8+4+4+8) = 6). Figure 1

Figure 2

Figure 3

Fig.1 (left) Original texture image, Fig. 2 (middle) a 2x bilinear zooming and Fig. 3 (right) 2x nearest neighbor zooming.

would take on a blocky look due to the zero-order hold filtering. To improve the ability to scale 3D graphics, they need to be filtered using a higher order filter [3]. Figure 6 shows an illustration of a twice zoomed image with white pixels representing the interpolated values that need to be calculated.

a)

b)

Fig. 7 (a) An example image matrix, and (b) its bilinear filtered twice zoomed version.

b)

a)

Fig. 8 (a) An example image matrix, and (b) its zero order hold twice zoomed version.

3.

Fig. 4. (left) 4x bilinear zooming of image in Fig. 1, and Fig. 5. (right) 4x nearest neighbor zooming clearly showing signs of severe blackness compared to the smoother effect in (Fig. 4).

Fig. 6 showing a twice zoomed image (Z) from the original image (S). The white pixels values are either calculated using zero-order or first-order hold filtering which will affect the resulting quality.

More precisely, bilinear (interpolation) texture filtering will average the values of four adjacent pixels, thus creating a

ALGORITHM PARALLELIZATION AND PROGRAMMING MODEL

Our parallelization approach of the image resizing algorithm is based on the data parallel model in which each SPE core perform the same computation on different parts of the data. The image resizing algorithm was first implemented on the simple single PPE architecture model and then recoded for the parallel PPE-SPE architecture model. In the parallel model, the PPE is responsible for SPE thread creation and I/O functions and SPEs perform the image resizing computation. The computational part is uniformly distributed on all enabled SPEs. The number of SPEs therefore determines the number of times the SPE module gets replicated and executed. The PPE initiates the SPE threads (a total of 8 SPE threads in our experiment). The data is grouped and kept ready for distribution to 8 SPEs. Each SPE can directly access only its local store memory of 256 KB. The SPEs process the retrieved data from the main memory through DMA transfer and once finished, write back the data to the same memory location again via DMA transfer. 4.

IMPLEMENTATION

We created two implementation sets i. A serial PowerPC/PPE-only implementation following the first

© 5th IEEE International Multi-Conference on Systems, Signals and Devices (SSD’08), Amman, Jordan, July 2008 above model; and ii. A parallel PPE-SPE (aka embedded) implementation following the second model. 4.1. Overview In the serial PPE-only implementation, we input the 256x256 data from the input data file. Once the data was retrieved, we store it into a multi-dimensional array. We then apply the image resizing algorithm, calculate the execution time and write the processed data to the output data file. In the Embedded implementation, we again retrieve the data from the input file via the PPE code, store it into a multidimensional array, pass on to each of the 8 SPEs pointers of the input data and also pointers to the memory locations where the output data will be transferred back from the SPEs. For our application, we experimented with both image reduction (zoom factor of 0.5) and image enhancement techniques (zoom factors of 2.0 and 4.0) 4.2. Data Partitioning The data represents an image of 256x256 single precision floating point (float) values. For 256x256 matrices, the PPE code is responsible for partitioning the data into 32 rows and 256 columns which were distributed among 8 SPEs. This amounts to 32*256*4= 32KB blocks of data to process. 4.3. Data Transfer Data was transferred from main memory to the SPEs’ LS and vice versa. For efficient data transfer, the data has to be 128 Bytes aligned. A PPE starts a SPE module by creating a thread on the SPE by calling the spe_create_thread function. Some of the arguments to this function are the SPE program to load, and a pointer to a data structure. Since SPEs do not have any direct access to main memory, they can only access it through the Memory Flow Controller (MFC) DMA calls. The primary functions of the MFC are to connect the SPUs to the EIB and support DMA transfers between main storage and the LS. The DMA calls once issued by the SPEs bring the data structure into LS. The maximum data size which can be transferred between PPE and SPE is 16KB per DMA call. 4.4. Data Processing Once the data was moved from the main storage into the LS it was then processed through the bilinear texture filtering technique which produced the results based on different zoom factors. For 32x256 matrices and a zoom factor of 0.5, the matrix produced by every SPE is of size 16x128. Similarly for zoom factors of 2.0, 4.0, 8.0 and 16, it is of size 64x512, 128x1024, 256x2048 and 512x4096, respectively. While processing the data we had to keep in

mind the LS size limitation which is 256 KB, and thus for large zoom factors, we could not process or produce and transfer the entire data at once. Thus we fetched a portion of the data, processed it and then transferred it, while simultaneously processing another portion of the data. Thus the data transferring and data computation were overlapped. However this “choppy” transfer of the data incurred additional overhead times which we will analyze in section 6. 5.

EXPERIMENTS

We ran our experiments on an IBM Cell machine with 1 Cell blade containing a 9-core Cell processor package. In order to obtain the more accurate timing information, we calculated the number of ticks on the serial PPE-only model through the mftb function which is available through PPU intrinsic header files along with gettimeofday function. In the embedded (PPE+SPE) version, we invoked spu_read_decrementer and spu_write_decrementer functions to get the total number of clock ticks and passed on that information to the PPE code by updating the counter. The reason for using mftb or spu_read_decrementer and spu_write_decrementer functions was to get more accurate and reliable timing information as compared to the information provided by the gettimeofday function. On the Cell blades, the time base frequency is set to 14.318 MHz with a 3.2 GHz microprocessor. In order to get the execution time, we divided the number of clock ticks by the time base frequency. We first calculated the time it took to run our 2D image resize algorithm by varying the zoom factor on the single PowerPC or PPE-only model as shown in Figure 8. We then calculated the time it took to execute the same algorithm by our embedded model also shown in figure 8 by taking the following steps: a) Calculate the entire input data transfer time from PPE to SPEs, complete execution time of the entire data, and transfer time of the output data back to main memory by the SPEs to PPE; and b) Calculate the execution time only, omitting the input and output data transfer times. We also optimized our code through compiler “auto SIMDIzation, scheduling and loop unrolling features” and measured its effect on the total execution time for different zoom factors. We used the compiler optimization level 3 along with loop unrolling which duplicates the loop body multiple times and can help SPE performance as it reduces the number of branches and uses software pipelining which attempts to arrange instructions in a loop so as to minimize pipelining stalls. We then calculated the overall speedup for various zoom factors. The speedup is defined as the ratio of the execution time on the serial PPE-only model over the execution time on the parallel PPE-SPE model.

© 5th IEEE International Multi-Conference on Systems, Signals and Devices (SSD’08), Amman, Jordan, July 2008 30

RESULTS

25

Figure 9 and Table 1 show the execution times of both the embedded and PPE-only algorithm implementations, with the embedded PPE-SPE implementation times including and excluding data DMA transfer times. The execution times increase with increasing zoom factor. Comparing the embedded and PPE-only implementation results, it is clear that both embedded implementation performances (with and without DMA transfers) are remarkably enhanced by parallelization. By examining Figure 9.0 and Table 1, we also analyzed the impact of DMA transfers on the overall execution time. We observe that the reduction in the execution time is 34%, 20%, 3%, 5% and 20% for zoom factors of 0.5, 2.0, 4.0, 8.0 and 16.0, respectively. Thus zoom factors of 0.5, 2.0 and 16.0 are most affected by the DMA transfer overhead. 120000 Computation Time + DMA

Execution Time

100000 80000

Computation Time only

60000 40000 20000 0 0.5

2

4 8 Zoom Factor

16

Fig.9. Plot of embedded implementation’s execution time (in microseconds) vs. the zoom factor

Speedup

6.

20 Computation Time + DMA

15 10

Computation Time only

5 0 0.5

2

4 8 Zoom Factor

16

Fig.10. Plot of speedup vs. the zoom factor before code optimization

Figure 10 shows the speedup trend with varying zoom factors for both computation only and with DMA transfer execution times. In computation-only, note the lower speedups for zoom factors of 4.0 and 8.0, and higher speedups for zoom factor of 0.5, 2.0 and 16.0, as the latter ones suffer from higher relative DMA transfer times, and the former ones enjoy much higher computation time over DMA transfer time ratios. Note that for 4.0 and higher zoom factors, our code introduced additional loops in order to utilize the same memory space by other loop iterations because of the limited local storage of SPEs. These loops result in additional data movement and branches which are though not very expensive, but still affect the performance because both SPEs (no branch predictor) and PPE (inefficient branch predictor) are poor in managing branches. Table 2 and Figure 11 show the execution times and speedup versus the zoom factor, after compiler-based code optimization. Note that we did not optimized our PPE-only algorithm code and that this code optimization introduced loop unrolling, vectorization and software pipelining but was not optimized for DMA transfer.

Table 1: Execution Time before optimization (in microseconds) Zoom Factor

0.5

2.0

4.0

8.0

16.0

Table 2: Execution time after code optimization (in microseconds) Zoom Factor

Mode Embedded (Comp. Time + DMA

135.3

Embedded (Comp. Time only)

88.7

PPE-only Application

2235.3

1547.7

1247.7

5829.3

5688.8

23876.4

22746.62

98971.57

79856.01

Mode Embedded (Comp. Time + DMA) Embedded (Comp. Time only)

31165.6

123401.1

493047.8

1969534.3

0.5

2.0

4.0

8.0

16.0

105.25

1074.18

3955.80

16136.66

64692.6

47.98

858.78

3071.74

12308.68

49114.3

© 5th IEEE International Multi-Conference on Systems, Signals and Devices (SSD’08), Amman, Jordan, July 2008 35 30 Speedup

25 20

Computation Time + DMA

15 10 5 0 0.5

2

4 8 Zoom Factor

16

Fig.11. Plot of speedup vs. the zoom factor after code optimization

We notice that the zoom factors of 0.5 followed by 2.0 are least affected by optimization as they avoid the additional loops for data transfer. For zoom factors of 2.0, 4.0, 8.0, and 16.0 we noticed a significant decrease in the execution time and an increase in the overall speedup.

As we observed large speedups (20-30), it can be concluded that applications which are compute intensive such as image resizing with large zoom factors when distributed among 8 SPEs will enjoy a performance on the Cell unequaled by any other contemporary microprocessor. With its performance capability, dedicated resources to decoupled eight SPUs engines, the Cell BE can show its prowess. In future work, we will enhance our code by employing techniques like double buffering for zoom factors of 0.5 and 2.0,and 16.0, do a detailed investigation with performance analysis tools like Visual Performance Analyzer, FDRPRO, OPROFILER and also investigate different code optimization techniques to further enhance the performance. 8.

We acknowledge generous hardware equipment, software, and service support from IBM Corporation as a result of a 2006 IBM Shared University Research award. 9.

25

REFERENCES

[1] M. Slater, A. Steed, Y. Chrysanthou. Computer Graphics and Virtual Environments: From Realism to Real-time. AddisonWesley, 2002.

20 Speedup

ACKNOWLEDGMENT

15 10

[2] S. Umbaugh. Computer Vision and Image Processing: A practical approach using CVIPtools. Prentice Hall PTR, 1999.

5 0 2

4

8

Number of SPEs Fig.12. Plot of Speedup vs. the number of SPEs for zoom factor 2.0

We also analyzed the effect of varying SPEs on the overall speedup as shown in Figure 12. Our results are based on a single zoom factor of 2.0. We noticed a linear increase in the overall speedup with the increase in number of SPEs. 7.

CONCLUSION

The result presented in this paper demonstrates that the Cell BE processors can perform particularly well in cases where the application is compute intensive and not memory intensive. Also as applications become more control flow intensive, their performance degrades. In order to avoid costly branch mispredictions and to improve the performance of computation within loops, techniques like software pipelining, loop unrolling should be adopted [8]. Also for applications where data transfer time becomes an overhead, techniques such as double buffering are recommended.

[3] CNET Glossary. 5752488-1.html

http://reviews.cnet.com/4520-6029_7-

[4] T. Chen, R Raghavan, J.N Dale, EIwata, September 2007, Cell Broadband Engine Architecture and its first implementation – A performance view. http://www-03.ibm.com/industries/ telecom/doc/content/bin/Cell_Information.pdf [5] IBM Corp. SPE Runtime Management LibraryTM Reference Manual. [6] IBM Corp., Cell Broadband EngineTM Programming Handbook. [7]

IBM Corp., Cell Broadband EngineTM Programming Tutorial.

[8] Programming the Cell Broadband Engine, Examples and best Practices, IBMTM Redbook, http://www.redbooks.ibm.com/ redpieces/pdfs/sg247575.pdf [9] AltiVec™ Technology Programming Environments Manual, http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPE M.pdf [10] Luke Cico, Robert Cooper, Jon Greene "Performance and Programmability of the IBM/Sony/Toshiba Cell Broadband Engine Processor", Workshop on Edge Computing Using New Commodity Architectures (EDGE), Univ. of North Carolina, Chapel Hill, May 23-24, 2006.