Machine Learning Algorithms for the Performance ... - Semantic Scholar

3 downloads 2936 Views 353KB Size Report
formance of the entire University of Florida Sparse Matrix ... IBM, the IBM logo, ibm.com, ... view from Berkeley,” EECS Dept., UC Berkeley, Tech. Rep.
Machine Learning Algorithms for the Performance and Energy-Aware Characterization of Linear Algebra Kernels on Multithreaded Architectures A. Cristiano I. Malossi∗ , Yves Ineichen∗ , Costas Bekas∗ , Alessandro Curioni∗ and Enrique S. Quintana-Ortí† ∗ IBM

Research - Zurich, Switzerland Cognitive Computing & Computational Sciences Department Email: {acm,yin,bek,cur}@zurich.ibm.com † Universidad

Jaime I, Castellón, Spain Depto. de Ingeniería y Ciencia de Computadores Email: [email protected] Keywords—Performance and Power Modeling; Energy-Aware Computing; Classification Methods; Dense/Sparse Linear Algebra

S UMMARY OF THE WORK The performance-power-energy characterization of linear algebra kernels represents a key-stone towards the development of energy-aware algorithms. Indeed, any effective optimization process requires first a deep understanding of the analyzed phenomena, which generally turns into a simplified but accurate mathematical description of the phenomena itself, i.e., a model. In this work we present a semi-automatic machine learning algorithm to measure, model, and thus characterize the time-power-energy performance triangle of linear algebra kernels. In particular, we make the following contributions: 1) We devise a bottom-up approach that decomposes dense linear algebra kernels into a small number of meaningful fine-grained components (arithmetic, memory access, etc.), which are subsequently modeled using a classical regression analysis on a selected number of measures. This strategy hides the unnecessary complexity of the low-level instructions and, at the same time, allows to accurately determine (and also predict) sources of power cost as a function of the critical parameters (dimension of the problem, number of threads and cores, architecture type, etc.). See [1] for more details. 2) We extend our methodology to tackle irregular and indirect memory access operations, where prefetching and memory bandwidth dominate the actual floating-point operations, i.e., the cost of arithmetic becomes almost negligible (≤ 1%). As an example, we focus on the sparse matrix-vector product (S P MV), which is the key ingredient to tackle large-scale sparse linear systems and eigenvalue problems via iterative methods. More in detail, first we identify a small set of critical parameters which drive the performance triangle of a baseline implementation of S P MV. On top of these parameters, we establish a simple and easy-to-use classification of sparse matrices that provides an immediate qualitative indication of the required performance triangle costs for any specific

S P MV operation. Finally, we devise models that are general (multi-architecture) yet simple and inexpensive to use, offering precise (quantitative) estimations of the time and energy requirements for a S P MV involving a given sparse matrix. The tool built upon these models extracts the aforementioned critical parameters from the graph representing the sparse matrix, and exploits this information, combined with 3-D interpolation, to yield the sought-after predictions. See [2] for details. P OTENTIAL APPLICATIONS Our methodology is in line with the dwarfs’ decomposition proposed in [3], [4] and can yield important benefits for the HPC community, in particular: 1) Optimize the management and cost of HPC-supercomputers and cloud-systems under the concurrent usage from thousands of different users, following power and energy directives (e.g., dynamically place power-hungry routines on different nodes to avoid hotspots, and run energyconsuming kernels during the night to limit the costs). 2) Help algorithm designers to analyze and model the performance-energy interaction of their development kernels towards real energy-aware implementations. 3) Make suggestions for energy-efficient architectural designs, extrapolating the models towards the next generation hardware (e.g., future cache hierarchies performance and optimization). VALIDATION We test our methodology on three modern HPC systems: the IBM Blue Gene/Q (BG/Q), the IBM POWER7, and the recently released IBM POWER8. These multithreaded architectures are equipped with hardware sensors that automatically collect run-time power measures for selected regions of the code. Concerning the performance triangle costs, we normalize all our measures with respect to the number of “operations”, where an operation is defined as the base arithmetic instruction that characterizes the kernel (e.g., the fused floating-point

−9

2 cores

4 cores

8 cores

DDR L3−cache L2−cache L1−cache Register

25 20 15 10 5 0 1 core

2 cores

4 cores

8 cores

DDR L3−cache L2−cache L1−cache Register

4.8 3.6 2.4 1.2 0 1 core

2 cores

4 cores

8 cores

DDR L3−cache L2−cache L1−cache Register

20 15 10

0 1 core

2 cores

4 cores

8 cores

−1 −8

10

25

5

0

10

10

Net energy per non−zero per core [J]

1.2

Time per FMA operation [s]

2.4

6

Average power per core [W]

3.6

Average power per core [W]

Time per FMA operation [s]

x 10

4.8

0 1 core

Net average power per core [W]

−9

x 10 6

DDR L3−cache L2−cache L1−cache Register

10

10

2.5 2 1.5 1 0.5 0 1 core

2 cores

4 cores

8 cores

DDR L3−cache L2−cache L1−cache Register

Energy per FMA operation [J]

Energy per FMA operation [J]

−7

x 10 2.5 2 1.5 1 0.5 0 1 core

2 cores

4 cores

8 cores

DDR L3−cache L2−cache L1−cache Register

10

−7

−7

−8

−8

10 −7

x 10

Time per non−zero [s]

Time per non−zero [s]

10

−7

Figure 2: IBM BG/Q S P MV measures of the entire UF collection using 4 threads per core. Colors correspond to different matrix classes, following our model classification.

Figure 1: IBM P7 measures of the inner product kernel without reduction using one (left) and four (right) threads per core. The red horizontal lines indicate idle power.

summarized in Figure 2. Despite the increased complexity of the problem, due to the indirect and irregular memory access pattern of the different sparse matrices, our model is still able to provide very accurate predictions in the time-power-energy performance triangle.

multiply-add characterizes the SpMV). This normalization allows a more general and comparable description of the costper-component for the analyzed kernels. We also subtract idle power from the measured values to compute the real net power consumption for the executed kernels. Due to the limited accuracy of the power measurement system (especially on the IBM BG/Q) this procedure might magnify the relative error between the measure and the prediction, expecially for the cases where the total measured power is close to the idle power of the machine, i.e., the net consumed power is very low.

As a final note, we want to stress that in spite of the large amount of presented measures, the primary focus of our work remains the methodology employed to derive the models, which is general and applies to any multithreaded architecture. The purpose of the large amount of result presented is thus mainly towards the validation of such methodology.

To validate the bottom-up approach we set up many different benchmarks, ranging the input parameters with respect to size and type of problem, number of cores and threads, as well as memory distance (cache-levels vs. DDR). As an example, Figure 1 shows 40 different measures performed for the inner (dot) product without reduction (another set of 40 measures has been collected for the parallel reduction case). Similar measures have been performed for the axpy kernel, the dense matrix-vector product, and the entire conjugate gradient (CG) method. In all these cases, our model predicts power consumption very accurately: in particular, the average and maximum relative errors for the CG are equal to 2.5 and 9.0%, respectively, which are quite small values considering the measurement noise. Concerning the S P MV we demonstrate the generality and robustness of our models by classifying and predicting performance of the entire University of Florida Sparse Matrix Collection1 (i.e., approximately 1200 square-real sparse matrices, arising from a broad range of fields and applications). Measures for this collection have been performed for the case of one and four threads per core on the IBM BG/Q and POWER7, and one and eight threads per core on the IBM POWER8. Part of these results on the IBM BG/Q are 1 http://www.cise.ufl.edu/research/sparse/matrices

ACKNOWLEDGMENTS The project Exa2Green (under grant agreement n◦ 318793) acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the ICT theme of the Seventh Framework Programme for Research (FP7/2007-2013) of the European Commission. IBM, the IBM logo, ibm.com, Blue Gene/Q, and Power 755 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. R EFERENCES [1]

A. C. I. Malossi, Y. Ineichen, C. Bekas, A. Curioni, and E. S. QuintanaOrtí, “Systematic derivation of time and power models for linear algebra kernels on multicore architectures,” 2014, submitted. [2] ——, “Performance and energy-aware characterization of the sparse matrix-vector multiplication on multithreaded architectures,” in Proc. of 43rd ICPP, Minneapolis, MN, USA, Sep. 2014. [3] K. Asanovic et al, “The landscape of parallel computing research: a view from Berkeley,” EECS Dept., UC Berkeley, Tech. Rep. UCB/EECS2006-183, 2006. [4] E. L. Kaltofen, “The “seven dwarfs” of symbolic computation,” in Numerical and Symbolic Scientific Computing: Progress and Prospects. Springer, 2012, pp. 95–104.