Towards Fault Resilient Global Arrays - Semantic Scholar

2 downloads 7733 Views 298KB Size Report
For such applications, fault recovery can be much less expensive and ... a shared global-address data view such as UPC, Co-Array Fortran or Global ... GA has been adopted by multiple application areas and is supported by most hardware.
John von Neumann Institute for Computing

Towards Fault Resilient Global Arrays Vinod Tipparaju, Manoj Krishnan, Bruce Palmer, Fabrizio Petrini, Jarek Nieplocha

published in

Parallel Computing: Architectures, Algorithms and Applications , C. Bischof, M. Bucker, ¨ P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), John von Neumann Institute for Computing, Julich, ¨ NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 339-345, 2007. Reprinted in: Advances in Parallel Computing, Volume 15, ISSN 0927-5452, ISBN 978-1-58603-796-3 (IOS Press), 2008. c 2007 by John von Neumann Institute for Computing

Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.

http://www.fz-juelich.de/nic-series/volume38

Towards Fault Resilient Global Arrays Vinod Tipparaju, Manoj Krishnan, Bruce Palmer, Fabrizio Petrini, and Jarek Nieplocha Pacific Northwest National Laboratory Richland, WA 99352, USA E-mail: {vinod, manoj, bruce.palmer, fabrizio.petrini, jarek.nieplocha}@pnl.gov This paper describes extensions to the Global Arrays (GA) toolkit to support user-coordinated fault tolerance through checkpoint/restart operations. GA implements a global address space programming model, is compatible with MPI, and offers bindings to multiple popular serial languages. Our approach uses a spare pool of processors to perform reconfiguration after the fault, process virtualization, incremental or full checkpoint scheme and restart capabilities. Experimental evaluation in an application context shows that the overhead introduced by checkpointing is less than 1% of the total execution time. A recovery from a single fault increased the execution time by 8%.

1

Introduction

As the number of processors for high-end systems grows to tens or hundred of thousands, hardware failures are becoming frequent and must be handled in such manner that the capability of the machines is not severely degraded. The development of scalable fault tolerance is critical to the success of future extreme-scale systems. In addition to general and fully automated approaches1 we recognize that some classes of applications in chemistry, bioinformatics, data mining, and Monte Carlo simulations have a natural fault resiliency providing that some critical data is protected from loss and corruption due to hardware faults. For such applications, fault recovery can be much less expensive and consume less system resources than otherwise would be required in the general case. There has been a considerable amount of work aiming at achieving fault tolerance in MPI. Using MPI programming model, the programmer must explicitly represent and manage the interaction between multiple processes, distribute and map parallel data structures, and coordinate the data exchanges through pairs of send/receive operations. For the upcoming massively parallel systems with complex memory hierarchies and heterogeneous compute nodes, this style of programming leads to daunting challenges when developing, porting, and optimizing complex multidisciplinary applications that need to demonstrate performance at a petascale level. These productivity challenges in developing complex applications with MPI have resulted in renewed interest in programming models providing a shared global-address data view such as UPC, Co-Array Fortran or Global Arrays (GA). The focus of the current paper is adding fault resiliency to the Global Arrays. The Global Array (GA) toolkit2 enables scientific applications use distributed memory systems as a single global address space environment. The user has ability to create and communicate through data objects called global arrays in way like they were located in shared memory. GA has been adopted by multiple application areas and is supported by most hardware vendors, including IBM for the fastest system on Top-500 list, the IBM BlueGene/L3 . We extended the GA toolkit to provide capabilities that will enable programmer to implement fault resiliency at the user level. Our fault-recovery approach is programmer

339

assisted and based on frequent incremental checkpoints and rollback recovery. In addition, it relies on a pool of spare nodes that are used to replace the failing nodes. This approach is consistent with that of FT-MPI4 , an MPI implementation that handles failure at the MPI communicator level and allows the application to manage the recovery by providing corrective options such as shrinking, rebuilding or aborting the communicator. We demonstrate usefulness of fault resilient Global Arrays in context of a Self Consistent Field (SCF) chemistry application. On our experimental platform, the overhead introduced by checkpointing is less than 1% of the total execution time. A time to recover from a single fault increased the execution time by only 8%.

2

Programming Model Considerations

The demand for high-end systems is driven by scientific problems of increasing size, diversity, and complexity. These applications are based on a variety of algorithms, methods, and programming models that impose differing requirements on the system resources (memory, disk, system area network, external connectivity) and may have widely differing resilience to faults or data corruption. Because of this variability it is not practical to rely on a single technique (e.g., system initiated checkpoint/restart) for addressing fault tolerance for all applications and programming models across the range of petascale systems envisioned in the future. Approaches to fault tolerance may be broadly classified as user-coordinated or user-transparent. In the latter model, tolerance to hardware failures is achieved in a manner that is completely transparent to the programmer, where as user-coordinated FT approaches rely on explicit user involvement in achieving fault tolerance. Under the US DoE FASTOS program funding, we have been developing fault tolerance solutions for global address space (GAS) models. This includes both automatic (usertransparent)9 as well as user-coordinated approach described in the current paper where we focus on fault tolerance for the Global Arrays (GA). Our user-transparent approach relies on Xen virtualization and supports high-speed networks. With Xen, we can checkpoint and migrate the entire OS image including the application to another node. These two technologies are complementary. However, they differ in several key respects such as portability, generality, and use of resources. The virtualization approach is most general yet potentially not as portable. For example, it relies on availability of Xen and Xenenabled network drivers for the high speed networks. The user coordinated approach is very portable. However, since it requires modifications the application code (e.g., placing checkpoint and restart calls and possibly other restructuring of the program), it is harder to use and less general. The GA library offers a global address space programming model implemented as a library with multiple language bindings (Fortran, C, C++, Python). It provides a portable interface through which each process in a parallel program can independently, asynchronously, and efficiently access logical block of physically distributed matrices, with no need for explicit cooperation by other processes. This feature is similar to the traditional shared-memory programming model. However, the GA model also acknowledges that remote data is slower to access than local data, and it allows data locality to be explicitly specified and used. Due to interoperability with MPI, the GA extends benefits of the global address space model to MPI applications. GA has been used in large applications in multiple science areas.

340

The GAS models such as GA enable the programmer to think of a single computation running across multiple processors, sharing a common address space. For discussions in the rest of the paper, we assume that each task or application process is associated with a single processor. All data is distributed among the processors and each of these processors has affinity to a part of the entire data. Each processor may operate directly on the data it contains but must use some indirect mechanism to access or update data at other processors. GAS models do not provide explicit mechanisms like send/receive for communication between processors but rather offer implicit style of communication with processors updating and accessing shared data. These characteristics lead to different design requirements for fault tolerance solutions for the GAS models than for MPI. In particular, we need to assure consistency of the global data during checkpoint/restart operations, and the tasks executed by processors affected by the failure can be transparently reassigned and the updates to the global shared data are fully completed (partial updates should either be prevented or undone).

3

Technical Approach

Our prototype fault resilient Global Arrays uses a spare pool of processors to perform reconfiguration after the fault. The prototype solution includes task virtualization, incremental or full checkpoint scheme, and restart capabilities. In addition, we rely on the parallel job’s resource manager to detect and notify all the tasks in an application in case of a fault. Not all resource managers have the ability to deliver a fault notification to the application upon a fault. Quadrics is one vendor who incorporated such mechanisms into their RMS resource manager. As discussed in the previous section, to achieve fault tolerance we need to assure that the global shared data is fault tolerant, the tasks executed by processors affected by the failure can be transparently reassigned and the updates to the global shared data are fully completed (partial updates should either be prevented or undone). 3.1

Reconfiguration After the Fault

Our approach allocates a spare set of idle nodes to replace the failed node after the fault, see Fig. 1. This approach is more realistic than dynamic allocation of replacement nodes given the current operation practices of supercomputer centres. For quicker restart the spare nodes load a copy of the executable at startup time. 3.2

Processor Virtualization

To enable effective reconfiguration, virtualization of processor IDs is needed. The advantage of this approach is that once a failed node is replaced with a spare node, the array distribution details as known to the application don’t have to be changed after recovery. Any prior information obtained by the application as to the array distribution or the task IDs of the tasks that hold a particular section of an array are still valid after recovery. 3.3

Data Protection and Recovery

The fault tolerance scheme for GA was designed to impose only small overhead relative to the execution of an application. Similarly to prior work like libckpt5, 6 we checkpoint the

341

working set

spare set

node failure

reconfigure

Figure 1. Fault reconfiguration using spare node pool.

global data, which is distributed among processes. Checkpointing global arrays is a based on a collective call i.e., all the processes participate in the checkpointing operation. There are two-options for checkpointing global arrays: (i) full checkpointing (ii) incremental checkpointing. With full checkpointing, each process takes a full checkpoint of its locally owned portion of the distributed global array. In this case, the entire global array is always checkpointed. The other approach, incremental checkpointing6 , reduces the amount of global array data saved at each checkpoint. Our implementation uses a page-based scheme similar to prior work7 , where only the pages modified since the last checkpoint is saved rather than the entire global data.

4

Application Example

The target application for testing fault tolerance was a chemistry code that implements a simple Hartree-Fock Self Consistent Field (HF SCF). The method obtains an approximate solution to Schr¨odinger’s equation HΨ = EΨ . The solution is assumed to have the form of a single Slater determinant composed of one electron wavefunctions. Each single electron wavefunction ϕi (r) is further assumed P to be a linear combination of basis functions χµ (r) that can be written as ϕi (r) = µ Ciµ χµ (r). Combining the linear combination of basis functions with the assumption that the solution to Schr¨odinger’s equation is approximated by a single Slater determinant leads to the self-consistent eigenvalue problem P Fµν Ckν = Dµν Ckν where the density matrix Dµν = Ckµ Ckν and the Fock matrix k

Fµν is given by Fµν = hµν +

1X [2(µν|ωλ) − (µω|νλ)] Dωλ 2

(4.1)

ωλ

Because the Fock matrix is quadratically dependent on the solution vectors, Ckµ , the

342

F cnt Finished

!

Global Counter

Yes No Initialize

Converged?

Get next task

Construct Fock Matrix

Diagonalize Fock Matrix

D

Update Density

!

Figure 2. Block diagram of SCF. Block arrows represent data movement

solution procedure is an iterative, self-consistent procedure. An initial guess for the solution is obtained and used to create an initial guess for the density and Fock matrices. The eigenvalue problem is then solved and the solution is used to construct a better guess for the Fock matrix. This procedure is repeated until the solution vectors converge to a stable solution. The solution procedure to the HF SCF equations is as follows, see Fig. 2. An initial guess to the density matrix is constructed at the start of the calculation and is used to construct an initial guess for the Fock matrix. The HF SCF eigenvalue equation is then solved to get an initial set of solution vectors. These can be used to obtain an improved guess for the density matrix. This process continues until a converged solution is obtained. Generally, the convergence criteria is that the total energy (equal to the sum of the lowest k eigenvalues of the HF SCF equations) approaches a minimum value. The density and Fock matrices, along with the complete set of eigenvectors all form two-dimensional arrays of dimension N, where N is the number of basis vectors being used to expand the one electron wavefunctions. These arrays are distributed over processors. The Fock matrix is constructed by copying patches of the density matrix to a local buffer, computing a contribution to the Fock matrix in another local buffer, and then accumulating the contribution back into the distributed Fock matrix. This is shown as the task loop in Fig. 2. The tasks are kept track of via a global counter that is incremented each time a processor takes up a new task. An obvious point for checkpointing in this process is to save the solution vectors at each cycle in the self-consistent field solution. The checkpoint would occur right after the “Diagonalize Fock Matrix” operation in Fig. 2. If the calculation needs to restart, this can easily be accomplished by restoring the saved solution vectors from the most recent cycle. It is fairly straightforward to reconstruct the density matrix using the most recent solution and the remainder of the HF SCF process can proceed as before. This would correspond to restarting in the “Update Density” block in Fig. 2.

343

9 8

Overhead %

7

data protec tion data protec tion and recovery

6 5 4 3 2 1 0 1

2

4

8

16

32

num ber of process ors

Figure 3. Overhead for data protection and overhead in SCF relative to total execution time

For very large calculations, this level of checkpointing may be too coarse and it would be desirable to add additional refinement. This could be accomplished by storing intermediate versions of the Fock matrix, as well as the solution vectors, which would correspond to adding additional checkpoints inside the task loop that constructs the Fock matrix. 4.1

Experimental Results

The experiment evaluation was performed on a Quadrics cluster with 24 dual Itanium nodes running Linux kernel 2.6.11. We run HF calculations for a molecule composed of sixteen Beryllium atoms. In addition to the checkpointing operation, we measured time for recovery from a single fault of a node. Figure 3 shows that the overhead introduced by the data protection is very low (1%). This is the overhead representative of normal program execution (without faults). Of course, in those cases the restart operation is not triggered and therefore the restart overhead also shown in Fig. 3 does not apply. As expected the time to recover from the fault was more significant (8%) than checkpointing, as measured relative to the original application. It involves the cost of loading the restart files and reconfiguration. That level of overhead is quite reasonable since by extending the execution time by 8% we would be able avoid the costly alternative of restarting the whole application from the beginning. These results indicate that at least for applications with characteristics similar to the HF program, our fault tolerant solution would have a very limited impact on the performance while providing a protection against faults.

5

Conclusions and Future Work

We described a prototype of a fault resilient Global Arrays toolkit that supports a global address space programming model. This capability was used for a chemistry application and shown to introduce very small overheads as compared to the baseline application without

344

fault resiliency. Our plans for future work include more experimental validation with other applications, comparing effectiveness and performance of the current user-directed and the user-transparent approaches. We also plan on adding our own collective fault notification mechanism as a fallback strategy when working with resource managers that do not have a mechanism to deliver the failure notification to all the tasks in an application.

References 1. F. Petrini, J. Nieplocha and V. Tipparaju, SFT: Scalable fault tolerance, ACM SIGOPS Operating Systems Review, 40, 2, (2006). 2. J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease and E. Apra. Advances, applications and performance of the Global Arrays Shared Memory Programming Toolkit. International Journal of High Performance Computing and Applications, 20, 2, (2006). 3. M. Blocksome, C. Archer, T. Inglett, P. McCarthy, M. Mundy, J. Ratterman, A. Sidelnik, B. Smith, G. Almasi, J. Castanos, D. Lieber, J. Moreira, S. Krishnamoorthy and V. Tipparaju, Blue Gene system software: Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene supercomputer, Proc. SC’06, (2006). 4. G. Fagg, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic and J. Dongarra, Scalable Fault Tolerant MPI: Extending the Recovery Algorithm, Proc. 12th Euro PVM/MPI, (2005). 5. J. S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt. Transparent Checkpointing under Unix, in: Proc. Usenix Technical Conference, New Orleans, pp. 213–223, (1995). 6. J. S. Plank, J. Xu, R. H. Netzer, Compressed differences: An algorithm for fast incremental checkpointing, Tech. Rep. CS-95-302, University of Tennessee, (1995). 7. Princeton University Scalable I/O Research, A checkpointing library for Intel Paragon. http://www.cs.princeton.edu/sio/CLIP. 8. Y. Zhang, R. Xue, D. Wong, W. Zheng, A checkpointing/recovery system for MPI applications on cluster of IA-64 computers, International Conference on Parallel Processing Workshops (ICPPW’05), pp. 320-327, (2005). 9. D. Scarpazza, P. Mullaney, O. Villa, F. Petrini, V. Tipparaju, D. M. L. Brown, J. Nieplocha, Transparent system-level migration of PGAs applications using Xen on Infiniband, Proc. IEEE CLUSTER’07, (2007).

345