• No se han encontrado resultados

3.6

Related Work

To the best of our knowledge, our work is the first to explore the use of ULFM as a runtime system for high-level programming models. The closest to our work is related to the Fortran Coarrays programming model (Section 2.3.2.3). Fortran 2018 added new features to allow Coarrays programs to detect image failures and reform the remaining images for post-failure processing4. Fanfarillo et al. [2019] used ULFM to add these features to the OpenCoarrays framework [OpenCoarrays, 2019]. Failure recovery is performed collectively by the active processes using the revoke, shrink, and agreement functions of ULFM. In contrast, our work avoids global synchronization for recovery.

More explorations of ULFM have been done in the context of MPI applications and frameworks, which we review next.

Laguna et al. [2016] present an evaluation of ULFM’s programmability in two common patterns of HPC applications: bulk-synchronous with static load balancing and master-worker with dynamic load balancing. They concluded that the absence of a non-shrinking communicator recovery interface in ULFM complicates the recov- ery procedures of statically balanced computations, since it requires updating the domain decomposition to map to an arbitrary number of ranks, which is complex for many applications and may even be infeasible. They also reported complexities in developing local recovery procedures for computations that rely on non-blocking computation, since ULFM does not report failures while initiating a non-blocking communication. Although detecting failures through return codes enables applica- tions to determine the location of the failure in the code, they reported that this method results in significant changes and complications to the non-resilient code.

Gamell et al.[2016] used ULFM as a basis for comparing two in-memory check- pointing mechanisms implemented in the Fenix library. The first is neighbor-based — each process saves a copy of its own data and a copy of its neighbor’s data. The second is checksum-based — each process saves a copy of its own data, and one process saves the checksum of the data of all the processes. In their experiments, neighbor-based checkpointing achieved better performance at the expense of higher memory footprint.

Teranishi and Heroux [2014] used ULFM to implement the Local Failure Local Recovery programming model, which aims to support the possible transition from global to local rollback recovery in certain HPC applications. The programming model provides simple abstractions for handling data and communicator recovery to enable MPI applications to adopt local non-shrinking recovery more easily. Three abstractions are provided: 1) a resilient communicator that uses spare processes to repair its internal structure, 2) a redundant storage interface that implements in- memory checksum-based checkpointing, and 3) an extendable programming template for building recoverable distributed data structures.

Losada et al.[2017] used ULFM to deliver transparent non-shrinking recovery for MPI applications using the ComPiler for Portable Checkpointing (CPPC) [Rodríguez et al., 2010]. CPPC is a source-to-source compiler capable of transparently inject- ing checkpointing instructions in MPI programs at safe points discovered using static code analysis and variable liveness analysis. The above paper improved the generated code from CPPC to automatically handle failure detection, global failure propaga- tion using MPI_COMM_REVOKE, communicator recovery using MPI_COMM_SHRINK and MPI_COMM_SPAWN, and computation restart.

In addition to the checkpoint/restart frameworks, other research groups have usedMPI-ULFMfor implementing algorithmic-based fault-tolerant applications.

Rizzi et al. [2016] used ULFM to design an approximate PDE solver that can tolerate both hard and soft faults. Algorithmic reformations were performed to the PDE to introduce filters for corrupt data and to add more task-parallelism that can fit a master-worker execution model. They favoured the master-worker model as it does require global communicator repair it the event of a worker failure. Depending on the computation stage, the described algorithm handles failure of a worker either by ignoring its results or by resubmitting its work to another worker. As in our work, the master detects the failure of any of the workers by continuously probing the communicator usingMPI_ANY_SOURCE. The master is assumed to be fully reliable.

Ali et al.[2015] applied an algorithmic fault-tolerant adaptation of the sparse grid combination technique (SGCT) [Harding and Hegland,2013] in the GENE (Gyroki- netic Electromagnetic Numerical Experiment) plasma application. Through data re- dundancy, introduced by the adapted SGCT, their implementation is capable of recov- ering lost sub-grids from existing ones more efficiently than using checkpoint/restart. Synchronous non-shrinking recovery of the MPI communicator is facilitated using ULFM.

3.7

Summary

This chapter has focused on the MPI backend of the X10 language, and how we extended it to support fault-tolerant execution using MPI-ULFM. We outlined the fault tolerance features thatRX10requires from the underlying communication layer to meet the requirements of the resilient async-finish model. We described the details of the integration between X10 and MPI-ULFM, including the use of MPI-ULFM

fault-tolerant collectives by X10’sTeamcollectives.

The experimental evaluation demonstrated strong performance and scalability advantages to RX10 applications by switching from the sockets backend to the

MPI-ULFMbackend. Most importantly, the performance of native Team collectives showed that RX10 bulk-synchronous applications can avoid most of the resilience overhead of the RX10 runtime system by using MPI-ULFM collectives. We will present a performance evaluation of a suite of resilient bulk-synchronous applica- tions in Section5.4.4.

In the next chapter, we describe another performance enhancement to RX10 by providing an optimistic termination detection protocol for resilientfinish.

Chapter 4

An Optimistic Protocol for

Resilient Finish

This chapter describes how we reduced the resilience overhead ofRX10applications by designing a message-optimal resilient termination detection (TD) protocol for the async-finish task model. We refer to this protocol as ‘optimistic finish’ since it favors the performance of failure-free executions over the performance of failure recovery. In this chapter, we evaluate the performance of this protocol using microbenchmarks of X10 programs representing different task graphs that are common in resilient X10 programs. In the next chapter, we evaluate the protocol using realistic X10 applications (see Section5.4).

After an introduction in Section 4.1, we give an overview about different nested task parallelism models in Section 4.2, and different approaches to resilient TDin Section 4.3. After that, we prove the optimality limits of non-resilient and resilient async-finishTDprotocols in Section4.4and highlight the challenges of adding fail- ure awareness to the async-finish model in Section 4.5. We then describe the data structures and APIs used by the X10 runtime system for control-flow tracking in Section4.6. The following sections explain different async-finishTDprotocols: non- resilient finish (Section4.7), pessimistic resilient finish (Section4.8), and our proposed optimistic resilient finish (Section4.9). Section4.10describes two resilient stores that can be used for maintaining criticalTDmetadata and Section4.11concludes with a performance evaluation.

This chapter is based on our paper “Resilient Optimistic Termination Detection for the Async-Finish Model” [Hamouda and Milthorpe,2019].

4.1

Introduction

Dynamic nested-parallel programming models present an attractive approach for exploiting the available parallelism in modernHPCsystems. A dynamic computation evolves by generating asynchronous tasks that form an arbitrary directed acyclic graph. A key feature of such computations is TD — determining when all tasks

in a subgraph are complete. In an unreliable system, additional work is required to ensure that termination can be detected correctly in the presence of component failures. Task-based models for use inHPCmust therefore support resilience through efficient fault-tolerant termination detection protocols.

The standard model of termination detection is the diffusing computation model [Di- jkstra and Scholten, 1980]. In that model, the computation starts by activating a coordinator process that is responsible for signaling termination. The status of the other processes is initially idle. An idle process becomes active only by receiving a message from an active process. An active process can become idle at any time. The computation terminates when all processes are idle, and no messages are passing to activate an idle process [Venkatesan,1989].

Most termination detection algorithms for diffusing computations assign a parental responsibility to the intermediate tasks, requiring each intermediate task to signal its termination only after its successor tasks terminate [Dijkstra and Scholten, 1980; Lai and Wu, 1995; Lifflander et al., 2013]. The root task detects global termination when it receives the termination signals from its direct children. This execution model is very similar to Cilk’s spawn-sync model [Blumofe et al., 1995], where a task calling ‘sync’ waits for join signals from the tasks it directly spawned. It is also similar to the execution model of thecobeginandcoforallconstructs of the Chapel language [Chamberlain et al.,2007]. While having the intermediate tasks as implicit

TDcoordinators is favorable for balancing the traffic ofTDamong tasks, it may add unnecessary blocking at the intermediate tasks that is not needed for correctness. The async-finish model does not impose this restriction; a task can spawn other asyn- chronous tasks without waiting for their termination. Unlike Cilk’s sync construct, the finishconstruct waits for join signals from a group of tasks that are spawned directly or transitively from it.

To the best of our knowledge, the first resilient termination detection protocol for the async-finish model was designed by [Cunningham et al.,2014] as part of theRX10

project. This protocol does not force the intermediate tasks to wait for the termination of their direct successors. It maintains a consistent view of the task graph across multiple processes, which can be used to reconstruct the computation’s control flow in case of a process failure. It adds more termination signals over the optimal number of signals needed in a resilient failure-free execution for the advantage of simplified failure recovery. Since it favors failure recovery over normal execution, we describe this protocol as ‘pessimistic’.

In this chapter, we review the pessimistic finish protocol and demonstrate that the requirement for a consistent view results in a significant performance overhead for failure-free execution. We propose the ‘optimistic finish’ protocol, an alternative message-optimal protocol that relaxes the consistency requirement, resulting in a faster failure-free execution with a moderate increase in recovery cost.

Documento similar