• No se han encontrado resultados

4. CASO DE ESTUDIO

4.3 Evaluación Multicriterio del caso de estudio

4.3.1 Criterios de evaluación

λ with an increase in FLOPS performance. A final conclusion cannot be drawn at this time, because there is a lack of data. Only recently, systems on the TOP500 list do provide their available memory. On older systems this information is rarely provided; memory was considered not being a performance indicator. So we are left with the adaptation of the checkpointing scheme and in particular the increase of the number of checkpoints. Less memory per node means more checkpoints per node and thus more recomputations. In a parallel environment the number of checkpoints is influenced by two factors instead of one: the problem sizen and additionally the number of nodes c. The memory footprint of an adjoint program without checkpoints is directly related to the runtime complexity of the original program. A code with the runtime complexity ofO(n2) will have at least an adjoint memory footprint ofO(n2

). Using our simple checkpointing scheme allows us to decrease this factor, but it will introduce recomputations that then increase the runtime complexity by an exponent of two, thus ending up with a problem inO(n4) for the sequential program. The best known checkpointing scheme leads to a logarithmic growth of the memory complexity with respect to original runtime complexity [17, 20].

The third option not mentioned so far is the analytical computation of adjoints. Instead of generating code that computes adjoints discretely where the differentiation happens at the statement level, the adjoint model could just be applied to an entire mathematical subproblem. This could be a linear solver, iterations thereof or even an entire non-linear solver. This is so-called continuous differentiation as opposed to the aforementioned discrete method [47]. A continuously differentiated subproblem may potentially dramatically reduce the memory footprint as has been shown for example in the case of linear or nonlinear solver [47]. However, discrete adjoints may be required inside a continuous model (e.g. continuous adjoints of a nonlinear solver). Moreover, a continuous adjoint may not be equivalent to a discrete adjoint. Already at the level of iterative solvers, the difference between the two is clear [12]. Additionally, an adjoint model of the original simulation model may not exist or it may be mathematically challenging to derive. In summary, continuous adjoints are even more worth the effort in parallel computing than in a serial setting.

So what could be a general approach to the memory problem in an exascale environment? What design choices should be made when starting a new code base for a project? Whenever the runtime com- plexity of the original code is above the projected memory evolution with respect to the number of nodes, continuous adjoints are the only potentially scalable solution; given that a continuous version exists. With discrete adjoints, the number of recomputations would increase, thus not fulfilling the expectations of the user. However, the evolution of the projected memory per node is unknown for the future. Everything that is below this unknown threshold can and should be adjoined using discrete adjoints generated by AD tools, including the MPI communication.

3.2

Generation and Correctness of Adjoint Patterns

MPI code Induces Extended SAC Differentiate Extended SAC(1) Generate Verify Adjoint MPI code

Figure 3.2: Method of adjoint MPI code generation and potential verification

Chapter 2 laid out the foundations of our adjoint communication analysis. This section describes how the adjoint communication generation in this work is derived in three distinct steps and what to expect from the correctness or the verification of an adjoint MPI implementation.

First, the MPI standard serves as the basis for the adjoint code generation. It describes in English text the requirements for an MPI interface implementation. It lacks any mathematical or computer scientific

abstraction and is mainly composed of function signatures. The text then describes the arguments of these functions and the behaviour of the communication. There is no proof of correctness or coherence. The validation or incoherences are mainly detected empirically through implementations of MPI libraries and runtime tests thereof. Hence, we introduced an abstraction that allows us to model MPI communication mathematically and convey MPI code into extended SAC notation (see Section 2.6. The first step is to extract the necessary arguments from a function’s signature. The arguments may be split into three categories:

1. arguments defining the communicated buffer (e.g. send buffer, receive buffer, reduction buffer), 2. arguments specifying where the data is being sent to or received from (e.g. rank) and when

(MPI_Request, MPI_Fence),

3. MPI specific arguments that are not relevant for the logic of the communication pattern itself (e.g. communicators, tags).

For the induction of the extended SAC code, we only need the first two types of arguments to map them on the communication functionSendRecv introduced in Section 2.6. To model the destination and source of a communication the PGAS notation (see Section 2.5) is first used and applied to all variables. Each variable is prepended by the location’s process rank. A receive is for example written as b.x = SendRecv(a.x). The first argument defines the buffer (variable x on process a) that is being sent from processa to the variable x on process b. In the SAC notation MPI functions do only have their input buffer as an argument, whereas the output buffer is the returning value of the function. The second type of arguments defining when and to whom the data is being sent or received from are modelled entirely by the prefix of the PGAS notation. The last missing element are access restrictions described in the MPI standard that we model using locks Section 2.6.2. These four abstractions, PGAS, extended SAC and locks allow us to induce an extended SAC. This is purely an interpretation of English text with no proof of correctness. It is assumed that the English text is interpreted correctly. This step is called induction (see Figure 3.2).

The second step is to apply the adjoint model to the extended SAC. TheSendRecv function is adjoint just like any regular intrinsic function. The difficult part, not present in a regular SAC are the lock state- ments. They are adjoined according to the rules derived in Section 2.6.2. The output of the differentiation step is an adjoint SAC (see Figure 3.2). The SAC abstracts the original MPI code by describing the data flow, the concurrency and the data access. Moreover the same is true for the adjoint SAC. It describes the constraints of an adjoint MPI implementation.

Hence, the third works in two ways and is either called verification or generation. The adjoint SAC gives us hints as to how an adjoint implementation might look like (generation). Moreover, it allows us to verify an adjoint MPI implementation (verification), because this adjoint code can be mapped onto an extended SAC the same way as the original code. This leads us to the downside of this method. The extended SAC with its constraints does not uniquely and seamlessly translate into an adjoint MPI implementation. It does not allow us to generate adjoint MPI code based on the original code. There has to be human input. The non uniqueness of an adjoint implementation is inherent in MPI. It is not a standard that needs to be read and interpreted by compilers, thus it is also not transformable through a grammar similarly to what has been done with adjoints of OpenMP code [16].

For the communication patterns we distinguish between blocking, nonblocking, collective and one- sided communication. The signature of the MPI functions is presented and the arguments that are relevant for the adjoining of the communication are picked. Our method of adjoining a given MPI communication pattern is then applied. The blocking communication serves as an illustrative example for all the other patterns and should be seen as part of our method description. In general, it is assumed that the reader has an in depth knowledge of MPI.

Documento similar