11. tEnDEnCIAs DE Futuro En EL sECtor DE Los ContEnIDos
11.2 Cine y Vídeo digital
We build a mathematical model to combine task-level checkpointing with a system-wide check- pointing scheme. If task-level checkpointing cannot handle the errors occurred, such as those that corrupt kernel, runtime and MPI library, then the system-wide scheme will recover from those errors. The model targets the detected errors that prevent the successful completion of the application execution, i.e. fail-stop errors. The undetected errors, called silent errors, are mitigated by our techniques explained in Chapter7.
Table 5.1: Model Parameters Parameter Description
τs Checkpoint interval of system-wide checkpoints cs Time to take a system-wide checkpoint
rs Time to restart with system-wide scheme
Tf f Failure-free computation time without fault tolerance µs Expected rate for fail-stop errors
Tover al l Total execution time including fault-tolerance
C ov T L Failure coverage of task-level checkpointing from fail-stop errors TT L Total amount of time used for task-level checkpointing
Ts ys Total amount of time used for system-wide checkpointing
Twast ed Total amount of time used for task-level and system-wide scheme
WT L Total waste per unit of time of task-level checkpointing Ws ys Total waste per unit of time of system-wide scheme
C Fraction of time that an application is performing task computations
Table5.1shows the model parameters. We first introduce the total time equations:
Tover al l= Tf f + Twast ed
= Tf f + TT L+ Ts ys (5.1)
= Tf f + (WT L+ Ws ys)Tover al l (5.2)
The first equation shows the overall execution time which includes the wasted time due to task-level checkpointing and the wasted time due to system-wide checkpointing. The second equation shows the breakdown of the overall execution time with respect to the total waste per unit of time which we will minimize. We note that the failure coverage C ov T L refers to the fraction of time that a task-level scheme is able to recover from fail-stop errors. Formally, it is
defined as
C ov T L= C × e−
Rt
0λ(θ) dθ (5.3)
where C is a factor showing the fraction of time that a task-parallel application is perform- ing task computations, that is, the execution time except the time consumed in runtime and MPI library, and in system calls. During the time in runtime and MPI library and in system calls, task-level checkpointing cannot mitigate errors which we need to factor out. In addition,
e−R0tλ(θ) dθ is the probability of the successful execution of a task-parallel computation from the
beginning of the computation until time t with some failure distributionλ(θ) [94]. Our model does not assume any particular distribution forλ(θ).
We detail the waste per unit of time of system-wide checkpointing, i.e. Ws ys, in Section
5.3.1 and then the waste per unit of time of task-level checkpointing, i.e. WT L, in Section
5.3.2. In Section 5.3.3, we analytically find the optimal checkpoint intervals of the system- wide scheme and of the combined scheme. Finally, we conclude the model formalization with Section5.3.4where we discuss the generality of our model.
5.3.1
The waste time of a system-wide scheme
The waste per unit of time of a system-wide scheme, i.e. Ws ys, is the sum of the checkpointing, rework and restart time per unit of time. Checkpointing time captures the overhead of taking checkpoints. The rework time is the lost computation from the latest checkpoint to the moment of the failure. The restart time refers to the time to restore the latest checkpoint. We include down times, if any, in restart times since they do not require any special treatment. Now we formalize checkpointing, rework and restart time.
The checkpoint overhead per unit of time of a system-wide scheme, denoted by Ws yschk, is the product of the time to checkpoint cs and the number of checkpoints which is
1
τs. Hence the checkpoint overhead per unit of time is
Ws yschk= cs
τs
(5.4)
When a failure occurs, on average, half of the computation is lost since the last checkpoint. Thus, the expected value of rework overhead per unit of time of a system-wide scheme, denoted
by Ws ysr ew, is
Ws ysr ew=µsτs
2 (5.5)
Restart overhead per unit of time, denoted by Ws ysr es, is the product of the failure rate and the time to restore the checkpoint. Thus,
Ws ysr es= µsrs (5.6)
Therefore the waste per unit of time of a system-wide scheme without task-level check- pointing is
Ws ys= cs τs
+µsτs
2 + µsrs (5.7)
5.3.2
The waste time of task-level scheme
The waste per unit of time of a task-level scheme, WT L, is constituted by the checkpoint per unit of time of all tasks, and the rework and restart per unit of time of the failed tasks during the program execution. Let Fail Ts be the set of the failed tasks during the program execution due to fail-stop errors. In addition letα be a factor showing the fraction of the total wasted time of task-level checkpointing and recovery that is reflected in wall-clock time of the application execution. So for instance ifα = 0, then the application perfectly overlaps the computation with the task-level checkpointing and recovery and there is no overhead reflected in the wall- time of the application. On the other hand ifα = 1, then it means that task-level fault tolerance is sequential and fully reflected in the wall-time of the application. Note that0<= α <= 1.
The checkpoint overhead per unit of time, WT Lchk, is the rate of taking task-local checkpoints. Hence the checkpoint overhead per unit of time of task-level checkpointing is
WT Lchk= X
∀i,Taski
Taskichk (5.8)
where Taskichkis the checkpoint time per unit of time of task Taski.
The rework overhead per unit of time for a single task is the computation lost since the beginning of the task. Thus the rework overhead per unit of time, WT Lr ew, of task-level check-
pointing is
WT Lr ew= X Taski∈Fail Ts
µs× Taskir ew (5.9)
where Taskr ewi is the lost computation overhead per unit of time of the failed task Taski since
the beginning of its computation.
The restart overhead per unit of time, WT Lr es, is the per unit of time of restoring the task-local checkpoint. Thus the restart overhead per unit of time of task-level checkpointing is
WT Lr es= X Taski∈Fail Ts
µs× Taskir es (5.10)
where Taskir es is the restart overhead per unit of time of task Taski.
Finally the waste per unit of time of a task-level checkpointing scheme is
WT L = α(WT Lchk+ WT Lr ew+ WT Lr es) (5.11) In the next section we combine the two models.
5.3.3
The waste time of the combined model
If the task-level scheme provides the failure coverage C ov T L,µsis decreased as µ0
s = (1 − CovT L) × µs (5.12)
which leads us to parametrize the waste per unit of time of the system-wide scheme with respect to the expected failure rate.
Letµ0s= (1 − CovT L) × µs. We then denote the waste per unit of time of the system-wide scheme under failure coverage C ov T L with Ws ys(µ0s, C ov T L). As a result, note that Ws ys(µ0s, 0) denotes the waste per unit of time of the system-wide scheme without task-level checkpointing. The waste per unit of time of the system-wide scheme under failure coverage C ov T L is
Ws ys(µ0s, C ov T L) = cs τs +µ 0 sτs 2 + µ 0 srs (5.13)
Then the total waste per unit time of the combined scheme is
W = WT L+ Ws ys(µ0s, C ov T L) (5.14) We now compute the optimal checkpoint intervals of both the system-wide only and the unified scheme. To find the optimal interval of the system-wide only scheme, we take the derivative of Equation5.7with respect toτsto and equate it to zero to find the global minimum. We compute it as τ∗ s = v t2cs µs (5.15)
which is similar to the Young’s formula [52].
The waste per unit of time of the task-level scheme WT L is independent from τs of the system-wide scheme. That is, the task-level overheads and waste per unit of time is independent from the checkpoint interval of system-wide scheme. WT L then can be treated as a constant - with respect toτs- such that the total waste per unit of time of the combined scheme in Equation
5.14 is still a convex function and thus we can find a global minimum. The first derivative of Equation5.14with respect toτs is given by
dW dτs = µ0 s 2 − cs τ2 s (5.16)
where the derivative of WT L with respect toτsis zero. Setting the derivative to zero, the solution is τ∗ comb= v u t2cs µ0 s (5.17)
Hence, if the task-level scheme has coverage C ov T L, thenτ∗s is increased by a factor of γ =
q 1
1−C ovT L, which we call the gamma factor, according to Equation5.17. That is, the optimal
solution of the combined model is τ∗ comb= v t 1 1− C ovT L v t2cs µs = γτ∗ s (5.18)
5.3.4
Discussion on the Generality of the Model
The unified model can be applied to any non-hierarchical scheme where there is an uncoordi- nated fault tolerance scheme, such as task-level checkpointing, that is used with a coordinated checkpointing scheme. The coordinated scheme can be any system-wide scheme where the uncoordinated scheme decreases the optimal checkpoint intervals in accordance to Equation
5.18. Moreover, this effectively also means that our model is orthogonal to studies done for coordinated or hybrid hierarchical systems such as Di et al. [57] and Bosilca et al. [60].
Our model enables application users tune the optimal checkpoint interval according to their applications. For instance, the system-wide checkpoint and restart times for application data differ from application to application, as evident in Table5.5 from our results. They can tune the optimal checkpoint interval with respect to their applications’ checkpoint and restart times. In addition, if their applications exhibit different failure rates as reported by Fang et al. [95], the optimal checkpoint interval can be adapted further.