In the previous section, we looked at how the nodes should be grouped into replicas when the number of nodes to be replicated is fixed. In other words, the number of application visible nodes n was already decided, and the goal was to intelligently pick nodes to be placed in replicated and non-replicated sets based on their individual reliability. Now we attempt to answer the more general question: Given an N node system with node reliability g1(t) ≥ g2(t) ≥ · · · ≥ gN(t), how many of the N nodes should be used and how many
should be replicated? This question cannot be answered by considering system reliability alone. Although a higher value of n will reduce the work per node due to parallelism, system reliability will go down making failures more likely. On the other hand, higher replication factors are likely to add more runtime overhead to the application, although they lead to a more resilient configuration. These trade offs can only be captured by computing the expected completion time for given number of nodes and replica pairs, and then picking the values of these variables that yield the minimum completion time.
4.3.1 Job Model
The first thing to determine, as n becomes variable, is the amount of work that will be distributed over each node and executed in parallel. Similar to the previous chapter, we use Amdahl’s law to determine Wn, the time required to execute the job on n parallel nodes
without failures:
Wn = (1 − α)W/n + αW (4.5)
where 0 ≤ α ≤ 1 represents the sequential part of the job. Since the focus of this dissertation is on HPC applications that usually have a high level of parallelism, most of the analysis we perform and the results we report will be for values of α equal to, or close to, 0.
4.3.2 Overhead Model for Partial Replication
As mentioned in the previous chapter, in addition to reducing the nodes over which work is parallelized, replication also induces additional overhead to message passing applications because of the communication required between replicas in order maintain consistency among them. Naturally this overhead increases when the replication degree increases, since more replicas mean more messages being duplicated. An approach to model the overhead versus the degree of replication was proposed in [26] using γ, the ratio of application time spent in communication under no replication. We use their idea but update the model so that it agrees with the experimental results reported subsequently in [27]. According to this model, for an application executing under partial replication factor r, the time, Wr, that includes
the overhead of partial replication, is given by
Wr = Wn+
√
r − 1γWn (4.6)
This estimate provides a more pessimistic overhead for partial replication compared to the original model of [26]. Moreover, it matches with the experimental result of [27] on real sys- tems since, for r = 1.5, the overhead will be 1/√2 ≈ 71% of the overhead of full replication. We, therefore, use Eq. 4.6 to compute and add the overhead of partial replication.
4.3.3 Combining with Checkpointing
Having figured out the failure-free execution time, Wr, of a partially replicated applica-
tion, we now proceed to compute the expected completion time of such an application under failures. For that, however, we first need to determine the checkpointing interval, which in turn depends on the mean time to interrupt (MTTI). The MTTI, M , can be computed using the reliability as:
M = Z ∞
0
R(t)dt (4.7)
where R(t) is the overall reliability for a a given partial replication configuration. Although we mentioned in the previous chapter the closed form formula for MTTI of dual replication as derived in [7], in general, with partial replication, it is not always possible to evaluate the integral in Eq. 4.7 analytically. We, therefore, resort to numerical integration to obtain the MTTI for our results. For the results we compute numerically, we will be using the more accurate higher order approximation by Daly[20] to calculate the checkpointing interval, τ .
In order to determine the expected completion time, we employ the same approach used in the previous chapter for non-exponential distributions, which is based on the works of [53] and [68]. As a brief recap, this involves computing the extra work, which consists of the time spent writing checkpoints and the lost work due to failures. By considering each failure as a renewal process, the average fraction of extra time during an execution can be taken to be the same as the fraction of extra time between two consecutive failures. Let F (t) = 1 − R(t) be the cumulative failure distribution and f (t) = F0(t) be the failure density function. The extra time between consecutive failures will then be given by
E(extra) = Z ∞ 0 Ct τ f (t)dt + kτ = C ∗ M τ + kτ (4.8)
where k is the average fraction of work lost during a checkpointing interval due to failure and can be evaluated numerically[53].
Once we obtain E(extra), the expected time to finish work Wr will be given by
E(Wr) = Wr
M
M − E(extra) (4.9)
where Wr is determined from Eq. 4.6. This is the equation that we use to compute and
4.3.4 Optimization Problem
The purpose of computing the expected completion time was to find the replication factor r that minimizes it for a given system. We thus formulate the search for r as an optimization problem as follows:
minimize
a,b E(Wr)
subject to a + 2b ≤ N n = a + b ≥ 1
where r = (a + 2b)/(a + b) and a and b can only take nonnegative integer values. The inputs include work W , total number of system nodes N , individual node reliability functions 1 > g1(t) ≥ · · · ≥ gN(t) > 0, checkpointing cost C, the parameter α, and communication
ratio, γ. In the next section, we will discuss our findings and results about the optimal r for different kinds of systems. In our results, we report the expected completion time normalized by WN, which is calculated from Eq. 4.5, and represents the time it takes to execute the job
on all N nodes without replication or checkpointing and under no failures.