The plots of the percentage execution time overhead for all presented algorithms in comparison to the conventional fault intolerant single pre- cision GEMM (sGEMM) computation is presented in Figures 2.8 and 2.9. The results were obtained by averaging the execution time of all methods
0 200 400 600 800 1000 1200 Matrix Size (M) -60 -40 -20 0 20 40 60 80 100 120 (ETime m e th o d /ETime c o n v e n ti o n a l ) (%) Proposed ABFT mABFT DMR (a) 0 200 400 600 800 1000 1200 Matrix Size (M) -40 -20 0 20 40 60 80 100 120 (ETime m e th o d /ETime c o n v e n ti o n a l ) (%) Proposed ABFT mABFT DMR (b)
Figure 2.8: Execution time percentile overhead of numerical packing, ABFT, mABFT and DMR for (a) GEMM computation only; (b) pre- processing, GEMM computation, error check and post-processing in the absence of errors.
0 200 400 600 800 1000 1200 Matrix Size (M) -40 -20 0 20 40 60 80 100 120 140 (ETime m e th o d /ETime c o n v e n ti o n a l ) (%) Proposed ABFT mABFT DMR (a) 0 200 400 600 800 1000 1200 Matrix Size (M) 0 20 40 60 80 100 120 140 160 180 200 (ETime m e th o d /ETime c o n v e n ti o n a l ) (%) Proposed ABFT mABFT DMR (b)
Figure 2.9: Execution time percentile overhead of the numerical packing, ABFT, mABFT and DMR for error detection and correction for: (a) single error injection and (b) One-row error injection using the KULFI fault injection.
For all the reported experiments, all methods detected all incurred er- rors (i.e., the detection rate was 1.0 for all approaches). Therefore, the reported results compare the different methods with respect to execution time. To present a detailed analysis of the execution time performance of all methods, four subcases are shown: (i) only the cost of GEMM calls (i.e., not the cost of packing, unpacking, checksum generation or error checking) where the proposed approach is computed using double pre- cision GEMM (dGEMM) and the other approaches utilize the sGEMM routine; (ii) everything when no SDCs occur; (iii) everything, including error checking and correction when one SDC occurs; (iv) everything, in- cluding the detection and correction of one row of SDCs injected using the KULFI tool.
Theoretically, the execution time for GEMM computation (without error tolerance) is expected to be equal for the conventional sGEMM and the two quatersize dGEMMs performed by the proposed approach (cf. proof of Section 2.3.2). In practice, due to the internal kernel optimizations of the ATLAS MKL for different matrix inner blocks, the results of Figure 2.8(a) show both positive and negative GEMM execution time overhead for the proposed approach, which subsequently affects the overall over- head for error tolerance for the presented matrix sizes. Overall, the re- sults show that, against fault-intolerant (conventional) GEMM design, the proposed approach incurs execution time overhead between −24.70% and 43.20% when no SDCs occur and 2.41% and 49.94% when mitigating up to N SDCs in an N × N GEMM output. On average, 12.05% to 21.21%
overhead is incurred by the proposed scheme for tolerating up to N SDCs in GEMM for the presented matrix sizes. Similarly, the plots show that ABFT incurs an average execution time overhead of 12.64% to 120.34% for the same level of fault tolerance. Overall, it is evident from the obtained results that the proposed method incurs comparable overhead to ABFT when no errors occur and this overhead is (approximately) 18.04% and 46.37%less than that of mABFT and DMR-based GEMM.
We note that the GEMM execution time is controlled by the optimization offered by the utilized ATLAS MKL for different matrix sizes and data types. For example, while experimenting with smaller subblock sizes, we observed both positive and negative execution time overhead (i.e., speedup) for the two quarter-size 64-bit GEMMs computed by the pro- posed method in comparison to the conventional fault-intolerant 32-bit GEMM. Such behavior is well documented in the experimental results reported with such libraries, e.g., see the experiments with small GEMM sizes reported in [8, 36, 97]. In terms of the choice of subblock sizes used for our experiments, the proof of Section 2.3.2 illustrates that low-cost SDC correction techniques in matrix products (beyond the brute force method of modular redundancy) become very valuable as the matrix size increases, since, for small matrix sizes, SDCs can be efficiently mitigated by recomputation. From the results of Figs. 2.8(b) and 2.9 (cf. Table I of [131] for quantitative values), we see that, for the proposed method, mABFT and DMR, the ratio of pre- and post-processing to the actual GEMM computation decreases significantly as the matrix size increases. Thus, we focus our application section on a multimedia retrieval system with requirement for large integer GEMM computations as discussed in
8× 8 blocks), the use of GEMM is not justified there because typical im- plementations apply direct calculation of the results without requiring a high-performance library, also exploiting the potential symmetries that tend to exist in block transform matrices of that size (e.g., Hadamard or DCTs used in video coding, etc.). Therefore, such small block sizes are out of the scope of matrix products considered in our work.
The presented results in this section are in line with the theoretical pre- dictions of Propositions 2.1 and 2.2 of Section 2.3.2 with the “best case” single SDC case and “worst-case” one-row SDC case shown in Figures 2.5(a) and 2.5(b) respectively. As expected, the percentile overhead of all methods tends to decrease with increased matrix size (with some fluctu- ation for small subblock sizes due to internal kernel optimizations of the utilized ATLAS library).
For execution time overhead when correcting SDCs, the execution time plot of Figure 2.9(b) and the theoretical prediction of Figure 2.5(b) show that the performance of ABFT could be worse than modular redundancy in multiple SDC scenarios. Specifically, while the proposed algorithm, mABFT and DMR incurs very low overhead for the correction of detected SDCs, ABFT requires (on average) more than 50% additional overhead for error correction. This is because, under multiple detected SDCs in GEMM, ABFT recomputes several rows and columns of the result, or indeed the entire GEMM subblock when ten erroneous rows/colums are detected (“rollback ABFT” [2]). This significant increase in the incurred overhead is also evident in the theoretical analysis of Proposition 2.2. It
is also evident that the additional overhead for tolerating multiple SDCs decreases considerably for all approaches, with the exception of ABFT, as the matrix size increases. For example, numerical packing requires 53.01%additional overhead to tolerate multiple SDCs for the 32× 32 ma- trix, while requiring only 9.07% to tolerate the same proportion of SDCs in the 1152×1152 matrix. This property is shared amongst all exact error- location algorithms (like numerical packing and DMR) and is beneficial as the requirements for low-cost SDC correction techniques are more sig- nificant for large matrix sizes, where entire GEMM recomputation would lead to substantial performance degradation.
In terms of error detection, by injecting IUD bit flips in all the outputs of the two GEMM calls of the proposed approach (in the integer case) under an extensive SDC campaign, we verified experimentally that the
locations of all SDCs were indeed detectable by the proposed approach.
On the contrary, as detailed in Section 2.3.1, ABFT can reliably detect and correct only up to a single SDC within each GEMM product. ABFT requires recomputation of entire rows and columns to ensure no SDCs remain uncorrected, as discussed in the example of Figure 1.3. This is circumvented via the use of mABFT, which, under the utilized settings, can reliably detect the locations of up to 32 SDCs per GEMM subblock, albeit at the cost of substantial execution time overhead.
Overall, our theoretical analysis and experimental results demonstrate that our proposal offers very high accuracy and reliability in the detection of the locations of SDCs, while it comes with runtime overhead that is similar to that of ABFT when no SDCs occur.