II. ANÁLISIS SITUACIONAL DE LA INSTITUCIÓN
2.4. Diagnóstico Institucional – FODA (Herramienta 8 y 9)
To have a better understanding of the results, profiling of the different benchmarks has been done, in order to determine the different percentages of certain instruction type. Table 6-4 shows the different ratios of the different instruction types of the tested benchmarks in 3 different mode: The unprotected, the ACEDR protected with ITMR and the combined ITMR and TTMR protection.
128
Table 6-4 Profiling instruction types before and after protection
Fibo
Instruction type No Protection ITMR Combined mode
load 81.81% 49.38 44.44%
store 18.18% 50.61% 55.55%
Qsort
Instruction type No Protection ITMR Combined mode
store 24.71% 12.16% 15.76%
load 55.05% 55.55% 67.25%
gep 8.98% 13.75% 5.42%
binop 11.23% 9.52% 11.55%
Roots
Instruction type No Protection ITMR Combined mode
load 40.98% 38.80% 36.47%
store 37.70% 40.12% 40.25%
binop 20.63 24.07% 23.27%
Math
Instruction type No Protection ITMR Combined mode
load 39.06% 27.89% 39.05%
store 34.37% 36.31% 19.48%
binop 20.31% 21.05% 39.05%
gep 6.25% 14.73% 2.41%
The reliability is inversely proportional to the number of instructions, that is why the blue curved are different from one benchmark to the other. For example, the Fibo benchmark has the lowest number of instructions and shows the highest reliability. The other reason for obtaining different reliability curves is the number of memory instructions in the benchmark, which are highly vulnerable to injection errors. Injecting memory instructions will not only corrupt the memory, but also parts of the CPU, since the data loaded/stored from/to memory is corrupted, making the errors propagate, causing more error events. Branch instructions can be corrupted indirectly - by corrupting the compare (cmp) instruction, the preceding load instruction or the branching memory address. This makes them very susceptible to injection errors. The memory instructions in the Math benchmark represent a high percentage (73.43%), in addition to the fact that this benchmark contains large number of instructions (1160) compared to the rest, making it more vulnerable than the other tested benchmarks.
129
experiments on a new benchmark, however, a user of this model must have a data set of a generic instruction types sensitivities, obtained from quantitative software injection (or from neutron radiation test) of other benchmarks. Our work provides a starting point of values to such investigations as found in Sections 6.2.1. and 6.2.2.
The reliability prediction model is a first attempt in research to model the reliability of a whole processing architecture, making it prone to multiple parameters that are affect accuracy. These parameters are hardware/software related, and others are related to the environment. The different levels of cache, the CPU, and the RAM all have different error rates (depending on the orbit and the circuit technology size), and different Hit/Miss rates (obtained using perf Linux profiling tool [138]). Another parameter to take into count is the CPU and how many levels of cache it incorporates which can dramatically affect the reliability prediction model. The error rates obtained from the injection experiments are affected by the benchmark’s instruction types, the size of the code and the Operating System. According to [136] enabling the Operating System will make the system more prone to SEUs bit-flips. The current prediction model takes into account the processing architecture configuration and how the processor is connected to its different parts, such as the RAM and the caches. The prediction model also takes the OS into consideration, when running TTMR in three threads parallelly, on three CPUs. This will change the configuration of the CPUs and their connection to the caches and the RAM.
The accuracy of the prediction model can be improved further by including other OS parameters, specific to the running benchmark (if it calls OS libraries) such as input/output system calls, exec, fork…etc. Just like the TTMR, any calls to the OS will change the configuration of the platform and how it runs in the perspective of the prediction model, and this can affect the reliability prediction.
For our case study, which is predicting the reliability of processing architectures in orbit, one day of accurate predictions is enough for LEO orbits [84], as in that period of time the satellite could achieve multiple orbits.
Our research provides a new, fast and open way to not only estimate but practically test the reliability of COTS processing architectures using software fault injection operating in the space environment. However, in order to fully understand the architecture and validate our model, a physical radiation test should be conducted. We are hoping in the future to compare the reliability of the software injection experiment and the reliability prediction model with proton radiation test.
6.3 Summary
Protecting COTS against the SEUs has become a necessity, especially with scaling in the processing architectures, making SEUs a prominent problem, not exclusively in the space domain, but SEUs have also been a source of disturbances at sea level for ICs. It is crucial for designers in mainstream, embedded or mission-critical fields to ensure the reliability of their systems. Systems with redundant hardware such as TMR and ECC have been used to mitigate against SEUs, however, this trend always elevates the complexity of the
130
system, and adds more overhead and higher costs. COTS are the alternative if their reliability is improved using software error detection and recovery techniques. This will enable more processing power with less cost and energy consumption.
A new adaptive framework has been presented in this work, capable of changing the COTS processing architecture’s protection mode depending on the error rate of the SEUs. Adaptiveness will enable the system to keep high reliability without sacrificing performance. We first have demonstrated how the adaptive system can be implemented, enabling the running code to switch its state depending on the SEUs rate. Three operational modes have been selected, depending on their performance and reliability. The first mode is the unprotected mode, which has the highest performance, but no error detection and recovery schemes are added. The second mode is when the system is protected using ITMR, this mode can offer relatively higher reliability compared to the unprotected mode, with an acceptable overhead. The third mode of protection is when the system is protected with the combined protection techniques including both ITMR and TTMR. This mode of protection reduces the error rate dramatically, and provides the highest reliability, however it adds a time overhead to the system, which is benchmark dependent. The error rate is defined as the number of the injections that have caused an error divided by the total number of the injections.
The reliability prediction equations demonstrate that the unprotected code is the least reliable mode, subsequently, the ITMR offers better reliability by reducing the error rate. The best reliability was obtained when the third mode of protection was used. The reliability equations for the different protection techniques have been developed, including multiple parameters related to the processing architecture hardware, such as the number of cores, the pipeline stages, the hit and miss rate of components. Other parameters are specific to the environment such as the SEUs error rate. This model also takes into account the different instructions types and their different sensitivities to a particular SEU error rate. Including all the previous parameters is necessary to alleviate the accuracy of the predicting equations.
The error injection experiments verify the reliability prediction equations, since the error rate was at its highest when no protection has been applied to the operating mode. An improvement in reliability was noticed when the ITMR has been applied to protect the system, where the error rate has been dropped drastically compared to when no protection was applied. Applying the third protection mode has shown the highest reliability since the error rate has been dropped to the lowest.
The adaptive solution contributes to the state of the art with the following points;
• Low overhead with the adaptive solution, switching modes depending on the SEUs error rates and utilizing redundant independent instructions, taking advantage of the CPU abundant resources, • A reliability prediction model for all the processing architecture components,
• High error detection and recovery rate, where the error rate has been reduced to less than 1% in some benchmarks,
• Our protection modes include the protection of multiple data and instruction types (i32, i32*, i1, i8, i8*, i64, float & double, float pointers & double pointers) [12], in addition to both the CPU and the
131 memory Read/Write instructions types,
132