CHAPTER II – LITERATURE REVIEW
CHAPTER 4 – RESULTS Introduction
This section presents a novel lockstep approach which is focused on circumvent-ing the problem of the latency in context savcircumvent-ing (checkpointcircumvent-ing) and reloadcircumvent-ing
(rollback) processes. This approach has been named as HW Fast Lockstep. The reduced latency of the HW Fast Lockstep is obtained by introducing hardware changes in the design of the soft-core processor. Following the line of research of this work, the selected soft-core processor to work with has been the PICDiY.
Utilizing this design as a basis, several adaptations have been made to achieve a fast context saving and restoring lockstep approach. Furthermore, a Lockstep Controller and a Backup Memory modules (for the context saving) have been added to the architecture. However, the storing elements must also be hardened to achieve an effective fault tolerance level. For this reason, data and program memories are protected by an ECC encoding and decoding implementation.
Figure 4.22 shows the implemented lockstep system in which two units of the adapted processor work at the same time and share program and data memories.
As it can be observed, this approach implies a medium grained DMR scheme. The Lockstep Controller block manages the behaviour of the system in presence of errors and the synchronization of both processors. The sharing of memories and registers between the processors is performed using two different mixer blocks.
Thus, there is a mixer structure for the program memory that merges the data address and the read enable inputs from each processor. There is also a mixer for the data memory which merges the data address and writes enable inputs and data input / output. The last mixer is used to mix the data off the registers of both processors (FSR, STATUS, PCLATH, INTCON, PORTB and WORK) and its output is connected to the backup memory for a direct context saving.
These mixer blocks are continuously comparing the signals from the processors and they report to the Lockstep Controller when a discrepancy is detected. When that happens, the Lockstep Controller forces the context load of the IDC blocks of both processors, loading the last correct context saved in the backup memory.
Figure 4.22: Simplified block diagram of the HW Fast Lockstep approach.
Several aspects of the original PICDiY design have had to be modified. One
of the most important adaptation has been made in the IDC module. Since it controls most of the elements during the execution of instructions, it has been modified to adapt its behaviour. The FSM of four states that manages the IDC has been adapted by adding an new state to perform the context saving and reloading. Figure 4.23 shows the flow diagram of the adapted FSM. As it can be observed, in the absence of errors the execution follows the original sequence of four states. In the first state, the IDC activates the instruction reading from the program memory. The second state, depending on the loaded instruction, is used to read data or to load the next instruction on the program counter when CALL, GOTO or RETURN instructions are executed. In the third state, if necessary, the IDC enables writing data memory or registers and the fourth state is used for data stabilization, interruption attention management and context saving in the backup memory. Thanks to the adaptation, during the fourth state, each processor sends out a signal to the Lockstep Controller to be used in the synchronization of both processors. When an error is detected the state machine changes to the new context reload state, loading the last correct context saved into both processors. The checkpointing is regularly performed every 4 clock cycles without any extra latency for the processors. In the same way, the rollback only takes a 2 extra clock cycles.
Figure 4.23: Finite state machine diagram of the adapted FSM.
Another significant modification has been made in the processor’s registers to permit data load simultaneously. As shown in Figure 4.24. an extra data input controlled by an extra write enable input has been added to each register. Due to this, when the Lockstep Controller sends the reload command, all the information saved in the backup memory can be reloaded simultaneously in the registers
without any latency.
Figure 4.24: Original and adapted registers.
By reason of their critical importance, a fault strategy has been adopted to protect memory blocks, which as in the original design, are implemented with BRAMs. Due to the extra clock latency addition and the possible errors caused by double-bit faults in unused bits, the Xilinx’s built-in ECC fabric has been discarded. These drawbacks have motivated the implementation of custom ECC generation and checking modules particularly adapted to the designed approach.
Hamming code has been selected for the implementation of these modules, which is a widely utilized, robust and simple code. Considering that the data word width of the designed processor is 8-bit, the ECC code needs at least 5-bit to detect and correct a single error and detect a double error. In the case of the program memory, the instruction word width is 13 bits, so the ECC code needs at least 6 bits. The 8-level stack memory of the processor has been also adapted by using the same ECC strategy. Different ECC checking modules are used to detect and auto-correct single errors in the data, the 8-level stack and the program memories during reading operation, but they don’t correct errors in the memory array. They only present corrected data in the output. Nevertheless, the errors in data and stack memories should be overwritten with the next data store during the normal operation. In addition, a double error detection output is available, which can be used to indicate uncorrectable errors.
In the case of memory elements sensitive to accumulative errors, like data mem-ory positions that are not overwritten during the program execution or program memories, additional strategies have to be adopted. In the case of the program memory, the proposed Data Content Scrubbing Approach can be utilized to per-form periodic scrubbings. The most adequate scrubbing frequency is application and program dependant. An alternative to harden non overwritten data memory positions is to modify the original program by adding extra software instructions in order to perform a overwriting operation of these positions.
On the other hand, the ECC generation modules are used only by the data and 8-level stack memories. As long as the program memory data has to be initialized
with the correct instructions during the synthesis and implementation processes, a ECC generation module would not be useful in the case of this memory. As a result, a visual basic application has been developed to generate the program memory HDL modules. This application uses the .HEX file generated by the assembler with the source code to create the program memory VHDL file with the ECC encoded data.
The implementation of the backup memory for the processor context saving de-mands a different method, because a parallel load of the data of all registers is necessary to achieve a low latency context saving. Taking into account that the Xilinx BRAM structure can only write to a single memory address at the same time, the data has been saved using a register structure. Considering this fact, the ECC alternative has been discarded for the hardening of the backup memory, because it will dramatically increases the overhead. So, the strategy adopted has been the DMR implementation. In this case, this is adequate because the prob-ability of two consecutive SEUs is usually very low [174]. So, if an SEU affects a register it is extremely rare that another SEU will affect the backup memory simultaneously. In the same way, if an SEU affects the backup memory it is not probable that one of the processors will be affected by another SEU at the same time, and the wrong data will be overwritten in the next context saving.
Nevertheless, if this rare situation occurs it will be detected by the comparators of the backup memory.
Thanks to the proposed architecture, a lockstep approach which performs a fast checkpointing and rollback is obtained. Due to this fact, the saving process can be regularly performed every 4 clock cycles without causing any extra latency, nor affecting the processor’s performance. In the same way, a context loading process can be achieved only with 2 extra clock cycles. The main drawback of this approach is that the logic added to the design increases the hardware overhead. For this reason, it also reduces the maximum achievable operating frequency. Finally, it has to be remarked that this design can be implemented in FPGA devices from different vendors and the main ideas of this approach can be adapted to designs based on different soft-core processors.