To hide the latency of runtime monitoring operations, the FlexCore architecture decouples monitoring operations from the main processing core using a Core- Fabric FIFO queue and allowing the main processing core to push committed instructions into the queue and continue without waiting for monitoring op- erations for that instruction to complete. In this decoupled fashion, the main processing core will only stall the commit of completed instructions when the Core-Fabric queue is full, and this greatly improves the performance overheads of runtime monitoring on the FlexCore architecture. However, the decoupling also implies that the detection of an error in the co-processor may take place many cycles after the error has already occurred and modified system state, we
call this mode of error detection decoupled checking. In decoupled checking, by the time that an error is detected, additional instructions will complete on the main core, and more damage to the monitored application and state state could take place.
Runtime monitoring functions typically perform two types of operations: bookkeeping to update its metadata to reflect that of the monitored applica- tion; and checking using application data and metadata to ensure that an error did not occur. For bookkeeping operation performed by many runtime moni- toring approaches, the bookkeeping can be performed in a decoupled fashion. Similarly, for runtime monitoring functions where damage from errors can be contained within the memory space of the monitored application and the ap- plication exhibits fail-stop behavior, the bookkeeping and checking operations can both be decoupled. However, for the remaining classes of applications and runtime monitoring functions, where errors could result in damage that is dif- ficult to undo or recover from, then a mechanism needs to exist to ensure that critical instructions are checked before they are allowed to update system state, we call this mode of error detection precise checking. In the remainder of this subsection, we will discuss alternatives for implementing precise checking on different types of processing cores.
For security critical instructions and system calls where the monitored ap- plication may make irreversible changes to system system, these instructions can be forwarded to the runtime monitoring function but be delayed from com- mit until checks are complete. For modern out-of-order processors, such delays simply mean that instructions stay in an ROB (Re-Order Buffer) longer with- out necessarily stalling the execution of subsequent instructions. This simple
and straightforward approach for precise checking will have negligible effects on the performance of the monitored application when the frequency of critical instructions and system calls are low. However, if the frequency is high, then alternative mechanisms may be needed to mitigate the performance impact of such checks. We will discuss options for precise checking for out-of-order and in-order processing cores in the subsequent paragraphs.
For high-performance processing cores that make use of a ROB to reorder instructions and enable speculation, these architectures buffer instructions in its reorder buffer until the instructions are guaranteed to complete and are no longer speculative. For such architectures, we can simply extend the speculation support to cover instructions that have yet to be checked by the co-processor. Hence, instructions executed by the monitored application can complete and be readied for commit but not write to architectural state until the co-processor completes its checks.
Fetch
Engine Pipeline UIL
Lifeguard Register File X UML Data Memory Backup Register File Restore Commit
Figure 3.3: Block diagram of processing core with precise exception sup- port.
For in-order processing cores that eschew speculation for area and energy efficiency, additional cost-effective hardware mechanisms are needed to allow for precise checking. Figure 3.3 shows the block diagram of one possible way that precise checking can be implemented on such architectures. The main idea behind the mechanism proposed in the figure is to ”buffer” completed instruc- tion results so that the monitored application can continue to make forward progress and to allow each ”buffered” instruction to modify architectural state only when it does not result in an error. Intuitively, the approach can miti- gate the overheads of checking when monitored applications exhibit bursty in- struction execution patterns: the burst of commits can be ”buffered” and then checked, and the buffers can drain when the pipeline is idle and waiting for long latency operations, such as memory loads, to return. Rather than make extensive modifications to the in-order processing core, we leave the majority of the core unchanged. All instruction executed on the main processing core are allowed to update the register file in the main core pipeline immediately upon commit so that following instruction can use the results. However, memory writes are buffered in case a check fails and a copy of architectural state is pre- served that can be restored on a detected error. To this end, we extend the archi- tecture with several ”buffers” for completed instructions and their register and memory updates. The Unchecked Memory List (UML) buffers and bypasses speculative memory updates (stores) and each UML entry holds the value and the address of unchecked memory write. All loads on the main processing core are extended to check the UML for a matching entry before reading from the memory hierarchy. A Back-up Register File (BRF), which has the same number of entries as the main (speculative) register file, is added and is only written to by instructions that have been checked. The Unchecked Instruction List (UIL) is
a FIFO that holds a list of speculative instructions that are completed in the main core but not checked by the co-processor. The main core enqueues instructions into the UIL after completing an instruction and forwarding the instruction to the co-processor. Each entry in the UIL holds a pointer to an UML entry, desti- nation register, and register write result.
As the co-processor completes the checking of each instruction, if the check passes, the next UIL entry is dequeued and its register and/or memory write are allowed to proceed to the BRF and/or memory. However, if the check fails, then the main core can restores its state to the before the failing instruction using the additional ”buffers”. The register file can be restored by copying values from the BRF. Next, the core flushes all entries in the UIL and UML and the pipeline, sets the appropriate status registers to indicate that a failed check occurred, and a control transfer instruction to the base address of the exception handler can be inserted into the processor pipeline.