The main contributions of this dissertation span two different areas: targeted lightweight runtime error detection and cost-effective post-silicon fault localization and diagnosis. The key results related to run-time fault detection are as follows:
1. Register dataflow logic runtime validation is first deeply studied. We propose a novel runtime technique to detect errors in the register dataflow logic. The so- lution introduces a novel concept called signature checking that detects errors by attaching a token to each produced register value and by matching con- sumed signatures against source signatures. We show through fault injection campaigns that the rename tables, wake-up logic, select logic, bypass control, operand read and write-back, register free list, register release, register alloca- tion, and the load replay logic are protected with high coverage. The approach is shown to be very effective in detecting faults, and allows designers to choose the coverage ratio by amplifying the signature size.
We also propose nine different signature allocation policies with different area and power requirements. We show that in-flight signature distribution can be controlled to increase coverage for different register dataflow failure scenarios. 2. We introduce a new microarchitecture that combines register dataflow checking
and register value checking. We particularly show how to improve our register dataflow checking technique by integrating it with an end-to-end residue check- ing scheme. Our evaluations show that a significant amount of power and area
4
1.4. Thesis Contributions
·
13 can be amortized by combining both solutions, while at the same time protec- tion is extended to the functional units, load-store queue data and addresses, bypass values and register file values.3. Efficient control flow logic runtime validation is then studied. Even though a myriad of targeted solutions exist to detect faults in the instruction sequencing (fetch, decode and allocate logic), none of them can check the complex logic involved in implementing efficient control flow recovery. We propose two tech- niques to validate the rename state recovery and the squashing functionalities of high-performance out-of-order cores. The proposal uses end-to-end rename state signature checking and tracking of squashed regions to detect faults in the ROB, the rename state recovery logic, the checkpoint rename tables, and in the instruction squashing mechanism. Our evaluations demonstrate the effective- ness of our approach: very high failure reduction rates are achieved with minor power and area overheads.
4. Finally, we target the runtime validation of the memory dataflow logic imple- mented by the load-store queue. Our proposed solution (MOVT), relies on a tiny cache-like structure that keeps the last producer id’s for tracked addresses. At commit time, loads are checked to have obtained the data from the youngest older producing store. We have shown that by exploiting the fact that most forwarding store-load pairs are close to each other, coverage can be increased for small set-associative MOVTs by conservatively flushing the pipeline and restarting execution under some scenarios. Three different implementations of the technique with different trade-offs are proposed and evaluated. The so- lution presents very high fault coverage with attractive area and performance overheads. Moreover, MOVT can be used to solve the vulnerability hole inher- ent to redundant multi-threading designs where the load-store queue activity is not replicated across threads.
The key results related to cost-effective fault localization and automated diagnosis are the following:
5. Existing tracing solutions are constrained by the capacity and area of on-chip logs. A new software-hardware logging system to increase the internal observ- ability is proposed to alleviate these issues. First, we show that by sequestering physical memory pages from the application being run and re-purposing them to store activity logs we can increase observability by means of logs that can be sized to suite validation needs and without requiring big hardware structures. We then propose a hardware structure that temporally buffers internal activity
14
·
Chapter 1. Introductionat full speed and connects with the data cache to access the log pages. We study its efficiency and show that by offloading the buffer during idle cache cycles and by letting the application allocate lines as needed, performance is not critically impacted.
6. We show how to combine our error detection mechanisms together with the described logging system to construct a novel post-silicon validation method- ology. As a practical example, we particularly focus on the memory dataflow logic implemented by the load-store queue. By using our runtime bug-detection mechanisms together with the proposed non-intrusive logging system, we elimi- nate the simulation steps required to generate golden outputs for test programs and we extend coverage to non-reproducible errors without any intervention to orchestrate the activity logging.
7. Current debugging practices are manual and cumbersome. We present a diag- nosis algorithm that analyzes the log produced by our validation system and automatically localizes and diagnoses errors in the load-store queue. Not only the fault location is determined, but also the wrong behavior and the failure-free expected one. We evaluate its efficiency and show that a very high percentage of errors can be automatically diagnosed for different precision levels.