• No se han encontrado resultados

Control disciplinario** Expedientes producidos por este concepto

In document ARTÍCULO 18 FRACCIÓN I (página 145-152)

Responsabilidades Documentación que marca las causas de responsabilidad y sanciones en materia de juicio político y sanciones administrativas

4.7 Control disciplinario** Expedientes producidos por este concepto

diagnosis.

Sometimes bugs elude the validation phases and end up in the market, potentially causing massive financial and reputation impacts. The number of escaped bugs is increasing at a high pace: as an example, for Core 2 Duo designs researchers have reported a discovery rate that is 3 times larger than that of the Pentium 4 [43]. Under this scenario, the number of bugs debbugged in the validation phases is projected to rapidly increase, as well as the speed in which they are discovered once products hit the market (affecting millions of purchases).

These facts are calling for research advances in novel techniques and tools to im- prove the post-silicon validation phases, as well as in runtime verification approaches that provide processor lifetime correctness under undiscovered bugs. According to the ITRS [80, 81, 82], ”without major breakthroughs, verification will be a non-scalable, show-stopping barrier for further progress in the semiconductor industry”.

1.2 Problem Statement

The described challenges brought by the increasing vulnerability of silicon technolo- gies and by the inefficiency of existing post-silicon validation methods, introduce several problems that we address in this thesis. In the following subsections we dis- cuss them, we critically analyze the short-comings of some existing work, and we state the high-level research objectives that this thesis addresses to alleviate these problems.

1.2.1 Lifetime Reliability Mechanisms for Multiple Sources of Failures

Reliability trends show that multiple wear-out and permanent sources of failure are emerging as important contributors to microprocessors failure rates, rendering soft errors not the only reliability concern to be taken care of during product lifetime. At the same time, design complexity is causing an increase in design bugs eluding the post-silicon validation phases and impacting processor lifetime reliability.

As it will be thoroughly analyzed in Chapter 3 (Related Work), most state-of-art error detection solutions are designed for a specific error type, or for a few of them. For pure hardware reexecution-based techniques (Section 3.1), permanent faults cannot be targeted by solutions relying on temporal redundancy [64, 93, 143, 157, 162, 183, 196, 205, 207], whereas design bugs cannot be detected by solutions based on spatial (and design) redundancy [63, 127, 182, 197, 198]. Software-implemented redundant execution approaches (Section 3.4) also fail to detect multiple sources of failures

8

·

Chapter 1. Introduction

for the same reasons: they can either detect soft errors [138, 158, 160] or cannot comprehensively detect design bugs [34, 211]. Circuit-level techniques (Section 3.3) are limited to soft error mitigation [61], soft error detection detection [161, 202], or cannot detect permanent fault or design bugs in a cost-effective manner [47, 203]. On the other hand, built-in self-test circuits [2] cannot detect soft errors. Traditional error coding techniques like parity, ECC or CRC (Section 3.2), can detect soft and hard errors but just target data protection [70, 90, 210] and not combinational logic (an important contributor to processors failure rates).

Therefore, one major goal of this thesis is to:

• Explore and evaluate novel on-line mechanisms for comprehensively detecting multiple sources of failures in modern microprocessor cores during their lifetime (including transient, intermittent, permanent faults and design bugs). We look for unified mechanisms that can deal with all these sources of failure at the same time.

1.2.2 Overheads of Error Detection Solutions

The radical increase in raw error rates will pervade and threaten all commodity market segments. These segments impose challenging requirements to fault tolerance mechanisms that existing ones do not offer. Most of the error detection mechanism were devised for high-end segments where extreme reliability levels were targeted, in spite of severely weighing down global performance. However, reliability is not a primary design goal in commodity systems and some amount of fault coverage can be traded-off as long as processor performance, power and area are not severely impacted by runtime error detection solutions.

As it is described in Chapter 3, state-of-art error detection solutions are gener- ally not suitable from a performance, power or area perspective when dealing with multiple sources of failures. Reexecution-based techniques covering soft and hard errors [63, 127, 198] suffer extreme power and power performance overheads because they redo at every microarchitectural block all the state and internal activity that con- stitute a computation. Reexecution-based techniques exploiting loose synchroniza- tion [182] or ineffectual instruction removal [197] to minimize performance slowdowns, still incur high power overheads and sacrifice a hardware thread context from another core to execute redundant computations. Advanced solutions exploiting both spatial redundancy and design heterogeneity [10] protect against soft errors, hard errors and design bugs. However, their power overheads and area costs are not affordable.

Software-implemented redundant execution approaches targeting soft and hard errors [34, 211] suffer from the same performance and power problems, even though

1.2. Problem Statement

·

9 they require minimal area overheads. Compiler support has also been exploited by hybrid software-hardware solutions to avoid re-execution and to compute the expected microarchitectural activity to be observed during an error-free execution [131, 171, 219]. However, these techniques can only detect failures for the fetch and decode logic, and require extending the processor instruction set. Finally, error coding techniques implemented as self-checking circuits [12, 17, 136, 152] can detect soft errors, hard errors and design bugs with tolerable power and area overheads while causing no slowdown, though they are designed to detect errors in data and functional units.

Globally, existing solutions based on re-execution cannot strategically protect se- lected critical blocks or functionalities in a cost-effective and targeted way: they are global all-or-nothing approaches. Furthermore, these solutions do not offer flexibility to processor designers who may prefer modulating error coverage and power, perfor- mance and area overheads.

Hence, this thesis also aims at:

• Satisfying the needs for efficient reliability solutions with minimal costs in per- formance, power and area, while at the same time reaching similar reliability levels of traditional defect tolerance techniques.

• Exploring alternatives to reexecution-based techniques that can provide a more flexible trade-off between coverage and overheads, and that are also designed to be more modular for targeting specific blocks or functionalities.

1.2.3 Tackling Observability and Reproducibility During Post-Silicon Validation

The increasing design complexity and transistor integration is posing critical problems to error detection, localization and diagnosis during the post-silicon validation phases. Processor are like black boxes where observing internal state or activity is ex- tremely difficult. Common techniques like scan chains [2], hold-scan flip-flops [94] and cycle breakpoints [18] allow high-speed state inspection at a given execution moment. However, these techniques are prone-to error and require long iterative non-automated trial-and-error processes to hunt down the moment when the fault is exercised (as their use is extremely dependent on the experience of validators). Mod- ern solutions based on on-chip embedded trace buffers [1, 103, 220] can continuously sample the internal state for a given time period, by storing traced data into dedicated memory. They are however limited by the limited capacity of on-chip storage buffers and the pin I/O bandwidth to extract them. On-chip trace buffering have fixed and limited capacity: these solutions fail at capturing the internal activity for common scenarios where errors manifest thousands of cycles after faults are exercised. In case

10

·

Chapter 1. Introduction

of a failure, the log may have been overflown with traces without information about the real cause. Furthermore, on-chip trace buffers [1, 103] require important area overheads. Hardware features added for post-silicon validation purposes are costly and useless to the user once a product goes into production. Therefore, companies normally rely on scan-based techniques to increase the internal observability.

A big problem found during post-silicon validation are non-reproducible errors, which are important contributors to the high cost of current post-silicon approaches [84]. Existing tracing solutions aggravate the reproducibility problem: when attempting to reproduce an error, frequent and time-consuming scan chain and external logic analyzer operations can introduce interferences and non-determinism into the nor- mal program timing, potentially hiding the error. Independently of the interference caused by current state acquisition methods, many bugs are non-reproducible in na- ture because of the unique conditions that are needed for them to manifest (such as temperature, voltage fluctuations, etc).

To enhance the post-silicon validation phases, in this thesis we also:

• Pursue advancements in system observability through microarchitectural log- ging technologies that can enable bigger and more flexible buffering capacities, while at the same time have a very low area impact (hardware cost).

• Look for new validation approaches that can extend coverage to non-reproducible errors and that minimally interfere with system performance and operation.

1.2.4 System-Level Simulation for Error Discovery and Diagnosis

The limited internal observability is drifting validation towards methodologies based on rooting errors once an architectural state mismatch is found. Post-silicon valida- tion is principally driven by software tests that are run during a massive number of cycles on real silicon samples. These software tests are generated by specific appli- cations [146], whereas RTL processor models are used to to compute the expected error-free architectural results. As a consequence, big server farms are needed to keep in pace with the validation flow. The biggest issue of these approaches is that catching errors by means of architectural state mismatches incurs huge detection la- tencies, which ultimately leads to extremely time consuming and complex debugging processes to narrow down the time interval when the fault is exercised.

Once a reproducible error is discovered, methods to transfer and synchronize the silicon state to the RTL simulator [178] are used as a means to debug it. The objective is to help validators to understand the wrong system behavior, to reason about the error-free behavior and to locate the fault. System-level simulation of

1.3. Thesis Approach

·

11

In document ARTÍCULO 18 FRACCIÓN I (página 145-152)