RESOLUCIÓN N.º 282/SECPLAN/

Ministerio de Desarrollo Urbano

Within the fault tolerant computing community, the concept of fault coverage has been studied, as well as the importance of coverage to accurate reliability assessment [5]. Coverage is defined as the probability that the computer system can recover, given that a fault occurs. Fault coverage is a measure of the computer system’s ability to perform fault location, fault containment, or fault recovery [2], [6], [7], [8].

This section introduces the concept of coverage as it is understood in the fault tolerant computing community, and relates it to the concept of common-cause failure, as it is understood in the PRA community.

The concept of coverage in fault tolerant computing

Technological advances have reduced costs to the point where fault tolerant computer system designs can include sufficient redundant components to minimize the probability that all have failed. However, the system designer must be certain that faults and errors are detected promptly, so that the redundant units can be used effectively. If a faulty unit is not reconfigured out of the system it can produce incorrect results that contaminate the non-faulty units. Reliability models of fault tolerant computer systems incorporate coverage factors to reflect the ability of the system to automatically recover from the occurrence of a fault during operation. A coverage factor is a probability that the system can automatically recover from a fault and it’s associated errors, and thus can continue normal operation, though possibly in a degraded mode.

A fault tolerant computer system may fail to recover from a fault even if spare units remain. For example, a fault may produce an undetected error and the subsequent calculations or operations then operate on incorrect data, possibly leading to overall system failure. Even if an error is detected, the system may still be unable to recover, because the fault could “confuse" the automatic recovery procedures into disabling the wrong component. A coverage model is used to help structure our discussion of covered and uncovered faults.

General structure of a coverage model

Figure 8-10 shows the general structure of a coverage model. The entry point to the model is the occurrence of the fault, and the three exits (R, C, and S) are the three possible outcomes. The R or C exit is reached when a fault is covered; the S exit is reached when a fault is uncovered.

Coverage

M odel

Fault occurs in a com ponent. Fault m ay be transient or perm anent. Exit R: Transient Restoration

Covered transient fault does not lead to com ponent failure

Exit C:

Perm anent Coverage

Fault leads to covered failure of com ponent.

Exit S:

Single Point Failure

Fault leads to uncovered failure of com ponent, and hence to system failure

Figure 8-10. General Structure for a Coverage Model

Exit R from the coverage model represents transient restoration, the correct recognition of and

example by masking the error, retrying an instruction, or rolling back to a previous checkpoint. Reaching this exit successfully requires timely detection of an error produced by the fault; performance of an effective recovery procedure; and swift disappearance of the fault (the cause of the error).

Exit C from the coverage model represents permanent coverage, the determination of the

permanent nature of the fault, and the successful isolation and (logical) removal of the faulty component.

Exit S from the coverage model represents single point failure, in that a single fault causes the

system to fail, generally when an undetected error propagates through the system, or if the faulty unit cannot be isolated and the system cannot be reconfigured.

In a reliability analysis of a fault tolerant computer system, each component in the computer system has an associated set of coverage factors (r, c and s) that represent the probability of reaching the associated exit of the coverage model when a fault occurs in that component. The three exits from the coverage model are mutually exclusive and complete, thus the three probabilities sum to unity. The s probability reflects the extent to which a component fault can cripple the system. The r and c probabilities reflect the extent to which the computer system can automatically recover from a fault. The relative values of r and c reflect the relative proportion of transient and permanent faults expected to occur.

Incorporating the coverage probabilities into the fault tree model

To illustrate how the coverage probabilities (r, c, and s) for each component are integrated into the solution of a static fault tree model, first consider a simple computer system. The 3P2M system consists of 3 processors (of which one is needed), 2 memories (of which one is needed) connected by a bus. The fault tree model for the 3P2M system is shown in Figure 8-11. The

k * notation in a basic event represents k identical components of type X. This shorthand notation is supported by the DFT methodology as a convenience to the analyst. Suppose that we have defined coverage models for the processors and memories. Such coverage models would include an analysis of the proportion of transient to permanent faults, error detection, recovery mechanisms, and automatic reconfiguration. Figure 8-12 shows conceptually how coverage models are added to the fault tree. The pictures inside the boxes are representations of the coverage models for the components.

3*P 2*M Bus 3P2M failure 3 Processors; need 1 2 Memories; need 1

Figure 8-11. 3P2M Fault Tree

In Figure 8-12, the arrow from a basic event to the coverage model represents the occurrence of a fault. The effects of the fault (transient restoration, covered fault or uncovered fault) are the exits of the coverage model. Transient faults (from which the system can automatically recover) lead back to the basic event, representing a fault that had no permanent consequences. A covered fault leads to the normal fault tree gate.

Bus 3P2M f ailure 2 Memories ; need 1 3 Proc es s ors ; need 1 Unc ov ered f ault Fau lt Occurs Tra ns ien t R e s toration

It is covered faults whose combination is represented by the original fault tree model. Uncovered faults lead directly to the top OR gate in the tree. For details on how the coverage probabilities are incorporated into the quantitative analysis of the fault tree, see [9].

Relationship between coverage and common-cause failures

The concept of coverage and the concept of common-cause failures were developed independently to address similar needs in different analysis communities. Probabilistic analysis of redundant systems can produce wildly optimistic results if the redundant components are considered to fail independently.

Common cause failures (CCFs) are described in Chapter 5, Section 5.2, where the concept of a β-factor is introduced. When modeling CCF, one defines a set of (usually identical) components or subsystems, and associated with this set is a β-factor to represent the fraction of the failure rate of a single component that threatens the other components in the set. Suppose that there are three components of type A, with individual failure rate λA and a β-factor ofβΑ. The rate of a

common cause failure for all components of type A is then. A CCF affects all components in the common cause set of like components.

Imperfect coverage, or an uncovered fault in a computer system, similarly affects more than a single component. An uncovered fault in component A with individual failure rate λA and

single-point failure (uncoverage) probability sA affects the entire computer system with rate

λΑsΑ.

Thus the notions of common-cause and uncovered failures are similar. The main difference is that CCF affect components of the same type, and do not necessarily lead to system failure. Uncovered faults in a computer system can (and usually do) affect components of a different type and lead to the failure of the entire computer system (and presumably the system being controlled by the computer). That is, an uncovered failure in a processor could affect all the components connected to the bus, whether they are processors, memories, devices, etc. Further, coverage modeling allows the distinction between transient faults (which are very common in computer systems) and permanent faults.

In document Boletín Oficial. Gobierno de la Ciudad Autónoma de Buenos Aires. "2013. Año del 30 aniversario de la vuelta a la democracia" (página 47-49)