The symptom-directed d iagnosis fault isolation tools use a knowledge base of fault isolation rules to determine bow to analyze the data inside the error log entry. The fault isolation rules were designed by reliability engineering experts who understand the behavior of the machine when it fails.
Thl're are two basic types of fault isolation ru les. single event and multiple event. Single-event rules arc used for analyzing single error events (i .e. , one error log entry). Multi ple-even£ rules arc used for
analyzing m u ltiple error events that occur over a
specified inrerval of rime.
Single Event Fault Isolation Rules
There arc several categories of single evenr fault iso lation ru les. These rules are derived from rhe on- l ine error derection designed into the VA X 9000 system .
Prinuuy 5>)mdrome Fault Isolation Rules Primary syndromes are the error larches rhat detect and report error events. Each error latch stores the result of an on- line error derector. Each error detec ror covers a secrion of logic in rhe system. By map ping t h is logic to the physical parr ition (i .e. , field replaceable units), the values of set error latches can be ust"d as a first-pass fau l t isolation. I n many instances, this anal ysis a lone is sufficient to deter m ine rhe faulty field replaceable unit.
Secondary .'l)•ndrome Fault Isolation Rules In some i nstances, the fault isolation provided by the pri mary synd romes may nor local i ze the fault s u ffi ciently. For example, if the primary synd rome field
FRU 1 FRU 3 00 - 07 PARITY GENERATOR DO - 08 MUX FRU 2 01
08
PARITY
GENERATORreplaceable unit callout results i n more than one field replaceable unit hav ing a significant possibility of failure, then secondary syndromes must be used to reduce the cal lout. Secondary syndromes are key machine states, other than error l atches. that are stored in rhe error log entry. Examples of secondary synd romes include mulriplexer select lines, mem ory :tddress values, and orher p�nh-sensitive control signals. These s ignal stares are used to derermine rhe specific parh rhar was S(;nsitized w hen an error occu rred . T he nonsensitized path(s) can t he n be removed from the callour . An example of how sec ondary syndromes are used for fault isolarion is shown in Figure 4 .
Fault Propagation Rules Sometimes a single-error event can trigger multiple error detectors because of fau lt propagation or domain intersection .
Fault propagation occurs when a fault i n :.1 given
error domain (i .e. , the propagation source) propa
gates into other error domains (i.e. , the propagation
destinations). 'J() identify the real sou rce of the
error, the possible fault propagation paths must be found and the precedence of the error detectors in each propagation path must be identified . When multiple error latches are set, the propagation rules can then be applied to e l i m i nate a l l p ropagation
PARITY ERROR PARITY _ERROR
CHECKER LATCH
M U X_SELECT
PARITY _ER ROR M U X_SELECT CALLOUT
1 :)8 UNKNOWN FRU 1 FRU 2 FRU 3 0 FRU 1 FRU 3 FRU 2 FRU 3
Figure 4 Secondary 5)1ndrome Example: MUX Select Usedfor rciUlt Isolation Reji'nement
Hierarchical Fault Detection and Isolation Strategy for the VAX <)000 System
destinations for each propagation source in the call out. An example of faul t propagation is shown in Figure 5.
Domain Intersection Rules Domain intersection results when two or more error detectors cover a common piece of logic. This information is used to refine the callout when multiple error latches are set in the VAX 9000 system as shown in Figure 6.
Multiple Event Fault Isolation Rules Multiple-event rules attempt to correlate separate error events to fine! a common problem. This type of analysis is beneficial when an i n termittent or transient problem is not diagnosed sufficiently by single-event symptom-directed diagnosis rules.
For example, if a logic fault were analyzed with single-event, symptom-directed diagnosis rules, an intermittent logic fault could be concluded as hav-
FRU 1
ing occurred. Such an analysis would result in a ull out of the faulty field replaceable unit. However, multiple-event rules include checking for certain environmental deviations in close proximity to a logic fault. In this case, multiple-event analysis would attempt to correlate the logic fault with the envi ronmental deviations to determine if the fault is transient in nature. If this were the case, a callout would not be required.
Multiple-event rules can also be used to enforce the callout refinement provided by secondary syndromes, fau lt propagation, and domain inter section. For example, in a VAX 9000 system that repeatedly generates identical or similar error log entries, multiple event analysis can correlate these entries to a single intermittent fault. It can provide a scenario of which is the most l i kely secondary syndrome path to be sensitized and the most likely error domain to detect the error first. In this case,
PARITY _ERROR_ 1
PARITY OG-08/ PARITY ERROR
GENERATOR CHECKER LATCH
OG-08 FRU 2 LOGIC OG-08 PARITY CHECKER
PARITY _ E R ROR_1 PARITY _ERROR--2 CALLOUT
NO PROPAGATION FRU 1
INFORMATION FRU 2
WITH PROPAGATION FRU 1
INFORMATION
ERROR
LATCH
Figure 5 Fault Propagation Example
PARITY _ER ROR_2
FRU 1 FRU 2
00-07
PARITYGENE RATOR
FRU 3
PA R ITY _ERROR_1 PARITY _ERROR_2 CALLOUT
0 0 FRU 1 FRU 2 FRU 1 FRU 3 FRU 1
PAR ITY ERROR PARITY _ERROR_1
CHECKER LATCH
PARITY ERROR PARITY _ERROR_2
CHEC K E R LATCH
Figure 6 Domain Intersection EwmljJle
multiple-event analysis can view these events as a s ingle problem rather than seeing each error log entry i n isolat ion .
CAD Tools and Processes
'J(> ensure that t he VAX 9000 symptom-d irected diagnosis fault coverage and isolation goals were achien:d , CAD tools were needed to measure the quality of the on-line error detection in the design. Tools a lso were needed ro help develop symptom directed diagnosis fault isolation rules and to faci li tate the conversion of these rules into a format that cou ld he used by the fa ult isolation software.
Some of the significant symptom-directed diag nosis CAD tools that were devdoped and used for
the VAX 9000 system are discussed below.
Hardware Isolation Domain Evaluator The hardware isolation domain ev:tl uaror ( H I I )E)
CAD roo! was developed to prov ide sy mptom directed diagnosis fault coverage and isolation information to the VA X 9000 logic designers. Hl l)E
also can generate simp!<: symptom-d irected d iag
nosis fa ult isolation rules for usc in the system fa ult isolation matrices.
One of the:: goals for H ID E was to provide ea rl y fec::dback to logi c designers on the quality of on-l ine
1-i ()
error detection in designs. Ea rly feed back gave
dc::signers rime to make design changes i f cm·erage
or isolation goal s were not achieved . Further. the
information prov ided by H I OE helps designers
select locations for error detectors and gave design ers quick feedback on the implications of detector placement and design changes.
Symptom Diagnosis Information Language
The symptom-directed d i agnosis fa u l t isola tion
rules for t he VAX 9000 system were coded into a set of system fault isolation matrix fi les, cal led symp tom diagnosis information files. Symptom diagnosis information is a language t hat is designed to express hoth single-event and mulriple-c::vent, symptom
di recrc::d diagnosis fault isolation rules in an objec tive and consistent manner.
In c::arlier VAX systems, new fau l t isolation tools were needed for each new computer system . In the
VAX 9000 system, the sym ptom diagnosis informa
tion language provides a general-purpose means to specify symptom-directed d iagnosis fault isolation ru les. The fi les a rc used as t he ru le base for t he
symptom-directed d iagnosis fau l t isola tion tools.
which means that the tools can be used for furure computer system designs.
Hierarchical Fault Detection and Isolation Strategy for the VAX f)OOO S)'slern
On-line Fault Isolation Software
The VA X 9000 system contains on-line sym rrom direcred diagnosis sofrwan: rhar auromatica l l y diag noses faults as they occur. The software produces an isolation cal lout of the possible fau l t y fie l d replaceable units t h a t i s automatica lly received by Digital customer s<.:rvice centers through a symp rom-di rected d iagnosis reporting process. This rrocess is design<.:d (() mini mize the repair time for VAX 9000 systems. It a u tomatica l l y noti fies Digital of problems and provides a rerair plan to Customer Services before personnel arc sent to the customer's sire.
Service Processor Diagnostic
The VA X 9000 sen·ice rrocessor unit contains a symptom-directed diagnosis fa ult isolation process that rerforms single-event a n a l ysis. This pro cess runs in the background waiting for error log entries. \V hen an error log entry is generated. the
process analyzes the error log entry and rroduces an encoded ca l lout of possible fa ulty field rerlace ahle un its.
The symptom-d irected diagnosis fault isolation algorithm is rerformed by a general-purpose diag nostic engine. T h is engine uses a binary version of the s y m prom di agnosis i n formation fi le, i . e. , binary-coded matri x , as a rule base for its analysis. The d i agnostic engine can anal yze any error log entry that has a va lid corresponding binary-coded matrix file.
In addition to the encoded callout, the single event fa ult isolation rrou:ss produces status infor m ation from each error event that is used for multiple-event analysis.
VAXsimPL US
The Vr\X sim l'Ll 'S tool runs on the VAX 9000 CPU and
performs symptom-direcred diagnosis mulriple e,·cnt analysis. The tool analyzes information gen
erated by the si ngle-event, symptom-directed
diagnosis process using multiple-event. hinary coded matrix files. The VA Xsim P L LIS tool uses the same general-purpose d iagnostic engine as the single-event. symptom-directed diagnosis process. The outrur of the VA Xsim PLliS tool is a syndrome entry that coll apses several error even t s into a single error analysis theory.
Summary
A complete rest and diagnosis strategy for a large computn system, such as the VAX 9000 system,
1-equires off- l ine resting and its on- li ne cou nterpa rt.
s�·mptom-direcred diagnosis. Off-li ne test ing rro-
Di�ilal Tt.•cbuicatjounwt \'ul. 1 N". 4 Foil /'J')Ii
vides a hierarchical mechanism for testing each component before it is assemb led into the next level. In off-line resting, the usc of the scan system rrovidcs high coverage and accurate fa ult isolation . Scan test ing also has p roven e ffective du ring a l l p h ases o f the VA X 9000 system product develop ment: design, manufacturing, prototype debug,
and customer support.
Sy mptom-di rected d iagnosis is a sophisticated tool that provides detection and isolation of inter mittent faults. Intermi ttent faults have heen a signif icant problem i n t he past because of the difficu lty to re-create the conditions that lead to such faults. Symptom -directed diagnosis solves t he problem of intermittent faults by analyzing symptom informa tion generated by on- line error handlers rather t han by attempting to re-create the fault. Thus. the use of symptom-dir<.:cted d iagnosis provides greater machine availability for the VA X 9000 system .
Acknowledgments
The im plementat ion of the VA X 9000 fault detec tion and isolation strategy wou ld have been impos sible if not for the perseverance and dedication to high qual ity shown by the fol lowing peorle: Jeff Barry, Dom inic Carr, Steve Conway, Ed Crowley, Betty Daley, Tony [)ancona, Dave D'Antonio, Chris Demos, Sue DesMarais, Pa ul Dorm irzer, Rick
Dusek, M i ke Evans. S kip Gaede. M i ke Gavronsky, Philipre Girard, Matt Goldman, Francis (; ravel , C h ris josep h , D a l e Kec k, Tom Kreh e l , C harlie Kretz, Burch Leitz, Helen Lenane. Paul Leveil le. Keith Mayhue, Ch ris McCabe, Robert Nobrega. Mi ke Newman , Paul Paternoster, Brian Rosr, Dan Schu l l m a n , Scott Sitterly, Norm Sozio, Tamar
Wexler. Tom Winter, Ted Wojcik , Richard Wood .
Eugene Xia. and t he members of thc MCU rester and
tviC A :) rest develormenr t<.:ams.
General References
A . M iczo, Digital Logic Testing (New York : H arper and Row Publ ishers, I nc . , 1986).
N. Tendoikar and R . Swan n , " Automated Diag nostic Met hodology for the I BM 3081 Processor Complex ,'' IRM journal of Research and Det,el opment. \'O I . 26. no. 1 (.January 1982 ) : 78-HH.
H. Tanaka er a l . . " System Level Fault Dicrionary Generation ." IEEh. International Test Conference Proceedinp,s ( New York, 198H): H84 -HH7.
M . Coldman et a l . , "The VAX 9000 Sen·ice Pro cessor Unit," Digital Technical .Journal. vol . 2 .
no. -+ (Fall 1990, this issue): 90- 101 .