System life data can include observations o f system lives that were ended by the occurrence o f something other than a system failure. These are called suspended or censored data. If the failure time o f a system is known only to be beyond its current running time, the observation is said to be censored on the right, or right-censored. If the failure time o f a system is known only to be before a certain time, the observation is said to be censored on the left, or left-censored. An example o f the latter occurs when a system is observed to be in failed condition under a particular inspection schedule. This research only considers right-censored observations, which are simply called censored observations.
Censored observations occur, for example, when an analysis o f system reliability is made at a point in time when multiple units are still operational in a test or real life environment. In a second example, medical experiment participants who decide to
1 Patent Pending, DaimlerChrysler AG.
11
discontinue taking the test drugs during an experiment or stop showing up for checkups, cause censoring to occur. Reasons for censoring include:
• Time allowed for a test has run out without occurrence o f a system failure,
• A reliability estimate is made at a point in time after market introduction o f the product at which time the systems are still field operational,
• A subject (or system) has stopped participating in the experiment for other reasons than failure,
• One component in a series system failed, causing the life-length observation for all other components to be censored as it caused system failure to occur.
If all censored systems have a common running time and all failure times are earlier, the dataset is called singly censored on the right. A singly right-censored dataset results from a test in which all systems simultaneously enter the test sample and the data are analyzed before all units fail. If the test duration, TDrf, i.e. censoring time, is fixed beforehand and the number o f system failures in that fixed period is random, the dataset is called singly time censored or Type I censored. I f the test duration, i.e. censoring time, is random and dependent on a preset number o f system failures to be observed dining the test, NFrf, the dataset is called singly failure censored. Such data are referred to as Type II censored. In practice, time censoring is more common, while failure censoring is more frequently used in literature due to its mathematical properties [Nelson, 1982, p. 7]. The research presented here considers time, i.e. Type I, censoring only.
Systems can start operation, i.e. enter the (test) population, at different times and be taken out o f service at various times. This behavior results in a dataset that has censored
12
observations at miscellaneous running times intermixed with failure times. Such datasets are referred to as multiply censored. Multiply censored datasets usually come from the field, as systems go into service at various times and have different r u n n in g times when the test is ended and data are recorded [Nelson, 1982, p. 7]. The research presented in this report considers multiply censored datasets, as they are common in practice.
According to Nelson [1982, p. 1], the fact that life data are usually censored or incomplete in some manner is a key characteristic that distinguished life data analysis from other areas o f research in statistics. Note that some reliability estimation methods ignore suspended observations altogether while others assume they do not exist. This statement can also be made regarding the occurrence o f masking in a system life dataset. 1.5 Masked Data
In field operations, a system failure will likely be documented and some indication o f the cause o f failure might be given. The problem o f detecting the exact individual component that was responsible for the system failure has long troubled reliability engineers. It might not be possible or efficient within the given operating environment to analyze which component failed and resulted in the system failure. Especially in continuous manufacturing environments, military systems, aviation systems and trucking or transportation systems, it is common to quickly replace the part diagnosed to be the likely culprit or replace the entire system by a backup system followed by offline repair. The latter procedure allows production to be brought back up to its original level as soon as possible, thereby minimizing opportunity costs resulting from missed production volume. Note that the results o f diagnostic efforts in offline repair might not be related to (i.e. registered as the cause for) the system failure event. This omission complicates system
13
failure analysis for the reliability engineer, who relies primarily on registered event data. In case repair efforts are performed online, diagnostic efforts might not find the failing component with certainty within the available timeframe. Consequentially, a number o f components, amongst which the culprit is suspected to be, could be replaced or repaired. Failure observations without a uniquely identified cause are called cause-incomplete or masked data and the dataset that contains them is referred to as an incomplete dataset.
Thus, masked data are system failure observations where the cause o f failure has been limited to a subset o f the components that constitute the system, without uniquely identifying the failing one. Some reasons for missing failure cause information are:
• Time required to perform failure analysis,
• Cost o f performing failure analysis (both direct and opportunity costs), • Technical complexity of the analysis work, and
• Loss o f information in the communication process between diagnostician and reliability engineer.
An example o f masking due to loss o f information can be found in the bibliography [Usher and Walker, 1997]. The authors describe how system operation can lead to masked data o f microwave tube system failures. This occurs when failure analysis reports created by a repair depot do not contain the tube serial number. Diagnostic information can therefore not be related to the maintenance reports created when the tube was placed into service and the one removing it from service, thus leading to a masked observation.
Two types of masked observations are identified: fully and partially masked. A fully masked (system failure) observation is a registered system failure event where all
14
components are suspected o f having caused the failure. Therefore, observations are fully masked if none o f the components within the system can be confirmed as not having caused the system failure. For a fully masked observation, every single one o f the components within the system is suspected o f causing the system failure. No information is available on which component has caused the failure or certainly didn’t cause it. On the other hand, a partially masked observation has at least one component identified as operational at the time o f failure o f system i, Tf. A subset (smaller than the full set) o f the components that form the system is suspected to contain the cause o f system failure.
A fully masked observation is not to be confused with a fully masked dataset. The latter is a set o f system observations, which are exclusively partially or fully masked. This means that the set contains no system observations that are either censored or failures with diagnosed causes.
Usher and Hodgson [1990] state that masked data can represent a significant proportion o f observations in an industrial dataset. The masking level depends on the type o f system and the level o f failure analysis conducted, i.e. the diagnostic efforts exerted in the field upon identification o f a system failure.
Masking can occur on systems consisting o f several components connected in diverse configurations. One o f these configurations is commonly referred as a competing risks situation, and is the configuration assumed in the research presented here.