The availability of WSN systems is becoming increasingly important due to in- creased dependency of various WSN applications requiring continuous monitoring and also the need for fault-tolerant design. Availability is closely related to re-
liability, and is defined in ITU-T Recommendation E.800 [ITU-T, 2008] as “the
ability of a system to be in a state to perform a function or an operation at a given instant of time, or at any instant of time within a given time interval, assum- ing that the external resources, if required, are provided.” The main difference between the reliability and availability is that the reliability refers to failure-free
operation during an interval, while availability refers to failure-free operation at a
given instant of time [Trivedi,2002b], usually the time when a device or system is
first accessed to provide a required function or service. Availability may further be categorised as:
1. Instantaneous Availability or Point availability A(t) of a component (or a system) is defined as the probability that the component/or system
is properly functioning at time t [Trivedi, 2002b], [Ever, 2007], and may be
described mathematically as:
A(t) = R(t)
Z t
0
R(t − x)m(x)dx (3.6)
where R(t) is the probability of having no failure in interval (0, t) and m(x) is the repair density. The equation shows that the system is available either if no failures occurs in interval (0, t), or failure occurs but repair of the
system is completed before time t [Trivedi, 2002a].
In the absence of a repair or a replacement, availability A(t) is simply equal to the reliability R(t) of the component.
2. Limiting Availability defined as the steady-state availability (A) is the
limiting value of A(t) as t → ∞. From the literature [Ever, 2007], it may
be expressed mathematically as:
lim t→∞A(t) = A = 1 ξ 1 ξ + 1 η = M T T F M T T F + M T T R (3.7)
where ξ and η are failure and repair rates respectively, and 1ξ and 1η are Mean
Time To Failure (M T T F ) and Mean Time To Repair (M T T R) respectively. 3. Interval (Average) Availability defined as the expected fraction of time
the system is up in a given interval (0 → t) may be given by:
AI(t) = 1 t Z t 0 A(x)dx (3.8)
In [Trivedi, 2001], it is explained that the three availabilities relate as given in equation 3.9 lim t→∞AI(t) = limt→∞A(t) = η η + ξ (3.9)
In order to study system reliability and availability, three model types are identi-
fied as Combinatorial, State space and Hierarchical models [Trivedi,2001], [Ever,
2007]. In combinatorial models, three model types: reliability block diagrams, reliability graphs and fault trees are commonly used. These model types are similar since they capture conditions that make a system fail in terms of the structural relationships between the system components. Reliability block dia- grams (RBD) implemented either in series, parallel or in k-out-of-n configurations represent the logical structure of a system with regard to how the reliability of its components affects the system reliability. An RBD can be used to model availability if the repair and failure times are all independent. The assumption of independence and series-parallel structure allows very fast computation of re- liability and availability measures. However, many system models in practice do not follow the series-parallel structure. Symbolic Hierarchical Automated Relia- bility/Performance Evaluator (SHARPE) software package developed by Sahner and Trivedi in 1986 allows easy specification and solution of such models [Trivedi and Malhotra, 1993], [Trivedi,2001].
Reliability graph models are considered to consist of a set of nodes and edges (and directed arcs), where the edges represent components that can fail or structural
relationships between the components [Trivedi, 2001]. The graph contains one
node, the source (meaning no arcs enters it), with no incoming edges and one node, the sink (also called destination or terminal nodes) with no outgoing edges. The arcs are assigned failure distributions. A system represented by a reliability graph fails when there is no path from the source to the sink. The edges can be assigned failure probabilities, failure rates or unavailability values or functions, the same as reliability block diagrams. A reliability graph is equivalent to a non- series-parallel reliability block diagram. In the reliability graph, the components are the arcs, while in the block diagram, the components are the boxes. The non-
series-parallel block diagram cannot be directly analysed by (or even specified for) SHARPE, but the reliability graph can. The price for more generality is the increased complexity of solution.
A fault tree is a pictorial representation of the sequence of events/conditions to
be met for a failure to occur [Sahner et al., 1996], [Sathaye et al., 2000]. It uses
AN D, OR, and k of n logic gates to represent the combination of events in a tree- like structure. In order to represent situations where one failure event propagates failures along multiple paths in the fault tree, fault trees can have repeated nodes. There exists several efficient algorithms for solving fault tree [Sathaye et al., 2000]. Examples include; algorithms for serial - parallel systems (for fault tree without repeated components), a multiple inversion (MVI) algorithm called the LT algorithm for obtaining the sum of disjoint products (SDP) from mincut
set [Muppala and Trivedi, 1992] and the factoring /conditioning algorithm that
works by factoring a fault tree with repeated nodes into a set of fault trees without
repeated nodes [Sathaye et al., 2000], Satyanarayana and Prabhakar [1978]. In
[Doyle and Dugan, 1995], [Doyle et al., 1995], it is shown that binary decision
diagrams(BDD)-based algorithms can be used to solve very large fault trees.
In previous studies [Sathaye et al., 2000], Trivedi [2002b], it is noted that relia-
bility block diagram, reliability graph and fault trees cannot easily handle more complex situations such as failure/repair dependencies and shared repair facilities. State space representations have successfully been used to model such complex systems. A state space model is a description of a configuration of states used as a simple model of the system under study. State space models consist of states and transitions between the states. Gracefully degrading systems may be able to survive the failure of one or more active components and continue to pro- vide service at a reduced level. Some commonly used techniques for modelling of gracefully degradable systems include Markov reward model (MRM), Markov
chains, Stochastic reward nets and Petri nets [Trivedi,2001] and [Sathaye et al.,
2000].
The advantage of using non-state-space models seen above is that they are ef- ficient to specify and solve. However, the solution of these models assumes the components are independent. For instance, in a block diagram, fault-tree or reli-
ability graph, the components must be completely independent of one another in their failure and repair behaviour. A failure in one component cannot affect the operation of another component, and components cannot share a repair facility. Markov models provide the ability to model systems that violate the assumptions made by the non-state-space models as seen but at the cost of a state space ex-
plosion. A system having n components may require up to 2n states in a Markov
chain representation [Trivedi, 2001].
Trivedi mentions two ways of dealing with state space explosion problem as tol-
erance or avoidance [Trivedi, 2001]. Complex system tolerance must apply to
specification, storage and solution of the model. If the storage and solution
problems can be solved, the specification problem can be solved by using more concise (and simpler) model specifications that can be automatically transformed into Markov models. Complex models can be avoided by using hierarchical model
composition [Trivedi, 2002b]. The ability of SHARPE to combine results from
different kinds of models also makes it possible to use state-space methods for those parts of a system that require them, and use non-state-space methods for the more well-behaved parts of the system.
In practical system design, a pure availability model may not be enough for gracefully degrading communication and computer systems considering that they tend to be very conservative given they do not explicitly consider different levels of performance of system states. A composite model for both availability and performance is therefore necessary as the system degrades over time. A more
realistic analysis method was introduced in [Beaudry, 1978] and a conceptual
framework of performability introduced by Meyer [Meyer, 1980]. This modelling
approach is very useful for systems as they degrade and experience moments of breakdowns and failures.