However, such mechanisms are not effective when large-scale multiple failures arise, that is, when a significant part of the network fails simultaneously. Large-scale failures usually have serious consequences in terms of the economic loss they cause and the disruption they bring to thousands or even millions of users.
Motivation
Almost all recovery techniques focus on individual failures and try to offer some compromise between guarantee of resilience and consumption of resources. It is therefore vital that we have methods and tools that can be applied to the design and operation of communications infrastructure so that essential services can be maintained as much as possible when major failures occur.
Objectives
Outline of the Thesis
- Faults, Errors and Failures
- The concept of Resilience
- Fault Tolerance and Survivability
- Basic Assessment of Resilience
In the network literature, it is common to find terms such as survivability, reliability, fault tolerance, robustness, and dependability. In a recent proposal, it is defined as the ability of a system to fulfill its mission in a timely manner in the presence of threats such as attacks or large-scale natural disasters [129].
Overview of Transport Network Technologies
Wavelength Division Multiplexing
WDM is a technology that allows the transmission of more than one optical signal along a single fiber at the same time. Its principle is essentially the same as frequency division multiplexing (FDM), which means that multiple signals are transmitted using different carriers, each occupying non-overlapping parts of the frequency spectrum [30].
WDM Networks
Light paths can traverse many optical links in a fiber optic network, ideally without OEO conversion. Given that the light path behaves as a pure channel between source and destination, theoretically there is nothing in the signal path to limit the throughput of the fiber [121].
GMPLS-based Networks
Connection-oriented versus Connectionless Networks . 16
The virtual circuit is realized as one or more paths through the network and the data traffic between the two parties flows exclusively through them. "Routers" in a GMPLS-based network forward data based only on labels, so that any layer 3 identifiers (such as IP addresses) that may exist within the protocol's data units are completely ignored.
GMPLS-specific Features
The GMPLS Functional Planes
Network Failure and Recovery
- Types of Physical Failures
- A Failure Taxonomy
- Large-scale Failures
- Categorizing Multiple Failures
- Recovery in GMPLS-based Networks
- Recovery Phases
- Protection and Restoration
It can be argued that such errors are no more than node (hardware) failures due to the fact that the software runs on nodes. In any case, extra traffic may be blocked when failures occur, so it is not protected.
Fundamental Graph Concepts
Graphs and Paths
There can be multiple shortest paths for a given node pair, all of which have the same minimum cost. Two graphsG= (V, E) and G0 = (V0, E0) are isomorphic if they contain the same number of vertices connected in the same way. For example, if the diameter of graphG is four, all graphs that are isomorphic to it have the same diameter.
In research, communication networks are usually modeled as simple, connected, weighted graphs with link capacity and length (eg, distance . in km) as the most common weights.
Basic Graph Features
Metrics and Non-trivial Graph Features
- Degree sequence and Degree distribution
- Path length distribution
- Clustering coefficient
- Measures of Centrality
- Assortativity coefficient
- Algebraic Connectivity and other spectral measurements 43
- Erdős-Rényi Networks
- Generalized Random Networks
- The Watts-Strogatz Small-World Networks
- Scale-free Networks
- Tools and Models for Internet-like Topologies
The path length distribution Pr[HN =k] is the histogram of the length of the shortest path between all possible pairs of nodes in the graph. The study of the path length distribution has helped define the small world network model. In the following subsections, an overview of the models that are more relevant for communication systems is presented.
This results in the introduction of "bridges" that connect initially distant nodes and lead to the shortening of paths in the network.
Measures of Network Robustness
- Network criticality
- Symmetry ratio
- Connectivity and Average two-terminal reliability
- Elasticity
- Viral conductance
However, whether transient disconnection of small network sections is acceptable, as occurs in large data networks such as the Internet, does not provide a meaningful measure. One option is to look at the magnitude of the giant component, but a more flexible alternative is the so-called k-terminal reliability. The use of spectral range as an epidemic threshold is well established in the epidemic literature.
Formally, the viral conductance VC is defined as the area under the curve of the fraction of infected nodes at steady state y∞(s).
Evaluation of topological damage
Size of the largest component
The size of the largest component (alternatively the size of the giant component) is one of the most used measurements, especially in complex networks, see e.g. and the recent study [93]. Fig.4.3 shows the variation observed in the relative size of the largest connected component as more and more links are removed. Initially, when the network is still intact, the size of the largest component corresponds to the network size, i.e. SLC(0) =N.
Therefore, when a number of links are removed, many nodes are very often left isolated in small components (usually of size one or two), reducing the size of the giant component.
Average two-terminal reliability
In general, the difference between the topologies for r <0.15 is small: SLC(rv) is almost linear in all the cases, although with different slopes. An alternative is to replace "remove node x" with "isolate node x" (by removing all its links). We use that approach to explore another aspect not mentioned so far: the variability of the measurements.
On the other hand, cost266x6 is again peculiar in that it is quite stable for r <0.1, but after that point the variation increases rapidly.
Algebraic Connectivity
It is also possible to measure A2T R when removing a node, but this changes N and so the results cannot be related to the original full topology. The less variation the better because it means that no matter which specific elements fail, the performance reduction is expected to be about the same. The bt400 topology is extremely bad, which in this context means that there are some key elements whose failure significantly changes the A2T R.
Evaluation of functional damage
We perform this evaluation on the cost266x6 topology, which performed well in the previous section for small values ofr and is similar in structure to reference transport networks. In order to simulate the provision of service in the network and the occurrence of large-scale failures, an event-driven simulator has been developed that reproduces the process of route selection in a road-oriented transport network. Given the large diameter of the topology, in this experiment we chose to discard traffic that could be considered "local" in the sense of node neighborhood.
It is clear that these figures will be different if the links are equipped with protection, but in any case they emphasize that the functional damage, in this case from the point of view of active links, is much more serious than we can expect if we only look at topological measures alone. such as those evaluated in the previous section.
Limiting functional damage through Link Prioritization
EdgeBC : The betweenness centrality approach
Conversely, a connection is reinserted later when the connections using it are terminated.
OLC : The Observed Link Criticality approach
Performance Comparison
Table 4.2 shows the connection path length frequency distribution of a representative simulation run. Each combination of r, strategy, and z discussed in this section has an entry in the table. Furthermore, due to the nature of the architecture, one failure in the lower layer can manifest as multiple simultaneous failures in higher layers.
In any case, the broader issue of control plane elasticity has not been neglected by the research community, see for example and [79], but is outside the scope of this thesis.
Basic terminology of epidemic networks
A new model of failure propagation: The SID model
SID epidemic thresholds
The spread of the infection in the network depends on the topology of the network via the single parameterλ1 >0. This phenomenon arises from the systematic approach of considering the linearization of the model around the disease-free steady state, where the adjacency matrix of the network appears and the largest eigenvalue λ1 determines the (un)stability. It also analytically shows the values for the number of nodes in the sensitive, infected and disabled states.
The figure highlights two important points: the intersection of the infected and susceptible curves and the intersection of the disabled and susceptible curves.
Empirical validation of the model
1, then the infection dies over time, that is, the number of infected and disabled nodes goes to zero. Finally, for a homogeneous network we have explicit expressions for the endemic steady state: the fraction of susceptible nodes is R1. As time goes by, we can thus have the number of nodes per condition and compare them with the analytical assessment.
As shown in the figure, once the epidemic reaches a steady state, the proportion of infected nodes is close to the analytical value.
Failure propagation on Rings
- Assumptions
- A CTMC model for a small ring
- Guidelines for the assignment of repair rates
- Numerical results
For simplicity, it therefore does not appear in the calculation of the corresponding infection rate. Such a "disconnected" state represents a major malfunction that requires the urgent intervention of the network operator. By solving for the steady state probabilities of the CTMC-based model, it is easy to find the percentage of time the ring remains in each state as a function of the two rectification rates δ1 and δ2.
Severe infection: includes all states where at most one node is disabled and more than half of the nodes are infected.
Comparing robustness against propagating failures
Simulation environment
Thus, the blocking ratio depends only on the effects of the epidemic, which means that connections will only be blocked if there are no viable routes because the necessary intermediate nodes are not available (they are disabled). As for the stages of the SID model, they are chosen so that the topologies to be compared are exposed to infections of similar intensity, which can be easily calculated using the formulations given as part of the SID model definitions.
Measuring the performance degradation
A more stable value is the global or accumulated blocking ratio, identified as “ABP” in Figure 5.10. This happens when the network is in a state where the number of disabled nodes is the maximum possible. Now TRG can be defined in terms of ABP and MBP: TRG is the area between ABP and MBP, once the epidemic reaches its steady state at some point t=k.
However, since MBP and ABP are cumulative values, a simple approximation is the difference between them.
Topology comparison through TRG
The model definition gives us the expression to estimate that number, which means that an additional simulation can be run with that many nodes already disabled. For example, T65 performs better than the others in mild and extreme infections, but is not the best in the rest of the cases. This behavior can be explained by the fact that TRG takes into account not only the speed of the epidemic, but also the connection path lengths.
As can be seen in Appendix B, t65 has a much shorter average path length than the others, and its largest value is between the other two.
Summary
This chapter summarizes the main contributions of this work, along with possible directions for future research. Topology robustness in GMPS networks (TRG) measures how quickly a multiple fault event degrades the performance of the system in terms of its ability to accept connection requests. In the simulation, we carried out a numerical estimation of the number of affected LSPs in a multi-link failure scenario and compared it with the average reliability of two links of the rest of the network to illustrate that topological metrics alone cannot capture the extent of the damage caused at the service level.
Conceptual Framework on Resilience In addition, the thesis presents a comprehensive overview of the terminology on resilience adapted to the needs and applications of the field of networking.
Future work
Betweenness centrality in large complex networks.The European Physical Journal B-Condensed Matter and Complex Systems. In Proceedings of the 1st Annual Workshop on Simplifying Complex Network for Practitioners, SIMPLEX ’09, pages 4:1–4:6. Impact of random failures and attacks on Poisson and power-law random networks.ACM Comput.
Analysis of generalized recovery mechanisms based on GMPLS (including protection and recovery). In Proceedings of the 3rd International Conference on Biology-Inspired Network, Information and Computing Systems Models, BIONETICS ’08, pp. The cost266x6 topology is the result of the juxtaposition of several nearly identical copies of the "Cost266" reference topology, also available on the SNDlib website.
An example of error propagation in a two-component system 7
A schematic representation of an 3x3 optical cross-connect
Example of a logical topology defined over a physical network 15
Example of an MPLS domain
Example of the forwarding table of an MPLS switch
A taxonomy of failures in data networks
The Cost266 topology. Link thickness indicates link importance
An example of random network
An example of Small-World networks
An example of Scale-free networks
Basic operation of DPP. A working path (solid line) and a
Performance comparison of DPP and SPP on four reference
Effect of link failure on the size of the largest component
Effect of node failure on the size of the largest component
Average algebraic connectivity of the largest component
Percentage of connections affected at given fraction of failed
The Control and Data planes in the GMPLS architecture
The state-transition diagram of the SIS model
The eight-node GMPLS-based ring example
Examples of system states on the eight-node ring topology
Robustness comparison of the three studied topologies under
The cost266x6 topology
The bt400 topology
The t204 topology
The er400d3 topology
The er400d6 topology
The eba400h topology
The t65 topology