Ministerio de Desarrollo Urbano y Transporte

1.5 Contents and Contributions

The rest of the manuscript is organized as follows:

• In Chapter 2 first basic dependability concepts as well as modeling and evaluation methods and tools are introduced. This includes definition of dependability attributes and figures of merit, means to attain dependability, the fault-error-failure concept as well as a brief description of commonly used modeling and evaluation tools. Also, considering recent approaches based on prediction of disturbances, we present our extension of a taxonomy of fault-tolerance policies. Then, modeling Smart Grid as a cyber-physical system (CPS), a unified approach is taken to define its dependability by combining approaches and definitions from computer and power systems communities. A large set of previous blackouts (close to 40) has been analyzed to identify most common root causes, to identify their main properties and to conduct their classification in a form of a developed taxonomy. Finally, appropriate figures of merit for quantification of Smart Grid availability are identified.

• Related work is reviewed in Chapter 3. First a traditional approach to power grid dependability is explained followed by an overview of recent research in Smart Grid dependability modeling and evaluation. Proac- tive management approaches and prediction-based methods are then reviewed. The section concludes with a description of relevant Smart Grid pilot projects.

• In Chapter 4 the impact of predictive (proactive) management on availability is analyzed. First a generic model of a predictive management is defined considering the quality of failure prediction and the cost of disturbance mitigation actions. Then, a metric for optimizing failure prediction for enhanced availability is proposed and a sensitivity analysis of the approach with respect to model parameters is conducted. The main contribu- tion of this part of the work is that the derived model and the metric may be used to identify if a proactive approach may be applied to the specific system and what the minimum prediction quality requirements are. Also, derived model and equations may be used to maximize availability when a proactive approach is used.

• A methodology and methods for online disturbance prediction are described in Chapter 5. Also a methodology, methods and tools for the design of a disturbance predictor are proposed.

19 1.5 Contents and Contributions

• Developed framework for disturbance-related data generation is described in Chapter 6. The framework is based on simulation and fault injection following successful examples of implementing a similar approach for syn- thesizing failure data in computer systems for failure predictor training and evaluation. The framework is developed as fully modular so that system simulator, monitor and disturbance detector may be clearly distinguished. The final output of the framework is a set of time tagged and classified in- stances. An instance represents a set of values of selected variables (voltages, phasors, etc.). Each instance may be classified as disturbance-free or related with a specific disturbance. The generated data set must be further conditioned before being used for training a prediction algorithm. The same data set maybe used for training and evaluating online disturbance detectors.

• The case study is presented in Chapter 7. We evaluate if and to what extent voltage sags may be predicted in an Active Distribution Network (ADN). Us- ing the developed simulation framework we first simulate behavior of an ADN in the presence of short-circuit balanced faults that cause voltage sags in different parts of the network. We record and condition the data so that it may be used with the classification-based machine learning algorithms. We then perform feature selection to identify the most indicative features and evaluate with what quality of prediction sags may be predicted. We also evaluate how the prediction quality varies with monitoring sampling period, lead time and the size of the prediction window. Finally, we optimize the prediction to maximize availability of an ADN and compare availability gain measured with downtime decrease.

• Chapter 8 concludes the work, reviews the contributions and gives direc- tions for the future work.

Chapter 2 Terminology, Concepts and Taxonomies

As dependability is a mature field in computer systems with well-defined and widely-accepted terms and terminology, basic dependability concepts, as used in computer systems, are first introduced. This includes definitions of dependability attributes and figures of merit, description of means to attain dependability, the fault-error-failure concepts as well as a brief overview of commonly used modeling and evaluation tools. Also, considering recent approaches based on prediction of disturbances, we present our extension of a taxonomy of fault-tolerance policies.

Then, we present our unified approach to Smart Grid dependability that we modeled as a cyber-physical system (CPS). In fact, our unified model is built by combining approaches and definitions from computer and power systems communities. A large set of reported blackouts (close to 40) has been analyzed to identify the most common root causes, to identify their main properties and to conduct their classification by proposing a fault taxonomy. The proposed set of definitions and the taxonomy contribute to better communication between the two communities as a mutual understanding platform. Also, it gives a unique overview of the range of faults that may occur in Smart Grids and help to better understand Smart Grid dependability threats. Moreover, the taxonomy facilitates the application of dependability evaluation methods, such as fault injection.

Finally, appropriate figures of merit for quantification of Smart Grid availability are identified.

A part of the work presented in this section has been published[57]. 21

22 2.1 Overview of Basic Dependability Concepts

2.1 Overview of Basic Dependability Concepts

Dependability is the ability of a system to perform a required service under stated conditions for a specified period of time. A dependable system is the one which delivers a required service during its lifetime.

As an umbrella term, dependability integrates the following attributes[74]: • Availability: readiness for correct service,

• Reliability: continuity of correct service,

• Safety: absence of catastrophic consequences on the user(s) and the envi- ronment,

• Integrity: absence of improper system alterations, and

• Maintainability: ability to undergo modifications and repairs.

Security, that is mostly addressed separately, integrates availability, integrity and confidentiality (the absence of unauthorized disclosure of information)[74]. An important concept in dependability is the fault-error-failure one. If a failure occurs, delivered service deviates from the correct one. An error is a deviation of a system’s state from the correct state, and a fault is a root cause of an error [74]. In addition, Salfner, Lenk and Malek distinguish between detected unde- tected and detected errors (an error is undetected as long as a detector does not identify an incorrect state)[72]. A failure of one component may also propagate to another components or cause a fault at higher levels of system hierarchy. A failure of a component may be a fault from the system’s point of view as long as it does not cause deviation of the system’s state.Once activated, the fault will cause an error that, if affecting the service provided to the user, propagates to a failure.

According to[74], means to attain dependability include:

• Fault prevention (also called fault avoidance or fault intolerance): prevention of occurrence or introduction of faults,

• Fault tolerance (FT): avoidance of service failures in the presence of faults. Also, capability to continue the correct operation in the presence of faults, • Fault removal: reduction of the number and severity of faults, and

• Fault forecasting: estimation of the present number, the future incidence, and the likely consequences of faults as well as prediction of failures. In this work, we focus mainly on fault tolerance. Considering novel techniques based on online prediction of failures, we extend taxonomy of fault- tolerance techniques, originally developed by Avizienis et. al[74]. The taxonomy

23 2.1 Overview of Basic Dependability Concepts

is presented in Figure 2.1. Parts that are present in the taxonomy from[74] are written in italics, whereas classes of techniques are presented in rectangles.

Figure 2.1. Taxonomy of fault-tolerance techniques.

Error detection may be performed concurrently or by interrupting the main process to check the system for errors. In concurrent error detection techniques, for example, health-status (as in [75]) or the system load (as in [76]) may be estimated by online monitoring system parameters (also called features or variables). For a computer system and ICT components these parameters may be: memory usage, CPU activity and temperature, disk activity, number of exceptions or fan speed. For a power systems and its components these parameters may be voltages, phasors, and component temperatures. Failure prediction may be performed by: failure tracking, symptom monitoring, detected error reporting, and undetected error auditing [72]. The same or a similar set of error- and fault- handling techniques that are identified in [74] as system recovery techniques, may be used to prevent errors and failures when triggered based on the results of the state estimation or failure prediction. When a failure is anticipated, the system can be prepared for a recovery. In computer systems this may be done by creating a checkpoint on-demand, preparing a spare components, performing a failover or by applying similar techniques. In power systems, for example, a spare

24 2.1 Overview of Basic Dependability Concepts

dispatchable distributed generator may be preventively started in anticipation of a voltage drop.

By combining FT techniques different FT techniques may be defined. In gen- eral, fault-tolerance policies may be reactive, proactive-reactive, or proactive depending if an action is taken after a failure has been detected, before a failure (preventively) or in anticipation of an upcoming failure. Considering only techniques originally presented in[74] two types of reactive fault-tolerance policies may be defined: detection and recovery and masking and recovery1. In the first one, recovery actions are triggered on demand, when a failure of a component is detected. In the second policy, errors (component failures) are systematically masked (for example with redundancy) and recovery is performed on demand when an error is detected. Note that different implementations of these basic policies may also include additional, proactive actions that are performed systematically. For example, periodic checkpoints are created as a part of the roll- back and recovery policy that is one possible implementation of a detection and recovery policy type.

Proactive-reactive policies assume taking actions to prepare the system for more efficient recovery if a failure occurs. A typical example of preparation and recovery policy is checkpointing. Even though a failure will not be fully avoided, recovery time may be shortened and availability improved.

Taking into account all FT techniques from Figure 2.1, we identify additional types of FT policies that we presented in a form of a taxonomy in Figure 2.2. Proactive policies can further be categorized as systematic or predictive. Sys- tematic policies include those where actions are taken periodically, with a period defined statically or adjusted dynamically as in [76], depending on the system status indicators as, for example, system’s health and load distribution (e.g. as in [77]), or after specific events (e.g. a completion of a task or a preemptive error detection). For example, a migration to a spare server (failover) may be performed periodically to rejuvenate the original server. Predictive policies are based on online failure prediction and the use of prevention or preparation techniques to avoid a failure or to prepare a system for recovery. In the first case, failure-related downtime may be eliminated by failure avoidance. In the second 1_{A system with a masking-and-recovery policy reacts on errors (and from this perspective is}

reactive) to prevent them from propagating into failures (from this perspective it is proactive). Failure prevention in this case is based on redundancy and is implicit by the system design. During system life, there is no need to take any actions before a failure to prevent it. This is the main difference with respect to proactive policies as defined here and the reason for classifying masking and recovery as a reactive policy. Another view, by which masking and recovery would be classified as passive-proactive and other proactive policies considered here as active-proactive, is also possible.

In document Boletín Oficial. Gobierno de la Ciudad Autónoma de Buenos Aires (página 114-118)