III. PROPUESTA EDUCATIVA
1. Principios del diseño curricular
This chapter introduced Kheiron, a suite of tools for effecting fine-grained in-vivo and in-situ adaptations in software systems written in different languages (.NET, Java and C), running in different execution environments. We identified two major classes of execution environments, managed and unmanaged, and presented a generic model, which is used by Kheiron, for facilitating runtime adaptation in these execution environments.
Unmanaged Execution Environment
Managed Execution Environment
ELF Binaries JVM 1.5.x CLR 1.1
Program tracing ptrace,/proc JVMTI callbacks ICorProfiler ICorProfilerCallback Program steering Trampolines+ Dyninst Bytecode
rewriting MSIL rewriting Execution-unit metadata available to query .symtab, .debug sections Classfile constant- pool+ bytecode
Assembly, type & method metadata + MSIL Metadata augmentation/ editing N/A for compiled C-programs Custom classfile parsing & editing APIs using BCEL+ JVMTI RedefineClasses IMetaDataImport IMetaDataEmit APIs
Table 3.12: Execution environment facilities
Our generic model of adaptation is based on four key facilities existing, or easily added to, contemporary execution environments: program tracing, program steering, metadata
querying and metadata editing. Table3.12summarizes techniques used to effect adaptations in the three execution environments studied – Microsoft’s Common Language Runtime, Sun Microsystems’ Java Virtual Machine and the unmanaged execution environment consisting of the Linux operating system and the raw x86 processor.
In elaborating on the implementation details of Kheiron, we comprehensively cover and compare the techniques that are used to effect runtime adaptations in the contemporary managed and unmanaged execution environments studied.
Finally, we demonstrate Kheiron’s ability to effect fine-grained adaptations in multiple systems using three case studies: runtime reconfiguration of .NET applications using Khe- iron/CLR (§3.7.8), runtime fault-injection in Java-based applications using Kheiron/JVM ((§3.8.6)) and selective emulation of C programs using Kheiron/C (§3.9.4). The next chapter develops an evaluation methodology and benchmark for assessing the Reliability, Availability and Serviceability (RAS) properties of software systems, which uses the run- time adaptation capabilities of Kheiron (specifically its in-vivo and in-situ fault-injection capabilities) to construct failure scenarios that are used in RAS-evaluations.
RAS Evaluations via Runtime
Adaptation and RAS Modeling
This part describes the runtime fault-injection tools and analytical techniques that we combine to construct failure scenarios, which allow us to evaluate and compare the RAS capabilities of software systems.
Evaluating RAS Capabilities
Evaluating and comparing the Reliability, Availability and Serviceability (RAS) capabilities of systems requires reasoning about aspects of the system’s operation that may be difficult to capture or quantify using performance metrics alone.
Whereas performance metrics provide insights into the feasibility of using a system with its RAS-enhancing remediation mechanisms enabled, there are more in-depth analyses that we wish to perform. For example, we want to be able to evaluate the efficacy of any RAS mech- anisms the system may have, reason about the expected benefits of yet-to-be-added RAS- enhancing mechanisms, reason about RAS deficiencies, evaluate different combinations of mechanisms, evaluate and compare mechanisms that may employ different remediation- strategies (reactive, preventative, proactive), reason about tradeoffs between mechanisms and identify under-performing or sub-optimal mechanisms. Measures concerned with overall system performance do not adequately capture the details that distinguish one remediation mechanism from another, e.g., remediation accuracy/success rates, fault/failure coverage, the impact of remediation failures, the consequences of remediation strategy/style and accounting for partially automated remediations. These deficiencies of performance metrics and benchmarks limit our ability to use them as a primary means of comparing or ranking
systems based on their RAS capabilities.
An additional consideration for evaluating the RAS capabilities of systems is that the no- tions of “good” and “better” are dependent on the environmental constraints governing the system’s operation. For example, service level agreements (SLAs), policies, and inter- nally/externally visible service level objectives including but not limited to: uptime guaran- tees, meeting production targets, reducing production delays, improving problem-resolution and service-restoration activities, etc. Whereas there are aspects of the environmental con- straints that can be evaluated using performance metrics, such as response time guarantees in SLAs, these metrics are insufficient for evaluating other constraints.
As a result, evaluating and comparing RAS capabilities requires something beyond per- formance metrics and benchmarks. Specifically, tools and techniques that support more in-depth analyses of the details of RAS mechanisms (the micro-view), while considering the role and effects of the environmental constraints (the macro-view).
The importance of the environmental constraints in evaluating the RAS capabilities of systems cannot be understated since these constraints serve four major purposes. First, they help identify the failures and faults that impact these environmental constraints. Second, they enable reasoning about these impacts from the different perspectives of those affected (end- users, system operators/engineers/administrators and management). Third, they provide a source of possible metrics that can be used to quantify the impacts of RAS deficiencies, remediation failures and partially automated remediations. And finally, they establish the (scoring-)boundaries within which a system and its collection/composition of mechanisms can be considered to be better than another.
In this thesis we develop a model-based and measurement-based approach to evaluating the RAS capabilities of systems. Our evaluation approach is based on failure scenarios, which can be combined and extended to develop a RAS benchmark for a specific system or class of systems.
A failure scenario consists of three elements:
1. A set of faults that induce the failure of interest
2. A set of fault-injection tools capable of a) injecting one or more of the faults or b) otherwise inducing the failure of interest
3. A set of reusable analytical model templates used for scoring i.e. to quantify the impact(s) of a failure and/or the efficacy of any remediation mechanism(s) available and to capture the different perspectives of interest (end-user, operator/engineer and management)
4.1
Hypotheses
In Chapter3we demonstrated techniques and a suite of tools for effecting fine-grained adap- tations that could be used to inject faults and induce failures in a variety of systems written in multiple languages running on different platforms. In the context of RAS evaluations and the construction of failure-scenarios, similarly flexible adaptation tools allow failure scenario support to be grafted onto existing/legacy systems allowing for the study of the failure behavior of systems and an evaluation of their RAS capabilities directly in their deployment environments. Such dynamic tools play a major role in our measurement-based evaluations of RAS capabilities.
The main hypothesis in this chapter is that mathematical tools such as Markov chains, Markov reward networks and Control Theory models can be successfully used to de- scribe failure scenarios designed to quantitatively evaluate/score the RAS capabilities of systems. In validating this hypothesis we demonstrate how these tools can be used to create simple, reusable model templates for scoring and studying RAS properties. Further, we show that these model templates produced for scoring can be simpler than a detailed model of the implementation of the mechanisms, sub-system or system being studied while
still providing insights into the failure behavior of systems and the efficacy of its remediation mechanisms.