Arequipa – Perú
2. MARCO CONCEPTUAL
2.1.4. MEDICION DEL FUNCIONAMIENTO DE LA UNIDAD FAMILIAR
Determining the root cause of a fault is a hard problem – one that may help reduce costs in mitigating downtown in computing environments. The approach proposed in this thesis demon- strates novel capabilities for analysing of the root cause of faults within a computing system, with the aim of eventually producing demonstrable cost reduction capabilities. Uniquely, it shows that monitoring the content of the data is not strictly necessary to determine the root cause of a fault as monitoring the pattern of changes in the observed data can be sufficient. Using machine learning techniques this can be automated, and provide a helping hand to existing computer operating procedures.
4.2. APPROACH 65
This approach builds on prior art – chiefly from the Autonomic Computing initiative – but also leverages Machine Learning and Computational Intelligence techniques. The approach operates in two stages:
First, an application periodically samples feature behaviour data. This information is transduced into vectors which form the basis for future analysis and forecasting. Second, the data is labelled – a process that occurs through performance tests. If a system passes a number of high-level objective goals and policies, the data at large can be assumed to be in a ‘good’ state. If any of these tests fail, then the opposite is assumed and an analysis is performed against the likelihood of the expected and observed behaviours using trained stochastic primitives via the known ‘good’ feature data.
Finally, once trained, if any of the specified performance tests fail, then the primitives forecast feature behaviour to varying degrees – both inherent to their respective learning algorithm(s), and by how much training data is present. Any mismatches are detected and returned in a list ordered by descending likelihood indicating the potential root cause of the fault.
This thesis explores two possible implementations of this approach – one using a greedy data ingest mechanism (via ANNs, and HMMs, Figure 4.1), and one using a lazy data ingest mechanismRBMs(Figure4.2) – both of which are detailed in Section4.3along with examples of their operation. In all other ways, aside from the learning modules, the approaches are identical.
To accurately identify faults, theFDFsrequire a user to provide: 1. A polling interval (in milliseconds),
2. A ‘learning module’, and 3. A set of performance tests.
A polling interval specifies how often the systems’ feature data should be gathered, evaluated, and stored. The learning module consists of a stochastic primitive and an associated learning algorithm. Using the AForge.NET and Accord.NET frameworks, it is possible to select a number of previously built stochastic primitives and and associated learning algorithms. However, a user may also specify their own primitives and learning algorithms. Performance tests consist of user-designed code that evaluate the state of the system. This are simple tests written into the application to verify the health of the system and are akin toSLOs. All of the specified tests must pass during each polling interval.
Figure 4.1: Fault Detection Framework Logic & Architecture Diagram using Greedy Ingest. The
FDF leveraging ANNs and HMMs operates by updating its primitives as soon as feature data is recovered from the system.
control over the window of observed data desired to be retained by the FDFs. For example, a higher frequency collection rate of once per 5 seconds using the default value of 30 maximum samples would only allow for a 150-second window of observation. It is unlikely this window will be sufficient to capture changes in feature behaviour outside of this time-span. Increasing the maximum number of samples thus increases the window size, and by adjusting the maximum number of samples and the polling interval, it is therefore possible to control fidelity of the information as well.
TheFDFsare configured by default to run in a WindowsVMto test the performance and state of Internet Information Services (IIS), Microsoft’s proprietary web service. No changes are
4.2. APPROACH 67
Figure 4.2: Fault Detection Framework Logic & Architecture Diagram using Lazy Ingest. The
FDF leveraging RBMs operates identically to the FDF that uses ANNs and HMMs except with a lazy ingest mechanism for feature behaviour data. Primitives using a lazy ingest are only trained upon fault detection.
required to the source code to revalidate the experiments described in this thesis. Although it is believed that these results can be generalised to other operating systems, no attempt is made to demonstrate this.
4.2.1
Running Example
Step 1 (Optional) A user provides theFDFswith the three requirements described in4.1, and then compiles the code.
Step 2 The application is run and then waits in the background whilst the computing system
operates normally for a desired period of time – in the case of these experiments, 5 – 30 minutes. After each polling interval the framework prints to screen memory usage and the collective state of the performance tests as “System State” (Figure4.3).
Figure 4.3: Running Example – Successful Data Collection via FDFs. Cropped image showing
successful data collection when running an RBM-based FDF.
Step 3 Upon injection or detection of a fault, the FDF should break the loop it is currently operating in and provide a list of fault hypotheses in descending order of likelihood (Figure4.4). If the monitoring loop is not broken then a False Negative must be accounted for by hand.
Figure 4.4: Running Example – Fault Identification via FDFs. Cropped image showing a sample of
an FDF result screen. The full sized list has been truncated to save space but can contain 5 to 300 leads.