2. METODOLOGÍA
2.1. Metodología Para Determinar Ajustes De La Protección De Sobrecorriente En
2.1.3. Ajustes de la protección de sobrecorriente en función de la corriente de
the failure. It should be noted that failure is not the same as shutting down and start- ing back up again: the storage devices themselves will keep the data across the shut- down and the system will start up normally afterwards—this can even be done in many cases of infrastructure outages (like network or power failures).
In the case of the Scheduling component, a crash and restart means repopulating the internal state from persistent job storage and resuming operations. This is trivial for an aborted planning run or a failure during job validation and it can also be han- dled very simply for the top-level Scheduling component: we used the Error Kernel Pattern to keep this piece of software rather simple so that we could assume that a restart cycle takes a sufficiently short time to be deemed an acceptable downtime, unless specific requirements force us to use replication here as well.
The Execution component works similarly in that the worker nodes can crash and be restarted as discussed above, where the supervisor makes sure that the affected batch job is started again on another available node (or on the newly provisioned one). For the Resource Pool Interface we can tolerate short downtime while it is restarted as its services are only rarely needed, and when they are then the reaction times will be of the order of many seconds of even minutes in any case.
12.3.3 The Pattern Revisited
We have looked at each of the components in our system’s supervision hierarchy and considered the consequences of a failure and subsequent restart. In some cases we encountered implementation constraints like having to update the request routing infrastructure so that the failed node is no longer considered and the replacement is taken into account once it is ready. In other cases we approached the formulation of service level agreements by saying that a short downtime may be acceptable in some cases: in a real system we would quantify this both in the failure frequency (e.g. by way of the MTBF7) and the extent of the outage (also called MTTR8).
This pattern can also be turned around so that components are “crashed” inten- tionally on a regular basis instead of waiting for failures to occur—this could be termed the Pacemaker Pattern. Deliberately inducing failures has been standard opera- tion procedure for a long time in high-availability scenarios in order to verify that failover mechanisms are effective and perform according to their specification. The concept has been popularized in recent years by the “Chaos Monkey” employed by Netflix9to establish and maintain the resilience of their infrastructure. The chaotic
nature of this approach manifests in single nodes being killed at random without prior selection or human consideration. The idea is that in this way failure modes are exercised that could potentially be missed in human enumeration of all possible cases.
7 Mean Time Between Failures, see https://en.wikipedia.org/wiki/Mean_time_between_failures 8 Mean Time To Repair, see https://en.wikipedia.org/wiki/Mean_time_to_repair
9 At the time of writing, Netflix is the largest streaming video provider in the U.S. The Chaos Monkey is part of
the SimianArmy project which is available as open-source software at https://github.com/Netflix/Simian- Army; the approach is described in detail at http://techblog.netflix.com/2012/07/chaos-monkey-released- into-wild.html.
On a higher level, whole data centers or geographic regions are taken offline in a more prepared manner to verify global resource reallocation—this is done on the live production system since there is no simulation environment that could practically emulate the load and client dynamics in such a large-scale application.
Another way to look at this is to consider the definition of availability: it is the frac- tion of time during which the system is not failed, i.e. (MTBF – MTTR) / MTBF. This can be increased either by making MTBF larger—which corresponds to less frequent but possibly extensive failures—or by making MTTR smaller. In the latter case the max- imum consecutive downtime period is smaller and the system operates more smoothly, which is the goal of the Let-It-Crash pattern.
12.3.4 Implementation Considerations
While this pattern is deeply ingrained in reactive application design already, it is nev- ertheless documented here to take note of its important consequences on the design of components and their interaction:
Each component must tolerate a crash and restart at any point in time, just like a power outage can happen without warning. This means that all persistent state must be managed such that the service can resume processing requests with all necessary information and ideally without having to worry about state corruption.
Each component must be strongly encapsulated so that failures are fully con- tained and cannot spread. The practical realization depends on the failure model for the hierarchy level under consideration: the options range from shared-memory message passing over separate O/S processes to separate hard- ware in possibly different geographic regions.
All interactions between components must tolerate peers crashing. This means ubiquitous use of timeouts and circuit breakers (described later in this chap- ter).
All resources a component uses must be automatically reclaimable by perform- ing a restart. Within an actor system this means that resources are freed by each actor upon termination or that they are leased from their parent. For an O/S process it means that the kernel will release all open file handles, network sock- ets, etc. when the process exits. For a virtual machine it means that the infra- structure resource manager will release all allocated memory (also persistent filesystems) and CPU resources, to be reused by a different virtual machine image.
All requests sent to a component must be as self-describing as is practical so that processing can resume with as little recovery cost as possible after a restart.
12.3.5 Corollary: the Heartbeat Pattern
Let-it-crash describes how failures are dealt with. The other side of this coin is that fail- ures must first be detected before they can be acted upon. In particularly catastrophic
37