cases like hardware failures, the supervising compo- nent can only detect that something is wrong by observing the absence of expected behavior. This obviously requires that some behavior can be expected, which means that the supervisor and subor- dinate must communicate with each other on a regu- lar basis. In cases where there would not otherwise be a reason for such interchange, the supervisor needs to send dummy requests whose sole purpose is to see whether the subordinate is still working properly. Due to their regular and vital nature these are called
heartbeats. The resulting pattern’s diagram is shown in figure 12.8.
One caveat of using dedicated heartbeat messages
is that the subordinate might have failed in a way that allows heartbeats to be pro- cessed while nothing else can be answered properly. In order to guard against such unforeseen failures health monitoring should be implemented by monitoring the ser- vice quality (failure rate, response latency, etc.) during normal operation where appropriate—sending such statistics to the supervisor on a regular basis can be used as a heartbeat signal at the same time if it is done by the subordinate itself (as opposed to the infrastructure, e.g. by monitoring the state of circuit breakers as discussed in section 12.4).
12.3.6 Corollary: The Proactive Failure Signal Pattern
Applying the Heartbeat Pattern to all failure modes results in a high level of robustness already, but there are classes of failures where patiently counting out the suspected component takes longer than necessary: the component can diagnose some failures itself. A prominent example is that all exceptions that are thrown from an actor imple- mentation will be treated as failures—exceptions that are handled inside the actor usu- ally pertain to error conditions resulting from the use of libraries that use exceptions for this purpose. All uncaught exceptions can be sent by the infrastructure (i.e. the actor library) to the supervisor in a message signaling
the failure so that the supervisor can act upon it immediately. Wherever this is possible it should be viewed as an optimization of the supervisor’s response time. The messaging pattern between supervisor and subordinate is depicted in figure 12.9 using the con- ventions established in appendix A.
Depending on the failure model it can also be adequate to rely entirely on such measures. This is equivalent to saying that for example an actor is assumed to not have failed until it has sent a failure signal. Monitoring the health of every single actor in
Check & ack 1* Supervisor
Subordinate
Figure 12.8 The supervisor starts the subordinate, then it performs periodic health checks by exchanging messages with it until no satisfactory answer is returned.
Supervisor
Subordinate 2
Figure 12.9 The supervisor starts the subordinate and reacts to its failure signals as they occur.
a system is typically forbiddingly expensive and relying on these failure signals achieves sufficient robustness at the lower levels of the component hierarchy.
It is not uncommon to combine this pattern and the Heartbeat Pattern to cover all bases. Where the infrastructure supports lifecycle moni- toring—for example see the Deathwatch10 feature
of Akka actors—there is an additional way in which the supervisor can learn of the subordi- nate’s troubles: if the subordinate has stopped itself while the supervisor still expected it to do something (or if the component is not expected to ever stop while the application is running) then the resulting termination notification can be taken
as a failure signal as well. The full communication diagram for such a relationship is shown in figure 12.10.
It is important to note that these patterns are not specific to Akka or the Actor Model; we use these implementations only to give concrete examples of their imple- mentation. An application based on RxJava would for example use the Hystrix library for health monitoring, allowing components to be restarted as needed. Another example is that the deployment of components as microservices on Amazon EC2
could use the AWS API to learn of some nodes’ termination and react in the same fashion as described for the DeathWatch feature here.
12.4
The Circuit Breaker Pattern
“Protect services by breaking the connection to their users during prolonged failure conditions.”
In the previous sections we discussed how to segregate a system into a hierarchy of components and sub-components for the purpose of isolating responsibilities and encapsulating failure domains. This pattern describes how to safely connect different parts of the system together so that failures do not spread uncontrollably across them. Its origin lies in electrical engineering: in order to protect electrical circuits from each other and introduce decoupled failure domains, we have established the technique of breaking the connection when the transmitted power exceeds a given threshold.
Translated to a reactive application, this means that the flow of requests from one component to the next may be broken up deliberately when the recipient is overloaded or otherwise failing. This serves two purposes: firstly the recipient gets some breathing room to recover from possible load-induced failures, and secondly the sender decides that requests will fail instead of wasting time with waiting for negative replies.
10see http://doc.akka.io/docs/akka/2.4.1/general/supervision.html#What_Lifecycle_Monitoring_Means and http://doc.akka.io/docs/akka/2.4.1/scala/actors.html#Lifecycle_Monitoring_aka_DeathWatch Supervisor Subordinate 2 2 or Check & ack 1*
Figure 12.10 The supervisor first starts the subordinate, then it performs periodic health checks by exchanging messages with it (step 1) until either no answer is returned or a failure signal is received (step 2).
39