2. METODOLOGÍA
2.1. ANÁLISIS DE LOS REQUERIMIENTOS
5.3.1 The cost of perfect software
The software that controlled the space shuttle was some of the most perfect software ever written. It’s a good example of how expensive such software truly is, and calls out the absurdity of the expectations for enterprise software. It’s also a good example of how much effort is required to build redundant software components.
The initial cost estimate for the shuttle software system was $20m. The final bill was $200m. This is the first clue that defect-free software is an order-of-magnitude more expensive than even software engineers estimate. The full requirements specification has 40 000 pages for a mere 420 000 lines of code. By comparison, Google’s Chrome web browser is over 5 million lines of code. How perfect is the shuttle software? On average, there was one bug per release. It wasn’t completely perfect either!
The software development process was incredibly strict. It was a traditional process with highly detailed specifications. Strict testing, verification and code reviews were enforced. Bureaucratic signatures were needed for release. Many stakeholders in the enterprise software development process truly believe that this level of delivery is what they’re going to get.
It’s the business of business to make return-on-investment decisions. You spend money to make money, but you must have a business case. This breaks down if you don’t understand your cost model. It’s the job of the software architect to make these costs clear, and to provide alternatives, where the cost of software development is matched to the expected returns of the project.
5.4
Anarchy works
The most important question in software development is: "What is the acceptable error rate?" This is the first question to ask at the start of a project. It drives all the other questions and decisions. It also makes clear to all stakeholders that the process of software development is about controlling, not conquering, failure.
The primary consequence is that large scale releases can never meet the accept- able error rate. Reliability is compromised by the uncertainty of a large release, that large releases must be rejected as an engineering approach. This is mathematics, and no amount of QA can overcome it.
Small releases are less risky. The smaller the better. Small releases have small uncer- tainties, and we can keep under the failure threshold. Small releases also mean frequent releases. Enterprise software must constantly change to meet market forces. These small releases must go all the way to production to fully reduce risk. Collecting them into large releases takes you back to square one. This is how the probabilities work.
A system under constant failure isn’t fragile. Every component expects others to fail, and is built to be more tolerant of failure. The constant failure of components exercises redundant systems and backups, ensures that you know they work. You’ve an accurate measure of the failure rate of the system. It’s a known quantity that can be
How does our simple risk model work under these conditions? You may only be changing one component at a time, but aren’t you still subject to large amounts of risk? You know that your software development process isn’t going to deliver updated com- ponents that are as stable as those that have been baked into production for a while.
Let’s say updated components are 80% reliable on first deployment. You’re not going to meet a reliability threshold of 99% in any of the systems we’ve looked at. Re- deploying a single component still isn’t a small enough deployment. This is an engi- neering and process problem that we’ll address in the remainder of this chapter— how to make changes to a production software system whilst maintaining a desired risk tolerance.
5.5
Microservices and Redundancy
An individual component of a software system should never be run as a single instance. A single instance is vulnerable to failure. The component itself could crash. The machine that it’s running on could fail. The network connection to that machine could be accidentally misconfigured. No component should be a single point of failure.
To avoid being a single point of failure, you can run multiple instances of the com- ponent. Now you can handle load and you’re more protected against some kinds of failure. You aren’t protected against software defects in the component itself, which affect all instances. Even then, such defects can be usually mitigated by automatic restarts1. Once a component has been running in production for a while, you’ve
enough data to get a good measure of its reliability.
How do you deploy a new version of a component? In the traditional model, you try, as quickly as possible, to replace all the old instances with a full set of new ones. The blue-green deployment strategy, as it’s known, is an example of this. You’ve a run- ning version of the system, call this the blue version. You spin up a new version of the system, call this the green version. Then you choose a specific moment to redirect all traffic from blue to green. Now, if something goes wrong, you can quickly switch back to blue, and assess the damage. At least you’re still up.
One way to make this less risky is to redirect only a small fraction of traffic to green at first. If you’re satisfied that everything still works, redirect greater and greater volu- mes of traffic until green has completely taken over.
The microservice architecture makes it easy to adopt this strategy, and reduce risk even further. Instead of spinning up a full quota of new instances of the green version of the service, spin up one instance. This one new instance gets a small portion of all production traffic, and the existing blues look after the main bulk of traffic. You can observe the behavior of the single green instance. If it’s badly behaved, you can decom- mission it. Although a small amount of traffic has been affected, and although there’s been small increase in failure, you’re still in control. You can fully control the level of exposure by controlling the amount of traffic that you send to that single new instance.
71