Antifr-agile: Breaking things on purpose at Agile Day Chicago 2016
Saying that we're into agile is a bit of an understatement. We've written numerous articles and ranted at scale across conferences globally on the topic. On November 4, Morningstar's offices became the watering hole of the agile community as Agile Day Chicago 2016, went into full swing. One topic specifically drew my attention this year--Breaking Things on Purpose--and I wanted to expand on it in this post.
Breaking Things on Purpose
Ex-Amazon, ex-Netflix engineer, Kolton Andrus, presented a topic on purposefully induced failure within the enterprise infrastructure. The concept matured during Kolton's work at Amazon and Netflix where Kolton wrote the Failure Injection Testing (FIT) framework that allows Netflix to intentionally create failure within their ecosystem. The outcome of this practice is to create ecosystems that are not only robust (versus fragile), but ones that learn and evolve on their own.
A good example that Kolton shared was the notion of a vaccine. A vaccine introduces a bounded failure experience, while unlocking unbounded benefits. In other words, a small amount of the virus is actually allowed into the bloodstream of the patient and small discomfort is often experienced as a result of the immune system building up the tolerance and learning how to fight the assailant. So while the discomfort is limited, the upside is the complete elimination of contracting a potentially deadly disease in the future.
This analogy is incredibly accurate in the software space. Downtime is expensive. Industry research indicates that one minute of data center downtime costs $5,000, with incident resolution totals breaking into the hundreds of thousands. Downtime of Amazon.com during Black Friday equates to hundreds of millions. Downtime also causes a lot of strife. Even if middle-of-the-night responsiveness is part of the support role, no one is excited to get that SMS. To avoid what can often be perceived as a crisis, engineering organizations should simulate failure during the time of calm.
As Picard Management Tips (@PicardTips) tweeted, one should, "Run crisis drills when all is well. A real calamity is not a good time for training." And yet a strategy of injecting something harmful into the production instance will commonly be met with wide eyes and lists of counterarguments from the CIO. The following steps from Kolton define how breaking things in production can create bounded failure, yet limitless benefits:
- Brainstorm what could go wrong. Usually the 80/20 rule applies, so look at your architecture diagram and identify areas where a) devices could die, b) connectivity could be lost, and c) demand would increase beyond capacity.
- Guess how likely the above scenarios are to happen.
- Define the typical cost of an outage and use that to justify investment necessary on proactive failure injection.
- Define the key business metric the process should test. For example, can users still continue streaming their shows through Netflix if a regional data center goes down and traffic is rerouted to a different geography?
- Limit the scope of the initial test.
- Expand the scope of the tests close to 100 percent as local challenges are resolved.
The presentation reminded me of another session I was in four years ago. In fact, Netflix was not the first organization to think about failure injection as a strategy. Harper Reed, the CTO behind Barack Obama's 2012 campaign, used a manual strategy for injecting failure into Project Narwhal. Each day prior to the election, the engineering team would draw several pieces of paper out of a hat. Each piece had a failure scenario: data center availability, network bandwidth, a load balancer outage, etc.. The combined scenario from the three notes would then be simulated on the production instance of the application, allowing the team to preemptively address majority of the possible failures.
The same practices, however, were missing from the Romney campaign, resulting in a catastrophic failure of Project Orca. Multiple single-points-of-failure did, in fact, go down when production traffic exceeded expectations. Additionally, Comcast thought the rapid increase in traffic was a DDoS attack and initiated countermeasures.
Automating FIT and building production ecosystems that can be resilient should be top of mind for today's CIOs. If there is one thing I can predict without having a Magic 8 ball is that your production instance will go down. Question is, how catastrophic will it be and shouldn't you train for that event?