Sometime last year, Netflix began using Amazon Web Services (AWS) to run their immensely successful video streaming business. They moved their entire source of revenue to the cloud. They are now totally reliant on the performance of AWS.
How would you manage the business risk of such a move? Stop reading and write down your answer. Come on, humor me. Just outline it in bullet points.
OK, now read on (no cheating!).
Here's what I would have done. Crossed my fingers and hoped for the best. Of course you monitor and you create the right remediation plans. But you wait for it to break before you do anything. And you keep hoping that nothing too bad will happen.
Obviously, I don't think like the preternaturally freakishly smart genius engineers at Netflix. In the Netflix Tech Blog, here is how they describe what they did (I read about this first in Jeff Atwood's blog, Coding Horror).
"One of the first systems our [Netflix] engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage."
This is what proactive means! How many companies have the guts backed up by the technical prowess to do this (not to build the Chaos Monkey but deal with the destruction it leaves in its wake)?
It dawned on me that IT systems are constantly bombarded by countless chaos monkeys, or one chaos monkey if you prefer, controlling hundreds of variables. The best way to get ahead is to simulate the type of destruction these chaos monkeys might cause so you can be resilient rather than reactive to monkey strikes.
And strike the monkeys will. Especially when software is changing rapidly (Agile development, major enhancement, etc.). In these conditions, the structural quality of software can degrade over a period of time and over iterations or releases.
So I built a chaos monkey to simulate this deterioration of structural quality as an application is going through change. Here's how it works.
An application starts with the highest structural quality – a 4.0 (zero is the minimum quality score). At the end of each iteration/release/enhancement one of three things might happen to this value:
- It might, with a certain probability, increase [denoted by Prob(better quality)]
- It might, with a certain probability, decrease [Prob(worse quality)]
- It might, with a certain probability, stay the same [Prob(same quality)]
Of course we don’t know what each of these probabilities should be – once you have structural quality data at the end of each iteration for a few projects, you would be able to estimate it. And we don't know how much the structural quality will increase or decrease by at each iteration, so we can try out a few values for this "step" increase or decrease.
After 24 iterations here is where the chaos monkey has left us.
Because structural quality is at the root of visible behavior, it can be difficult to detect and monitor in the rush and tumble of development (or rapid enhancement). Even when structural quality drifts downward in small steps, it can quickly accumulate and drive down an application’s reliability. It’s not that you or your teammates lack the knowledge; you simply don’t have the time to ferret out this information. Business applications contain thousands of classes, modules, batch programs, and database objects that need to work flawlessly together in the production environment. Automated structural quality measurement is the only feasible way to get a handle on structural quality drift.
Once you know what to expect from the chaos monkey, you can build in the things you need to do to prevent decline rather than be caught by surprise.
Long live the chaos monkey!