Blog
Embracing Downtime: Why 99.999...% Availability is Not Always Better
Andrew Phillips
8 December, 2010
Capabilities:
A couple of weeks ago, my ever-active colleagues Marco Mulder and Serge Beaumont organised an nlscrum meetup about "Combining Scrum and Operations", with presentations by Jeroen Bekaert and devopsdays organiser Patrick Debois. Unfortunately, I was late and only managed to catch the tail end of Patrick's well-delivered talk explaining how Dev/ops can become Devops. Thankfully, the lively open space discussions that followed provided plenty of interesting insights, comments and general food for thought. One recurring theme that particularly struck me was the comment, uttered with regret by many in Operations, that they would very much like to help and coordinate with the development teams but inevitably were always too busy keeping the production environment up and running. In other words, helping prepare for new releases might be desirable, but achieving the five nines, or whatever SLA Operations has committed to1, will always be paramount. This is a fallacy! Indeed, one of the core realisations of the "Devops mindset", to me, is that 99.999...% uptime is not an end in itself, but a means to an end: delivering the greatest business value possible. And aiming for the highest possible availability may not be the best way to go about it!2
For instance, imagine a day's downtime in production costs $500k, and you have a new feature coming up for release that is estimated to bring in an extra $1m per day. Then for every day by which you can speed up the release you can afford almost two days of downtime!3 The point is: the ability to maintain a stable current environment cannot be considered independently of the ability to rapidly deliver change. Rather, they need to balanced against each other to determine which combination will likely deliver greatest value. This is a decision only the business owner or customer can make. And naturally, the balance needs to continuously monitored and updated in light of new requirements and experience.

- Too often without drawing on actual day-to-day experience, a point made by Patrick.
- Of course, rushing inadequately tested, unstable software out just to release a feature on a certain date usually isn't a good way to go about it, either. This post is not supposed to be "Ops-bashing"; it's just that reducing the "feature frequency" is far less controvertial, in most organisations, than even considering reduced stability.
- The relative magnitude of the two figures is not particularly realistic, for sure. It's just for example's sake.
- Don't laugh! I've seen it happen too often, to clever and experienced developers, to believe this only an isolated problem.
- Quite a few big companies are adopting this model for all their applications. A number of attendees at the nlscrum meeting also reported positive experiences with this approach.
- Or even "blog series", who knows.
Andrew Phillips
Contact