You’ve to admire the chaps at Netflix. They are a fantastic example of the binding of customer service, cloud, operations and devops. One can learn many lessons from them and perhaps the most awesome reason they are, erm, awesome is their openness. Not only do they publish their RCAs and other internal articles on their blog, but they also open source the tools they’ve developed on GitHub. Amazing, thank you, chaps. Everyone else: please copy this approach (including my own company, VCE!).
One of their many great practices (not best, not proven, but great!) is the Simian Army, made up of the following:
- Chaos Monkey – randomly disables live production instances
- Latency Monkey – introduces artificial delays into the API
- Conformity Monkey – checks that instances adhere to standards and shuts them down if they don’t
- Doctor Monkey – health checker
- Janitor Monkey – identifies and tidies up waste of resources
- Security Monkey – an extension of Conformity Monkey, checks certs and other security related configs
- 10-18 Monkey – internationalization
- Chaos Gorilla – simulates the failure of an entire AWS Availability Zone
Why isn’t every organization in the world doing this? Well, here’s five reasons that the Simian Armies are held back at other orgs…
- Culture doesn’t think it’s possible/worth it/too scared/important/my job/<insert lame excuse>
- Organization has fragmented engineering, development and ops into seperate orgs that have conflicting priorities.
- Fear – Operations are not empowered or rewarded to do anything like Simian Army. Any drop in service they cause will lead to dismissal. <legacy>Breaking (stupid) SLAs and incurring liability costs will hurt the business</legacy>
- Capability – IT org doesn’t have the skills to do a Simian Army, most usually this is a lack of development tools.
- Fragility – ironically, the infrastructures that would most benefit from a Simian Army are the ones that are least likely to get one because they are so fragile and/or running important workloads.
The last reason I think most orgs don’t have Simian Armies is because their infrastructure (physical, logical, virtual) is not programmable. The API revolution hasn’t hit every component yet. Just take a small sample of compute, network and storage devices from multiple vendors and you’ll find a horrible mish-mash of APIs. The more heterogeneous it is, the worst it is. The worst I could ever see would be in an outsourcing company that had taken on lots of different orgs IT systems. This lack of consistent interface across all the infrastructure resembles the crooked, stained and missing teeth of someone with a really bad smile.

If your infrastructure isn’t programmable, it’s hard to monitor and control it. A Simian Army needs a way to engage the enemy, so step #1 is to get a way to API-ify your infrastructure. One API for all your infrastructure is the headless single pane of glass we need.
