I'll say this only once

One of the most common counter arguments to virtualization and now Cisco’s Unified Computing System high density computing is the old “eggs in one basket” argument: if I run 50 applications on one host, and that host dies, then I have an outage to 50 applications!

I’ve been hearing this since 2004, and I recently read it again (with surprise on my face) on Scott Lowe’s blog: my interpretation is that Scott isn’t fully on board yet, so in some ways this post is for him.

When someone says that running multiple production applications on one physical server is like “putting all my eggs in one basket” we know they are really talking about risk.  Risk can be broken down into two parts: Probability and Impact.

The Probability of a server having an outage is part hardware failure but mostly administrator induced failure.  Having a huge number of components, less automation, more variance in configurations, poor process and system controls: these are the things that will cause an outage to a server.  The Mean Time Between Failure (MTBF) for high-end components are measured in years.  The Mean Time Between Cock-up (MTBC) varies depending on how good your staff are at IT.

The Impact of a server having an outage is really simple to calculate and mitigate and you need to think of the 3Rs: Redundancy, Resilience and Recovery.

  • Redundancy is duplicating components, including the application.  Running multiple webservers that do the same job and spreading these across physical servers means that when one server dies, the others take up the load.  Make sure you have redundant capacity for your workload (e.g. N+1) and the impact is not an outage, just a reduction in excess capacity.
  • Resilience is how well you can absorb an outage by restarting servers and applications in the same site.
  • Recovery is the more serious disaster end of the scale and involves recovering services at a remote site.

Net:net, if you have redundant application components, well deployed infrastructure but MOST IMPORTANTLY OF ALL well managed infrastructure, then your probability and impact of outage will be within production limits and allow you to scale up virtual machines on physical servers to 100:1 and beyond.

Like I’ve always said, if you were crap at IT before virtualization, you’ll be even worse with virtualization and not be able to do any of the risk mitigation listed above and, in your sad case, I wouldn’t put all your eggs in one basket – instead, you should give the eggs to someone else :-)