I'll say this only once

One of the most common counter arguments to virtualization and now Cisco’s Unified Computing System high density computing is the old “eggs in one basket” argument: if I run 50 applications on one host, and that host dies, then I have an outage to 50 applications!

I’ve been hearing this since 2004, and I recently read it again (with surprise on my face) on Scott Lowe’s blog: my interpretation is that Scott isn’t fully on board yet, so in some ways this post is for him.

When someone says that running multiple production applications on one physical server is like “putting all my eggs in one basket” we know they are really talking about risk.  Risk can be broken down into two parts: Probability and Impact.

The Probability of a server having an outage is part hardware failure but mostly administrator induced failure.  Having a huge number of components, less automation, more variance in configurations, poor process and system controls: these are the things that will cause an outage to a server.  The Mean Time Between Failure (MTBF) for high-end components are measured in years.  The Mean Time Between Cock-up (MTBC) varies depending on how good your staff are at IT.

The Impact of a server having an outage is really simple to calculate and mitigate and you need to think of the 3Rs: Redundancy, Resilience and Recovery.

  • Redundancy is duplicating components, including the application.  Running multiple webservers that do the same job and spreading these across physical servers means that when one server dies, the others take up the load.  Make sure you have redundant capacity for your workload (e.g. N+1) and the impact is not an outage, just a reduction in excess capacity.
  • Resilience is how well you can absorb an outage by restarting servers and applications in the same site.
  • Recovery is the more serious disaster end of the scale and involves recovering services at a remote site.

Net:net, if you have redundant application components, well deployed infrastructure but MOST IMPORTANTLY OF ALL well managed infrastructure, then your probability and impact of outage will be within production limits and allow you to scale up virtual machines on physical servers to 100:1 and beyond.

Like I’ve always said, if you were crap at IT before virtualization, you’ll be even worse with virtualization and not be able to do any of the risk mitigation listed above and, in your sad case, I wouldn’t put all your eggs in one basket – instead, you should give the eggs to someone else :-)

Related posts:

  1. Accept failure and build resilience and recovery into the system
  2. Cisco UCS and vSphere Management: Inside, or outside?
  3. Applications on vSphere
  4. Intel, UCS and vSphere: 10 advances since 2005, and why they matter

7 Comments

  1. Trey Layton says:

    Great take, couldn’t agree more.

  2. PiroNet says:

    100% agree…

    I felt a bit alone when that article came out. Was I the only one thinking that ‘all your eggs in one basket’ is not bad practice as long as you stick to the 3R’s!?

  3. (Hum the tune) You are not alone… :)

  4. The View from the Other Side…

    As someone who doesn’t necessarily advocate in favor of always shooting for the highest possible consolidation ratios, I’m apparently not “fully on board yet.” Here are my reasons why. ……

  5. vTrooper says:

    Eggs in a Basket. Aren’t eggs delivered in trucks now? And don’t those trucks have GPS and AirConditioning?

    I get the idea that you need to minimalize risks and you gents are right to speak of the 3R’s. When I hear scale up and scale out I also think about the operational changes that come with those decisions. Does everyone else think that the first choice (up or out) doesn’t have a corresponding impact(Risk vs OpEx)?

    I guess I’m the black sheep in the field watching the Roosters and Hens argue it out as I watch the farmer take all their eggs to his table and eat them anyway.

    I still think the reason you scale up is because the eggs are getting bigger. Like ostrich eggs, you need a bigger pan for that omlet, a bigger stove for the Quiche.

    In fact, it doesn’t matter to me how the eggs are carried as long as a good Cook is in the Kitchen.

  6. KenDonoghue says:

    Stick to the three Rs or skip clustering altogether and use a fault tolerant server platform. No unplanned downtime, data loss or failover caused by a server malfunction. A fault-tolerant server also removes opportunity for human interaction, a major cause of outages.

  7. Phil Riccio says:

    At Stratus Technologies, we have been saying this for the last several years, we even did a video on the topic. You can view it here – http://community.stratus.com/podcasts/dont-get-caught-with-egg-your-face

    Phil

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Spam protection by WP Captcha-Free