“Eggs in one basket” is a meme, it is not a tangible real thing. It’s an irrational but real fear. Let’s kill it, people! Eggs are laid by chickens. Right?
Applications in virtual machine are not like eggs, and hosts are not baskets. An egg is easy to break, and when it gets broken it’s finished for good. You can’t put an egg back together.
A modern computer application, unlike an egg, can be resilient to breaking and even when it does break it can be restored back to normal in minutes.
Don’t be scared of something going wrong: embrace it, for it will happen, my friend. When you face this eventuality you are set free. You will build resilient, automated and highly available systems. If you think of applications as eggs, where any failure is a catastrophe, you’re screwed.
What about the basket? The biggest fear in virtualization is that multiple applications running on one host means that any outage to the host means a big impact on service because multiple hits happen at the same time.
The pressure is on the host to be as “up” as possible, with five-9′s desired. This is wrong. Plain. Wrong. The Host is not the problem, it’s our thinking that is wrong. Here’s why:
- All applications can fail at any time for many reasons. Applications should be resilient to failure through human techniques such as Electrify The Fence and protective redundancy. If one node dies, the service keeps running at reduced capacity. If your service can’t afford this kind of resiliency and it has Single Points of Failure (SPoF), then your availability might be in the 80% range: that’s a fact of life, like buying a Ferrari kit car with a Toyota engine means – guess what? – IT AINT A FERRARI.
- Face up to the main cause of outages: me and you; us dopey humans. Our success is measure by the Mean Time Between Cock-Up (MTBCU). For those unfamliar with my English vernacular, a “Cock-Up” is pulling the wrong cable, rebooting the wrong server, and eating bacon with jam.
- If you think a Host = Basket, then you’re wrong. Your face is so close up to the paper you can see a word but not a sentence, and certainly not the whole story. At last week’s London VMUG I heard two people refer to the Basket as the platform/vendor or a stack of blades. That’s more like it. I prefer to think of the basket as The System, a top-to-bottom layer of technology stacks (compute, network, storage) with resilience and recovery built in, where you can measure availability from the view of the service consumer. In The System I expect apps to break, servers to be rebooted incorrectly, networks to flap, disks to break: and because of this, I build in resilience and recovery to cope.
- Until Administrators are rewarded for higher ROI through higher consolidation ratios, instead of being severely penalised for the impact of failure, then virtualization will stay at 30% and people will equate Eggs in one Basket to “How to get fired”.
- Nobody, not you or me, knows when Too Many is Too Many. Is one egg in a basket too many? What about two? What about thirty-two? What about three hundred? How do you calculate the line-that-must-not-be-crossed? Yes, this is a risk calculation, but you show me an Admin who has set Maximum VMs Per Host because of a Probability and Impact calculation…
So there you have it: applications are not eggs and hosts are not baskets. The fear of “Eggs in one basket” is irrational and unfounded IF YOU ACCEPT FAILURE HAPPENS AND BUILD RESILIENCE AND RECOVERY INTO THE SYSTEM. Note those important words:
1. ACCEPT FAILURE
2. BUILD RESILIENCE AND RECOVERY
3. THE SYSTEM
Related posts:
- Don’t be a chicken, cram your eggs into vSphere on UCS
- Anecdotes from the trench: why manufacturing is better than on-site build
- Six dimensions of institutionalizing the Cisco Unified Computing System


Definitely there ae two camps speech/dcotrine. I have chosen mine…
Anyway I’m more interested in the risk assessment point of view and I’m looking at how to calculate that risk. Any idea?
Thx,
Didier
kudo to you – this is an excellent argument in favour of greater consolidation.
my analogy is this, the system is a truck, in that truck are shelves (hosts) on those shelves are Boxes (guests), boxes fall of shelves but the truck goes on (none HA guests), some boxes move during travel (Guest protected by HA), but the truck goes on. The truck fails another tractor unit comes to take the trailer (SRM)
Right on! I understand and share the concerns that sysadmins have, I’m one of the family. But, I think it’s not as big an issue as we all fear so I hope we can put it to bed…
I agree with you about the fact that there ain’t no free high availability, meaning that as long as you keep it clear that hosts may and will fail (and network, and storage, etc), you probably can circumvent those inner “limitations” and provide h.a. nonetheless (by implementing clusters, ft, you name it). Still, there are scenarios where some applications aren’t thought for being highly available, while they end up needing to be. While this accounts for poorly-designed apps, you’re still facing this requirement, and the simple answer is that you can’t blame the infrastructure if the app is crap. Sure, features like vmware’s FT can sometime circumvent the app’s inner limitations, but even if it’s cool (while limited, at least so far), FT can’t solve the whole problem (e.g. if this crap-app is a singleton by definition, you’re going to have scalability issues, sooner or later).
So, I think that some (few) doomed-to-break eggs indeed exist, but it’s not the infrastructure’s fault. For the remaining vast majority of cases, I totally agree with you: know your chicken (the infrastructure and it’s inherent faulty/recoverable nature) and plan/act accordingly.
PJ