VDI, or Shared Desktop Services, has a critical Key Performance Indicator: Availability, which has a direct impact on end-user experience. If the desktop service is unavailable because of a failure of any single VDI technology layer, then user experience suffers. So how do you measure VDI availability and what are typical values?
I saw a great slide yesterday by one of our Tidal Software guys (you did know that Cisco has it’s own enterprise service management software suite, right?). It showed how a service made up of different apps can have a poor 80% service availability score even if some of the technology components are up 100% of the time.
What is VDI if it isn’t a multi-application service? In fact, VDI is the gorilla of shared, multi-layer services with expectations of zero downtime and severe consequences against business productivity when availability is poor. So why isn’t everyone talking about VDI availability?
- Redundant components only mean higher availability within a layer. More layers means lower availability.
- What’s the point of one layer being 100% available if others are at 80%? This has cultural complications, as you can well imagine.
- Measure availability for VDI at three points: layers, Service Desk, and end user.
- Give the end user the SLA, no more no less. More costs money. Are you competitive, though?
- Cheat a bit and only focus availability on an operational window that isn’t 24×7. If you provide 24×7 availability, see (4).

If you look at any VDI solution it’s made up of multiple technology layers managed by multiple teams. Here’s a generic list from the front (end user) to the back (storage), but your mileage may vary so don’t argue about the list items as they are not important by themselves, it’s the cumulative effect of downtime on each layer that matters.
You can work out the effect of combined downtime by multiplying the (real or predicted) availability of each layer: the more layers the less aggregate availability. Why is this so? Imagine a simple example of just two layers:
- If one layer has a one hour outage in 24 hours, that leaves 23 hours uptime (about 95.8%).
- If the next layer has another one hour outage then its availability is also 23 hours / 95.8%.
- The combined availability, if the layer outages occur at different times, is 2 / 24 = 91.7%.
- You can get the same availability by multiplying the two layers’ : 95.8% x 95.8% = 91.7%.
- The caveat is in the case that the two outages occur at the same time and so more sophisticated calculations are required (Google is your friend).
Point 1: Multiple layers can lower availability of the service, while multiple components can increase availability of the layer
You can calculate the availability at each layer by finding the downtime for each component in that layer, multiplying their downtime and subtracting from one: A = 1 – f^2 (this assumes all components have the same MTBF)
For example, if you have two redundant web servers that have 99.9% availability each, then you calculate the availability of that Web Layer (two web servers) as 1 – 0.1% x 0.1% = 99.99% (an improvement of .09%).
However, where having multiple redundant components increases availability within a layer we know that having multiple layers only increases the probability of lower availability for the service.
If more layers decreases availability then there is more bad news for VDI: the fact that these technology layers are managed by different teams. If these teams don’t work cohesively, characterised by disconnected tribal knowledge, different toolsets and unawareness of things beyond their silo, then your MTBF is likely to shrink as uncoordinated actions cause outages, and (double whammy!) the MTTR and deliver fixes becomes longer because it’s hard to find the root cause and fix it.
Point 2: Even if a layer has 100% availability, other poor performing layers will reduce the service quality and everyone gets blamed.
But what does the end user actually see? Even if you can calculate the probability of availability for layers and the service, and analyse the real results against those calculations, the most important data points to collect are at the front-end of the VDI solution. A couple of common approaches to working out availability and quality of service from the end user perspective.
- If you’ve got a good Service Desk, then you should be able to calculate approximate availability based on incidents raised.
- If you’re more scientific, then put automated monitors at the end user locations.
An automated monitor can be something as simple as a PC with automation software simulating a user and recording bad experiences and feeding that data up to a central system. What would be more awesome would be individual monitors on each end point that can record the availability and send to a centralized system automatically.
Point 3: Measure availability across the solution at component and layers, verify with Service Desk, and measure at end-user points with simulation monitors.
So, now that you know how hard it is to get great uptime on VDI and how you can calculate, predict and monitor it: what can you do to improve it? I would suggest the following essential three steps:
- Treat VDI solutions as a Shared Desktop Service. I’ve written about this before and can’t stress it enough: if you don’t actively manage VDI holistically with service management front and foremost of mind, then you will be at the mercy of rogue silos. You Have Been Warned.
- Make sure every silo understands the impact of their actions: (1) their availability has a direct impact on 5,000 users, and (2) if they impact another silo’s availability (SMS update to all desktops crashing the SAN?) this has a direct impact on 5,000 users.
- Communicate the cost of availability and visibly punish those that affect it. What does four hours out of a business day cost to the mortgage application team? When it happens because people break the service management rules, then it’s three steps to the door: (1) name and shame, (2) warning, (3) re-assignment (including to the unemployment line).
Still think your VDI solution has 99% uptime? You are saying that for all technology components and facilities that, combined, there is only 87.6 hours downtime per year. However, that might be acceptable, in fact you may be exceeding your SLA!
Point 4: Know your availability SLA and only give the business what they pay for
What if your VDI solution is only required to run between the hours of 8am and 6pm from Monday to Friday, then you have a better focus on availability and increase chance of high availability because more than half the week can be spent on planned maintenance. In this case, you are looking at a total of 2600 hours per year instead of 8760, leaving you a whopping 6160 (70%) hours of the year for maintenance.
In that kind of smaller, focused window you still can’t prevent unplanned outages but by eliminating planned outages your availability should increase significantly.
Point 5: If you can focus on a smaller service availability window your availability should improve.
Lastly, what’s the best possible availability for VDI? Check out this table that shows, cumulatively, even if all of your layers are at five nines availability that’s still calculated at just four nines for the service. However, of course, you can exceed this if some of your technology layers have 100% uptime. I have seen this happen only in reduced operational windows (8am-6pm) with absolute focus on uptime at the expense of change.
| Client Device | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| LAN | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| WAN | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| DCN | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Balancer | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Broker | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Server | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Hypervisor | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Desktop OS | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Desktop App | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| SAN | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| Array | 99.999% | 99.99% | 99.90% | 99.00% | 98.00% |
| SUM | 99.99% | 99.88% | 98.81% | 88.64% | 78.47% |
In summary, it is possible but unlikely to hit five nines availability for VDI unless you have a reduced operational window AND keen operational focus on uptime. But, if the customer hasn’t asked for or isn’t paying for this kind of uptime, then so what? Oh, you don’t know what the customer expects? Right, well I’m sure you’ll know when the next incident happens and he or she is calling your desk with a few choice words in explanation of what an hour’s downtime costs. HAVE FUN!
3 Comments
An enormous part of getting to high availability is reducing mean-time-to-repair.
Virtualization and UCS service profiles for dedicated servers can vastly improve recovery from one of the most time-consuming faults, hardware failure.
Consider how long it takes to recover from a failed hard disk in a simple server – repair, provision, restore – vs the time it takes to move a guest to a host or attach a service profile to another blade.
You can reboot in five minutes, maintaining four 9s for the year. But one repair/provision/restore might take you down for hours, dropping below three 9′s into ‘no bonus for you’ territory!
Great point, Kent! Nice to see Cisco and Zenoss working together too :-)
Couldn’t agree more, specifically point 3 and 4 above.
Many enterprises’ high-availability architecture is based on the assumption that you can prevent failure from happening by putting all your critical data in a centralized database, back it up with expensive storage, and replicate it somehow between the sites. As I argued in one of my previous posts (Why Existing Databases (RAC) are So Breakable!) many of those assumptions are broken at their core, as storage is doomed to failure just like any other device.
One of the main lessons that we can take from the likes of Amazon and Google is that the right way to ensure continuous high availability is by designing our system to cope with failure. We need to assume that what we tend to think of as unthinkable will probably happen, as that’s the nature of failure. So rather than trying to prevent failures, we need to build a system that will tolerate them.
Building elastic application that was designed handle their own failure and ensure continues availability without human intervention becomes the best practice for building robust and scale systems.
As Ken mentioned above what i believe UCS manager and service-profiles provides is a fine grained level control over the hardware that makes this type of automation and self-healing possible.
Interestingly enough i wrote a post just few days ago on how leading wall street firm managed to apply those same principles to handle a complete data center failure scenario while ensuring continues availability of their real time web application.
One Trackback
Your article was most tweeted by VMware experts in the Twitterverse…
Come see other top popular articles surfaced by VMware experts!…