Since 2004 I’ve been helping VMware customers adopt virtualization. Over these years I’ve consistently come up against the same barriers to VMware adoption, and I talked about these in my 2007 VMworld session Breaking Down Barriers to VMware Technology Adoption.
Here’s the rub, virtualization has been a roots-up transformation. It started with the techies who loved the agility that virtualizationg gave them. When I was working on a UK Government web site in early 2000′s, the Design Authority could test a three tier web-app-db deployment on his laptop, within an hour of the release. Yes, that’s iPlanet, WebLogic and Oracle all running on his Windows laptop with VMware Workstation.
Pretty soon VMware moved from the desktop (or vClient we call it now) to the server (or VDC-OS we call it now) and all that agility was improved and targeted at the enterprise datacenters. Now you could consolidate, avoid capital expenditure, leverage the same agility as on the desktop (snapshots, quickly deployments, delete VMs) but for hundreds and soon thousands of servers.
More innovation came: vMotion – transferring a running, live guest from one physical host to another with no downtime. This was 2003, and still the time where Microsoft didn’t believe in virtualization (search for David Cutler to find out one of the reasons why), never mind vMotion (and by the way, Microsoft spent the next few years saying vMotion didn’t matter – VMware customers said zero downtime because of planned maintenance was important!) – and Microsoft still doesn’t have “Live Migration”, so this article isn’t for Hyper-V afficionados
But Microsoft’s FUD didn’t stifle the spread of vMotion: inertia in Problem I.T. did! Let me ask you a question that a zillion Problem I.T. people have asked me in the past five years:
“Steve, do other customers have a change record for a vMotion?”
Instead of just saying “No”, which is the correct answer, I feel there is more value to my customer to ask “Why?” and the discussion goes something like this:
“We need a change ticket for vMotion because the server has moved.”
Responses to this are many, including “No physical servers have moved.” and “There is no change to your Configuration Items in your CMDB.” but again my favourite is “Why do you need to know a server has moved?”
“We need to know a server has moved because if there’s a problem, we need to isolate the fault.”
This is the first acknowledgement that the first step in ALL fault isolation is 1. Search the ticketing system for recent changes that would impact this configuration item. I think that’s very good first step, especially if you have an effective change and configuration management system (oh, you don’t? Join the club! It’s got more members than the one over the street that is reserved for those with Great Change and Configuration Management).
I partially like this response because it shows operational focus, but it’s still not a reason for having to create a change record for each vMotion, so I usually show how you can instantly find which guest is running on which host in VirtualCenter. So you can see if Host 1 dies, then here’s the list of Guests that are affected.
“But VirtualCenter is another [pane|pain] of glass.”
OK, so which Enterprise Management System from the Big4 have you spent the farm on? I then show the integration point.
“But if there have been several vMotions and then we only find out there’s been a problem after the fact, how do we know which guest was on which host when the problem happened?”
Ah-ha. So you want to use your ticketing system to track historical events, so that you can “recreate” a system state at the time that some event happened? Good luck at trying to compose a historical picture of a system through change tickets, but I get what you’re trying to do.
How about instead of trying to rebuild a picture of a past system state, why don’t we just go back in time through the events and see what events were also happening back then? All of the components of VMware Infrastructure are instrumented (aren’t they?) and this information is sent to a remote, central location (isn’t it?) so you can correlate events (can’t you?). Consider splunk.com.
“OK, so we can correlate events.. but we’re not happy with the risk exposure of someone performing a vMotion”
If risk is probability and impact, let’s look at the data. If a vMotion fails – nothing actually happens. The only change in service is when the replicant VM becomes active and the external switch receives an RARP. If a problem occurs with copying memory, the final step never happens and the original VM continues happily along with no downtime.
The reliability of this is 99.999% according to YOUR test data (you tested this, didn’t you?). The impact of a vMotion failure depends on the workload you are running in that VM. If you are running “low hanging fruit” then you have a score of low/low, and don’t need a change ticket.
“But there’s still a human logging on to a system and issuing an instruction – they might get it wrong.”
Ah, needing a change record because it’s a human performing a change. That I can understand, but is this a change or an operational event? Are the two distinguished? Do you differentiate between an operator logging on to run “top” and an operator logging on to run “sudo rm -rf /”?
If your admins log on and issue vMotion commands, even if they vMotion the wrong guests, they cause no impact on service. We’ve already established that vMotion’s have a low probability and low impact on service.
If your admins log on and do something other than vMotion commands, then that is a problem. If you are following the Visible Ops methodology and you are Electrifying the Fence, then I would put vMotion commands as an allowable operational activity – you aren’t changing any configuration items, just the guest-host managed object map.
Now, can I ask you some questions:
“How long does it take for a change ticket to be approved?”
For a standard change, which is what vMotion might be, it would be three-to-five days from request to approval to scheduling.
“And say I want to perform maintenance on a physical host and need to migrate twenty guests from that host prior to maintenance, so to provide zero downtime to the business?”
That would be twenty different change tickets so the application owners are notified.
“And say I need to change the memory on twenty physical hosts thanks to a bad batch of DIMMs?)”
That would be four hundred change tickets…
“What about if we use Dynamic Resource Scheduler which automatically moves guests around the hosts in a cluster?”
<Change Manager faints>
The above discussion shows the inflection point when Virtualization meets legacy thinking in Problem I.T. The impact of the decision here is immense. If you _do_ require a change ticket for every vMotion, you are putting the brakes on the amount of change you can do. Visible Ops says changes are “like the brakes that let your car go faster” – but in this case, we have massive disc brakes on a tiny vehicle.
The goal of Great I.T. organizations is more throughput of change for less cost. It means being able to make changes during the business day through low risk enabling mechanisms like vMotion. If you can only perform those four-hundred changes in weekend change windows, then that’s a lot of over time and slow throughput. Urgh!
The business wants it, virtualization can do it, but Problem I.T. stands in the way.
There is a solution here:
- You don’t require a change ticket for any vMotion on a system that has been tested and proven to be reliable for vMotion events. If vMotion is dodgy at your site (if it is, you probably have much bigger problems), then that is the exception. You could say that if I ask you if you require a change ticket for vMotion that is negatively says more about your wider operational capability….
- DRS is safe to use, ask the 130,000 production VMware customers. If it isn’t safe in your organization, then perhaps you are an exception to the rule.
- Don’t rely on change tickets to reproduce a system state: it doesn’t work. Use something like splunk.com to correlate historical events.
- Class vMotion as an ITIL Routine Maintenance task – no change record required, it’s a known, reliable operational activity.
- If you do need to know where a guest lives, or where it used to live, use VirtualCenter and an event analyzer like Splunk.
This topic has also been discussed on VMTN with the likes of Jason Boche and Don Pomery chiming in – check out the thread, which dates back to 2006 – but it is *still* a topic for today, because remember we have only virtualized 15% of all servers, and only got to 130,000 customers so far – many of whom still might not be over this particular speed bump.
Here’s another link from itsjustanotherlayer.com about making changes like vMotion without change control and in business hours from an architect at a large financial (with strict change controls, no doubt!) – nice reference to ITIL Routine Maintenance.

Comments