Since 2004 I’ve been helping VMware customers adopt virtualization.  Over these years I’ve consistently come up against the same barriers to VMware adoption, and I talked about these in my 2007 VMworld session Breaking Down Barriers to VMware Technology Adoption.

Here’s the rub, virtualization has been a roots-up transformation.  It started with the techies who loved the agility that virtualizationg gave them.  When I was working on a UK Government web site in early 2000′s, the Design Authority could test a three tier web-app-db deployment on his laptop, within an hour of the release.  Yes, that’s iPlanet, WebLogic and Oracle all running on his Windows laptop with VMware Workstation.

Pretty soon VMware moved from the desktop (or vClient we call it now) to the server (or VDC-OS we call it now) and all that agility was improved and targeted at the enterprise datacenters.  Now you could consolidate, avoid capital expenditure, leverage the same agility as on the desktop (snapshots, quickly deployments, delete VMs) but for hundreds and soon thousands of servers.

More innovation came: vMotion – transferring a running, live guest from one physical host to another with no downtime.  This was 2003, and still the time where Microsoft didn’t believe in virtualization (search for David Cutler to find out one of the reasons why), never mind vMotion (and by the way, Microsoft spent the next few years saying vMotion didn’t matter – VMware customers said zero downtime because of planned maintenance was important!) – and Microsoft still doesn’t have “Live Migration”, so this article isn’t for Hyper-V afficionados :-)

But Microsoft’s FUD didn’t stifle the spread of vMotion: inertia in Problem I.T. did!  Let me ask you a question that a zillion Problem I.T. people have asked me in the past five years:

“Steve, do other customers have a change record for a vMotion?”

Instead of just saying “No”, which is the correct answer, I feel there is more value to my customer to ask “Why?” and the discussion goes something like this:

“We need a change ticket for vMotion because the server has moved.”

Responses to this are many, including “No physical servers have moved.” and “There is no change to your Configuration Items in your CMDB.” but again my favourite is “Why do you need to know a server has moved?”

“We need to know a server has moved because if there’s a problem, we need to isolate the fault.”

This is the first acknowledgement that the first step in ALL fault isolation is 1. Search the ticketing system for recent changes that would impact this configuration item.  I think that’s very good first step, especially if you have an effective change and configuration management system (oh, you don’t?  Join the club!  It’s got more members than the one over the street that is reserved for those with Great Change and Configuration Management).

I partially like this response because it shows operational focus, but it’s still not a reason for having to create a change record for each vMotion, so I usually show how you can instantly find which guest is running on which host in VirtualCenter.  So you can see if Host 1 dies, then here’s the list of Guests that are affected.

“But VirtualCenter is another [pane|pain] of glass.”

OK, so which Enterprise Management System from the Big4 have you spent the farm on?  I then show the integration point.

“But if there have been several vMotions and then we only find out there’s been a problem after the fact, how do we know which guest was on which host when the problem happened?”

Ah-ha.  So you want to use your ticketing system to track historical events, so that you can “recreate” a system state at the time that some event happened?  Good luck at trying to compose a historical picture of a system through change tickets, but I get what you’re trying to do.

How about instead of trying to rebuild a picture of a past system state, why don’t we just go back in time through the events and see what events were also happening back then?  All of the components of VMware Infrastructure are instrumented (aren’t they?) and this information is sent to a remote, central location (isn’t it?) so you can correlate events (can’t you?).  Consider splunk.com.

“OK, so we can correlate events.. but we’re not happy with the risk exposure of someone performing a vMotion”

If risk is probability and impact, let’s look at the data.  If a vMotion fails – nothing actually happens.  The only change in service is when the replicant VM becomes active and the external switch receives an RARP.  If a problem occurs with copying memory, the final step never happens and the original VM continues happily along with no downtime.

The reliability of this is 99.999% according to YOUR test data (you tested this, didn’t you?).  The impact of a vMotion failure depends on the workload you are running in that VM.  If you are running “low hanging fruit” then you have a score of low/low, and don’t need a change ticket.

“But there’s still a human logging on to a system and issuing an instruction – they might get it wrong.”

Ah, needing a change record because it’s a human performing a change.  That I can understand, but is this a change or an operational event?  Are the two distinguished?  Do you differentiate between an operator logging on to run “top” and an operator logging on to run “sudo rm -rf /”?

If your admins log on and issue vMotion commands, even if they vMotion the wrong guests, they cause no impact on service.  We’ve already established that vMotion’s have a low probability and low impact on service.

If your admins log on and do something other than vMotion commands, then that is a problem.  If you are following the Visible Ops methodology and you are Electrifying the Fence, then I would put vMotion commands as an allowable operational activity – you aren’t changing any configuration items, just the guest-host managed object map.

Now, can I ask you some questions:

“How long does it take for a change ticket to be approved?”

For a standard change, which is what vMotion might be, it would be three-to-five days from request to approval to scheduling.

“And say I want to perform maintenance on a physical host and need to migrate twenty guests from that host prior to maintenance, so to provide zero downtime to the business?”

That would be twenty different change tickets so the application owners are notified.

“And say I need to change the memory on twenty physical hosts thanks to a bad batch of DIMMs?)”

That would be four hundred change tickets…

“What about if we use Dynamic Resource Scheduler which automatically moves guests around the hosts in a cluster?”

<Change Manager faints>

The above discussion shows the inflection point when Virtualization meets legacy thinking in Problem I.T.   The impact of the decision here is immense.  If you _do_ require a change ticket for every vMotion, you are putting the brakes on the amount of change you can do.  Visible Ops says changes are “like the brakes that let your car go faster” – but in this case, we have massive disc brakes on a tiny vehicle.

The goal of Great I.T. organizations is more throughput of change for less cost.  It means being able to make changes during the business day through low risk enabling mechanisms like vMotion.  If you can only perform those four-hundred changes in weekend change windows, then that’s a lot of over time and slow throughput.  Urgh!

The business wants it, virtualization can do it, but Problem I.T. stands in the way.

There is a solution here:

  1. You don’t require a change ticket for any vMotion on a system that has been tested and proven to be reliable for vMotion events.  If vMotion is dodgy at your site (if it is, you probably have much bigger problems), then that is the exception.  You could say that if I ask you if you require a change ticket for vMotion that is negatively says more about your wider operational capability….
  2. DRS is safe to use, ask the 130,000 production VMware customers.  If it isn’t safe in your organization, then perhaps you are an exception to the rule.
  3. Don’t rely on change tickets to reproduce a system state: it doesn’t work.  Use something like splunk.com to correlate historical events.
  4. Class vMotion as an ITIL Routine Maintenance task – no change record required, it’s a known, reliable operational activity.
  5. If you do need to know where a guest lives, or where it used to live, use VirtualCenter and an event analyzer like Splunk.

This topic has also been discussed on VMTN with the likes of Jason Boche and Don Pomery chiming in – check out the thread, which dates back to 2006 – but it is *still* a topic for today, because remember we have only virtualized 15% of all servers, and only got to 130,000 customers so far – many of whom still might not be over this particular speed bump.

Here’s another link from itsjustanotherlayer.com about making changes like vMotion without change control and in business hours from an architect at a large financial (with strict change controls, no doubt!) – nice reference to ITIL Routine Maintenance.

Related posts:

  1. Making ITIL real: Change Management for Technology Adoption
  2. vSphere Capacity Management

13 Comments

  1. duncan says:

    Welcome to the blogosphere mr Chambers,

    A lot of my customers are facing the same issues. I usually tell them to register VMs on a resource pool level or Cluster level in their CMDB instead of linked with a physical host. (resource pools can be linked to a cluster and a cluster can be linked to a host)

    If a VM moves to a different Cluster or Resource Pool it would be most definitely a change in my opinion. This usually also means changes in reservations / shares / datastores etc.

    Duncan

  2. Hany Michael says:

    Great post, unique and promising blog!
    But how in the hell a blogger like me can survive among these kind of blogs, such as yours, coming every day to the blogosphere?!

    I subscribed to your RSS feed, god dammit!

  3. You’re too kind… perhaps someone stole your session while you were away from your keyboard… if you had a Gravatar I could see it was you ;-)

  4. Exactly! In “standalone” mode then there IS a relationship between a Guest CI and a Host CI. In a cluster, where multiple hosts share backend storage and group their resources for VMs then there ISN’T a relationship between Guest CI and Host CI – the relationship is, as you said, between Guest -> Resource Pool -> Cluster -> Hosts. Sweet, must post about Config Management next ;-)

  5. Very interesting post Steve, the reconciliation between change management and autonomic computing systems such DRS is an event that has been brewing for years and you’ve nicely highlighted how virtualization will extend the need to deal with this issue out to every IT shop in the next couple of years. It will be a painful readjustment for many.

    There’s just one thing that flaws your argument regarding change management and non-autonomic systems (i.e. vMotion). What happens if you vMotion a Guest from an under-subscribed server to an over-subscribed server. Application performance takes a hit, the customer is disappointed, and the VI manager is going to have a lot of explaining to do. This is the type of thing that a change management process that is attuned to the challenges/opportunities that virtualization offers should catch. Granted the majority of change management processes are not well attuned to the the agility that virtualization offers. That shouldn’t be an excuse to abandon change management, bit it should be a wake-up call showing that that current change management processes have been left behind by the new reality that virtualization creates.

  6. Mike Wronski says:

    Great article! I agree that virtualization management can only benefit from taking a look at existing practices before blindly mapping them to virtualization. There are lots of places to gain operational efficiency through its use.

    On a side note, you mention splunk multiple times, Reflex Systems also provides a solution that tracks all the changes in the environment and can recreate history in both textual report or interactive topology for arbitrary points(or ranges) in the past. Making it very easy to see and correlate configuration changes with performance or other issues in the environment. I encourage you to check out the Reflex VMC.

  7. @Mike Wronski
    Hey Mike, I’ll check out Reflex Systems – thanks fo the heads up! Who is your alliance manager @ VMware? You can email me as schambers, at VMware’s domain name. :-)

  8. @Simon Bramfitt
    What a great point! This goes back to testing which, IMHO, is rarely done: what happens in your tests when you move a real workload from under- to over-subscribed servers. Some organizations divide up their applications by Tiers/SLAs and guarantee service levels, but I don’t think this is always reflected in their change process… fantastic point and I will noodle on this more, and ask about inside VMware.

    And I’d never say abandon change management, as I’m an avid fan of professional, scientific operations and Visible Ops. :-)

    Thanks, Simon!

  9. This appears to be just another rant by a bigoted virtualization administrator. Clearly, you have no understanding about risk management.

    In some environments, running `debug all’ is equivalent to `debug anything’ and both are considered 3-5 week (not day) change requests for planning. The reason? Because what you consider an “operational event” must have context to the machine. If the machine is keeping people alive, or keeping the business alive, then you have a whole different situation.

    Saying that vMotion/VMware has been fully tested over 3 years demonstrates further bigotry. Java hasn’t been tested enough and it’s been 15 years. Microsoft, of course, has Live Migration, in 2008 R2 Beta (which is about as stable as vSphere by any logical account). You must be new to this whole IT thing.

    Quoting VisibleOps and ITIL are great, but you totally forgot about risk management. Also, I doubt you’d ever get an availability return that matches Six Sigma based around the suggestions you’ve made in this article. You’re talking about IT infrastructures that are not reliant on high availability for success.

  10. @Andre Gironda
    Ouch! Sorry you feel I deserve a personal attack, Andre, but hey that’s a “risk” I was willing to “manage” :-)

    I think IT is all about risk management, so I’m sorry I haven’t called it out enough but I can’t address everything in one post…

    I fear that your approach to IT, the paranoid screamer, would mean nothing ever gets done by IT and that your conflicting comments about “Java hasn’t been tested enough in 15 years” whilst on the other hand claiming an unreleased bit of code from Microsoft is as stable as vSphere by “any logical account” – that is quite an outstanding piece of split brain that I’ve ever seen.

    Thanks for the post, would love to see some data from you on the Six Sigma stuff as I’m a great fan of less variance as long as it also increases throughput :-)

    Cheers
    Steve

  11. [...] Chambers – VMotion and change managementLet me ask you a question that a zillion Problem I.T. people have asked me in the past five [...]

  12. [...] Chambers – VMotion and change managementLet me ask you a question that a zillion Problem I.T. people have asked me in the past five [...]

  13. Grant says:

    Just stumbled upon this article. I wrestled with this a bit when we re-implemented network change requests. I cringed at the thought of doing one everytime I needed to do any maintenance. But as far as reliability of a vmotion, the only time I shot myself in the foot w/ vmware was when I was doing something almost nobody would do… I had p2v’d a node of a file server cluster (including the RDMs). Then I storage vmotioned it. When I invoked the svmotion I lost all cluster shares. But, I was doing it during off hours with an outage window so no harm done. Once it was done, I rebooted and all was fine. I just hadn’t expected that and to be honest, I don’t know why it happened (and still happens).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Spam protection by WP Captcha-Free