Stabilizing vSphere
Visible Ops translated for vSphere
I’ve talked at length on this site about how an IT system could be great if a small-but-fundamental change was made by the people who build and run the IT system.
- Good 2 great = IT revolution
- Data center 3.0 Scientific Method
- The VMM is upside down
- Pay IT per caravan, not per hour
I now want to provide an illustration on how this would work for vSphere.
Why vSphere?
When servers are files-on-a-disk rather than tin-in-a-rack, then it’s a bit like a non-Newtonian fluid, like custard powder, that goes from solid-to-fluid depending on the stress applied.
Gene Kim said once that “virtualization amplifies bad practice”, and this has been seen commonly out there (just today I saw, on Twitter, that a customer changed their SRM password – no change process – and killed their DR environment). Virtual server sprawl, eggs in one basket: there are many examples why if you are bad at IT, you’re going to be really bad at virtualization.
VMware’s latest incarnation of the virtual data center OS is another step forward in producing a tool that enables and supports a great IT organization. If IT is like the UK mortgage market, then virtualization is “a deal” vs. the Standard Variable Rate (SVR) “no deal”. If you’re on the “no-deal” you are wasting money, just like if you aren’t virtualizing.
So if virtualization is so good and the IT system is so bad, then we have a brilliant reason for getting IT right: because if you have great IT and great virtualization, you have a high performing IT organization that is likely in the top few performers world wide today.
Why Change Management?
The Visible Ops bible starts with a first step of Stabilize the Patient. It is never the case that you have a completely blank canvas to work from: even if you are at a greenfield site, the people brought in to start “a-fresh” have preconceived working practices that might be bad (sorry, but part of “Good to Great” is being honest and brutal about the facts, whilst being optimistic about a possible outcome).
To stabilize the patient, Visible Ops says:
- Reduce or eliminate access
- Document the new change policy
- Notify stakeholders
- Create change windows
- Reinforce the process
If you haven’t done the above, then you can forget about fault isolation, root cause analysis, release management and all the rest. With uncontrolled operations as the foundation, and with the heavy rain pouring down, your house will collapse (and will out-sourcing be seen as a fix?)
Scientific Method
Before going through the list of hypothesis that make this part of our IT system, first the most critical part: how to apply the scientific method. This is the critical difference between great and good organizations. Good organizations paint by numbers using Yet Another Framework (YAF) and never get the picture finished nor become great organizations. Great organizations recognize that the future is uncertain but there are patterns to success that they can apply, even if they don’t know what the final system might look like (how can you, until you’ve completed a few iterations of the flywheel?).
For each activity that Visible Ops states, for it to apply to vSphere, we need to:
- Assign the work correctly: Make sure the person/people doing the work are adequately qualified and trained
- Write a hypothesis (like a test plan, but it becomes the policy that is proven)
- Test the hypotheses, by applying the method
- Analyze the results, based on our expected results vs. actual
- Either: Iterate again with a new hypothesis / expected results, and/or train up the person doing the work
- Or: Sign off the hypothesis as proven, and establish it as the proven procedure -
- Continuous improvements are made by retrieving this procedure and following the hypothesis-test-improve-signoff procedure.
In summary, give someone Smart and Lazy the job, then write-test-analyze-iterate-signoff-continuously improve.
Let’s see this scientific method in action with the first step of the Stabilizing activity: Reduce and Eliminate Access by ensuring that only authorized change agents can apply changes to the vSphere assets.
“Reduce and eliminate access” method
The following represents the written hypothesis of the scientific method to reduce and eliminate access. Each item on this list identifies the expected results and should have more detailed implementation steps on how to make the changes.
- Expected Result 1: 3-man team to run the procedure comprising (1) VMware Engineer, (2) Operations Engineer, (3) Supervisor with experience of the scientific method.
- How: Find qualified staff, allocate for 3 days simultaneously, agree exit criteria and quality standards.
- Expected Result 2: Infrastructure (servers, network) is locked down to prevent traffic between unauthorized access points and vSphere assets.
- How: Authorized client access points are secure Ops Bridge. Lock down vCenter and ESX access with firewall rules, use bastion host.
- Expected Result 3: vSphere assets owned and control by Ops Bridge in ESM tools
- How: Assign assets to Ops Bridge in CMDB and assign escalation/approval/authorization to Ops Bridge
The next step is the test the above hypothesis by making the necessary infrastructure, enterprise systems management and change management alterations in the How steps.
Once the changes are in place, more tests should be carried out to ensure that the Expected Results happen. Tests should ensure that what is allowed can work, and what is unallowed does not: like, True the Ops Bridge can access vCenter, and True the Architect can’t access vCenter.
Note that the key to success here is documenting the tests. If this is done by a 1-man band in a test cell with no written record: quite frankly, that is a waste of time and will cost more money in re-work later because nobody will know what’s been done.
After analyzing, if there’s a discrepency in the Expected Results, it’s time to iterate through and change the Expected Results (this becomes the policy later), or the How (this becomes the Standard Operating Procedure later), or perhaps the people doing the work (need more training, perhaps?).
At some point (ideally within the 3 days allocated here), the method will have produced a policy (the Expected Results) and procedures (the How steps) and can be signed-off.
What next?
The next steps are to expand this hypothesis to cover the other actions required to Stabilize the Patient. At the end of applying this method to implement all the recommendations in your environment, you will be on your way to a great IT organization!
Update – here’s a blog series that records the experiences of one organization that is working through Visible Ops – very cool!
Related posts:

Comments