Data Center 3.0 Scientific Method

Couresty of Mike Tarrani - mike@tarrani.net
In my early days at VMware (2005-ish), customers were often surprised (sometimes pleasantly) to find that I was more interested in their processes than their technology.
I learned pretty early in the game that the secret to successful virtualization was the evolution of the processes that surround it, and that this evolution had been captured nicely by the good folks at Carnegie-Mellon in the form of the Capability Maturity Model. While at VMware I worked with the rest of the team, notably Rich Hogan, to create a virtualization version called the VMware Maturity Model.
If you’re not familiar with the CMM or VMM, and/or perhaps are thinking “here we go, here comes the ivory tower consulting BS” – please stay with me, and see what a powerful tool this is to understand the lay of the land.
We listed all the processes that surround virtualization and used a common language (ITIL) to apply them to enterprise organizations: the goal was to (a) understand the situation in an organization, and then (b) work out what the goal was, before (c) developing a roadmap of work packages to get to that goal.
What we’d often find was that for non-virtualization platforms, say the mainframe, that processes were often very mature (sometimes level 5 – continuously improving): but in the same organization, the same process but on the VMware platform was barely started (therefore a 1 – initial / adhoc). This helped us work out what to aim for with the VMware platform: if your organization was awesome at change management, then we had to make it awesome for VMware. If your org was rubbish at Configuration Management, then maybe that wasn’t a priority from day 1. <– from this thinking, you can see that working out what to focus on wasn’t easy because there was often _a lot_ to do.
So in effect, for each process we had a simple triangle that showed “non-VMware measure, VMware measure, target measure” – and when drawn up on a whiteboard, you got a good feel for where the processes were, what the target was, and what needed to be done.
I added a simple self-assessment to VIOPS, so you can get a feel for some of the questions that help you understand where things are today.
The second important part of this process maturity work is deciding on what the target processes should look like, and this is where things get interesting. As an example, would you think that someone building an ESX server with a CD and manual script was less mature than someone building ESX with an automated tool and PXE? The answer is: it has nothing to do with the tool, as long as the process meets requirements (e.g. you can build enough ESX servers to meet demand) and it is at the capability level you require – someone using PXE build can be less mature than the manual script if they get inconsistent results, such as two people in the same team “doing their own thing”.
I used to always say that getting to VMM 3 was essential for all processes. VMM 3 meant that you were doing each process in a repeatable, consistent manner (same port group names, for example!) and that these were documented in a manner such that The New Guy could get up to speed quickly and deliver the same consistent results.
Our challenge when it came to applying what we had learned was: where do we start? Often we’d identify around sixty activities required. We used a common-sense approach at first to say: what is a simpler fix with a good return?
Even if we’d managed to organize the activities into some order, when it comes to applying the changes we found that the VMM was upside down - at VMM 4/5 you are using metrics to continuously improve processes, but to get to that you have to develop processes to start with. But in reality, some kind of process already exists (maybe not specifically for VMware – yet), so you are always doing level 5, right? In reality, away from the theory, you want to be putting a metric in place from the start of the change so you can measure your improvement…right?
To work at levels 4 and 5 you need to be using metrics. Metrics tell you how things are performing, and help you make laser-focused changes: if the metrics are correct. For VMware, I also posted some example metrics that are ITIL related:
After working with Gene Kim and Kevin Behr (of Visible Ops fame), I’m now thinking that a different model is required.
Gene Kim also talks about Security Metrics here.
What working with Gene and Kevin has taught me (they are awesome mentors, and I’m still learning) is that there is always more work than man-hours available, and the way to avoid running around like an ineffective, headless chicken is to understand similar problems in other fields and see how they might apply to IT: check out The Goal and the How Toyota Turns Workers Into Problem Solvers – both manufacturing solutions, but emminently applicable to IT.
The goal of process maturity is, therefore, not just maturity-for-maturity’s sake, but instead to (a) Increase throughput, whilst (b) decreasing capity and operational costs: how to be efficient and effective.
Increasing throughput, in IT terms, is throughput of change. This can be planned maintenance, but is also availability of the systems (no throughput when they are down!) and also new projects (time to market). The bigger and better this is, then the more that IT is contributing to the business and this needs to be measured in $$$$. Throughput is the Effective part of the equation.
But increasing IT’s contribution in terms of throughput is for nought if it costs the earth to do so. The challenge here is to be efficient, reducing cost as a total but this might mean investing in some areas. Reducing IT spend by 10% across the board, blindly, is not a recipe for success because you could be reducing Throughput/effectiveness by 10% or more!
The magic IT recipe I’m working on is to actually drop the VMM idea – everyone should be working at level 5 from day 1. Metrics should be in place from the start, not as an after-thought, and to get great processes in place is a scientific exercise in hypothesis (defining the process), scientified method (applying the process) and continuous improvement (improve the person doing the job, and the process/system).
I think IT folks, live VMware Certified Professionals and their colleagues in Network and Storage and Security, are solving problems every day but that their work is under-valued and sometimes chaotic: people get rewarded for heroics and running around with hair-on-fire, which is the opposite from what we want (calm, predictable operations).
For data center 3.0 there is a warning to IT organizations from the experience of GM and Toyota: GM went to see how Toyota ran their plants and how they could become as efficient. What did they do? They bought the same robots, but didn’t adapt the culture (scientific approach to problem solving), and guess what: they spent $billions but were no more effective in terms of throughput. Data center 3.0 could be the same: but not if the merry band of believers in good2great have anything to do with it!

Comments