Anecdotes from the trench: why manufacturing is better than on-site build

Extreme therapy
Extreme therapy

I’m giving a presentation to a VCE global partner later today and I’m going to start it with an anecdote from my times in the trench to remind people of just one, personal reason why Vblocks are a good idea.  I think it’s worth sharing beyond these private meetings, if only for my own therapeutic reasons!

About three years ago I was part of a team to build one of the very first Vblocks in the UK.  This was right at the start of VCE’s genesis, while I was at Cisco.  There was no mature manufacturing facility back in those early days, and staff was very light compared to now.  We had design guidance from the other Vblocks that had been deployed, a reference architecture if you will, but it was a long way from today’s VCE capability.

I’ve been herding cats most of my life and this project wasn’t any different: the virtual team was created (VCE had one employee in Europe, and he was a sales guy) from Cisco, EMC, VMware, a large IT service provider and the customer teams themselves.  Let’s round it at twenty people, from different companies, in different parts of the UK, with different skillsets.

At the board level of the customer a commitment had been made to have the new, business critical service live before the end of the year.  The date was currently spring, so we had about six months to get it delivered, integrated and fully operational.

Long meetings were held of trying to get different people to communicate: it wasn’t that people were hostile, the team spirit was excellent, it was that there are three basic problems with getting different IT people to work together (even when they like each other, so imagine if they don’t!):

  1. They speak a different spoken language, for example using different acronyms.  If people don’t understand, they stop listening.
  2. They have a different unspoken language, in that they make assumptions of what others know because “it’s obvious” (to them).  This means stuff gets missed.
  3. They have different priorities, because surely the network is the core of everything and everyone knows the cloud is about storage, right?  This means stuff gets missed.

So imagine trying to agree on a full-stack design guide (application down to power sockets).  This is before any real work has started!  (SHOCK: Drawing a Visio and writing a PDF is not real work.  You heard it here first.)  This is a design that makes simple assertions as to what the solution will consist of, in terms of components, how they are assembled and configured.  It also dictates integration points and brushes the cheek of operations, but daren’t a kiss.

Once you’ve stumbled, bloodied and battered out of the design sessions with a PDF in your hand, you now need to work out how to actually build it!  Have the experts configured a Nexus 5020 in this way before?  Have they cut a VMAX binfile for this configuration before?  Have they integrated a 6140 with a Cat6509 before?

The problem with being all shouty, either as a vendor or consumer, about using the latest, shiniest technology is…. very few people have done it in anger before.  You are likely to be the first to do this in production.  There aren’t lots of “best” or “proven” practices except those that were done in lab conditions (probably with different configurations and firmware to your scenario).  There aren’t loads of forum entries you can google for “Error 5230″. So what happens is you learn on the job.  You build it live, test it, then document it afterwards.  Remember we have a deadline here.

What’s more, if you are building this new, shiny tech with a newly formed virtual team on the customer site in their datacenter, you are most likely limited as to who can access the infrastructure in their datacenter, which means you have to be really good at team work because those twenty folks that have their own domain knowledge have to cram it into the heads of the few that are approved to enter the datacenter to do the actual build.  Yes, we are now in the olympic relay passing PDFs instead of batons.

Let me give you a personal account of why this occasionally fails badly: I worked on the northbound network connectivity and the Nexus 1000V virtual network, with help from domain experts in Cisco.  Design looked great.  Nice visio, using lots of colours and cool fonts, tacked on were obsessive-compulsive level build instructions: what could possibly go wrong?  Enter the relay!

We handed the instructions to the one approved guy who could go and do the work, a man who hadn’t been involved in any of the discussions so far, and he went and did his cabling job (under change control of course) and reported back a day or so later:

Me:  Create the Nexus 1000V VSM, ok.  Add the VEM to a host, OK.  Move the first NIC from the standard vSwitch to the VSM, ok.  Move the second NIC from the standard vSwitch to the VSM…. the host then disappeared from the VSM.  What on earth is happening?  Check all the configs.  Cisco Discovery Protocol (CDP) came to the rescue and we learned that the northbound cabling was wrong.  We called the guy up.

Me: “How did you cable the northbound connections?”

Him: “I just connected the fabrics to the switches.”

<I’ve cut out a lot of useless to-ing and fro-ing here, which felt like hammering nails into my skull (see pic)>

Me: “Did you follow the network diagram and the build guide?”

Him: “What diagram and build guide?”

Yes, the switches were misconfigured and miscabled.  I take part of the blame for not stapling the guide to the guys forehead.  Next time I will be smarter… It all had to be done again, following the guide this time.  We lost about a week doing this (don’t forget change control!).  How do you explain to a senior executive that his business app will be late because someone didn’t put the right SFPs together?

The moral of the story is this: building converged infrastructure is hard because of people not because of complex technology, even when you appear to have all of the knights of the realm sat at your table.  If it’s a new team, if it’s new tech, and you’re doing it on a customer site, please take off those rose tinted classes and your superhero underpants: they won’t help you.

That’s why I jumped at the chance to join VCE: I couldn’t go through that again, so now all the hard work is done in a manufacturing facility.  It’s shipped WORKING, and then you can give ops a full kiss on the lips instead of a brush on the cheek and get on with the real work of consuming your converged infrastructure, instead of fussing about patching leaks.

You. Have. Been. Warned.

Post a Comment

Your email is kept private. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>