Stop the IT shrieking

Stop the IT shrieking

Do you remember when we used to blame the network for everything? Slow database transactions.  Slow web pages. Missing beer from the fridge.  Those pesky bits of string are so unreliable.  “It’s the network!” we shrieked in unison, throwing up our hands twice as high as our eyebrows.

But IT moved on and we started to blame someone new: “It’s VMware!” we shrieked.  That sandwich of mysteriousness (no mayo!) is now the first suspect in any and all IT problems.

Nowadays I’m plying my trade at Cisco in the Unified Computing System practice in Europe, and I have a hunch that the shrieking will now be moved on from VMware to UCS.  “That’s the last change we made! “ shriek the web admins, “burn the UCS!

Like all superstitions this shrieking is based on ignorance, and the best way to make silence and calm amongst the shrieking villagers and the waving pitchforks and burning torches is to present data about what’s really happening because, to quote an old, mud-covered VMware colleague (the inimitable Lance Berc): “Without data there is only conjecture”.  There’s only one way to reason with an IT guy who has a bad IT hangover: you need to show them the hard facts and possibly even a few home truths, in the case of our favourite SQL coders who favour “SELECT * FROM ALL”.  And that’s where Splunk comes in.

Picture a Unified Computing System with seven logical layers , where each of these layers produces multiple sources of events: SNMP traps, logs, CIM XML – all kinds of formats over all kinds of protocols, across long time lines, at every layer.  Are we really managing this with separate teams and separate tools?

Problems at any layer try to percolate to the top, like the bubbles in your hangover remedy, all the way up to the user of the service: that mistaken “sudo rm –rf” on the web server is soon going to impact the public facing web service.

The race is on for you in IT to find and fix that problem before it percolates to the top.  If it gets to the top, impacts the end user and the phone rings, you ‘d better have a fast way to (a) explain what went wrong, and (b) fix it – someone is going to shout “Your network/VMware/UCS is broken AGAIN!”

To explain what went wrong, you need to do Fault Isolation.  To fix it, you need to do Root Cause Analysis and create a Work Around or Fix.  If you are an ITIL phr34k then we are talking Incident and Problem Management, and the mother of all ITIL books: Continual Service Improvement.  All of this is unplanned work, and unplanned work = lost, wasted, never-get-back-again, Gladiator-feeding, money.

To do fault isolation across seven layers you will need to log in to every component on every layer to try and find that smoking gun.  That can take hours if not days of unplanned work of your best and most expensive staff, and you haven’t even started fixing the problem yet.  With Splunk, you do this in minutes: and if you’ve isolated a fault once, you can save that successful procedure for next time ;-)

The presentations I saw at Splunk Live by Splunk customers showed significant, massive improvements in service management and demonstrates that these customers (and some were telcos, who’s business depends on uptime) are turning from low/med performing IT organizations into high performing IT organizations.  Splunk is enabling better operational controls: it’s bringing security and operations together, which is ideal (just ask Gene Kim and Kevin Behr!).  And you know what else about Splunk?  It’s making IT people smile again.

Selfishly for me, and my customers, Splunk is going to remove a barrier to Unified Computing: if today you only have manual methods for fault isolation and RCA, then you are in the arena of “bad practice”.  Virtualization and UCS will only accelerate this bad practice.  The technology has changed, the process needs to change, and you need to change.

If you want stateless, scalable, unified computing where a homogeneous infrastructure runs heterogeneous applications and services, then you NEED (not want) a centralized event aggregator like Splunk, or that headache is going to last all day and get worse at about 3pm when the CEO calls to ask why he’s paying you so much money to do a job you are patently not doing.

No related posts.