Steve Chambers

Cloud formations

Apr 15, 2014

Azure Fabric Controller Internals

Azure Fabric Controller Internals

Source video on Channel9 Presenter: Igal Figlin, Sr Program Manager Lead

Scope and Agenda

  1. Understanding how Azure Fabric Controller works
  2. Understanding update and fault recovery of Azure Compute Services
  3. Understanding how to design Azure Compute Service for HA and recovery

Note that the PaaS bit lasts in the video until around 28 mins, and then it gets more detailed on IaaS after that. Total time is 42 mins.

Windows Azure

  • It’s an OS for the DC ** Resource management, provisioning, monitoring, app lifecycle
  • Common building blocks for distributed apps ** compute resources ** queuing, simple structured storage, sql storage ** app services like access control, caching and connectivity

  • Fabric Controlled (FC) manages compute infra ** Deploys and manages compute services ** Manages infra (h/w + s/w) and recovery ** Drives infra updates

Deploying a service to the cloud 10k ft view

  1. Develop using IDE like Visual Studio (not restricted to it).
  2. Upload package to Azure portal - from IDE, or powershell, or API
  3. Packaged passed to “RDFE” that decides where to run and what resources needed and passes to the right Fabric Controller that “executes” the package on local resources ** RDFE = Red Dog Front End

Datacenter Architecture

  • Regions can be multiple DCs, multiple racks, very common but large scale
  • Granularity is a unit which is a management domain (=FC), fault domain is a rack inside the cluster and a cluster multiple racks. racks don’t have redundant components in them.

Inside a cluster

  • Each cluster is managed by the FC ** FC manages DC h/w and services, allocates resources and manages life cycles
  • FC is a distributed stageful app running across multiple servers across multiple racks in master/many slave mode
  • Each rack is independent (has ToR switch)

Inside a physical server

  • resources are committed when allocating the service (ie. each server is sliced up)
  • the fc host agent = hypervisor
  • each VM has a guest agent

Leveraging fault domains

  • a single rack is a fault domain, and a fault domain is a physical unit of failure
  • node healing is moving VMs off the faulted server

PaaS leveraging fault domains

  • FC deploys the role instances in at least 2 fault domains, so across two racks minimum
  • can’t control mapping of instance:domain but can query via portal / api
  • queuing can be defined between layers, default is a load balancer

PaaS leveraging update domains

  • UD’s control how the service is updated ** User initiated - PaaS Service Owner updates the service package or chooses a different guest os ** Platform Initiatied - Guest OS patch of security, update the hypervisor
  • Role instances are assigned to different UDs - from 5 to 20, therefore distribute updates across your app
  • If you don’t have multiple role instances then you don’t get the 99.99% SLA
  • PaaS Mapping instances to UDs/FDs is done automagically and in a cyclical manner

PaaS auto update of service

  • Method 1 - define update mode ** Auto - UD walk by FC when package uploaded ** Manual - call “Walk Upgrade Domain” for each UD ** Simultaneous - ignores UD and blasts the whole service, like an urgent hot fix if service down
  • Method 2 - swapping stage / prod ** define staging, and swap between prod and staging ** allows testing before flip flopping
  • Changing service side, autoscaling ** use “change deployment confirmation” with new service instant count ** use “delete role instances”

IaaS leveraging Availability Sets

  • app is a web, app, db arch
  • each s/w component is redundant
  • slb at front
  • queueing or slb between front/mid
  • use affinity group to gain physical proximity
  • data layer is sql azure, but couldbe a storage layer
  • availability set allocates vms to mitigate failuers and updates - FD and UDs ** set via portal or api ** required for 99.95% SLA
  • No correspondance between FDs in different availablity sets ** queuing or slb between sets/FDs

IaaS Update the service and infra

  • Azure doesn't update IaaS - it's up to the user/admin

HA IaaS VMs Usage GUidane

  • Analogous to deploying different PaaS services for each tier
  • Update strategy clear upfront
  • Use Availability Sets to get platform scenario working
  • Do not use single instance availability sets for prod apps
  • Each availability set is independant from infra standpoint
  • Mix PaaS roles and IaaS availabilit sets as needed
  • Use affinity groups to enforce proximity

Defining Updating VMs in Availablity Sets

  • Update one update domain at a time
  • When removing/restarting/shutting down VMs, make sure remaining VMs are evenly distributed in FDs and UDs
  • Prepare for / detect platform update happening in parallel, same for h/w failures ** validate VMs status before walking to next UD ** 3 UDs will minimise collision risk with platform update
  • single IaaS instances will get a notification before the update
  • add service autoscaling
  • ** "Capture" role for an existing stopped VM or pre-create it, "Add" a new role from it
  • ** "Shutdown" / "Delete"" role when scalaing down
Nov 24, 2013

Cloud is defined by its service boundaries, not by the technology

Human's are obsessed with creating order out of chaos. Some people believe each life has a deterministic fate line; others belief they can prevent entropy by writing word documents and expecting other people to read them and act upon them (read: #ITIL). The latest phenomena is the desire to sharpen the pencil once more and define #cloud.

The immediate problem is that the name #cloud was picked for a reason. In the past, technology architects used the icon of a cloud to cover the non-interesting and/or non-focus part of an architecture, an example being an MPLS network provided by a 3rd party. No point in going into detail on MPLS because this diagram is about virtual desktops. Or is there?

I used to work at a company called #Loudcloud. Our business was in providing a service to business where we offered a 100% uptime SLA and other great features (including ability via a portal to push your own code releases), but what we didn't explain was how we did it. That was our business. You bought an SLA, not 10,000 architect hours to explain every minute detail of which compute blade we bought or how we configured it. Our service was a cloud, and the boundary was defined by the service characteristics (think: SLA).

The good definitions of #cloud are those that describe the boundaries. Think NIST. Do they tell you which hypervisor to use when providing a cloud? No. Because a hypervisor isn't required, for one thing, never mind which hypervisor (MAAS and GCE to name two obvious examples of no-hypervisor required). Do NIST tell you to use Puppet, Chef or DevOps? No. They talk about how you access it, and the service characteristics, albeit at (purposefully) abstract level. It's the same for analysts - Lydia at Gartner who does the IaaS MQ talks about the service, not the technology details. This is a Good Thing.

So, the next time somebody tells you a #cloud is defined by some kind of internal characteristic, like using OpenStack, or a specific implementation, like VMware's new Hybrid Service - not picking on them, just selecting from a sea of examples - then you can consider that "advisor" as not really getting cloud. If you ever catch yourself saying "It's not cloud" (and we've all done it, hand on heart) then catch your bad self and think twice: are you being a human control freak just living out your hardwired nature and putting your own restrictions on cloud?

Nov 22, 2013

# Cloud Striping

Cloud Striping

You know what RAID is. Why not apply it to the cloud?

If you can access cloud storage from multiple providers via HTTP, then what's to stop you spreading your data over multiple providers?

All you need is a "Redundant Array of Inexpensive Clouds" controller - hey! why not call it a broker? - and you write to it's API (put object) and in turn it stripes it across AWS, HP and Google?

Isn't that the ultimate availability solution? Cloudstripazilla!

The key is the broker tech. Hmmmm...

Nov 4, 2013

When does Hybrid become Normal?

The phrase "hybrid cloud" has been around for some years now (circa 2009?) , but it has never meant Just One Thing. In fact it is a many faceted thing as we shall explore, but the point here is: if hybrid is what everyone wants, is it the new normal?


In an abstract, general sense it means a mix between on- and off-premise cloud, and/or between private and public cloud. But there is more in there if you care to run a sharp nail across the barely formed surface...


What if the hybrid means just the cloud catalogue is shared across private and public cloud? An example is that you have one catalogue and you choose to deploy a workload (another overloaded term, we'll address that another day) at deploy time to one of the ends of your two-ended hybrid cloud.


Is this a scale-out or scale-up scenario? In the old days of tin, scale up was getting more out of one box, and scale out was adding more boxes. If you look at hybrid cloud as one cloud (old timers read: box), then it's scale up? If you look at hybrid cloud as two clouds (old timers read:boxes), then it's scale out?


What if you have a running workload on one end of your hybrid cloud and you want to migrate it (and possibly the data, but that's yet another topic for yet another day) to the other end? Must you have this capability to really qualify for having a hybrid cloud?

Elastic and Bursting

And what about the obsession with words like elasticity and bursting - must these capabilities, however hilarious, be present to qualify for a hybrid cloud?

Multi-leaf hybrid cloud

And what about if you want more than two ends in your hybrid cloud - are you now over qualified? What if I have three flavours of private cloud and six flavours of public, divided into Aristotlian buckets depending on their price, performance, security, sovereignty and any other criteria I care to apply?

Hybrid is the new normal

And lastly, to my main point - cloud is by its very nature undefined (that's a good thing, but not for you Completer Finishers out there), and that means it could be, amongst other things, singular or plural. From where I sit @canopycloud I have yet to meet a customer that wants to use just one cloud leaf node: all customers (no exceptions) do not want to be restricted to one cloud island, they all want their cloud (yes, their cloud, because cloudiness is in the eye of the beholder) to be a hybrid of many clouds.

And if everyone wants the same thing, and that thing becomes ubiquitous, the new normal, then there's no need to give it a special name anymore. All clouds can be plural, can be composite, and can be hybrid. There's no special magic anymore, it's a Must Have Feature (just depends on how, at the Demand Side - another topic for another day - you implement it).

Is it not the case that if your cloud is singular then it's like a yesterday cloud? If all the cool kids are composing their applications across multiple cloud targets, and the Old Timers are migrating their Cobol apps into clouds, then perhaps the new normal is Hybrid and instead we should drop the Hybrid and start highlighting "Monocloud" as the old normal?