Azure Users Seeing Red: When Patching the Cloud Causes Cracks
No, this isn’t one of those posts that suggests we can’t depend on the Cloud just because of one (ok, many) outages of note lately. That’s so dystopic. Besides, everyone else is already doing that.
I mean just because Azure was offline for 22 hours isn’t cause for that much concern, right? It’s a beta community technology preview, anyway… Just like Google’s a beta.
What I found interesting was what Microsoft reported as the root cause for the outage, however:
The Windows Azure Malfunction This Weekend
First things first: we’re sorry. As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime. Windows Azure storage was unaffected.
In the rest of this post, I’d like to explain what went wrong, who was affected, and what corrections we’re making.
During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail.
You catch that bit about “…a routine operating system upgrade?” Sometimes we call those things “patches.” Even if this wasn’t a patch, let’s call it one for argument’s sake, okay?
As such, I was reminded of a blog post that I wrote last year titled: “Patching the Cloud” in which I squawked about my concerns regarding patching and change management/roll-back in Cloud services. It seems apropos:
Your application is sitting atop an operating system and underlying infrastructure that is managed by the cloud operator. This “datacenter OS” may not be virtualized or could actually be sitting atop a hypervisor which is integrated into the operating system (Xen, Hyper-V, KVM) or perhaps reliant upon a third party solution such as VMware. The notion of cloud implies shared infrastructure and hosting platforms, although it does not imply virtualization.
A patch affecting any one of the infrastructure elements could cause a ripple effect on your hosted applications. Without understanding the underlying infrastructure dependencies in this model, how does one assess risk and determine what any patch might do up or down the stack? …
Huh. Go figure.