Home > Cloud Computing, Cloud Security, Virtualization > Azure Users Seeing Red: When Patching the Cloud Causes Cracks

Azure Users Seeing Red: When Patching the Cloud Causes Cracks

No, this isn’t one of those posts that suggests we can’t depend on the Cloud just because of one (ok, many) outages of note lately.  That’s so dystopic.  Besides, everyone else is already doing that.

I mean just because Azure was offline for 22 hours isn’t cause for that much concern, right?  It’s a beta community technology preview, anyway… 😉  Just like Google’s a beta.

azureWhat I found interesting was what Microsoft reported as the root cause for the outage, however:

 

The Windows Azure Malfunction This Weekend

First things first: we’re sorry.  As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime.  Windows Azure storage was unaffected.

In the rest of this post, I’d like to explain what went wrong, who was affected, and what corrections we’re making.

What Happened?

During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues.  This caused a large number of servers to time out and fail.

You catch that bit about “…a routine operating system upgrade?”  Sometimes we call those things “patches.”  Even if this wasn’t a patch, let’s call it one for argument’s sake, okay?

As such, I was reminded of a blog post that I wrote last year titled: “Patching the Cloud” in which I squawked about my concerns regarding patching and change management/roll-back in Cloud services.  It seems apropos:

 

Your application is sitting atop an operating system and underlying infrastructure that is managed by the cloud operator.  This “datacenter OS” may not be virtualized or could actually be sitting atop a hypervisor which is integrated into the operating system (Xen, Hyper-V, KVM) or perhaps reliant upon a third party solution such as VMware.  The notion of cloud implies shared infrastructure and hosting platforms, although it does not imply virtualization.

A patch affecting any one of the infrastructure elements could cause a ripple effect on your hosted applications.  Without understanding the underlying infrastructure dependencies in this model, how does one assess risk and determine what any patch might do up or down the stack?  …

Huh.  Go figure.  

/Hoff

 

  1. PhilA
    March 19th, 2009 at 20:55 | #1

    Tis a sign of things to come. Great Blog. You have a new reader.

  2. March 20th, 2009 at 00:49 | #2

    When you own the platform then you are in charge of patching and working out all the dependencies are very difficult.

    At the one place I worked at, the patch we applied killed off our fax software so the company could not send or receive faxes for a while.

    Cloud computing offers the advantage that you no longer need to concern yourself with patches because its someone else's problem. What a pleasure.

    On the other hand, the company applying the patches does not know very much about your applications, nor care very much and when things go wrong – they go wrong in a big way.

  3. March 20th, 2009 at 04:03 | #3

    Operations rule #1: OS upgrade can't be routine by definition.

    Operations rule #2: Do not perform OS upgrades on Fridays.

    Operations rule #3: Do not perform any prod work on Friday the 13th.

    🙂

  4. March 20th, 2009 at 04:11 | #4

    @Dmitriy HA! Good points all!

    @PhilA Thank you sir.

    @Allen Baranov Indeed!

    My, am I brief in my comments today! 😉

  1. No trackbacks yet.