Patching the (Hypervisor) Platform: How Do You Manage Risk?

Hi. Me again.

In 2008 I wrote a blog titled “Patching the Cloud” which I followed up with material examples in 2009 in another titled “Redux: Patching the Cloud.

These blogs focused mainly on virtualization-powered IaaS/PaaS offerings and whilst they targeted “Cloud Computing,” they applied equally to the heavily virtualized enterprise.  To this point I wrote another in 2008 titled “On Patch Tuesdays For Virtualization Platforms.

The operational impacts of managing change control, vulnerability management and threat mitigation have always intrigued me, especially at scale.

I was reminded this morning of the importance of the question posed above as VMware released a series of security advisories detailing ten vulnerabilities across many products, some of which are remotely exploitable. While security vulnerabilities in hypervisors are not new, it’s unclear to me how many heavily-virtualized enterprises or Cloud providers actually deal with what it means to patch this critical layer of infrastructure.

Once virtualized, we expect/assume that VM’s and the guest OS’s within them should operate with functional equivalence when compared to non-virtualized instances. We have, however, seen that this is not the case. It’s rare, but it happens that OS’s and applications, once virtualized, suffer from issues that cause faults to the underlying virtualization platform itself.

So here’s the $64,000 question – feel free to answer anonymously:

While virtualization is meant to effectively isolate the hardware from the resources atop it, the VMM/Hypervisor itself maintains a delicate position arbitrating this abstraction.  When the VMM/Hypervisor needs patching, how do you regression test the impact across all your VM images (across test/dev, production, etc.)?  More importantly, how are you assessing/measuring compound risk across shared/multi-tenant environments with respect to patching and its impact?

/Hoff

P.S. It occurs to me that after I wrote the blog last night on ‘high assurance (read: TPM-enabled)’ virtualization/cloud environments with respect to change control, the reference images for trust launch environments would be impacted by patches like this. How are we going to scale this from a management perspective?

Reblog this post [with Zemanta]
  1. April 12th, 2010 at 07:29 | #1

    Yep, the deployment-time benefits of Cloud/Virt are way ahead of the on-going mgmt capabilities. Which is fine for test/dev but an issue for prod. And remember, it's a lot worse for PaaS:

    http://stage.vambenepe.com/archives/1025

  2. April 14th, 2010 at 05:29 | #2

    Is this unique to cloud infrastructure?

    How would you regression test patches to 1000s of conventional systems?

    How do you measure the impact of any change? For many systems the usage patterns of the users maybe the biggest threat of all, how do you measure those?

    What if you only have one base image (or maybe a small set of base OSes)? And you use configuration management to build out all your systems? And they are automatically added to the monitoring which verifies a system is providing functional services?

    If you are constantly rebuilding systems from source in this manner in your dev/test/prod lifecycle, then how is the $64,000 question different than it would be for any change?

    If you believe that it is different, please help me to see it.

    I'm also interested in your opinions on an ideal process for managing risk in IT systems.

    –Andrew

  3. April 14th, 2010 at 05:50 | #3

    @Andrew Clay Shafer

    The difference (if you want to call it that) is that in a heterogenous environment, once you add virtualization (in enterprise or cloud) you now have a common software layer which underlies ALL OS/App stacks…and as we've seen, user mode VM issues *have* affected these underlying VMM's and vice-versa.

    When you mix multi-tenancy and the fact that an IaaS provider is often completely ignorant (and has no visibility) of the OS/App stacks and the customer is completely ignorant (and has no visiblility) of the underlying VMM, then the folks responsibile "configuration management…and monitoring" are only seeing 1/2 the environment.

    I didn't say anything about NOT using automation. I'm asking how people manage a layer they actually don't manage…wrt public cloud environments specifically…it has implications in private clouds and enterprises too, but I ask you to read those prior posts I linked to for examples.

    Thanks for the comment…don't know if I made things clearer.

    /Hoff

  4. April 14th, 2010 at 06:32 | #4

    That is clarifies the question considerably.

    My comment mostly assumed you controlled the whole stack from the application down to the turtles.

    You are posing a question about a much larger issue that most people already face in one form or another.

    How can you manage risk and proactively respond to changes in aspects of infrastructure that you depend on which you don't control? And implicitly, what kind of expectations and responsibilities should be placed on the party which provides and controls that infrastructure vs the responsibility of the consumer?

    In general, the infrastructure provider can make change as transparent as possible and the customer can communicate their business needs, but the 'conversation' approach can only get your so far.

    IMHO, the utopian ideal would be systems that publish changes in a machine consumable format and expose APIs for the customer to register metadata about how to monitor their services and sets up alerting thresholds for things like load and trends.

    Fleshing out a bit more of the 'Infrastructure' in IaaS.

    This doesn't avert all potential for catastrophe, but then what does?

    I'm still interested in what you believe is a 'best practice' process approach to managing risk in general. Feel free to write a whole post about it.

    –Andrew

    @littleidea

  5. April 14th, 2010 at 06:43 | #5

    @Andrew Clay Shafer

    I have written about it before. I've presented on it. I've implemented it…as a CISO.

    It's part of the reason I do what I do today. I'll go find the references and send them to you.

    (EDIT: Here's one I wrote from 2005 – http://www.informationweek.com/news/showArticle.j… – Still valid today.

    Also Here's a rather selfish product-centric (when I worked for Crossbeam) paper I wrote on "Unified Risk Management" which tried to unite the control implementation with the notion of risk management through better visibility to threat/vulnerability management – http://www.rationalsurvivability.com/blog/?p=529 )

    Let me just say that I'll touch on a potentially controversial point: automation can increase risk as much as it can reduce it. It's all about the implementation. If history is any coach… ;)

    The point here is not to highlight *my* experiences, I was asking for others.

    /Hoff

  6. April 15th, 2010 at 05:40 | #6

    I use multiple clusters a lot. Next to testing (a test cluster) we spread the risk by not patching them all at the same time but within a certain timeframe. The clusters can help out with the load of others when something goes wrong. And in case you wonder, not all clusters are in the same forest/domain, nor are the guests. The isolation is at the Hyper-Visor & networking (VLAN's) level.

    As you said, implementation make the difference. IT is risk management.

  7. Rob Lewis
    April 19th, 2010 at 04:49 | #7

    Hint: How about a technology that gets off the infosec vulnerability-centric patching train. :)

  1. May 31st, 2010 at 07:34 | #1
  2. June 11th, 2010 at 12:24 | #2
  3. December 10th, 2011 at 10:51 | #3