Archive

Posts Tagged ‘Amazon’

On Stacked Turtles & the AWS Outage…

April 29th, 2011 2 comments

The best summary I could come up with:

Bye, Bye My Clustered AMIs…A Cloud Tribute to Don McLean

April 23rd, 2011 1 comment

Sung to the tune of Don McLeans “American Pie

A long, long time ago…
I could launch an instance
How that AMI used to make me smile
And I knew if I needed scale
that I’d avoid that fail whale
though I knew that I was in denial

But April 20 made me shiver
Amazon did not deliver
Bad news – oh what a mess
auto-cloning E B S…

I can’t remember if I cried
when the status dashboard said East had died
Tried to take my VMs back inside
The day…Amazon died

So bye-bye, my clustered AMIs
I tried to launch one
it just sat there, much to my surprise
And them angry devs were telling stories and lies
Singin’ “this public cloud I now despise
“this public cloud, I now despise.”

The CFO’s got a look of love,
and his faith, all-in, with the clouds above,
Buy less servers, Werner tells you so…

Do you believe in infinite scale
Can the cloud save your ass when it goes to hell
and can you teach me how to plan to fail?

Well I know that ….you’re in love with scrum
that agile, mobile are your rules of thumb
You tried, those VMs to move
but with no RDS, you’re screwed…

I was a lonely sysadmin with nothin’ to prove
until the cloud done fail, now the devs are screwed
and they didn’t know what quite to do..
the day…Amazon died…

I started singin’
bye-bye, my clustered AMIs
I tried to launch one
it just sat there, much to my surprise
And them angry devs were telling stories and lies
Singin’ “this public cloud I now despise
“this public cloud, I now despise.”

Enhanced by Zemanta

Amazon Web Services Hires a CISO – Did You Know?

May 18th, 2010 No comments
Image representing Amazon Web Services as depi...
Image via CrunchBase

Just to point out a fact many/most of you may not be aware of, but Amazon Web Services hired (transferred (?) since he was an AWS insider) Stephen Schmidt as their CISO earlier this year.  He has a team that goes along with him, also.

That’s a very, very good thing. I, for one, am very glad to see it. Combine that with folks like Steve Riley and I’m enthusiastic that AWS will make some big leaps when it comes to visibility, transparency and interaction with the security community.

See. Christmas wishes can come true! Thanks, Santa! 😉

You can find more about Mr. Schmidt by checking out his LinkedIn profile.

/Hoff

Reblog this post [with Zemanta]

Cloud: Over Subscription vs. Over Capacity – Two Different Things

January 15th, 2010 10 comments

There’s been a very interesting set of discussions lately regarding performance anomalies across Cloud infrastructure providers.  The most recent involves Amazon Web Services and RackSpace Cloud. Let’s focus on the former because it’s the one that has a good deal of analysis and data attached to it.

Reuven Cohen’s post (Oversubscribing the Cloud) summarizing many of these concerns speaks to the meme wherein he points to Alan Williamson’s initial complaints (Has Amazon EC2 become over subscribed?) followed by CloudKick’s very interesting experiments and data (Visual Evidence of Amazon EC2 network issues) and ultimately Rich Miller’s summary including a response from Amazon Web Services (Amazon: We Don’t Have Capacity Issues)

The thing that’s interesting to me in all of this is yet another example of people mixing metaphors, terminology and common operating methodologies as well as choosing to suspend disbelief and the reality distortion field associated with how service providers actually offer service versus marketing it.

Here’s the kicker: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.

On the other hand, the sad truth is that we will have over capacity issues in cloud; it’s simply a sad intersection of the laws of physics and the delicate balance associated with cost control and service delivery.

Let me frame the following with an example: when you purchase an “unlimited data plan” from a telco or hosting company, you’ll notice normally that this does not have latency or throughput figures attached to it…same with Cloud.  You shouldn’t be surprised by this. If you are, you might want to rethink your approach to service level expectation.

Short and sweet:

  1. There is no such thing as infinite scale.  There is no such thing as an “unlimited ____ plan.”* Even in Cloud. Every provider has limits, even if they’re massive. Adding the word Cloud simply squeezes the limit balloon from you to them and it’s a tougher problem to solve at scale. It doesn’t eliminate the issue, even with “elasticity.”
  2. Allow me to repeat: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.  I don’t need to explain why, I trust.
  3. Capacity refers to the ability, within service level specifications, to meet the contracted needs of the customer and operate within acceptable thresholds. Depending upon how a provider measures that and communicates it to you, you may be horribly surprised if you chose the marketing over the engineering explanations of such.
  4. Capacity is also not the same as latency, is not the same as throughput…
  5. Over capacity means that the provider’s over-subscription modeling was flawed and suggests that the usage patterns overwhelmed the capacity threshold and they had no way of adding capacity in a manner which allows them to satisfy demand

Why is this important?  Because the “illusion” of infinite scale is just that.

The abstraction at the infrastructure layer of compute, network and storage — especially delivered in software — still relies on the underlying capacity of the pipes and bit-buckets that deliver them. It’s a never-ending see-saw movement of Metcalfe’s and Moore’s laws.

The discrete packaging of each virtualized CPU compute element sizing within an AWS or Rackspace is relatively easy to forecast and yields a reasonably helpful “fixed” capacity planning data point; it has a minima of zero and a maxima associated with the peak compute hours/vCPU clock rating of the instance.

The network piece and its relationship to the compute piece is where it gets interesting.  Your virtual interface ultimately is bundled together in aggregate with other tenants colocated on the same physical host and competes for a share of pipe (usually one or more single or trunked 1Gb/s or 10Gb/s Ethernet.) Network traffic in terms of measurement, capacity planning and usage must take into consideration the facts that it is both asymmetric, suffers from variability in bucket size, and is very, very bursty. There’s not generally a published service level associated with throughput in Cloud.

This complicates things when you consider that at this point scaling out in CPU is easier to do than scaling out in the network.  Add virtualization into the mix which drives big, flat, L2 networks as a design architecture layered with a control plane that is now (in the case of Cloud) mostly software driven, provisioned, orchestrated and implemented, and it’s no wonder that folks like Google, Amazon and Facebook are desparate for hugely dense, multi-terabit, wire speed L2 switching fabrics and could use 40 and 100Gb/s Ethernet today.

Check out this interesting article.

Oh, let’s not forget that there are also now providers who are deploying converged data/storage networking of said pipes with the likes of FCoE/DCE with all sorts of interesting ramifications on the above discussion.  If you thought it was tough to get your arms around before…

If you know much about Ethernet, congestion avoidance/recovery/control, QoS, etc. you know that it’s a complex beast. If service levels relating to network performance aren’t in your contract, you’re probably figuring out why right about now.

So, wrapping this up, I have to accept AWS’ statement that they “…do not have over-capacity issues,” because quite frankly there’s nothing to suggest otherwise.  That’s not to say there aren’t performance issues are related to something else (like software or hardware in the stack) but that’s not the same as being over capacity — and you’ll notice that they didn’t say they were not “over-subscribed” but rather they were not “over capacity.” 😉

/Hoff

*Just ask AT&T about their network and the iPhone. This *is* a case where their over-subscription planning failed in the face of capacity…and continues to.

Categories: Cloud Computing Tags: ,

Silent Lucidity: IaaS — Already A Dinosaur? The Evolution of PaaSasaurus Rex…

November 12th, 2009 8 comments

dinosaurSitting in an impressive room at the Google campus in Mountain View last month, I asked the collective group of brainpower a slightly rhetorical question:

How much longer do you feel pure-play Infrastructure-As-A-Service will be a relevant service model within the spectrum of cloud services?

I couched the question with previous “incomplete thoughts*” relating to the move “up-stack” by IaaS providers — providing value-added, at-cost services to both differentiate and soften the market for what I call the “PaaSification” of the consumer.  I also highlighted the move “down-stack” by SaaS vendors building out platforms to support a broader ecosystem and value proposition.

In the long term, I think ultimately the trichotomy of the SPI model will dissolve thanks to commoditization and the need for providers to differentiate — even at mass scale.  We’ll ultimately just talk about service delivery and the platform(s) used to deliver them.  Infrastructure will enable these services, of course, but that’s not where the money will come from.

Just look at the approach of providers such as Amazon, Terremark and Savvis and how they are already clawing their way up the PaaS stack, adding more features and functions that either equalize public cloud capabilities with those of the enterprise or even differentiate from it.  Look at Microsoft’s Azure.  How about Heroku, Engine Yard, Joyent?  How about VMware and Springsource?  All platform plays. Develop, click, deploy.

As I mention in my Cloudifornication presentation, I think that from a security perspective, PaaS offers the potential of eliminating entire classes of vulnerabilities in the application development lifecycle by enforcing sanitary programmatic practices across the derivate works built upon them.  I look forward also to APIs and standards that allow for consistency across providers. I think PaaS has the greatest potential to deliver this.

There are clearly trade-offs here, but as we start to move toward the two key differentiators (at least for public clouds) — management and security — I think the value of PaaS will really start to shine.

Probably just another bout of obviousness, but if I were placing bets, this is where I’d sink my nickels.

You?

/Hoff

* The most relevant “incomplete thought” is the one titled “Incomplete Thought: Virtual Machines Are the Problem, Not the Solution…” in which I kicked around the notion that virtualization-enabled IaaS and the VM containers they enable are simply an ugly solution to an uglier problem…

Dear Santa: All I Want For Christmas On My Amazon Wishlist Is a Straight Answer…

October 31st, 2009 4 comments

A couple of weeks ago amidst another interesting Amazon Web Services announcement featuring the newly-arrived Relational Database Service, Werner Vogels (Amazon CTO) jokingly retweeted a remark that someone made suggesting he was like “…Santa for nerds.”

All I want for Christmas is my elastic IP...

All I want for Christmas is my elastic IP...

So, now that I have Werner following me on Twitter and a confirmed mailing address (clearly the North Pole) I thought I’d make my Christmas wish early this year.  I’ve put a lot of thought into this.

Just when I had settled on a shiny new gadget from the bookstore side of the house, I saw Amazon’s response to Eran Tromer’s (et al) research on Cloud Cartography featured in this Computerworld article written by my old friend Jaikumar Vijayan titled “Amazon downplays report highlighting vulnerabilities in its cloud service.”

I feature Eran and his team’s work in my Cloudifornication presentation.  You can read more about it on Craig’s blog here.

I quickly cast aside my yuletyde treasure list and instead decided to ask Santa (Werner/AWS) for a most important present: a straight answer from AWS that isn’t delivered by a PR spokeshole that instead speaks openly, transparently and in an engaging fashion with customers and the security community.

Here’s what torqued me (emphasis is mine):

In response, Amazon spokeswoman Kay Kinton said today that the report describes cloud cartography methods that could increase an attacker’s probability of launching a rogue virtual machine (VM) on the same physical server as another specific target VM.

What remains unclear, however, is how exactly attackers would be able to use that presence on the same physical server to then attack the target VM, Kinton told Computerworld via e-mail.

The research paper itself described how potential attackers could use so-called “side-channel” attacks to try and try and steal information from a target VM. The researchers had argued that a VM sitting on the same physical server as a target VM, could monitor shared resources on the server to make highly educated inferences about the target VM.

By monitoring CPU and memory cache utilization on the shared server, an attacker could determine periods of high-activity on the target servers, estimate high-traffic rates and even launch keystroke timing attacks to gather passwords and other data from the target server, the researchers had noted.

Such side-channel attacks have proved highly successful in non-cloud contexts, so there’s no reason why they shouldn’t work in a cloud environment, the researchers postulated.

However, Kinton characterized the attack described in the report as “hypothetical,” and one that would be “significantly more difficult in practice.”

“The side channel techniques presented are based on testing results from a carefully controlled lab environment with configurations that do not match the actual Amazon EC2 environment,” Kinton said.

“As the researchers point out, there are a number of factors that would make such an attack significantly more difficult in practice,” she said.

So while the Amazon spokesperson admits the vulnerability/capability exists, rather than rationally address that issue, thank the researchers for pointing this out and provide customers some level of detail regarding how this vulnerability is mitigated, we get handwaving that attempts to have us not focus on the vulnerability, but rather the difficulty of a hypothetical exploit.  That example isn’t the point of the paper. The fact that I could deliver a targeted attack is.

Earth to Amazon: this sort of thing doesn’t work. It’s a lousy tactic.  It simply says that either you think we’re all stupid or you’re suffering from a very bad case of incident handling immaturity. Take a look around you, there are plenty of companies doing this right. You’re not one of them.  Consistently.

Tromer and crew gave a single example of how this vulnerability might be exploited that was latched on to by the AWS spokesperson as a way of defusing the seriousness of the underlying vulnerability by downplaying this sample exploit.  There are potentially dozens of avenues to be explored here.  Craig talked about many of them in his blog (above.)  What we got instead was this:

At the same time, Amazon takes all reports of vulnerabilities in its cloud infrastructure very seriously, she said. The company will continue to investigate potential exploits thoroughly and continue to develop features bolster security for users of its cloud service, she said.

Amazon Web Services has already rolled out safeguards that prevent potential attackers from using the cartography techniques described in the paper, Kinton said without offering any details.

She also pointed to the recently launched Amazon Web Service Multi-Factor Authentication (AWS MFA) as another example of the company’s continuing effort to bolster cloud security. AWS MFA is designed to provide an extra layer access control to a customer’s Web services account, Kinton said.

Did you catch “…without offering any details” or were you simply overwhelmed by the fact that you can use a token to authenticate your single-key driven AWS console instead?

I’m not interested in getting into a “full disclosure” battle here, but being dismissive, not providing clear-cut answers and being evasive without regard for transparency about issues like this or the DDoS attacks we saw with Bitbucket, etc. are going to backfire.  I posted about this before in previous blogs here and here.

If you want to be taken seriously by large enterprises and government agencies that require real answers to issues like this, you can engage with the security community or ignore us and get focused on by it (and me) until you decide that it’s a much better idea to do the former.  You’ll gain much more credibility and an eagerness to work with you instead of against you if you choose to use the force wisely 😉

Until then, may I suggest this?  I found it in the Amazon.com bookstore:

beinghonest…you can download it to your Kindle in under a minute.

/Hoff

Incomplete Thought: The Cloud Software vs. Hardware Value Battle & Why AWS Is Really A Grid…

October 18th, 2009 2 comments

Some suggest in discussing the role and long-term sustainable value of infrastructure versus software in cloud that software will marginalize bespoke infrastructure and the latter will simply commoditize.

I find that an interesting assertion, given that it tends to ignore the realities that both hardware and software ultimately suffer from a case of Moore’s Law — from speeds and feeds to the multi-core crisis, this will continue ad infinitum.  It’s important to keep that perspective.

In discussing this, proponents of software domination almost exclusively highlight Amazon Web Services as their lighthouse illustration.  For the purpose of simplicity, let’s focus on compute infrastructure.

Here’s why pointing to Amazon Web Services (AWS) as representative of all cloud offerings in general to anchor the hardware versus software value debate is not a reasonable assertion:

  1. AWS delivers a well-defined set of services designed to be deployed without modification across a massive number of customers; leveraging a common set of standardized capabilities across these customers differentiates the service and enables low cost
  2. AWS enjoys general non-variability in workload from *their* perspective since they offer fixed increments of compute and memory allocation per unit measure of exposed abstracted and virtualized infrastructure resources, so there’s a ceiling on what workloads per unit measure can do. It’s predictable.
  3. From AWS’ perspective (the lens of the provider) regardless of the “custom stuff” running within these fixed-sized containers, the main focus of their core “cloud” infrastructure actually functions like a grid — performing what amounts to a few tasks on a finely-tuned platform to deliver such
  4. This yields the ability for homogeneity in infrastructure and a focus on standardized and globalized power efficient, low cost, and easy-to-replicate components since the problem of expansion beyond a single unit measure of maximal workload capacity is simply a function of scaling out to more of them (or stepping up to one of the next few rungs on the scale-up ladder)

Yup, I just said that AWS is actually a grid whose derivative output is a set of cloud services.

Why does this matter?  Because not all IaaS cloud providers are architected to achieve this — by design — and this has dramatic impact on where hardware and software, leveraged independently or as a total solution, play in the debate.

This is because AWS built and own the entire “CloudOS” stack from customized hardware through to the VMM, management and security planes (much as Google does the same) versus other providers who use what amounts to more generic software offerings from the likes of VMware and lean on API’s and an ecosystem to extend it’s capabilities as well as big iron to power it.  This will yield more customizable offerings that likely won’t scale as highly as AWS.

That’s because they’re not “grids” and were never designed to be.

Many other IaaS providers that have evolved from hosting are building their next-generation offerings from unified fabric and unified computing platforms (so-called “big iron”) which are the furtherest thing from “commodity” hardware you can get.  Further, SaaS and PaaS providers generally tend to do the same based on design goals and business models.  Remember, IaaS is not representative of all things cloud — it’s only one of the service models.

Comparing AWS to most other IaaS cloud providers is a false argument upon which to anchor the hardware versus software debate.

/Hoff

Amazon Web Services: It’s Not The Size Of the Ship, But Rather The Motion Of the…

October 16th, 2009 3 comments
From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

Carl Brooks (@eekygeeky) gets some fantastic, thought-provoking interviews.  His recent article wherein he interviewed Peter DeSantis, VP of EC2, Amazon Web Services, was titled: “Amazon would like to remind you where the hype started” is another great example.

However, this article left a bad taste in my mouth and ultimately invites more questions than it answers. Frankly I felt like there was a large amount of hand-waving in DeSantis’ points that glossed over some very important issues related to security issues of late.

DeSantis’ remarks implied, per the title of the article, that to explain the poor handling and continuing lack of AWS’ transparency related to the issues people like me raise,  the customer is to blame due to hype and overly aggressive, misaligned expectations.

In short, it’s not AWS’ fault they’re so awesome, it’s ours.  However, please don’t remind them they said that when they don’t live up to the hype they help perpetuate.

You can read more about that here “Transparency: I Do Not Think That Means What You Think That Means…

I’m going to skip around the article because I do agree with Peter DeSantis on the points he made about the value proposition of AWS which ultimately appear at the end of the article:

“A customer can come into EC2 today and if they have a website that’s designed in a way that’s horizontally scalable, they can run that thing on a single instance; they can use [CloudWatch[] to monitor the various resource constraints and the performance of their site overall; they can use that data with our autoscaling service to automatically scale the number of hosts up or down based on demand so they don’t have to run those things 24/7; they can use our Elastic Load Balancer service to scale the traffic coming into their service and only deliver valid requests.”

“All of which can be done self-service, without talking to anybody, without provisioning large amounts of capacity, without committing to large bandwidth contracts, without reserving large amounts of space in a co-lo facility and to me, that’s a tremendously compelling story over what could be done a couple years ago.”

Completely fair.  Excellent way of communicating the AWS value proposition.  I totally agree.  Let’s keep this definitional firmly in mind as we go on.

Here’s where the story turns into something like a confessional that implies AWS is sadly a victim of their own success:

DeSantis said that the reason that stories like the DDOS on Bitbucket.org (and the non-cloud Sidekick story) is because people have come to expect always-on, easily consumable services.

“People’s expectations have been raised in terms of what they can do with something like EC2. I think people rightfully look at the potential of an environment like this and see the tools, the multi- availability zone, the large inbound transit, the ability to scale out and up and fundamentally assume things should be better. “ he said.

That’s absolutely true. We look at what you offer (and how you offered/described it above) and we set our expectations accordingly.

We do assume that things should be better as that’s how AWS has consistently marketed the service.

You can’t reasonably expect to bitch about people’s perception of the service based on how it’s “sold” and then turn around when something negative happens and suggest that it’s the consumers’ fault for setting their expectational compass with the course you set.

It *is* absolutely fair to suggest that there is no release from not using common sense, not applying good architectural logic to deployment of services on AWS, but it’s also disingenuous to expect much of the target market to whom you are selling understands the caveats here when so much is obfuscated by design.  I understand AWS doesn’t say they protect against every threat, but they also do not say they do not…until something happens where that becomes readily apparent 😉

When everything is great AWS doesn’t go around reminding people that bad things can happen, but when bad things happen it’s because of incorrectly-set expectations?

Here’s where the discussion turns to an interesting example —  the BitBucket DDoS issue.

For instance, DeSantis said it would be trivial to wash out standard DDOS attacks by using clustered server instances in different availability zones.

Okay, but four things come to mind:

  1. Why did it take 15 hours for AWS to recognize the DDoS in the first place? (They didn’t actually “detect” it, the customer did)
  2. Why did the “vulnerability” continue to exist for days afterward?
  3. While using different availability zones makes sense, it’s been suggested that this DDoS attack was internal to EC2, not externally-generated
  4. While it *is* good practice and *does* make sense, “clustered server instances in different avail. zones, costs money

Keep those things in the back of your mind for a moment…

“One of the best defenses against any sort of unanticipated spike is simply having available bandwidth. We have a tremendous amount on inbound transit to each of our regions. We have multiple regions which are geographically distributed and connected to the internet in different ways. As a result of that it doesn’t really take too many instances (in terms of hits) to have a tremendous amount of availability – 2,3,4 instances can really start getting you up to where you can handle 2,3,4,5 Gigabytes per second. Twenty instances is a phenomenal amount of bandwidth transit for a customer.” he said.

So again, here’s where I take issue with this “bandwidth solves all” answer. The solution being proposed by DeSantis here is that a customer should be prepared to launch/scale multiple instances in response to a DoS/DDoS, in effect making it the customers’ problem instead of AWS detecting and squelching it in the first place?

Further, when you think of it, the trickle-down effect of DDoS is potentially good for AWS’ business. If they can absorb massive amounts of traffic, then the more instances you have to scale, the better for them given how they charge.  Also, per my point #3 above, it looks as though the attack was INTERNAL to EC2, so ingress transit bandwidth per region might not have done anything to help here.  It’s unclear to me whether this was a distributed DoS attack at all.

Lori MacVittie wrote a great post on this very thing titled “Putting a Price on Uptime” which basically asks who pays for the results of an attack like this:

A lack of ability in the cloud to distinguish illegitimate from legitimate requests could lead to unanticipated costs in the wake of an attack. How do you put a price on uptime and more importantly, who should pay for it?

This is exactly the point I was raising when I first spoke of Economic Denial Of Sustainability (EDoS) here.  All the things AWS speaks to as solutions cost more money…money which many customers based upon their expectations of AWS’ service, may be unprepared to spend.  They wouldn’t have much better options (if any) if they were hosting it somewhere else, but that’s hardly the point.

I quote back to something I tweeted earlier “The beauty of cloud and infinite scale is that you get the benefits of infinite FAIL”

The largest DDOS attacks now exceed 40Gbps. DeSantis wouldn’t say what AWS’s bandwidth ceiling was but indicated that a shrewd guesser could look at current bandwidth and hosting costs and what AWS made available, and make a good guess.

The tests done here showed the capability  to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone.  So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.)  Seems simple, right?

Again, this about absorbing bandwidth against these attacks, not preventing them or defending against them.  This is about not only passing the buck by squeezing more of them out of you, the customer.

“ I don’t want to challenge anyone out there, but we are very, very large environment and I think there’s a lot of data out there that will help you make that case.” he said.

Of course you wish to challenge people, that’s the whole point of your arguments, Peter.

How much bandwidth AWS has is only one part of the issue here.  The other is AWS’ ability to respond to such attacks in reasonable timeframes and prevent them in the first place as part of the service.  That’s a huge part of what I expect from a cloud service.

So let’s do what DeSantis says and set our expectations accordingly.

/Hoff

Transparency: I Do Not Think That Means What You Think That Means…

October 12th, 2009 7 comments

vizziniHa ha! You fool! You fell victim to one of the classic blunders – The most famous of which is “never get involved in a cloud war in Asia” – but only slightly less well-known is this: “Never go against Werner when availability is on the line!”

As an outsider, it’s easy to play armchair quarterback, point fingers and criticize something as mind-bogglingly marvelous as something the size and scope of Amazon Web Services.  After all, they make all that complexity disappear under the guise of a simple web interface to deliver value, innovation and computing wonderment the likes of which are really unmatched.

There’s an awful lot riding on Amazon’s success.  They set the pace by which an evolving industry is now measured in terms of features, functionality, service levels, security, cost and the way in which they interact with customers and the community of ecosystem partners.

An interesting set of observations and explanations have come out of recent events related to degraded performance, availability and how these events have been handled.

When something bad happens, there’s really two ways to play things:

  1. Be as open as possible, as quickly as possible and with as much detail as possible, or
  2. Release information only as needed, when pressured and keep root causes and their resolutions as guarded as possible

This, of course, is an over-simplification of the options, complicated by the need for privacy, protection of intellectual property, legal issues, compliance or security requirements.  That’s not really any different than any other sort of service provider or IT department, but then again, Amazon’s Web Services aren’t like any other sort of service provider or IT department.

So when something bad happens, it’s been my experience as a customer (and one that admittedly does not pay for their “extra service”) that sometimes notifications take longer than I’d like, status updates are not as detailed as I might like and root causes sometimes cloaked in the air of the mysterious “network connectivity problem” — a replacement for the old corporate stand-by of “blame the firewall.”  There’s an entire industry cropping up to help you with these sorts of things.

Something like the BitBucket DDoS issue however, is not a simple “network connectivity problem.”  It is, however, a problem which highlights an oft-played pantomime of problem resolution involving any “managed” service being provided by a third party to which you as the customer have limited access at various critical points in the stack.

This outage represents a disconnect in experience versus expectation with how customers perceive the operational underpinnings of AWS’ operations and architecture and forces customers to reconsider how all that abstracted infrastructure actually functions in order to deliver what — regardless of what the ToS say — they want to believe it delivers.  This is that perception versus reality gap I mentioned earlier.  It’s not the redonkulous “end-of-cloud” scenarios parroted by the masses of the great un(cloud)washed, but it’s serious nonetheless.

As an example, BitBucket’s woes of over 20+ hours of downtime due to UDP (and later TCP) DDoS floods led to the well-documented realization that support was inadequate, monitoring insufficient and security defenses lacking — from the perspective of both the customer and AWS*.  The reality is that based on what we *thought* we knew about how AWS functioned to protect against these sorts of things, these attacks should never have wrought the damage they did.  It seems AWS was equally as surprised.

It’s important to note that these were revelations made in near real-time by the customer, not AWS.

Now, this wasn’t a widespread problem, so it’s understandable to a point as to why we didn’t hear a lot from AWS with regards to this issue, but after this all played out, when we look at what has been disclosed publicly by AWS, it appears the issue is still not remedied and despite the promise to do better, a follow-on study seems to suggest that the problem may not yet be well understood or solved by AWS (See: Amazon EC2 vulnerable to UDP flood attacks) (Ed: After I wrote this, I got a notification that this particular issue has been fixed which is indeed, good news.)

Now, releasing details about any vulnerability like this could put many many customers at risk from similar attack, but the lack of transparency  of service and architecture means that we’re left with more questions than answers. How can a customer (like me) today defend themselves against an attack like this in the lurch of not knowing what causes it or how to defend against it? What happens when the next one surfaces?

Can AWS even reliably detect this sort of thing given the “socialist security” implementation of good enough security spread across its constituent customers?

Security by obscurity in cloud cannot last as the gold standard.

This is the interesting part about the black-box abstraction that is Cloud, not just for Amazon, but any massively-scaled service provider; the more abstracted the service, the more dependent upon the provider or third parties we will become to troubleshoot issues and protect our assets.  In many cases, however, it will simply take much more time to resolve issues since visibility and transparency are limited to what the provider chooses or is able to provide.

We’re in the early days still of what customers know to ask about how security is managed in these massively scaled multi-tenant environments and since in some cases we are contractually prevented from exercising tests designed to understand the limits, we’re back to trusting that the provider has it handled…until we determine they don’t.

Put that in your risk management pipe and smoke it.

The network and systems that make up our cloud providers offerings must do a better job in stopping bad things from occurring before they reach our instances and workloads or customers should simply expect that they get what they pay for.  If the provider capabilities do not improve, combined with less visibility and an inability to deploy compensating controls, we’re potentially in a much worse spot than having no protection at all.

This is another opportunity to quietly remind folks about the Audit, Assertion, Assessment and Assurance API (A6) API that is being brought to life; there will hopefully be some exciting news here shortly about this project, but I see A6 as playing a very important role in providing a solution to some of the issues I mention here.  Ready when you are, Amazon.

If only it were so simple and transparent:

Inigo Montoya: You are using Bonetti’s Defense against me, ah?
Man in Black: I thought it fitting considering the rocky terrain.
Inigo Montoya: Naturally, you must suspect me to attack with Capa Ferro?
Man in Black: Naturally… but I find that Thibault cancels out Capa Ferro. Don’t you?
Inigo Montoya: Unless the enemy has studied his Agrippa… which I have.

/Hoff

*It’s only fair to mention that depending upon a single provider for service, no matter how good they may be and not taking advantage of monitoring services (at an extra cost,) is a risk decision that comes with consequences, one of them being longer time to resolution.

Cloud Providers and Security “Edge” Services – Where’s The Beef?

September 30th, 2009 16 comments

usbhamburgerPreviously I wrote a post titled “Oh Great Security Spirit In the Cloud: Have You Seen My WAF, IPS, IDS, Firewall…” in which I described the challenges for enterprises moving applications and services to the Cloud while trying to ensure parity in compensating controls, some of which are either not available or suffer from the “virtual appliance” conundrum (see the Four Horsemen presentation on issues surrounding virtual appliances.)

Yesterday I had a lively discussion with Lori MacVittie about the notion of what she described as “edge” service placement of network-based WebApp firewalls in Cloud deployments.  I was curious about the notion of where the “edge” is in Cloud, but assuming it’s at the provider’s connection to the Internet as was suggested by Lori, this brought up the arguments in the post
above: how does one roll out compensating controls in Cloud?

The level of difficulty and need to integrate controls (or any “infrastructure” enhancement) definitely depends upon the Cloud delivery model (SaaS, PaaS, and IaaS) chosen and the business problem trying to be solved; SaaS offers the least amount of extensibility from the perspective of deploying controls (you don’t generally have any access to do so) whilst IaaS allows a lot of freedom at the guest level.  PaaS is somewhere in the middle.  None of the models are especially friendly to integrating network-based controls not otherwise supplied by the provider due to what should be pretty obvious reasons — the network is abstracted.

So here’s the rub, if MSSP’s/ISP’s/ASP’s-cum-Cloud operators want to woo mature enterprise customers to use their services, they are leaving money on the table and not fulfilling customer needs by failing to roll out complimentary security capabilities which lessen the compliance and security burdens of their prospective customers.

While many provide commoditized solutions such as anti-spam and anti-virus capabilities, more complex (but profoundly important) security services such as DLP (data loss/leakage prevention,) WAF, Intrusion Detection and Prevention (IDP,) XML Security, Application Delivery Controllers, VPN’s, etc. should also be considered for roadmaps by these suppliers.

Think about it, if the chief concern in Cloud environments is security around multi-tenancy and isolation, giving customers more comfort besides “trust us” has to be a good thing.  If I knew where and by whom my data is being accessed or used, I would feel more comfortable.

Yes, it’s difficult to do properly and in many cases means the Cloud provider has to make a substantial investment in delivery platforms and management/support integration to get there.  This is why niche players who target specific verticals (especially those heavily regulated) will ultimately have the upper hand in some of these scenarios – it’s not socialist security where “good enough” is spread around evenly.  Services like these need to be configurable (SELF-SERVICE!) by the consumer.

An example? How about Google: where’s DLP integrated into the messaging/apps platforms?  Amazon AWS: where’s IDP integrated into the VMM for introspection?

I wrote a couple of interesting posts about this (that may show up in the automated related posts lists below):

My customers in the Fortune 500 complain constantly that the biggest providers they are being pressured to consider for Cloud services aren’t listening to these requests — or aren’t in a position to respond.

That’s bad for everyone.

So how about it? Are services like DLP, IDP, WAF integrated into your Cloud providers’ offerings something you’d like to see rather than having to add additional providers as brokers and add complexity and cost back into Cloud?

/Hoff