Home > Cloud Computing > Cloud: Over Subscription vs. Over Capacity – Two Different Things

Cloud: Over Subscription vs. Over Capacity – Two Different Things

There’s been a very interesting set of discussions lately regarding performance anomalies across Cloud infrastructure providers.  The most recent involves Amazon Web Services and RackSpace Cloud. Let’s focus on the former because it’s the one that has a good deal of analysis and data attached to it.

Reuven Cohen’s post (Oversubscribing the Cloud) summarizing many of these concerns speaks to the meme wherein he points to Alan Williamson’s initial complaints (Has Amazon EC2 become over subscribed?) followed by CloudKick’s very interesting experiments and data (Visual Evidence of Amazon EC2 network issues) and ultimately Rich Miller’s summary including a response from Amazon Web Services (Amazon: We Don’t Have Capacity Issues)

The thing that’s interesting to me in all of this is yet another example of people mixing metaphors, terminology and common operating methodologies as well as choosing to suspend disbelief and the reality distortion field associated with how service providers actually offer service versus marketing it.

Here’s the kicker: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.

On the other hand, the sad truth is that we will have over capacity issues in cloud; it’s simply a sad intersection of the laws of physics and the delicate balance associated with cost control and service delivery.

Let me frame the following with an example: when you purchase an “unlimited data plan” from a telco or hosting company, you’ll notice normally that this does not have latency or throughput figures attached to it…same with Cloud.  You shouldn’t be surprised by this. If you are, you might want to rethink your approach to service level expectation.

Short and sweet:

  1. There is no such thing as infinite scale.  There is no such thing as an “unlimited ____ plan.”* Even in Cloud. Every provider has limits, even if they’re massive. Adding the word Cloud simply squeezes the limit balloon from you to them and it’s a tougher problem to solve at scale. It doesn’t eliminate the issue, even with “elasticity.”
  2. Allow me to repeat: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.  I don’t need to explain why, I trust.
  3. Capacity refers to the ability, within service level specifications, to meet the contracted needs of the customer and operate within acceptable thresholds. Depending upon how a provider measures that and communicates it to you, you may be horribly surprised if you chose the marketing over the engineering explanations of such.
  4. Capacity is also not the same as latency, is not the same as throughput…
  5. Over capacity means that the provider’s over-subscription modeling was flawed and suggests that the usage patterns overwhelmed the capacity threshold and they had no way of adding capacity in a manner which allows them to satisfy demand

Why is this important?  Because the “illusion” of infinite scale is just that.

The abstraction at the infrastructure layer of compute, network and storage — especially delivered in software — still relies on the underlying capacity of the pipes and bit-buckets that deliver them. It’s a never-ending see-saw movement of Metcalfe’s and Moore’s laws.

The discrete packaging of each virtualized CPU compute element sizing within an AWS or Rackspace is relatively easy to forecast and yields a reasonably helpful “fixed” capacity planning data point; it has a minima of zero and a maxima associated with the peak compute hours/vCPU clock rating of the instance.

The network piece and its relationship to the compute piece is where it gets interesting.  Your virtual interface ultimately is bundled together in aggregate with other tenants colocated on the same physical host and competes for a share of pipe (usually one or more single or trunked 1Gb/s or 10Gb/s Ethernet.) Network traffic in terms of measurement, capacity planning and usage must take into consideration the facts that it is both asymmetric, suffers from variability in bucket size, and is very, very bursty. There’s not generally a published service level associated with throughput in Cloud.

This complicates things when you consider that at this point scaling out in CPU is easier to do than scaling out in the network.  Add virtualization into the mix which drives big, flat, L2 networks as a design architecture layered with a control plane that is now (in the case of Cloud) mostly software driven, provisioned, orchestrated and implemented, and it’s no wonder that folks like Google, Amazon and Facebook are desparate for hugely dense, multi-terabit, wire speed L2 switching fabrics and could use 40 and 100Gb/s Ethernet today.

Check out this interesting article.

Oh, let’s not forget that there are also now providers who are deploying converged data/storage networking of said pipes with the likes of FCoE/DCE with all sorts of interesting ramifications on the above discussion.  If you thought it was tough to get your arms around before…

If you know much about Ethernet, congestion avoidance/recovery/control, QoS, etc. you know that it’s a complex beast. If service levels relating to network performance aren’t in your contract, you’re probably figuring out why right about now.

So, wrapping this up, I have to accept AWS’ statement that they “…do not have over-capacity issues,” because quite frankly there’s nothing to suggest otherwise.  That’s not to say there aren’t performance issues are related to something else (like software or hardware in the stack) but that’s not the same as being over capacity — and you’ll notice that they didn’t say they were not “over-subscribed” but rather they were not “over capacity.” 😉

/Hoff

*Just ask AT&T about their network and the iPhone. This *is* a case where their over-subscription planning failed in the face of capacity…and continues to.

Categories: Cloud Computing Tags: ,
  1. January 15th, 2010 at 12:39 | #1

    Very interesting, Chris. I've learned a lot by reading the flurry of related blog posts including Reuven's and Alan's initial entry re: EC2 that seemed to have started it all.

    I believe that there are many who don't understand the over capacity vs. over subscription issue. I also think much more will be written about the issues regarding scaling out the network vs. scaling CPUs in a cloud environment.

    Thanks and Best Regards, Bob

  2. January 15th, 2010 at 16:06 | #2

    But pretending that the cloud was a magical place with limitless capacity that could never be oversubscribed was a key part of my strategy. 🙂 Just set it, and forget it.

    How can I justify my zero responsibility approach to IT if I have to evaluate providers on vectors like throughput, latency, and capacity management strategy. That sounds hard.

    What's that bubble popping sound?

  3. January 15th, 2010 at 16:29 | #3

    Ahhh… here comes the "semi-elastic" cloud.

  4. Armorguy
    January 15th, 2010 at 16:32 | #4

    Hmmm…. So, if I follow you, Cloud Vendors plan for over-subscription much like airlines over-book flights? Makes perfect sense – until the customers don't react the way the plan predicts (as your AT&T example perfectly demonstrates)…

    So, at what point does over-subscription begin to cause people to turn away from the cloud? How much 'pain' makes cloud perceived to be a bad choice??

    Oh, and anybody want to buy this AT&T Blackberry from me? 🙂

  5. January 16th, 2010 at 08:13 | #5

    Great article… I’ve yet to see an ROI tool that considers nuances like over capacity in the calculation to make a business case for deploying critical business applications to the cloud. Makes it kind of difficult to be a proponent of cloud computing with a critical business applications perspective except when the scale of the application remains small and the dollars kept on the bottom line far exceed more common architectures. An educated bet that over-subscription remains only an issue with large scale enterprise deployments seems reasonable.

  6. January 19th, 2010 at 07:27 | #6

    Great post, thanks.

    I also would like to mention that as far as I know, there are no network bandwidth quotas in EC2. Or at least I didn't find any in AUP. Over-subscription technically would mean a total of quotas of all users exceeds designed capacity – which can't happen by definition, because each customer's quota is 0 (or undefined).

    Please also note that quotas can be set in max traffic over a period of time, or max burst traffic, etc.

    Whether not having a quota is a good thing or bad thing – I don't know, but I don't think the term "over-subscription" in the context of this discussion is applicable.

    I think Alan may have used it figuratively, not literally, in his original post.

  7. Jost
    January 26th, 2010 at 05:42 | #7

    The i/o performance guarantees for virtual machines in ec2 indicate some form of differentiation between vms of different types (small vs medium and large instances). I do wonder how ec2 manages the priorities on network flows for internal and external traffic from and to all the deployed vms in case of bandwidth shortages on overbooked (over-subscribed) network links…

    And by the way, a quota for a vm would suggest a capacity reservation, which would not allow for over-subscription at all. maybe we will see statistical guarantees coming up one day…..

  8. January 26th, 2010 at 11:11 | #8

    @Jost

    Re "a quota for a vm would suggest a capacity reservation, which would not allow for over-subscription at all":

    I meant it in the context "provider has 1 Mbps available, your quota is 1 Mbps, my quota is 1 Mbps; as a result, network is over-subscribed."

  9. GaryBoom
    November 16th, 2011 at 15:10 | #9

    Does anyone have a feel for the total available capacity on AWS? How many servers? How many terrabytes of data? How much power are these guys using?

  10. Mikee
    October 18th, 2012 at 08:42 | #10

    Because AWS does have capacity issues in the network, that appear to be caused by the “oversubscription” problems, leads me to believe it is just that…CAPACITY issues. Wrapping another name around it does not change the fact their newtorking/compute model is at the capacity limit. What enterprises need, is a dedicated cloud model that will scale and provide QoS modeling without “capacity” issues.

  1. January 17th, 2010 at 16:50 | #1
  2. January 22nd, 2010 at 10:02 | #2
  3. March 26th, 2010 at 04:09 | #3
  4. June 14th, 2010 at 19:05 | #4