Cloud: Over Subscription vs. Over Capacity – Two Different Things
There’s been a very interesting set of discussions lately regarding performance anomalies across Cloud infrastructure providers. The most recent involves Amazon Web Services and RackSpace Cloud. Let’s focus on the former because it’s the one that has a good deal of analysis and data attached to it.
Reuven Cohen’s post (Oversubscribing the Cloud) summarizing many of these concerns speaks to the meme wherein he points to Alan Williamson’s initial complaints (Has Amazon EC2 become over subscribed?) followed by CloudKick’s very interesting experiments and data (Visual Evidence of Amazon EC2 network issues) and ultimately Rich Miller’s summary including a response from Amazon Web Services (Amazon: We Don’t Have Capacity Issues)
The thing that’s interesting to me in all of this is yet another example of people mixing metaphors, terminology and common operating methodologies as well as choosing to suspend disbelief and the reality distortion field associated with how service providers actually offer service versus marketing it.
Here’s the kicker: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.
On the other hand, the sad truth is that we will have over capacity issues in cloud; it’s simply a sad intersection of the laws of physics and the delicate balance associated with cost control and service delivery.
Let me frame the following with an example: when you purchase an “unlimited data plan” from a telco or hosting company, you’ll notice normally that this does not have latency or throughput figures attached to it…same with Cloud. You shouldn’t be surprised by this. If you are, you might want to rethink your approach to service level expectation.
Short and sweet:
- There is no such thing as infinite scale. There is no such thing as an “unlimited ____ plan.”* Even in Cloud. Every provider has limits, even if they’re massive. Adding the word Cloud simply squeezes the limit balloon from you to them and it’s a tougher problem to solve at scale. It doesn’t eliminate the issue, even with “elasticity.”
- Allow me to repeat: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model. I don’t need to explain why, I trust.
- Capacity refers to the ability, within service level specifications, to meet the contracted needs of the customer and operate within acceptable thresholds. Depending upon how a provider measures that and communicates it to you, you may be horribly surprised if you chose the marketing over the engineering explanations of such.
- Capacity is also not the same as latency, is not the same as throughput…
- Over capacity means that the provider’s over-subscription modeling was flawed and suggests that the usage patterns overwhelmed the capacity threshold and they had no way of adding capacity in a manner which allows them to satisfy demand
Why is this important? Because the “illusion” of infinite scale is just that.
The abstraction at the infrastructure layer of compute, network and storage — especially delivered in software — still relies on the underlying capacity of the pipes and bit-buckets that deliver them. It’s a never-ending see-saw movement of Metcalfe’s and Moore’s laws.
The discrete packaging of each virtualized CPU compute element sizing within an AWS or Rackspace is relatively easy to forecast and yields a reasonably helpful “fixed” capacity planning data point; it has a minima of zero and a maxima associated with the peak compute hours/vCPU clock rating of the instance.
The network piece and its relationship to the compute piece is where it gets interesting. Your virtual interface ultimately is bundled together in aggregate with other tenants colocated on the same physical host and competes for a share of pipe (usually one or more single or trunked 1Gb/s or 10Gb/s Ethernet.) Network traffic in terms of measurement, capacity planning and usage must take into consideration the facts that it is both asymmetric, suffers from variability in bucket size, and is very, very bursty. There’s not generally a published service level associated with throughput in Cloud.
This complicates things when you consider that at this point scaling out in CPU is easier to do than scaling out in the network. Add virtualization into the mix which drives big, flat, L2 networks as a design architecture layered with a control plane that is now (in the case of Cloud) mostly software driven, provisioned, orchestrated and implemented, and it’s no wonder that folks like Google, Amazon and Facebook are desparate for hugely dense, multi-terabit, wire speed L2 switching fabrics and could use 40 and 100Gb/s Ethernet today.
Check out this interesting article.
Oh, let’s not forget that there are also now providers who are deploying converged data/storage networking of said pipes with the likes of FCoE/DCE with all sorts of interesting ramifications on the above discussion. If you thought it was tough to get your arms around before…
If you know much about Ethernet, congestion avoidance/recovery/control, QoS, etc. you know that it’s a complex beast. If service levels relating to network performance aren’t in your contract, you’re probably figuring out why right about now.
So, wrapping this up, I have to accept AWS’ statement that they “…do not have over-capacity issues,” because quite frankly there’s nothing to suggest otherwise. That’s not to say there aren’t performance issues are related to something else (like software or hardware in the stack) but that’s not the same as being over capacity — and you’ll notice that they didn’t say they were not “over-subscribed” but rather they were not “over capacity.”
*Just ask AT&T about their network and the iPhone. This *is* a case where their over-subscription planning failed in the face of capacity…and continues to.