22
APR 2011

Posted by Leo Zheng at 03:14 PM EDT

4678 reads

Share this

In light of the recent Amazon EC2 outage: let's talk clouds

The Amazon Elastic Compute Cloud or EC2 provides a scalable application deployment service by allowing users to create virtual machines (VMs) or ‘instances’ containing the software they want to distribute. Server instances can be launched or terminated as needed (customers pay by the hour), and users are given geographical control of their VMs to minimize latency and improve redundancy.

Now redundant servers and the system’s internal checks and balances are usually enough to minimize instances of noticeable downtime for customers hosting on EC2, but that was not the case yesterday. Big names like Reddit, Hootsuite, Quora, Foursquare, and many more startups / websites were all affected in one way or another by a massive Amazon cloud outage. Reddit is still in ‘emergency mode’ as I write this.

That being said, stuff goes down all the time. It’s not a big deal, when you look at the big picture, unless it stays down or it happens at an annoyingly high frequency. I would wager that all the services that went down yesterday (and some of them which are still down today) took into consideration the expected downtime of EC2 in their decision to operate their services there, and decided it was worth the trade-off of dollars saved vs downtime.

But for everyone else (customers of the services hosted on EC2, mainly) who has not done the relative price / performance ratio evaluation, the idea of hosting your critical operations “in the cloud” may seem questionable in light of recent events.

Here's how I kind of imagine it going:


Cloud Hosting Vs. Cloud Service [OnSIP]

The term “cloud” is widely used these days to replace “hosted.” We’ve heard friends and customers refer to OnSIP as “voice in the cloud” and “cloud PBX.” In a sense, these terms properly explain that users need not have their own PBX - customers can just plug their phones into the Internet (cloud), and can deploy a phone system in a matter of minutes. In this sense, we don’t mind the term “cloud,” and we’ve even used the whole “cloud theme” on our website.

However, we’ve also seen the terms “voice in the cloud” and “cloud PBX” create a bit of confusion about our services, which we’d like to clarify. While OnSIP allows you to “manage your phone system in a cloud,” OnSIP is not, itself, hosted in the cloud. We do not utilize services like Amazon EC2 to maintain our data and applications. Instead, we have our own servers that we manage and are not shared by other applications. We believe this is an important distinction, so we asked our Engineering team to explain in detail:

Input from our CTO, John:

“From a business standpoint, part of operating an Internet service involves evaluating the price vs. performance of various technologies available. While a hosted service like EC2 provides a relatively compelling price/performance ratio for many Internet applications, we believe the performance level available does not meet the minimum required for a business-class VoIP service.

It boils down to the quality of service (QoS) the network supporting "the cloud" is able to provide. While we could potentially operate OnSIP on a 3rd parties set of hosted servers, we are not aware of any providers who maintain a "hosted network" that is optimized for realtime IP traffic (RTP traffic in particular - the voice/video in VoIP is carried in RTP packets) and provide guaranteed levels of service to back it up...

So we built our own network which is optimized for the service we are delivering. We have purposely built our network to maximize our ability to deliver quality RTP traffic and achieve the level of quality we believe business users require. "

From our System Engineer, Charlotte:

"We need to very finely control latency and jitter in our network. When you host with someone else, you've no control over that at all, so we couldn't guarantee the kind of Internet backbone that we need to run a reliable service. We also need to connect to the Internet in a very particular way to guarantee the best connectivity to the networks where our end-users are located. If any of the networks that we connect to have network problems, we have enough redundancy that we can quickly reroute our traffic around them and then work with the Internet provider directly to solve the problem. If we were in a cloud hosting scenario, we'd only be able to complain and wait and hope it got better. That's just not good enough for our needs.

And, of course, when there is an issue like the Amazon cloud outage, there's nothing you can do to fix it other than wait for someone else. That's a terrible position to be in, especially if your customers are businesses that rely on you for their daily operations. Since we run all of our own stuff, we're able to monitor and predict potential failures and respond to them (as proactively as possible) ourselves.

Admittedly, the trade off is that running our own equipment and network is much more expensive, but we get to build it to the specific requirements of our particular needs."

Related posts: Why we don't run our systems in the cloud