Down For The Count: 9 High-Profile Cloud Outages

A Problem That Won't Go Away

Despite the sophistication of the technology that powers cloud infrastructure and services and the vast sums of money involved, the cloud is still disturbingly vulnerable to outages that can close an online business down for hours, and even days, or cause it to lose precious data.

Cloud providers suffered some embarrassing services failures this year. Jeff Kaplan, the founder of Thinkstrategies, has some simple advice that never grows old when it comes to cloud availability.

"There are inevitably going to be disruptions to service availability, and it's key for service providers to minimize these occurrences and for cloud consumers to mitigate their risk by having a backup and recovery plan in place and by exploring ways to take advantage of offline service options."

Continue on and see the cloud providers who experienced downtime this year.

Microsoft Azure

On Feb. 28, a so-called leap-year bug caused Microsoft Azure to suffer an extensive, worldwide outage that wasn't fixed for more than 24 hours.

Microsoft said the software bug was related to a "time calculation that was incorrect for the leap year."

The outage drew an angry reaction from customers, some of whom wanted more communication about the issue.

Amazon Web Services

On June 15, an Amazon Web Services power outage cut services to customers for about six hours, affecting its Amazon Elastic Compute Cloud, Amazon Relational Database Service and AWS Elastic Beanstalk, which are run from Amazon's data centers in Northern Virginia.

The Northern Virginia data centers, the company's oldest and most used, suffered a similar outage in 2011 and another in October, leading some to believe its infrastructure is wearing thin.

But for Amazon partners, the situation is just a drag.

"I like Amazon, but when these things happen you have to drop everything, although I understand these things also happen in traditional IT centers as well," said Jeremy Przygode, CEO and founder of Stratalux, a partner who lost services for one of his customers for about an hour.

Microsoft Windows Azure, Again

Azure customers in Western Europe were affected on July 26, when "a service interruption was triggered by a misconfigured network device that disrupted traffic to one cluster in our West Europe subregion," Microsoft said.

Azure's cloud computing service went down for about 2.5 hours. The company said customers' storage accounts were not impacted during the outage.

Google Talk

The same day as Microsoft's outage, Google Talk chat service used by Google Gmail customers went down for almost five hours.

Google's Talk Service dashboard kept customers updated throughout the outage, and Google apologized when the service was restored, saying in part: "Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better."

GoDaddy

On Sept. 11, web-hosting and email services company GoDaddy said a six-hour outage that disrupted its operations was caused by a networking issue and not by an attack from Anonymous, as the hacker group claimed.

The company said the outage was due to a series of internal network events that corrupted router data tables.

In early October, GoDaddy said it planned to close its cloud computing business, telling its SMB customers it will try to integrate the business into other services.

Amazon Web Services, Again

Amazon Web Services went down in its Northern Virginia market Oct. 22, causing website outages in an unknown number of companies, including Reddit, Pinterest and Airnb.

The outages affected Elastic Beanstock services, followed by announcements of service interruptions with its Management Console for Elastic Beanstock Services, Relational Database Service, ElasticCache, Elastic Compute Cloud and CloudSearch.

The event raised questions on whether Amazon's cloud infrastructure at the Northern Virginia data centers needs to be upgraded.

Google App Engine

Google App Engine, the company's platform for developing and hosting Web applications in Google-managed data centers, went down Oct. 26 for about four hours as it experienced slowness and errors. As a result, 50 percent of requests to the App Engine failed.

The company said no application data was lost and application behavior was restored. Google said it would issue credits to customers for 10 percent of their usage in November.

Google said it is bolstering its network service to guard against traffic latency. "In response to this incident, we have increased our traffic-routing capacity and adjusted our configuration to reduce the possibility of another cascading failure," the company said.

The same day as Google's outage, Dropbox and Tumblr also reported service disruptions. Nothing was found to link the three incidents, but the coincidence does make one wonder.

Tumblr

About the same time as the Google App Engine failure, the microblogging platform and social networking website Tumblr said service had been restored within a few hours and the company promised it would issue a full report.

Tumblr could not be reached for comment in the days after the outage to explain it further.

Dropbox

Online storage company Dropbox also on Oct. 26 experienced an outage.

It displayed a message that said, "Error: Something went wrong. Don't worry, your files are still safe and the Dropboxers have been notified."

Dropbox also could not be reached for comment after the outage.