The 10 Biggest Cloud Outages Of 2017
Downtime In 2017
As a general rule, uptime continues to improve as cloud providers gain proficiency and develop better tools for operating the largest and most-advanced server clusters to ever exist.
For that reason, the very notion of a catastrophic cloud outage seemed almost an anachronism going into 2017. While all providers suffer bouts of downtime that restrict specific services, or short bursts of regional unavailability, massive failures of the kind seen in the industry's early days, many believed, surely had gone the way of the dodo.
But near the end of February, the world was reminded that even the most-experienced operators enabled with the most-advanced automation tooling were vulnerable, and the blast radius of failure was unprecedented.
That Amazon Web Services outage shook the industry, and diminished the confidence of enterprise customers warming to cloud adoption, because of the sheer number of business services that became unavailable that day. GitHub, Slack, Zendesk, Heroku, Twilio, Mailchimp, Citrix and Expedia constitute just a small list of the casualties. Confidence further waned when the cloud leader revealed the cause was human error -- essentially an incorrect one-line command typed in by a technician.
That memorable outage, and to a lesser extent the nine others on the list below, remind a rapidly maturing industry that the stakes of operational excellence are higher than ever.
Get more of CRN's 2017 tech year in review.
IBM, January 26
IBM's cloud credibility took a hit at the start of the year when a management portal used by customers to access its Bluemix cloud infrastructure went down for several hours.
While no underlying infrastructure failed, users were frustrated in finding they could not manage their applications or add or remove cloud resources powering workloads.
IBM said the problem was intermittent and stemmed from a botched update to the interface.
GitLab, January 31
GitLab's popular online code repository, GibLab.com, suffered an 18-hour service outage that ultimately couldn't be fully remediated. The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures.
Some customer production data was ultimately lost, including modifications to projects, comments, and accounts.
"Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts," the company said in a post-mortem.
In an apology to users, GitLab's CEO said: "losing production data is unacceptable."
Facebook, February 24
For almost three long, painful hours, some users across the world were locked out of Facebook and worried their accounts had been hijacked.
The social media giant later explained functionality meant to guard against hackers inadvertently sent users to a recovery screen that gave the impression someone else had logged into their accounts. Affected users were prevented from immediately logging back in.
Facebook confirmed no actual security breach had occurred.
It was the second time that week Facebook had problems. Days earlier, some people reported they could not see their news feeds.
Amazon Web Services, February 28
This was the outage that shook the industry.
An Amazon Web Services engineer trying to debug an S3 storage system in the provider's Virginia data center accidentally typed a command incorrectly, and much of the Internet, including many enterprise platforms like Slack, Quora and Trello, was down for four hours.
The post-mortem said the employee was using "an established playbook," and intended to pull down a small number of servers that hosted subsystems for the billing process. Instead, the accidental command resulted in a far broader swath of servers being taken offline, including one subsystem necessary to serve specific requests for data storage functions and another allocating new storage.
The outage from a provider that owns roughly a third of the global cloud market reignited the debate on the risks of public cloud.
Microsoft Azure, March 16
Storage availability issues plagued Microsoft's Azure public cloud for more than eight hours, mostly affecting customers in the Eastern U.S.
Some users had trouble provisioning new storage or accessing existing resources in the region. A Microsoft engineering team later identified the culprit as a storage cluster that lost power and became unavailable.
In addition to that problem, Microsoft also listed on the Azure status page a software error affecting storage provisioning across multiple services for longer than an hour.
Microsoft Office 365, March 21
Several Microsoft business and consumer cloud services, including Office 365 storage and email services, became inaccessible due to problems authenticating users.
The widespread outage prevented customers from accessing OneDrive storage, Skype collaboration, Outlook email, and consumer products such as Xbox Live.
Apple iCloud, June 28
Multiple social media feeds reported availability problems with Apple's iCloud Backup service. Apple's systems status page said iCloud Backup was only down for less than 1 percent of users.
The problem, in which those affected could not restore iOS devices from previous backups, lasted for at least 36 hours. While the restore process would hang without completion, there was no problem initiating new backups of devices to protect data.
Amazon Web Services, September 14
While this AWS outage in September was nowhere near as severe as Amazon's February debacle, the fact that the failure affected the S3 storage service, and problems originated in the same US-EAST-1 region, was enough to invoke unpleasant memories of the calamitous event about half a year earlier.
Errors in accessing storage buckets started attracting attention around noon and were under control before 1 p.m. Eastern.
Microsoft Azure, September 29
Some Microsoft services in the Azure public cloud became unavailable to European customers for seven hours.
Bringing down the world's second-largest cloud provider in Northern Europe was an accidentally set off fire extinguishing system.
Microsoft said routine maintenance on the system released fire suppression gas, automatically triggering a shutdown of the air conditioning system, which raised the ambient temperature in the facility forcing automatic shutdowns of computing systems.
Important cloud services like Virtual Machines, Cloud Services, Azure Backup and several others were offline between 1:27 p.m. and 8:15 p.m. local time.
Google Docs, November 15
Thousands of Google Docs users experienced a service disruption which impacted their businesses.
The downtime started minutes before 4 p.m. Eastern and lasted 30 minutes to just over an hour for most users, Google said. During the outage, which the internet giant confirmed impacted a "significant subset of users," the popular document creation and editing tool couldn't access files.
Google said that the Docs service was back up for most users by Wednesday night.
A Google partner told CRN that out of its 400 business customers, six were impacted by the service interruption. The solution provider, a Google user itself, was also impacted.