The 10 Biggest Cloud Outages Of 2023 (So Far)
IT Glue, Microsoft, Google Cloud and AWS made the list.
A multi-day IT Glue outage in January. An hourslong Microsoft Exchange Online outage in March. And a fire that wreaked havoc for Google Cloud users in Europe.
These are among the biggest cloud outages of 2023 so far based on a CRN review of the incidents that left IT professionals without access to tools.
[RELATED: Microsoft Resolves Hourslong Teams, 365 Online Outage]
Cloud Outages In 2023
Other vendors hit with outages this year include:
*Oracle
*Datadog
*Amazon Web Services (AWS)
Cloud downtime can cost users $100,000 an hour, according to a report this year from New York-based insurance company Parametrix Solutions. Worldwide end-user spending on public cloud services is expected to hit $591.8 billion this year.
A survey by Enterprise Management Associates (EMA) and BigPanda found that the average monetary cost for unplanned outage downtime is $12,900 per minute, according to BigPanda, a Redwood City, Calif.-based event correlation and automation platform provider.
E-learning provider A Cloud Guru recommends users explore a multi-cloud strategy, back up essential data and be familiar with service level agreements (SLAs) to get credits and partial or even full refunds.
Here are the other biggest cloud outages of 2023 so far.
January Microsoft Outages
On Jan. 17, Microsoft Teams and 365 users in North America faced an outage from 9:17 a.m. Eastern to about 2:18 p.m. Eastern.
Outage-tracking website Downdetector showed thousands of reported issues with Teams including 504 reports of issues at approximately 10 a.m. and another 503 reports of issues at 11 a.m.
About 66 percent were a result of server connections, 20 percent due to the application and 14 percent login issues.
On Jan. 25, Reuters reported that a networking outage caused an outage for Azure, Teams, Outlook and other services in the Americas, Europe, Asia Pacific, Middle East and Africa. Services resumed after a full recovery by late morning.
Microsoft said a network connectivity issue happened with devices across the Microsoft wide area network (WAN), according to Reuters.
The incident lasted about five hours, according to Quest Software’s Practical 365. The issue involved a command given to a WAN router to send messages to other routers, resulting in adjacency recomputing and table forwarding, preventing packet forwarding.
Microsoft has about 400,000 channel partners worldwide, according to CRN’s 2023 Channel Chiefs.
January IT Glue Outage
IT Glue reported at around 8 a.m. Pacific on Jan. 18 that it underwent “emergency database maintenance … to resolve an issue some customers are experiencing.”
The IT documentation software vendor owned by Kaseya was put into read-only mode until 9:33 a.m. Pacific, according to the incident report.
IT Glue restored all passwords and documents were restored by Jan. 20.
Although IT Glue doesn’t have incident reports for the following dates, users of Reddit posted about issues with the platform on Jan. 9 and Jan. 11 as well.
IT Glue’s user base includes more than 13,000 organizations worldwide and more than 350,000 individuals.
February Oracle, NetSuite Outages
Despite public assertions by Oracle co-founder and Chief Technology Officer Larry Ellison that the vendor’s Oracle Cloud Infrastructure (OCI) “doesn’t go down,” the vendor saw a couple of issues in February.
A multi-day OCI outage struck in February, according to Network World.
The issue started at around 10:30 a.m. Pacific time Feb. 13 and lasted until around 3:30 p.m. Wednesday for users in the Americas, Australia, Asia Pacific, the Middle East, Europe and Asia.
The issue involved performance in the back-end infrastructure supporting the OCI public domain name system (DNS) application programming interface (API) preventing the processing of some incoming service requests. Oracle used real-time backend optimizations and DNS load management fine-tuning to mitigate the issue.
OCI Vault, API Gateway, Oracle Digital Assistant and OCI Search with OpenSearch faced issues during the outage, according to Network World.
An outage for Oracle subsidiary NetSuite started around noon Eastern Feb. 14 due to a Cyxtera data center fire in Waltham, Mass., according to Data Centre Dynamics.
The Massachusetts facility cut power to its servers and account restoration started around 10:26 p.m. Eastern, according to The Register.
At least one Reddit user reported receiving credit on their account for the issue.
NetSuite has about 880 channel partners worldwide, 300 in North America, according to CRN’s 2023 Channel Chiefs.
March Microsoft Exchange Online Outage
On March 1, some users of Microsoft Exchange Online couldn’t access mailboxes through any connection method.
The Redmond, Wash.-based vendor first tweeted at 8:56 a.m. Pacific Wednesday about the issue. By 12:59 p.m. ET, the problem was resolved
A possible contributor to the issue was directory based edge blocking (DBEB), which allows administrators to configure message rejection for invalid recipients and block messages sent to email addresses not present in Microsoft 365 or Office 365, according to Bleeping Computer.
March Datadog Outage
Datadog took almost two days to resolve an outage that started March 8.
The New York-based cloud monitoring and security tools vendor notified users about an issue with its web app at 1:31 a.m. Eastern, according to MarketWatch. Analysts at Wells Fargo even published a note expressing concerns about the outage’s effect on Datadog’s revenue.
The incident cost Datadog about $5 million and took three shifts of about 500 to 600 engineers to resolve, CEO Olivier Pomel revealed on the vendor’s May quarterly earnings call, according to a transcript.
Pomel said he doesn’t “worry so much about it happening again” and that Datadog learned how “to recover faster” and “a better way for our customers to mitigate an issue when that happens,” according to the transcript.
Tech columnist Gergely Orosz wrote that Datadog “most likely did not charge customers for data transfers while the system was down” and that “the loss represents about a day’s worth of revenue for the company.”
Orosz said that an operating system update was a factor in the outage and said the vendor could have done better communicating with users about the incident.
April AWS Outage
Hundreds of users lost access to AWS on April 16 while thousands couldn’t access voice assistant Alexa, according to Reuters. The vendor’s Amazon mobile app also experienced issues.
AWS users couldn’t complete account signups and received error messages about the billing console.
The outage lasted more than three hours starting at around 6 a.m. Pacific, according to Bloomberg.
AWS has about 100,000 channel partners worldwide, according to CRN’s 2023 Channel Chiefs.
April Microsoft Outages
Microsoft users faced a nearly six-hour-long issue with Microsoft 365 online applications and the vendor’s Teams collaboration app on April 20.
The vendor tweeted at 6:56 a.m. Pacific that it was “investigating access issues with Microsoft 365 Online apps and the Teams admin center.”
The company tweeted at 1:10 p.m. Pacific that it “received positive confirmation through our internal telemetry and impacted users that service has been restored.”
Ookla’s Downdetector website noted thousands of reported M365 outages Thursday, with
reports passing 3,000 at around 7 a.m. Pacific and reaching a high around 9 a.m. Pacific.
Teams, SharePoint Online and Outlook had another outage on April 24, according to The Register. Microsoft tweeted about the issue at 4:17 a.m. Pacific and tweeted again at 7:17 a.m. to say a “majority of impact” had been remediated.
Bleeping Computer reported another outage with Exchange Online on April 25. Microsoft tweeted about the issue at 1:21 p.m. Pacific and said the issue was resolved about an hour later.
April Google Outage
On April 25 around 5:20 p.m. Pacific, a Paris data center fire downed Google Cloud and more than 90 cloud services for Europe region users, according to The New Stack.
The affected services included Google Cloud Storage (GCS), Cloud Key Management Service (KMS), Cloud Identity and Access Management (IAM) and Google Kubernetes Engine (GKE), according to IT Pro.
On May 10, Google reported that “some instances located in the impacted portions of the datacenter remain unavailable.”
April Oracle-Cerner Issues
On April 17, the Department of Veterans Affairs dealt with an outage of its Oracle-Cerner Electronic Health Record (EHR) system that lasted five hours, according to Federal News Network.
The outage happened due to an upgrade for database capability and failovers, according to FNN.
Then on April 25, the Oracle-Cerner system sustained another outage for almost four hours that hit the VA, the Defense Department and the Coast Guard.
The VA halted additional implementations of the system until gaining more confidence in the system’s functionality at the five sites that use it, according to EHRIntelligence.
June Microsoft Outages
The outages continue as solution providers prepare for the second half of 2023, with multiple Microsoft products rocked by outages at the beginning of June. On June 5, an outage affected tens of thousands of Microsoft 365 users in the morning. The software giant said it halted an unspecified “update.”
“We’ve identified downstream impact for Microsoft Teams, SharePoint Online and OneDrive for Business,” said Microsoft in a tweet at around 11:45 a.m. Eastern.
Microsoft said it prevented a “potentially problematic update” from propagating further across the service and is reviewing options to revert the change quickly in those portions of Microsoft’s infrastructure where it has been applied.
The next day, Microsoft saw a “recurrence” of the service issues. At 12:03 p.m. Eastern, Microsoft said that it had “identified that the impact has started again” and that it was applying further mitigation.
“Telemetry indicates a reduction in impact relative to earlier iterations due to previously applied mitigations,” said Microsoft.
At 11:22 a.m. Eastern, 3,118 DownDetector users reported Microsoft 365 problems.
On June 8, a hacktivist group known as “Anonymous Sudan” claimed responsibility for causing a Microsoft OneDrive outage. At 3 p.m. Eastern, Microsoft said it is “continuing to analyze monitoring telemetry and performing load-balancing processes to provide relief.”
A subsequent update to the status page Thursday indicated that the outage has only impacted access to OneDrive through a web browser. “Access to the OneDrive service using the desktop client, a synchronization client or Office clients are not impacted,” Microsoft said in the update.
The next day, Microsoft users experienced a major outage with the Azure cloud platform portal going down.
Microsoft appeared to have resolved the problem by the afternoon. Shortly after 11 a.m. Eastern, user reports of Azure availability issues began to climb on Downdetector. The site logged thousands of user reports of Azure outages over the next two hours.
Anonymous Sudan claimed that it carried out distributed denial-of-service (DDoS) attacks against the Azure portal.
On Monday, Microsoft said that a “spike in network traffic” has been identified as the likely cause of the outage.
“We identified a spike in network traffic which impacted the ability to manage traffic to these sites and resulted in the issues for customers to access these sites,” Microsoft said.