The 15 Biggest Cloud Outages Of 2023
Amazon Web Services, Microsoft, Google and other cloud vendors were hit by significant service outages this year.
Technology giants and vendors of all sizes experienced multiple cloud outages this year as cloud platform technology continues to grow in importance for running critical business processes.
In fact, service outages have become so commonplace and preparation so essential that during the AWS re:Invent conference in November, the cloud giant announced more scenarios for its Fault Injection Service (FIS) that customers use to test how applications perform if AWS Availability Zones experience full power interruption or lose connectivity from another AWS region.
[RELATED: The 10 Hottest Networking Products Of 2023]
A report this year from Parametrix Insurance concluded that a 24-hour outage of mission-critical services from AWS us-east-1 – the cloud region with the largest number of Fortune 500 companies relying on it – could cost $3.4 billion in direct revenue. A 48-hour outage could cost $7.8 billion.
A 24-hour loss of east-1 and west-2 AWS services could cost $8.2 billion, $17.5 billion if lost for 48 hours, according to the report.
For IT professionals worried about threat actors causing outages, a report that Aviatrix expects to publish in January finds that “more than two times as many cloud network outages were caused by firewalls than by cyberattacks within respondents' organizations in the last year.”
Read on for more information about the biggest cloud outages that occurred in 2023.
January Microsoft Service Outages
On Jan. 17, users of Microsoft Teams and Microsoft 365 in North America faced an outage from 9:17 a.m. EST to about 2:18 p.m. EST.
Outage-tracking website Downdetector showed thousands of reported issues with Teams including 504 reports of issues at approximately 10 a.m. and another 503 reports of issues at 11 a.m.
About 66 percent were a result of server connections, 20 percent due to the application and 14 percent login issues.
On Jan. 25, Reuters reported that a networking problem caused an outage for Azure, Teams, Outlook and other services in the Americas, Europe, Asia-Pacific, Middle East and Africa. Services resumed after a full system recovery by late morning.
Microsoft blamed a network connectivity issue with devices across the Microsoft wide area network (WAN), according to Reuters.
The incident lasted about five hours, according to Quest Software’s Practical 365. The issue involved a command given to a WAN router to send messages to other routers, resulting in adjacency recomputing and table forwarding, preventing packet forwarding.
Microsoft has about 400,000 channel partners worldwide, according to CRN’s 2023 Channel Chiefs.
January IT Glue Outage
IT Glue reported at around 8 a.m. PST on Jan. 18 that it had to undertake “emergency database maintenance … to resolve an issue some customers are experiencing.”
The IT documentation software vendor owned by Kaseya was put into read-only mode until 9:33 a.m. PST, according to the incident report. IT Glue restored all passwords and documents by Jan. 20.
Although IT Glue doesn’t have incident reports for the following dates, users of Reddit posted about issues with the platform on Jan. 9 and Jan. 11 as well.
IT Glue’s user base includes more than 13,000 organizations worldwide and more than 350,000 individuals.
The incident led to more than 40 comments across multiple posts in January.
February Oracle, NetSuite Outages
Despite public assertions by Oracle co-founder and Chief Technology Officer Larry Ellison that the vendor’s Oracle Cloud Infrastructure (OCI) “doesn’t go down,” the vendor’s cloud service experienced a couple of issues in February.
A multi-day OCI outage struck in February, according to Network World.
The issue started around 10:30 a.m. PST on Monday, Feb. 13 and lasted until around 3:30 p.m. Wednesday, Feb. 15, for users in the Americas, Australia, Asia-Pacific, the Middle East, Europe and Asia.
The issue involved performance problems in the back-end infrastructure supporting the OCI public domain name system (DNS) application programming interface (API), which prevented the processing of some incoming service requests. Oracle used real-time back-end optimizations and DNS load management fine-tuning to mitigate the issue.
OCI Vault, API Gateway, Oracle Digital Assistant and OCI Search with OpenSearch all faced issues during the outage, according to Network World.
An outage for Oracle subsidiary NetSuite started around noon EST, Feb. 14, due to a Cyxtera data center fire in Waltham, Mass., according to Data Centre Dynamics.
The Massachusetts facility cut power to its servers and account restoration started around 10:26 p.m. EST, according to The Register.
At least one Reddit user reported receiving credit on their account for the issue.
NetSuite has about 880 channel partners worldwide, 300 in North America, according to CRN’s 2023 Channel Chiefs.
March Datadog Outage
Datadog took almost two days to resolve a service outage that started March 8.
The New York-based cloud monitoring and security tools vendor notified users about an issue with its web application at 1:31 a.m. EDT, according to MarketWatch. Analysts at Wells Fargo even published a note expressing concerns about the outage’s effect on Datadog’s revenue.
The incident cost Datadog about $5 million and took three shifts of about 500 to 600 engineers to resolve, CEO Olivier Pomel revealed on the vendor’s May quarterly earnings call, according to a transcript.
Pomel said he doesn’t “worry so much about it happening again” and that Datadog learned how “to recover faster” and “a better way for our customers to mitigate an issue when that happens,” according to the transcript.
Tech columnist Gergely Orosz wrote that Datadog “most likely did not charge customers for data transfers while the system was down” and that “the loss represents about a day’s worth of revenue for the company.”
Orosz said that an operating system update was a factor in the outage and said the vendor could have done better communicating with users about the incident.
April Microsoft Outages
Microsoft users faced a nearly six-hour-long issue with Microsoft 365 online applications and the vendor’s Teams collaboration application on April 20.
The vendor tweeted at 6:56 a.m. PDT that it was “investigating access issues with Microsoft 365 Online apps and the Teams admin center.”
The company tweeted at 1:10 p.m. PDT that it “received positive confirmation through our internal telemetry and impacted users that service has been restored.”
Ookla’s Downdetector website noted thousands of reported M365 outages that day, with reports surpassing 3,000 at around 7 a.m. PDT and peaking around 9 a.m. PDT.
Teams, SharePoint Online and Outlook suffered another outage on April 24, according to The Register. Microsoft tweeted about the issue at 4:17 a.m. PDT and again at 7:17 a.m. to say a “majority of impact” had been remediated.
Bleeping Computer reported another outage with Exchange Online on April 25. Microsoft tweeted about the issue at 1:21 p.m. PDT and said the issue was resolved about an hour later.
April Google Outage
On April 25 around 5:20 p.m. PDT, a fire in a Paris, France data center brought down Google Cloud and more than 90 cloud services for Europe region users, according to The New Stack.
The affected services included Google Cloud Storage (GCS), Cloud Key Management Service (KMS), Cloud Identity and Access Management (IAM) and Google Kubernetes Engine (GKE), according to IT Pro.
On May 10, Google reported that “some instances located in the impacted portions of the datacenter remain unavailable.”
The news prompted more than 200 comments on a Reddit post in a system administrators-focused forum.
April Oracle-Cerner Issues
On April 17, the Department of Veterans Affairs dealt with an outage of its Oracle-Cerner Electronic Health Record (EHR) system that lasted five hours, according to Federal News Network.
The outage happened due to an upgrade for database capability and failovers, according to FNN.
Then on April 25, the Oracle-Cerner system sustained another outage for almost four hours that hit the VA, the U.S. Department of Defense and the U.S. Coast Guard.
The VA halted additional implementations of the system until gaining more confidence in the system’s functionality at the five VA sites that use it, according to EHR Intelligence.
May Cisco SD-WAN Issue
On the hardware side of cloud outages, an expired public root certificate for various Cisco vEdge platforms resulted in a public apology from the vendor on X – formerly known as Twitter – and a post on a Cisco-focused forum on Reddit with more than 80 comments.
“We apologize for the challenge this is creating,” Cisco posted on X on May 10.
The vendor “published upgrade versions of software to permanently resolve this problem,” according to a post on Cisco’s website.
vEdge routers deliver “WAN, security and multi-cloud capability of the Cisco SD-WAN solution,” according to the vendor. “Cisco SD-WAN vEdge routers are delivered as hardware, software, cloud or virtualized components that sit at the perimeter of a site, such as remote office, branch office, campus or a data center.”
June Microsoft Outages
Microsoft 365 services such as Teams and Outlook saw widespread outages on back-to-back days in early June, followed by a major OneDrive outage days later.
Then, the following day, the portal for Microsoft’s Azure cloud platform went down for thousands of users.
Microsoft confirmed later in the month that distributed denial-of-service (DDoS) attacks were responsible.
Getting into the details, on June 5, an outage affected tens of thousands of Microsoft 365 users in the morning. The software giant said it halted an unspecified “update.”
“We’ve identified downstream impact for Microsoft Teams, SharePoint Online and OneDrive for Business,” said Microsoft in a tweet at around 11:45 a.m. EDT.
Microsoft said it prevented a “potentially problematic update” from propagating further across the service and was reviewing options to revert the change quickly in those portions of Microsoft’s infrastructure where it had been applied.
The next day, Microsoft saw a “recurrence” of the service issues. At 12:03 p.m. EDT, Microsoft said it had “identified that the impact has started again” and that it was applying further mitigation.
“Telemetry indicates a reduction in impact relative to earlier iterations due to previously applied mitigations,” said Microsoft.
At 11:22 a.m. EDT, 3,118 Downdetector users reported Microsoft 365 problems.
On June 8, a hacktivist group known as “Anonymous Sudan” claimed responsibility for causing a Microsoft OneDrive outage. At 3 p.m. EDT, Microsoft said it was “continuing to analyze monitoring telemetry and performing load-balancing processes to provide relief.”
A subsequent update to the status page that day indicated that the outage had only impacted access to OneDrive through a web browser. “Access to the OneDrive service using the desktop client, a synchronization client or Office clients are not impacted,” Microsoft said in the update.
The next day, June 9, Microsoft users experienced a major outage with the Azure cloud platform portal going down.
Microsoft appeared to have resolved the problem by that afternoon. Shortly after 11 a.m. EDT, user reports of Azure availability issues began to climb on Downdetector. The site logged thousands of user reports of Azure outages over the next two hours.
Anonymous Sudan claimed that it carried out DDoS attacks against the Azure portal.
On Monday, June 12, Microsoft said that a “spike in network traffic” has been identified as the likely cause of the outage.
“We identified a spike in network traffic which impacted the ability to manage traffic to these sites and resulted in the issues for customers to access these sites,” Microsoft said.
Issues With AWS In June
Amazon Web Services experienced an outage incident for a couple of hours in June, accordingto an incident report on the cloud giant’s website.
“Starting at 11:49 AM PDT on June 13th, 2023, customers experienced increased error rates and latencies for Lambda function invocations within the Northern Virginia (US-EAST-1) Region,” according to the report. “Some other AWS services – including Amazon STS, AWS Management Console, Amazon EKS, Amazon Connect, and Amazon EventBridge – also experienced increased error rates and latencies as a result of the degraded Lambda function invocations. Lambda function invocations began to return to normal levels at 1:45 PM PDT, and all affected services had fully recovered by 3:37 PM PDT.”
To prevent this event from happening again, AWS “immediately disabled the scaling activities for the Lambda Frontend fleet activities that triggered the event, while we worked to address the latent bug that caused the issue; this bug has since been resolved and deployed to all Regions,” according to the report.
“This event also uncovered a gap in our Lambda cellular architecture for the scaling of the Lambda Frontend, which allowed a latent bug to cause impact as the affected cell scaled,” according to the report. “Lambda has already completed several action items to address the immediate concern with cellular scaling and remains on track to complete a larger effort later this year to ensure that all cells are bounded to a well-tested size to avoid future unexpected scaling issues.”
Tens of thousands of users reported outages for Seattle-based AWS around noon PDT, June 13, according to Downdetector. The Vermont Department of Motor Vehicles, The Boston Globe and New York’s Metropolitan Transportation Authority were among the organizations to take to X – then Twitter – to report outages due to AWS.
July Sees Slack Outage
Salesforce-owned collaboration platform Slack experienced a systemwide issue on July 27 that lasted for about an hour, clearing up by 3 a.m. PDT.
The company said in an online post that “users were not able to send or receive messages across multiple platforms” during the outage.
“Our engineering team identified an issue after a change was made to a service that manages our internal system communication,” according to the post. “This resulted in degradation of Slack functionality until the change was reverted which resolved the issue for all users.”
The outage resulted in a post on a Slack-focused forum on Reddit, garnering more than 20 comments. Media outlets including The New York Timesand The Verge reported on the outage.
July Error With IT Glue
In July an IT Glue issue that lasted about an hour resulted in a “502 Bad Gateway” error and promptedalmost 100 comments in a Reddit post in an MSP-focused forum.
IT Glue posted on July 18 at 11:54 a.m. PDT that performance issues “may prevent access to IT Glue for some of our partners.” The incident was resolved by 12:46 p.m. PDT.
“Should just start making posts titled: is ITGlue up?” one Reddit user joked.
September Problems With Microsoft Teams
Microsoft Teams experienced an issue that lasted more than two hours in mid-September.
Microsoft posted to X – then known as Twitter – at 7:10 a.m. PDT Sept. 13 to say that the tech giant was “investigating an incident affecting Microsoft Teams” and that “users may encounter delays or failures sending and receiving messages.”
The vendor “determined the issue is specific to some users served through affected infrastructure in North America” and routed “affected service traffic to healthy infrastructure to alleviate impact.”
Microsoft posted at 9:43 a.m. PDT to say, “We've confirmed that impact associated with this issue has been resolved.”
The issue led to more than 20 comments in a post on a system administrator-focused forum on Reddit.
A post by Cisco’s ThousandEyes network intelligence company said that “the application frontend was accessible but attempts to log into the system and/or interact with it resulted in 500 errors and timeouts.”
The company said that indicated “some form of backend system or distribution layer issue,” according to the posting.
September Outage For Salesforce
Salesforce sustained a service disruption on Sept. 20 that lasted about two hours for its products and services, except for MuleSoft and Tableau which were knocked out for about four hours, according to a report from the vendor.
The vendor accidentally caused the outage with a policy change “made as a part of our standard operating procedure of ongoing reviews and updates to our security controls,” according to the company’s review.
“While it was designed to add defense in depth, it inadvertently blocked access to other legitimate and necessary resources beyond its intended scope,” according to the report. “The end result was a breakdown in communication between our services due to a lack of access permissions, causing failures within our systems. This restricted some of our customers from logging in and using the services.”
The vendor has altered its change review and approval process and fixed a startup race condition bug in Tableau to prevent the same issue from recurring. It also promised
- “specialized automated deployment pipelines to enforce staggered policy deployments,”
- “additional monitoring and alerting capabilities to diagnose policy-related issues faster,”
- and “re-architecting MuleSoft CloudHub’s back-end component … to increase resiliency.”
November Outages For Cloudflare, Workday Traced To Oregon Facility
Workday and Cloudflare attributed outages that started Nov. 2 to issues within a facility in Oregon, leading Cisco’s ThousandEyes to speculate that the two were affected by the same data center.
Cloudflare CEO Matthew Prince said he was “sorry and embarrassed” over a multi-day incident in early November and put some of the blame on a Flexential-run data center in Oregon, according to a post on the vendor’s website.
On Nov. 2, Cloudflare's customer-facing control plane interface and analytics services experienced an outage. The incident lasted until Nov. 4.
“We were able to restore most of our control plane at our disaster recovery facility as of November 2 at 17:57 UTC,” Prince said. “Many customers would not have experienced issues with most of our products after the disaster recovery facility came online. However, other services took longer to restore and customers that used them may have seen issues until we fully resolved the incident. Our raw log services were unavailable for most customers for the duration of the incident.”
Prince apologized because Cloudflare “believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically.”
“While many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable,” he said.
Among the changes Cloudflare promised to make are:
- “Remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network,”
- “Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested,”
- And “Thorough auditing of all core data centers and a plan to reaudit to ensure they comply with our standards.”
A Workday report on the incident said it lasted for three hours. It did not name Cloudflare or Flexential in the report, but it did blame “a power outage at our Portland, Oregon, data center that resulted in a service interruption for some customers.”
“Due to issues with backup power failures, as well as an unstable power environment resulting in additional challenges, service restoration has taken longer than is typical,” according to the vendor.
At one point, Downdetector logged more than 1,200 outage reports related to the Workday outage, according to KRON4.