Analysis: Was A Global IT Outage Inevitable?
Unless there’s a concerted effort by a lot more tech industry vendors than just CrowdStrike, it’s unlikely to be the last incident of this kind.
Within the software industry, it has long been a cliche´ to point out the inevitability of bugs.
As the thinking goes, the world of software is insanely complex, and the potential impact of fallible humans is an ever-present risk. Whether we like it or not, just about everyone seems to have accepted at this point that “bugs happen.”
[Related: CrowdStrike Chief Security Officer: ‘To Say We’re Devastated Is A Huge Understatement’]
However, just under a week ago, an especially nasty bug happened, and the world woke up to the worst IT outage of all time. The sad-face emoticon on the Windows blue screen of death (BSOD) was never so infuriating as it was on July 19, 2024.
The whole thing can be traced back to a bug in CrowdStrike’s validation system, which didn’t catch a defect in a file before it was deployed to nearly 9 million Windows devices, as the cybersecurity vendor disclosed Wednesday. (I’ve reached out to CrowdStrike for further comment.)
To their credit, the leadership of CrowdStrike has taken full responsibility for this failure. Even with their history of frequently criticizing Microsoft, CrowdStrike executives have said not a peep about Microsoft bearing any responsibility for the Windows outage over the past week.
In other words, this was a bug, plain and simple.
Of course, it boggles the mind how a seemingly tiny mistake could wreak the havoc we’ve seen over the past six days: Potentially hundreds of thousands of passengers marooned away from home. Surgeries and other medical visits postponed. Untold numbers of businesses hit with sudden downtime. It was a tech apocalypse if we’ve ever seen one.
And yet, while CrowdStrike certainly is in no position to entertain this question right now, I think it’s one that is worth asking: Was something like this inevitable?
Given what we know about the factors mentioned above (software complexity, human fallibility), was it always just a matter of time before we’d be hit with a global BSOD event like this one? Could it be that we haven’t seen anything like this before out of sheer luck?
Constant Updates
It also probably shouldn’t be a surprise that it was a cybersecurity vendor that caused the first outage of this magnitude.
Cybersecurity vendors are pushing threat-related updates constantly as part of keeping up with hackers who are always shifting their tactics.
For customers, it’s not desirable or perhaps even feasible to extensively test such updates themselves. In part that’s because, as a number of CrowdStrike partners have told me this week, your IT environment will be vulnerable for however long it takes to test everything.
In addition, this sort of thing probably has happened before, but hardly anyone noticed. The reason for that is simple: CrowdStrike is the only endpoint security vendor, apart from Microsoft itself, that has this type of global scale, i.e., the type of reach needed to instigate a truly worldwide outage of Windows devices.
Long story short, I believe this all boils down to basic arithmetic. Global scale + frequent updates + inevitable bugs = inevitability of a worldwide IT meltdown.
CrowdStrike is pledging to do all it can to prevent this sort of thing from happening again. That’s only to be expected, and I have little doubt they’ll succeed. But are other vendors planning to do the same?
This was the first global IT outage, but unless there’s a concerted effort by a lot more vendors than just CrowdStrike, it’s unlikely to be the last.