SentinelOne CEO On CrowdStrike Outage: ‘Not Just An Honest Mistake’
In an interview with CRN, SentinelOne CEO Tomer Weingarten says the massive outage was the result of a ‘very risky architecture.’
The massive Microsoft Windows outage set off by a CrowdStrike update on July 19 was the consequence of “bad design decisions” by the security vendor that allowed a defective file to hobble servers and PCs worldwide, according to SentinelOne Co-founder and CEO Tomer Weingarten.
In an interview with CRN, Weingarten, whose company is a top rival of CrowdStrike, sharply criticized the process that enabled the outage to occur and led to widely felt societal disruptions.
[Related: SentinelOne CEO: Cybersecurity Shouldn’t Require Constant Updates]
Specifically, Weingarten questioned the practices around CrowdStrike’s interactions with the core control center of the Windows operating system, known as the Windows kernel. After examining how CrowdStrike has handled updating the Windows kernel, “you can’t avoid arriving at this conclusion—that it’s just a very risky architecture,” he said.
The CrowdStrike update was able to set off a “blue screen of death” scenario for 8.5 million devices worldwide, leading to massive impacts for air travel, health care and business. Experts have called it the largest IT outage of all time, and one estimate suggested the incident will cost U.S. Fortune 500 companies $5.4 billion in total direct financial loss.
The disruptions dragged on for much of the following week in part because of the need for IT teams to manually fix many of the affected Windows devices. CrowdStrike disclosed that a bug in its validation process for security configuration updates to its Falcon platform resulted in the outage.
During CRN’s interview with Weingarten, which took place July 25, the SentinelOne CEO said he’s convinced this was “not just an honest mistake.”
"It’s a result of how the architecture was used—or maybe even abused, I would say,” he said.
In a statement provided to CRN, CrowdStrike responded to Weingarten’s comments and disputed his characterization of the cause of the outage. (Click here to read CrowdStrike’s full statement responding to Weingarten’s comments.)
The update that was deployed on July 19 was a "rapid response content update,” and "these updates don’t execute code in the kernel,” CrowdStrike said.
Additionally, “Microsoft does have a clear kernel review process, and that process was followed,” CrowdStrike said in the statement provided to CRN.
CrowdStrike CEO George Kurtz previously disclosed that 97 percent of Windows sensors for Falcon were online as of Thursday.
“I am deeply sorry for the disruption this outage has caused and personally apologize to everyone impacted,” Kurtz wrote in a LinkedIn post Thursday. “While I can’t promise perfection, I can promise a response that is focused, effective, and with a sense of urgency.”
In a post Friday, Microsoft executive John Cable wrote that the outage “shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience.” Cable also touched on the role of third-party access to the Windows kernel, indicating that the tech giant is now looking to “encourage development practices that do not rely on kernel access.”
CRN has reached out to Microsoft for further comment.
In the interview with CRN, Weingarten spoke about what SentinelOne has been discussing with customers following the outage, the questions around Windows kernel access for security vendors and how the incident may change the cybersecurity industry.
What follows is an edited and condensed portion of CRN’s interview with Weingarten.
Where have you and SentinelOne been focusing in response to the outage?
I've been talking to so many customers in the past week. This is not just about an outage, or a faulty update, or the technical aspect of what happened. It’s more broadly about trust and credibility. Lots of customers are furious about this. I think a lot of them are also feeling deceived. The more you dig into the architecture, and understand and unravel what happened, you can't avoid arriving at this conclusion—that it’s just a very risky architecture. For many customers, it created a hole in their change management processes. Nobody expects that you’ll have something on your devices that constantly introduces new code, introduces new content into that device. In many cases, customers didn't know about that. They didn't know about that cadence. They didn't know about the frequency of updates.
In your view, how unusual is it for security vendors to have access to the Windows kernel?
Kernel-based protection is nothing new. But [the problem is] the pervasiveness of code that has been put in the kernel [by CrowdStrike], which is totally against best practices. As someone that has been doing this for now 10 years, it's very clear that you want to minimize the amount of code you put into the kernel. This is the most sensitive part of the operating system. And that’s also what the operating system vendor will tell you — to the point that typically, when you put code into the kernel, you need to have it tested and reviewed by the operating system vendor. Every change has to go through the review process. What you're seeing here is really a complete bypass of [that process]. So it creates this cloud-to-kernel connection, which is just very, very dangerous. It's unheard of. It upends the operating system processes for review and for protection.
So you don’t buy into the notion that due to the complexity of IT, it was just a matter of time before something like this happened?
It's not something that can happen to anybody. It's just not the case. This is a product of bad design, bad design decisions. And you’ve seen folks like Elon Musk [who] understand that and they're immediately removing that platform from all of [their] devices. This is not just an honest mistake. It's a result of how the architecture was used — or maybe even abused, I would say.
To me, that’s one of the bad things about cybersecurity — it’s so complex that many people don't fully understand what's happening. This is not force majeure. This is bad architecture.
Wouldn’t Microsoft have been aware of this? If this was bad architecture, why do you think Microsoft wouldn’t have put a stop to this sooner?
Microsoft has very clear kernel attestation rules and [a clear] kernel review process. It seems like they were bypassed here. Downloading new content directly to the kernel — as far as I know, that infringes the policy of how you attest kernel drivers. The whole purpose of attestation is that you submit [the changes] and your code gets reviewed. And that is what's approved to go into the kernel. If now you have this side-channel that can update what's happening in the kernel without that process — I don't know, it just seems to me like there was something here that just shouldn't have happened.
What is the normal arrangement for security vendors, in terms of how much access to the kernel they’re granted?
Normal is that you build it, you sign it, you send it to attestation. You don't update it frequently. And there's not a lot of code there. You want as little code as possible as part of your kernel driver. And that is what you find with pretty much every other security vendor. Some have a little bit more bloat in the kernel, some less. I think the more lightweight you are, the better — to the point that you might not want to have a kernel driver, as an example. On the Mac operating system, macOS, we don't use a driver. We had a full blog post a couple of years ago on how we're moving away from the kernel, and just working in user space. If you can do that, if the operating system gives you enough tools to do that, you have no reason to be in the kernel. Our entire cloud security suite does not touch the kernel. In Linux, we don't touch the kernel. So that, to me, is just a design decision. And I think it's not something that was shared here [by CrowdStrike]. I think this was very kernel-heavy. But updating directly to the kernel, with no re-attestation or re-review — that to me is just appalling.
So just to be clear, when SentinelOne updates anything that affects the kernel, you’re going through that review process you just described?
Every vendor that I know of does [that]. I haven't seen anybody update the kernel in such a way [as CrowdStrike did] — and definitely not in a way that doesn't go through customer approval, as well. I mean, the other part here is, it’s not just Microsoft in the process. It’s customers. You're [supposed to be] giving customers the ability to control what's deployed, when it’s deployed, what version is deployed, rollback capabilities, gradual rollout, phased deployments. All of those are kind of table stakes. So to see that with one push [of a] button, this gets immediately sent globally, causing the biggest IT security outage in history — it's not just a mistake. It’s just architecture. And it’s an architecture that's going to take time to change. Right now, I think customers are weighing that risk.
So you don’t see any security purpose for what CrowdStrike was doing here? You don’t see any justification for doing this?
I don't see a security reason. I think it's completely the opposite of that. Why would you put all that code in the kernel? That's just more attack surface. It doesn't have to be there. If it can be done in user space, I wouldn’t put all these different bits of your platform, communication stacks, all of that, into something as delicate as the kernel.
I think you can [offer] incredible protection without stuffing all your code into the kernel. I just don't see that as something that gives you a better protection capability.
Going back to your conversations with customers, what other things are you hearing from them about this incident?
What I would say is, customers are looking to build some redundancy and figure out how they become more resilient. We're not coming and saying, “Hey, you need to throw all this stuff out the door and there needs to be a complete overhaul.” But you can inject the Singularity platform, and you can start getting another vantage point into your environment. … We're going to help customers in any way that they choose. We're not there to tell them what to do. But we're there to just show them that there could be a better way, there could be a less-risky way. And again, the focus is on just building resilience as much as possible.
How do you think this outage is going to change the security industry?
I hope it's going to change the industry. I see some of that change happening. I've been calling for a long time for this industry to mature, to grow up, to look beyond the marketing claims, to test the products, to verify what's actually happening — to not lean on all these vacant promises that have been made by multiple vendors for many years.
I think it's really time to wake up and ask some questions — [such as] what do you actually need to have in your environment? And how much do you know about the technology that you're putting out there? That's how we avoid the next interruption, the next problem. We just need to be more attuned to the impact crater that technology can create and know what we deploy and know what's happening in our environment. It's tough. It's very, very technical. It will require some bandwidth.
I think partners have a deep role in this. They need to be these champions for the customers, and just make sure they're making informed decisions.
Are there any other factors here that you think are worth re-examining in this situation?
In the past couple of years, there was this vendor push towards consolidation, which also consolidates a lot of risk in one place. And I actually said on our last earnings call, I called that out — and I said, look, it's only in the benefit of the vendor to tell you to buy everything from a single platform. It's never advisable to have all your eggs in one basket in cybersecurity. It's just not the proper risk mitigation strategy. So, as you kind of see what's happening right now, I think a lot of customers are just pausing and rethinking their security decisions. I see the same with partners. There's definitely a pause on the expansion. Folks are trying to discern, what is my best strategy to reduce risk? How do I reduce single-vendor risk in my environment? I think it's becoming more sane now, actually. Because moving everything to one single point of failure platform architecture, I just can't understand it. I wouldn't even advise my customers to put everything on my platform. It’s a choice. You should choose that only if you feel like that is building more resilience in your network. Supporting open architectures that can bring together different products is the right way. And being transparent about architecture is the right way. Just trying to convince everybody that they should just go on your platform, that's the best thing in the world—which serves only you as a vendor—it’s just not the right way to do it.