Sophos CEO On How EDR Vendors, Microsoft Are ‘Rethinking’ Security After CrowdStrike Outage

In an interview with CRN, Sophos CEO Joe Levy discusses the future of the Windows kernel and endpoint security after attending the recent Microsoft-hosted summit of EDR vendors.

Microsoft continues to signal it has no intention of restricting Windows kernel access to endpoint security vendors in the wake of the massive CrowdStrike-caused outage in July, while the tech giant also appears open to finding less-severe ways for the kernel to respond to errors caused by security tool updates, Sophos CEO Joe Levy told CRN.

Levy was among the attendees of the Microsoft-hosted endpoint security summit that included executives from top vendors in the space, including CrowdStrike, on Sept. 10. The summit took place at Microsoft headquarters in Redmond, Wash., and was held in response to the globally disruptive Windows outage that began July 19 and lasted for several days afterwards.

In an interview with CRN, Levy described the summit as “very collaborative” — even with the competitive relationships between the 10 vendors in attendance — and suggested it should end up having a positive impact on how endpoint detection and response (EDR) tools interact with the Windows kernel going forward.

Following the July outage and discussions at the recent summit, “I think this is going to result in the endpoint community rethinking the amount of code and the complexity of the code that they introduce into their kernel drivers,” Levy said.

Kernel access has been pinpointed as a key factor that enabled CrowdStrike’s defective update to send 8.5 million Windows devices into a “blue screen of death” state, causing widespread societal disruptions including to air travel, health care and banking. Estimates have suggested the costs to major corporations from the incident will reach into the billions of dollars.

Notably, the Microsoft executives in attendance at the Sept. 10 summit — including Microsoft President Brad Smith and David Weston, vice president of enterprise and OS security — set the tone for the discussion by defusing the tension over whether the tech giant could restrict kernel access to EDR vendors in response to the outage, according to Levy.

“They made it really clear that this was not going to be an overreaction to the incident, and that Microsoft and the Windows team were not going to take a punitive or an adversarial posture toward the endpoint security ecosystem as a result of this,” Levy told CRN. “But rather, what they wanted to do was just pause, take a breath, step back and ask the questions — what are some of the best practices that we're seeing in the industry?”

During the Sept. 10 summit, Microsoft executives also seemed “super receptive” to exploring if there are alternative ways for the kernel to respond to adverse interactions with security tools besides a “blue screen of death,” Levy said.

Significantly, a number of developers from the Microsoft kernel team took part in the summit, he said. “And they were asking questions that were clear indications of their interest in [exploring], ‘How do we make this better?’”

“They were asking about use cases. They were asking us why we need certain kinds of capabilities and why we do certain kinds of things in our endpoint protection controls,” Levy said. “They were there to listen.”

In addition to Levy, a Sophos veteran who was named permanent CEO of the security vendor in May, executives from CrowdStrike, SentinelOne, Broadcom, ESET, Trellix and Trend Micro attended the Sept. 10 summit.

CRN has reached out to Microsoft for comment. In a blog post last week, Weston noted that Microsoft has faced calls to provide alternative ways for organizations to secure Windows devices “outside of kernel mode.”

Ultimately, following the summit discussion, “I’m hopeful that this begins to bring about the evolution of the safety protocols that the endpoint security ecosystem itself is deploying,” Levy said.

What follows is an edited and condensed portion of CRN’s interview with Levy.

What was the discussion like at the summit? And what do you see as the most likely outcomes from it?

It was a really important event. I’m very glad that Microsoft organized it and hosted us. We had about 10 vendors representing the Windows endpoint security ecosystem. CrowdStrike was present, along with Sophos and a number of others. And Microsoft started by setting the stage in what I thought was a very open and a very collaborative way. [Microsoft executives] Brad Smith, Dave Weston — they had really good executive representation there. And they made it really clear that this was not going to be an overreaction to the incident, and that Microsoft and the Windows team were not going to take a punitive or an adversarial posture toward the endpoint security ecosystem as a result of this. But rather, what they wanted to do was just pause, take a breath, step back and ask the questions — what are some of the best practices that we're seeing in the industry? What are some advancements that we can make within the MVI program? That is the Microsoft Virus Initiative that is the basis for the integration of third parties to be able to provide endpoint security to the collection of Windows operating systems. And then, how can we evolve this program? It's not really practical to make a claim that you will never have this sort of an incident again. But what can we do to materially reduce the likelihood of its recurrence? And if it does occur again, what can we do to reduce the impact — so that we don't see the sort of global outage that we saw in this past incident?

What did Microsoft have to say in terms of what can be done to prevent a recurrence of this type of outage?

[Microsoft started off] with this immediate diffusing of the tension, by confronting one of the key issues — which was, is Microsoft going to restrict access to the kernel? They spent what I thought was a really thoughtful and deliberate amount of time describing why kernel access is essential to being able to provide endpoint security. And Microsoft was good in the way that they [incorporated] Defender. They treated Defender as a guest in the room, the same way that they treated Sophos as a guest in the room. And they framed [the challenge] in the context of, how do we continue to confer the same sort of essential accesses — while improving the set of native interfaces and native benefits that the operating system itself can provide to all of you, including to Defender?

What did Sophos and the other participating vendors bring to the discussion?

In preparation for the summit itself, [Microsoft] asked that the participants prepare a statement of their current set of safe deployment practices. What do we do in the architecture of our software? What do we do in the design and the isolation and the segmentation and the determination of how much privilege we take, and how much work that we do in the kernel? [Microsoft] then factored this into this presentation back to us. And they said, “This is what we heard from you. We see that you're effectively doing these [certain] things as best practices, and that they've served you well.”

What were some of the key examples of these practices?

It could be the testing that we do internally. [Sophos] went to significant length to describe the way that we do internal testing, internal rollout of our updates — both our code updates and our content updates, first within our own employee population. We roll it more broadly through the employee population. Then we begin to segment it out to our customer population, where we have what I think is an essential control for safe deployment — where rather than just hitting the blast button, and it just goes out to everybody all at once, we can begin to mete it out to portions of the population. We're collecting telemetry as this is going on, and then if we see anything adverse happening within the population, we can immediately pause it so that it doesn't reach a broader group.

Microsoft reflected that back to us. Overall I was quite pleased with the fact that the vast majority of our submissions were then reflected back in the statements that Microsoft was making. It was a really good indication to us that we are indeed quite mature in our methodologies.

I will never make the claim that we won't have an incident of this sort. And I think every single vendor in the room has probably had a “blue screen of death” occur at one point or another in their history. But what I could say is that the summit just provided a really good forum for us to be able to share the best practices, and to do it in a very collaborative way — where I think it’s going to make the relationship that the vendor ecosystem has with Microsoft, and the Microsoft operating system itself, much more resilient and much more robust as a result of the effort.

So you would say that you believe it's going to have a tangible impact, around reducing the potential for another outage like this one?

I'm very optimistic that it will. I don't think that this crisis will go un-leveraged. And I think that we're already seeing it in some of the interactions that we're having in the industry — just the elevation of the discourse. And the fact that we are talking about things like separation of privilege and good architectural design — that just as a basic principle, every time you're interacting with the kernel, you're running the risk of potentially producing some sort of an adverse memory interaction that will then ultimately result in a blue screen. Because the kernel effectively has to protect itself.

[During the summit] we talked about things like, are there better kinds of controls that Microsoft could produce — within the kernel itself — to be able to effectively roll back these sorts of occurrences when they're observed? On the vendor side of things, we talked about better interfaces that Microsoft could potentially provide us in the kernel — so that we have to do less work in the kernel, because now the operating system is providing a native interface that we can all consistently interact with.

Those are just a couple of examples. But I think this is going to result in the endpoint community rethinking the amount of code and the complexity of the code that they introduce into their kernel drivers — and just rethink, “How can we minimize the surface of the code that we're putting in the kernel? And [how can we] do more of the logic that the endpoint depends on — in order to provide its protection and its detection and response capabilities — in user space?” [So that] if something goes wrong, at least it's happening in user space and not in kernel space.

I think we all came out of the room thinking about how to make these investments into our architectures, whether it's our endpoint protection software or the kernel itself. [We want to] continue to provide the same defensive and remedial security benefits to our customers, while minimizing the likelihood of the disruption to that very essential component of cybersecurity — [which is] availability.

What was Microsoft’s response to the idea of providing alternative ways for the kernel to respond to errors, in the ways that were suggested? Did they seem to think the idea was reasonable?

They were super receptive to this. In fact, they brought a number of their developers from the kernel team, and they were sitting in the room, and they were participating in the conversation. It was a good dialogue. And they were asking questions that were clear indications of their interest in [exploring], “How do we make this better?”

They were asking about use cases. They were asking us why we need certain kinds of capabilities and why we do certain kinds of things in our endpoint protection controls. They were there to listen. And I'm hopeful that this begins to bring about the evolution of the safety protocols that the endpoint security ecosystem itself is deploying. I would love to see a consistency in the set of practices that the industry uses. And I think the evolution of the MVI program is going to help to drive that. And similarly, I think Microsoft is taking a very serious look — under the leadership of Dave Weston, who I think is just absolutely legendary — and a number of other senior people in the room from Microsoft, they were just profoundly interested in figuring out how they can make the Windows operating system better at this.

Do you feel like some vendors have been doing too much at the kernel level?

What I would say is, we should do as much as we need to, and no more. Some vendors are better at that than others. We really do practice the minimization of the work that we do in-kernel [at Sophos]. We try to get in and out as quickly as possible, and we try to do as much of the more complicated logic in user space as we can. Over the years, we've very intentionally architected our endpoint to be able to practice that. The set of interfaces that Microsoft could provide has the potential to minimize the amount of work that we do in the kernel. So I'm optimistic for where this conversation is going to lead.

The things to keep in mind here are that No. 1, we're facing adversaries who are attempting to evade or evict us. Evade us means, they want to bypass the security controls. Evict means, they want to turn us off, or uninstall us, or defeat our ability to do the monitoring and process control that we need to do in order to stop malware from running, in order to stop ransomware from executing. So we have to operate at the kernel level in order to defend ourselves against evasion or eviction.

The other thing to keep in mind is, let's say we arrive at an architecture where the kernel itself provides a set of interfaces that we then need to interact with, and we're doing this in some asynchronous fashion. Now you have to think about what kind of an impact that is going to have on the performance of the system.

You never want to be aware of the fact that you've got a security control running on your endpoint. You want that to be invisible. And that's why it's so essential that if we do go down the path of creating native operating system interfaces, that they operate at least at the speed that we do today. We don't want to degrade the user experience in any way. We want to continue to elevate the kinds of security apparatus that we're delivering — but we want to be able to do it with a minimization of the impact on resources.

You’ve mentioned in a previous statement that there’s a “risk of monocultures” that is also a factor here. Could you say a bit more on that? What are you seeing there?

[The idea is] don't put all of your eggs in one basket. This is something that we all get intuitively. But right now, the risk of monocultures is that if an organization is attempting to build a resilient IT operation — and they're relying entirely on one architecture, one vendor, and they don't have any kind of diversity or heterogeneity in their environment — if something catastrophic were to happen to that one basket, they basically lack a rapid path to recovery. So a resilient operation is an operation that by design, is going to have some heterogeneity and some diversity built into it. We get the concept when we talk about things like multi-cloud, for example. The reason why we talk about multi-cloud is because, again, we don't want to have these single points of failure, and the cloud itself is resilient by design. But even still, sometimes entire zones go out, sometimes entire regions go out. Sometimes the CSPs themselves experience an outage. So we think of multi-cloud in terms of having some of my workload in GCP, Azure, AWS. We get that. We have a relatively easy ability to do it when it comes to cloud security providers. We don't have an easy ability to do it when it comes to endpoint security. So how can we start thinking about ways where it's simpler for customers to be able to introduce that kind of diversity within their endpoint protection stack? And it's not strictly endpoint, but anything where there is a potential single point of failure — where you don't want that failure to pervade across the entirety of your operating environment.

Overall, do you feel like this incident is going to lead to major changes in the way security is done?

I think it did force all of the vendors who were present — and probably the entire vendor ecosystem — to rethink resiliency, to rethink safe deployment practices, to ensure that they're doing this in a way that they feel good describing and disclosing to their customers. I think it's going to change the narrative when customers are making vendor selections at this point. We should expect that probably all RFPs are going to include some questions that address the nature of this event. And I encourage that. I think customers should be asking their vendors and should be holding them to account on this, and they should be making selections based on the maturity that a vendor is able to describe and the transparency with which they operate.

It must have been an unusual experience to have so many vendors in the endpoint security industry in one room like that.

It was very cool. It was a very good, collaborative environment. I thank Microsoft for putting something like that together and just being able to create such a healthy and such a collaborative environment for us to get together and talk about the topic.