Matthias
Welcome to the KuppingerCole Analyst Chat. I'm your host. My name is Matthias Reinwarth. I'm an Analyst and Advisor with KuppingerCole Analysts and I have guests today. We've summed to discuss a topic that made the news quite heavily in the last weeks and therefore I've invited several of our KuppingerCole Analysts not only to give information, but we want to look at what we as analysts can provide as guidance in the aftermaths of what happened. And for that I want to welcome four people starting with Martin Kuppinger, one of the Analysts and Advisor and of course the Principal Analyst. Hi Martin, good to see you. I want to welcome John Tolbert. He is an Analyst hailing from Seattle. So this is a difficult call to coordinate, but we made it. We have John Tolbert. Hi, John. Not so far away, but still logistic has to happen. Alexei Balaganski, Dusseldorf hailing. Hi Alexei, good to have you.
Alexei
Hello, thanks for having me.
Matthias
And finally moving over the channel over to Stockport, having Mike Small. Hi Mike, good to have
Mike
Hello there. Good afternoon.
Matthias
Hi Mike and hi to all of you. We want to start, of course, we want to talk about incident and what happened in the last weeks. And maybe John, can you give us a bit of an insight of what the recent CrowdStrike incident was and how it impacted global players around the world?
John
Sure. Well, let's take a look at what CrowdStrike themselves published on their website. They have provided some good information, pretty timely. They say on Friday, July 19th, at 4:09 UTC as part of the regular operations, they released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. These updates are a regular part of the dynamic protection mechanisms of the Falcon platform. The problematic rapid response content configuration update resulted in a Windows system crash. Systems in scope included Windows hosts running sensor version 7.11 and above that that were online between Friday at 4:09 and 5:27 UTC. Macs and Linux hosts were not affected. The defect in the content update was reverted by 5:27 AM on that Friday. Systems that came online after that time or did not connect during that window were not impacted. What we learned later, well, in the immediate aftermath is that about 8.5 million machines went into a reboot loop and required pretty significant manual intervention to bring them back online. So that's what they say was the cause and obviously, as you said, it made the news. Many organizations around the world were affected. Some were affected for four, five days afterward trying to get things back up and running. So it was an extremely serious and significant cybersecurity incident. Although they say it wasn't caused by a cybersecurity incident, the fact that it took so many machines offline made it into a cybersecurity incident.
Matthias
Right, and to get the facts and figures right, Alexei just texted me a few minutes before that recording. These 8.5 millions might not be accurate, right Alexei?
Alexei
Apparently, what Microsoft did initially was to count the number of crash reports submitted to the telemetry hubs. Obviously, not every computer that crashed was even able to submit that report. So the real number is probably substantially higher. The problem is we will never know how high it was. Obviously, in any environment which was like air-gapped or isolated from the public Internet in whatever way configured not to send the information back home to Microsoft, they were not counted. So yes, we are dealing with probably a much higher number of affected systems.
Matthias
Right. And when we just look at the facts and what has happened and what obviously can be the reasons for that, there has been some quite extensive analysis, of course, and we need to build up on top of that, of course. But the question is really, how can that be possible? How can such an event happen in the first place? And of course, then the message would be, what can we tell our audience, our customers, vendors, end user organizations, what to do to be in a better situation next time this happens. But maybe first of all, happened? Mike, from your learnings, what you've heard, what has happened actually, what is your opinion, what has happened and what could have been done better?
Mike
Well, so from the perspective of what I have read and the information that has been published, it appears that the way in which CrowdStrike Falcon works is that they have a piece of software called a driver that runs in kernel mode whose purpose in life is to ingest data that comes from CrowdStrike and do something with it. And in effect, this piece of software was provided with some data which it could not digest. And the result of that was that the software itself failed, which in a way is unforgivable as an ex-programmer and ex-person in charge of lot of programmers. One of the first rules of good programming is that you always check the data that you have been given to make sure that it is not corrupt and to fail catastrophically because of that when you have written a piece of software which has the highest possible privilege and therefore the ability to cause the system to crash completely seems to me to be very, very bad practice. So that is...
Matthias
You agree Martin?
Martin
You know, I think about what happened, probably everything has been written, said, et cetera, when we look at the internet and also, I think in a very short form here. I think, and I'm fully with Mike, I think we need to have good checks of that in the software. I think we also need to think about what else could have been done. And so the thing that came to my mind is really, you know, we're gathering so much information right now about events on decentralized systems. So there's a lot of data we have and we have the ability to consume the data very quickly. And clearly this brings up the question of with automated analytical systems. So the broader sense, it probably would be something we call nowadays an XDR system, even while it's a bit different in this case, the recognition of something strange is going on should have been very, very fast. And the question is, why not build in into such solutions where we have these mass updates that can impact a lot of systems, something where the individual client before activating the update just sort of asks a security question, so to speak, where is the risk score? We can do something, we do it in many areas, this automated risk scoring. It would be sort of an automated kill switch in that sense, where we monitor what is happening and once a certain threshold is passed, there's a stopper delay and an analysis. Something by the way, which I have seen also in IGA systems, for instance, and IAM solutions, which start having thresholds and if you have a mass update, for instance, to an Active Directory or to an SAP system, if this is unusually high, if there's something very unusual, it stops and says to the people, hey, let's double check, is there something wrong? And if you just have all the clients asking once before and saying, okay, can we do it, or is there a threshold passed that even could be configurable to a certain degree, we probably can better deal with scenarios because errors can happen. There are certain things that must not happen, like a lack of testing, but at the end of the day, they can happen. And we need also to think beyond the usual measures that everyone does to what else can we do?
Mike
Perhaps it should have been phased with feedback.
Alexei
If I may excuse myself for sounding little bit too blunt, but one thing we should not attempt to do now or later is to try to reinvent the wheel. All those issues are not new by any means, I mean, they have happened decades ago, they have happened earlier this year, and they will happen again. That's the effect of life, if you will. We know how to solve that. There are multiple processes and algorithms and coding techniques and testing standards developed. The only issue is how do you force everyone in that software supply chain, if you will, to follow those practices. And this is unfortunately not an entirely technical issue. Basically, you need a combination of a carrot and a stick. And only in that case, it would actually be efficient in the long term.
Martin
Yeah, clearly. I think that there's a need for regulation, I think also even on the technical side, need to think beyond what we usually do. So coding practice, all that stuff is fine. But what else can we do in the case that something goes wrong and there always will be mistakes made and we will potentially become better in spotting them and avoiding them. But as we all know, there's no 100 % security and there's no 100 % perfect coding or so. So because the cost of doing that would be infinite. And so at the end of the day, it's also, think about thinking about resiliency. And resiliency means what can we do in the process of adding, for instance, additional checks, like I mentioned, but it's also about what can we do to bring Systems back faster? And maybe also, what can we do to avoid kernel drivers for third parties or kernel drivers that are not integral part of the operating system, I think that would be the correct term, at all.
Mike
Well, you've just mentioned resiliency and the strongest control to give you resiliency is diversity. And if you want to make software that is software suitable where there is a danger to life and people, good practice says you have three different software modules, each using different algorithms and each provided by a different supplier. And you then vote on the results. The problem here is that we don't have enough diversity because Microsoft has got effectively the major share of all of the desktop systems and the software in there is all from one source.
Martin
Yeah, but it was only a small fraction of the Windows systems affected because CrowdStrike only serves a fraction of the market. In that sense, we have diversity. The question is, whether we will not have more but smaller or smaller but more problems if you go to more diversity in the sense of, for instance, we use different EPDR vendors in that area. So I think this is the other side of the coin. It's the same discussion that came up again about, oh shouldn't we better test the patches before we roll them out. With the balance between risk of cyber attacks versus risks of updates, this has, when I go back 20 years, updates were quite a significant risk, but they were very, very few, if at all, zero day attacks. Nowadays, it's totally different. So, it's always a risk balance. And I think there's really not a simple answer. Diversity may help, but that would require that you in that case would have, whatever, three different client operating systems for doing the same task. And ideally, then for each three different EPDR solutions, so would be already at nine for only these two components, it would multiply. And that's true, ...
Mike
Well, what actually matters is really at the business level. We're talking at the technology level. And the problem is that if a business like a hospital has become totally dependent upon one supplier or one piece of software that is spread across the board, then that isn't a very good, resilient solution. And all of this comes down to a very, very small amount of one operating system which was corrupted by, in a sense, the ability of a third party to insert some rather poor code into that. And so the whole of this edifice, this business, this social edifice is built upon and dependent upon this small part of code. And so that's the problem.
Matthias
Right, John, you wanted to say something. think now it's your chance.
John
Yeah, well, let me back up and go back to something Mike said a few minutes ago around kernel mode drivers. So I mean, first of all, CrowdStrike, course, is an EPDR product, endpoint protection detection and response. It implements a kernel mode driver because that's really the only way it can get access at the lowest levels to read, write, and prevent malware from taking over on the disk at the memory level, and understanding what's going through the processor to be able to do behavioral analysis. So that's kind of the way these programs are designed to work for the Windows platform. So this didn't affect other platforms because they're designed differently in some cases. And there was a mention of regulation. Well, it's been pointed out that Microsoft had intended to take a more closed approach like, let's say, Mac OS does, but European regulators prevented that in the spirit of trying to promote competition. So I think, you know, how do you handle regulation in a situation like this is an important question because we see that it has, you know, considerations years later in how the program actually operates.
Martin
But I think this is a very interesting point, this claim about because of the regulators, we had to allow, so to speak, everyone to ride kernel drivers. I think if I'm right, and I'm not super deep into that, but basically, the point was about sort of a fair chance for everyone in comparison to what Microsoft themselves can do. So the other resolution in that logic, I'm maybe wrong on that, would have been that Microsoft says, no one, including Microsoft, except of the operating system team itself, provides kernel mode drivers. And Microsoft provides the kernel mode drivers that are needed, for instance, for every type of EPDR solution with the appropriate set of APIs. But the kernel mode is really only used by the operating system team. That would have then again meant fair chances for everyone if you do it in that manner. Not sure whether this would have been the alternative answer to what the regulators have been requesting, but it's clearly a legal issue. It's something from the past, but I think it's very worth considering. I think also for a company such as Microsoft, what to do here and aren't there better solutions than kernel mode drivers that can be provided by "everyone".
Mike
Well, it comes back to this question of diversity because apparently Microsoft had proposed a set of APIs to do what you were describing Martin, but the regulators felt that it was going to lock people in to a Microsoft platform in some way, or was going to be too difficult for small suppliers. So this is the challenge, but in a sense, if you are the provider of this piece of software upon which hospitals, society as a whole depends, that can all be brought down by one single error causing all of this chaos, then you have a very high degree of responsibility to make sure that what you provide is really bulletproof. That would be my point.
Alexei
Well, if I may look at this whole discussion from a slightly different angle, and it's absolutely not a new one. I mean, isn't it actually the same situation that we had with cloud service providers like a decade ago? This whole discussion of who is responsible for what, which ended up with this, well, basically de facto established shared responsibility model. Like, yeah, a cloud platform is also a major part of your critical infrastructure if AWS or Azure or Google Cloud just kind of goes down. It would affect millions of people around the world too, just like Windows. ...
Mike
Yeah, and if AWS and the other cloud providers all have a very clearly defined process for deploying change, and that is that they do not deploy it everywhere at once. They deploy it in a phased manner to avoid that problem.
Alexei
Exactly. is, they have those processes figured out, codified, probably run through all the possible legal checks, and they have them in place for years. So basically, the only thing we actually need to change with CrowdStrike is basically to, okay, from now on, you have to do the same legally, because there is no other way. Or maybe we can try to establish some kind of a independent and neutral regulatory framework like we have for cloud services with independent certifications and stuff. But again, as long as you do not have a very tangible opportunity to go to jail, if your product or your service has caused such a major disruption for millions of people around the world, nothing will change. We know that from centuries of human history. And there is really nothing, there is no secret technical thought. It is just kind of, you have to define the processes and you have to make everyone equally responsible for following those.
Martin
And I think an interesting point also is what can be done now. So if we think about a different model for building applications that require kernel mode access on Windows, then we are talking about a decade probably at least, at least many, many years. The advice you frequently hear, just use something else than CrowdStrike. You know, I remember back in 2011 when the RSA SecurID incident was, a lot of their competitors said, just replace RSA with our solution. The first problem was that most of the RSA SecurID implementations were baked deep into, for instance, e-banking services. So you couldn't switch them from one day to the next. It was a big effort. And doesn't mean that the next solution really then works perfectly well. It might not be the same mistake. It might be something else which goes wrong. So also not the solution we need. Diversifying across multiple operating systems, we may afford to do that and may be capable of doing so with quite some time and effort for very selected areas where we are willing to take the investment needed, et cetera. But it's also nothing we can do. Regulations unfortunately also will take time. In some cases, probably jurisdictions will come faster. But in others, it will take longer. So usually it takes a bit longer side of things. So at the end of the day, what remains is that the easiest thing that can be done immediately probably is really a phased rollout of any type of update. And then a very thorough monitoring of what is happening. And then ideally double checks like I've proposed, maybe the endpoint also asking, but also the vendor closely monitoring it, having a kill switch to say, okay, let's interrupt this, there are too many signals of something going wrong. I think this is the thing which can be done more or less immediately as well as everyone must work on the software development practices and double check, do we do everything well? And I'm absolutely confident that usually everywhere there will be some areas where then, let's say where there is room for improvement, probably for more or less every vendor.
John
So just going back to the incident at hand, I think if you read between the lines of what's been reported and what CrowdStrike themselves have said, it seems that they're, because they're very careful in their wording about what they did around validation of this content. So this was a content update sent to the kernel driver. It didn't change the kernel driver, but this was data that the kernel driver read. And the fact that they say it passed validation to me makes it sound as if there wasn't a lot of in-depth actual testing. And we've also kind of pointed out that this has been pushed out to everybody at the same time. A staggered deployment would certainly be helpful. Maybe alerting people that these things are coming, giving customers some control back. All this started a few years back, like you said, Martin, around Zero Days. Because the way software used to get pushed to organizations was you'd get an update, it would go to your own internal testing lab if you're a big company, and you'd spend weeks testing it to make sure everything was fine and then eventually push it out. Well, there was so much lag time between when a patch was released and got implemented. That was a great opportunity for bad actors to launch their attacks. So then, you know, as an industry, we moved to automatic updates, just turn on automatic updates. And as soon as something is ready, it gets pushed across the organization. And the customer organization really doesn't have any control over that. I think we need to find the happy medium there where vendors push these updates, but they also give some opportunity to say yes or no or wait such that, you know, this isn't a global problem again. And I think we see that reflected in CrowdStrike's messaging where they say they will enhance their software testing procedures and strengthen error handling procedures. Again, I think that says there was not nearly enough testing done before this was released. I if you look at some of the published crash dump analyses, the file that went out was just full of zeros. So I think, and given the fact that ... 8.5 million machines were affected, surely some real testing would have discovered this. And it could have been prevented before it was pushed out to everyone. So yeah, stagger your deployments. Put some control back in the hands of customers. And then see where we can go from there.
Alexei
If I may just say a few words, Matthias. What everyone has discussed now so far is all recommendations to vendors, like how they should basically fix their faulty processes, which is all useful and great, but what can we tell to end users because they have like zero control now. And what I've heard specifically from the people dealing with this incident in the field, so to say, that it's just technically impossible to stagger CrowdStrike updates at the moment. So they cannot apply a policy, cannot deploy a firewall or anything like that. It would not have changed anything. What should they do?
Mike
So the first thing is this was a very good test of organizations' resilience plans. And so if you had a resilience plan, and it was a good one, then you probably were able to recover more quickly. And if you didn't have a resilience plan, then you found yourself in some trouble.
Alexei
So I guess probably the biggest take away from this whole incident is that security on its own is not the goal. The goal is still your resilience. And security is just one way to achieve it, but you have to balance it with other considerations as well, because too much security is obviously, just isn't good in this case. What we have learned the hard way. Because those people who were affected at the massive scale, those were the people with the best security strategies, with the broadest coverage of the endpoints and with the fastest deployment of patches, which were all the best practices, if you will. And now they're paying the price for that.
Matthias
So we need to understand what really organizations can do to make sure that this can be prevented. that's That's really a questions, are there additional measures on top of that? We see a movement that we've seen in operational technology earlier before, because just to say it's not only cybersecurity, it's always resilience, but keep the systems running and then make them secure. Do it both at the same time, but not one selling one for the other. I think that that is an important part. Of course, if there is some faulty development process or some faulty verification, every newbie has written a web frontend for a SQL database and then was surprised that there could be SQL injection. And I think that is something comparable. I would not say that that is the same that we have here. So that input verification did not take place properly, but this is just the ground work. What else can organizations do? Can they look at the purchasing process? Can we as analysts look closer into this testing process when we look at our Leadership Compasses? Is this something that we should focus on as well?
Mike
Well, Yes, yes. So I think that this was the same question that was asked about cloud services. And you need to have a proper vendor evaluation process that takes into account some of these risks. And whilst there may be legal issues that you can bring in, there are also standards that can apply to software. And I'm thinking particularly of ISO/IEC 15408, which is an external evaluation of software for its consistency and resilience. So you can ask vendors to be, if you will, accredited, solutions to be accredited to that.
Martin
Yeah, but even then the power of the individual buyer, unless you're a very, very large organization is somewhat limited. I think there are other aspects and I think when we go beyond this incident, then one of the things always is, are we affected by this? So take things like Log4j or so. I've talked with very large organizations which said, okay, we spent the first long weekend analyzing which systems may be affected. Right now, then we had some sort of improvements in regulation around SBOM, software bill of materials, which potentially makes it a bit simpler, but we're definitely not yet there. But understanding what we have in place, in which versions, cetera, is one important part of every measure. Also, recovery concepts that are proven from the individual client machine to different services are important. It's very interesting. I talked a lot in the past year with people more from the backup and recovery space. And most organizations are very good in backup. Many are somewhat good also in deploying new systems, but bringing these things together. So how do you restore a large amount of systems, in which order do you do it? Do you have to plan of which systems must be up first, how you do it? Do you have a plan then also to restore, for instance, complex multi-tier distributed applications in the right order so that the data isn't corrupted at the end and that everyone knows who does the recovery. It's usually not the people only who do the backup. And all these things must be there. And at the end, we are at incident response management and business resiliency management, all the disciplines, which are super important in cybersecurity, but which sometimes tend to fall a bit more apart than other disciplines.
Mike
Yeah, so that's it. I'm certainly writing about disaster recovery at the moment, and that's an important issue. One of the interesting things about this, though, which John talked about at the beginning, which is that in a standard Windows deployment, you would actually have to send a person with technical knowledge to each individual Windows installation. Now, in a way, I remember when there were dual boot machines. So maybe there would have been a solution if you could have said that we always have a secondary boot which has not been updated. So if it really, really is important, we can get going. So there may be another technical solution to that. But the real problem is the need for physical presence on all of these 8.5 million machines is a significant issue.
Alexei
And finally enough, this again, this is a problem that has a solution that has had a solution for years. I mean, we have IPMI, for example, for the enterprise deployments. have IP KVMs for personal deployments. Basically, you can always find a technology which would solve this need to be physically present in the console. There are solutions for that. The problem, of course, that they cost money and nobody wants to spend that budget on something which is supposedly extremely unlikely scenario. Well, we have learned the hard way again that the scenario is not as extremely unlikely as many people thought.
Martin
And you need to have the solution in place before the incident happens. That's the other thing. But I think regarding the risk equation, I think we still can argue the previous very big incident by such updates caused by an endpoint security solution has been 14 years ago. So from that perspective, if you take a standard risk mathematics, you may say, if this happens every 14 years, Why should I invest? I know it's a bit of a cynic argument, but I've been sitting in whatever steering committees with board members that made this risk mathematics and said, no, we don't spend money for that because the probability is so low. Because it means if it happens very rarely and only for very few systems relative to all the systems rolled out, why should we invest into that? I personally believe, yes, there are things we can do with reasonable effort, which will increase our overall resilience because it's not the only incident. And being able to bring client systems back in an automated manner, very efficiently without walking to or running in this case maybe, to each and every system, it clearly is a measure that makes absolutely sense because the same problem you will have with cyber attacks potentially. But you also need to say, okay, I just bring the systems back, so to speak, from scratch by automated redeployment, because the data anyway is somewhere in the cloud, hopefully protected.
John
So I did want to give one final piece of advice to CIOs, CISOs about this type of situation. And that is, like we've alluded to, this isn't the first time this has happened. And even though it might be 14 years before it happens again, you should probably be aware of the tools in your environment that do have kernel mode drivers. And it's not just endpoint security. Endpoint security itself has many different components. Some organizations are running multiple products that do that. Others use, you know, a full EPDR package, but your DLP software, if you have that virtualization, SASE or SSE clients, VPN clients, any of those might install kernel mode drivers. And I think you need to understand what you have in your environment and be prepared to deal with that if the time comes, you know, and then find out how do those things work. You know, do you have any control over the updates that get pushed? If not have a conversation with your vendor about that. And then yeah, think about consolidation versus diversity in terms of the kinds of clients that you have, the kinds of operating systems that you have. Think about, you know, does a particular course of action really reduce the risk sufficiently and is that justified?
Matthias
Right, so before we close down, that was really a great discussion. Is there other advice that we have not yet mentioned? That would be your chance to provide this advice to our listeners. Maybe starting with Alexei, anything that we have not yet mentioned and you think everybody should just do.
Alexei
If I may bring back the term which probably will make some people just groan a little bit, but I think we still have to remember about Zero Trust. In that original organic way, you will, Zero Trust really means do not trust anyone, including your security vendors. Like if you are blindly giving them keys to your digital kingdom, you are obviously doing your risk management wrong. And kind of as a person who used to major in statistics back in the university, I'm sometimes just a little bit amused by the ways people calculate those probabilities. It reminds me about the old joke of what's the probability of meeting a dinosaur outside in the street? It's 50-50, either you meet one or you don't. Let's not calculate our risks like that. Let's do it in a little bit more scientific and proven way.
Matthias
Mike, final thoughts?
Mike
Well, my advice would be to organisations, hope for the best, but expect the worst and be prepared.
Matthias
Right, good one. Martin.
Martin
I want to pick two things. The thing Alexei just said and one thing Mike said before. Yes, don't trust software. And what Mike said was understand the procedures software vendors have in place. And that also means think about how many of solutions you need to truly go that deep, for instance, into your System that have kernel mode drivers and re-sync your vendor assessment also with trying to reduce this and trying to focus for these deep integrations into systems for the very critical components that could be other types of components and other operating systems, things that really can bring your business down. Try to focus on the vendors where you get more proof. It's not that you have trust, that you sort of have at least more proofs or lesser distrust.
Matthias
Right, assurances at least.
Martin
Assurance is the more friendly term that's true.
Matthias
Right, but I think if one thing came to our mind that should be taken care of, it's kernel drivers, it's kernel level access to an operating system that should be much better guarded, like John said. We're closing down. Martin, John, Alexei, and Mike, thank you very much for doing this discussion, for going beyond what's just in the news, but really getting to more actionable recommendations. Looking forward to talking to all of you again very soon. And that's it for today. Thank you very much and see you soon. Bye bye.