Some years ago, the book Blackout by Marc Elsberg described the impact on society of malicious software infecting the electricity supply network in Europe. Last week the whole world experienced the consequences of a faulty software update to the security software CrowdStrike Falcon. According to Mr. Kurtz, the CEO of CrowdStrike, this was not a malicious act. However, the impact of this error was considerable.
We Are All Now Dependent on Digital Systems
I can understand the feelings of the CrowdStrike team. When I was VP of development of security software, I remember very clearly how I felt when one night I was called and told that our security software was preventing clinicians in a paediatric hospital from accessing the systems and that unless they regained access within the hour, babies would start to die.
Developing security software places an extra burden on the development teams. This software is intended to protect organisations from malicious actors, however by its very nature that security software can also prevent the systems it is intended to protect from operating.
Over the past several years governments have recognised the increasing dependence of society on IT systems and how this brings the need for greater resilience of these systems. In Europe, this has resulted in legislation that includes NIS2 and DORA. The best control to ensure resilience is diversity. However, in the world of IT, most organisations are heavily dependent on systems that or delivered by a few suppliers. This is especially true for desktop systems, where Microsoft is the dominant supplier, and for cloud services, where AWS, Google, and Microsoft have the lion’s share of the market.
In this instance, the problem was that an update to security software caused the systems that were running it to crash, and the fix required a significant amount of manual intervention on each affected machine. The end users had no control over the deployment of this patch, as the security vendor pushed it out to all endpoints across the world within a very short period of time. The issue affected all CrowdStrike customers running Windows-based systems including PCs, servers, kiosks and other forms of specialist terminals.
This is actually not the first time it has happened – in fact, CrowdStrike had a similar issue with the Linux version of its software just a few months ago. An update incompatible with the latest version of Debian Linux was released, causing servers to crash and refuse to boot. Back then, it took the company weeks to acknowledge the issue and reveal that Debian Linux wasn’t covered by their test procedures, despite being officially supported.
Other cybersecurity vendors, including McAfee, Sophos, and Symantec, had similar issues over the last two decades, although they have never had such a global impact.
What an End User Organisation Must Do
Since this occurred through a defect in security software, the normal advice relating to the use of up-to-date security software is not very helpful. Additionally, these are infrequent but high-impact events which make planning hard. Here are some actions that organisations can take:
Include this in your Business Continuity Plan – Consider this kind of risk as part of your business continuity planning. Remember that as your organisation goes digital, it becomes more dependent upon IT, and cyber risks require special treatment. Cyber incidents spread very rapidly across interconnected components, so moving to another physical location does not help.
Resilience through Diversity – The most powerful control to ensure resilience is diversity, but this is difficult to achieve given the dominance of a small number of major suppliers. Consider this kind of risk in your business continuity planning. Consider whether the trade-off between cost and ease of management against risk of cyber failure due to dependence upon a single IT environment is acceptable for your critical business systems. For life-critical systems, best practice requires three different software elements provided by three different suppliers to minimize risk. This is impractical for most situations, but you could consider deploying security software from multiple vendors across different parts of your IT estate.
Evaluate Vendor Risk – When choosing security software, include consideration of this kind of risk in your vendor assessment process. Evaluate the kinds of controls that the vendor has to prevent and mitigate this kind of error. These can include the software design and development processes, including testing and deployment. Does the vendor phase the deployment of updates with inbuilt feedback? Does the vendor allow you any control over the deployment of updates, and can updates be selectively deployed to groups of systems? Does it follow the standard-based practices of software supply chain security?
Incident Plan – Have a well-tested incident response plan and include this kind of event in your planning. Include and test how you would manage having to reimage or reboot a large portion of your IT estate. Do you have the tools and skills to manage this? Don’t forget that you need to verify whether you have backed up your data and are able to restore it in time.
Keep Calm and Carry On – Unfortunately, a lot of cybercriminals and even a handful of security vendors have already recognized this massive incident as an opportunity to exploit victims’ insecurity and vulnerable state. We can already observe a massive increase in phishing and other criminal activities focusing on CrowdStrike’s and Microsoft’s products. Some vendors are trying to push their own products as “more resilient” alternatives. However, the best thing you can do now is to avoid making rash decisions. Focus on addressing the immediate consequences of the outage and start looking for neutral expert guidance for adjusting your long-term security strategies, architectures, and portfolios. Focus on methods that can be proven and validated and avoid snake oil at all costs.