Thoughts on the CrowdStrike/Microsoft Global IT Outage

Multiple blue screens of death, caused by an update pushed by CrowdStrike, on airport luggage conveyer belts at LaGuardia AirportNew York City. Smishra1, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Unless you've been hiding under a rock or seriously been off-grid in in the last week, you will have seen the near continuous news coverage of the global IT outage caused by a defective release of CrowdStrike content. Unfortunately even reasonably trustworthy news sources misunderstand the issue. Worse still are the numerous industry professionals that are quick to declare the root cause of the issue - in complete contradiction to CrowdStrike's statements. The worst of all are some of CrowdStrike's competitors that are saying that this would have never happened to them and even offering to switch customers over. This ambulance chasing is frankly disingenuous and insulting to all seasoned cybersecurity professionals.

So let's dispel the myths and disinformation and objectively consider the situation.

Disclaimer: Caribbean Solutions Lab is both an independent CrowdStrike customer and partner. Being independent indicates that we pay for our licenses just like any other customer. And CrowdStrike is just one of several tools we use to protect our endpoints.

What is CrowdStrike? CrowdStrike the cybersecurity company was founded in 2011 by George Kurtz, Dimitri Alperovitch, and Gregg Marston. George and Dimitri were previously at McAfee, in fact George was McAfee's CTO (more on this fact later). CrowdStrike started out as an Endpoint Detection and Response (EDR) tool but has grown into a cybersecurity platform that as of 2024 commands an impressive ~25% market share of the global endpoint protection market. Their core product is the Falcon sensor, an application that is installed on endpoint systems (e.g. Windows, Linux, MacOS) and that detects and responds to threats. In order to protect the underlying operating system, Falcon must operate in what is commonly called privileged mode and have full access to the operating system's kernel or core. For Microsoft Windows, this is by design and a convention enables all antivirus and endpoint protection (AV/EP) products operate.

What happened? Based upon CrowdStrike disclosures, on Friday, 19 July 2024, CrowdStrike released a routine content update, Channel File 291. Organizations in Australia were the first to start experiencing systemic crashes evidenced by Windows blue screens of death (BSoD). By the time the Americas had started their Friday, the problem was documented to have affected millions of CrowdStrike installations in Europe as well. In hindsight, we know that Microsoft Windows systems were affected, not Linux or MacOS systems. Within hours of discovering the issue and despite published remedations, the damage had been done to any systems that were online and able to download the offending content file. Systems that did not download the problem file were not affected. We fell into this group, having had all of our systems offline overnight - as we're 100% cloud-based. The resolution is to simply boot into safe mode, and remove the offending files, reboot and then allow CrowdStrike Falcon to download a working version.

How hard can it be? On paper this is a straightforward process, but the actual process isn't trivial due to everything from encrypted hard drives, to virtual and remote systems. Even a nominal 15 minute recovery time, factored for 8.5 million devices equates to at least two million man hours. For anyone that has been in the industry long enough, you know that recovering Microsoft Windows is never as clean in practice as it is on paper. First you have to successfully boot into safe mode. There are documented cases where this must be attempted 15 times before successful and then being able to remove the bad CrowdStrike file. For encrypted systems, recovery keys need to be provided in order to decrypt the hard drive and access the system. On large systems, booting can sometimes take several minutes under perfect conditions, which these are not. CrowdStrike has provided guidance and toold to help organizations automate the recovery process. Microsoft, having a vested interest in recovery of their mutual customers has also released guidance and tools. In fact many industry professionals, other IT companies, and service providers have jumped in to help. Given the exceptionally wide blast radius of this event, the entire industry (mostly) has been working towards speedy recovery. Unfortunately threat actors have also been quick to take advantage of the news cycle and have been actively campaigning with fake information and fixes.

What has gone well? CrowdStrike has been very transparent about the issue and communicating with customers. I have seen some of George's interviews and unfortunately the journalists lack of understand isn't helping matters. Emails such as the screenshot below have been forthcoming several times a day. As of 22 July 2024, CrowdStrike indicates that a significant number of systems are back online, but realistically it may be weeks before full recovery.

What really went wrong? Depending upon how much CrowdStrike eventually makes public, we may never really know. But lets be clear about a few things. First, the problematic file was a content update and NOT a patch or application update - which many supposed pundits and competitors say. The challenge for all security vendors is to provide protection to customers while keeping up with onslaught of threats. Current estimates are that there are over 500 million new and unique pieces of malware generated every day. So how to you get the latest instructions (content/signatures/DAT files) on how to detect threats to endpoints in a timely manner? Transferring a file is the oldest and most common technique and used by CrowdStrike and hundreds of other endpoint security vendors. Another method is to use web-based or online content, negating the need to transfer files but requiring a constant Internet connection. A less common method is to build the content into the application itself. This method is currently used by Microsoft Windows' native Defender product but is a more onerous process and problematic for structured management and change control. Then there is the completely signatureless route where the endpoint protection doesn't need regular updates or to be online. Cylance PROTECT (now owned by Blackberry) is a prime example of this approach. Some vendors such as McAfee use a hybrid approach combining a very modular architecture with regular content updates, online content, periodic engine upgrades. Speaking of McAfee, in 2010 a bad McAfee content file (5958 - yes I was there) instructed its AV to consider a key windows process (SVCHOST) malicious and triggered deletion. The negative effects were global and widespread, though not quite as wide as CrowdStrike today. McAfee's then CEO, Dave Dewalt and George's boss, publicly acknowledged fault, apologized and promised to do better. Some of McAfee's current content distribution architecture were developed from that incident. There are no right or wrong approaches to delivering the necessary threat intelligence to deployments. There are pros and cons to each.

Were mistakes made? Maybe, but time will tell. Until proven otherwise I am confident that CrowdStrike does take reasonable efforts to test before release. But given the practically infinite number of combination of versions and languages of operating systems, applications, hardware, device enabling drivers, it is impossible to test every scenario. For those saying that this can't happen with their product or tool, they are basically saying a mistake will never be made. If that were really the case why do they ever need to update or patch their product? Couldn't they get it 100% correct the first time and never have to change a thing? Of course not. Microsoft Windows alone changes every month on Patch Tuesday. Lather, rinse, and repeat for every other vendor and app present on a given system. It is certainly true that a non-signature based product can't have a problem with a signature, but does it really matter to you and I as consumers, what part of an application or system is faulty? Some pundits argue that staged deployments and testing are ways to avoid global outages. It has already disclosed that the outages followed Friday's rising sun. Some of that might just be because of who was awake to witness the outage. But CrowdStrike is a global 24x7x365 operation and not just tied to western hemisphere time zones so they were surely gathering data as the issues started to be documented.

What could be done better? CrowdStrike currently does not expose management of content updates and distribution. Enabling staging with a kill switch in case of emergency would be useful and seems like something easily implemented. I also expect CrowdStrike to review and tweak their procedures to reduce the risk of a similar future incident. Microsoft could certainly improve Windows' recovery procedures and architecturally Microsoft could learn a few things on kernel protection from the likes of Apple MacOS.

What can you do? Plan for failure. Sooner or later, whether due to an application fault, hardware issue or something external like a hurricane or cyber attack, things will break. Similar scenarios have happened before with McAfee, Kaspersky, Symantec, and even Microsoft. Test and practice your recovery procedures. In terms of cybersecurity, some advocate for hedging bets and running different AV/EP products. If you can afford this approach why not? But how would this work? Running one product for servers and one for workstations (a very simplistic setup) doesn't really remove the concentration risk on critical infrastructure. So maybe you protect half of your systems with one product and the other half with another. This is great if your resources are evenly split but how many of you have two production versions of every server/role? I'm sure very few. Now consider a mildly complex organization. There isn't a clear, universal solution. So I refer you back to my first statement: Plan for and practice for failure.

Useful Links:

- From the desk of the CTO ,

Cybersecurity Preparedness

Ivan near peak intensity west of Jamaica on September 11, 2004. NASA image courtesy Jacques Descloitres, MODIS Land Rapid Response Team at NASA GSFC. Public domain, via Wikimedia Commons

It's 1 July and the official start to Hurricane Season. Most organizations will have been reviewing and hopefully testing their Business Continuity and Disaster Recovery (BCDR) procedures. Annual BCDR exercises are generally considered a sound practice - for events that have a predictable timeframe and ample leadtime.

But what about a cybersecurity incident/event? Can these not happen 24x7x365?

How often do you test your cybersecurity controls? Are your tools and policies actually effective? Are there any gaps?

How often do you test your technical incident response plan? Do all of your staff understand and can they fully execute your incident response playbook?

How often do test your procedural incident response plan? In the event of a material breach, does your organization know who (regulators/customers) should be notified? Do you know how they should be notified? Do you have a statement prepared in advance?

Do you see the issue? The overall risk of a cybersecurity incident is substantially higher than a hurricane and yet most organizations don't devote as much effort and resources to preparedness.

The following approach works well for most customers.

  • Quarterly Security Validation - This focuses on your technology implementation (e.g. AV, IPS, EDR, policies) and consequently also tests your technical incident response plan.
  • Annual Tabletop Exercise - This focuses more on your business level efforts i.e. management and board of directors and should also include members of your legal, communications, and communications teams.

- From the desk of the CTO ,

Case Study – Ourselves!

As the saying goes, sharing is caring but first, some background. Most of our infrastructure has been in the cloud for close to ten years i.e. email, security and management consoles, etc. Early on, our endpoints were domain-joined but that proved cumbersome for us as we spent the majority of our time off-LAN, working remotely or were at customer sites. As cybersecurity service providers we lead by example and make a point of using the same technologies and tools that recommend for our customers.

Our core endpoint security consists of several tools and layers. We’ll start from the foundation and work our way up the stack:

  • Trapezoid FIVE – If you can’t trust the hardware, how can you trust the software? Firmware is everywhere, from your laptop to IoT device to datacenters to the cloud. It is even part of the NIST Cybersecurity Framework. Using Trapezoid FIVE, we monitor the BIOS integrity of our systems for unauthorized changes.
  • OpenDNS Umbrella, specifically the Roaming Client provides security management and reporting from the endpoint, network, and web perspectives. Our systems benefit from the same Community Threat Intelligence feeds that we provide for our customers.
  • CylancePROTECT – All modules are fully enabled and enforced. File security is set to Auto Quarantine with Execution Control. Files are examined pre-execution.  Memory Violations are terminated. The CylancePROTECT services and applications are protected against tampering. The Application Control module is enabled and prevents creating or even saving an unauthorized application (Portable Executable or PE) file let alone running it. ScriptControl is fully enabled and will block unauthorized ActiveScript, PowerShell scripts, or MS Office Macros from running. Objectively, fully locking down scripts was the hardest aspect but critical for the prevention of file-less malware. Lastly, the Device Control module allows the usage of only authorized devices.
  • CylanceOPTICS – The OPTICS endpoint detection and response (EDR) tool is deployed with a custom-developed ruleset that also aligns with the MITRE ATT&CK Framework.
  • OPSWAT MetaDefender Cloud – All web downloads are scanned with over 35 antimalware engines in the MetaDefender Cloud system.
  • HerdProtect – As CylancePROTECT only monitors Portable Executable (PE) files, we also regularly scan our endpoints with HerdProtect’s 68 independent antivirus scanners. This ensures an explicit check of data files such as MS Office, Adobe, and image files.
  • McAfee GetClean – We have a long relationship with McAfee and continue to work with the company, notably with the Joint Development Program. We use this tool to help provide McAfee with information on known clean system images and files.
  • DUO Beyond – Authentication to our local systems requires multifactor authentication using DUO. In addition to requiring MFA, we are also enabling usage of Yubikey Two Factor security keys wherever possible.
  • BitLocker – All local volumes are fully encrypted using Windows BitLocker.
  • Windows and application updates are installed within one month of release. Patch management is primarily monitored using Patch Manager Plus Cloud but also with DUO’s Device Health application.
  • RoboForm Everywhere manages my passwords. I’ve been using it for a very long time. LastPass is the other popular app for this. We never save passwords in browsers.
  • User accounts never have administrative rights – not now, not next November, never! Check out the latest BeyondTrust report on Microsoft vulnerabilities.
  • Device management is performed using several tools and layers subject to the individual device’s specifics. Cisco Meraki’s MDM system is the most common and the ability to manage our own systems as well as customers’ systems, firewalls, switches, and wireless devices simply more efficient.
  • Finally, our online presence, or our digital risk, is monitored using a combination of opensource tools like Google Alerts as well as private tools deep/dark web tools, including but not limited Digital Shadows and SpyCloud.

The commonly stated trade-offs for security are performance and convenience. As security is our business, it cannot be compromised for convenience. That said, if I really had to nitpick, the one area that might be considered inconvenient would be the authentication process with MFA. In reality the few extra seconds required to use the DUO app on our phones is truly trivial. Technology is constantly evolving and we’re always looking at ways to improve.

As for performance, we have long believed in making the right investments in computing devices up-front. We favor business-class computers, currently the HP Zbook Workstations. We also made the switch to solid-state drives (SSD) ten years ago and haven’t looked back. My current system is a five-year-old HP Zbook 14 G2 with dual SSD drives and 16GB RAM. It runs as well now as the day I purchased it. Only when I am simultaneously running several virtual systems do I long for more RAM. More recent models like the Zbook 15u support 32GB RAM. Over time that too will improve. The longevity is partially due to the choice of hardware (platform, RAM, SSD) and partially due to our carefully, curated security stack.

Bonus Tip: An easy way to improve the performance of a Windows system running on a standard hard disk is to use a USB key and enable ReadyBoost. This nifty feature has been around since Windows 7 and does a really good job improving performance. Note that you should format and dedicate this key to the computer and not use it for moving files around. A small, high-speed 16GB key is ideal and can be connected to a USB port somewhere on the back or side, where it will be out of the way and that you won’t be tempted to remove.

I hope you found this information useful. If you have any questions about the information provided, please do not hesitate to contact me.