Thoughts on the CrowdStrike/Microsoft Global IT Outage

Multiple blue screens of death, caused by an update pushed by CrowdStrike, on airport luggage conveyer belts at LaGuardia AirportNew York City. Smishra1, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Unless you've been hiding under a rock or seriously been off-grid in in the last week, you will have seen the near continuous news coverage of the global IT outage caused by a defective release of CrowdStrike content. Unfortunately even reasonably trustworthy news sources misunderstand the issue. Worse still are the numerous industry professionals that are quick to declare the root cause of the issue - in complete contradiction to CrowdStrike's statements. The worst of all are some of CrowdStrike's competitors that are saying that this would have never happened to them and even offering to switch customers over. This ambulance chasing is frankly disingenuous and insulting to all seasoned cybersecurity professionals.

So let's dispel the myths and disinformation and objectively consider the situation.

Disclaimer: Caribbean Solutions Lab is both an independent CrowdStrike customer and partner. Being independent indicates that we pay for our licenses just like any other customer. And CrowdStrike is just one of several tools we use to protect our endpoints.

What is CrowdStrike? CrowdStrike the cybersecurity company was founded in 2011 by George Kurtz, Dimitri Alperovitch, and Gregg Marston. George and Dimitri were previously at McAfee, in fact George was McAfee's CTO (more on this fact later). CrowdStrike started out as an Endpoint Detection and Response (EDR) tool but has grown into a cybersecurity platform that as of 2024 commands an impressive ~25% market share of the global endpoint protection market. Their core product is the Falcon sensor, an application that is installed on endpoint systems (e.g. Windows, Linux, MacOS) and that detects and responds to threats. In order to protect the underlying operating system, Falcon must operate in what is commonly called privileged mode and have full access to the operating system's kernel or core. For Microsoft Windows, this is by design and a convention enables all antivirus and endpoint protection (AV/EP) products operate.

What happened? Based upon CrowdStrike disclosures, on Friday, 19 July 2024, CrowdStrike released a routine content update, Channel File 291. Organizations in Australia were the first to start experiencing systemic crashes evidenced by Windows blue screens of death (BSoD). By the time the Americas had started their Friday, the problem was documented to have affected millions of CrowdStrike installations in Europe as well. In hindsight, we know that Microsoft Windows systems were affected, not Linux or MacOS systems. Within hours of discovering the issue and despite published remedations, the damage had been done to any systems that were online and able to download the offending content file. Systems that did not download the problem file were not affected. We fell into this group, having had all of our systems offline overnight - as we're 100% cloud-based. The resolution is to simply boot into safe mode, and remove the offending files, reboot and then allow CrowdStrike Falcon to download a working version.

How hard can it be? On paper this is a straightforward process, but the actual process isn't trivial due to everything from encrypted hard drives, to virtual and remote systems. Even a nominal 15 minute recovery time, factored for 8.5 million devices equates to at least two million man hours. For anyone that has been in the industry long enough, you know that recovering Microsoft Windows is never as clean in practice as it is on paper. First you have to successfully boot into safe mode. There are documented cases where this must be attempted 15 times before successful and then being able to remove the bad CrowdStrike file. For encrypted systems, recovery keys need to be provided in order to decrypt the hard drive and access the system. On large systems, booting can sometimes take several minutes under perfect conditions, which these are not. CrowdStrike has provided guidance and toold to help organizations automate the recovery process. Microsoft, having a vested interest in recovery of their mutual customers has also released guidance and tools. In fact many industry professionals, other IT companies, and service providers have jumped in to help. Given the exceptionally wide blast radius of this event, the entire industry (mostly) has been working towards speedy recovery. Unfortunately threat actors have also been quick to take advantage of the news cycle and have been actively campaigning with fake information and fixes.

What has gone well? CrowdStrike has been very transparent about the issue and communicating with customers. I have seen some of George's interviews and unfortunately the journalists lack of understand isn't helping matters. Emails such as the screenshot below have been forthcoming several times a day. As of 22 July 2024, CrowdStrike indicates that a significant number of systems are back online, but realistically it may be weeks before full recovery.

What really went wrong? Depending upon how much CrowdStrike eventually makes public, we may never really know. But lets be clear about a few things. First, the problematic file was a content update and NOT a patch or application update - which many supposed pundits and competitors say. The challenge for all security vendors is to provide protection to customers while keeping up with onslaught of threats. Current estimates are that there are over 500 million new and unique pieces of malware generated every day. So how to you get the latest instructions (content/signatures/DAT files) on how to detect threats to endpoints in a timely manner? Transferring a file is the oldest and most common technique and used by CrowdStrike and hundreds of other endpoint security vendors. Another method is to use web-based or online content, negating the need to transfer files but requiring a constant Internet connection. A less common method is to build the content into the application itself. This method is currently used by Microsoft Windows' native Defender product but is a more onerous process and problematic for structured management and change control. Then there is the completely signatureless route where the endpoint protection doesn't need regular updates or to be online. Cylance PROTECT (now owned by Blackberry) is a prime example of this approach. Some vendors such as McAfee use a hybrid approach combining a very modular architecture with regular content updates, online content, periodic engine upgrades. Speaking of McAfee, in 2010 a bad McAfee content file (5958 - yes I was there) instructed its AV to consider a key windows process (SVCHOST) malicious and triggered deletion. The negative effects were global and widespread, though not quite as wide as CrowdStrike today. McAfee's then CEO, Dave Dewalt and George's boss, publicly acknowledged fault, apologized and promised to do better. Some of McAfee's current content distribution architecture were developed from that incident. There are no right or wrong approaches to delivering the necessary threat intelligence to deployments. There are pros and cons to each.

Were mistakes made? Maybe, but time will tell. Until proven otherwise I am confident that CrowdStrike does take reasonable efforts to test before release. But given the practically infinite number of combination of versions and languages of operating systems, applications, hardware, device enabling drivers, it is impossible to test every scenario. For those saying that this can't happen with their product or tool, they are basically saying a mistake will never be made. If that were really the case why do they ever need to update or patch their product? Couldn't they get it 100% correct the first time and never have to change a thing? Of course not. Microsoft Windows alone changes every month on Patch Tuesday. Lather, rinse, and repeat for every other vendor and app present on a given system. It is certainly true that a non-signature based product can't have a problem with a signature, but does it really matter to you and I as consumers, what part of an application or system is faulty? Some pundits argue that staged deployments and testing are ways to avoid global outages. It has already disclosed that the outages followed Friday's rising sun. Some of that might just be because of who was awake to witness the outage. But CrowdStrike is a global 24x7x365 operation and not just tied to western hemisphere time zones so they were surely gathering data as the issues started to be documented.

What could be done better? CrowdStrike currently does not expose management of content updates and distribution. Enabling staging with a kill switch in case of emergency would be useful and seems like something easily implemented. I also expect CrowdStrike to review and tweak their procedures to reduce the risk of a similar future incident. Microsoft could certainly improve Windows' recovery procedures and architecturally Microsoft could learn a few things on kernel protection from the likes of Apple MacOS.

What can you do? Plan for failure. Sooner or later, whether due to an application fault, hardware issue or something external like a hurricane or cyber attack, things will break. Similar scenarios have happened before with McAfee, Kaspersky, Symantec, and even Microsoft. Test and practice your recovery procedures. In terms of cybersecurity, some advocate for hedging bets and running different AV/EP products. If you can afford this approach why not? But how would this work? Running one product for servers and one for workstations (a very simplistic setup) doesn't really remove the concentration risk on critical infrastructure. So maybe you protect half of your systems with one product and the other half with another. This is great if your resources are evenly split but how many of you have two production versions of every server/role? I'm sure very few. Now consider a mildly complex organization. There isn't a clear, universal solution. So I refer you back to my first statement: Plan for and practice for failure.

Useful Links:

- From the desk of the CTO ,

MetricsMonday (011) – Endpoint Security (Part 3)

Vendors are stepping up their game!

Up until now we have been focused on our own metrics, but what about vendor supplied or created ones? Some tools such as Trellix ePO (shown above) natively provide useful metrics covering the overall security posture, even providing recommendations on how to improve these. In other words, leverage vendor supplied metrics whenever possible.

Since we are dealing with endpoint protection, they will come a time where something suspicious or malicious sis detected. Understanding the detection type e.g. infostealer, ransomware, remote access trojan, and location e.g. laptop, server, department, or business unit provide valuable insights into what is happening in your organization. In fact, let's take it step further and also correlate threat detections with user and application e.g. browser, MS Office, Windows Explorer.

We once worked with a customer that had a staff member that was trying to download pirated movies. We tracked his behavior starting with his office desktop, then his office laptop, and eventually vis his remote access to the Citrix Presentation Server (aka Remote Desktop Services) farm. Simply accepting the threat detection events without understanding the context and being able to address the issue with the user, would have opened the company up to significant cyber and business risk and liability.

MetricsMonday (010) – Endpoint Security (Part 2)

In our previous post about endpoint security, we discussed Operational Compliance, high-level metrics regarding coverage and timely communications. Now we're going to dive into the technical details of protection. Do note that depending upon the particulars of your endpoint protection, product-specific capabilities and functions will vary.

First let's consider your tools' configuration, settings, or tasks - collectively we'll just call policies. Ideally your management console can indicate that their policies are up to date and even a policy version. If your console does not have this ability, hopefully the local app's console can indicate this. Without some method to confirm that your apps are doing what they are being told, you have nothing more than assumption of what is in place.

  • % of endpoints with up to date policies*

* Overall health is important but able to drill down into the details of any out of compliance systems so that they can be remediated.

If your chosen tool uses content updates e.g. signatures, .DAT files, engines, you will want to ensure that your endpoints are kept up to date. With most vendors releasing content on a daily basis, a good rule of thumb is ensuring that your systems are never more than four versions out of date. This threshold allows for reporting delays i.e. if a system is out of the office but online and also if systems are offline. For scanning engines or other content which are not updated as frequently, sticking with no less than a one version difference is good.

  • % of endpoints with content within n-4 versions
  • % of endpoints with engines within n-1 versions

These three aforementioned metrics are good, point-in-time snapshots. To really understand your environment, we recommend tracking these over time i.e. collect trends. Ideally your management consoles can be customized to display all of these in one screen, providing you with an at-a-glance view of the health of your organization.

  • 30-day/90-day Trend % of endpoints with up to date policies
  • 30-day/90-day Trend % of endpoints with content within n-4 versions
  • 30-day/90-day Trend % of endpoints with engines within n-1 versions

Tracking trends is an effective way to spot operational anomalies in your environment and remediate them before they become institutional issues. For example, we worked with an organization that outsourced their Internet firewall management. When the supplier "cleaned up" some rules, they inadvertently disconnected their endpoints from the local management console, which also prevented propagation of the policies. Had the endpoint security team not been tracking these metrics, they would have only known about the changes upon encountering an operational issue, or worse a cyber incident.

MetricsMonday (009) – Endpoint Security (Part 1)

Let's now tackle another pillar of cybersecurity, endpoint security.

We'll leave the debate of antivirus (AV), antimalware (AM), endpoint protection (EP), next gen antivirus (NGAV), or next gen endpoint protection (NGEP), or another day when we can play acronym bingo. We're going to lump all of these apps that serve, through signatures, behavioral rules, machine learning (ML), or even artificial intelligence (AI), to protection endpoints from malicious apps and activity into one basket.

Regardless of your organization's size, the first order of business is what I like to call Operational Compliance. Basically we are ensuring that all of your endpoints are protected by the correct apps, and reporting to central management in a timely manner.

In terms of coverage, you should strive for nothing less than 100% coverage, 95% or higher is a good pass in my books. Of course 95% of coverage in a 100-device organization can be easily remediated, but in a 1,000 device organization, 50 non-compliant systems starts to be more daunting to handle. As the saying goes, your mileage may vary - all depending upon your organizations risk appetite.

We've already established that knowing the number of assets in your organization is key, We're just building on the principle. That said, there are times where a system may need your AV/EP temporarily removed or disabled for troubleshooting purposes. Application vendors do tend to blame security for most issues don't they? So we should allow for those exceptions. Exceptions, though, should not linger and become the norm.

  • % Endpoint Coverage (deployed / (total systems - exceptions))
  • # of Exceptions or better yet # of Exceptions older than two weeks

Now let's consider the versions. Just like any application, your AV/EP will require periodic updating, patching, or upgrading. And just like any other application, you should be running recent (the latest n, or n-1) versions excluding any compatibility issues. A pie cart is a good way to visually and quickly understand the state of your environment. If your AV/EP tools utilise several modules, you'll need to duplicate these efforts for each of these, as well if your systems require a separate management agent. For larger organizations, segmenting this data by business unit or asset type can be helpful in order to direct resources to investigation.

  • # Deployed versions of AV/EP
  • # Deployed version of AV/EP by business unit or asset type (servers, workstations, laptops)

Last is where the proverbial rubber meets the road. It does us no good to deploy software if we can't ensure that it is operating normally, have the latest policies and settings, or have reported back to the management console. Generally speaking I like to ensure that all endpoints have checked in at least once a week. Exceptions such as being out of the office can be easily managed simply by bringing these systems online. You would anyways be doing this as part of your patch management process right?

  • % of assets with successful communication within n-days

Your AV/EP architecture and management console will largely dictate how easy all of this information is gathered or reported. While automatic/scheduled export/delivery is ideal, at the very least be able to easily extract the information if manual efforts are required. As the expression goes, your mileage may vary.

Cybersecurity Preparedness

Ivan near peak intensity west of Jamaica on September 11, 2004. NASA image courtesy Jacques Descloitres, MODIS Land Rapid Response Team at NASA GSFC. Public domain, via Wikimedia Commons

It's 1 July and the official start to Hurricane Season. Most organizations will have been reviewing and hopefully testing their Business Continuity and Disaster Recovery (BCDR) procedures. Annual BCDR exercises are generally considered a sound practice - for events that have a predictable timeframe and ample leadtime.

But what about a cybersecurity incident/event? Can these not happen 24x7x365?

How often do you test your cybersecurity controls? Are your tools and policies actually effective? Are there any gaps?

How often do you test your technical incident response plan? Do all of your staff understand and can they fully execute your incident response playbook?

How often do test your procedural incident response plan? In the event of a material breach, does your organization know who (regulators/customers) should be notified? Do you know how they should be notified? Do you have a statement prepared in advance?

Do you see the issue? The overall risk of a cybersecurity incident is substantially higher than a hurricane and yet most organizations don't devote as much effort and resources to preparedness.

The following approach works well for most customers.

  • Quarterly Security Validation - This focuses on your technology implementation (e.g. AV, IPS, EDR, policies) and consequently also tests your technical incident response plan.
  • Annual Tabletop Exercise - This focuses more on your business level efforts i.e. management and board of directors and should also include members of your legal, communications, and communications teams.

- From the desk of the CTO ,

MetricsMonday (008) – Vulnerabilities (Part 3)

Let's bring this topic home and cover what we want to do about them, because we are going to do something right? We patch, remediate, and mitigate in order to reduce the exploitability of the asset in question.

Ideally your business and or asset owner should be indicating how long they are willing tolerate being exposed. Turning cybersecurity into a business decision is a bigger discussion for another day so let's seed this discussion with a 30-day window. Why 30-days? Simply because we are all very used the cadence of Patch Tuesday - Microsoft, Adobe, Oracle and few others' regularly scheduled release of updates. If we can patch our systems within 30-days, we don't have to deal with complications of overlapping updates. Don't forget that there are many vendors that may have their own update cadence and that many vendors may release out-of-band updates to address more critical issues.

The typical small to mid-sized enterprise (SME) that operates 9x5 should be able to adhere to the 30-day target. For all others, you may have to have different targets depending up the type of asset. For example, you may choose to allow non-critical assets to be patched within 45-days. See previous posts regarding asset categories.

For now let's stick with 30-days for all assets.

  • Average # of days to patch Critical assets
  • Average # of days to patch non-Critical assets
  • % of Critical assets patched within 30-days
  • % of non-Critical assets patched within 30-days
  • # of assets with exceptions
  • # of assets with exceptions over 90-days

MetricsMonday (007) – Vulnerabilities (Part 2)

Yes, that is a strong password, but the sticky note needs to be hidden under the computer!

In our previous post, we determined that we need to organize our assets based upon their context. With that in mind, let's consider what vulnerabilities matter to us.

The oblivious place to start is the Common Vulnerability Scoring System (CVSS). Taking into account factors such as attack vector, complexity, privileges and user interaction, CVSS provides standardized way to assess the severity of security weaknesses. Sounds great right? Before you answer, consider the real-world context. Does a Critical vulnerability on a trivial asset, let's say an intern's laptop, matter as much as a Medium vulnerability on your mission-critical communications server? Ceteris paribus, eventually yes that laptop is concerning, but probably not in the immediate future.

Obsolete, end-of-life, of end-of-support, software is its own class of vulnerabilities. In most cases, the vendors no longer offer support or updates for these, so your only recourse is to upgrade, seek an alterative, or uninstall.

Another significant class of vulnerabilities are those that are known to be exploited. These are worth tracking anywhere in your organization. The U.S. Cybersecurity Infrastructure Security Agency (CISA) is one of several organizations that maintains a list of Known Exploited Vulnerabilities (KEV).

The last class of vulnerabilities to consider at this time are those with no remediation. Note that I did not specify patch. Remember that some vulnerabilities are simply misconfigurations such as a default password left in operation. The lack of remedy could simply be because a fix has not yet been developed. Or worse, a remediation might be incompatible with the system or might create other problems such as creating performance issues. In either case we're dealing with vulnerabilities with no solution in sight.

In summary, so far we're working with:

  • Severity
  • End-of-life/end-of-support software
  • Known exploited
  • Vulnerabilities with no remediation or mitigation

Let's now include some context and we have the following to get started:

  • Known Exploited Vulnerabilities for any asset or group
  • High-severity for Critical systems
  • Rated vulnerabilities for all non-Critical systems
  • Any severity above Informational (rated) for Internet-facing systems
  • End-of-life/end-of-support software by business unit

Next week, we bring this topic home when we also consider remediation/mitigation efforts.

#MetricsMonday (006) – Vulnerabilities (Part 1)

There are two critical vulnerabilities in the image, can you spot them?

Stated simply, vulnerabilities are weaknesses that attackers can exploit to gain unauthorized access or cause harm. Mitigating a vulnerability usually entails patching, updating, reconfiguring, or applying a compensating control. Sometimes though mitigation may not be possible due to a lack of a patch or because the patch might be incompatible with other parts of the system.

But before we can discuss measuring vulnerabilities, we need to really understand where we are measuring them. Is uniformly measuring all assets (devices, systems, operating systems, applications, etc) appropriate? If our organization only consisted of five laptops, all running the same software for users to perform the same work, maybe. But for any reasonably sized organization, a server has greater business value than a single user's desktop. The CEO's laptop is going to have greater business value (operationally) than a receptionist's desktop. And for a final example, a public-facing system will be of greater value than a test system. In other words we must establish levels of criticality or importance to business functions.

Here are some examples of asset categories that will help to define our vulnerability metrics, keeping in mind that an asset might belong to several categories simultaneously.

  • Critical vs non-critical
  • Tier 1 (production) vs Tier-2 (supporting) vs Tier-3 (test/development)
  • Internet-facing
  • Contains sensitive data e.g. customer or financial
  • VIP users : CEO, CFO, HR managers i.e. high value targets
  • Business unit

Thinking ahead, once you apply your policies and processes to the asset groups, your work is simplified to managing these groups as assets or commissioned or decommissioned.

In reference to this post's image, the first vulnerability should be obvious, the zip tie. The second is the Master lock. While wildly popular and mainstream, they are some of the easiest to defeat.

#MetricsMonday (005) – Who has admin rights?

Administrative or privileged accounts are the holy grail for threat actors because they are the proverbial and literal keys to the kingdom.

Since Windows Active Directory is the most popular network operating system, we'll focus our efforts on domain environments.

For IT administrators of a certain age, there are certain hard-to-break habits that persist. These include granting end users local administrator rights, making certain users e.g. managers Domain Admins, and the most egregious in my opinion, making their own user accounts a Domain Admins.

This can be quite an expansive topic so we're going to focus on certain fundamentals to get the proverbial party started:

  • Set aside the default Domain Administrator account with a strong password kept under lock and key
  • Minimize privileged account sprawl
  • Enforce separate user and admin accounts for IT staff
  • Require multifactor authentication (MFA) for all privileged accounts
  • Monitor for and alert on undesirable privileged account activity
  • Monitor for and alert on privileged user group changes

Minimize the following key metrics for best results:

  • # of accounts with administrative permissions
  • # of privileged accounts without MFA enabled
  • # of privileged accounts with passwords older than 1-year (your mileage may vary)
  • # of inactive privileged accounts i.e. with no logon in last 30-days
  • Frequency that the default Administrator account has been used
  • Frequency that privileged user groups have been changed
  • Frequency of privileged account failed logins, lockouts, unlocks, and password resets

We could go on and on with regards to auditing. Seriously we could go on and on, and will do so at a later time. For now, this should get you started on the straight and narrow.

Image Source: Adobe Firefly Generative AI

#MetricsMonday (004) – What’s Running?

Now that we can measure what's connected to our organization, let's see what's running (installed). As with the previous posts, we're going to initially focus on our local systems.

Consider what is running in your environment. The obvious things are productivity applications such as MS Office, collaboration software, and web browsers. Speaking of web browsers, what about plug-ins and extensions? Also consider any hardware enabling drivers, their supporting apps, and of course all of your security software. You're probably thinking that this list is getting big.

But wait, there's more! The two most important bits of software have yet to be mentioned: the computers' operating systems (OS) and firmware (BIOS). The OS probably just slipped your mind but you probably didn't consider the BIOS. Without a working BIOS, your computer is just a mess of metal and electronic circuits. It is the firmware which turns that pile of stuff into a computer, and enables the OS to load and run. And yes, you really need to manage the firmware along with everything else. Don't worry though, there is an app for that!

Let's recap the various bits of software that we should be measuring:

  • BIOS/firmware
  • Operating Systems
  • Drivers and hardware enablers
  • Applications
  • Application add-ons e.g. Browser Helper Objects

The more versions and variations of these, the greater the risk from misconfigurations, vulnerabilities and exploitation, and the greater the effort and time required to manage. Therefore we want to have a few of these as possible in order for the business to function i.e. establish a common operating environment (COE).

A common operating environment's benefits include but are not limited to:

  • Increased efficiency and productivity
  • Reduced costs
  • Improved collaboration and communication
  • Enhanced security and compliance

In terms of metrics, here are some to get you started. For simplicity with this list, we'll refer to all items in the previous list as apps. Minimize these for best results and there are bonus points for having these broken down by business unit.

  • # of different app versions
  • # of end-of-life/end-of-support apps
  • # of unauthorized / non-COE apps
  • # of authorized / COE apps not used in the last n-months
  • # system deviations from COE standard
  • # of systems with COE exceptions or extensions

Image Source: Adobe Firefly Generative AI