Global IT Outage – CrowdStrike
On July 19, 2024 I posted on X:
Massive GLOBAl IT outage on Windows devices traced back to a BAD PATCH issued by Crowdstrike which caused a domino effect… NOT a cybersecurity event) Crowdstrike holds 24%of the market – that is why some businesses were affected and others were not…
The primary reason is that as it was hitting the United States news cycle that morning – there was lots of finger pointing to Microsoft for the outage, and after being up since 2:30am watching / listening to the outage as it unfold across Australia, Asia Europe for a few hours already, the cause was already known and frankly I was a bit annoyed at the initial US coverage.
Sunday (yesterday ) 7/21, a ran across GREAT video explaining why and how this happened was published by Dave Plummer on his Dave’s Garage YouTube channel. In it he clearly, efficiently and in plain language (English) explains the situation…
Key to remember, 1) the obvious: we need to make sure patches are well tested before releasing them, but more importantly 2) I think we will need some changes to the architecture of tools like CrowdStrike so this does not happen again.
The key points of his video are:
- The CrowdStrike issue caused widespread blue screens (system crashes) due to a bad software update.
- CrowdStrike’s Falcon sensor operates as a kernel-mode driver, giving it complete access to system data structures and services.
- Kernel-mode code crashes cause system-wide crashes (blue screens), unlike user-mode application crashes.
- CrowdStrike uses dynamic definition files to update their driver without going through the Windows Hardware Quality Labs (WHQL) certification process each time.
- The issue stemmed from processing dynamic definition files containing executable code, not just data, which the driver executed without proper validation.
- The crash was likely caused by the driver attempting to process an update file (CY file) that contained all zeros instead of the expected data.
- The specific error involved dereferencing a null pointer, indicating inadequate error checking and parameter validation in the driver.
- CrowdStrike’s driver is marked as a boot driver, making it difficult to boot into safe mode to fix the issue.
- A temporary fix involves deleting the problematic update file (matching the pattern C0000291.sys) from the Windows\System32\drivers folder.
- The incident highlights the risks of running untrusted code in kernel mode and the importance of robust error handling in driver development.
CrowdStrike blog posts with Analysis Report and remediation steps can be found at:
- https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
- https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
- YouTube video with remediation steps: https://www.youtube.com/watch?v=Bn5eRUaMZXk
His Video, well worth the watch, follows: ( very well done Dave!)
Original Video: 7/21/2024
UPDATE VIDEO From Dave on 7/24/2024
The AI generated (thank you Claude) summary of the video follows: ( Please watch the video!):
Last week, a widespread IT outage caused by CrowdStrike’s Falcon security software left many organizations scrambling to restore their systems. As a retired Windows developer, I’d like to shed some light on what happened and why it’s a cautionary tale for software companies developing kernel-mode drivers.
CrowdStrike’s Falcon is a security product that operates at the kernel level of Windows systems. Unlike typical applications that run in user mode, kernel-mode drivers have unrestricted access to system resources. This level of access is necessary for security software to monitor and protect against threats effectively. However, it comes with a significant risk: if a kernel-mode driver crashes, it takes down the entire system with it.
The recent outage was caused by a faulty update to CrowdStrike’s software. The company uses dynamic definition files to update their driver without going through the time-consuming Windows Hardware Quality Labs (WHQL) certification process for each update. This approach allows for rapid response to new threats but also introduces potential vulnerabilities.
In this case, it appears that an update file containing all zeros instead of valid data was distributed. When the CrowdStrike driver attempted to process this file, it led to a critical error – specifically, trying to dereference a null pointer. This type of error in kernel mode results in the infamous “blue screen of death” and system crash.
What makes this incident particularly problematic is that CrowdStrike’s driver is marked as a boot driver, meaning it loads very early in the Windows startup process. This designation made it challenging for affected systems to boot even into safe mode, complicating recovery efforts.
The situation highlights several critical points for software developers and IT professionals:
- The importance of rigorous error checking and parameter validation, especially in kernel-mode code.
- The risks associated with bypassing standard certification processes for driver updates.
- The need for robust fallback mechanisms in critical system components.
For those affected by this outage, a temporary fix involved manually deleting the problematic update file from the Windows\System32\drivers folder. However, this incident serves as a stark reminder of the delicate balance between security, performance, and system stability.
As we increasingly rely on complex security software to protect our digital assets, it’s crucial to remember that these tools themselves can become points of failure if not developed and maintained with the utmost care. The CrowdStrike outage is a wake-up call for the industry to reevaluate practices around kernel-mode driver development and updates.
Moving forward, we must prioritize not just the speed of security updates but also their reliability and potential system-wide impacts. Only by doing so can we ensure that the very tools designed to protect us don’t end up being our biggest vulnerability.
End of AI Summary