Inside the 78 minutes that took down millions of Windows machines

Photo by Diego Radames/Anadolu via Getty Images

On Friday morning, shortly after midnight in New York, disaster started to unfold around the world. In Australia, shoppers were met with Blue Screen of Death (BSOD) messages at self-checkout aisles. In the UK, Sky News had to suspend its broadcast after servers and PCs started crashing. In Hong Kong and India, airport check-in desks began to fail. By the time morning rolled around in New York, millions of Windows computers had crashed, and a global tech disaster was underway.

In the early hours of the outage, there was confusion over what was going on. How were so many Windows machines suddenly showing a blue crash screen? “Something super weird happening right now,” Australian cybersecurity expert Troy Hunt wrote in a post on X. On Reddit, IT admins raised the alarm in a thread titled “BSOD error in latest CrowdStrike update” that has since racked up more than 20,000 replies.

The problems led to major airlines in the US grounding their fleets and workers in Europe across banks, hospitals, and other major institutions unable to log in to their systems. And it quickly became apparent that it was all due to one small file.

At 12:09AM ET on July 19th, cybersecurity company CrowdStrike released a faulty update to the Falcon security software it sells to help companies prevent malware, ransomware, and any other cyber threats from taking down their machines. It’s widely used by businesses for important Windows systems, which is why the impact of the bad update was so immediate and felt so broadly.

CrowdStrike’s update was supposed to be like any other silent update, automatically providing the very latest protections for its customers in a tiny file (just 40KB) that’s distributed over the web. CrowdStrike issues these regularly without incident, and they’re fairly common for security software. But this one was different. It exposed a massive flaw in the company’s cybersecurity product, a catastrophe that was only ever one bad update away — and one that could have been easily avoided.

How did this happen?

CrowdStrike’s Falcon protection software operates in Windows at the kernel level, the core part of an operating system that has unrestricted access to system memory and hardware. Most other apps run at user mode level and don’t need or get special access to the kernel. CrowdStrike’s Falcon software uses a special driver that allows it to run at a lower level than most apps so it can detect threats across a Windows system.

Running at the kernel makes CrowdStrike’s software far more capable as a line of defense — but also far more capable of causing problems. “That can be very problematic, because when an update comes along that isn’t formatted in the correct way or has some malformations in it, the driver can ingest that and blindly trust that data,” Patrick Wardle, CEO of DoubleYou and founder of the Objective-See Foundation, tells The Verge.

Kernel access makes it possible for the driver to create a memory corruption problem, which is what happened on Friday morning. “Where the crash was occurring was at an instruction where it was trying to access some memory that wasn’t valid,” Wardle says. “If you’re running in the kernel and you try to access invalid memory, it’s going to cause a fault and that’s going to cause the system to crash.”

CrowdStrike spotted the issues quickly, but the damage was already done. The company issued a fix 78 minutes after the original update went out. IT admins tried rebooting machines over and over and managed to get some back online if the network grabbed the update before CrowdStrike’s driver killed the server or PC, but for many support workers, the fix has involved manually visiting the affected machines and deleting CrowdStrike’s faulty content update.

While investigations into the CrowdStrike incident continue, the leading theory is that there was likely a bug in the driver that had been lying dormant for some time. It might not have been validating the data it was reading from the content update files properly, but that was never an issue until Friday’s problematic content update.

“The driver should probably be updated to do additional error checking, to make sure that even if a problematic configuration got pushed out in the future, the driver would have defenses to check and detect… versus blindly acting and crashing,” says Wardle. “I’d be surprised if we don’t see a new version of the driver eventually that has additional sanity checks and error checks.”

 CrowdStrike should have caught this issue sooner. It’s a fairly standard practice to roll out updates gradually, letting developers test for any major problems before an update hits their entire user base. If CrowdStrike had properly tested its content updates with a small group of users, then Friday would have been a wake-up call to fix an underlying driver problem rather than a tech disaster that spanned the globe.
Exit mobile version