ELI5: The CrowdStrike Outage

What actually happened from a technical perspective.

In what was probably the biggest IT failure in the history of humanity, computers across the globe got fried over the weekend from an errant software update. Planes were grounded, banks were offline, machines in hospitals stopped working…you get the gist. There have been endless articles analyzing what this means and why it’s a big deal. Instead of that, this post will explain what actually happened from a technical perspective, in simple, easy to understand language.

The outage itself: what went wrong

Let’s first focus on what actually happened. What broke?

Let’s start from the basics. Every device on the planet that runs software – whether it’s your phone, a laptop, a large server in the cloud , a TV at an airport, or a teleprompter in a studio – is built on an operating system . An operating system is the mastermind behind computers: it orchestrates all of the behind-the-scenes magic that lets something like Excel work on your laptop, or flight arrival times display on a TV.

The most popular operating system in the world by far is Microsoft Windows. As of last month, more than 70% of desktop systems were running Windows, which may come as a surprise to Mac-loyal readers. Windows is especially popular in commercial applications like factory robots, hospital devices, airplanes, etc. The initial release was 38 years ago, in November 1985.

Now if something goes wrong with the operating system, your device is completely fucked. If the operating system can’t run, the device can’t run; simple as that. And this is exactly what happened with the CrowdStrike outage. It (we will discuss the it) messed up the Windows operating systems that it runs on, so computers that use the software couldn’t start at all.

Of all of the types of outages out there, an operating system not being able to boot up is perhaps the most sinister. Because the only way to fix it is manually, by rebooting the computer in a special way that allows you to remove the offending code (more on this later). But the people using these devices are certainly not computer experts, and in most cases probably don’t even know what an operating system is. This is part of why it took so long to resolve this outage, and required significant hand holding from CrowdStrike.

Anyway, now that we’ve got an understanding of the symptoms (i.e. what went wrong), we can dive into what caused the outage and who CrowdStrike is.

What does CrowdStrike do?

If you grew up in the 90’s like I did, you probably had something like Norton installed on your computer.

norton antivirus cover

Norton was (and is) antivirus software. It sits on your computer and tries to protect it from viruses that hackers and malicious actors try and disperse around the web.

Antivirus software works in some generic ways – like checking for generally suspicious looking files, or files that it hasn’t seen before. But the main way that antivirus software does its job today is roughly the same way that “Wanted” signs work at the post office. Norton has a team of researchers finding all of the latest vulnerabilities and nasty software that hackers are using[1] Don’t ask how they do this.. Whenever they find new stuff, they update your computer to look for that stuff, and freak out if it finds it. They constantly publish these kinds of updates so your computer knows what to look for.

wanted poster

This is pretty much exactly what CrowdStrike does, just in 2024 and with better marketing. But instead of selling their software to consumers like you and me, they’re focused on selling it to businesses. 300 of the Fortune 500 use CrowdStrike; it’s incredibly ubiquitous. That’s how this outage ended up affecting so many different types of devices in so many industries: the owners of those devices, from Delta to Chase, were all CrowdStrike customers.

Antivirus and the operating system

In order for modern antivirus software to work, it needs to have permission to live in the deepest recesses of a computer’s operating system . Like, way in there. It’s monitoring all of the little different things an operating system does, so it needs to be able to see those things and stop them if it doesn’t like what it sees. In fact, it probably has the deepest “security clearance” of any piece of software on any computer. It has total control over what, how, and when your operating system runs programs.

This privileged access that antivirus software has to operating systems has long been a topic of controversy. Way back in 2008, a researcher wrote a whitepaper outlining vulnerabilities in antivirus software itself, and how bad things could get given this relationship. For more depth on this topic, I highly recommend Jan Kammerath’s post on the outage.

On Friday, CrowdStrike released a software update that had a new Channel File in it. Remember how antivirus companies have teams researching the latest vulnerabilities? When they find a new one, they update their software to start looking for it. That’s what a CrowdStrike Channel File is: a file configured to detect a specific type of malware. It’s like a new “Wanted” poster for your computer. This one was called something like:

C-00000291.sys

Unfortunately, this file had a major issue: it tried to access a piece of data on Windows that doesn’t actually exist. This is what developers call a Null Pointer Exception: a program tries to find a piece of data, it’s not there, and things go haywire. This is a very common kind of bug that software engineers accidentally overlook all the time, without drastic consequences; usually, the operating system just shuts down the offending program. I’ve encountered many Null Pointer Exceptions in my own code over the years[2] Then again, I’m not a very good developer. . Generally, a run of the mill thing.

But in this case, it wasn’t run of the mill at all. Precisely because CrowdStrike has such an intimate relationship with the operating system, this failure in its software broke the entire operating system as a whole. Once this messed up file got sent to your system in the software update, you couldn’t even start your Windows machines in the first place. Hence, the dreaded screen of doom:

blue screen error

The last thing (probably) worth mentioning is why this issue affected Microsoft Windows in particular. CrowdStrike does have sister products for MacOS and Linux , the other two largest desktop operating systems in the world. The simple explanation is that this particular channel file (remember: the wanted poster) was for a Microsoft Windows vulnerability, not a MacOS or Linux one. The even simpler one is that Windows is just dominant in the enterprise, commercial universe: most devices just don’t run on MacOS or Linux.

But that’s not the whole picture. Compared to Apple, Microsoft’s approach to building Windows is much more open, with a looser integration between software and hardware. Apple doesn’t allow software like CrowdStrike to have control over what the operating system does to the same degree; this could never have happened on a Mac. There’s also a narrative here (true or not, I can’t opine) about Microsoft’s weakened stance on security, and neglect of Windows since they started focusing on the cloud.