In what was probably the biggest IT failure in the history of humanity, computers across the globe got fried over the weekend from an errant software update. Planes were grounded, banks were offline, machines in hospitals stopped working…you get the gist. There have been endless articles analyzing what this means and why it’s a big deal. Instead of that, this post will explain what actually happened from a technical perspective, in simple, easy to understand language.
The outage itself: what went wrong
Let’s first focus on what actually happened. What broke?
Let’s start from the basics. Every device on the planet that runs software – whether it’s your phone, a laptop, a large server in the cloud, a TV at an airport, or a teleprompter in a studio – is built on an operating system. An operating system is the mastermind behind computers: it orchestrates all of the behind-the-scenes magic that lets something like Excel work on your laptop, or flight arrival times display on a TV.
The most popular operating system in the world by far is Microsoft Windows. As of last month, more than 70% of desktop systems were running Windows, which may come as a surprise to Mac-loyal readers. Windows is especially popular in commercial applications like factory robots, hospital devices, airplanes, etc. The initial release was 38 years ago, in November 1985.
Now if something goes wrong with the operating system, your device is completely fucked. If the operating system can’t run, the device can’t run; simple as that. And this is exactly what happened with the CrowdStrike outage. It (we will discuss the it) messed up the Windows operating systems that it runs on, so computers that use the software couldn’t start at all.
Of all of the types of outages out there, an operating system not being able to boot up is perhaps the most sinister. Because the only way to fix it is manually, by rebooting the computer in a special way that allows you to remove the offending code (more on this later). But the people using these devices are certainly not computer experts, and in most cases probably don’t even know what an operating system is. This is part of why it took so long to resolve this outage, and required significant hand holding from CrowdStrike.
Anyway, now that we’ve got an understanding of the symptoms (i.e. what went wrong), we can dive into what caused the outage and who CrowdStrike is.