The Value of Software Testing: Lessons from the CrowdStrike Outage

On July 19th, 2024, a faulty update caused global system crashes, highlighting the importance of thorough testing — a practice i3 routinely engages in through our QA process to prevent such failures.

Share this article

The Value of Testing and the Cost of Neglecting It 

Companies have long looked for shortcuts and silver bullets when it comes to software testing. People see test documentation and reasonably assume testing is just running through checklists. Anyone could do that! If developers test their own code, dedicated salary space for testers isn’t needed. In the software-as-a-service model we can quickly make changes to production. If there are issues, clients will notify us and we’ll address them. Real software companies test in production! 

Worldwide Chaos

On Friday July 19th, 2024, major industries across the world, from airports to emergency call centers, were paralyzed. This paralysis was caused by an update to the CrowdStrike Falcon Sensor driver imbedded in the Windows operating system on devices at major companies. An update that affected 8 million Windows systems. Less than 1% of all Windows systems worldwide. 

A variable in the Falcon kernel driver ended up pointed at a junk definition file. This caused the Falcon driver to throw an exception. And because this exception occurred in the kernel level of the system, it was interpreted as a security threat. The system locked itself down as a safety mechanism. What end users observed was the dreaded “Blue Screen of Death” on startup. 

5 days on, thousands of flights have been cancelled. According to insurance firm Parametrix, Fortune 500 companies are claiming upwards of $5.4 billion in losses due to the outage. Both CrowdStrike’s reputation and stock are in freefall. Lawsuits may be forthcoming. Lives may have been lost as widespread unavailability of emergency services can turn serious but survivable events fatal. 

We’ve seen plenty of stark warnings in the aftermath. In a world increasingly reliant on technology, where 73% of computers use the same operating system, incidents like this feel inevitable. An unavoidable doomsday lurks on our horizon, when our online apparatus comes crashing down around us. These predictions discount how preventable this outage was. Simple policies and practices could’ve mitigated this issue or prevented it entirely. 

The Consequences of Insufficient Testing

Simple policies and practices could’ve mitigated this issue or prevented it entirely. If we’re being charitable, this update wasn’t tested sufficiently. Given the nature of the bug, it likely wasn’t tested at all. The code probably wasn’t even reviewed by another developer before being pushed to production. 

The software industry at large has been pushing dedicated testers to the fringes of the production cycle to trim costs and shorten development time. Producing software cheaper and faster sounds good and prudent until your company accidentally costs the global economy over $5 billion in a single day. Thorough, deep testing requires dedicated resources. It takes time to understand a product, its risks, and communicate those risks in a way that product owners can use to make decisions. This is why testing is a dedicated line item in i3’s project proposals. 

How do we estimate testing in our proposals? 

Estimating testing time is extremely difficult. The purpose of testing is to learn about the product that we make. Accurately estimating how long it will take to learn about something that doesn’t yet exist requires a level of clairvoyance that borders on the supernatural. As such, a common industry standard is that testing time should be equivalent to ⅓ to ½ of development time. This kind of estimate comes with a lot of built-in buffer such that we can absorb changes to scope as previously unforeseen risks become known. 

Dedicated testing time and budget may seem extraneous, even expensive, in a vacuum. It means any product takes longer to go from concept to market. But when you look at the true costs of neglecting testing and risk analysis, the value testing provides is clear. 

Our Software Quality Assurance Analyst, Sam Whitesell, put together this article to help those inside and outside of i3 better understand the importance of testing and QA in development.

Software

Recognizing Risk

At the heart of successful project development lies a clear definition of risk—an encompassing concept that extends beyond the technical realm.

Read More »