Did CrowdStrike Learn the Lesson? I Don’t Think So
I did not see the real engineering solution in CrowdStrike’s Preliminary Post Incident Review, besides pledges better testing.
Five days after causing the horrific global IT outage, CrowdStrike released a detailed review of the incident on 2024–07–24.
“The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing.”
In this “How Do We Prevent This From Happening Again?” section:
Improve Rapid Response Content testing by using testing types such as:
Local developer testing
Content update and rollback testing
Stress testing, fuzzing and fault injection
Stability testing
Content interface testing
Wow, there are a lot of testing terms there, even with a fancy name: “Rapid Response Content testing”. To non-IT people, it seems that this company is getting serious about testing from now on.
I don’t think CrowdStrike fully learned a lesson well from this review document. Let’s revisit the basics.
What happened from the customers’ point of view?
Millions of users found their Windows computers got the ‘blue screen of death” at the same time.
We are referring to tens of millions of computers, not just isolated incidents like a specific user’s computer with tricky software installed. Essentially, it failed End-User Testing and was easily detectable during that phase, not about unit testing, performance testing, integration testing, or stress testing, as repeatedly mentioned in CrowdStrike’s review document.
If we get a bit more technical (most computer users will still understand), this refers to regression testing, which tests a new release shouldn’t break existing features.
If CrowdStrike senior engineers have a solid understanding of CI/CD and End-Test Automation, particularly in the context of regression testing, its review document will highlight those. But it did not.
The Solution, a Proper Engineering Way
Keep reading with a 7-day free trial
Subscribe to The Agile Way to keep reading this post and get 7 days of free access to the full post archives.