May An Alternative Be With You
Microsoft estimated about 8.5m Windows computers were directly affected because a CrowdStrike content update caused Windows systems to crash with the Blue Screen of Death (BSoD). It’s just less than 1% of the global Windows install base. But for anyone or any business depending on the computers to do their jobs, one is too many already. Hope you were not one of them. Or you were lucky enough to have an alternative: such as Android phones, Chromebooks, or anything else to get you through the dark age.
Postmortem
This CrowdStrike outage may be the biggest outage so far. But it won’t be the last one. Not long ago in 2021, even Tesla’s infotainment system could crash when navigating to a fish soup restaurant (開元路土魠魚羹) in Taiwan. I bet Tesla learned its valuable lessons since. So, how might CrowdStrike and Microsoft learn better from a big one like this?
Any good engineering team is also good at doing postmortems, retrospectives, or whatever you call it. Because in real life, only bystanders can afford to wait for perfection. Doers need to focus on making their things a bit better than where they find them. All other things being equal, a shorter Lead Time for Changes is always better than a longer one. The questions are if you are capable of making the Change Failure Rate lower and the Time to Restore Service quicker.
To do so, you have to do postmortems properly. Reducing the Deployment Frequency or skipping updates are very tempting because they are easier for now. Unfortunately, they are almost always the wrong answer in the long run. Check out how Google learned from failure. Feel free to do it in whatever way you like, as long as you get the essence outcomes. The formality and processes are secondary.
1. Better Safeguard
Adding new safeguards to prevent or flag offending changes early is the only way if you don’t want to end up in the same situation again. Otherwise, you or someone else will be in the same hole sooner or later.
CrowdStrike did its part to avoid falling into the same hole. In the Preliminary Post Incident Review (PIR), they plan to do more content checking, testing, graceful handling of the error, and adopting staggered deployment.
Dear CrowdStrike & Microsoft, my 2 cents for you. Dogfooding and Public Beta may further help you and your users.
2. Lessons Learned
Most companies tend to sweep it under the carpet. I was lucky to work for an unconventional company believing in making lessons learned universally and useful to everyone in the company, currently or in the future. Because the more you share, the more people learn, and then the more value the company harvests.
Such learning makes a rookie more seasoned. Because the lessons learned are the best guidance for complex design decisions in production. It grounds key tradeoffs by reality, instead of personal ego or preference. If your commander has never been to such a production battlefield, it’s just a matter of time before you are walking into a trap. Then panicking & misery are inevitable.
No Crisis Should Go To Waste
When ordinary companies busy with finger-pointing, good companies focus on issue fixing. And, a great company can even turn crises into opportunities. Most of the time, “if it ain’t broke, don’t fix it” is a good principle. But when there are enough lessons learned, it can be the time to properly refactor or rearchitect the “technical debts” once and for all. So, the next time, it may be different. I like to do it in an Minimum Necessary Architecture (MNA) style, not thing more and not thing less. For examples:
- Linux FUSE (Filesystem in Userspace) allows non-privileged users to create a filesystem without changing the kernel code. Which minimizes the risk of a bug in such filesystem code crashing the kernel to improve system stability.
- Android A/B system updates ensure there is always a workable booting system. Which prevents a bad update from bricking the device. If Microsoft implements such a design, Windows users may get out of a blue screen loop by a reboot. CrowdStrike may also learn from this. Not only should it fail gracefully, but it could also revert back to the last known good content to recover. Either way, there is no need to wait for a manual fix. So, no one has to suffer a longer outage of airline and more services.
- Android Application Sandbox isolates apps from each other and protects apps and the system from malicious apps. So, even when a navigation app crashes, the system will not be dragged along to restart at all.
After all, fail fast only works for those who are actually learning and fixing.