CrowdStruck: Blue Screen of Box Checking

Ryan Mahoney

July 23, 2024

Genuine Safety Commitment
Government Reacts
CrowdStrike: Address One Risk, Create Another
Towards a More Secure Future
Resources: 🇨🇦 Invest in In-House Digital Competency
Resources: From the Blogosphere

On July 19, 2024, U.S. cybersecurity vendor CrowdStrike unleashed chaos on an estimated 8.5 million Microsoft Windows computers, rendering critical digital infrastructure unavailable worldwide. This disaster affected hospitals, 911 dispatch centers, airlines, and public transit agencies. CrowdStrike’s leadership failings will undoubtedly lead to severe consequences for the company—as they should. It has become clear that CrowdStrike lacks (or has a culture that permits bypassing) a reliable process to prevent software disruptions, including automated and manual testing, thorough code reviews, and, most notably, canary deployment. Multiple safety mechanisms were bypassed for this incident to occur. While these checks likely exist, they are applied superficially, failing to prevent issues, protect shareholder value, or safeguard the public.

Join the Transit Tech Insiders!
Stay updated about the latest policy briefs, tech strategies, and innovative ideas to enhance rider experience and support operations staff.

Genuine Safety Commitment

In this post, I contend that the recent outages faced by public transit agencies were, in part, self-inflicted. By incorporating technology into our critical service pathways without fully understanding the risks, we inadvertently introduced vulnerabilities. My aim is not to assign blame but to advocate for a more secure future.

One might argue that many organizations made similar mistakes. This is true, yet numerous companies avoided these pitfalls, and we should strive to emulate their success.

Given our backgrounds in software development, It may be a surprise that Paul Swartz and I devote so much time to information and infrastructure security. In government technology, however, the stakes are extraordinarily high. The risk of exposing private information or facing system outages necessitates developing expertise in these areas—an endeavor that requires years of experience and continuous learning.

Of course, one could take the path of least resistance: merely fulfilling the minimum compliance requirements and outsourcing security responsibilities to consulting firms. When failures occur, it’s easy to place the blame on these external entities. But for those genuinely committed to safety, there is no alternative but to learn how to architect, build, operate, and test highly secure and resilient systems.

What I describe here is the phenomenon of “checkbox compliance,” where organizations meet formal requirements by superficially ticking off items on a list without truly embracing the underlying principles or goals.

This fraudulent approach to safety has been implicated in numerous severe incidents throughout recent history:

In each of these cases, a lack of genuine commitment to safety had devastating consequences. We must learn from these tragedies and prioritize true security and resilience in our systems.

Government Reacts

If there’s one thing government agencies excel at, it’s being reactive. Public transit leaders and staff consistently navigate complex, unexpected challenges that threaten service. Last week, thanks to CrowdStrike, technical teams across the U.S. took swift and decisive action to restore normalcy, ensuring safety and service quality. Their dedication is commendable.

Now that we’ve been CrowdStruck, we must ask ourselves: Did we, as technology decision-makers and agency leaders contribute to this problem? And if so, how can we better protect the riding public in the future?

CrowdStrike: Address One Risk, Create Another

CrowdStrike is a closed-source, third-party threat monitoring software embedded in the critical systems of many organizations. This external company controls the software and can update it at will. Like an autoimmune disease, CrowdStrike sometimes malfunctions and attacks the very systems it is meant to protect, as we saw last week.

In contrast, other critical vendor-operated systems that government agencies rely on don’t pose the same risks. Take GitHub Actions, for example. While it facilitates deployment processes for many companies, if GitHub goes offline, you can still deploy your code manually—though it requires more effort. Similarly, AWS provides cloud platforms for numerous public transit applications. Though costly, organizations can use tools like Terraform to manage infrastructure across multiple regions and potentially clouds, reducing risk when a single provider is unavailable.

CrowdStrike, however, has long been known to have the following problematic qualities:

It is a single point of failure.
It presents a single dependency risk.
Customers lose strategic control of critical assets because CrowdStrike inserts itself into the critical path of the application.
Its proprietary nature can lead to vendor lock-in.
It exposes companies to supply chain vulnerabilities.

Given these issues, we must reconsider our reliance on such systems and strive for more resilient, strategically controlled solutions.

Towards a More Secure Future

Transit agencies affected by CrowdStrike must build in-house security expertise to identify and mitigate vulnerable security architectures, ensuring a genuine commitment to safety. Relying on external consultants and vendors is a comforting idea but ultimately ineffective. High-priced security consultants, often partnered with vendors like CrowdStrike, profit by recommending solutions that allow agencies to check compliance boxes without genuinely safeguarding the public. Transit agencies must manage vendors to get the best results, but in areas like information security, it’s not uncommon to see the vendor managing the agency: making key decisions without adequate oversight.

Agencies can achieve modern threat monitoring without making their systems and the public more vulnerable due to fragile system architectures, but only when they have the in-house capacity and culture to support a genuine safety commitment.

Resources: 🇨🇦 Invest in In-House Digital Competency

The Government of Canada serves as a compelling model for agencies aiming to build strong in-house digital competencies. By moving away from traditional IT procurement and investing in internal capabilities, Canada illustrates how innovation can flourish within government. Their IT Procurement Guide stresses the need to cultivate digital skills among employees rather than relying on external vendors. This ensures that expertise remains within the agency, enabling more agile and responsive service delivery. This shift reduces dependence on contractors and promotes a culture of continuous learning and improvement, which is essential in today’s rapidly evolving technological landscape.

The Canadian strategy also highlights the long-term benefits of in-house digital competency, including better project outcomes and increased public trust. With a workforce adept in digital transformation, agencies can align technology initiatives more closely with policy goals and public needs. This internal capability enhances oversight and governance of IT projects, mitigating the risks of outsourcing. Additionally, it empowers government employees to drive innovation, resulting in more effective, citizen-centric services. The Canadian experience underscores that investing in people and their skills is not just a technical necessity but a strategic imperative for any government aspiring to thrive in the digital age.

Resources: From the Blogosphere

Zagaga: Lessons from Crowdstrike
The blog post highlights the importance of recovery planning in cybersecurity. It emphasizes that organizations should diversify their software tools, ensure vendor compatibility with standard formats, run recovery drills, and establish a government-backed Digital Services Reserve for technical incident response.
Edward Zitron: CrowdStruck
In a stark portrayal of today’s tech industry failures, the article describes how Crowdstrike’s botched software update caused widespread chaos. It draws parallels to the Y2K bug and highlights systemic issues in tech’s “growth-at-all-costs” culture, which prioritizes expansion over quality and security.