When computer screens turned blue around the world on Friday, flights were ground to a halt, hotel check-ins were impossible and goods deliveries were halted. Businesses resorted to paper and pen. And initial suspicions landed on some sort of cyber-terrorist attack. The reality, however, was far more mundane: a botched software update by cybersecurity firm CrowdStrike.
“In this case, it was a content update,” said Nick Hyatt, director of threat intelligence at security firm Blackpoint Cyber.
And because CrowdStrike has such a broad customer base, it was the content update that was felt around the world.
“One mistake had catastrophic results. This is a great example of how closely connected to IT our modern society is – from coffee shops to hospitals to airports, a mistake like this has huge consequences,” Hyatt said.
In this case, the content update was linked to the CrowdStrike Falcon tracking software. Falcon, Hyatt says, has deep connections to monitor for malware and other malicious behavior on endpoints, in this case, laptops, desktops and servers. Falcon automatically updates to account for new threats.
“Buggy code was released through the auto-update feature, and so here we are,” Hyatt said. The automatic update feature is standard in many software applications and is not unique to CrowdStrike. “It’s just because of what CrowdStrike is doing, the effect here is devastating,” Hyatt added.
Blue screen of death bugs on computer monitors are shown due to the global communications outage caused by CrowdStrike, which provides cyber security services to US technology company Microsoft, on July 19, 2024 in Ankara, Turkey.
Harun Ozalp | Anadolu | Getty Images
Although CrowdStrike quickly identified the problem and many systems were up and running within hours, the global cascade of damage is not easily reversed for organizations with complex systems.
“We’re thinking three to five days before things are resolved,” said Eric O’Neill, a former FBI counterterrorism and counterintelligence agent and cybersecurity expert. “That’s a lot of downtime for organizations.”
It didn’t help, O’Neill said, that the outage happened on a summer Friday with many offices off and IT in short supply to help solve the problem.
Software updates should be released gradually
One lesson from the global IT outage, O’Neill said, is that the CrowdStrike update should have been rolled out gradually.
“What Crowdstrike did was release their updates to everyone at once. That’s not the best idea. Send it to a group and test it. There are layers of quality control that it has to go through,” O’Neill said.
“It should have been tested in sandboxes, in multiple environments before it went out,” said Peter Avery, vice president of security and compliance at Visual Edge IT.
He expects that more safeguards are needed to avoid future incidents that repeat these types of failures.
“You need the right checks and balances in companies. It could have been one person who decided to push this update, or someone picked the wrong file to run,” Avery said.
The IT industry calls it a single point of failure — an error in one part of a system that creates a technical disaster across industries, operations and interconnected communications networks. a huge domino effect.
Call for redundancy in IT systems
Friday’s event could prompt companies and individuals to raise their level of cyber preparedness.
“The bigger picture is how fragile the world is. It’s not just a cyber or technical issue. There are many different phenomena that can cause disruption, such as solar flares that can take out our communications and electronics,” Avery said.
Ultimately, Friday’s meltdown was not an indictment of Crowdstrike or Microsoft, but of how businesses view cybersecurity, said Javed Abed, an assistant professor of information systems at the Johns Hopkins Carey Business School. “Business owners need to stop viewing cybersecurity services as just a cost and instead as a meaningful investment in their company’s future,” Abed said.
Businesses should do this by building redundancy into their systems.
“A single point of failure shouldn’t be able to stop a business, and it did,” Abed said. “You can’t just rely on one cybersecurity tool, cybersecurity 101,” Abed said.
While creating redundancy in corporate systems is expensive, what happened on Friday is more expensive.
“I hope this is a wake-up call and I hope it causes some changes in the mindset of business owners and organizations to review their cyber security strategies,” Abed said.
What to do with ‘kernel level’ code
At a macro level, it is fair to assign some systemic responsibility to an enterprise IT world that often views cyber security, data security, and technology supply chain as “nice-to-haves” rather than essentials, and a general lack of leadership in of cybersecurity within organizations, said Nicholas Reese, a former Department of Homeland Security official and instructor at New York University’s SPS Center for Global Affairs.
On a micro level, Reese said the code that caused this outage was kernel-level code, affecting every aspect of computer hardware and software communication. “Core-level code should receive the highest level of scrutiny,” Reese said, with approval and implementation being entirely separate processes with accountability.
This is a problem that will continue for the entire ecosystem, flooded with third-party products, all with vulnerabilities.
“How do we look at the third-party vendor ecosystem and see where the next vulnerability is going to be? It’s almost impossible, but we have to try,” Reese said. “It’s not maybe, but a certainty until we deal with the number of potential vulnerabilities. We need to focus on backup and redundancy and invest in that, but businesses are saying they can’t afford to pay for things that might not happen never. It’s a difficult case,” he said.