The day I.T. stopped
In today’s interconnected world, a single IT outage can have far-reaching consequences, disrupting businesses and affecting millions. Paul explores a recent major IT outage and offers some food for thought on enhancing resilience in our increasingly digital landscape.
Dire warnings were blared, describing pandemonium in airports, halted stock market trading, failed credit card transactions – and so forth. Worldwide monitoring committees were instituted to observe the IT world once all computers passed that dreaded second of midnight, January 1, 2000.
For those too young to remember, IT professionals were long wary of how computers would behave once the year 2000 was crossed. This was dubbed the “Y2K Problem.”
The root of this problem was that many early computer programs only recognized the last two digits of each year. As a result, the years “1900” and “2000” could not be differentiated. A cause for sufficient concern if, say, an air traffic computer believes that a plane scheduled to land on January 1, 2000 had already landed a hundred years before.
Nonetheless, as the first few seconds into the 2000s ticked by, IT professionals breathed deep sighs of relief. Planes didn’t fall from the sky, and our electrical and water systems chugged on as normal. Programmers managed to save the day!
So years passed, and the notion of a “Global IT Outage” faded. That is, until Friday July 19, 2024.
The first call I received believed they were compromised by a hacker. While receiving such calls are common to us, I knew this organization to be quite secure and well funded.
“All our servers are affected, I need you to look into this,” the grave voice on the line said.
I hung up quite perplexed. For an attacker to do that sort of damage, there are usually ample warning signs prior – especially for an advanced organization like this.
Then posts in IT and IT Security forums came rushing in. Whole financial and transport systems ceased to work because their computers simply crashed. There is a consensus that this was the largest such IT outage in history – one IT insurance firm (Parametrix) estimating a cost of $5.4 Billion to Fortune 500 companies. Delta Airlines in particular canceled more than 5,500 flights within five days starting from the outage. Other airlines in ASEAN had to resort to handling transactions manually, while frantically restoring their services.
By now, I suppose the general public is somewhat aware of what happened, though I suspect there remains some confusion. I hope to attempt my own “laymanized” manner of explaining, and what we can learn from it.
Most readers know what an Antivirus Program (AV) is, and what it does. I want to emphasize however that AVs are described as “passive.” They detect known malicious programs, and stop them from running. Generally, that’s it.
For large companies however, this isn’t sufficient. We need to know how that malicious program got there in the first place. Was the user phished? Did this come from somebody else who was compromised? There is a need to isolate the computer and analyze it, similar to how police would detain and interrogate a suspect.
To do this, a more advanced program called Endpoint Detection & Response (EDR) is used. Think of it as an AV on steroids. Unlike a passive AV that simply blocks and deletes malicious files, an EDR allows for further interaction. An IT team for example may isolate a compromised computer (so a hacker inside the computer cannot move elsewhere) and investigate the computer as well. Bear in mind, the EDR lets an IT team do this remotely – on computers they manage across the world.
You’re probably imagining that an EDR is a very powerful piece of software, and you’d be correct. It is “powerful” in the sense that it directly interacts with the very heart of the Windows operating system – a section of computer code literally referred to as the “kernel.” Mind you, this is not typically done by computer programs. Computer programs normally interact with the Windows operating system from safer and restricted areas.
I suppose this needs an analogy.
When hotel guests simply need to charge their phones, they don’t typically need security access to the main electrical facility housing the circuit breakers. Guests simply plug their devices into the safe, faux-wood veneer electrical sockets in the comfort of their rose-scented rooms.
EDRs however, not only need direct access to those main electrical facilities – but are constantly given new instructions on what to do once they are in that room. That very room where one single mistake can blow the lights out of the entire building. And yes, this is basically what happened with the EDR called “CrowdStrike”.
See, as with AVs, an EDR constantly gets updated to recognize the latest malicious software. It needs to, because after all, new threats are concocted daily by hackers – who aim to evade detection.
Unfortunately, as with all software – mistakes happen. The CrowdStrike update that took place on July 19 had a bug. While bugs occur with all software all the time (hence making IT security a viable career to put food on the table), recall that CrowdStrike directly works inside that kernel – the proverbial main electrical room of an entire building.
CrowdStrike is a leading company in the global EDR segment, hence explaining the widespread impact. However, due to the digital manner in which modern companies interact, companies that do not use CrowdStrike directly were affected as well. Companies that don’t have an EDR would definitely feel the impact of their third-party payment system going offline.
Unfortunately in a world where corporations interact with each other digitally in order to survive, the chances of a digital armageddon imagined with Y2K are all too real. While we may not have all the answers now to reduce the chances of these happening, as always – full awareness of the problem is never a bad first step.