The Top Ten IT Outages in History

Published 09/03/2024

Editorial Note: The lessons learned and changes that may result from the CrowdStrike incident may take quite some time to fully understand. CSA is providing a platform for member experts to weigh in on this issue. The opinions of this article represent those of the member, not those of CSA.

Written by Javier Perez, Sr. Director of Product Marketing for Security at Veeam Software.

Originally published on LinkedIn.

The recent major outage from popular cybersecurity software CrowdStrike, caused by a bad update that froze organizations worldwide, serves as a reminder that IT outages can occur at any time.

There has been a recent global increase in cyberattacks, and most of the attention, for good reason, has shifted to cybersecurity. We can’t ignore that in this ever-changing technological world, there’s always the risk of IT outages — not just caused by cyberattacks — but by natural disasters, faulty software, hardware failure, or human error.

The Windows outage caused by CrowdStrike has a good chance of becoming one of the largest outages in history in terms of systems and end users affected. Government services, health care systems, 911 services, and airlines were disrupted with Windows down and unable to start critical applications.

In this blog, I will review the largest IT outages not caused by cyberattacks. Here's a breakdown of some of the most impactful outages based on factors like duration, affected users, and global impact.

The Top Ten Largest IT Outages in History

CrowdStrike Outage of July 19, 2024: A faulty CrowdStrike update crashed Windows versions 10 and higher to crash causing a global IT outage of critical services such as hospitals, 911 services, and airlines.
Amazon Web Services (AWS) outage of Dec. 7, 2021: Affected the entire US-East-1 region due to network device overload. Automated capacity scaling triggered unexpected behavior, impacting services like EC2 and EventBridge. Popular websites and online services, including Disney+ and Netflix, went offline, and Amazon’s retail operations were disrupted, halting deliveries.
Facebook outage of Oct. 4, 2021: Facebook, along with its subsidiaries Messenger, Instagram, WhatsApp, and Oculus experienced a global outage. Billions of users worldwide were locked out for several hours due to Facebook’s data centers being disconnected from the network.
Fastly outage on June 8, 2021: The content delivery network (CDN) provider Fastly reported a service configuration issue that triggered disruptions in the global network. It affected the UK Government and major websites such as Amazon, Reddit, and the New York Times.
Microsoft Azure outage of March 3, 2020: Microsoft’s crucial East US Azure region faced over six hours of disruption across most services. A temperature spike triggered by a malfunction in building automation control affected storage, compute, networking, and dependent services.
T-Mobile outage of June, 2020: T-Mobile experienced a 12-hour outage that affected its 4G, 3G, and 2G networks. The outage led to congestion and the failure of over 23,000 911 calls. The outage was caused by an optical link failure compounded by other factors.
Equinix data center outage of March 2, 2018: An Equinix data center outage in Ashburn, Virginia, partially disrupted AWS connectivity impacting customers like Atlassian, Twilio, and Capital One. A nor’easter cyclone in the region triggered power outages in the east coast affecting Equinix data center.
British Airways outage of May 17, 2017: A contractor mistakenly disconnected the uninterruptible power supply to the airline’s data center. When power was restored, it damaged servers and systems, leading to widespread flight disruptions. The incident cost the airline over $100 million. An estimated 75,000 passengers were stranded worldwide.
Amazon Web Services outage of Feb. 28, 2017: An engineer accidentally mistyped a command during routine debugging. Services relying on S3 storage were affected. This led to a several-hour-long cloud outage, affecting AWS services and popular services like Coursera, Medium, and Slack.
Southwest Airlines outage of July 20, 2016: Faulty components for the Southwest website resulted in a 13-hour disruption that led to the cancellation of over 2,000 flights. Communication channels such as email were disrupted affecting communication with employees and customers.

What Can You Do to Prepare: Data Resilience

It doesn’t have to be a major national or international outage, any type of outage can affect users, services, and even critical infrastructure. Preparation and readiness are the key to address the next outage and this brings us to the topic of Data Resilience.

Data resilience is all about the capability to endure and bounce back from data-related disruptions or failures like in the case of an outage. It entails implementing strategies to ensure data availability, integrity, and security for unforeseen outages due to natural disasters, cyberattacks, hardware malfunctions, and human errors.

To achieve data resilience, it's essential to have robust measures such as data backup and recovery, redundancy, disaster recovery (DR) planning, and cybersecurity protocols. This ensures critical data and applications remain accessible even if network disruptions occur.

By prioritizing data resilience, organizations can significantly reduce the risk of data loss and maintain business continuity despite unexpected challenges. This proactive approach not only safeguards crucial information but also reinforces trust and reliability in the face of adversity.

Conclusion

There are certainly many more major outages that made the news, and while the impact is difficult to measure, this list of outages serves as a reminder to organizations that this can happen bringing financial, service, and reputational impact which are very hard to recover from. There’s no question that preparation and readiness are key. Prioritizing data resilience will significantly reduce the risk of data loss and maintain business continuity despite unexpected challenges.

About the Author

Javier boasts an extensive career spanning 28 years, showcasing his expertise in a spectrum of technological domains, including application development, open source software, mobile technologies, app security, SaaS, and AI. As a Sr. Director of Product Marketing for Security at Veeam Software, Javier is responsible for driving technical thought leadership while also leading product marketing initiatives for cybersecurity and data resilience.

Prior to his current role, Javier held Chief Evangelist and leadership roles at startups and renowned technology companies, such as Perforce, IBM, and Red Hat. Javier actively engages as a speaker and prolific blogger, sharing his knowledge and insights across the global tech community. Armed with an honors degree in Computer Systems and an MBA, Javier aims to inspire others through his thought leadership and advocacy, fostering a culture that embraces cybersecurity and open source to drive innovation.

Cloud Incident Response Data Security