Internet Outages Timeline

Dive into high-profile Internet disruptions. Discover their impact, root causes, and essential lessons to ensure the resilience of your Internet Stack.

October

October 1, 2024

Mashery

Various Regions

What Happened?

On October 1, 2024, TIBCO Mashery, an enterprise API management platform leveraged by some of the world’s most recognizable brands, experienced a significant outage. At around 7:10 AM ET, users began encountering SSL connection errors. Internet Sonar revealed that the root cause wasn’t an SSL failure but a DNS misconfiguration affecting access to key services.

Takeaways

The Mashery outage reveals a crucial lesson: SSL errors can be just the tip of the iceberg. The real issue often lies deeper, like in this case, with a DNS misconfiguration. If DNS isn’t properly configured or monitored, the entire system can fail, and what seems like a simple SSL error can spiral into a much bigger problem. To truly safeguard against the fragility of the Internet, you need full visibility into every layer of the Internet Stack, from DNS to SSL and beyond.

September

September 17, 2024

Reliance Jio

India

What Happened?

On September 17, 2024, Reliance Jio encountered a major network outage affecting customers across multiple regions in India and across the globe. The outage was initially noticed when users began encountering connection timeouts attempting to access both the AJIO and Jio websites. The outage was resolved around 05:42 EDT.

Takeaways

Gaining full visibility across the entire Internet Stack, including external dependencies like CDN, DNS, and ISPs is critical for businesses. Proactive monitoring is essential for early detection of issues such as packet loss and latency, helping companies mitigate risks before they escalate into major outages.

August

August 15, 2024

ServiceNow

Global

What Happened?

On August 15, at 14:15 PM ET, ServiceNow experienced a significant outage lasting 2 hours and 3 minutes. Catchpoint's Internet Sonar detected the disruption through elevated response and connection timeout errors across major geographic locations. The disruption, caused by instability in connectivity with upstream provider Zayo (AS 6461), impacted ServiceNow's core services and client integrations. The outage resulted in intermittent service availability, with users facing high connection times and frequent timeouts.

Takeaways

A proactive approach to BGP monitoring is crucial to prevent extended outages. ServiceNow's quick response to reroute traffic is a good example of how effective incident management and holding vendors accountable can make all the difference in keeping things running and keeping your users happy.

August 14, 2024

AWS

Multiple regions (primarily locations using CenturyLink AS209 and Lumen AS3356)

What Happened?

On August 14, between 8:00 and 8:25 UTC, AWS experienced a micro-outage affecting services like S3, EC2, CloudFront, and Lambda. Catchpoint's Internet Sonar detected connection timeouts across multiple regions, particularly in locations routing through CenturyLink AS209 and Lumen AS3356. This disruption, though not reflected on AWS’s status page, significantly impacted these regions' access to AWS services.

Takeaways

Status pages aren't always reliable indicators of service health. If you’re only relying on cloud-based monitoring tools, you’re in trouble if their cloud goes down. It’s good practice to diversify your monitoring strategy and have a fallback plan to ensure Internet resilience. Clear communication will also help you maintain trust with your users.

July

July 31, 2024

Disney+

Multiple nodes

What Happened?

On July 31, at 20:12 EDT, Disney Plus experienced a brief outage lasting 38 minutes. Catchpoint detected 502 Bad Gateway errors from multiple nodes, an issue that was confirmed through both automated tests and manual browsing. The disruption was resolved by 20:50 EDT.

Takeaways

This incident shows why it's so important to monitor your services from multiple vantage points to quickly detect and verify outages. Even short-lived disruptions can impact user experience, making continuous monitoring and rapid response essential.

July 31, 2024

Alaska Airlines

North America

What Happened?

On July 23, from 14:35 to 14:52, Alaska Airlines’ website (www.alaskaair.com) experienced a 404 Not Found error, rendering the site inaccessible for approximately 20 minutes. Catchpoint detected the issue, confirming the failures across multiple tests. Response headers indicated the issue stemmed from configuration errors, as evidenced by the 404 error and subsequent cache miss responses.

July 23, 2024

Microsoft Outlook

Multiple locations

What Happened?

Starting at 21:23 EDT on July 23, Microsoft Outlook experienced intermittent failures across multiple regions. Users encountered various errors, including 404 Not Found, 400 Bad Request, and 503 Service Unavailable, when trying to access https://www.outlook.com/ and https://outlook.live.com/owa/. Catchpoint’s Internet Sonar detected the issue through multiple tests, while Microsoft’s official status page did not report any outages at the time.

Takeaways

Another example of how intermittent issues, which can pose the most persistent threat to observability, may not be reflected on official status pages. Given the high cost of Internet disruptions, even a brief delay in addressing these issues can be extraordinarily expensive. And if you’re waiting for your provider to tell you when something’s wrong, that delay could be even longer.

July 18, 2024

Azure

US Central Region

What Happened?

On July 18, starting at 18:36 EDT, Azure’s US Central region experienced a major service outage lasting until 22:17 EDT. Initially, 502 Bad Gateway errors were reported, followed by 503 Service Unavailable errors. This outage impacted numerous businesses reliant on Azure Functions, as well as Microsoft 365 services like SharePoint Online, OneDrive, and Teams, which saw significant disruptions.

Takeaways

This incident occurred within 24 hours of a separate CrowdStrike outage, leading to confusion in the media as both issues were reported simultaneously. Companies that relied solely on Azure without multi-region or multi-cloud strategies were significantly impacted, particularly those using eCommerce APIs. Catchpoint’s Internet Sonar detected the outage early and helped isolate the issue, confirming that it wasn’t related to network problems, saving time on unnecessary troubleshooting.

July 19, 2024

CrowdStrike

Global

What Happened?

On July 19, a massive global outage disrupted critical services worldwide, affecting systems dependent on Microsoft-based computers. The outage, caused by a faulty automatic software update from cybersecurity firm CrowdStrike, knocked Microsoft PCs and servers offline, forcing them into a recovery boot loop. This unprecedented outage impacted daily life on a global scale, grounding airlines, taking emergency services offline, and halting operations for major banks and enterprises.

Takeaways

The CrowdStrike outage is a wake-up call for how fragile our digital world really is. Everything we do relies on these systems, and when they fail, the ripple effects are huge. This incident shows just how important it is to be prepared. Know your dependencies, test updates like your business depends on it (because it does), and have a plan for when things go wrong. Don’t just assume everything will work—make sure it will. And remember, resilience isn’t just about your tech; it’s about your team too. Keep them trained, keep them ready, and make sure they know what to do when the unexpected happens.

June

May

May 23, 2024

Bing

Global

What Happened?

On May 23, starting at 01:39 EDT, Bing experienced an outage with multiple 50X errors affecting users globally. The issue was detected by Catchpoint’s Internet Sonar and confirmed through manual checks. The outage disrupted access to Bing’s homepage, impacting user experience across various regions.

Takeaways

This incident shows the value of having robust monitoring in place. Quick detection and confirmation are crucial for minimizing the impact of such outages.

May 1, 2024

Google

Global

What Happened?

On May 1, starting at 10:40 Eastern, Google services experienced a 34-minute outage across multiple regions, with users encountering 502 Bad Gateway errors. The issue affected accessibility in locations including Australia, Canada, and the UK. Internet Sonar detected the incident and the outage was also confirmed via manual checks.

April

April 29, 2024

Google

North America, Asia Pacific

What Happened?

On April 29, starting at 03:29 EDT, X (formerly known as Twitter) experienced an outage where users encountered high wait times when trying to access the base URL 'twitter.com.' The issue was detected by Internet Sonar, with failures reported from multiple locations. Manual checks also confirmed the outage. Additionally, during this time, connection timeouts were observed for DFS and Walmart tests due to failed requests to Twitter’s analytics service, further impacting both platforms.

March

March 6, 2024

ChatGPT

Global

What Happened?

On April 30, starting at 03:00 EST, ChatGPT’s APIs experienced intermittent failures due to HTTP 502 (Bad Gateway) and HTTP 503 (Service Unavailable) errors. Micro-outages were observed at various intervals, including 03:00-03:05 EST, 03:49-03:54 EST, and 03:58-03:59 EST. These disruptions were detected by Catchpoint’s Internet Sonar and confirmed through further investigation.

Takeaways

Even brief micro-outages can affect services and user experience. Early detection is key to minimizing impact.

February

February 25, 2024

ChatGPT

Global

What Happened?

On February 25, 2024, at 23:29 EST, OpenAI’s ChatGPT API began experiencing intermittent failures. The primary issues were HTTP 502 Bad Gateway and HTTP 503 Service Unavailable errors when accessing the endpoint https://api.openai.com/v1/models. The outage was confirmed manually, and Catchpoint’s Internet Sonar dashboard identified the disruption across multiple regions, including North America, Latin America, Europe, the Middle East, Africa, and the Asia Pacific. The issues persisted into the next day, with 89 cities reporting errors during the outage.

Takeaways

As with many API-related outages, relying on real-time monitoring is essential to quickly mitigating user impact and ensuring service reliability across diverse geographies.

January

January 26, 2024

Microsoft Teams

Global

What Happened?

On January 26, Microsoft Teams experienced a global service disruption affecting key functions like login, messaging, and calling. Initial reports indicated 503 Service Unavailable errors, with the issue captured by Autodesk synthetic tests. Microsoft later identified the root cause as networking issues impacting part of the Teams service. The failover process initially helped restore service for some regions, but the Americas continued to experience prolonged outages.

Takeaways

Failover processes can quickly resolve many service issues, but this outage showed the importance of ongoing optimization for full recovery across all regions. It also highlighted the value of monitoring from the user’s perspective. During the disruption, Teams appeared partially available, leading some users to believe the issue was on their end.

2023

December

December 15, 2023

Box

Global

What Happened?

On December 15, from 6:00 AM to 9:11 AM Pacific Time, Box experienced a significant outage that affected key services, including the All Files tool, Box API, and user logins. The outage disrupted uploading and downloading features, leaving users unable to share files or access their accounts. Early detection through proactive Internet Performance Monitoring (IPM) helped Box mitigate the outage’s impact, with IPM triggering alerts as early as 04:37 AM PST, well before the outage became widespread.

Takeaways

Early detection and quick response are key to minimizing downtime, reducing financial losses, and protecting brand reputation. This incident emphasizes the value of a mature Internet Performance Monitoring strategy, setting the right thresholds to avoid false positives, and ensuring teams can quickly identify root causes to keep systems resilient.

December 8, 2023

Adobe

Global

What Happened?

Starting at 8:00 AM EST on December 8 and lasting until 1:45 AM EST on December 9, Adobe’s Experience Cloud suffered a major outage, affecting multiple services like Data Collection, Data Processing, and Reporting Applications. The outage, which lasted nearly 18 hours, disrupted operations for Adobe’s extensive customer base, impacting businesses worldwide. Catchpoint's Internet Sonar was the first tool to detect the issue, identifying failures in Adobe Tag Manager and other services well before Adobe updated its status page.

Takeaways

Yet another reminder of the fragility of the Internet and another catch for Internet Sonar, which was essential for early detection and rapid response, helping to pinpoint the source of the problem and minimize downtime. The outage also highlights the importance of proactive monitoring and preparedness, as well as the potential financial and reputational costs of service disruptions.

November

October

September

September 20, 2023

Salesforce

Global

What Happened?

On September 20, starting at 10:51 AM EST, Salesforce experienced a major service disruption affecting multiple services, including Commerce Cloud, MuleSoft, Tableau, Marketing Cloud, and others. The outage lasted over four hours, preventing a subset of Salesforce’s customers from logging in or accessing critical services. The root cause was a policy change meant to enhance security, which unintentionally blocked access to essential resources, causing system failures. Catchpoint detected the issue at 9:15 AM EST—nearly an hour and a half before Salesforce officially acknowledged the problem.

Takeaways

Catchpoint’s IPM helped identify the issue well before Salesforce's team detected it, potentially saving valuable time and minimizing disruption. For businesses heavily reliant on cloud services, having an IPM strategy that prioritizes real-time data and rapid root-cause identification is crucial to maintaining internet resilience and avoiding costly downtime.

August

July

June

June 28, 2023

Microsoft Teams

Global

What Happened?

On 28 June 2023, the web version of Microsoft Teams (https://teams.microsoft.com) became inaccessible globally. Users encountered the message "Operation failed with unexpected error" when attempting to access Teams via any browser. Catchpoint detected the issue at 6:51 AM Eastern, with internal tests showing HTTP 500 response errors. The issue was confirmed manually, though no updates were available on Microsoft’s official status page at the time.

May

April

March

February

January

January 25, 2023

Microsoft

Global

What Happened?

On January 25, 2023, at 07:08 UTC/02:08 EST, Microsoft experienced a global outage that disrupted multiple services, including Microsoft 365 (Teams, Outlook, SharePoint Online), Azure, and games like HALO. The outage lasted around five hours. The root cause was traced to a wide-area networking (WAN) routing change. A single router IP address update led to packet forwarding issues across Microsoft's entire WAN, causing widespread disruptions. Microsoft rolled back the change, but the incident caused significant impact globally, especially for users in regions where the outage occurred during working hours.

Takeaways

Catchpoint’s IPM helped identify the issue well before Salesforce's team detected it, potentially saving valuable time and minimizing disruption. For businesses heavily reliant on cloud services, having an IPM strategy that prioritizes real-time data and rapid root-cause identification is crucial to maintaining internet resilience and avoiding costly downtime.

2022

December

December 5, 2022

Amazon

Global

What Happened?

Starting at 12:51 ET on December 5, 2022, Catchpoint detected intermittent failures related to Amazon’s Search function. The issue persisted for 22 hours until December 7, affecting around 20% of users worldwide on both desktop and mobile platforms. Impacted users were unable to search for products, receiving an error message. Catchpoint identified that the root cause was an HTTP 503 error returned by Amazon CloudFront, affecting search functionality during the outage.

Takeaways

Partial outages, even when affecting a small portion of users, can still have serious consequences. Relying solely on traditional monitoring methods like logs and traces can lead to delayed detection, especially with intermittent issues. Being able to pinpoint the specific layer of the Internet Stack responsible for the issue helps engineers troubleshoot and resolve problems faster.

November

October

September

August

July

July 8, 2022

Rogers Communications

Canada (Nationwide)

What Happened?

On July 8, 2022, Rogers Communications experienced a major outage that impacted most of Canada for nearly two days, disrupting internet and mobile services. A code update error took down the core network at around 4 AM, affecting both wired and wireless services. The outage disrupted essential services, including 911 calls, businesses, government services, and payment systems like Interac. Some services were restored after 15 hours, but others remained down for up to four days. The incident impacted millions of Canadians, sparking widespread frustration and highlighting the risks of relying heavily on a single telecom provider.

Takeaways

Test thoroughly before deploying network changes and ensure redundancies are in place and effective. Rogers thought they had redundancies, but they failed to work when needed most. Fast detection and resolution are critical. Rogers' slow response led to significant financial losses, reputational damage, and a potential class-action lawsuit.

June

May

April

March

February

February 22, 2022

Slack

Global

What Happened?

On February 22, 2022, at 9:09 AM ET, Slack began experiencing issues, primarily impacting users' ability to fetch conversations and messages. While users could log in, key functionalities were down, leading to widespread disruption. The issue persisted intermittently, affecting productivity for many businesses relying on Slack for communication. Catchpoint tests confirmed errors at the API level, pointing to issues with Slack’s backend services, not the network.

Takeaways

Early detection and real-time visibility into service performance is critical. Being able to quickly diagnose an issue and notify users before the flood of support tickets arrives can significantly reduce downtime and frustration. Monitoring from the user’s perspective is crucial, as it helps detect problems faster and more accurately than waiting for official service updates.

January

2021

December

December 2021

Amazon Web Services (AWS)

Global (across multiple AWS regions)

What Happened?

In December 2021, AWS experienced three significant outages:

1. December 7, 2021: An extended outage originating in the US-EAST-1 region disrupted major services such as Amazon, Disney+, Alexa, and Venmo, as well as critical apps used by Amazon’s warehouse and delivery workers during the busy holiday season. The root cause was a network device impairment.

2. December 15, 2021: Lasting about an hour, this outage in the US-West-2 and US-West-1 regions impacted services like DoorDash, PlayStation Network, and Zoom. The issue was caused by network congestion between parts of the AWS Backbone and external Internet Service Providers (ISPs).

3. December 22, 2021: A power outage in the US-EAST-1 region caused brief disruptions for services such as Slack, Udemy, and Twilio. While the initial outage was short, some services experienced lingering effects for up to 17 hours.

Takeaways

Don’t depend on monitoring within the same environment. Many companies hosting their observability tools on AWS faced monitoring issues during the outages. It’s essential to have failover systems hosted outside the environment being monitored to ensure visibility during incidents.

November

November 16, 2021

Google Cloud

Global

What Happened?

On November 16, 2021, Google Cloud suffered an outage beginning at 12:39 PM ET, which knocked several major websites offline, including Home Depot, Spotify, and Etsy. Users encountered a Google 404 error page. This outage affected a variety of Google Cloud services such as Google Cloud Networking, Cloud Functions, App Engine, and Firebase. Google’s root cause analysis pointed to a latent bug in a network configuration service, triggered during a routine leader election charge. While services were partially restored by 1:10 PM ET, the full recovery took almost two hours.

Takeaways

Monitor your services from outside your infrastructure to stay ahead of problems before customers notice. Tracking your service level agreements (SLAs) and mean time to recovery (MTTR) allows you to measure the efficiency of your teams and providers in resolving incidents.