When the Cloud Falls: The Amazon Outage That Exposed the Internet’s Achilles Heel

How corporate greed, mass layoffs, and institutional amnesia brought down half the internet, and why it will happen again

Oct 28, 2025

On Monday, October 20, 2025, at 3:11 AM Eastern Time, the modern internet began to unravel. Amazon Web Services’ US-East-1 region, the crown jewel of cloud computing infrastructure in northern Virginia, started experiencing what the company euphemistically called “increased error rates and latencies.” Within minutes, the digital apocalypse was in full swing.

ChatGPT went silent. Banking apps locked customers out of their accounts. Snapchat’s friend lists vanished. Fortnite players were booted mid-game. Uber couldn’t dispatch drivers. Starbucks’ mobile ordering system collapsed. Ring doorbells stopped recording. Even Amazon’s own shopping site went dark, along with Alexa-powered devices that suddenly became expensive paperweights.

For the next fifteen hours, much of the internet simply didn’t work.

The culprit? DNS resolution issues with DynamoDB endpoints, tech-speak for “the internet’s phone book broke.” But the real story isn’t about the technical failure. It’s about what happens when a corporation prioritizes quarterly earnings over operational excellence, when “efficiency” becomes code for “dangerously understaffed,” and when decades of hard-won engineering knowledge walks out the door because nobody thought it was worth keeping around.

The Anatomy of a Predictable Disaster

Let’s be clear about what happened: this wasn’t a cyberattack, an act of God, or an unforeseeable technical failure. This was a DNS issue, the oldest, most well-understood problem in systems administration. There’s even a haiku about it: “It’s not DNS / there is no way it’s DNS / it was DNS.”

Amazon Web Services knew this was a recurring weakness. They’d faced DNS failures in 2021 (causing a five-hour outage), in 2023, and now again in 2025. After the 2021 incident, AWS publicly committed to improving their outage notification times and response procedures. After 2020’s problems, they made the same promises. These weren’t unknown unknowns, they were known knowns that happened anyway.

What made this outage particularly damning was the response time. It took AWS engineers 75 minutes just to identify that DNS resolution was the problem. Another 40 minutes passed before they pinpointed DynamoDB as the root cause. For a company that prides itself on operational excellence and whose entire business model depends on five-nines reliability (99.999% uptime), this was the equivalent of a fire department taking over an hour to realize the building was on fire.

The technical cascade was textbook: DNS issues with DynamoDB, a “foundational service” that underpins dozens of other AWS services, created a domino effect. When DynamoDB’s DNS failed, it impaired EC2 instance launches. That led to Network Load Balancer health check failures, which cascaded into Lambda, CloudWatch, and multiple other services. US-East-1’s failure didn’t just affect services hosted in northern Virginia, it brought down global features that relied on US-East-1 endpoints, including IAM (Identity and Access Management) and DynamoDB Global Tables.

Think about that for a moment: a DNS problem in one data center region effectively paralyzed services around the world.

The Brain Drain Nobody Wants to Talk About

Here’s what Amazon won’t tell you: this disaster was entirely predictable, because they’ve been systematically dismantling the expertise that could have prevented it.

Since 2022, Amazon has laid off over 27,000 employees. In July 2025, just three months before this outage, AWS cut hundreds more positions. CEO Andy Jassy sent a memo making it explicit: “We expect that this will reduce our total corporate workforce as we get efficiency gains from using AI extensively across the company.”

Translation: we’re replacing experienced engineers with algorithms and hoping for the best.

The numbers tell a brutal story. Internal Amazon documents reveal that the company suffers from 69% to 81% regretted attrition across all employment levels. “Regretted attrition” is corporate-speak for “people we really didn’t want to lose but lost anyway.” When your best engineers are fleeing en masse, and you’re simultaneously conducting mass layoffs, you’re not building resilience, you’re building a ticking time bomb.

In late 2023, Justin Garrison, a senior AWS engineer, left the company and published a scathing blog post warning that AWS was seeing an increase in Large Scale Events (LSEs) and predicting major outages in 2024 and beyond. He noted that “in my small sphere of people, there wasn’t a single person under an L7 [Principal level] that didn’t want out.”

By summer 2025, industry insiders were reporting that AWS planned to cut 10% of its workforce by year-end, with approximately 25% of those cuts targeting Principal-level engineers, the most experienced, senior technical leaders who’ve been with the company for years or decades.

These aren’t replaceable cogs. Principal engineers at AWS are the people who remember that when DNS starts acting wonky, you check that seemingly unrelated system in the corner because it historically contributed to outages five years ago. They’re the ones who know which undocumented dependencies exist between services. They’re the institutional memory that prevents 75-minute diagnostic delays during a crisis.

As cloud expert Corey Quinn wrote in The Register: “You can hire a bunch of very smart people who will explain how DNS works at a deep technical level, but the one thing you can’t hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it has historically played a contributing role to some outages of yesteryear.”

When that tribal knowledge walks out the door, whether through layoffs, burnout, or deliberate attrition, you’re left with smart people who have to reinvent expertise that used to exist in-house. The new, leaner, presumably cheaper teams lack the institutional knowledge to prevent these outages or significantly reduce time to detection and recovery.

There was a time when Amazon’s famous “Frugality” leadership principle meant doing more with less. Now it means doing everything with basically nothing.

The US-East-1 Problem: Too Big to Fail, Too Critical to Secure

US-East-1 isn’t just another AWS region. It’s the original, the largest, and arguably the most critical piece of internet infrastructure on Earth. Located in northern Virginia, US-East-1 is often the default deployment location for new AWS services and customers. Its sheer scale means that when it goes down, the blast radius is global.

The region has become what engineers call a “single point of failure”, and everyone knows it. The 2017 US-East-1 outage felt like it took down most of the internet. The 2021 incident is still cited as the biggest AWS disruption in history. Now we can add 2025 to that ignominious list.

The obvious question is: why don’t companies use multi-region deployments with automatic failover to protect against this exact scenario?

The answer is simple: money.

Running truly redundant infrastructure across multiple regions is expensive. You’re paying for compute, storage, and bandwidth in multiple locations. You need sophisticated orchestration to handle failovers. You need to design your applications to be genuinely region-agnostic, which requires significant engineering investment. Many companies, especially startups and small-to-medium businesses, simply can’t afford it or don’t see the ROI until disaster strikes.

So they roll the dice, deploy to US-East-1, and pray that Amazon’s infrastructure holds. Until Monday, that prayer seemed to be working most of the time.

The bitter irony is that AWS sells high-availability solutions. They offer multi-region deployments, auto-scaling, disaster recovery services. They publish best-practice documentation about building resilient systems. But they apparently didn’t follow their own advice for their own foundational services, because a DNS failure in one region shouldn’t cascade globally the way it did.

The Consolidation Crisis: When Three Companies Control the Internet

Here’s the uncomfortable truth that Monday’s outage laid bare: the modern internet runs on a dangerously consolidated infrastructure. AWS, Microsoft Azure, and Google Cloud Platform collectively control about two-thirds of the global cloud computing market, with AWS alone accounting for roughly 32%.

When AWS sneezes, the internet catches pneumonia.

This isn’t theoretical anymore. In July 2024, a faulty CrowdStrike software update brought down Microsoft Windows systems worldwide, grounding thousands of flights and causing chaos across industries. Now AWS has shown us that a DNS glitch in northern Virginia can knock out banking, gaming, social media, government services, and e-commerce simultaneously.

The economist would argue that this consolidation reflects efficient market dynamics, these companies succeeded because they offered the best services. The cynic would note that once you’ve migrated your entire infrastructure to AWS, switching to another provider is so expensive and technically complex that you’re essentially locked in. The realist understands that both things can be true.

But the infrastructure expert looks at this situation and sees a systemic risk that dwarfs any individual company’s problems. We’ve created an internet where three corporations control the foundational infrastructure, and we’re all just hoping they don’t screw up simultaneously.

What’s the alternative? Regulatory intervention to break up these cloud giants? Mandated interoperability standards? Government-funded public cloud infrastructure? None of these solutions seem politically feasible or technically straightforward. So we’re left with the status quo: crossing our fingers and hoping that Amazon, Microsoft, and Google keep the lights on.

The Economics of Acceptable Failure

Let’s do some back-of-the-envelope math. Monday’s outage lasted approximately 15 hours from first report to full resolution. During that time, thousands of companies lost productivity, revenue, and customer trust.

Coinbase, the cryptocurrency exchange, couldn’t process trades during a volatile market period, potentially costing them millions in transaction fees and user confidence. Robinhood faced similar issues. Roblox and Fortnite, platforms with millions of daily active users, went offline during peak gaming hours. Airlines struggled with reservations and check-ins. Retail apps couldn’t process mobile orders.

Conservative estimates put the total economic impact in the billions of dollars. Uber drivers who couldn’t get dispatches. Restaurants that couldn’t process delivery orders. Smart home devices that became dumb. Productivity software that couldn’t be accessed. The cascade effects are almost impossible to calculate.

And yet, Amazon’s stock price barely moved. AMZN closed Monday essentially flat. The market shrugged.

Why? Because investors understand that AWS generates massive profit margins, operating margins around 30%, compared to Amazon’s retail business at razor-thin single digits. Even with occasional outages, AWS remains a cash cow. The math works out: occasional disasters are cheaper than maintaining the engineering talent and infrastructure redundancy that would prevent them.

This is the perverse incentive structure of modern tech: outages are priced into the model. They’re acceptable losses on the altar of short-term profit. Leadership gets to report “efficiency gains” from workforce reductions and AI automation, Wall Street applauds the improved margins, and meanwhile the engineers who remain are working 3 AM emergency escalations trying to patch systems that are held together with tribal knowledge that just left the building.

Former Amazon employees describe the recent layoffs as “cold and soulless”, unheralded emails announcing that positions had been eliminated, with decisions “based largely on titles and high-level optics rather than a nuanced understanding of roles, skills, or actual overlap in responsibilities.”

This is what happens when MBA-trained efficiency experts optimize for quarterly earnings instead of operational resilience. This is what happens when you treat senior engineers as replaceable expense line items instead of irreplaceable repositories of institutional knowledge. This is what happens when “doing more with less” becomes “doing everything with basically nothing.”

What Comes Next

Amazon will call this an “isolated incident.” They’ll publish a detailed post-mortem. They’ll promise improvements to their monitoring, alerting, and response procedures. They’ll probably announce some new reliability initiative with a catchy internal name.

And then it will happen again.

Not because the technology is flawed, DNS has been a solved problem for decades. Not because the engineers are incompetent, the people still working at AWS are among the best in the industry. It will happen again because the incentive structures that created Monday’s disaster remain unchanged.

AWS will continue bleeding senior talent. The mass layoffs will continue. The return-to-office mandates that drove experienced engineers to competitors will continue. The pressure to replace human expertise with AI tools that aren’t ready for production will continue. The expectation that remaining staff work mandatory overtime to compensate for understaffing will continue.

The pattern is clear, and unless something fundamental changes, the next outage is already brewing. Maybe it won’t be DNS next time. Maybe it’ll be a cascading failure in EC2’s control plane. Or a distributed systems bug that only manifests at massive scale. Or a configuration error that nobody with institutional memory is left to catch.

What we learned Monday isn’t that technology is unreliable, we knew that already. What we learned is that the internet’s foundational infrastructure is being operated by companies that have decided operational excellence is less important than cost efficiency, that institutional knowledge is expendable, and that occasional disasters causing billions in economic damage are simply the cost of doing business.

The question isn’t whether this will happen again. The question is how bad the next one will be, and whether we’ll finally do something about the dangerous consolidation and systematic undermining of expertise that made it possible.

Until then, we’re all just hoping that the cloud doesn’t fall on our heads.

The Bottom Line

The October 2025 AWS outage wasn’t a technology failure. It was a management failure, a policy failure, and a system design failure that’s been years in the making. When you:

Lay off 27,000 employees including senior engineers with irreplaceable institutional knowledge

Replace experienced staff with AI tools that aren’t ready for production-critical work

Maintain a single-region architecture for global services despite knowing it’s a point of failure

Prioritize quarterly earnings over operational resilience

Create a market structure where three companies control two-thirds of cloud infrastructure

...you don’t get to act surprised when DNS issues take 75 minutes to diagnose and 15 hours to fully resolve.

This is the internet we’ve built: held together by corporate profit optimization, understaffed engineering teams, and the prayer that nothing breaks at 3 AM on a Monday morning.

So far, we’ve been lucky. The outages have been measured in hours, not days. The economic damage has been billions, not trillions. No planes fell from the sky. No critical infrastructure failed catastrophically.

But luck eventually runs out. And when it does, we’ll look back at Monday, October 20, 2025, and realize it was a warning we chose to ignore.

What do you think? Are cloud providers doing enough to maintain reliability, or have they optimized for profit at the expense of resilience? Drop a comment below or share this article if you think more people need to understand the fragility of our digital infrastructure.

If you found this analysis valuable, subscribe for more deep dives into tech accountability, infrastructure resilience, and the hidden costs of corporate efficiency drives. Next week: examining the CrowdStrike disaster and what it reveals about software update procedures.

Discussion about this post

Ready for more?