Worst Case

[This fragment is available in an audio version.]

Suppose you’re running your organization’s crucial apps in the cloud. Specifically, suppose you’re running them them on AWS, and in particular in the “us-east-1” region? Could us-east-1 go away? What might you do about it? Let’s catastrophize!

Acks & disclaimers · First, thanks to Corey Quinn for this Twitter thread, which got me thinking.

Second, while I worked for AWS for 5½ years, I’ve never been near a data center, nor do I have any inside information about the buildings, servers, or networking. On the other hand, I do have a decent understanding of AWS culture and capabilities in software engineering and operations. Bear those facts in mind as you read this.

Finally, since this blog fragment concerns itself entirely with catastrophic scenarios, I’ll try to be cheerful about it.

[Those of you who know what us-east-1 is can skip over the next section to the first entertaining disaster.]

“us-east-1”? · AWS means “Amazon Web Services”, Amazon’s insanely huge ($60B/year revenue) and profitable (~30% margin) collection of cloud-computing services. Basically, AWS will rent you computers and databases and the use of many other software services. So more or less everything your IT department owns can be rented by the hour (or second) rather than installed in your own data center.

If you’re using AWS, you have to pick one (or more) of its (24, as I write) “regions” to host your systems. They have boring names like “us-west-2” (Portland) and “ap-northeast-1” (Tokyo).

“us-east-1” (N. Virginia) is generally thought to be the biggest region, by a huge margin. There have been estimates that 30% of all Internet traffic flows through it. Here’s AWS’s official write-up and here’s a nice Atlantic story by a person who drove around Northern Virginia looking for the actual buildings.

Before we leave the subject, I should say that each AWS region is divided into multiple “availability zones” (AZ’s), data centers that are independently operated and geographically separated, so to really lose a whole region, you”d have to take all of them out.

If us-east-1 went off the air, it would be Really Bad. How could that happen?

Threatening clouds.

Terrestrial disaster · This is the first one anybody thinks of.

Suppose a big late-summer hurricane somehow misses Florida and Texas, cruises north offshore picking up energy from an anomalously-warm western Atlantic, turns left just south of DC, and savages anywhere that’s easy driving distance from Dulles airport. We’re talking about inches of rain in a few hours so every waterway floods; also, high winds and lightning are playing hell with the electrical and network infrastructure.

The other obvious candidate would be an earthquake, which can ravage infrastructure to a degree unequaled by any other flavor of natural catastrophe. Among other thing, the Potomac bridges and lots of freeway overpasses would be rubble, so your ability to bring help in would be severely reduced.

If you’re the unlucky proprietor of systems hosted at us-east-1, they’d be off the air, and while AWS would probably arrange to answer your distress call, there’s really not much that could be done. How would your business do if it were off the air for, uh, nobody really knows how long?

How much should you worry? · This one worries me less than a lot of the other scenarios here. First off, the hurricane scenario is so utterly predictable that I bet anyone with a significant data-center presence in the region has been planning and wargaming around this one for at least a decade.

Modern data centers all come with self-contained backup generators and some sort of power-bridging gear, so assuming the water doesn’t actually get in and flood the equipment rooms, things should be fine. You’d expect Internet-provider outages as well, but once again, modern data centers strive for redundant connections and are built in places where there are multiple providers, so they’d all have to go down to go completely off the air.

Having said that, the climate is changing and possibly, everything we know about that storm system will turn out to have been wrong.

The earthquake scenario is tougher, but fortunately that’s not a seismically active zone.

Also bear in mind that the availability-zone architecture is going to help you. You can imagine one data centre’s backup power failing to operate, but it’d be really unlikely for that to happen in all the AZ’s.

I’m not sure this is much consolation, but: If an event of this scale occurs, you’re not going to be the only operation who’s off the air. Probably, quite a lot of the United States government would be in the same boat. So while your customers and employees are going to be mad, they’re also going to be distracted from worrying about your downtime.

Extraterrestrial disaster · What about devastation raining down from space?

Sophie Schmieg is a high-level cryptography/security Googler, and Knows What She’s Talking About. She refers to the Carrington Event, a major solar storm (“Coronal Mass Ejection” they say) that happened in 1859, and severely disrupted the world’s telegraph system for about eight hours. This is an example of a Solar proton event. If/when one happens, it’s going to seriously suck for astronauts and for anyone who depends on aerial radio-frequency communications. How hard will it hit modern data-center and Internet infrastructure? The deepest dive on the subject seems to be Solar Superstorms: Planning for an Internet Apocalypse (PDF) by Sangeetha Abdu Jyothi.

Physicists I’ve talked to say “Yeah, that’s gonna happen someday.” Bear in mind that since the duration is measured in hours, we might get lucky and find us-east-1 facing away from the sun.

How much should you worry? · I figure that this is actually a more likely disaster scenario for us-east-1 then either the hurricane or the earthquake. But I’ve got no special insights into how much it will hurt. In Abdu Jyothi’s paper, she offers lots of specific recommendations about how to solar-storm-proof the infrastructure. How much have the operators of us-east-1 tried, and how well will their efforts work? We don’t know.

However, as with the terrestrial disasters, your personal pain may not matter that much. After all, as Abdu Jyothi points out, “A recent study … which analyzed the risks posed by a Carrington-scale event to the US power grid today found that 20 - 40 million people could be without power for up to 2 years, and the total economic cost will be 0.6 - 2.6 trillion USD.” So… there’s not going to be much leftover attention for your little outage.

Labor unrest · It’s increasing around the globe as multiple decades of increasing inequality in wealth and power bite down harder and harder. Also, it may turn out that Covid has disturbed the balance of power between the working and owning classes. A wave of Big Tech unionization would be surprising, but not that surprising.

So here’s the scenario: Some group of employees whose services are essential for the operation of us-east-1 wins a unionization vote and starts trying to negotiate a contract with AWS, because they’re looking at that 30% margin on the tens of billions in revenue.

Unsurprisingly, Amazon goes all hard-ass, explains that unionization is incompatible with Day One thinking and Amazon Leadership Principles, and refuses to talk. So they take a strike vote, and on one fine spring day, don’t come to work. Nobody’s watching the graphs, whether those are graphs of electrical-supply stability, fiber-repeater failures, or data-storage latencies. How long does us-east-1 stay operative? I have no idea. But it’s a terrifying scenario.

It’s going to be difficult to explain to your customers that you can’t service them because of a labor dispute between a company they’re not dealing with and a union that doesn’t contain any of your employees.

How much should you worry? · Not at all. This will never happen.

Let’s ignore the passion and fury with which Amazon will resist unionization, and suppose hypothetically that things proceed as described, the strike vote passes, and it’s becoming apparent that several thousand essential workers are absolutely not going to show up on a near-future morning. What happens? Amazon caves instantly and does whatever it takes to come to a settlement with the workers.

The company is always talking customer obsession and that’s no BS, they really mean it. Failing to provide services that customers pay for and rely on because of internal management failure (and this is one of those) is violently antipathetic to Amazon culture. So they just won’t let it happen.

AWS software or operational failure · I’m talking about something like what happened to Facebook this month: For reasons that nobody who’s not a serious software geek can understand, us-east-1 suddenly vanishes from the network. Or is still on the network but is refusing all requests. Or is accepting requests but timing them out. Or is accepting requests but returning empty answers.

Once again, you’re in a bad spot when you have to explain to your customers that you’re off the air because you made a bet on a provider who couldn’t deliver the goods.

How much should you worry? · I’m not going to say this could never happen. But I’d be shocked. AWS has been doing cloud at scale for longer than anyone, they have the most experience, and they’ve seen everything imaginable that could go wrong, most things multiple times, and are really good at learning from errors.

Also, AWS has a powerful and consciously-constructed culture of operational excellence based on extreme paranoia. To be honest, I’m just the tiniest bit concerned over the recent departure of Charlie Bell, because he, more than anyone else, deserves credit for building and maintaining that culture. But it runs very deep.

War · It doesn’t seem likely that foreign attackers are going to swarm ashore on the Virginia beaches and send tank battalions through the industrial parks to blow up us-east-1. So maybe you don’t need to worry?

But wait; how about civil war? Let’s see; suppose Trump wins the Republican nomination for 2024, and runs on a rabble-rousing campaign of Revenge For The Steal, and explicitly rallies the Proud Boys, Oath Keepers, Sovereign Citizens, Three Percenters, Groypers, and police unions, telling them, “We can’t lose in a fair election, so if we do, let’s not let them steal it again.”

His election rallies are stuffed with Second-Amendment fanatics brandishing assault weapons. Every debate and campaign interview features questions along the lines of “If you lose, will there be an insurrection?” The majority of voters are out of patience with Trump and vote in Kamala Harris by a decent popular margin, but once again it’s a squeaker in the Electoral College.

The Trump supporters scream “Steal!” and launch a march on Washington; it turns out they have support from significant factions in the police forces and the US armed forces. Northern Virginia becomes a key strategic battleground, and both sides deploy heavy artillery…

OK, that’s a little far-fetched (I hope). Here’s another scenario: Beijing launches an invasion of Taiwan and the US comes to its defence. China’s cyberwar apparatus turns out to have discovered multiple zero-day attacks against Internet exchanges, poison pills that knock BGP off the air and keep it from coming back up. In this scenario, us-east-1 may be up and running, but nobody can reach it.

How much should you worry? · Probably not very much. Like the hurricane or solar storm, your problems are going to vanish in the static.

Enemy action · In this scenario, the Bad Guys (who knows, maybe those Chinese cyberwarfighters I just mentioned) figure out some combination of poison pills and DDOS and Linux kernel zero-days to knock over us-east-1 and keep it that way.

Once again, there you are explaining to your customers why AWS’s incompetence is screwing up their lives.

How much should you worry? · Not at all; I just can’t see this happening. I remember an AWS meeting with a customer looking at moving to the cloud, who asked “What about DDOS attacks?” The Amazon executive in the room said “Yeah, there’s probably three or four of those going on right now, they’re a cost of doing business for us.” There’s nobody in the world with more experience than AWS in dealing with this kind of crap.

But there’s a bigger reason. The vast majority of hackers are in it for the money, and they know perfectly well that AWS has one of the best-defended attack surfaces on the planet. So it’s in their interest to go after softer targets; big companies with juicy customer lists and password files and so on who aren’t minding their perimeters.

Note: You might be one of those big companies; while AWS is generally secure, it’s possible to run insecurely on it. So while the Bad Guys might come after you, they’re almost certainly not going to go after us-east-1 as a whole.

Public legal risk · It seems quite unlikely that any force of nature or criminal action could wipe out us-east-1. How about the US Government? Bear in mind that Republicans hate Amazon because of Bezos’s Washington Post and because the whole tech industry is (somewhat correctly) perceived as progressive.

Suppose Trump or some guttersnipe like Cruz or DeSantis wins the Presidency in 2024, and the Republicans control congress. Could AWS survive a US Federal legal move that forced a us-east-1 shutdown? Could it even survive a continuous credible threat of such a thing happening? The temptation might be too much for the GOP goons.

How much should you worry? · I would. But in a more general way; the existential peril to the USA following on the exercise of power by the Trumpist faction seems to me very severe, not something that can be ignored. So I would be watching which PACs I donated money to, and encouraging grassroots political activism to stave off the wreckage before it happens.

But then, I’m on the respectable left of the Canadian political spectrum, which makes me a raving Commie by US standards.

Surviving · Let’s assume you’re not going to wait for us-east-1 to come back, you want to resume operations elsewhere. So, you need to pick another region. Depending on which scenario worries you the most, you might want to be (as Sophie Schmieg suggested) in a different hemisphere, or if you’re worried about political/legal risks, at least a different jurisdiction.

The best thing you could possibly do is, don’t wait: Run “active-active”, which is to say have your application live in both regions all the time. Netflix kind of wrote the book on this, for example consider this 2013 write-up. I’ll be honest: I don’t know if Netflix has ever actually failed over in the face of an actual region outage. But their thinking is correct: The only way you can be sure that your backup region will run in production is by running it in production.

But let’s suppose you’re less ambitious; you’re not going to try to keep operations running continuously in the case of a failed region, you just need to be able to get back on the air in a reasonable amount of time, probably accepting that some transactions happening just as disaster struck might get lost.

Your app inventory, if it’s typical, probably includes virts running your code, along with load-balancing and fire-walling gear, and your code accesses a variety of services such as messaging systems and databases and serverless stuff. Let’s assume you’ve got your configurations all stored as code with Terraform or CloudFormation or whatever, so that if you needed to rebuild the system from scratch, you could. You do, right? Seriously, given that, if us-east-1 got blown to hell and you have a copy of the config code, revivifying your app is plausible.

Then there’s your data, which lives in some combination of databases, filesystems, and S3.

S3 has had region-to-region replication built in for a long time, and clearly people at AWS have been thinking about this; consider Introducing Multi-Region Asynchronous Object Replication Solution.

If it were me in my ideal world, I’d have copies of everything stored in S3 because of its exceptional durability; I sincerely believe there is no safer place on the planet to save data. Then I’d have a series of scripts that would rehydrate all my databases and config from S3, reconfigure all my code, and fire up my applications. I’d test this script regularly; any more than a few weeks untested and I’d lose confidence that it’d work.

Anyhow… · We probably won’t lose us-east-1. I’m not absolutely 100% sure that these scenarios are even worth thinking about, in a strictly economic sense. But if I were running a big important app, I wouldn’t be able to not think about it.

Contributions

Comment feed for ongoing:

From: Jon Stewart (Oct 10 2021, at 18:05)

Virginia had a 5.8 earthquake in 2011, with the epicenter right about between Richmond and Charlottesville. “Meh, 5.8,” you scoff. But because the geology of the east coast is so different than the west, the quake traveled far, far longer than a 5.8 on the west coast would. It caused pretty severe damage to the Washington Monument and the National Cathedral.

https://en.m.wikipedia.org/wiki/2011_Virginia_earthquake

While exceedingly rare, an earthquake in northern Virginia is likely to have a much greater area of effect than on the west coast, nor are building codes designed for them. A data center might have more robust building standards but it still depends on other infrastructure to be effective. So, it’s not inconceivable that a ~7 Richter scale earthquake in NoVA could put a real crimp in us-east-1.