This is eventually about the public cloud and Open Source, but — apologies in advance — takes an indirect and long-ish path.
In AWS engineering, we develop stuff and we operate stuff. I think the second is more important.
We have good hardware and software engineers, and infrastructure that feels pretty magic to me (faves: the racks and networking gear, the consensus manager underlying QLDB, and the voodoo that makes S3 go). But, like Bill Joy said, “Wherever you work, most of the smart people are somewhere else”, so I’m not gonna kid myself that we’re magically unique at programming.
But on the operations side, the picture is really unique. First of all, there are very few places in the world where you can get operational experience at this scale. Second, AWS doesn’t run on SRE culture; the same engineers who write the code live by the dashboards and alarms and metrics that try to reflect and protect the customers’ experience (not perfectly, but we make progress).
The obsessive focus on operational excellence isn’t subtle and it’s not a secret. There’s been a re:Invent presentation about how we run our ops meetings and we even open-sourced the AWS Ops Wheel.
But it’s not all meetings. We build and deploy a lot of technology with no direct connection to any feature or function or API that a customer will ever see. These are all about having the right dashboards, and being able to extract the key ratio from petabytes of logs, and predicting what might melt down before it even gets warm.
The asshole ratio · I’ve already written that at AWS, it’s lower than I experienced at other BigTech outfits. Here’s why this is relevant: There is plenty of evidence that you can be a white-hot flaming asswipe and still ship great software. But (going out on a limb) I don’t think you can be an asshole and be good at operations.
Because ops requires being humble in the face of the evidence, acknowledging fallibility, assuming that the problem is your problem even when quite likely it’s not, and always eager to investigate theories B, C, and D even when you’re pretty sure your current theory A is right-on. Since problems in complex services are almost never solved by a single individual’s efforts, you have to be good at working with people under pressure.
Those LPs · I have a hypothesis about that good ratio and it involves the Amazon Leadership Principles (we just say LPs). I’ve gotten flack from friends who think having such things is lame and corny. But in practice they turn out to be useful, and to explain how I’m going to take side-trip into modern clinical medicine.
There’s this guy Atul Gawande, a surgeon and writer whose work I’ve admired, mostly in The New Yorker, for years. I recommend pretty well anything he writes but in particular I recommend The Checklist Manifesto. Do me (and yourself) a favor, follow that link and read the Malcolm Gladwell review excerpt. From which:
“…the routine tasks of surgeons have now become so incredibly complicated that mistakes of one kind or another are virtually inevitable: it’s just too easy for an otherwise competent doctor to miss a step, or forget to ask a key question or, in the stress and pressure of the moment, to fail to plan properly for every eventuality.“ [Sounds just like updating a million-TPS Web Service. -Tim] “Gawande then visits with pilots and the people who build skyscrapers and comes back with a solution. Experts need checklists–literally–written guides that walk them through the key steps in any complex procedure.”
Well, one insanely-complex routine task that we do all the time is hiring. You know what the LPs are at hiring time? A checklist. Now even the typical all-day interview marathon isn’t gonna reliably dig into every LP, but we do an acceptable job of taking a close look at enough of them. I believe that’s very helpful in bringing down the asshole ratio.
Open Source · Which brings me to the touchy subject of the relationship between Cloud Providers and Open Source. We and our competitors have made a good business of infrastructure operations, keeping service-oriented software servicing; reliably, durably, 24/7/365. The core EC2 business is about operating Linux boxes and IP networking at extreme scale, efficiently enough that we can rent them out at an attractive price and still make a buck.
In recent Open-Source years, some very gifted people have created wonderful pieces of software — Kafka, ElasticSearch, Mongo — and taken a new course, launching VC-financed companies to monetize with service and support. Then sometimes they find themselves competing with multiple public-cloud providers.
I have a load of sympathy for the virtuoso engineers who created these wonderful pieces of work. But here’s the thing: I have at least as much for the customers who (let’s take Kafka for an example) just need reliable high-performance streaming. A direct quote: “I’ll cheerfully pay monthly to never worry about Zookeeper again.”
On the other hand, I have little sympathy with modern VC-driven business models.
It’s like this: The qualities that make people great at carving high-value software out of nothingness aren’t necessarily the ones that make them good at operations. This has two unfortunate effects: They don’t necessarily have the right skills to build and run a crack operations team, and they might not manage to get a job at an operations-obsessed company.
I have recent personal experience with failing to hire a senior committer to a well-known OSS project, and also with paying an “open-source company” for tech support when we were spinning up a service around a package we didn’t know very well. Both of these left me unhappy.
Jack and Jonathan · Let me tell you a story. Sometime around 2008, I and Jonathan Schwartz, then the CEO of Sun Microsystems, made a sales call on Jack Dorsey at Twitter. Sun had acquired MySQL and Twitter was using the hell out of it. We wanted them to start paying us for support; after all, they were existentially dependent on this technology and everyone knew that serious Enterprises would never use unsupported software.
Jack was nice, and listened to our pitch, but we didn’t get the business.
And while, as a career software guy, I entirely love open-source culture and technologies and methods, the hypothesis that Open Source in and of itself constitutes a business model is not well supported by the evidence.
Which way forward? · Google Cloud’s recent Open Source partnerships are interesting. I look at that list of companies and it’s not obvious to me that they’re going to offer better operational excellence than Google’s, but maybe I’m wrong. It’s an interesting and probably useful experiment.
At the end of the day I’m not that worried. Most of us who’ve open-sourced stuff love the creative process for its own sake; touching and improving other engineers’ lives. The skillset evidenced by having done so will probably help you get really good jobs. Yeah, you might not get to be a Bay Area Unicorn. But you probably weren’t going to anyhow.