I recently watched Build an enterprise-grade service mesh with Traffic Director, featuring Stewart Reichling and Kelsey Hightower of GCP, and of course Google Cloud’s Traffic Director. Coming at this with a brain steeped in 5½ years of AWS technology and culture was surprising in ways that seem worth sharing.
Stewart presents the problem of a retail app’s shopping-cart checkout code. Obviously, first you need to call a payment service. However it’s implemented, this needs to be a synchronous call because you’re not going to start any fulfillment work until you know the payment is OK.
If you’re a big-league operation, your payment processing needs to scale and is quite likely an external service you call out to. Which raises the questions of how you deploy and scale it, and how clients find it. Since this is GCP, both Kubernetes and a service mesh are assumed. I’m not going to explain “service mesh” here; if you need to know go and web-search some combination of Envoy and Istio and Linkerd.
The first thing that surprised me was Stewart talking about the difficulty of scaling the payment service’s load balancer, and it being yet another thing in the service to configure, bearing in mind that you need health checks, and might need to load-balance multiple services. Fair enough, I guess. Their solution was a client-local load balancer, embedded in sidecar code in the service mesh. Wow… in such an environment, everything I think I know about load-balancing issues is probably wrong. There seemed to be an implicit claim that client-side load balancing is a win, but I couldn’t quite parse the argument. Counterintuitive! Need to dig into this.
And the AWS voice in the back of my head is saying “Why don’t you put your payments service behind API Gateway? Or ALB? Or maybe even make direct calls out to a Lambda function? (Or, obviously, their GCP equivalents.) They come with load-balancing and monitoring and error reporting built-in. And anyhow, you’re probably going to need application-level canaries, whichever way you go.” I worry a little bit about hiding the places where the networking happens, just like I worry about ORM hiding the SQL. Because you can’t ignore either networking or SQL.
Traffic Director · It’s an interesting beast. It turns out that there’s a set of APIs called “xDS”, originally from Envoy, nicely introduced in The universal data plane API. They manage the kinds of things a sidecar provides: Endpoint discovery and routing, health checks, secrets, listeners. What Google has done is arrange for gRPC to support xDS for configuration, and it seems Traffic Director can configure and deploy your services using a combination of K8s with a service mesh, gRPC, and even on-prem stuff; plus pretty well anything that supports xDS. Which apparently includes Google Cloud Run.
It does a lot of useful things. Things that are useful, at least, in the world where you build your distributed app by turning potentially any arbitrary API call into a proxied load-balanced monitored logged service, via the Service Mesh.
Is this a good thing? Sometimes, I guess, otherwise people wouldn’t be putting all this work into tooling and facilitation. When would you choose this approach to wiring services together, as opposed to consciously building more or less everything as a service with an endpoint, in the AWS style? I don’t know. Hypothesis: You do this when you’re already bought-in to Kubernetes, because in that context service mesh is the native integration idiom.
I was particularly impressed by how you could set up “global” routing, which means load balancing against resources that run in multiple Google regions (which don’t mean the same things as AWS regions or Azure regions). AWS would encourage you to use multiple AZ’s to achieve this effect.
Also there’s a lot of support for automated-deployment operations, and I don’t know if they extend the current GCP state of the art, but they looked decent.
Finally, I once again taken aback when Stewart pointed out that with Traffic Directors, you don’t have to screw around with iptables to get things working. I had no idea that was something people still had to do; if this makes that go away, that’s gotta be a good thing.
Kelsey makes it go · Kelsey Hightower takes 14 of the video’s 47 minutes to show how you can deploy a simple demo app on your laptop then, with the help of Traffic Director, on various combinations of virts and K8s resources and then Google Cloud Run. It’s impressive, but as with most K8s demos, assumes that you’ve everything up and running and configured because if you didn’t it’d take a galaxy-brain expert like Kelsey a couple of hours (probably?) to pull that together and someone like me who’s mostly a K8s noob, who knows, but days probably.
I dunno, I’m in a minority here but damn, is that stuff ever complicated. The number of moving parts you have to have configured just right to get “Hello world” happening is really super intimidating.
But bear in mind it’s perfectly possible that someone coming into AWS for the first time would find the configuration work there equally scary. To do something like this on on AWS you’d spend (I think) less time doing the service configuration, but then you’d have to get all the IAM roles and permissions wired up so that anything could talk to anything, which can get hairy fast. I noticed the GCP preso entirely omitted access-control issues. So, all in, I don’t have evidence to claim “Wow, this would be simpler on AWS!” — just that the number of knobs and dials was intimidating.
One thing made me gasp then laugh. Kelsey said “for the next step, you just have to put this in your Go imports, you don’t have to use it or anything:
I was all “WTF how can that do anything?” but then a few minutes later he started wiring endpoint URIs into config files that
xdi: and oh, of course. Still, is there, a bit of a code smell happening or is that just me?
Anyhow · If I were already doing a bunch of service-mesh stuff, I think that Traffic Director might meet some needs of today and could become really valuable when my app started getting heterogeneous and needed to talk to various sorts of things that aren’t in the same service mesh.
What I missed · Stewart’s narrative stopped after the payment, and I’d been waiting for the fulfillment part of the puzzle, because for that, synchronous APIs quite likely aren’t what you want, event-driven and message-based asynchronous infrastructure would come into play. Which of course what I spent a lot of time working on recently. I wonder how that fits into the K8s/service-mesh landscape?