Workflows in AWS and GCP

[This fragment is available in an audio version.]

Recently, Google launched a beta of Google Cloud Workflows. This grabs my attention because I did a lot of work on AWS Step Functions, also a workflow service. The differences between them are super interesting, if you’re among the handful of humans who care about workflows in the cloud. For those among the other 7.8 billion, move right along, nothing to see here.

Context · The Google launch seemed a bit subdued. There was an August 27th tweet by Product Manager Filip Knapik, but the real “announcement” is apparently Serverless workflows in Google Cloud, a 16-minute YouTube preso also by Knapik. It’s good.

On Twitter someone said Good to see “Step Functions / Logic Apps” for GCP! and I think that’s fair. I don’t know when Logic Apps launched, but Step Functions was at re:Invent 2016, so has had four years of development.

I’m going to leave Logic Apps out of this discussion, and for brevity will just say “AWS” and “GCP” for AWS Step Functions and Google Cloud Workflows.

· · ·

Docs · For GCP, I relied on the Syntax reference, and found Knapik’s YouTube useful too. For AWS, I think the best starting point is the Amazon States Language specification.

The rest of this piece highlights the products’ similarities and differences, with a liberal seasoning of my opinions.

YAML vs JSON · AWS writes workflows (which it calls “state machines”) in JSON. GCP uses YAML. Well… meh. A lot of people prefer YAML; it’s easier to write. To be honest, I always thought of the JSON state machines as sort of assembler level, and assumed that someone would figure out a higher-level way to express them, then compile down to JSON. But that hasn’t happened very much.

I have this major mental block against YAML because, unlike JSON or XML, it doesn’t have an end marker, so if your YAML gets truncated, be it by a network spasm or your own fat finger, it might still parse and run — incorrectly. Maybe the likelihood is low, but the potential for damage is apocalyptic. Maybe you disagree; it’s a free country.

And anyhow, you want YAML Step Functions? You can do that with serverless.com or in SAM (see here and here).

Control flow · Both GCP and AWS model workflows as a series of steps; AWS calls them “states”. They both allow any step to say which step to execute next, and have switch-like conditionals to pick the next step based on workflow state.

But there’s a big difference. In GCP, if a step doesn’t explicitly say where to go next, execution moves to whatever step is on the next line in the YAML. I guess this is sort of idiomatic, based on what programming languages do. In AWS, if you don’t say what the next step is, you have to be a terminal success/fail state; this is wired into the syntax. In GCP you can’t have a step named “end”, because next: end signals end-of-workflow. Bit of a syntax smell?

I’m having a hard time developing an opinion one way or another on this one. GCP workflows can be more compact. AWS syntax rules are simpler and more consistent. [It’s almost as if a States Language contributor was really anal about minimalism and syntactic integrity.] I suspect it maybe doesn’t make much difference?

GCP and AWS both have sub-workflows for subroutine-like semantics, but GCP’s are internal, part of the workflow, while AWS’s are external, separately defined and managed. Neither approach seems crazy.

The work in workflow · Workflow engines don’t actually do any work, they orchestrate compute resources — functions and other Web services — to get things done. In AWS, the worker is identified by a field named Resource which is syntactically a URI. All the URIs currently used to identify worker resources are Amazon-flavored ARNs.

GCP, at the moment, mostly assumes an HTTP world. You can specify the URL, whether you want to GET/POST/PATCH/DELETE (why no PUT?), lets you fill in header fields, append a query to the URI, provide auth info, and so on. Also, you can give the name of a variable where the result will be stored.

I said “mostly” because in among all the call: http.get examples, I saw one call: sys.sleep, so the architecture allows for other sorts of things.

GCP has an advantage, which is that you can call out to an arbitrary HTTP endpoint. I really like that: Integration with, well, anything.

There’s nothing in the AWS architecture that would get in the way of doing this, but while I was there, that feature never quite made it to the top of the priority list.

That’s because pre-built integrations seemed to offer more value. There are a lot of super-useful services with APIs that are tricky to talk to. Some don’t have synchronous APIs, just fire-and-forget. Or there are better alternatives than HTTP to address them. Or they have well-known failure modes and workarounds to deal with those modes. So, AWS comes with pre-cooked integrations for Lambda, AWS Batch, DynamoDB, ECS/Fargate, SNS, SQS, Glue, SageMaker, EMR, CodeBuild, and Step Functions itself. Each one of these takes a service that may be complicated or twitchy to talk to and makes it easy to use in a workflow step.

The way it works is that the “Resource” value is something like, for example, arn:aws:states:::sqs:sendMessage, which means that the workflow should send an SQS message.

AWS has a couple of other integration tricks up its sleeve. One is “Callback tasks”, where the service launches a task, passes it a callback token, and then pauses the workflow until something calls back with the token to let Step Functions know the task is finished. This is especially handy if you want to run a task that involves interacting with a human.

Finally, AWS has “Activities”. These are worker tasks that poll Step Functions to say “Got any work for me?” and “Here’s the result of the work you told me about” and “I’m heartbeating to say I’m still here”. These turn out to have lots of uses. One is if you want to stage a fixed number of hosts to do workflow tasks, for example to avoid overrunning a relational database.

So at the moment, AWS comes with way more built-in integrations and ships new ones regularly. Having said that, I don’t see anything in GCP’s architecture that prevents it eventually taking a similar path.

The tools for specifying what to do are a little more stripped-down in AWS: “Here’s a URI that says who’s supposed to do the work, and here’s a blob of JSON to serve as the initial input. Please figure out how to start the worker and send it the data.” [It’s almost as if a States Language contributor was a big fan of Web architecture, and saw URIs as a nicely-opaque layer of indirection for identifying units of information or service.]

Reliability · This is one of the important things a workflow brings to the table. In the cloud, tasks sometimes fail; that’s a fact of life. You want your workflow to be able to retry appropriately, catch exceptions when it has to, and reroute the workflow when bad stuff happens. Put another way, you’d like to take every little piece of work and surround it with what amounts to a try/catch/finally.

GCP and AWS both do this, with similar mechanisms: exception catching with control over the number and timing of retries, and eventual dispatching to elsewhere in the workflow. GCP allows you to name a retry policy and re-use it, which is cool. But the syntax is klunky.

AWS goes to immense, insane lengths to make sure a workflow step is never missed, or broken by a failure within the service. (The guarantees provided by the faster, cheaper Express Workflows variant are still good, but weaker.) I’d like to see a statement from GCP about the expected reliability of the service.

Parallelism · It’s often the case that you’d like a workflow to orchestrate a bunch of its tasks in parallel. AWS provides two ways to do this: You can take one data item and feed it in parallel to a bunch of different tasks, or you can take an array and feed its elements to the same task. In the latter case, you can limit the maximum concurrency or even force one-at-a-time processing.

GCP is all single-threaded. I suppose there’s no architectural reason it has to stay that way in future.

Workflow state in GCP · Now here’s a real difference, and it’s a big one. When you fire up a workflow (AWS or GCP), you feed data into it, and as it does work, it builds up state, then it uses the state to generate output and make workflow routing choices.

In GCP, this is all done with workflow variables. You can take the output of a step and put it in a variable. Or you can assign them and do arithmetic on them, like in a programming language. So you can build a loop construct like so:

- define:
    assign:
        - array: ["foo", "ba", "r"]
        - result: ""
        - i: 0
- check_condition:
    switch:
        - condition: ${len(array) > i}
            next: iterate
    next: exit_loop
- iterate:
    assign:
        - result: ${result + array[i]}
        - i: ${i+1}
    next: check_condition
- exit_loop:
    return:
        concat_result: ${result}

Variables are untyped; can be numbers or strings or objects or fish or bicycles. Suppose a step’s worker returns JSON. Then GCP will parse it into a multi-level thing where the top level is an object with members named headers (for HTTP headers) and body, and then the body has the parsed JSON, and then you can add functions on the end, so you can write incantations like this:

- condition: ${ userRecord.body.fields.amountDue.doubleValue == 0 }

(doubleValue is a function not a data field. So what if I have a field in my data named doubleValue? Ewwww.)

Then again, if a step worker returns PDF, you can stick that into a variable too. And if you call doubleValue on that I guess that’s an exception?

Variable names are global to the workflow, and it looks like some are reserved, for example http and sys.

…and in AWS · It could hardly be more different. As of now AWS doesn’t use variables. The workflow state is passed along, as JSON, from one step to the next as the workflow executes. There are operators (InputPath, ResultPath, OutputPath, Parameters) for pulling pieces out and stitching them together.

Just like in GCP, JSONPath syntax is used to pick out bits and pieces of state, but the first step is just $ rather than the name of a variable.

There’s no arithmetic, but then you don’t need to do things with array indices like in the example above, because parallelism is built in.

If you want to do fancy manipulation to prepare input for a worker or pick apart one’s output, AWS takes you quite a ways with the built-in Pass feature; but if you want to get fancy and run real procedural code, you might need a Lambda to accomplish that. We thought that was OK; go as far as you can declaratively while remaining graceful, because when that breaks down, this is the cloud and these days clouds have functions.

While I haven’t been to the mat with GCP to do real work, at the moment I think the AWS approach is a winner here. First of all, I’ve hated global variables — for good reason — since before most people reading this were born. Second, YAML is a lousy programming language to do arithmetic and so on in.

Third, and most important, what happens when you want to do seriously concurrent processing, which I think is a really common workflow scenario? GCP doesn’t really have parallelism built-in yet, but I bet that’ll change assuming the product gets any traction. The combination of un-typed un-scoped global variables with parallel processing is a fast-track to concurrency hell.

AWS application state is always localized to a place in the executing workflow and can be logged and examined as you work your way through, so it’s straightforward to use as a debugging resource.

[It’s almost as if a States Language contributor was Functional-Programming literate and thought immutable messages make sense for tracking state in scalable distributed systems.]

GCP variables have the virtue of familiarity and probably have less learning curve than the AWS “Path” primitives for dealing with workflow state. But I just really wouldn’t want to own a large mission-critical workflow built around concurrent access to untyped global variables.

Pricing · GCP is cheaper than AWS, but not cheaper than the Express Workflows launched in late 2019.

It’s interesting but totally unsurprising that calling out to arbitrary HTTP endpoints is a lot more expensive. Anyone who’s built general call-out-to-anything infrastructure knows that it’s a major pain in the ass because those calls can and will fail and that’s not the customer’s problem, it’s your problem.

Auth · This is one area where I’m not going to go deep because my understanding of Google Cloud auth semantics and best practices is notable by its absence. It’s nice to see GCP’s secrets-manager integration, particularly for HTTP workers. I was a bit nonplused that for auth type you can specify either or both of OIDC and Oauth2; clearly more investigation required.

UX · I’m talking about graphical stuff in the console. The GCP console, per Knapik’s YouTube, looks clean and helpful. Once again, the AWS flavor has had way more years of development and just contains more stuff. Probably the biggest difference is that AWS draws little ovals-and-arrows graphic renditions of your workflow, colors the ovals as the workflow executes, and lets you click on them to examine the inputs and outputs of any workflow execution. This is way more than just eye candy; it’s super helpful.

Which workflow service should you use? · That’s dead easy. You should use a declaratively-specified fully-managed cloud-native service that can call out to a variety of workers, and which combines retrying and exception handling to achieve high reliability. And, you should use the one that’s native to whatever public cloud you’re in! Like I said, easy.

Contributions

Comment feed for ongoing:

From: Geoff Arnold (Sep 22 2020, at 20:22)

The first thing I looked at when Step Functions came out was how one managed change, like blue-green deployment and versioning. There's a BurningMonk piece that shows the clunky but serviceable answer.... How does GCP do it? (And I agree with you about the evils of globals...!)

This really strikes me as something that should be a single open-source project with multiple cloud bindings, rather than a proprietary thing. But hey....

[link]

From: Adrian Brennan (Sep 23 2020, at 15:05)

There's a lot to like about Step Functions (especially segregated execution, self-contained logs) but support for testing and debugging is lacking. Without that tooling it feels not-quite-ready for consideration as a serious development medium.

[link]

From: Jarek (Sep 26 2020, at 16:50)

"GCP has an advantage, which is that you can call out to an arbitrary HTTP endpoint. I really like that: Integration with, well, anything.

There’s nothing in the AWS architecture that would get in the way of doing this, but while I was there, that feature never quite made it to the top of the priority list."

A feature that would lessen vendor lock-in never quite makes it to the top of the vendor's priority list. Colour me not quite surprised ;)

[link]

From: Joe Bowbeer (Sep 28 2020, at 06:28)

I'd like to see both of these compared to a modern workflow engine (Zeebe?) and a standard notation such as BPMN.

[link]

From: Tim Bannister (Sep 28 2020, at 08:03)

Nice comparison!

One of the points in the article was about YAML vs. JSON; you can always spot when JSON documents are truncated.

YAML does have end markers, but they're optional - https://yaml.org/spec/1.2/spec.html#id2800401 describes them. The end marker for YAML is "..." as the last of the document.

A particular API or tool that consumes YAML could insist that its input is a single YAML document the end marker present, maybe with a flag like --no-strict to allow all valid YAML streams instead.

(I'd like to see tools such as kubectl that ofte process YAML add a strict mode that you can enable, but equally you can achieve the same thing via preprocessing, linting, or with custom tooling).

[link]

From: Sean Donovan (Feb 01 2021, at 09:46)

The fundamental problem with both of these tools is that they (i) don't support signals/events and (ii) don't support cancellation . . . and a few others. In practical terms, you can only build trivial workflows with them.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

September 21, 2020
· Technology (90 fragments)
· · Cloud (24 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!