Back in 2016, in Message Processing Styles, I was sort of gloomy and negative about the notion of automated mapping between messages on the wire and strongly-typed programming data structures. Since we just launched a Schema Registry, and it’s got my fingerprints on it, I guess I must have changed my mind.
Eventing lessons · I’ve been mixed up in EventBridge, formerly known as CloudWatch Events, since it was scratchings on a whiteboard. It has a huge number of customers, including but not limited to the hundreds of thousands that run Lambda functions, and the volume of events per second flowing through the main buses are keeping a sizeable engineering team busy. This has taught me a few things.
First of all, events are strongly subject to Hyrum's Law: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. Which is to say, once you’ve started shipping an event, it’s really hard, which is to say usually impossible, to change anything about it.
Second: Writing code to map back and forth between bits-on-the-wire and program data structures is a very bad use of developer time. Particularly when the messages on the wire are, as noted, very stable in practice.
Thus, the new schema registry. I’m not crazy about the name, because…
Schemas are boring · Nobody has ever read a message schema for pleasure, and very few for instruction. Among other things, most messages are in JSON, and I have repeatedly griped about the opacity and complexity of JSON Schema. So, why am I happy about the launch of a Schema Registry? Because it lets us do two useful things: Search and autocomplete.
Let’s talk about Autocomplete first. When I’m calling an API, I don’t have to remember the names of the events or their arguments, because my IDE does that for me. As of now, this is true for events as well; the IDE knows the names and types of the fields and sub-fields. This alone makes a schema registry useful. Or, to be precise, the code bindings and serializers were generate from the schema.
The search side is pretty simple. The schema registry is just a DynamoDB thing, nothing fancy about it. But we’ve wired up an ElasticSearch index so you can type random words at it to figure out which events have a field named “drama llama” or whatever else you need to deal with today.
Inference/Discovery · This is an absolutely necessary schema-registry feature that most people will never use. It turns out that writing schemas is a difficult and not terribly pleasant activity. Such activities should be automated, and the schema registry comes with a thing that looks at message streams and infers schemas for them. They told us we couldn’t call it an “Inferrer” because everyone thinks that means Machine Learning. So it’s called “schema discovery” and it’s not rocket science at all, people have been doing schema inference for years and there’s good open-source code out there.
So if you want to write a schema and jam it into the registry, go ahead. For most people, I think it’s going to be easier to send a large-enough sample of your messages and let the code do the work. At least it’ll get the commas in the right place. It turns out that if you don’t like the auto-generated schema, you can update it by hand; like I said, it’s just a simple database with versioning semantics.
Tricky bits · By which I mean, what could go wrong? Well, as I said above, events rarely change… except when they do. In particular, the JSON world tends to believe that you can always add a new field without breaking things. Which you can, until you’ve interposed strong types. This is a problem, but it has a good solution. When it comes to bits-on-the-wire protocols, there are essentially two philosophies: Must-Understand (receiving software should blow up if it sees anything unexpected in the data) and Must-Ignore (receiving software must tactfully ignore unexpected data in an incoming message). There are some classes of application where the content is so sensitive that Must-Understand is called for, but for the vast majority of Cloud-native apps, I’m pretty sure that Must-Ignore is a better choice.
Having said that, we probably need smooth support for both
approaches. Let me make this concrete with an example. Suppose
you’re a Java programmer
writing a Lambda to process EC2 Instance State-change Notification events, and through the magic of the schema registry,
you don’t have to parse the JSON, you just get handed an EC2InstanceStateChangeNotification object. So, what happens when EC2
decides to toss in a new field? There are three plausible options. First, throw an exception. Second, stick the extra data into
some sort of
Map<String, Object> structure. Third, just pretend the extra data wasn’t there. None of these are
There’s another world out there where the bits-on-the-wire aren’t in JSON, they’re in a “binary” format like Avro or Protocol Buffers or whatever. In that world you really need schemas because unlike JSON, you just can’t process the data without one. In the specific (popular) case of Avro-on-Kafka, there’s a whole body of practice around “schema evolution”, where you can update schemas and automatically discover whether the change is backward-compatible for existing consumers. This sounds like something we should look at across the schemas space.
Tactical futures · Speaking of those binary formats, I absolutely do not believe that the current OpenAPI schema dialect is the be-all and end-all. Here’s a secret: The registry database has a SchemaType field and I’m absolutely sure that in future, it’s going to have more than one possible value.
Another to-do is supporting code bindings in languages other than the current Java, TypeScript, and Python. At the top of my list would be Go and C#, but I know there are members of other religions out there. And for the existing languages, we should make the integrations more native. For example, the Java bindings should be in Maven.
And of course, we need support in all the platform utilities: CloudFormation, SAM, CDK, Terraform, serverless.com, and any others that snuck in while I wasn’t looking.
Big futures · So, I seem to have had a change of worldview, from “JSON blobs on the wire are OK” to “It’s good to provide data types.” Having done that, everywhere I look around cloud-native apps I see things that deal with JSON blobs on the wire. Including a whole lot of AWS services. I’m beginning to think that more or less anything that deals with messages or events should have the option of viewing them as strongly-typed objects.
Which is going to be a whole lot of work, and not happen instantly. But as it says in Chapter 64 of the Dao De Jing: 千 里之行，始於足下 — “A journey of a thousand leagues begins with a single step”.