Exactly 100% of everything on the Internet involves exchanging messages which represent items of interest to humans. These items can be classified into three baskets: One for “media” (images, sound, video), one for text (HTML, PDF, XML), and one for “objects” (chat messages, payments, love poems, order statuses). This is a survey of how Object data is encoded for transmission over the Internet. Discussed: JSON, binary formats like Avro and Protobufs, and the trade-offs. Much of what people believe to be true is not.
History sidebar · The first ever cross-systems data interchange format was ASN.1, still used in some low-level Internet protocols and crypto frameworks. ASN.1 had a pretty good data-type story, but not much in the way of labeling. Unfortunately, this was in the days before Open Source, so the ASN.1 software I encountered was slow, buggy, and expensive.
Then XML came along in 1998. It had no data-typing at all but sensibly labeled nested document-like data structures. More important, it had lots of fast solid open-source software you could download and use for free, so everybody started using it for everything.
Then sometime after 2005, a lot of people noticed JSON. The “O” stands for “Object” and for shipping objects around the network, it was way slicker than XML. By 2010 or so, the virtuous wrath of the RESTafarians had swept away the pathetic remnants of the WS-* cabal. Most REST APIs are JSON, so the Internet’s wires filled up with media, text, and JSON.
On JSON · I think there’s still more of it out there than anything else, if only because there are so many incumbent REST CRUD APIs that are humming along staying out of the way and getting shit done.
Readers and writers are implemented in every computer language known to humankind, and they tend to interoperate correctly and frictionlessly with each other, particularly if you follow the interoperability guidelines in RFC 8259, which all the software I use seems to.
It does a pretty good job of modeling nested-record structures.
It’s all-text, so humans can read it, which is super extra helpful.
You can receive a JSON message you know nothing about and pick it apart successfully without knowing its schema, assuming it has one, which it probably doesn’t. So you can accomplish a task like “Pull out the item-count and item-price fields that are nested in the top-level order-detail field” with pretty good results given just a blob of raw JSON.
You can reliably distinguish between numbers, strings, booleans, and null.
The type system is impoverished. There is no timestamp type, no way to know whether a number should be treated as an integer or float or Bignum, no way to signal when string values are really enums, and so on.
Numbers are specially impoverished; in general you should assume that your repertoire is that of an IEEE double-precision float (but without NaN or ∞) which is adequate for most purposes, as long as you’re OK with an integer range of ±253 (which you probably should be).
Since JSON is textual, there is a temptation to edit it by hand, and this is painful since it’s nearly impossible to get the commas in the right places. On top of which there are no comments.
JSON’s textuality, and the fact that it carries its field labels along, no matter how deeply nested and often repeated, suggest that it is unnecessarily verbose, particularly when numeric values are represented in textual form. Also, the text needs to be converted into binary form to be loaded into objects (or structs, or dicts) for processing by code in memory.
JSON doesn’t have a universally-accepted schema language. I have been publicly disappointed over “JSON Schema”, the leading contender in that space; it’s just not very good. For a long time, the popular Swagger (now OpenAPI) protocols for specifying APIs used a variant version of a years-old release of JSON Schema; those are stable and well-tooled.
Mainstream binary formats · I think that once you get past JSON, Apache Avro might be the largest non-text non-media consumer of network bandwidth. This is due to its being wired into Hadoop and, more recently, the surging volume of Kafka traffic. Confluent, the makers of Kafka, provide good Avro-specific tooling. Most people who use Avro seem to be reasonably happy with it.
Protobufs (short for “Protocol Buffers”) I think would be the next-biggest non-media eater of network bandwidth. It’s out of Google and is used in gRPC which, as an AWS employee, I occasionally get bitched at for not supporting. When I worked at Google I heard lots of whining about having to use Protobufs, and it’s fair to say that they are not universally loved.
Next in line would be Thrift, which is kind of abstract and includes its own RPC protocol and is out of Facebook and I’ve never been near it.
JSON vs binary · This is a super-interesting topic. It is frequently declaimed that only an idiot would use JSON for anything because it’s faster to translate back and forth between data types in memory with Avro/Protobufs/Thrift/Whatever (hereinafter “binary”) than it is with JSON, and because binary is hugely more compact. Also binary comes with schemas, unlike JSON. And furthermore, binary lets you use gRPC, which must be brilliant since it’s from Google, and so much faster because it’s compact and can stream. So, get with it!
Is binary more compact than JSON? · Yes, but it depends. In one respect, absolutely, because JSON carries all its field labels along with it.
Also, binary represents numbers as native hardware numbers, while JSON uses strings of decimal digits. Which must be faster, right? Except for your typical hardware number these days occupies 8 bytes if it’s a float, and I can write lots of interesting floats in less than 8 digits; or 4 bytes for integers, and I can… hold on, a smart binary encoder can switch between 1, 2, 4, and 8-byte representations. As for strings, they’re all just the same UTF-8 bytes either way. But binary should win big on enums, which can be represented as small numbers.
So let’s grant that binary is going to be more compact as long as your data isn’t mostly all strings, and the string values aren’t massively longer than the field labels. But maybe not as much as you thought.
Unless of course you compress. This changes the picture and there are a few more it-depends clauses, but compression, in those scenarios where you can afford it, probably reduces the difference dramatically. And if you really care about size enough that it affects your format choices, you should be seriously looking at compression, because there are lots of cases where you’ve got CPU to spare and are network-limited.
Whether or not your data is number- or string-heavy matters in this context too, because serializing or deserializing strings is just copying UTF-8bytes.
I mentioned gRPC above, and one aspect of speed heavily touted by the binary tribe is in protobufs-on-gRPC which, they say, is obviously much faster than JSON over HTTP. Except for HTTP is increasingly HTTP/2, with longer-lived connections and interleaved requests. And is soon going to be QUIC, with UDP and no streams at all. And I wonder how “obvious” the speed advantage of gRPC is going to be in that world?
I linked to that one benchmark just now but that path leads to a slippery slope; the Web is positively stuffed with serialization/deserialization benchmarks, many of them suffering from various combinations of bias and incompetence. Which raises a question:
Do speed and size matter? · Can I be seriously asking that question? Sure, because lots of times the size and processing speed of your serialization format just don’t matter in the slightest, because your app is bottlenecked on database, or on garbage collection, or on a matrix inversion or an FFT or whatever.
What you should do about this · Start with the simplest possible thing that could possibly work. Then benchmark using your data with your messaging patterns. In the possible but not terribly likely case that your message transmission and serialization is a limiting factor in what you’re trying to do, go shopping for a better data encoding.
The data format long tail · Amazon Ion has been around for years running big systems inside Amazon, and decloaked in 2015-16. It’s a JSON superset with a usefully-enriched type system that comes in fully interoperable binary and textual formats. It has a schema facility. I’ve never used Ion but people at Amazon whose opinion I respect swear by it. Among other things, it’s used heavily in QLDB, which is my personal favorite new AWS service of recent years.
CBOR is another binary format, also a superset of JSON. I am super-impressed with the encoding and tagging designs. It also has a schema facility called CDDL that I haven’t really looked at. CBOR has implementations in really a lot of different languages.
I know of one very busy data stream at AWS that’s running at a couple of million objects a second where you inject JSON and receive JSON, but the data in the pipe is CBOR because at that volume size really starts to matter. It helped that the service is implemented in Java and the popular Jackson library handles CBOR in a way that’s totally transparent to the developer.
I hadn’t really heard much about MessagePack until I was researching this piece. It’s yet another “efficient binary serialization format”. The thing that strikes me is that every single person who’s used it seems to have positive things to say, and I haven’t encountered a “why this sucks” rant of the form that it’s pretty easy to find for every other object encoding mentioned in this piece. Checking it out is on my to-do list.
While on the subject of efficient something something binary somethings, I should mention Cap’n Proto and FlatBuffers, both of which seem to be like Avro only more so, and make extravagant claims about how you can encode/decode in negative nanoseconds. Neither seems to have swept away the opposition yet, though.
[Shouldn’t you mention YAML? —Ed.]
[No, this piece is about data on the network. —T.]
On Schemas · Binary really needs schemas to work, because unless you know what those bits all snuggled up together mean, you can’t un-snuggle them into your software’s data structures. This creates a problem because the sender and receiver need to use the same (or at least compatible) schemas, and, well, they’re in different places, aren’t they? Otherwise what’s the point of having messaging software?
Now there are some systems, for example Hadoop, where you deal with huge streams of records all of which are the same type. So you only have to communicate the schema once. A useful trick is to have the first record you send be the schema which then lets you reliably parse all the others.
Avro’s wire format on Kafka has a neat trick: The second through fifth byte encode a 4-byte integer that identifies the schema. The number has no meaning, the schema registry assigns them one-by-one as you add new schemas. So assuming both the sender and the receiver are using the same schema registry, everything should work out fine. One can imagine a world in which you might want to share schemas widely and give them globally-unique names. But those 32-bit numbers are deliciously compact and stylishly postmodern in their minimalism, no syntax to worry about.
Some factions of the developer population are disturbed and upset that a whole lot of JSON is processed by programmers who don’t trouble themselves much about schemas. Let me tell you a story about that.
Back in 2015, I was working on the AWS service that launched as CloudWatch Events and is now known as EventBridge. It carries events from a huge number of AWS services in a huger number of distinct types. When we were designing it, I was challenged “Shouldn’t we require schemas for all the event types?” I made the call that no, we shouldn’t, because we wanted to make it super-easy for AWS services to onboard, and in a lot of cases the events were generated by procedural code and never had a schema anyhow.
We’ve taken a lot of flak for that, but I think it was the right call, because we did onboard all those services and now there are a huge number of customers getting good value out of EventBridge. Having said that, I think it’d be a good idea at some future point to have schemas for those events to make developers’ lives easier.
Not that most developers actually care about schemas as such. But they would like autocomplete to work in their IDEs, and they’d like to make it easy to transmogrify a JSON blob into a nice programming-language object. And schemas make that possible.
But let’s not kid ourselves; schemas aren’t free. You have to coördinate between sender and receiver, and you have to worry what happens when someone wants to add a new field to a message type — but in raw JSON, you don’t have to worry, you just toss in the new field and things don’t break. Flexibility is a good thing.
Events, pub/sub, and inertia · Speaking of changes in message formats, here’s something I’ve learned in recent years while working on AWS eventing technology: It’s really hard to change them. Publish/subscribe is basic to event-driven software, and the whole point of pub/sub is that the event generator needn’t know, and doesn’t have to care, about who’s going to be catching and processing those events. This buys valuable decoupling between services; the bigger your apps get and the higher the traffic volume, the more valuable the decoupling becomes. But it also means that you really really can’t make any backward-incompatible changes in your event formats because you will for damn sure break downstream software you probably never knew existed. I speak from bitter experience here.
Now, if your messages are in JSON, you can probably get away with throwing in new fields. But almost certainly not if you’re using a binary encoding.
What this means in practice is that if you have a good reason to update your event format, you can go ahead and do it, but then you probably have to emit a stream of new-style events while you keep emitting the old-style events too, because if you cut them off, cue more downstream breakage.
The take-away is that if you’re going to start emitting events from a piece of software, put just as much care into it as you would as you do in specifying an API. Because event formats are a contract, too. And never forget Hyrum’s Law:
With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.
The single true answer to all questions about data encoding formats · “It depends.”