More or less all the big APIs are REST­ful these days. Yeah, you can quib­ble about what “REST” means (and I will, a bit) but the as­ser­tion is broad­ly true. Is it go­ing to stay that way forever? Seems un­like­ly. So, what’s nex­t?

What we talk about when we talk about “REST” · Th­ese days, it’s used col­lo­qui­al­ly to mean any API that is HTTP-based. In fac­t, the vast ma­jor­i­ty of them of­fer CRUD op­er­a­tions on things that have URIs, em­bed some of those URIs in their pay­load­s, and thus are ar­guably REST­ful in the orig­i­nal sense; al­though these days I’m hear­ing the oc­ca­sion­al “CRUDL” where L is for List.

At AWS where I work, we al­most al­ways dis­tin­guish, for a ser­vice or an ap­p, be­tween its “control plane” and its “data plane”. For ex­am­ple, con­sid­er our database-as-a-service RDS; the con­trol plane APIs are where you cre­ate, con­fig­ure, back-up, start, stop, and delete databas­es. The da­ta plane is SQL, with con­nec­tion pools and all that RDBMS bag­gage.

It’s in­ter­est­ing to note that the con­trol plane is REST­ful, but the da­ta plane isn’t at al­l. (This isn’t nec­es­sar­i­ly a database thing: DynamoDB’s da­ta plane is pret­ty REST­ful.)

I think there’s a pat­tern there: The con­trol plane for al­most any­thing on­line has a good chance of be­ing REST­ful be­cause, well, that’s where you’re go­ing to be cre­at­ing and delet­ing stuff. The da­ta plane might be a dif­fer­ent sto­ry; my first pre­dic­tion here is that what­ev­er starts to dis­place REST will start do­ing it on the da­ta plane side, if on­ly be­cause con­trol planes and REST are such a nat­u­ral fit.

REST­ful im­per­fec­tions · What are some rea­sons we might want to move be­yond REST? Let me list a few:

La­ten­cy · Set­ting up and tear­ing down an HTTP con­nec­tion for ev­ery lit­tle op­er­a­tion you want to do is not free. A cou­ple of decades of ef­fort have re­duced the cost, but stil­l.

For ex­am­ple, con­sid­er two mes­sag­ing sys­tems that are built by peo­ple who sit close to me: Ama­zon SQS and MQ. SQS has been run­ning for a dozen years and can han­dle mil­lions of mes­sages per sec­ond and, as­sum­ing your senders and re­ceivers are rea­son­ably well bal­anced, can be re­al­ly freak­ing fast  —  in fac­t, I’ve heard sto­ries of mes­sages ac­tu­al­ly be­ing re­ceived be­fore they were sen­t; the long-polling re­ceiv­er grabbed the mes­sage be­fore the sender side got around to tear­ing down the PutMes­sage HTTP con­nec­tion. But the MQ da­ta plane, on the oth­er hand, doesn’t use HTTP, it us­es nailed-up TCP/IP con­nec­tions with its own fram­ing pro­to­col­s. So you can get as­ton­ish­ing­ly low la­ten­cies for trans­mit and re­ceive op­er­a­tions. But, on the oth­er hand, your through­put is lim­it­ed by the num­ber of mes­sages the “message broker” ter­mi­nat­ing those con­nec­tions can han­dle. A lot of peo­ple who use MQ are pret­ty con­vinced that one of the rea­sons they’re do­ing this is they don’t want a REST­ful in­ter­face.

Cou­pling · In the wild, most REST re­quests (like most things la­beled as APIs) op­er­ate syn­chronous­ly; that is to say, you call them (GET, POST, PUT, what­ev­er) and you stall un­til you get your re­sult back. Now (s­peak­ing HTTP lin­go) your re­quest might re­turn 202 Ac­cept­ed, in which case you’d ex­pect ei­ther to have sent a URI along to be called back as a web­hook, or to get one in the re­sponse that you can pol­l. But in all these cas­es, the cou­pling is still pret­ty tight; you (the caller) have to main­tain some sort of state about the re­quest un­til the caller has done with it, whether that’s now or lat­er.

Which sort of suck­s. In par­tic­u­lar when it’s one mi­croser­vice call­ing an­oth­er and the client ser­vice is send­ing re­quests at a high­er rate than the server-side one can han­dle; a sit­u­a­tion that can lead to acute pain very quick­ly.

Short life · Han­dling some re­quests takes mil­lisec­ond­s. Han­dling oth­ers  —  a cit­i­zen­ship ap­pli­ca­tion, for ex­am­ple  —  can take weeks and in­volve or­ches­trat­ing lots of ser­vices, and oc­ca­sion­al­ly hu­man in­ter­ac­tion­s. The no­tion of hav­ing a thread hang­ing wait­ing for some­thing to hap­pen is ridicu­lous.

A word on GraphQL · It ex­ist­s, ba­si­cal­ly, to han­dle the sit­u­a­tion where a client has to as­sem­ble sev­er­al fla­vors of in­for­ma­tion do its job  —  for ex­am­ple, a mo­bile app build­ing an information-rich dis­play. Since REST­ful in­ter­faces tend to do a good job of telling you about a sin­gle re­source, this can lead to a waste­ful flur­ry of re­quest­s. So GraphQL lets you cherry-pick an ar­bi­trary se­lec­tion of fields from mul­ti­ple re­sources in a sin­gle re­quest. Pre­sum­ably, the server-side im­ple­men­ta­tion is­sues that re­quest flur­ry in­side the da­ta cen­ter where those calls are cheap­er, then as­sem­bles your GraphQL out­put, but any­how that’s no longer your prob­lem.

I ob­serve that lots of client de­vel­op­ers like GraphQL, and it seems like the world has a place for it, but I don’t see it as that big a game-changer. To start with, it’s not as though client de­vel­op­ers can com­pose ar­bi­trary queries, lim­it­ed on­ly by the se­man­tics of GraphQL, and ex­pect to get uni­form­ly de­cent per­for­mance. (To be fair, the same is true of SQL.) Any­how, I see GraphQL as a con­ve­nience fea­ture de­signed to make syn­chronous APIs run more ef­fi­cient­ly.

A word on RPC · By which, these days, I guess I must mean gRPC. I dun­no, I’m old enough that I saw gen­er­a­tion af­ter gen­er­a­tion of RPC frame­works fail mis­er­ably; brit­tle, re­quir­ing lots of con­fig­u­ra­tion, and fail­ing to de­liv­er the an­tic­i­pat­ed per­for­mance win­s. Smells like mak­ing REST­ful APIs more tight­ly cou­pled, to me, and it’s hard to see that as a win. But I could be wrong.

Post-REST: Mes­sag­ing and Event­ing · This ap­proach is all over, and I mean all over, the cloud in­fras­truc­ture that I work on. The idea is you get a re­quest, you val­i­date it, maybe you do some com­pu­ta­tion on it, then you drop it on a queue (or bus, or stream, or what­ev­er you want to call it) and for­get about it, it’s not your prob­lem any more.

The next stage of re­quest han­dling is im­ple­ment­ed by ser­vices that read the queue and ei­ther route an an­swer back to the orig­i­nal re­quester or pass­es it on to an­oth­er ser­vice stage. Now for this to work, the queues in ques­tion have to be fast (which the­se, days, they are), scal­able (which they are), and very, very durable (which they are).

There are a lot of wins here: To start with, tran­sient query surges are no longer a prob­lem. Al­so, once you’ve got a mes­sage stream you can do fan-out and fil­ter­ing and as­sem­bly and sub­set­ting and all sorts of oth­er use­ful stuff, with­out dis­turb­ing the op­er­a­tions of the up­stream mes­sage source.

Post-REST: Orches­tra­tion · This gets in­to work­flow ter­ri­to­ry, some­thing I’ve been work­ing on a lot re­cent­ly. Where by “workflow” I mean a ser­vice track­ing the state of com­pu­ta­tions that have mul­ti­ple step­s, any one of which can take an ar­bi­trar­i­ly long time pe­ri­od, can fail, can need to be re­tried, and whose be­hav­ior and out­put af­fect the choice of sub­se­quent out­put steps and their be­hav­ior.

An in­creas­ing num­ber of (for ex­am­ple) Lamb­da func­tions are, rather than serv­ing re­quests and re­turn­ing re­spons­es, ex­e­cut­ing in the con­text of a work­flow that pro­vides their in­put, waits for them to com­plete, and routes their out­put fur­ther down­stream.

Post-REST: Per­sis­tent con­nec­tions · Back a few para­graphs I talked about how MQ mes­sage bro­kers work, main­tain­ing a bunch of nailed-up net­work con­nec­tion­s, and pump­ing bytes back and forth across them. It’s not hard to be­lieve that there are lots of sce­nar­ios where this is a good fit for the way da­ta and ex­e­cu­tion want to flow.

Now, we’re al­ready part­way there. For ex­am­ple, SQS clients rou­tine­ly use “long polling” (typ­i­cal­ly around 30 sec­ond­s) to re­ceive mes­sages. That mean­s, they ask for mes­sages and if there aren’t any, the serv­er doesn’t say “no dice”, it holds up the con­nec­tion for a while and if some mes­sages come in, shoots them back to the caller. If you have a bunch of threads (po­ten­tial­ly on mul­ti­ple host­s) long-polling an SQS queue, you can get mas­sive through­put and la­ten­cy and re­al­ly re­duce the cost of us­ing HTTP.

The next two steps for­ward are pret­ty easy to see, too. The first is HTTP/2, al­ready wide­ly de­ployed, which lets you mul­ti­plex mul­ti­ple HTTP re­quests across a sin­gle net­work con­nec­tion. Used in­tel­li­gent­ly, it can buy you quite a few of the ben­e­fits of a per­ma­nent con­nec­tion. But it’s still firm­ly tied to TCP, which has some un­for­tu­nate side-effects that I’m not go­ing to deep-dive on here, part­ly be­cause it’s not a thing I un­der­stand that deeply. But I ex­pect to see lots of apps and ser­vices get good val­ue out of HTTP/2 go­ing for­ward; in some part be­cause as far as clients can tel­l, they’re still mak­ing, and re­spond­ing to, the same old HTTP re­quests they were be­fore.

The next step af­ter that is QUIC (Quick UDP Inter­net Con­nec­tion­s) which aban­dons TCP in fa­vor of UDP, while re­tain­ing HTTP se­man­tic­s. This is al­ready in pro­duc­tion on a lot of Google prop­er­ties. I per­son­al­ly think it’s a re­al­ly big deal; one of the rea­sons that HTTP was so suc­cess­ful is that its con­nec­tions are short-lived and thus much less like­ly to suf­fer break­age while they’re at work. This is re­al­ly good be­cause de­sign­ing an application-level pro­to­col which can deal with bro­ken con­nec­tions is super-hard. In the world of HTTP, the most you have to deal with at one time is a failed re­quest, and a bro­ken con­nec­tion is just one of the rea­sons that can hap­pen. UDP makes the connection-breakage prob­lem go away by not re­al­ly hav­ing con­nec­tion­s.

Of course, there’s no free lunch. If you’re us­ing UDP, you’re not get­ting the TC in TCP, Trans­mis­sion Con­trol I mean, which takes care of pack­e­tiz­ing and re­assem­bly and check­sum­ming and throt­tling and loads of oth­er super-useful stuff. But judg­ing by the ev­i­dence I see, QUIC does enough of that well enough to sup­port HTTP se­man­tics clean­ly, so once again, apps that want to go on us­ing the same old XMLHttpRe­quest calls like it was 2005 can re­main hap­pi­ly obliv­i­ous.

Brave New World! · It seems in­evitable to me that, par­tic­u­lar­ly in the world of high-throughput high-elasticity cloud-native app­s, we’re go­ing to see a steady in­crease in re­liance on per­sis­tent con­nec­tion­s, or­ches­tra­tion, and message/event-based log­ic. If you’re not us­ing that stuff al­ready, now would be a good time to start learn­ing.

But I bet that for the fore­see­able fu­ture, a high pro­por­tion of all re­quests to ser­vices are go­ing to have (ap­prox­i­mate­ly) HTTP se­man­tic­s, and that for most con­trol planes and quite a few da­ta planes, REST still pro­vides a good clean way to de­com­pose com­pli­cat­ed prob­lem­s, and its ex­treme sim­plic­i­ty and re­silience will mean that if you want to de­sign net­worked app­s, you’re still go­ing to have to learn that way of think­ing about things.


Comment feed for ongoing:Comments feed

From: Chad Brewbaker (Nov 19 2018, at 11:50)

At Global Day of Code Retreat last weekend, the easiest Game of Life implementation was actually in SQL. Surprised me a bit.

Most transactional data is in SQL data stores. Perhaps the problem is on the database end? Both failure of SQL databases to have REST style discoverabilty, and our failure to give SQL schemas for the data views offered by "REST" APIs.

Blowing away data schemas at the server application level has hurt more than helped for the sake of Type erasure "flexibility".


From: Mike Bannister (Nov 19 2018, at 14:13)

>Any­how, I see GraphQL as a con­ve­nience fea­ture de­signed to make syn­chronous APIs run more ef­fi­cient­ly.

I was curious if you meant to use the word "synchronous" here or not? Doesn't seem right but maybe I'm misunderstanding?


From: Alastair Houghton (Nov 20 2018, at 10:12)

It's a shame that SCTP hasn't taken off outside of telecoms; it solves a lot of the problems with TCP and it would make a lot of sense to use it instead of reimplementing TCP-like semantics over UDP or coming up with complicated multi-channel application protocols that can work over TCP (these have significant problems when running on top of TCP, as the TCP layer enforces potentially unnecessary ordering constraints that can cause head-of-line blocking and other problems).


From: Santiago Gala (Nov 20 2018, at 13:29)

Not completely unrelated. :)

Re: latency and persistent connections, I had an "aha!" moment when I recently experimented with Wireguard as compared with my usual L2TP/IPsec VPNs: when I suspend the laptop, move elsewhere and resume it, 2 roundtrips, prompted by the first packet routed, is all it takes to have the connection back. It is virtually stateless and maintenance free. When I compare with L2TP/IPsec, it takes minimum 5 roundtrips, to get a security association going, even more if the L2TP tunnel dies and needs to be restarted.


From: Lewis Cowles (Nov 20 2018, at 22:49)

I'm not so sure that this millions of requests per-second monoliths (more than superficial) are needed. After recently using Graph-QL as the sole source of truth for an application, I hate it with a passion, and consider it antithetical to achieving high throughput.

Problems GraphQL introduces which REST / single-object RPC do not

- Traversing the leaf nodes of the graph that can be n-levels deep (assuming we're not expecting junk we ignore)

- Aggregation of data from systems which do not (or should not) store hierarchical data (SQL, KV)

- Separation of data to systems which do not (or should not) store hierarchical data (SQL, KV)

- What to do in case of a potentially increased space of conflicts

We recently had an error in one of our applications. In comes a request to update just one field of a resource that was "designed to use deep-structures". It's a JSON API endpoint, and it simply sets approval for a person. Somewhere in that nest of code (which I did not author), it triggers a cascade effect which wipes out skills, which can be sent with a resource, but have not in this case been sent.

If we instead had REST endpoints for attached resources for all write-operations, we'd have no problems apart from selecting the appropriate endpoint for our requests, reducing debugging, error count, software complexity.

I don't pretend that the problem lies solely with graph-inspired API's, but I do think if people are honest, having RPC and REST APIs as well as a few graph endpoints might be more the future than "REST is dead, long live graph"


From: Matthew Pava (Nov 27 2018, at 08:36)

I don't think we should be spending so much time focusing on a million-requests-per-minute infrastructure. The Internet was supposed to be distributed. It should be unlikely that a million endpoints are accessing a single server for any purpose. It's time to go back to our roots and focus on truly distributed computing. The problem becomes simpler to solve.


From: Nox (Dec 03 2018, at 18:39)

On coupling and Post-REST Messaging and Eventing. Is the understanding that the client/caller is required to maintain the state of the call on both cases?


author · Dad · software · colophon · rights
picture of the day
November 18, 2018
· Technology (85 fragments)
· · Web (390 more)

By .

I am an employee
of, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.