Slow REST

We’re working on a fairly substantial revision of the Sun Cloud API, motivated by this problem: In a RESTful context, how do you handle state-changing operations (POST, PUT, DELETE) which have substantial and unpredictable latency?

What we’ve learned, from work with our own back-end based on the Q-layer technology and with some other back-ends, is that Cloud operations are by and large not very fast; and that the latencies show up in weird places. Here’s an example: in our own implementation, creating a Virtual Machine from a template or by copying another VM instance is very snappy. But weirdly, connecting a network (public or private) to a VM can sometimes be extremely slow. Go check out other implementations like EC2 and you see a similar unpredictable-latency narrative.

The idiom we’d been using so far was along these lines:

As with both AtomPub and Rails, when you want to create something new you POST it to a collection of some sort and the server comes back with “201 Created” and the URI of the new object.
When you POST to some controller (for example “boot a machine”) or do a DELETE, the server comes back with “204 No content” to signal success.

This is all very well and good; but what happens when some of these operations take a handful of milliseconds and others (e.g. “boot all the VMs in this cluster”) could easily go away for several minutes.

The current thinking is evolving in the Project Kenai forums, and was started up by Craig McLanahan in PROPOSAL: Handling Asynchronous Operation Requests. Check it out, and put your oar in if you have something better in mind.

To summarize: For any and all PUT/POST/DELETE operations, we return “202 In progress” and a new “Status” resource, which contains a 0-to-100 progress indicator, a target_uri for whatever’s being operated on, an op to identify the operation, and, when progress reaches 100, status and message fields to tell how the operation came out. The idea is that this is designed to give a hook that implementors can make cheap to poll.

We also thought about a Comet style implementation where we keep the HTTP channel open, and that can be made clean but support for it in popular libraries is less than ubiquitous. My personal favorite idea was to use “Web hooks”, i.e. the client sends a URI along with the request and the server POSTs back to it when the operation is complete. But every time I started talking about it I ran into a brick wall because it probably doesn’t work for a client behind a firewall, which is where most of them will be. Sigh.

There are a few points that are still troubling me, listed here in no particular order:

When an operation is finished and you want to provide a Status code, we’re re-using HTTP status codes. Which on the one hand seems a bit outside their design space, but on the other hand maybe it’s a wheel we don’t have to re-invent.
Instead of having the “op” field, we could have a different media-type for each imaginable kind of Status resource. That might be a bit more RESTful but seems a less convenient to use for client implementors.
This whole notion of the target_uri makes me wonder if we’re missing a generalization. The most obvious role is when the Status is that of a create operation, for example Create New VM; then the target_uri is the new resource’s URI, what would come back in the Location HTTP header in a synchronous world.

And in a few cases you might want more than one target, for example when you’re attaching an IP address to a VM.

Hmmm.
Speaking of generalization, I wonder if this whole “Slow REST” thingie is a pattern that’s going to pop up again often enough in the future that we should be thinking of a standardized recipe for approaching it; the kind of thing that has arisen for CRUD operations in the context of AtomPub and Rails.

What do you think?

Contributions

Comment feed for ongoing:

From: David Ing (Jul 02 2009, at 16:24)

When I've had to do this we usually end up with an addressable state/resource to represent what's going on in transit, i.e.

- On

- Turning Off

- Off

You're right that it does come up a fair bit, and I can count at least three times already I've seen it modeled in very different apps.

Unsure about the generalization of it though, as you seem to want to improve the whole '202 in progress' notification part as well - and that starts meaning specific client capabilities/environments, i.e. why HTTP isn't XMPP.

Anything that gets REST further from a narrow CRUD usage view is all good with me, and if Slow REST helps that then more power to it.