We’re working on a fairly substantial revision of the Sun Cloud API, motivated by this problem: In a RESTful context, how do you handle state-changing operations (POST, PUT, DELETE) which have substantial and unpredictable latency?
What we’ve learned, from work with our own back-end based on the Q-layer technology and with some other back-ends, is that Cloud operations are by and large not very fast; and that the latencies show up in weird places. Here’s an example: in our own implementation, creating a Virtual Machine from a template or by copying another VM instance is very snappy. But weirdly, connecting a network (public or private) to a VM can sometimes be extremely slow. Go check out other implementations like EC2 and you see a similar unpredictable-latency narrative.
The idiom we’d been using so far was along these lines:
As with both AtomPub and Rails, when you want to create something new you POST it to a collection of some sort and the server comes back with “201 Created” and the URI of the new object.
When you POST to some controller (for example “boot a machine”) or do a DELETE, the server comes back with “204 No content” to signal success.
This is all very well and good; but what happens when some of these operations take a handful of milliseconds and others (e.g. “boot all the VMs in this cluster”) could easily go away for several minutes.
The current thinking is evolving in the Project Kenai forums, and was started up by Craig McLanahan in PROPOSAL: Handling Asynchronous Operation Requests. Check it out, and put your oar in if you have something better in mind.
To summarize: For any and all PUT/POST/DELETE operations, we return
“202 In progress” and a new “Status” resource, which contains a 0-to-100
progress indicator, a
target_uri for whatever’s
being operated on, an
op to identify the operation, and, when
progress reaches 100,
message fields to tell how the operation came out. The idea is
that this is designed to give a hook that implementors can make cheap to
We also thought about a Comet style implementation where we keep the HTTP channel open, and that can be made clean but support for it in popular libraries is less than ubiquitous. My personal favorite idea was to use “Web hooks”, i.e. the client sends a URI along with the request and the server POSTs back to it when the operation is complete. But every time I started talking about it I ran into a brick wall because it probably doesn’t work for a client behind a firewall, which is where most of them will be. Sigh.
There are a few points that are still troubling me, listed here in no particular order:
When an operation is finished and you want to provide a Status code, we’re re-using HTTP status codes. Which on the one hand seems a bit outside their design space, but on the other hand maybe it’s a wheel we don’t have to re-invent.
Instead of having the “op” field, we could have a different media-type for each imaginable kind of Status resource. That might be a bit more RESTful but seems a less convenient to use for client implementors.
This whole notion of the
target_uri makes me wonder if
we’re missing a generalization. The most obvious role is
when the Status is that of a create operation, for example Create New VM; then
target_uri is the new resource’s URI, what would come back in
the Location HTTP header in a synchronous world.
And in a few cases you might want more than one target, for example when you’re attaching an IP address to a VM.
Speaking of generalization, I wonder if this whole “Slow REST” thingie is a pattern that’s going to pop up again often enough in the future that we should be thinking of a standardized recipe for approaching it; the kind of thing that has arisen for CRUD operations in the context of AtomPub and Rails.
What do you think?