This piece is provoked by Ryan Daigle’s What’s New in Edge Rails: Simpler Conditional Get Support (ETags). I think it’s an important subject. I realize that many readers here understand what ETags are and why they matter, and will see right away how the API Ryan describes fits into the picture. If you don’t, and you build Web apps, it’s probably worth reading this and following some links.
Caching in General ·
If, every time a browser processed a URI beginning http://...
,
it actually fired off a GET
at a server, and if, every time a
GET
hit a server, the server actually recomputed and sent
the data, the Web would melt down PDQ. There is a lot of
machinery available, on both the client and server side, to detect when the
work of computing results and sending them over the network can be avoided.
Normally, we use the term “caching” to refer to all this stuff. ¶
There is browser caching and expiration dates and cache-control headers, which are worth implementing but I’m not going to cover here. And on the server side, there are a variety of caching tools, of which memcached is the best-known example.
If You Must Ask ·
Even with all the caching, there are lots of
occasions when a Web client has to fire off that GET
, and it gets
through to your server-side application code. But that doesn’t necessarily
mean you have to compute and transmit.
If the server discovers that whatever the URI identifies
hasn’t changed since that client last fetched it, the server can send
an (essentially) one-line response labeled with the
HTTP Status
Code “304 Not Modified”. This saves network bandwidth, because you don’t
retransmit the whole resource representation, and if it’s done cleverly, may
save a whole bunch of computation and I/O for the server. ¶
Time-stamping ·
The most obvious way to accomplish this is for the client to send an HTTP
If-Modified-Since
header containing the date the URI was last
fetched. This works just fine, particularly for a resource which is a static
file (in fact, popular Web servers have this built right in).
But sometimes a single time-stamp isn’t enough
information for a server to figure out whether the client needs a fresh copy
of whatever it’s asking for. ¶
ETags are for this situation. The way it works is, when a client sends you a GET, along with the result you send back an HTTP header like so:
ETag: "1cc044-172-3d9aee80"
Whatever goes between the quotation marks is a signature for the current state of the resource that’s been requested. Here’s an example: Suppose you’ve got some sort of social-networking Web app, and a user asks to see her profile page. The way the page looks depends on a few things:
Who the user is.
Whether the app has been updated (i.e. new templates, stylesheets) since the last fetch.
Whether the profile has been updated since the last fetch.
The first you know, and it shouldn’t be tough to make the second available to application code; you could have an app-version global, or store it in your database, or just have a file somewhere that gets updated so you can check its timestamp. As for the third, this requires that you have a version number or update-timestamp field associated with the user profile, which you probably already do.
So what you do is turn those three things into a signature (probably by
concatenating them and hashing the string) and sending the ETag
header along with the profile page.
Then, when the client wants to look at the profile page again, it sends an HTTP header along with the request like so:
If-None-Match: "1cc044-172-3d9aee80"
When you see this, you have a quick glance at the user id, app version, and
profile version, recompute the signature, and if it matches, you just send
back a 304 Not Modified
and your job is done. (the header is
called If-None-Match
because the client can send a bunch of
different ETags along; but I’ve never seen anyone do that).
In many cases, this is going to be a lot less computing than fetching the profile information out of the database tables and re-running the template to create the HTML you were going to send along.
When This Matters · This matters if your Web app is maxed on some combination of CPU and database, and a noticeable proportion of requests don’t really need a page-rebuild, and your existing caching and last-modified setup isn’t getting the job done. This isn’t going to be true of all Web apps, nor even of all Web apps that are suffering from overload. But my feeling, on surveying the landscape, is that there are a lot of apps out there where smart ETagging could cut the CPU load and database traffic down by a few percentage points, and those percentage points are damn precious in a server that’s breathing hard in public. ¶
This is particularly likely to be true if your Web app is written in a language that isn’t the world’s fastest (like Rails), and has an elaborate, complex object-relational mapper (like Rails), and was built in a big hurry to meet a perceived need, without much pre-optimization (like a lot of Rails apps).
I’m impressed by the response.etag
and
request.fresh?
API Ryan Daigle describes; it’s typically-elegant
in the Rails style: “Tell us what matters and we’ll do the housekeeping”.
I’m sure that other Web frameworks offer similar tools; perhaps readers might contribute pointers below?
Further Reading · The trade-offs around this, like the trade-offs around everything having to do with Web-app performance, are complicated. I wrote about this subject before (essentially requesting exactly what Edge Rails now has) in On Being for the Web. The comments to that piece are erudite and instructive, and link to lots of valuable primary materials: ¶
James Abley’s comment, while rude, is full of useful links, including to his own High Performance Web Sites: Rule 13 – Configure ETags, which covers some of the issues around ETagging static content.
Michael Koziarsky’s Clever Caching has a nice explanation of the issues and shows the Ruby code you’d have had to write before these new APIs.
Joe Gregorio’s REST Tip: Deep etags give you more benefits covers still more ground, touches on database design, and his examples are in Python.
Bill de hÓra starts an argument over whether it’s more important to save bandwidth or CPU. I think he’s in the minority when argues that bandwidth is more important, but it’s an issue you have to think about (and measure!) as you tune your system.
Comment feed for ongoing:
From: Bob Aman (Aug 14 2008, at 15:58)
It also matters when you're on a webhost with bandwidth limits and nasty overage charges for some reason. There's fewer and fewer of those these days, but I've gotten emails about it, so I know they still exist.
If someone starts pulling down a large-ish feed every hour and no Etag stuff is being done, that often works out to a lot of bytes transferred. So it's not always just about CPU usage saved.
[link]
From: Loïc d'Anterroches (Aug 15 2008, at 00:05)
Um, bandwidth is still an issue, especially now, because more and more people are using their mobile phone/pda over UMTS/Edge to access websites. And the connection in that case is bandwidth limited.
We basically need to handle both cases at the same time: DSL access with a Mbps connection and edge access with only a kbps connection.
Challenging when we want to do it right, but that where the fun his.
[link]
From: John O'Shea (Aug 15 2008, at 02:38)
I almost missed it in your post but it is worth highlighting - when you use a proxy http server to serve your rails-generated-cached-pages the proxy server will automatically compute/update ETag: and Last-Modified: headers when the cached file appears/changes on disk (at least litepseed does out of the box, pretty sure apache does too). The result really is an incredibly fast, efficient and scalable use of bandwidth and server resources - page requests can scale from <10 req/sec to 1000+ req/sec easily once page caching is put in place.
Two barriers to page caching (apart from the obvious security/authentication constraints) that I think are worth considering early in an service designs are
- metrics - if your business absolutely has to integrate request counts into other rails app business metrics then you'll need to build something that takes data from the http server log file. Not terribly difficult if you have nice restful URLs and no session state and ultimatly it is a small price to pay for service scalability achieved. High volume sites could do worse than reuse some of code from contributers to Tim's WideFinder project.
- pages that contain just one element that is user specific that would otherwise be cachable - e.g. a login name in a navigation bar. Techniques to fudge this like http://www.ibm.com/developerworks/web/library/wa-rails2/ work, but only if you have a small piece of static data and if you are assuming the browser has jscript/cookies enabled.
John.
[link]
From: Justin Sheehy (Aug 15 2008, at 05:59)
In addition to the good links you already provided, Mark Nottingham's caching tutorial (though perhaps a little dated) should be required reading for anyone building a cacheable Web system.
http://www.mnot.net/cache_docs/
[link]
From: Gerald Oskoboiny (Aug 22 2008, at 15:05)
Good stuff.
It's straightforward to use the last-modified/if-modified-since approach when a page depends on multiple components, you just need to compare the if-modified-since time against the maximum timestamp of the components that make up the page.
here's a sample implementation in perl:
http://impressive.net/archives/fogo/20061012104240.GA28586@impressive.net
[link]
From: James Abley (Aug 23 2008, at 13:04)
A minor correction: the link you cite as mine is part of the Yahoo! Developer Network Blog rather than something I authored.
My apologies. The last thing I wanted to do was come across as rude. I think at the time, I'd been reading a lot around the subject and for whatever reason, you seemed a little late to the party, that was all. Normally, you seem to be one of the first people in my feed reader to be talking about a particular meme. Please put it down to the impersonal nature of communicating via this medium.
[link]