One of the more interesting pieces of the discussion on the Buddha-nature of log entries (blog entries, whatever) that was launched by Sam Ruby is the notion of how to identify one. Various versions of RSS have struggled with this one, and I’ve finally developed an opinion: URI and Version. (Update: Maybe #fragments are a bad idea.)

URI · A log entry’s primary identifier should be a URI, just because this is about the Web and if you have a URI you’re on it, otherwise not. This is not quite the end of the story, though.

By the way, the link in the previous paragraph is to “RFC2396bis,” the in-progress redraft, which is in my opinion much cleaner and easier-to-understand than the original RFC2396 (but then I wrote part of it, so I’m biased).

Fragments? · A URI can come equipped with a “fragment identifier”—that’s the bit after the #-mark. At some level it’s kind of a second-rate part of the URI because it’s applied after you’ve dereferenced the URI.

Second-rate or not, URIs with fragments are commonly used to refer to individual weblog postings; here’s one from Jon Udell: http://radio.weblogs.com/0100887/2002/03/08.html#a121. It seems that URIs-with-fragments are just fine here.

UPDATE: Peter Stuer, whose email address is at the Vrije Unversiteit in Brussels, writes to point out that in the very common case of a URI with a #fragment pointing into an HTML page, the #fragment only shows you where the entry starts, not where it ends. Peter points out:

Suppose I write an aggregator that fetches the full content of an entry not from having it somewhere in the RSS itself but from dereferencing the URI (a much sounder approach IMHO). If fragment identifiers are OK then, even apart from the wasted bandwidth of having to pull in the complete base URI doc, such an aggregator would have to rely on parsing and heuristics to provide a decent presentation.

Hmm. I think he has a point.

HTTP? URN? · The part of the URI before the first colon (:) character is called the scheme. A very common scheme is http; let’s call URIs in that scheme URLs (Everyone knows URL stands for Universal Republic of Love).

URLs are special because they don’t just identify something, they assert that if you ask using the HTTP protocol, you might get a useful representation over the wire. That is to say, the name does double duty as a locator.

This double duty has bothered many over the years. One of the consequences are URNs, which are URIs in the urn: scheme. URNs are not necessarily useful for retrieving things over the network, but they are in theory designed to be simpler (because they’re not trying to be names and locators) and more long-lived (because they don’t depend on a website).

In the world of weblog entries, I can’t imagine why you’d use an identifier that doesn’t double as a locator, so I don’t think URNs are particularly relevant.

Versioning · One of the really nice things about weblogs is that you can post an entry, then when your audience, who in aggregate are always much smarter than you, write to tell you that you screwed up, you can fix things. Newspapers and magazines just can’t do that.

When you do that, I think you really shouldn’t change the URI, because it’s really the same entry, it’s just changed. It is, in fact, a new version. And when you have a new version you probably want to signal this fact in your syndication feed and on the front page of your site and so on.

I think the way to identify the new version is with a version identifier. I think it doesn’t matter in the slightest what the version identifier is, just that it changes when the version does.

Anyone who’s spent time in the publishing-technology trenches knows that versioning is an immense, messy can of worms where everyone’s application works and thinks different.

So I think that in the ideal world, log entries are identified by URI and include a version identifier, which is a character string that can contain anything you want. Obvious candidates to use for versioning would be sequence numbers and datestamps, but it doesn't really matter.

I almost had myself convinced that the last-updated datestamp could double as the version number, but that wouldn’t work for ongoing, which probably means it wouldn’t work for lots of other people too. The reason is that when I change the software that writes ongoing, it usually changes the presentation (not the content) of all the essays, and they get a new “updated” date, but they shouldn’t get a new version number.

(After thinking about this through lunch). If it were decreed from on high that the last-updated date should double as the version identifier, ongoing could live with that. And one less piece of metadata to maintain is always good. But at the moment I think keeping them separate is cleaner.

Conclusion · Log entries should be identified by URI, which may include a fragment identifier, and that URI’s scheme should normally be http. Log entries should include a version stamp, which can be any old string of characters that changes from one version to the next.


author · Dad · software · colophon · rights
picture of the day
June 18, 2003
· Technology (81 fragments)
· · Publishing (156 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.