RSS Needs Fixing

There are two big problems with RSS that aren't going away and are just going to have to be fixed to avoid a train-wreck, given the way this thing is taking off. They are first, what can go in a <description>, and second, the issue of relative URIs. (Warning: yet another incestuous self-referential post by a blogger about blogging, of interest only to syndication geeks.) (Substantially updated 11AM Pacific time)

Breakage · This essay tries to illustrate the problems it talks about. In its RSS description, it tries to mention the <description> tag, with the angle brackets visible, and it contains a relative reference to another ongoing article; one or both of these may have failed in your aggregator.

Fixing <Description> · What provoked this was a complaint from fellow-TAGger Norm Walsh that he could see the HTML markup in the ongoing feed in his (linux-based) RSS aggregator. Well, yeah, all the HTML is escaped because I went and looked at other people's feeds (Udell and Pilgrim I believe) and copied the way they did it; that's how the Web's supposed to work.

After Norm's complaint, I decided to (sigh) RTFM. The RSS2 spec, marvel of informality that it is, notes in passing that “(entity-encoded HTML is allowed)” with no words about what this might mean or how such HTML might be interpreted. This underspecification (inherited from many previous versions of RSS) leads to really stupid behavior even in good software:

The notion that you unescape markup and then act on it is just architecturally hosed. If I write < rather than < in some text, I'm saying “please ignore the semantics of this character!” That's what escaping is for.
It essentially prevents an RSS feed from ever mentioning a tag by name, i.e. there's no good way for me to say: “this is about the <description> tag.” Now it turns out that the ongoing generator is anal enough to do double-escaping, which worked in at least one RSS reader, but there's a word for this: stupid.

These days, the preferred method for dealing with this seems to be an <html:body> element, in which markup need not be escaped. This seems to work, but I don't see why RSS should make me do this. Second, it seems like I'm lying, the text in the RSS entry isn't the body of the ongoing essay, it's what <description> seems to be designed for (since many ongoing pieces are over a thousand words and studded wiith pictures, there's no way I'm putting the whole thing in the RSS for every RSS scraper to grab whether or not the user is interested.)

I'm not 100% sure what the right solution is, but either <description> should be totally plain text - no HTML markup - or it should allow well-formed HTML markup; in which case it would be OK for aggregators either to act on or ignore it.

Relative URI References · If I, in an ongoing essay, want to refer to another ongoing essay, the natural, correct, robust, flexible, concise way to do this is with relative reference. So I encode a link to my Colophon as <a href="/ongoing/misc/Colophon">Colophon</a>, and the browsers know how to deal with this and everything just works. Also, it works identically both in production and on my staging site, which isn't at www.tbray.org. Of course, if I want to copy the first paragraph of my essay into my RSS feed, apparently I have to parse the hyperlink and make the reference absolute, which as a side-effect makes it less portable, more fragile, and longer. There's a word for this: wrong.

When you have a chunk of markup that looks like this:

<item><title>Wrong</title> <link>http://example.com/114</link> <description>My <a href="/113">note yesterday</a> about RSS was wrong.</description> </item>

Then the only sane interpretation of /113 is as http://example.com/113. RSS needs to say this, and software needs to implement it.

Not A Toy · Because, boys and girls, RSS is no longer a science experiment, it's becoming an important part of the infrastructure, which means that a lot of programmmers are going to get the assignment of generating and parsing it, and they need better instructions.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

April 22, 2003
· Technology (90 fragments)
· · Web (398 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!