Mark Pilgrim has echoed Aaron Swartz’s earlier call for, in general, forgiving parsing of Internet content and, in particular, the application of this “liberal” policy to the parsing of subscription feeds, and in particular particular, to the parsing of the Atom format. Others including Dave Winer have weighed in on the other side. Parts of this no-exceptions message are mistaken and malformed, but I’ll parse it forgivingly and address some interesting related issues.

By the way, I’ve already written on this issue once before; those who care might want to revisit that; it also has an amusing side-trip into Greek legal history.

Postel’s Law Has Exceptions · Mark and Aaron’s rallying cry is Postel’s Law Has No Exceptions! where the law is something like “Be conservative in what you transmit and liberal in what you accept.” Probably the most overwhelmingly-successful application of Postel’s law was of course the triumph of the Web, where forgiving parsing allowed more or less anybody to hand-author Web content and most times more or less anybody would be able to read it. This was A Good Thing.

On the other hand, it’s painfully obvious that this law does have exceptions. Supppose I’m an equity-trade execution module receiving messages from traders, and I get:

<trade>
 <ticker>IBM</ticker>
 <amount>100

Then it is clearly not OK to guess that someone just forgot the </amount> and </trade> but didn’t also drop a trailing zero or two. A programmer in a position of responsibility who did this would be spanked and maybe fired. A manager who mandated or authorized such an implementation would be spanked, maybe fired, and maybe subject to legal action.

In fact, the correct action is to halt processing on this particular message and immediately raise an alarm to the system’s operators, as what we’ve just seen is evidence of severe breakage that probably needs urgent human attention. Raising the alarm with the originator of the broken message would be nice, but in a lot of systems wouldn’t be practical; addressing severe breakage is done better by humans than software, anyhow.

But that, you say, is different; we’re talking about newsfeed aggregation here, not life-or-death financial transactions; let’s be forgiving and allow a thousand newswire feed flowers to bloom, even if some of them are kind of malformed. I don’t buy that; I already have a newsfeed for my stock portfolio, and I really want one from my bank account and my credit card. Let’s suppose I’ve got that credit card feed going, and the aggregator sees:

<charge>
 <merchant>Ace Merchandising, Inc.</merchant>
 <charge units="GBP">100

Now, as the consumer, I’m pretty well in the same position as that equity-trading system we were talking about. I don’t want forgiveness. I don’t want liberalism. I want my aggregator to put up a flashing red flag saying YOUR CREDIT CARD FEED IS BROKEN!

By the way, it doesn’t make any difference whether the ill-formedness is grossly-missing tags as above or a single unescaped &; in these kinds of apps, if it isn’t XML, this is evidence of serious breakage.

The conclusion is obvious: Postel’s law clearly does have exceptions. Since Mark and Aaron are smart people, we will forgive their rhetorical flourishes and address the really interesting question, which is: how forgiving should we be in parsing syndication feeds?

The Case of RSS · Mark maintains an ultra-liberal feed parser, seems very proud of it, and I’m prepared to believe that it’s excellent. Given that for the many flavors of RSS, there has grown a culture of permissiveness, I’d absolutely use Mark’s parser in anything advertising itself as an RSS reader.

I gather that while the evidence seems to be that most feeds are well-formed, there are quite a few, particularly among those that are screenscraped from HTML, which aren’t.

The Case of Atom · An Atom feed is going to be defined as an XML document, which means that if it’s not well-formed then it’s not Atom. All it needs is for one (I repeat, one) popular newsreader with a large installed base to enforce this policy (stop parsing and display an error to the subscriber) to turn this from de jure to de facto reality. This works because Atom doesn’t have an installed base. Let’s name names; if any one of NetNewsWire or FeedDemon or Radio adopted this policy for Atom, it would be game over; people who’ve gotten used to these aggregators are not going to switch clients because some upstream feed producer is a bozo.

The Bozo Factor · There’s just no nice way to say this: Anyone who can’t make a syndication feed that’s well-formed XML is an incompetent fool. Here are the rules:

  1. For the tags you write, make sure that begin-tags and end-tags match up, and all the attribute values are quoted.

  2. Make sure that you generate correct UTF-8 or UTF-16 text.

  3. Filter out characters that aren’t legal in XML. Don’t get fancy, just lose them.

  4. Clean up any text you’re passing through by replacing < with &lt;, & with &amp;, > with &gt;, " with &quot;, and ' with &apos;. This applies to attribute values and character data in elements.

Note that this doesn’t require that the feed payload be well-formed; that’s what all the escaping is about.

Maybe this is unkind and elitist of me, but I think that anyone who either can’t or won’t implement these measures is, as noted above, a bozo. Many people who are not bozos will have bugs in their code and drop intermittently into bozo mode (I do all the time), and if the Atom readers are unforgiving, guess what, they’ll find out and fix the problem. If not, then not.

Dave’s Right · Aggregator writers ought to be focusing on cool new features, not reverse bozo engineering.

PostScript · I just did the first proof on the first draft of this article. It had a mismatched tag and wasn’t well-formed. The publication script runs an XML parser over the draft and it told me the problem and I fixed it. It took less time than writing this postscript.

PPS: Putting My Money Where My Mouth Is · If you’re programming in .NET, there’s a decent-looking XmlWriter class. There’s software available if you’re in DOM4J or BioJava mode too, but a couple of minutes with Google don’t seem to turn up generic low-level Java or C-language equivalents. If I just missed ’em, send me email and I’ll publicize them.

Otherwise, in the interests of Doing the Right Thing for Atom, I’ll make an offer. I have some coding time in February, and I hereby offer to create free and open-source XML Writing packages for C, Java, and Perl, and to host them indefinitely. They’ll be quite efficient and fanatically attentive to correctness. In fact, if I do this, after a short shakedown period, I’ll offer a $500 reward for each significant bug reported that would allow someone to use the API as documented but produce non-well-formed output.

If enough people pipe up and say they’d use such a facility, I’ll go ahead and do it.


author · Dad · software · colophon · rights
picture of the day
January 11, 2004
· Technology (77 fragments)
· · Internet (103 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.