Dave Walker over at freeform goodness catches me with my XML pants, figuratively speaking, down. I wrote a piece about leaving the W3C TAG entitled (cleverly I thought) </TAG>. Unfortunately that < in the title caused all sorts of grief and breakage, both here at ongoing and downstream in the world of syndication and aggregation. I can fix my own problems, but it’s deeper downstream; long term, the answer is Atom. Herewith some thoughts on good programming practices and the larger problem. [Update: A couple of notes on the “href problem.”]

Local Repair · ongoing is written in XML and processed by an XML parser and a bunch of Perl code to produce both what you’re looking at and the RSS (and soon Atom) feeds. There’s a function called escape() that turns < into &lt;, & into &amp;, and so on. The problem was, as I was writing the software, I stuck an escape() call in whenever it seemed necessary, without thinking about the dataflow too much. Bad, bad Tim! So just now, when I read the freeform goodness essay and went to look for the breakage, the code was kind of ugly.

There was quite a bit of double-escaping going on, so what appeared as &lt; in the input ended up as &amp;lt; in the output. This was showing up as “&lt;” here at ongoing, but (maddeningly) as < in the RSS aggregator display. Please, we need Atom.

By the way, at Antarctica we had quite a few similar problems, with things surprising us by turning up either unescaped or doubly-escaped.

Getting the Policy Right · I think that software designers have to look at their application dataflows and get the policy right. Here’s a picture:

XML flows through a system

The policy ideally should be, I think, that all data in the Your Code block has to be known to be escaped or known to be unescaped. That is to say, you always do escaping on the data at the pointy end of the input arrows, or you never do it.

I think always-unescaped is a little better, since some of those output arrows might not be XML or HTML, but probably they all are; so always-escaped is certainly viable.

Now, in a small, constrained publishing system like here at ongoing, this is achievable. It’s tougher in a big professional multi-user system where there are a lot of input arrows and you don’t control them. Which gives us a third choice; accompany every piece of text in your program with a little boolean metadatum saying whether it’s escaped or not. Quite a bit more work; but maybe the only choice for a big-league system.

For what it’s worth, I’m now reworking ongoing to make the internal data always-unescaped.

Later: Couldn’t quite manage that, since the first paragraph is stashed away in a persistent database, and contains markup, so is a mixture of unescaped real markup and escaped magic characters in content. So it’s never ever simple.

The Output · Once your internal text is in a deterministic state, you have a chance of generating the correct output. For XML and HTML, you single-escape, and don’t forget to escape quotation marks as well as < and & or you’re going to get attribute-value breakage.

For Atom, you single-escape and set the mode= attribute and you’re good to go.

For RSS it’s tougher; lots of people single-escape their HTML and assume it will get executed; which means that you have to double-escape any markup that you don’t want executed. But even so, implementations vary.

And there’s still one nasty sharp-fanged viper lurking in the bushes...

The “href” Problem · There are lots of URIs out there that look like this:

http://example.com/select?y=1999&m=Jan

Well, that’s an &, right, and everyone knows that those have to be escaped in well-formed XML, right? So it should look like this in your HTML:

http://example.com/select?y=1999&amp;m=Jan

Well, yes, but... some browsers have been known to react poorly to this, probably depending on whether you serve your stuff as text/html or application/xhtml+xml. Hrumph. When I figure out the right solution to that one, I’ll let you know.

Update: Julian Reschke writes to tell me that whereas he’s heard lots of talk about this “problem,” he’s never heard of anyone getting bitten. Come to think of it, neither have I. And Nik Clayton writes to point out that you can usually (but not always) use ; instead of & for this kind of URI.

Conclusion · OK, I think it’s now right. NetNewsWire is obstinately refusing to show the last character of that </TAG> article, even though it’s irritatingly double-escaped. Brent is no fool. I rest my case; this really needs fixing in the spec.


author · Dad
colophon · rights

March 16, 2004
· Technology (90 fragments)
· · XML (136 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!