It’s amazing how issues float to the top of multiple minds independently. I’ve been spending a lot of time thinking about how to sanitize to-be-published data. Then Rob Sayre wrote Interoperability and XSS Mitigation; XSS stands for “cross-site scripting”, the main threat that you sanitize to avoid. Sam Ruby noticed got active: Interoperability and XSS Mitigation announced the Sanitization rules wiki-space. Microsoft’s Joe Cheng is worrying, too.

mod_atom · As of now, mod_atom is, as a pure Atom Store, approaching 1.0 status. It’s interoperated with every credible client that’s tried. (Further evidence, were any needed, that the Atom protocol is Really Simple). Except for, it’s not finished, for one small and one large reason. The small reason is that it doesn’t yet generate HTML (but that’s not hard). The big reason is that it’s not safe; I can send it HTML loaded with horrible XSS exploits and it’ll stuff them into Web-space, ready to wreak havoc on the world.

Feedparser’s Whitelist Approach · What the Sanitization Wiki Page doesn’t spell out is that this logic, derived originally from Feedparser, is whitelist based. For HTML, it goes through the data, examines each element and attribute, and lets it survive if it appears on the “Approved Elements/Attributes” list.

The same approach is used with MathML and SVG markup; CSS is sanitized by removing the url() pattern and anything that looks like it might be hiding something bad.

I haven’t seen any pushback against the basic approach, which makes me happy because it seems very sound to me.

At Microsoft · Check out Joe Cheng’s AtomPub interop event notes. He writes “I’m thinking about implementing a web app that takes any AtomPub endpoint and makes a blog out of it, although I would love it if someone beat me to it.” So he’ll be looking at the same problems.

During the interop we were talking about sanitizing the payload, and I described the whitelist approach. Joe pointed out that that simply removing style, both element and attribute, wouldn’t work for his users, because authoring tools use this to produce nice visuals that there’s no other obvious way to get.

So I guess that you could look inside style elements and attributes and do your CSS-cleanup there in situ. Hmm.

Where to Sanitize? · mod_atom actually has some cleanup code right now. If you post an Atom entry with text marked type="xhtml", it applies a whitelist algorithm much as specified above. Which is easy, because the Apache server includes an XML parser that builds a DOM for you, and it’s straightforward to run around it checking against the whitelist. The still-unsolved problem is type="html", because that requires parsing the HTML. Blecch.

Right now, the mod_atom cleanup happens as the data comes in, so the version in the Atompub Collection feeds is sanitized. I’m beginning to think that’s wrong, that the Atom Store part of mod_atom should preserve the data as-is, as much as possible; presumably, those feeds and entries will be access-controlled, not world-readable. Then there should be a separate set of feeds offered to the world for subscription purposes. They, and the HTML pages, exist only in the sanitized state.

But at this stage we’re just making this up as we go along. It’s really nice, though, that everyone seems to have realized that the problem is real and important; and if we can develop a set of Best Common Practices, that’d be good for everyone.


Comment feed for ongoing:Comments feed

From: Zak Greant (Aug 09 2007, at 18:17)

I bet that you aren't the only Vancouverite who is currently thinking about sanitation.

(Non-Vancouverites: Garbage pickup in Vancouver has been suspended for a few weeks now, as members of the CUPE union (which includes sanitation workers) have been on strike since July 20th. The garbage is definitely starting to pile up.)


From: Mark (Aug 09 2007, at 18:49)

<> is the one to beat.

It passes these tests: <>


From: Mark (Aug 09 2007, at 20:20)

It's ironic that a Microsoftie is complaining about sanitizing CSS styles, since it was Microsoft who polluted CSS with executable JavaScript in the first place.


From: Tim Bray (Aug 09 2007, at 22:02)

Mark, the problem isn't finding good HTML cleanup code... the problem is finding good cleanup code *in C*. No... the problem is finding good *parsing* code in C. I can do the cleanup if it's parsed.



From: Aristotle Pagaltzis (Aug 10 2007, at 02:51)

Is there *still* no one to have bribed John Cowan into porting TagSoup to C?


From: JD (Aug 10 2007, at 06:01)

When I wrote Eddie I borrowed the whitelist approach from Feedparser. I believe that I could certainly improve on the sanitization, as it strips out style tags and it would be useful to have a whitelist of object urls,. Empty entries that once just contained a YouTube video aren't ideal.


From: Mark (Aug 10 2007, at 07:12)

I assume you have a good reason why you're not using Tidy?*checkout*/tidy/tidy/src/parser.c?revision=1.184


From: Ian Bicking (Aug 10 2007, at 10:18)

I've had pretty good experience with libxml2's HTML parser, which could be the basis for cleanup code written in C. Once it's parsed it's not too terribly hard to clean, after all.


From: John Cowan (Aug 10 2007, at 22:32)

Aristotle: Nothing could persuade me to rewrite TagSoup in C or C++. Of course, the code is Open Source, so someone else would be free to do so, and I would provide encouragement and support.


From: Seth A. Roby (Aug 11 2007, at 14:05)

Given that APP is a standard, wouldn't it be possible to write a APP Server whose whole purpose was to pick up Atom feeds from some other server, sanitize them, and distribute? That would allow any APP implementation to benefit from them, and make one more small piece, loosely joined.

You could even have the backing server only accept feed requests if the REFERRER was the sanitation server.


From: Sam Ruby (Aug 11 2007, at 18:14)

> wouldn't it be possible to write a APP Server whose whole purpose was to pick up Atom feeds from some other server, sanitize them, and distribute?

Venus has a command line program (test/ which can do this for any feed. It would be trivial to create a CGI or FastCGI wrapper for this function.


From: Aristotle Pagaltzis (Aug 11 2007, at 19:02)

Seth: I am not sure why you’re putting Atompub in the picture. What you are talking about is nothing more than basically FeedBurner; it does not involve the publishing protocol at all, just the syndication format.


author · Dad · software · colophon · rights
picture of the day
August 09, 2007
· Technology (87 fragments)
· · Atom (91 more)
· · Publishing (157 more)
· · Web (393 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.