It’s amazing how issues float to the top of multiple minds independently. I’ve been spending a lot of time thinking about how to sanitize to-be-published data. Then Rob Sayre wrote Interoperability and XSS Mitigation; XSS stands for “cross-site scripting”, the main threat that you sanitize to avoid. Sam Ruby noticed got active: Interoperability and XSS Mitigation announced the Sanitization rules wiki-space. Microsoft’s Joe Cheng is worrying, too.
mod_atom · As of now, mod_atom is, as a pure Atom Store, approaching 1.0 status. It’s interoperated with every credible client that’s tried. (Further evidence, were any needed, that the Atom protocol is Really Simple). Except for, it’s not finished, for one small and one large reason. The small reason is that it doesn’t yet generate HTML (but that’s not hard). The big reason is that it’s not safe; I can send it HTML loaded with horrible XSS exploits and it’ll stuff them into Web-space, ready to wreak havoc on the world.
Feedparser’s Whitelist Approach · What the Sanitization Wiki Page doesn’t spell out is that this logic, derived originally from Feedparser, is whitelist based. For HTML, it goes through the data, examines each element and attribute, and lets it survive if it appears on the “Approved Elements/Attributes” list.
The same approach is used with MathML and SVG markup; CSS is sanitized by
url() pattern and anything that looks like it might
be hiding something bad.
I haven’t seen any pushback against the basic approach, which makes me happy because it seems very sound to me.
At Microsoft · Check out Joe Cheng’s AtomPub interop event notes. He writes “I’m thinking about implementing a web app that takes any AtomPub endpoint and makes a blog out of it, although I would love it if someone beat me to it.” So he’ll be looking at the same problems.
During the interop we were talking about sanitizing the payload, and I
described the whitelist approach. Joe pointed out that that simply removing
style, both element and attribute, wouldn’t work for his users,
because authoring tools use this to produce nice visuals that there’s no other
obvious way to get.
So I guess that you could look inside
style elements and
attributes and do your CSS-cleanup there in situ. Hmm.
Where to Sanitize? ·
mod_atom actually has some cleanup code right now. If you post an Atom
entry with text marked
type="xhtml", it applies a whitelist
algorithm much as specified above. Which is easy, because the Apache server
includes an XML parser that builds a DOM for you, and it’s straightforward to
run around it checking against the whitelist. The still-unsolved problem is
type="html", because that requires parsing the HTML. Blecch.
Right now, the mod_atom cleanup happens as the data comes in, so the version in the Atompub Collection feeds is sanitized. I’m beginning to think that’s wrong, that the Atom Store part of mod_atom should preserve the data as-is, as much as possible; presumably, those feeds and entries will be access-controlled, not world-readable. Then there should be a separate set of feeds offered to the world for subscription purposes. They, and the HTML pages, exist only in the sanitized state.
But at this stage we’re just making this up as we go along. It’s really nice, though, that everyone seems to have realized that the problem is real and important; and if we can develop a set of Best Common Practices, that’d be good for everyone.