I’ve decided that mod_atom really needs to be a blog-publishing system, not just an Atom Store. And furthermore, based mostly on the comments to that Sanitation piece, I’ve made two design decisions. First, the sanitizing happens only on the HTML output; the Atom-store part will persist the data as close as possible to the way it was sent upstream. Second, I’m going to try using the TidyLib parser to pick apart type="html" text constructs so I can clean ’em up.

Why Tidy? · The other candidate was libxml2, and online research failed to reveal any hands-on comparisons of the two, but it also failed to turn up anyone seriously dissing either HTML parser. So then I noticed that the libxml2 binary was like 3.8M, while TidyLib is under 400K. Of course, to be fair, libxml2 does tons of other useful stuff that I don’t care about.

So after a couple of days’ part-time poking around, I figured out how to compile TidyLib and mod_atom together and load the result into httpd.

Now let’s see how it goes. I must say that I’m a little intimidated by Tidy’s memory allocator. That’s extremely, uh, extreme. I suppose I can figure it out. Compare Genx’s. Am I too simple-minded?

As soon as I stop blogging I’m going to try to wire it up. Surely I have some big thick books or corporate strategies or social-software trends to review first?



Contributions

Comment feed for ongoing:Comments feed

From: Bob DuCharme (Aug 16 2007, at 19:43)

What about John Cowan's TagSoup (http://ccil.org/~cowan/XML/tagsoup/)?

[link]

From: Tim (Aug 16 2007, at 21:43)

Bob: TagSoup is in Java.

[link]

From: Aslak Raanes (Aug 17 2007, at 01:01)

I guess a plain C version of html5lib would be nice, but don't know if someone is working on that.

[link]

From: David Comay (Aug 27 2007, at 10:53)

Tim, you may be interested to know that Tidy has been integrated into build 71 of OpenSolaris so it's now part of Solaris Express.

[link]

author · Dad · software · colophon · rights
picture of the day
August 16, 2007
· Technology (85 fragments)
· · Atom (91 more)
· · Open Source (82 more)
· · Web (390 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.