RAD VII: J?REXML

[RAD stands for Ruby Ape Diaries, of which this is part VII.] In native Ruby, the default way to do XML processing is via the REXML library. In JRuby, you have another default choice—Java’s built-in XML APIs. Neither option is that great. Still, there are some reasonably safe ways to get the job done. I wrote some glue code called JREXML to make the Java APIs look more like REXML, which forced me to think about this stuff perhaps more than is entirely healthy.

REXML · I and others have already griped about REXML’s forgiving attitude to broken XML, but let’s assume that Sam Ruby or someone else gets around to fixing that. The REXML API still has some surprises; most notably the “elements” field of the Element object. It has a bunch of methods but the most popular is [], in which an integer selects a child element by (1-based) index. Or you can put an XPath in those brackets, but every time I do, I get a surprising result that I don’t understand. Maybe if I could figure out the following from the online docs I’d be able to make it go:

The attribute Element.elements is an Elements class instance which has the each and [] methods for accessing elements. Both methods can be supplied with an XPath for filtering, which makes them very powerful.

Since Element is a subclass of Parent, you can also access the element’s children directly through the Array-like methods Element[], Element.each, Element.find, Element.delete. This is the fastest way of accessing children, but note that, being a true array, XPath searches are not supported, and that all of the element children are contained in this array, not just the Element children.

Use XPaths · But (leaping ahead to this essay’s conclusion), I think that the sensible way to deal with XML APIs is, if you possibly can, to forget tree walking and depend on XPath. The combination of current-context and an XPath can get you pretty well whatever you want out of an in-memory XML structure in a straightforward way.

The good news is that XPaths seem to work about the same in Java and REXML; since these are two entirely unrelated implementations, that’s a really good sign.

There are two pieces of bad news. The first is that XPaths can be tricky to get right, and it’s made harder by the fact that there are sometimes little bits of DOM goo that get in your way; the stuff that sits above the root element and is used to hold <!DOCTYPE> drivel and suchlike. You can work around this stuff, but you have to do it consciously.

REXML::XPath.* Has Three Arguments. You Need Them All. · Another REXML-specific gotcha is that for most real-world XML processing, you need to use namespaces, and REXML makes it really easy to go off the rails. Suppose, for example, you want to find all the entries in an Atom feed. Assuming you trust the structure (as in, you’ve validated it), something like //atom:feed/atom:entry will do what you need. The code looks like:

REXML::XPath.each(doc, '//atom:feed/atom:entry') do |entry|
  # use the entry
end

You might idly wonder what namespace does that atom: prefix map to? Well, it turns out, whatever atom: is mapped to in the document you’re searching. How whimsical. Danger, Will Robinson! On this planet they rely on prefix assignments from random input documents. Repeat: Danger! If, for some reason, they’ve mapped http://www.w3.org/2005/Atom to, for example a:, or used it as the default namespace, both of which I’ve seen in the wild, you’re hosed. And if some black-hat has been doing malicious prefix mapping...

The solution is easy enough:

atomNS = { 'a' => 'http://www.w3.org/2005/Atom' }
REXML::XPath.each(doc, '//a:feed/a:entry', atomNS) do |entry|
  # use the entry
end

Then your XPath will Just Work. Or at least, mine have been. But it’s easy to get the impression that it’s OK to leave out that namespace-hash argument, and that’s really insidious because someone who looks at a sample doc, and writes their XPaths to work with the sample doc, will encounter mysterious breakage at some unpredictable later time following on exposure to real live Internet XML.

On the Java Side · The default Java approach to using XML is conformant to the W3C XML DOM, and this is why alternatives like JDOM and XOM and so on exist. The DOM tries to be infinitely general and language-independent, and succeeds at these things, but pays a severe price in usability.

I wouldn’t have been able to figure out how to make it go from the official docs. However, some quality time with Elliotte Rusty Harold’s excellent survey over at IBM Developerworks got me pointed in the right direction, and I recommend it heavily.

Not only is the DOM irritating in general to most programmers, its Java expression in particular is perhaps not all that it would be in an ideal world. That’s OK; with a little work, you can in fact make it go, even (as in my case) peering through the not-quite-100%-transparent Java-to-Ruby porthole. But do start by reading ERH’s piece.

So, why would anyone use this API? Because it works, and in any reasonably-modern Java installation, you know it’s going to Just Be There. For some people, this is not a very compelling reason; but for others it’s a big deal. Choose your poison.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

August 22, 2006
· Technology (90 fragments)
· · Dynamic Languages (45 more)
· · Java (124 more)
· · Ruby (93 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!