This fragment is mostly a note to myself and placeholder and might prove useful to someone slashing through the XML undergrowth with bleeding-edge Ruby. Briefly: I revived my “RX” Ruby tokenizer (see here, here, and here) to contribute to Antonio Cangiano’s proposed Ruby benchmark suite, which I think is a Really Good Idea. I had a bit of pain getting the code to run on both Ruby 1.8 and 1.9, and then when I tried sanity-checking the output by comparing it to REXML on 1.9, REXML blew chunks. There are, apparently, issues about REXML and 1.9. Read on for details in the unlikely event that you care about any of this.

Benchmarking · There’s this problem in that there are a lot of plausible-looking Ruby implementations now (MRI, YARV, JRuby, Rubinius, IronRuby, MagLev) and it would be nice to compare performance. I was talking to some of the implementers about this and someone (Charles Nutter I think) said “Problem is, there’s this huge gap between running fib() and running Rails.” So, for example, how do we find out how fast MagLev will run Rails, without going through all the pain of making MagLev run Rails?

Antonio Cangiano sensibly proposed Let’s create a Ruby Benchmark Suite; when Avi Bryant told me he’d tried my RX code on MagLev, it occurred to me that it might be an interesting benchmark.

RX refresher: It’s a pure automaton-based XML tokenizer whose performance is totally dependent on the efficiency of dereferencing integer arrays, and it turns out that mainstream Ruby really sucks at this.

To make it a little more competitive with REXML, the de-facto standard Ruby parser, I had kludged it all over the place with regex preprocessing to cut down on the array traffic.

So I asked Antonio whether, if I de-optimized RX to make it a pure array benchmark, would it be interesting for his suite, and he said yes, so I did.

1.8.6 vs. 1.9 · Perhaps the single most visible difference between today’s Ruby and tomorrow’s is in the low-level string-handling API. Well, an XML parser lives entirely right there, so boy did I ever learn all about it. I had previously converted RX to run on 2006-vintage YARV, but I wanted one version of the code that would run in both 1.8.6 and 1.9. Sigh.

Here’s one of the detail issues, to give a feeling for the problems. Suppose you know that your input stream is in UTF-8 and you’ve read a buffer-full of data and you want to turn it into Unicode integer characters for the parser. The problem is that the buffer might end in the middle of a multi-byte UTF-8 character. Easy enough, a glance at the last byte will diagnose that. The problem is, how do you pull out the unsigned-integer value of the last byte of a buffer, without processing through the whole (potentially large) buffer, with code that runs in both Rubies?

I poked around on IRC and Eric Hodel managed to improve on my original suggestion. Read it and weep:

  def byte_at(s, i)
    s[i, 1].unpack('C')[0]
  end

REXML Ouch · RX has a primitive unit-test suite; what I do to sanity-check it at a high level is feed a nontrivial XML doc to it and REXML and check that they find the same number of elements, PIs, paragraphs, img elements with a src= attribute whose value ends in .jpg, and occurrences of the word “the” in running text.

Well, when I finally got it running in Ruby 1.9, and started the sanity check, REXML blew up on my document, 2.8 Meg of the input to this blog.

With a bit of poking around, I ascertained that:

Well, there you go. By the way, in Ruby 1.9’s favor, it runs the (simplified de-optimized) RX about three times as fast as 1.8.6. Any other implementors want a whack at it?



Contributions

Comment feed for ongoing:Comments feed

From: MenTaLguY (Jun 11 2008, at 07:19)

I'm not sure what's worse: that REXML has issues, or that lots of people seem to think that Hpricot (a permissive HTML parser lacking namespace support and implementing a small subset of XPath) is a better replacement.

[link]

From: automatthew (Jun 11 2008, at 10:12)

The author of Hpricot commented recently that he does not "think Rubyists and XMLists share much of a Venn diagram. "

http://www.rubyflow.com/items/388

Many rubyists can get away with using Hpricot for XML parsing, the way I can get away with using Google's translated pages to make DNS changes on a foreign registrar's website. Unpleasant, but effective so long as it's infrequent.

[link]

From: Scott Johnson (Jun 12 2008, at 08:28)

All of this make me happy to stick with Python. I haven't personally had any issues with XML support there.

[link]

author · Dad · software · colophon · rights
picture of the day
June 10, 2008
· Technology (77 fragments)
· · Ruby (93 more)
· · XML (135 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.