Thanks to the commenters on the previous RX piece who recommended ruby-prof (there’s a gem install), which is a much faster and thus better profiler than the built-in one. I learned a few more things.

The commenter who hit closest to the mark was Aristotle Pagaltzis, who references Perl lore: “The fastest way to do something in Perl is frequently the one that implements the most costly step in the fewest ops.” The reasons will be obvious on a little thought, but the PerlMonks piece is worth reading.

Looping Blues · When I run the parser over 2,477,645 bytes of XML, because of the buffer-skipping trick it only actually has to look at 484,021 characters individually, and that loop still burns 25% of the 13 seconds it takes on my PowerBook. I’ve looked at it pretty hard, sliced and diced the loop two or three different ways with the profiler, and squeezed a few improvements out, but at the end of the day, it’s pretty simple sensible code. For now, my conclusion is that the current Ruby implementation is just not gonna be fast enough in any algorithm that requires looking individually at a nontrivial proportion of the characters.

Packing Blues · The other problem is that RX internally processes XML text as an array of numeric values identifying Unicode characters; but APIs are going to want to deal with Ruby strings, so the arrays have to be run through pack before handing data to whatever API is in use. This hurts; Array#pack was burning 11% of my time. What puzzled me was that Integer#to_int (a no-op) was getting called 2,354,147 times and burning another 6.59%.

So I poked around and it turns out that any_array.pack('U*') calls to_int on each element; duck-typing culture at work.

If there were such a notion as an array known to contain only Fixnums, or a pack-type operation that was allowed to throw an exception if its arguments weren’t Fixnums, things might be better.

Now, in the case where the input is already UTF-8, there’s a chance for a special-purpose hack to avoid the String->Array->String round-trip, but that feels both brutal and discriminatory.

Conclusion · The notion of picking one of the libxml or expat based Ruby libraries and maintaining it properly and blessing that as the “right” way to do XML in Ruby is looking better and better.


Comment feed for ongoing:Comments feed

From: Justin (Nov 15 2006, at 23:35)

Since a good chunk of the RX code base is generated from your state-machine data file, wouldn't it be possible to generate platform specific code for different implementations of Ruby instead? It would always still be possible to fall-back to pure Ruby if the platform was unknown.

Refactoring MachineBuilder to use seperate code-generation classes shouldn't be too rough and it would allow new Ruby implementations the option of a fast (mostly) native XML-Parser with minimal effort.

Lastly, is there a reason to specify the DFA machine in a custom file? It seems like a little Ruby metaprogramming would handle the expressivity you wanted and make the source more approachable.


From: Robert Hahn (Nov 16 2006, at 05:58)

On the assumption that almost all of the 484,021 characters that you have to look at are one byte big, you're processing 37,232 bytes/sec - if you divide the time it takes to process 2,477,645 bytes, you're talking 190,588 bytes/sec.

Am I naïve to think that those speeds sound plenty fast for a scripting language? 13s to process 2.5 megabytes of data sounds pretty good to me…


From: Tim Bray (Nov 16 2006, at 15:09)

Yeah, Robert, but REXML does the same work in one-third the time.


From: Robert Hahn (Nov 16 2006, at 16:25)

REXML may do it in a third the time, but obviously something's being sacrificed in the process. You yourself pointed out that it happily works with non-conforming XML. If the cost of doing it correctly means you're taking a bit more time, that's still a win.

But at this point, I bet far better Ruby programmers than you or I are looking over the source. Hopefully they'll find a trick or 3 that'll help.


From: Tom Pollard (Nov 16 2006, at 21:13)

Aristotle's advice is common wisdom for people working in high-level scripting languages. I learned it first when doing Matlab programming in grad school. Most likely it was first formulated a few decades ago by some pioneering APL hacker.


From: assaf (Nov 23 2006, at 01:12)

Apparently, libxml-ruby is still being maintained:

I haven't used it yet to judge how well it works, or how simple the API is, just thought it's worth a mention.

I won't discount it on account of being C code. I'm happy to let Java hold a monopoly on the "one language to code it all" philosophy.


author · Dad
colophon · rights
picture of the day
November 15, 2006
· Technology (90 fragments)
· · Coding (98 more)
· · Ruby (93 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!