I’m in about round eight of my duel with Ruby, trying to make a correct automaton-based parser run as fast as a regexp-based one that has a casual attitude toward the rules. (Our story thus far here and here.) I thought I’d try out YARV, the heir-apparent Ruby VM. [Update: Hah! Improved RX a little more.]
More RX · I’d added one more optimization since the last report: when the parser wants to assemble characters into UTF-8 strings to pass to the API, it gets them from the input subsystem, since if the input were already in UTF-8, there’s a chance to dodge string recomposition overhead. It bought about 20% on large UTF-8 input documents.
[Update, Friday night.] I was trying to check out JRuby timings and something wasn’t working; a closer look at the code revealed a bone-headed programming error that was doing a useless (and expensive) rexexp match on every buffer refresh. The numbers below are a little better.
Ruby 2 Breakage · There will be some incompatible changes in Ruby 2. One of the most visible is that indexing into a string returns a one-character string, as opposed to the value of the byte thus indexed. I think that in general I probably approve of this behavior, and most programs that it breaks will probably be getting what they deserve, since they were probably heading for an i18n brick wall anyhow.
Of course, RX used strings to hold the automaton itself and a couple of
related data structures, so they all had to have
applied, and all the numeric references are into
not strings, which I’d like to think is cheap, but I haven’t measured yet.
TDD Rah Rah · Refactoring the code to deal with the new String regime wasn’t that tricky, but I bow my head in the general direction of whichever deity governs Unit Testing. If I hadn’t had a fairly decent set of input-subsystem tests, it would have been nightmarish, automaton-based parsers being fairly tricky to debug.
The Numbers ·
Once again, the benchmark is 2,477,645 bytes of
ongoing source text; the code counts the number of PIs,
img elements whose
src attribute points at something ending in
occurrences of the word “the” in running text.
The numbers are averages of a few runs, and consider only “user” CPU time as reported by Mac OS X. Units are seconds, smaller is better.
This is fairly self-explanatory; RX and REXML both got a shot in the arm from YARV; RX’s was a little bigger, so it’s not quite as far behind REXML as it used to be.
Frankly, I’m disappointed. I’d have thought that if there’s anywhere a VM, as opposed to an interpreter, ought to do well, it’s this kind of stripped down array-lookup-and-dispatch code.
Next step; track down what they use for profiling in YARV-land, and find out what’s going on. [Are we getting a little obsessive here? -Ed.] [And your point is? -Tim]