I’m in about round eight of my duel with Ruby, trying to make a correct automaton-based parser run as fast as a regexp-based one that has a casual attitude toward the rules. (Our story thus far here and here.) I thought I’d try out YARV, the heir-apparent Ruby VM. [Update: Hah! Improved RX a little more.]

More RX · I’d added one more optimization since the last report: when the parser wants to assemble characters into UTF-8 strings to pass to the API, it gets them from the input subsystem, since if the input were already in UTF-8, there’s a chance to dodge string recomposition overhead. It bought about 20% on large UTF-8 input documents.

[Update, Friday night.] I was trying to check out JRuby timings and something wasn’t working; a closer look at the code revealed a bone-headed programming error that was doing a useless (and expensive) rexexp match on every buffer refresh. The numbers below are a little better.

Ruby 2 Breakage · There will be some incompatible changes in Ruby 2. One of the most visible is that indexing into a string returns a one-character string, as opposed to the value of the byte thus indexed. I think that in general I probably approve of this behavior, and most programs that it breaks will probably be getting what they deserve, since they were probably heading for an i18n brick wall anyhow.

Of course, RX used strings to hold the automaton itself and a couple of related data structures, so they all had to have .bytes.to_a applied, and all the numeric references are into Fixnum arrays, not strings, which I’d like to think is cheap, but I haven’t measured yet.

TDD Rah Rah · Refactoring the code to deal with the new String regime wasn’t that tricky, but I bow my head in the general direction of whichever deity governs Unit Testing. If I hadn’t had a fairly decent set of input-subsystem tests, it would have been nightmarish, automaton-based parsers being fairly tricky to debug.

The Numbers · Once again, the benchmark is 2,477,645 bytes of ongoing source text; the code counts the number of PIs, Elements, p elements, img elements whose src attribute points at something ending in jpg, and occurrences of the word “the” in running text.

The numbers are averages of a few runs, and consider only “user” CPU time as reported by Mac OS X. Units are seconds, smaller is better.

RX REXML Δ
Ruby 1.8.5 11.53 4.52 2.55
YARV 0.4.1 6.99 3.08 2.27
Δ 1.65 1.47

This is fairly self-explanatory; RX and REXML both got a shot in the arm from YARV; RX’s was a little bigger, so it’s not quite as far behind REXML as it used to be.

Frankly, I’m disappointed. I’d have thought that if there’s anywhere a VM, as opposed to an interpreter, ought to do well, it’s this kind of stripped down array-lookup-and-dispatch code.

Next step; track down what they use for profiling in YARV-land, and find out what’s going on. [Are we getting a little obsessive here? -Ed.] [And your point is? -Tim]



Contributions

Comment feed for ongoing:Comments feed

From: Pierre Phaneuf (Nov 23 2006, at 22:58)

It all depends on the design of the interpreter/VM. For example, the perl5 interpreter was often faster than Java VMs for a while (until they improved the hell out of the VMs), by the simple expedient of "chunky" ops. A single interpreter op can be something like "take this list, and give me a list of the items matching this regex". This way, the decoding/interpreter overhead remains small in comparison to the work done in native code.

If your ops are "load this memory location into this pseudo-register (or the stack)", then yeah, it's going to be very slow.

Now, what if perl5 JITted the smaller ops (like addition operator and friends) together? They didn't even *try* to make it faster!

[link]

From: Ed (Nov 24 2006, at 23:07)

Dear Mr. Bray,

Seeing how you are more enthusiast to Ruby than Java has made me wonder why SUN did not chose to recruit Matz to work on Ruby full time with a group of brightest SUN engineers.

Google recruited Guido (thank God, now Python can be taken seriously and have long future).

Yahoo! recruited Rasmus

Well.. nobody recruited Larry Wall perhaps because Perl is so cryptic..

SmallTalk is.. just there..

Seeing how Ruby seems to capture many people interest including old-school LISP developers, I think SUN should grab Matz.

Come to think again.. this is a good topic for my first blog ^_^.

[link]

From: Peter Green (Nov 25 2006, at 00:40)

I'm just a humble Ruby foot soldier and don't know squat about such things as VM implementation, and doubtless Koichi knows what he's doing and eventually it will get faster. BUT, I think it's fair to say that a lot of folks are quietly getting a bit impatient about the whole Ruby execution speed issue... Elephant in the room type stuff - and nobody wants to say anything negative too loudly in case they start an internet meme... least of all me - I wish I could do all my programming in Ruby. But I can't.

So, if a few Sun elves happened to get medieval on YARV with DTrace and found some rate-determining bottlenecks... well we'd all recycle our powerbooks and buy Ultra 40s and carry them around on our backs... or something like that :).

[link]

From: David Waite (Nov 29 2006, at 14:02)

Tim, I'm interested in seeing how your rx library has changed over time, is there a chance you could put the code up somewhere? It appears the only available version is the initial version from the 12th.

[link]

From: Tim Bray (Nov 29 2006, at 16:47)

Oops, David, you're right. I put the latest YARV-ized version at http://www.tbray.org/code/rx-yarv.tgz

[link]

From: David Waite (Dec 05 2006, at 18:15)

The reason I asked for the latest and greatest is that I've been a bit inspired to work on my own XML parser recently, and have taken a liking to your DFA-based approach. I'll send you a link sometime once I have a blog up around my efforts :)

Some things I've done so far are to make a different DFA language with some additional rules (you can either match single characters or against sequences of characters as a shortcut) and a macro/templating feature for reuse of some common things like Name matching. I also restricted calling functions to just declaring function names to allow for easier (potential) cross-language code generation.

I did notice that some of the DFA for the "Eq" pattern seem to be less than strict, it consumes any number of spaces rather than the single 'optional' space. The place I saw this was around the XML declaration 'version', but I might have not seen other places. I'm assuming you are shooting for 'compliant as possible without supporting doctypes' (when is XML 2.0 coming out again?)

[link]

From: Paul B (Dec 06 2006, at 12:57)

Ed said:

> Well.. nobody recruited Larry Wall perhaps because Perl is so cryptic..

O'Reilly "recruited" Larry for a while. This, in combination with the excellent Perl books they published -- excellent in conveying general concepts as a context for Perl and not just Perl specifics -- has contributed significantly to my continuing affection for O'Reilly. (Added to my affection for Larry, for authoring or consulting on those works.)

Also, I believe I recall that Yahoo gave him, unsolicited and unrequired, some nice options or grants at the time they went public. I seem to recall (admittedly, unreliably) some comments around that time more or less to the effect: "We couldn't have done it without Perl."

[link]

author · Dad · software · colophon · rights
picture of the day
November 23, 2006
· Technology (87 fragments)
· · Ruby (93 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.