· Naughties
· · 2004
· · · August
· · · · 22 (1 entry)

Java Regex Wrangling · I need­ed a quick and dirty to­k­eniz­er for a big chunk of XML-ish text to feed in­to some Ja­va code so I was go­ing to fire up Per­l, then I re­mem­bered that mod­ern Ja­va comes with its own regular-expression li­brary. Hey, it’s good! I put it to­geth­er in quick-n-dirty hack­er style, and it ran over a 100M file, find­ing fif­teen mil­lion to­ken­s, in about three min­utes of CPU time on my 1.25GHz Pow­erBook. Quite re­spectable, but, I thought with a snick­er, I bet Perl can beat that. (Perl’s regex en­gine is gen­er­al­ly re­gard­ed as the state of the art.) So I whacked to­geth­er a Perl ver­sion and, just to make sure I was get­ting the right an­swer­s, I had both the Ja­va and Perl ver­sions print out all the to­kens they found. They both burned some­thing over ten min­utes, and Perl was maybe 10% faster; might have been the I/O or oth­er stat­ic. I was im­pressed to find Ja­va with­in 10% of the best. So then I ran it again with­out the out­put, just count­ing the to­ken­s, and yowie zowie, Perl was at 8 min­utes 47 sec­ond­s, Ja­va back at 3 min­utes 4 sec­ond­s. So I re-ran on a near­by De­bian box, on the the­o­ry that the OS X ver­sions of Ja­va and Perl might not be rep­re­sen­ta­tive of their kind. There are all sorts of vari­a­tions around I/O and so on, but my find­ing is that for this prob­lem, the Ja­va 1.4.2 regex pro­cess­ing is some­where around twice as fast as Perl 5.8.1. Frankly, I’m as­tound­ed. Read on for ac­knowl­edge­ments, some gory de­tail­s, and a taste­ful se­lec­tion of Google ads for reg­u­lar ex­pres­sion soft­ware. [Up­date: There is a good rea­son things are the way they are, and Perl’s trade-off may well be bet­ter.] ...
author · Dad · software · colophon · rights
Random image, linked to its containing fragment

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.