· Technology
 · · Coding
 · · · Text

Ruby 1.9 I18n and Mashup Testing · A cou­ple of points on that PDML + Twit­ter mashup. First, yet an­oth­er homi­ly say­ing “Please unit test”. Se­cond, some com­menters want­ed to see the code even though it’s triv­ial, and I found a rea­son to agree ...
[1 comment]  
Unicode Blues · In the past few days I’ve been watch­ing two de­bates on the sub­ject of Uni­code; one on the main IETF general-discussion list, and an­oth­er on ruby-talk (there must be a bet­ter archive). In IETF-land, the el­ders are once again con­vinc­ing each oth­er that In­ter­net Stan­dards need not be writ­ten in a way that al­lows char­ac­ters oth­er than ASCII; thus, for ex­am­ple, you can’t cor­rect­ly record the names of con­trib­u­tors like Bill de hÓra or Martin Dürst; nor can you il­lus­trate any dis­cus­sions of net­work pro­to­cols which car­ry pay­loads oth­er than those which can be ex­pressed in primitively-typeset English. I have a lot of the re­spect for the IETF’s achieve­ments, but I think my re­vul­sion at this in­sti­tu­tion­al big­otry will prob­a­bly soon drive me out of the or­ga­ni­za­tion. In Ruby-land, it seems that Matz has spo­ken, and Ruby, the next gen­er­a­tion, will have a won­der­ful String class that deals with ev­ery­thing; han­dling Uni­code, which they see as un­ac­cept­ably lim­it­ed, as mere­ly one case among many. This think­ing seems deeply bro­ken to me but I am on­ly shal­low­ly im­mersed in Ru­by and don’t un­der­stand the Han Uni­fi­ca­tion angst that is at the root of things. I don’t have much in­flu­ence in ei­ther com­mu­ni­ty (which is ap­pro­pri­ate, I haven’t earned it). I’ll raise my voice for, what that’s worth, to ar­gue that get­ting Uni­code re­al­ly right is a nec­es­sary con­di­tion for be­ing a tech­nol­o­gy provider in the third mil­len­ni­um, and may prove to be suf­fi­cient, in­so­far as internationalized-text is­sues go. I’m not op­ti­mistic that this will make any dif­fer­ence. But if ei­ther com­mu­ni­ty de­cides to give Uni­code a se­ri­ous go, I’ll vol­un­teer to pitch in, to work to make it work.
Regex Update · Back in Au­gust 2004, I wrote a piece com­par­ing Perl and Ja­va regex per­for­mance, ob­serv­ing that, to my sur­prise, ap­par­ent­ly Ja­va was way faster on what I thought was a pret­ty com­mon task. Last mon­th, Ben Til­ly wrote me say­ing that Perl con­scious­ly ac­cept­ed a regex slow­down to route around a patho­log­i­cal case where search time could ex­plode to in­fin­i­ty. I asked him to write it up and promised to point to it, and he has. If you care about this kind of thing, read Ben’s piece and don’t miss the com­ments, which are in­ter­est­ing. Sum­ma­ry: the jury’s still out. See al­so: Open-Source Regex.
Republished · At some point in the tran­si­tion to De­bian Sarge, some­thing broke in the the on­go­ing soft­ware. The perl code reads text us­ing an XML pro­ces­sor and var­i­ous pieces of it get stashed in a Mysql database. On­ly some­where along the line, non-ASCII UTF-8 char­ac­ters were get­ting trashed. I tried all sorts of stupid dodges, and was whin­ing away at Sam Ru­by via in­stant mes­sen­ger, and he said “of course, you could do it all as seven-bit ASCII via 몾... or you could rewrite it in Ru­by and It Would Be Much Better”. I shrieked “Get thee be­hind me foul tempter!” and have now jammed ev­ery­thing in­to 7-bit ASCII as it comes out of the XML parser, and of course all the prob­lems have gone away. Ac­tu­al­ly, the code got sim­pler, lots of XML es­cap­ing/unescap­ing calls are no longer nec­es­sary. This is one of the nice things about XML I guess, it al­lows you to be a good in­ter­na­tion­al­iza­tion cit­i­zen even when your soft­ware in­fras­truc­ture isn’t. It still feels evil. Any­how, the whole site’s been re­pub­lished, let me know if anything’s bust­ed. (By the way, if you’re read­ing this in my RSS feed and all the en­tries show up as new, switch to the Atom feed and that prob­lem will go away, be­cause Atom ac­tu­al­ly has unique IDs and dat­es­tamps that work.) [Up­dat­ed: Tony Coates (in­ter­est­ing new blog there, BTW) re­ports that Opera 8.02 gets it back­ward­s, which means that it’s one of the rare pieces of soft­ware that re­spects guids in RSS, but that it’s do­ing Atom 1.0 wrong.]
Text Encoding Progress · It’s good to see the IETF show­ing for­ward mo­tion on the vi­tal is­sues around how to store text ef­fi­cient­ly; check out the brand-new RFC4042 on UTF-9 and UTF-18. Good stuff.
Big Unicode! · Via jwz, a mon­ster Uni­code chart about 6 by 12 feet. I want one!
Open-Source Regex · A few days ago I wrote a lit­tle re­port on regular-expression per­for­mance; it drew a sur­pris­ing amount of feed­back, in­clud­ing one piece that throws an in­ter­est­ing side­light on the trade-offs around Ja­va and Open Source ...
Java Regex Wrangling · I need­ed a quick and dirty to­k­eniz­er for a big chunk of XML-ish text to feed in­to some Ja­va code so I was go­ing to fire up Per­l, then I re­mem­bered that mod­ern Ja­va comes with its own regular-expression li­brary. Hey, it’s good! I put it to­geth­er in quick-n-dirty hack­er style, and it ran over a 100M file, find­ing fif­teen mil­lion to­ken­s, in about three min­utes of CPU time on my 1.25GHz Pow­erBook. Quite re­spectable, but, I thought with a snick­er, I bet Perl can beat that. (Perl’s regex en­gine is gen­er­al­ly re­gard­ed as the state of the art.) So I whacked to­geth­er a Perl ver­sion and, just to make sure I was get­ting the right an­swer­s, I had both the Ja­va and Perl ver­sions print out all the to­kens they found. They both burned some­thing over ten min­utes, and Perl was maybe 10% faster; might have been the I/O or oth­er stat­ic. I was im­pressed to find Ja­va with­in 10% of the best. So then I ran it again with­out the out­put, just count­ing the to­ken­s, and yowie zowie, Perl was at 8 min­utes 47 sec­ond­s, Ja­va back at 3 min­utes 4 sec­ond­s. So I re-ran on a near­by De­bian box, on the the­o­ry that the OS X ver­sions of Ja­va and Perl might not be rep­re­sen­ta­tive of their kind. There are all sorts of vari­a­tions around I/O and so on, but my find­ing is that for this prob­lem, the Ja­va 1.4.2 regex pro­cess­ing is some­where around twice as fast as Perl 5.8.1. Frankly, I’m as­tound­ed. Read on for ac­knowl­edge­ments, some gory de­tail­s, and a taste­ful se­lec­tion of Google ads for reg­u­lar ex­pres­sion soft­ware. [Up­date: There is a good rea­son things are the way they are, and Perl’s trade-off may well be bet­ter.] ...
Yooster, v0.1 · Ar­ti­cles in this space have in­tro­duced Uni­code, dis­cussed how it is pro­cessed by com­put­er­s, and ar­gued that Java's prim­i­tives are less than ide­al for heavy text pro­cess­ing. To ex­plore this fur­ther, I've been writ­ing a Ja­va class called Ustr for “Unicode String,” pro­nounced “Yooster.” The de­sign goals are cor­rect Uni­code se­man­tic­s, sup­port for as much of the Ja­va String API as rea­son­able, and sup­port for the fa­mil­iar, ef­fi­cient null-terminated byte ar­ray ma­chin­ery from C ...
Programming Languages and Text · Wel­come to an­oth­er in­stall­ment in on­go­ing's on­go­ing tour through text-processing is­sues. This one is about programming-language sup­port, and while it makes spe­cif­ic ref­er­ence to Java, tries to be gen­er­al­ly ap­pli­ca­ble to mod­ern soft­ware en­vi­ron­ments. The con­clu­sion is that Ja­va is OK for some kinds of text pro­cess­ing, but has re­al prob­lems when the lift­ing gets heavy ...
Characters vs. Bytes · This is the first of a three-part es­say on mod­ern char­ac­ter string pro­cess­ing for com­put­er pro­gram­mer­s. Here I ex­plain and il­lus­trate the meth­ods for stor­ing Uni­code char­ac­ters in byte se­quences in com­put­er­s, and dis­cuss their ad­van­tages and dis­ad­van­tages. Th­ese meth­ods have well-known names like UTF-8 and UTF-16 ...
On Character Strings · In the re­cent es­say on Uni­code, I promised to say more about UTF-8 and UTF-16. Which is still a good idea, but a won­der­ful ar­ti­cle by Paul Gra­ham, The Hundred-Year Lan­guage, got me think­ing about char­ac­ter strings in gen­er­al, and how many dif­fer­ent ways there are of ap­proach­ing them ...
On the Goodness of Unicode · Quite a few soft­ware pro­fes­sion­als have learned that they need to wor­ry about in­ter­na­tion­al­iz­ing soft­ware, and some of those have learned how to go about do­ing it. For those get­ting start­ed, here­with a brief in­tro­duc­tion to Uni­code, the one tech­nol­o­gy that you have to get com­fort­able with if you're go­ing to do a good job as a soft­ware cit­i­zen of the world ...
author · Dad · software · colophon · rights
Random image, linked to its containing fragment

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.