· · Coding
· · · Text
· In the past few days I’ve been watching two debates on the subject of Unicode; one on the main IETF general-discussion list, and another on ruby-talk (there must be a better archive). In IETF-land, the elders are once again convincing each other that Internet Standards need not be written in a way that allows characters other than ASCII; thus, for example, you can’t correctly record the names of contributors like Bill de hÓra or Martin Dürst; nor can you illustrate any discussions of network protocols which carry payloads other than those which can be expressed in primitively-typeset English. I have a lot of the respect for the IETF’s achievements, but I think my revulsion at this institutional bigotry will probably soon drive me out of the organization. In Ruby-land, it seems that Matz has spoken, and Ruby, the next generation, will have a wonderful String class that deals with everything; handling Unicode, which they see as unacceptably limited, as merely one case among many. This thinking seems deeply broken to me but I am only shallowly immersed in Ruby and don’t understand the Han Unification angst that is at the root of things. I don’t have much influence in either community (which is appropriate, I haven’t earned it). I’ll raise my voice for, what that’s worth, to argue that getting Unicode really right is a necessary condition for being a technology provider in the third millennium, and may prove to be sufficient, insofar as internationalized-text issues go. I’m not optimistic that this will make any difference. But if either community decides to give Unicode a serious go, I’ll volunteer to pitch in, to work to make it work.
· Back in August 2004, I wrote a piece comparing Perl and Java regex performance, observing that, to my surprise, apparently Java was way faster on what I thought was a pretty common task. Last month, Ben Tilly wrote me saying that Perl consciously accepted a regex slowdown to route around a pathological case where search time could explode to infinity. I asked him to write it up and promised to point to it, and he has. If you care about this kind of thing, read Ben’s piece and don’t miss the comments, which are interesting. Summary: the jury’s still out. See also: Open-Source Regex.
· At some point in the transition to Debian Sarge, something broke in the the ongoing software. The perl code reads text using an XML processor and various pieces of it get stashed in a Mysql database. Only somewhere along the line, non-ASCII UTF-8 characters were getting trashed. I tried all sorts of stupid dodges, and was whining away at Sam Ruby via instant messenger, and he said “of course, you could do it all as seven-bit ASCII via
몾... or you could rewrite it in Ruby and It Would Be Much Better”. I shrieked “Get thee behind me foul tempter!” and have now jammed everything into 7-bit ASCII as it comes out of the XML parser, and of course all the problems have gone away. Actually, the code got simpler, lots of XML escaping/unescaping calls are no longer necessary. This is one of the nice things about XML I guess, it allows you to be a good internationalization citizen even when your software infrastructure isn’t. It still feels evil. Anyhow, the whole site’s been republished, let me know if anything’s busted. (By the way, if you’re reading this in my RSS feed and all the entries show up as new, switch to the Atom feed and that problem will go away, because Atom actually has unique IDs and datestamps that work.) [Updated: Tony Coates (interesting new blog there, BTW) reports that Opera 8.02 gets it backwards, which means that it’s one of the rare pieces of software that respects guids in RSS, but that it’s doing Atom 1.0 wrong.]
Text Encoding Progress
· It’s good to see the IETF showing forward motion on the vital issues around how to store text efficiently; check out the brand-new RFC4042 on UTF-9 and UTF-18. Good stuff.
· A few days ago I wrote a little report on regular-expression performance; it drew a surprising amount of feedback, including one piece that throws an interesting sidelight on the trade-offs around Java and Open Source ...
Java Regex Wrangling
· I needed a quick and dirty tokenizer for a big chunk of XML-ish text to feed into some Java code so I was going to fire up Perl, then I remembered that modern Java comes with its own regular-expression library. Hey, it’s good! I put it together in quick-n-dirty hacker style, and it ran over a 100M file, finding fifteen million tokens, in about three minutes of CPU time on my 1.25GHz PowerBook. Quite respectable, but, I thought with a snicker, I bet Perl can beat that. (Perl’s regex engine is generally regarded as the state of the art.) So I whacked together a Perl version and, just to make sure I was getting the right answers, I had both the Java and Perl versions print out all the tokens they found. They both burned something over ten minutes, and Perl was maybe 10% faster; might have been the I/O or other static. I was impressed to find Java within 10% of the best. So then I ran it again without the output, just counting the tokens, and yowie zowie, Perl was at 8 minutes 47 seconds, Java back at 3 minutes 4 seconds. So I re-ran on a nearby Debian box, on the theory that the OS X versions of Java and Perl might not be representative of their kind. There are all sorts of variations around I/O and so on, but my finding is that for this problem, the Java 1.4.2 regex processing is somewhere around twice as fast as Perl 5.8.1. Frankly, I’m astounded. Read on for acknowledgements, some gory details, and a tasteful selection of Google ads for regular expression software. [Update: There is a good reason things are the way they are, and Perl’s trade-off may well be better.] ...
· Articles in this space have introduced Unicode, discussed how it is processed by computers, and argued that Java's primitives are less than ideal for heavy text processing. To explore this further, I've been writing a Java class called
Ustr for “Unicode String,” pronounced “Yooster.” The design goals are correct Unicode semantics, support for as much of the Java
String API as reasonable, and support for the familiar, efficient null-terminated byte array machinery from C ...
Programming Languages and Text
· Welcome to another installment in ongoing's ongoing tour through text-processing issues. This one is about programming-language support, and while it makes specific reference to Java, tries to be generally applicable to modern software environments. The conclusion is that Java is OK for some kinds of text processing, but has real problems when the lifting gets heavy ...
Characters vs. Bytes
· This is the first of a three-part essay on modern character string processing for computer programmers. Here I explain and illustrate the methods for storing Unicode characters in byte sequences in computers, and discuss their advantages and disadvantages. These methods have well-known names like UTF-8 and UTF-16 ...
On Character Strings
· In the recent essay on Unicode, I promised to say more about UTF-8 and UTF-16. Which is still a good idea, but a wonderful article by Paul Graham, The Hundred-Year Language, got me thinking about character strings in general, and how many different ways there are of approaching them ...
On the Goodness of Unicode
· Quite a few software professionals have learned that they need to worry about internationalizing software, and some of those have learned how to go about doing it. For those getting started, herewith a brief introduction to Unicode, the one technology that you have to get comfortable with if you're going to do a good job as a software citizen of the world ...
By Tim Bray
I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.
A full disclosure of my
professional interests is
on the author page.