Programming Languages and Text

Welcome to another installment in ongoing's ongoing tour through text-processing issues. This one is about programming-language support, and while it makes specific reference to Java, tries to be generally applicable to modern software environments. The conclusion is that Java is OK for some kinds of text processing, but has real problems when the lifting gets heavy.

Last time out I said this was going to be a three-part essay, but now I realize I'd already written two other text-processing-centric pieces before that, one an intro to Unicode, and the other entitled On Character Strings. The present essay will recapitulate some of the material in that second note, but no matter how you cut it, we're already (to quote Douglas Adams) on volume four of the trilogy. To make it worse, I'm gestating some essays on full-text-search, so we'll just call it a continuing series.

I had also said that I was going to discuss C# to a lesser degree, and I thought I could probably get away with that simply because C# is so amazingly like Java, but I found the language and library documentation so hard to find and navigate on the Microsoft site that I abandoned that project. I suspect that some of the remarks will apply, and if someone wants to follow-up on the differences (if any), I'll point to or reproduce such material.

I think that most text processing falls into one of two buckets, which I'll illustrate with a story.

You Look Inside? · I'm remembering a meeting that took place sometime in '95 or '96. My company, Open Text, had just acquired a company called Odesta. We were mostly a full-text-search company, Odesta was mostly a pure document-management outfit.

Shortly after the acquisition, we had a technical summit of the senior engineers from the two companies to explain to each other what our stuff did and figure out how to move forward. The Odesta guys talked first. Their Doc Man stuff was pretty cool, it had a RDBMS of interesting metadata about each object checked into and out of the system, and the software could slice and dice this many different ways, had a very sophisticated security model, and so on. We were up next, and our head search-algorithm guy was talking about tokenizing and inflexion processing and index update and so on, when one of the Odesta guys put up his hand with a question and said in a slow Southern-accented voice: “Hold on... you mean you guys look inside the documents?!?” At which point a silence fell in the room as we all had an “Aha!” moment.

Classes of Text Processing · I think that story illustrates the two main classes of text processing software, which for the purposes of this essay I'll call “light” and “heavy”.

Light Text Processing · This is what most people do most of the time. You get some text, from a file or more likely via a database for example:

SELECT uri, description FROM Essays WHERE age < 24 * 60 * 60;

You keep the text around in String object and maybe you intern it for efficiency and use it as a hash key and eventually jam it into an HTML template or send it off in a SOAP message or update another table. What you don't do is invest many lines of code in reaching into that chunk of text and fiddling around with characters, rather you take what you're given and route it around.

An example is the code that generates ongoing, which takes the chunks of text emitted by the XML parser, sticks them into an RDBMS, wraps XHTML tags around them and writes them into the output files.

Heavy Text Processing · This is the province of search engines and parsers and semantic analyzers and browsers and rendering engines and word processors.

Such software lives by “looking inside the document”, disassembling, searching, sorting, transforming, and reconstructing strings of characters. Its needs are qualitatively different from the light processing world.

An example is the Expat XML parser and the Perl regular expression engine that the ongoing software uses to do all the heavy lifting.

Java String is Lightweight · Before I go any further, I should note that my remarks on Java are in the context of the “Java 2 Platform Std. Ed. v1.3.1”, which is what Apple packages up with the Macintosh.

Anyone who's tried to do heavy lifting with Java's String class is rolling their eyes at this section's headline, but what I mean to say is that String is well-suited to the needs of light text processing:

It is immutable, so you can intern it and do very lightweight comparison
It comes with lots of handy built-in methods for indexing, prefix and suffix testing, and so on.
You can feed it directly to input-output routines.
It can pass Unicode text through without damaging it.
Every programmer in the universe knows how to deal with it.

I'm less friendly to the mutable StringBuffer class. It has all the methods you could want for twiddling characters, but they seem to think that successive append method dispatches are a good way to build up character strings, and if your lifting gets heavy at all this collection of blunt-edged tools is apt to feel like trying to play a Stradivarius with welding gloves on.

Still, most light text processing is going to be in applications where the costs are dominated by database, UI, and network overhead, so this is probably not too bad.

But if you're going to be doing any heavy text processing, you may find that Java's String and char primitives are part of the problem, not part of the solution.

Character Support · In Java, characters are represented by the char data type, which is claimed to be a “16-bit Unicode character”. Unfortunately, as I pointed out recently, there really is no such thing. To be precise, a Java char represents a UTF-16 code point, which may represent a character or may, via the surrogate mechanism, represent only half a character. The consequence of this is that the following methods of the String class can produce results that are incorrect: charAt, getChars, indexOf, lastIndexOf, length, and substring. Of course, if you are really sure that you will never have to deal with an “astral-plane” character, to the point of being willing to accept that your software will break messily if one shows up, you can pretend that these errors can't happen.

To me, this feels just like deciding that you'll hever have to deal with more than 64K of memory, or a database bigger than 32 bits in size, or a date after December 31, 1999. What Hunter S. Thompson would call “bad craziness.” I'll settle for “shortsighted.”

But still probably OK for light processing, where you're not counting on looking inside those String objects.

I note that Java sorta kinda recognizes the problem; the indexOf and lastIndexOf methods of String (correctly) use an int to hold the character argument, but the replace method still uses char.

One minor problem with Java should be noted; DataInputStream and DataOutputStream have readUTF and writeUTF methods which do not in fact read or write any of the recognized Unicode encoding formats, but that's OK, because InputStreamReader and OutputStreamWriter handle them just fine, just don't get confused by the names.

Fortunately, Those Who Control Java have noticed the UTF-16 problem, and there is work in progress to address it. I am a bit puzzled, though, by the remark in the above-noted document that most of the changes will take place in java.lang.Character. Now admittedly this is a class which I've never found occasion to use, but it seems to me the real breakage is on the String front. In any case, it's good that they're addressing the issue.

Characters Considered Harmful · One of the key principles of doing heavy text processing in an internationalized environment is that it's almost always wrong to work directly with individual characters, especially at the API level. I'll cover this in brief, but for those working in this area, it would be good to spend some quality time with a W3C Working Draft entitled Character Model for the World Wide Web. This is not light reading, but it's important stuff. It highlights the following problems with characters as units of work:

Characters do not correspond one-to-one with the sounds of language (examples: “ch,” “ng”).
Characters do not correspond one-to-one with units of displayed text (examples: combining diacritics).
Characters do not correspond one-to-one with units of input (examples: Chinese, Japanese, Korean).
Characters do not correspone one-to-one with units of physical storage (examples: UTF-8, UTF-16).

I'm not going to lay all this out in detail, but the conclusion is sound: if there's a place in your API that takes a “character” as input or provides one as output, you're probably looking at a design error.

Strings or Character Arrays or Byte Arrays? · In a previous essay inspired by some fine writing by Paul Graham, I considered the question he raised: do we need String data objects at all?

After all, a string is just a sequence of characters, and every programming language under the sun has arrays, so why do you need a separate object to package one of these up? Some of the most sophisticated text-processing software in the world, er make that most of the most sophisticated text-processing software in the world, is written in C, which gets along just fine with null-terminated byte arrays.

So for Java, should we abandon String and do all our work with char[] constructs? I don't think so, simply because I think the char primitive is just too deeply broken. Also, I want to use tricks like strcmp() and strncat(), beloved of grey-bearded Unix veterans.

Gasp! Surely he's not suggesting that we retreat to putting everything in a byte[] construct, and presumably revert to living in caves and courting women with clubs?!? Well no, because I am an object-oriented kinda guy, when I can get away with it. So, how do we get the heavy industrial machinery for doing superior text processing in modern languages without compromising their virtues? Stay tuned.

Acknowledgements · Grateful thanks to Rick Jelliffe, Miles Sabin, and Elliote Rusty Harold over on xml-dev who helped hash through some of these issues.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

April 30, 2003
· Technology (90 fragments)
· · Coding (99 fragments)
· · · Text (12 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!