Articles in this space have introduced Unicode, discussed how it is processed by computers, and argued that Java's primitives are less than ideal for heavy text processing. To explore this further, I've been writing a Java class called Ustr for “Unicode String,” pronounced “Yooster.” The design goals are correct Unicode semantics, support for as much of the Java String API as reasonable, and support for the familiar, efficient null-terminated byte array machinery from C.

Here goes:

package com.textuality;
public class Hello
  public static void main (String[] args)
    // room for 13 UTF-8 bytes
    Ustr message = new Ustr(13);

    // construct Ustr from String
    Ustr hello = new Ustr("Hello");

    // blast it into message

    // append a character given as an integer
    message.appendChar((int) ' ');

    // construct Ustr from some integers
    int [] wints = { 'w', 'o', 'r', 'l', 'd' };
    Ustr world = new Ustr(wints);

    // stick it on the end, using byte ops
    Ustr.strcat(message.s, world.s);

    // there's no room in the buffer for all these bangs
    Ustr bangs = new Ustr("!!!!!!!!!!!!!!!!!!!!!");

    // damn the torpedos, we have safe methods (note extra 's')

    // it would be more stylish to do this in Hebrew or Korean

I've got enough of it working to start sharing it with the world. I can't post the code till I sort out the copyright—anything I produce belongs to Antarctica, but since there's no money in this, the best thing for the world and the company, in the event there's any interest, is to publish it under some sane OSS license.

Also, I'm a little reluctant to publish code until there's been a bit of feedback, because I haven't programmed Java professionally for a few years and it's quite possible that the interface reveals my profound ignorance of some new trick or approach that I should fix. But I promise to get it out there in the next seven days.

So here's the Javadoc, as a basis for discussion. It provides enough info for anyone competent to implement Ustr or (hint hint) the C# equivalent, it only took me a couple of weeks working an hour here and an hour there, mostly late at night. There are just under 1500 lines of Java (at least half Javadoc bloat, that's OK, Javadoc bloat is a good thing) and the class file is 11K.

How It Works · A Ustr is a thin wrapper around a null-terminated UTF-8 byte sequence. Nothing is private, the byte array s and the start of the sequence base are both public fields. You could allocate a really big byte array and have lots of different Ustrs in it. There's one more field called offset, which is used for stepping through the characters embedded in the UTF-8.

I went to some effort to get the Unicode right; there are methods called appendChar() and nextChar() that store and retrieve Unicode characters (as integers) from the UTF8, and use that offset field to work through the text in a natural way. Also there are constructors and generators for integer arrays and Java's kind-of-UTF16 String thingies.

I'm thinking about making a strong claim: with some more polishing, this package, in the hands of someone who knows what they're doing, is going to be both more correct and more efficient for doing heavy-lifting text processing than what comes with Java.

strcpy() and Friends · Once you have null-terminated byte arrays, why shouldn't you be able to make like a classic C programmer? So Ustr has strcat(), strcmp(), strcpy(), strlen(), strstr(), strchr(), and strrchr(). They all operate on bytes, not Unicode characters.

There are lots more where that came from (strspn(), strtok(), etc etc ad nauseum); I just implemented the ones I've actually used regularly over the years. Except for, I implemented strchr() and strrchr() instead of the slightly-more-idiomatic index() and rindex() because I wanted the term “index” to always mean counts of Unicode characters, not bytes.

Each comes in at least two flavors: first, a nice modern object-oriented version, so you if you have a Ustr named ustr, you can say things like ustr.strcmp(other_ustr). Second, there are down-to-the-metal static functions that just pump the contents of byte arrays back and forth: Ustr.strcpy(to, from). One of the reasons the class is Ustr rather than ExcellentPoMoUnicodeString is that Ustr.strcpy() is less typing (snicker).

I haven't done the “n” variants (strncpy() etc) yet, because for anything that copies data I made a safe version, e.g. for strcpy() there's sstrcpy() (note the extra “s”) which efficiently makes sure you don't overrun the target buffer by catching ArrayIndexOutOfBounds exceptions.

Which is fine, but I think there's a good case for strncpy() and friends anyhow.

The java.lang.String Family · Pretty well all of String that's not actually pernicious is in Ustr. The constructors are a bit different, but the other methods are there, except for anything that involves case-folding (arguably wrong and empirically horribly expensive in Unicode) and all the valueOf() stuff. Also I started implementing trim() by calling out to the String() version, decided the that what that version does is horribly, unsalvageably, wrong, and postponed it, because doing this properly with respect to the Unicode tables will take a bit of work.

The other difference is that methods such as charAt() and substring() operate correctly in terms of Unicode characters.

I even did intern() which is kind of questionable since a Ustr is mutable, but as noted, this is a tool for experts.

Implementation Notes · The implementation (while fairly well-tested) is entirely unoptimized; it's all done with the bare minimum amount of code. This, I would argue, is entirely correct in a first cut. I can imagine all sorts of optimizations that my intuition tells me would make it run a lot faster (I note that in GNU libc the str*() routines are largely in assembler), but I say “Get thee behind me, intuition!”

It should be emphasized that this is a power tool; pumping bytes around like this is efficient, but it relies on you to do null-termination and allocate enough space and all that good stuff. So if you're not doing heavy lifting, Java's built-in String is probably a much better choice.

TDD Again · This is my first outing with JUnit in hand and an aggressive TDD approach. Junit rocks, and TDD is programmers' crack of the highest purity. There is no going back, nosirree. (Mind you, on the Mac, JUnit or its Swing wrapper has a busy loop of some sort and burns CPU steadily. But still.)

So there's a TestUstr harness with 1500 or so lines of code and 26 test functions with 134 assertions last time I counted. At that, I don't think the harness is as complete as Kent Beck would like, and could well be expanded. As long as this remains my baby the policy will be that nothing goes into without something corresponding in

Plans? · I don't know. There's a kind of low-level geek thrill in implementing strcpy() and friends, and there are subtleties to some of these calls that I hadn't appreciated. It would appear that I wasn't quite as comprehensively on top of text-processing issues as I maybe thought I was, and this is worth doing if only to have learned that.

In the unlikely event that other people want to take this seriously and use it or work with it, I'd be perfectly happy to park it on sourceforge and run it for a while. I can't see it being much work.

For the first time, I regret not having a comments feature for ongoing, but I'm just not going to have time to write that code, is this a job for Yahoo groups or some such? Advice on hosting discussions would be received with gratitude, Google and five seconds will get you my email address.

And if I don't get any mail, this has all been a pleasant experiment, and so far, I have failed to falsify my hypotheseses about string processing in high-level languages.

author · Dad
colophon · rights
picture of the day
May 17, 2003
· Technology (90 fragments)
· · Coding (99 fragments)
· · · Text (12 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!