On Character Strings

In the recent essay on Unicode, I promised to say more about UTF-8 and UTF-16. Which is still a good idea, but a wonderful article by Paul Graham, The Hundred-Year Language, got me thinking about character strings in general, and how many different ways there are of approaching them.

I'll quote Graham's remarks on strings, but if you haven't already read that article I strongly recommend doing so:

Most data structures exist because of speed. For example, many languages today have both strings and lists. Semantically, strings are more or less a subset of lists in which the elements are characters. So why do you need a separate data type? You don't, really. Strings only exist for efficiency. But it's lame to clutter up the semantics of the language with hacks to make programs run faster. Having strings in a language seems to be a case of premature optimization.

I think Graham's 100% right, and I further think that the “right way” to do strings in programmming languages is far from a solved problem. I'm not sure what the solution is, but here are some observations and assertions based on some decades in the text-processing coalmines.

Neolithic String Technology · My first serious programmming tasks were in C (with a bit of lex and yacc) on early (V6) Unix systems. Then I got out into the real world and went to work for DEC (R.I.P.) because I thought the VAX was a cool computer, and got turned into a VMS expert. The VMS geeks thought this notion of null-terminated strings was just incredibly primitive, and Real Programmmers did it with “descriptors”. A String Descriptor was two words, one word pointing at the string, one half-word giving its length, and the other half-word containing type information - I seem to recall that sometimes this final half-word was presented as two separate byte fields.

Apparently, for representing COBOL and BASIC and PL/I strings, this was just the ticket, and many of the VMS system calls wanted their string arguments by descriptor.

This sounds fine in principle, but I noticed that it seemed to be a lot more work to create and manipulate a string, and I really couldn't do anything that I couldn't do with the old-fashioned C flavour. But obviously descriptors were more advanced technology than null-termination, with a layer of indirection yet, so they had to better.

Then I changed jobs and got back into Unix and it seemed like handling strings was a lot less work, and I didn't seem to miss the layer of indirection. But I always felt vaguely guilty about using this primitive-feeling text-processing facility.

Over the years we all learned about buffer overflows and using strncpy() rather than strcpy() and other best practices, but still, it all felt kind of primitive.

Everything's a Scalar · Moving into Perl was obviously like teleporting to another planet. Perl “scalars” (which really are very sophisticated pieces of technology, under the covers) can be strings or numbers or whatever you want, and anything you could possibly ever want to do with a string is already built into the language, and you never think about where the bytes are and what form they're in and so on.

Also, the strings can contain nulls and any old bit pattern you can cook up, and everything just sort of works. There are very few C programmers indeed who can construct code that will outperform the equivalent perl code for any task that is mostly string manipulation.

There's no free lunch of course; if you have any data structures to speak of, Perl burns memory like there's no tomorrow. Having said that, Perl's notion of a string seems organic and natural and about right.

Object Orientation <sob> · I first went to the mat with Java in 1997, the project being Lark, which was the world's first XML parser. It was used in several different production systems, and, I claim, if I cared to maintain it (and thus compete with IBM, Microsoft, and James Clark), would still be the world's fastest Java-language parser, but I don't so it's untouched since 1998.

Obviously, an XML parser has a significant proportion of its code dedicated to string processing; in fact, just about everything you want to do with strings in heavy industrial code is there in an XML processor. I quickly came to hate the Java String and StringBuffer classes. First of all, they were slow and expensive (I've heard that they've gotten smarter and quicker). Second, the methods just felt klunky. I quote from the Java 1.3.1 reference documentation:

String buffers are used by the compiler to implement the binary string concatenation operator +. For example, the code:

x = "a" + 4 + "c"

is compiled to the equivalent of:

x = new StringBuffer().append("a").append(4).append("c").toString()

Maybe I'm missing something, but four method dispatches to create a three-character string feels a little, well, stupid.

The amusing thing is, the one time I peeked into source code for the String and StringBuffer classes (I think this may have been in a Microsoft implementation) it looked remarkably like VMS “string descriptors”, circa 1985.

For what it's worth, Lark did almost all its processing with Char and even a few byte arrays, and created String objects in the smallest number absolutely required and then only lazily when requested by the client software.

And all these years later, it dawns on me that this is something that is probably a good general characteristic of an XML processing API, that it shouldn't force the processor to create String objects until the client software asks to see them.

So Why Do We Have String Objects Anyhow? · In the guts of Antarctica's Visual Net server, it's all done with null-terminated C strings, and it's all 100% Unicode-clean, and it's all very clean and paranoid about buffer overflows, and the string-handling code is pretty clear and easy to read, and it runs very damn fast.

So all these years later, to get back to Paul Graham's remarks, I'm inclined to think, as regards the notion of String objects generally, that the emperor has no clothes.

Of course, I wouldn't really take that position to the extreme. If I'm writing a big middleware application in which all the text is coming out of database columns and being pumped out into HTTP streams, I have a strong intuition that the performance problems and awkwardnesses induced by String objects are going to get lost in the shuffle, and maybe the encapsulation is a win.

And, as I said, for Perl, which is all about strings from the ground up, it makes all sorts of sense to have a native String-ish type. But for apps in a general-purpose programming language where the string processing is a big chunk of the workload, I think Mr. Graham is right; if I have characters, and I have arrays of characters, why would I want a separate construct for strings?

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

April 13, 2003
· Technology (90 fragments)
· · Coding (99 fragments)
· · · Text (12 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!