In the recent essay on Unicode, I promised to say more about UTF-8 and UTF-16. Which is still a good idea, but a wonderful article by Paul Graham, The Hundred-Year Language, got me thinking about character strings in general, and how many different ways there are of approaching them.
I'll quote Graham's remarks on strings, but if you haven't already read that article I strongly recommend doing so:
Most data structures exist because of speed. For example, many languages today have both strings and lists. Semantically, strings are more or less a subset of lists in which the elements are characters. So why do you need a separate data type? You don't, really. Strings only exist for efficiency. But it's lame to clutter up the semantics of the language with hacks to make programs run faster. Having strings in a language seems to be a case of premature optimization.
I think Graham's 100% right, and I further think that the “right way” to do strings in programmming languages is far from a solved problem. I'm not sure what the solution is, but here are some observations and assertions based on some decades in the text-processing coalmines.
Neolithic String Technology ·
My first serious programmming tasks were in C (with a bit of
yacc) on early (V6) Unix systems.
Then I got out into the real world and went to work for DEC (R.I.P.) because
I thought the VAX was a cool computer, and got turned into a VMS expert.
The VMS geeks thought this notion of null-terminated strings was just
incredibly primitive, and Real Programmmers did it with “descriptors”.
A String Descriptor was two words, one word pointing at the string,
one half-word giving its length, and the other half-word containing type
information - I seem to recall that sometimes this final half-word was
presented as two separate byte fields.
Apparently, for representing COBOL and BASIC and PL/I strings, this was just the ticket, and many of the VMS system calls wanted their string arguments by descriptor.
This sounds fine in principle, but I noticed that it seemed to be a lot more work to create and manipulate a string, and I really couldn't do anything that I couldn't do with the old-fashioned C flavour. But obviously descriptors were more advanced technology than null-termination, with a layer of indirection yet, so they had to better.
Then I changed jobs and got back into Unix and it seemed like handling strings was a lot less work, and I didn't seem to miss the layer of indirection. But I always felt vaguely guilty about using this primitive-feeling text-processing facility.
Over the years we all learned about buffer overflows and using
strncpy() rather than
strcpy() and other best
practices, but still, it all felt kind of primitive.
Everything's a Scalar · Moving into Perl was obviously like teleporting to another planet. Perl “scalars” (which really are very sophisticated pieces of technology, under the covers) can be strings or numbers or whatever you want, and anything you could possibly ever want to do with a string is already built into the language, and you never think about where the bytes are and what form they're in and so on.
Also, the strings can contain nulls and any old bit pattern you can cook up, and everything just sort of works. There are very few C programmers indeed who can construct code that will outperform the equivalent perl code for any task that is mostly string manipulation.
There's no free lunch of course; if you have any data structures to speak of, Perl burns memory like there's no tomorrow. Having said that, Perl's notion of a string seems organic and natural and about right.
Obviously, an XML parser has a significant proportion of its code
dedicated to string processing; in fact, just about everything you want to
do with strings in heavy industrial code is there in an XML processor.
I quickly came to hate the Java
First of all, they were slow and expensive (I've heard that they've gotten
smarter and quicker).
Second, the methods just felt klunky.
I quote from the Java 1.3.1 reference documentation:
String buffers are used by the compiler to implement the binary string concatenation operator +. For example, the code:
x = "a" + 4 + "c"
is compiled to the equivalent of:
x = new StringBuffer().append("a").append(4).append("c").toString()
Maybe I'm missing something, but four method dispatches to create a three-character string feels a little, well, stupid.
The amusing thing is, the one time I peeked into source code for the
StringBuffer classes (I think this may
have been in a Microsoft implementation) it looked remarkably like VMS
“string descriptors”, circa 1985.
For what it's worth, Lark did almost all its processing with
Char and even a few
byte arrays, and created
String objects in the smallest number absolutely required and
then only lazily when requested by the client software.
And all these years later, it dawns on me that this is something that is
probably a good general characteristic of an XML processing API, that it
shouldn't force the processor to create
String objects until the
client software asks to see them.
So Why Do We Have String Objects Anyhow? · In the guts of Antarctica's Visual Net server, it's all done with null-terminated C strings, and it's all 100% Unicode-clean, and it's all very clean and paranoid about buffer overflows, and the string-handling code is pretty clear and easy to read, and it runs very damn fast.
So all these years later, to get back to Paul Graham's remarks, I'm
inclined to think, as regards the notion of
generally, that the emperor has no clothes.
Of course, I wouldn't really take that position to the extreme.
If I'm writing a big middleware application in which all the text is coming
out of database columns and being pumped out into HTTP streams, I have a
strong intuition that the performance problems and awkwardnesses induced by
String objects are going to get lost in the shuffle,
and maybe the encapsulation is a win.
And, as I said, for Perl, which is all about strings from the ground up, it makes all sorts of sense to have a native String-ish type. But for apps in a general-purpose programming language where the string processing is a big chunk of the workload, I think Mr. Graham is right; if I have characters, and I have arrays of characters, why would I want a separate construct for strings?