On the Goodness of Unicode

Quite a few software professionals have learned that they need to worry about internationalizing software, and some of those have learned how to go about doing it. For those getting started, herewith a brief introduction to Unicode, the one technology that you have to get comfortable with if you're going to do a good job as a software citizen of the world.

This essay:

Provides some reasons why you ought to worry about internationalization.
Provides some basic background on the world's writing systems.
Explains generally how characters and encodings and fonts and so on fit together.
Describes the history and politics of Unicode.
Describes the Unicode standard technically.
Offers some advice on the right things to do about all this.

Right: U+0024 DOLLAR SIGN

Why Should You Care? · Whether you're doing business or academic research or public service, you have to deal with people, and these days, it's quite likely that some of the people you want to deal with come from somewhere else, and you'll sometimes want to deal with them in their own language. And if your software is unable to collect, store, and display a name, an address, or a part description in Chinese, Bengali, or Greek, there's a good chance that this could become very painful very quickly.

There are a few organizations that as a matter of principle operate in one language only (The US Department of Defense, the Académie française) but as a proportion of the world, they shrink every year.

Right: U+05D4 HEBREW LETTER HE

If you're in the business of specifying, paying for, or building software, and you're not paying attention to this stuff you're probably not doing your job. The good news is that doing the right thing isn't that difficult or that expensive.

Writing Systems · The number of human languages is much larger than the number of systems for writing them down, but the definitive reference on the subject, The World's Writing Systems (Peter T. Daniels and William Bright, eds.), still has 74 big sections, most of which discuss not one but a family of related writing systems. The Unicode system, which we'll discuss in depth, covers some three dozen different language-oriented character sets.

<code>U+0E12</code> THAI CHARACTER THO PHUTHAO

Right: U+0E12 THAI CHARACTER THO PHUTHAO

Many languages, of course, aren't written in our A-to-Z alphabet, in fact many aren't written with alphabets at all. Many scripts don't fill the page left-to-right top-to-botton, don't have spaces between words, don't have alphabetical order, and don't conform to Western expectations in lots of other different ways.

And once you get past languages you have to deal with symbols for currency, mathematics, and science.

If you're feeling intimidated, don't; there is good technology in place to help deal with this, and the really hard problems have mostly been solved for you by other people.

We'll start with the basics: how do you get the languages of the world into and out of computers?

Right: U+AE7D HANGUL SYLLABIC

Input Methods · How do people get text in all the world's languages into the computer? This is one of the many problems that you don't have to solve; anyone who sells a computer, or a PDA, or a cellphone, equips it with technology to do this.

For languages which have a reasonably-small number of characters (Hebrew, Arabic, Greek, the languages of India) you just use a keyboard with those characters painted on the keys.

For Chinese, Japanese, and Korean, there are a variety of tricks people use to enter thousands of characters using only a few dozen keys. I won't go into detail, but if you haven't seen it before, it's pretty impressive to watch a Japanese person pounding text into their PDA at high speed using just their thumbs.

<code>U+00D8</code> LATIN CAPITAL LETTER O WITH STROKE

Right: U+00D8 LATIN CAPITAL LETTER O WITH STROKE

Fonts and Rendering · How do computers display text in all those writing systems? The bad news is that this is a horribly hard problem; the good news is that once again, the people who make computer systems have done most of the work. If you're going to have to support complicated text-editing operations including select/cut/paste, you're going to have to bite the bullet and learn a lot more about this than you probably want to, but most software only really needs to accept short chunks of text in the fields of a form, and then to hand off other chunks of text to a browser or equivalent for display on the screen.

Right: U+4F5B HAN IDEOGRAPH

There are two pieces of technology necessary to make this work. The first is fonts; if you have a Russian customer and send them some Cyrillic text (for example their name), they probably have the appropriate fonts installed on their computer and everything will just work out; on the other hand, if you're a Canadian anglophone like me and try to open an Indian website that's written in Gujarati, there's a good chance the fonts won't be there. Having said that, Macintosh OS X comes with an astoundingly wide selection of fonts that covers pretty well the whole world, and I believe modern Windows boxes are reasonably well-supplied as well.

Fonts don't solve the whole problem. Many languages just can't be rendered without some built-in knowledge of how characters, words, and lines fit together; for example, many versions of Windows can't display Thai text without downloading some special Thai rendering software (which Microsoft supplies). Once again, the good news is that nobody expects you to write this into your software.

Right: U+0634 ARABIC LETTER SHEEN

Unicode, ISO, Politics · You can do the right thing at a reasonable cost, mostly because an excellent standard normally referred to as “Unicode”. There's a lot of history behind this simple label; Unicode proper is a consortium of technology vendors that, many years ago in a flash of intelligence and public-spiritedness, decided to unify their work with that going on at the ISO. Thus, while there are officially two standards you should care about, Unicode and ISO 10646, through some political/organizational magic they are exactly the same, and if you're using one you're also using the other.

<code>U+0F03</code> TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA

Right: U+0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA

The reason we usually talk about Unicode rather than ISO10646 is that Unicode has a helpful web site and also publishes their product in a nice beautifully printed book, which you should think seriously about buying; more on that later.

What's a “Character” Anyhow? · All human languages are written using characters; and while philologists can enjoy decades-long arguments about what characters are, as far as Unicode (and computers) care, a character can usefully be defined as the smallest atomic unit of text with semantic value.

Computers usually store characters as small numbers; back in the days of A-to-Z ASCII, you could fit a character into an eight-bit byte, but those days are long gone.

Right: U+221E INFINITY

Historically, there have been hundreds of different systems for assigning characters to numbers and then stuffing those numbers into bytes of computer storage. Given that every computer manufacturer in the world tended to cook up their own scheme for every language in the world, this was clearly an interoperability disaster in the making, and led to the ISO and Unicode work.

How Unicode Works · The basics of Unicode are actually pretty simple. It defines a large (and steadily growing) number of characters - just under 100,000 last time I checked. Each character gets a name and a number, for example LATIN CAPITAL LETTER A is 65 and TIBETAN SYLLABLE OM is 3840. Unicode includes a table of useful character properties such as "this is lower case" or "this is a number" or "this is a punctuation mark".

Right: U+0A8A GUJARATI LETTER UU

Also, for each of these characters, the standard provides a helpful picture of a reasonably-typical rendition.

For reasons we need not explore here, Unicode numbers are given in four hex digits preceded by U+, so “A”; is U+0041 and “Tibetan Om” is U+0F00. Now the labels for the pictures in the right margin should make sense.

The Unicode standard also includes a large volume of helpful rules and explanations about how to display these characters properly, do line-breaking and hyphenation and sorting and all sorts of other stuff that you probably don't have to worry about, but if you do, it's all right here and easy to find.

<code>U+091D</code> DEVANAGARI LETTER JHA

Right: U+091D DEVANAGARI LETTER JHA

Encodings · From Unicode's point a view, text is stored on a computer as a series of numbers, one per character. There are many different ways to arrange these numbers in memory (or in a network transmission), some straightforward and efficient, some less so. These are called “encodings”. Unicode itself defines several different encoding schemes, the two best known of which are UTF-8 and UTF-16.

However, there's a good chance that your software will have to input and output characters in some other pre-Unicode encoding scheme such as ASCII, ISO-8859, or a Microsoft Code Page. Fortunately, converting back and forth is a fairly well-defined process, if a little bit less efficient than we would like.

<code>U+0178</code> LATIN SMALL LETTER Y WITH DIAERESIS

Right: U+0178 LATIN SMALL LETTER Y WITH DIAERESIS

Internally, it would be a really good idea, in your software, to start storing all your data internally as either UTF-8 or UTF-16, starting now. I'll discuss the trade-offs between these two in another essay, which will be quite a bit more technical than this one.

Special Problems in Asian Scripts · The Asian scripts (Chinese, Japanese, and Korean, often abbreviated “CJK“) present special problems, both political and technical. The process by which all these related character sets were organized into the Unicode tables was somewhat controversial and left bruised egos in various places around Asia, in particular Japan. For quite some time, whether or not you were using Unicode, you had to be really careful what you said about it in Japan or you could end up catching some real grief.

Right: U+306C HIRAGANA LETTER NU

However, today there seems to be fairly widespread acceptance of the fact that while Unicode may not be perfect, it's probably an acceptable compromise and substantially better than the chaos that came before.

Another problem is that in these parts of the world, it is not unheard-of to invent new characters. The Japanese word for such charaacters is gaiji; historically they were invented for personal or company names. Just last year, I found out that NTT DoCoMo has been inventing new characters for teenagers to include in their cellphone text messages. This made my blood run a little bit cold, and I think the jury's still out on what the impact is going to be from a business point of view.

Right: U+0A14 GURMUKHI LETTER AU

Search · One of the most common things you have to do with text is search it. In an internationalized environment, this is tricky with Unicode and essentially impossible without it. It's tricky because a decent search capability knows about things like singular/plural, verb conjugations, and maybe something about synonyms. This is obviously different from language to language.

Another problem is that in some languages (for example Japanese and Chinese) there are no spaces between the words. This is a problem for software that needs to search such text. Not all search-engine vendors have done a good job of this, and if you’re doing your own search capability you're going to have to think about it

Does This All Work? · It's important to realize that all this is here today and really works. The following is a bit of an experiment and depends how many fonts you have in your browser. Suppose you wanted to send an invoice to me, Tim Bray; it's a good idea to spell someone's name correctly particularly when you're asking for money. So, if I were living in Cairo you'd probably want to send it to تم براي, and if in Osaka, to チムブレー.

What You Have to Do · So what, practically speaking, should you do, as a software practitioner? Here are a bunch of recommendations:

Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
Spend some time poking around the Unicode web site and learning how the code charts work.
If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

April 06, 2003
· Technology (90 fragments)
· · Coding (99 fragments)
· · · Text (12 more)
· Language (57 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!