UTF-8+names

Here’s the problem. You want to put “funny” characters in your XML, ones that aren’t on your keyboard, like “ñ” isn’t in Greece and “Δ” isn’t in Mexico. XML has a bunch of ways to do this; some of them require sophisticated software, others are really ugly, and if you want to avoid both the ugliness and the fancy software, you can use a DTD. Except for people don’t want to use DTDs either. This set of issues has been darkening the XML skies for years now, but we may have stumbled on a way out of the box. (Warning: Bit-banging technicalia of interest only to XML obsessives).

Fancy Software · The right way to deal with these situations is to use fancy software. If you’re in Microsoft Word, for example, and you want to put in a Δ you pull up a Unicode character palette that lets you select it for insertion. Once you’ve inserted it, Word is smart enough to display it properly like any other character.

Another sophisticated text-editing system is GNU Emacs, it’s what I use here at ongoing, and recently I got interested in this problem and did a bunch of work to make these things easier to type into Emacs. At more or less the same time, so did Norm Walsh, taking quite a different approach.

Uglification · You probably don’t want to try to use Word to type in XML anyhow, and many people aren’t comfy with the notion of hacking away on Emacs, they just want to bang XML in with an ordinary text editor. This, after all, was one of the original design goals of XML, and (except for this particuar problem) one that’s mostly been achieved.

XML does provide you with a brute-force way to type in unfamiliar characters: you crack open your handy local Unicode reference and find out what the numeric “code point” for that character is, and you use that. For example, that capital-Delta is number 916, only it’s usually given in hex as U+0394. So I can put Δ in my XML text and sure enough, a Δ will appear. Of course, this is pretty ugly, and hard to edit, among other things because if you’re typing a few of these things in, it may not remain fresh in your memory that Δ is Δ but ñ is ñ.

DTDs to the Rescue · That’s OK, because XML has a solution for the ugliness. In an XML DTD you can give any old chunk of text a name and use it that way. For example, you can say that the string Δ is named Delta and then you can put Δ in your text, which isn’t all that pretty but is way better than the hex version.

These named chunks of text are called “Entities.”

Right: U+222F SURFACE INTEGRAL

Entities are particularly useful to people from the Math world, who invent and use all sorts of funky characters to get their jobs done. Fortunately for them, they’ve managed to get a huge number of these into Unicode and thus into XML. Unfortunately for them, there’s basically no keyboard in the world that has these things painted on the keys, and there are very few people who have the right set of software and fonts on hand to display them.

Since sensible people like to standardize things, a bunch of “standard entity sets” have been developed over the years. They’re wired into HTML, which is why you can put things like Δ into your web pages and it all just works.

The XHTML 1.0 spec did a particularly good job of pulling all these definitions together in its Appendix A.2, and most browsers implement this pretty well as far as I know.

As noted above, the mathies have special needs, so they’ve standardized names for their frighteningly-large inventory of symbols; for example the surface integral illustrated above is known as &Conint;. To appreciate their work in the fullness of its imposing glory, check out Chapter 6 of their spec..

Suppose I Don’t Want a DTD? · Lots of people don’t, these days. DTDs are supposed to be mostly about writing down the rules for an XML language, what tags and attributes you can use and what order they have to be in and so on. This functionality is in the process of being rapidly replaced by newer inventions such as W3C XML Schema and RelaxNG; in the case of RelaxNG, this is a clear step forward.

Unfortunately, DTDs not only did this “Schema” stuff, they were used to declare entities. Which is quite a different kettle of fish, and the designers of modern schema facilities like W3C XML Schema and RelaxNG had no interest in addressing that problem, so they didn’t.

The designer of a modern language doesn’t want to have to write a modern schema and then still have a DTD around to declare funny characters. On top of that, it’s operationally tricky if you have to have a DTD around just to parse an XML document, which you do if you’ve used entities that were declared in that DTD. This is particularly nasty when the set of declared entities numbers in the hundreds or thousands (XHTML + MathML between them get up there) but you only want to use one or two.

What you’d really like would be for XML to be like HTML, where the system has built-in knowledge of a bunch of names for characters that you can use without having to declare them or hook up with a DTD.

Unfortunately, it would be totally out-of-bounds to try to change the definition of XML at this point in history to cause it suddenly to have all this built-in knowledge. So people have been concocting all sorts of imaginative and (in some cases) elegant proposals aimed at giving people access to characters by name without having to use a DTD or tear up the XML pavement. So far, consensus has been hard to find.

Which Brings Us to Today · On one of the W3C mailing lists, Michael Sperberg-McQueen brought the problem to the fore and essentially challenged the community to do something about it. The discussion swirled around in a kind of unsatisfactory way, then, reading an interchange between Rick Jelliffe and Martin Dürst, it dawned on me that we can maybe dodge the whole problem by moving it out of XML into Unicode. First a bit of background.

UTF Revisited · As I discussed in previous essays, Unicode’s huge inventory of characters is identified by number and there are a bunch of ways to pack those numbers into bytes. Unicode defines three of its own, called UTF-8, UTF-16, and UTF-32. Furthermore, all the characters in the good old ASCII and ISO-Latin we grew up with, and all the Microsoft Code Pages and so on, are actually just different ways to stuff Unicode characters into memory.

XML is sensitive to this, and lets you use any old encoding of the Unicode characters as long as you declare what it is, for example you might have the following at the top of an XML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>

While XML software is only required to know about UTF-8 and UTF-16, most of what’s out there can deal with popular code pages and ISO-Latin and so on; in fact, the original XML specification was encoded in ISO-Latin-1 and has a declaration like the one above.

Finally, as I discussed at length in the article referenced above, UTF-8 stores the old-fashioned 7-bit ASCII characters in 7 bits just like ASCII does. It has a clever trick for storing the rest of the Unicode characters in multiple bytes, but we need not go there in this essay.

Introducing UTF-8+names · This is the name of a new encoding of Unicode. It’s just like UTF-8, only it has built-in knowledge of the XHTML and MathML entity sets. That means that when you have Δ in your file, that’s an encoding of the single Unicode character U+0394.

So the following short XML document is just fine as it stands:

<?xml version="1.0" encoding="UTF-8+names" ?>
<p>From &Alpha; to &omega;.</p>

The XML processor’s text reader will autoconvert away the Α and ω so that all the underlying XML processor ever sees is the Unicode characters.

It wouldn’t be that hard to implement, and it would some people’s lives easier—not millions, but not just one or two either. I’d use it here at ongoing for sure. And it would make all the XML machinery run just a little bit smoother, which these days is significant.

The rules of UTF-8+names are simple. Anything that starts with & and ends with ; is called a replacement. The bit between the & and ; is called the replacement’s name, and whatever it stands for is called the value. For example, Δ is a replacement whose name is Delta and whose value is the single Unicode character U+0394 GREEK CAPITAL LETTER DELTA.

UTF-8+names defines one replacement of its own, for the ampersand (&) character, whose name is &. That is to say, if you want an ampersand, use &&;. The jury is still out on this one, others have proposed an empty name, i.e. &;. More discussion is required.

Then, UTF-8+names adopts as replacements all the entities defined by XHTML 1.0 and MathML 2.0 (see the references above). I haven’t worked out yet how many replacements that is. The jury’s still out on whether HTML and MathML are the right sets to adopt; they’re not 100% consistent, and there is work under way to produce a grand unified list, which would be much more convenient.

Finally, if you have a replacement whose name isn’t defined by UTF-8+names, it just stands for itself. For example, &ongoing; just stands for &ongoing;.

Another open question is, why is this based on UTF-8? It could be based on UTF-16 or ISO-8859-1 or even US-ASCII. I’m not going to dive deep on that one here because the night is yet young in that discussion.

Adventures in IETF-Land · I ran this idea up the flagpole and a couple of smart people liked it, and John Cowan proposed a couple of real improvements on the original notion, and I wondered how we might make such a thing official. More or less anyone can define a Unicode Character Encoding, and more or less anyone can write an IETF Internet-Draft, and if it catches on, more or less anyone can write an IETF RFC. So I decided to write an Internet-Draft.

I’d never written an I-D or an RFC, and I know the formatting rules are very strict. Quickly, I discovered xml2rfc, a nifty package cooked up by the formidable Marshall Rose, which includes a DTD for writing I-Ds and RFCs, and a web site where you can upload your XML and get it back in prettified ASCII or HTML. It’s slick and straightforward and Just Works. I think this XML stuff is going to catch on, you know?

Anyhow, courtesy of xml2rfc, here is an Internet-Draft for UFT-8+names. It’s not finished by any means, but it’s cooked enough to be worth arguing about.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

October 17, 2003
· Technology (90 fragments)
· · XML (138 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!