Much time on the W3C TAG telecon today on (I think) an important issue: how to extend the machinery of the Universal Republic of Love, er I mean URL, er I mean URI, to the billions who don't use our ninety-seven ASCII characters to describe the world. This is tricky, not so much because it's tricky, but because there's so much Anglocentric software out there that we have to cater to. (Warning: severely geeky.)

There's a fairly ambitious not to say heroic effort under way to define something called an Internationalized Resource Identifiers (IRIs); the drafts are IETF documents even though the work is (mostly) being done at the W3C.

I can hear some grumbles in the background wondering if we really need this stuff; Isn't the Japanese cabinet making do with, and Al-Jazeera seems to be OK with After all, this stuff is for computers not people, right?

Wrong (at least partly). URIs are painted on the sides of buses everywhere, and shouldn't I be able to paint something perfectly useful like this:伊藤穣一

Well, with IRIs I'd be able to do just that, and most people who've thought about it seem to think that this is a good thing. There's a problem though; the current definition of URIs is perfectly clear that they can only contain printable ASCII characters, on the reasonable basis that in the nineties when this was cooked up, that was all you could really be sure would be available on any and all computers in any and all locations.

That last assumption may no longer be valid, but there is a heck of a lot of software out there right now that processes URIs according to those all-ASCII rules.

Now in fact, you can get a string like “伊藤穣一” into a URI through a trick called “hex-encoding.” Here's that same Google search, hex-encoded:

Click on it and you'll see that it works, but you wouldn't want to paint it on the side of a bus.

(For those in the crowd who are technical enough to care how this works but haven't learned hex-encoding, what's happening is that the four Kanji characters are being encoded as UTF-8, each of which occupies three bytes, and then each of those three bytes is encoded as a % sign followed by two hexadecimal characters; thus four Kanjis become twelve sequences of a % followed by two hex characters.)

OK, so the problem is solved, right? In future, we'll paint IRIs on the sides of buses and transform them into hideous hex-encoded URIs behind the scenes for the computers to use.

Well, sort of. The problem is that the rules for when and how you hex-encode are hopelessly vague and nowhere near deterministic. (Geeks may want to do a view-source and look at the href= on that hex-encoded URI above to see the extra little pieces you have to stick on to make it work.) This has the potential to play hell with caches and proxies and webcrawlers and XML namespaces, since you can have the same IRI ending up as a bunch of wildly different looking URIs.

I don't think the problems are that hard, and fortunately the installed base of IRIs is not yet big enough to be an intractable problem, but clearly some combination of the TAG and the W3C internationalization gang and the IETF experts are going to have to buckle down and sort this out.

author · Dad
colophon · rights
picture of the day
March 31, 2003
· Technology (90 fragments)
· · Web (397 fragments)
· · · TAG (11 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!