I gave a presentation called I18n, M17n, Unicode, And All That at the recent 2006 RubyConf in Denver. This piece doesn’t duplicate this presentation; it outlines the problem, some conference conversation, and includes a couple of images that you might want to steal and use in a future Unicode presentation. For those who don’t know, “i18n” is short for “internationalization” (i-18 letters-n), “m17n” for “multilingualization”, and you can call me “T1m”.
Background · The object of my presentation was to argue, to the Ruby community and its leadership, that I18n in general and Unicode in particular are really important technologies which deserve to be taken quite a bit more seriously than they currently are.
My 45 minutes included 20 or so introducing Unicode concepts and issues; I was a little worried that this might be old hat, but when I asked for a show of hands from people who thought they understood Unicode, damn few went up. So I guess this was OK. From that part of the presentation, here are a couple of pretty graphics illustrating the structure and layout of the Unicode character set.
The Problem · Right now, Ruby sees a String as a byte sequence, and doesn’t provide much in the way of character-oriented, as opposed to byte-oriented, functions. Also, it has very little built-in knowledge of Unicode semantics, aside from a few UTF-8 packing and unpacking capabilities, and some rudimentary regular expression support. Up till now, this hasn’t been seen as an urgent problem.
A solution has been promised in Ruby 2 (due next year, maybe); Matz calls it M17n and has made some general remarks in speeches but hasn’t published much in the way of code or documentation.
When I was preparing this talk, I sent email off to the Unicode people saying “Hey, I’m going to be plugging your baby to an eager audience of a few hundred eager programmers, want to send me some advertising?” They sent me some leaflets, but also a pre-release bound copy of the Unicode 5.0 manuscript. All the previous versions of Unicode have been impressive and beautiful books, and it looks like 5.0 will carry on the tradition.
Before my presentation, I was at the front of the room getting set and talking to Matz, who was a little worried, I think, that I’d come in raining fire and brimstone on all those who are not members of the Church of Unicode. I got the 5.0 book out and told the organizers they could raffle it off (then an idea struck) “...unless Matz wants it” and he did, so he has it now.
Some Progress · I had a talk over lunch with Matz and he made a couple of pretty good points. First, for operating on the Web, you just gotta do Unicode because it’s an inherently multilingual place and Unicode is the only plausible way to deal with texts that combine characters from lots of different languages. But for people who are using Ruby to process their own data on their own computers, compulsory Unicode round-tripping could be a real problem, because it can cause breakage. For example, take currency-symbols. In Japan, there is ambiguity in the JIS encodings as to where the ¥ (Yen) symbol goes; anyone who’s used a Japanese keyboard and seen a “¥” pop up when you hit the “\” key knows about this.
There’s another problem right here in our own round-eye gringo back yard; namely the blurring of the border between two immensely popular encodings, ISO-8859-1 and Microsoft Windows 1252. Sam Ruby’s Survival Guide to I18n (from 2004) covers these issues; also see a couple of slides starting here, Copy and Paste, and Mozilla bug 121174 comment #23.
So the problem with Ruby is, how do we make it easy and effortless for the zillions of people who really need Unicode to Just Work, while at the same time allowing others to avoid the potential round-trip breakage.
I don’t know the answer, but lots of people are looking at the problem, along with Matz. Julian Tarkhanov and Manfred Stienstra have been working on ActiveSupport::MultiByte (get it from Edge Rails), Nikolai Weibull on the character-encodings project, and the JRuby guys are trying to build Ruby on a platform that’s natively Unicode.