In preparation for my presentation next weekend at
RubyConf, I’ve been poking
around at Ruby’s string-handling. One thing that text-wranglers such as
me like to do is walk through a string a character at a time, and Ruby doesn’t
make this particularly easy.
I ended up
String#each_char_utf8 three times along the way.
[Update: Lots of interesting feedback, and a worth-reading
I poked around in the
(I can’t get it
to work on my computer, but I can read the source).
It appears that the only way to look into a Ruby string and see Unicode
characters is with
unpack('U*'), or using a regexp with
$KCODE set to
So, I want to go walking through a string, looking at the Unicode
Maybe I’m parsing a big XML file using
mmap(2) or some such.
What I want is an efficient
This will be hard in Ruby in the general case because Strings don’t know
what encoding they’re in; there’s
$KCODE but that’s only defined
to work with regular expressions. So let’s look at the special case of
Of course, if
String#unpack took a block like
String#gsub, that would give you the tools you need. I looked at
pack.c and it would be real work, but it doesn’t look
architecturally impossible. Failing that, let’s use
def each_utf8_unpack(s, &block) s.unpack('U*').each &block end
The above sucks for big strings because you create a monster array of
integers. Regular expressions are maybe a little more efficient
(this depends on the
$KCODE setting, obviously):
def each_utf8_regex(s) s.gsub(/./m) do |c| yield c.unpack('U').shift '' end end
unpack voodoo is because I want the integer value of the
Here is a more ambitious version, extending
String and picking
the UTF-8 apart a byte at a time:
class String def each_char_utf8 @utf8_index = 0 while @utf8_index < length yield next_utf8_char end end def next_byte b = self[@utf8_index] @utf8_index += 1 return b end def first_utf8 b = next_byte if b & 0x80 == 0 then return 1, b elsif b & 0xe0 == 0xc0 then return 2, b & 0x1f elsif b & 0xf0 == 0xe0 then return 3, b & 0x0f else return 4, b & 0x07 end end def next_6bits next_byte & 0x3f end def next_utf8_char len, c = first_utf8 case len when 2 c = (c << 6) | next_6bits when 3 c = (c << 12) | (next_6bits << 6) | next_6bits when 4 c = (c << 18) | (next_6bits << 12) | (next_6bits << 6) | next_6bits end return c end end
I’m pretty sure the above has more or less the right semantics; but it’s a candidate for implementation in C (or Java, for JRuby).
I tested the performance by running all three versions over 2,000,000
bytes of ongoing text, containing a few
thousand non-ASCII characters, and doing a character frequency count.
regex version and the byte-at-a-time version took around
18 seconds on my PowerBook; the
unpack version took less then
I’ve asked the
ruby-talk mailing list why
doesn’t take a block; let’s see what they say.