Being a brief code fragment that makes me happy.

There’s this little 10-byte file called 4c like so:

~/dev/rx/ 627> hexdump 4c
0000000 26 d0 96 e4 b8 ad f0 90 8d 86

These bytes are the UTF-8 encoding of a particular four-character string as described in Characters vs. Bytes.

I’m running Ruby 1.9 as checked out from svn earlier today:

~/dev/rx/ 628> ruby -v
ruby 1.9.0 (2008-09-19 revision 19423) [i386-darwin9.4.0]

There’s a new method, String#each_codepoint:

~/dev/rx/ 629> ri String#each_codepoint
-------------------------------------------------- String#each_codepoint
     str.each_codepoint {|integer| block }    => str
     Passes the +Integer+ ordinal of each character in _str_, also known
     as a _codepoint_ when applied to Unicode strings to the given

        "hello\u0639".each_codepoint {|c| print c, ' ' }


        104 101 108 108 111 1593

And it works! (Disclaimer: I probably am not using the best and simplest idiom.)

~/dev/rx/ 630> irb
irb(main):001:0> u ='4c').force_encoding('UTF-8')
=> "&Ж中𐍆"
irb(main):002:0> u.each_codepoint {|c| printf("U+%04X\n", c) }

Further background and explanation may be found here. I felt like writing back saying “And can we have ponies, too?”


Comment feed for ongoing:Comments feed

From: Lars Marius Garshol (Sep 19 2008, at 04:28)

In other words: each_codepoint really does what it advertises, and does not treat UTF-16 surrogates as code points, but shows the last character as a single code point, instead of as the two units used to encode it in UTF-16.

That really is good, and is certainly more than Java can do:


From: g (Sep 20 2008, at 16:16)

What a pity that the Linear A syllabary hasn't yet made it into the Unicode standard: it would have been amusing to have U+10646 instead of U+10346.


From: Jay Carlson (Sep 20 2008, at 18:25)

Sometimes I think Ruby's text processing is an elaborate parody of the English-centric Unix world's attitudes decades ago.


From: Julian Reschke (Sep 21 2008, at 03:25)


"That really is good, and is certainly more than Java can do" -- see:


From: Gaute Strokkenes (Sep 23 2008, at 22:59)

Julian: Lars' point is that Java strings can only cope with non-BMP characters by means of surrogate pairs, i.e. encoding them with pairs of chars, rather than a single pair. This is a common misfeature of all UTF-16 systems, and I see nothing in the Javadoc you quoted to refute this.


author · Dad · software · colophon · rights
picture of the day
September 18, 2008
· Technology (77 fragments)
· · Ruby (93 more)

By .

I am an employee
of, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.