Being a hyper-pedantic note about turning bytes into Java strings and a small fix for a smaller and almost-purely-aesthetic but ubiquitous problem. [Update: Heavily revised with a better solution.]

[Most of the comments below apply to the original solution I’d been using, which turned out to be sub-optimal.]

So, it’s like this: You’ve received some bytes over the wire and run them through a JSON parser and you’re looking at a few of them that you know damn well are a field name in UTF-8. So, you say:

final String name = new String(bytes);

Then your perfectly sensible reviewer points out that it’s a Best Practice to call the constructor with the “charsetName” argument because otherwise it’ll use “the platform’s default charset”. Which you know damn well is UTF-8 but hey, a Best Practice is a Best Practice, so you say:

final String name = new String(bytes, "UTF8");

At which point your IDE sticks a sharp little red underline in your eye because hey, that might throw an UnsupportedEncodingException even though the Javadocs say explicitly that implementations must damn well support UTF-8.

Now your nice minimal code is wearing an ugly necklace of tries and catches. And what, I ask, is supposed to go in the catch clause? There’s only one thing you can be sure of: It’ll be ugly and distracting and useless. I mean, if you had a “detonateBioWeapon()” call, you could put it in there and still sleep soundly. Once my catch clause read:

throw new RuntimeException("Paging Mr Gosling to the white courtesy
phone for a message.");

But another perfectly-reasonable reviewer made me take it out.

I bet there are millions of these stupid intrusive little excrescences all over the world’s Java code-bases. In the previous sentence, “millions” is not a figure of speech.

Anyhow, the answer is this:

import java.nio.charset.StandardCharsets;

final String name = new String(bytes, StandardCharsets.UTF_8);

I can’t conceive of any circumstances in which any of the three versions of this code presented here will produce results different from any of the others. But hey, your Practices are the Best, baby.



Contributions

Comment feed for ongoing:Comments feed

From: Adrian Sutton (Feb 10 2015, at 20:51)

java.nio.charsets.StandardCharsets is going to be your friend here, probably with a static import for StandardCharsets.UTF_8.

[link]

From: Carey Evans (Feb 10 2015, at 21:34)

The behaviour of the String constructor with a character set name is undefined when the input isn't valid UTF-8. In practice, though, it's the same as when a Charset is provided.

In any case, your hypothetical code reviewer should have pointed you to java.nio.charset.StandardCharsets, where Oracle fixed this in the same way as you did.

[link]

From: Tim K (Feb 10 2015, at 21:38)

I went through a similar process some time ago, but I try to avoid having a "junk drawer".

The approach I came up with looks like this:

enum Encodings {

DEFAULT(Charset.defaultCharset()),

UTF8("UTF-8"),

LATIN1("ISO-8859-1"),

// ... all other ones we require

public String toString(byte[] b)...

// ... other charset-dependent methods

}

The use it like this:

String s = Encodings.UTF8.toString(bytes)

Reader r = Encodings.LATIN1.reader(is);

[link]

From: Paul Clapham (Feb 10 2015, at 21:54)

I had that code. I used to call that a Yoda exception... "There is no try or not try..."

But when I found out about that new feature, it became the first thing on my to-do list.

[link]

From: Karl Ostendorf (Feb 11 2015, at 00:08)

Perhaps not a best practice, but we used to just set the default file encoding system property: java -Dfile.encoding=UTF-8 ...

[link]

From: Jean Hominal (Feb 11 2015, at 01:26)

I am still working on systems where the platform encoding is ISO-8859-1, and that have a very hard time moving away because of the number of programs that assume it to be the default encoding...

So I would argue that yes, the first version of the code can easily yield different results, unless you specify that "this library only runs as expected on systems where the platform encoding is UTF-8", which is a backward requirement to make (that is, programs should not unduly constrain OS configuration). Even though I want UTF-8 everywhere...

[link]

From: Bruno (Feb 11 2015, at 01:36)

If you want to assert that the checked exception cannot occur, why not just catch it and do:

throw new AssertionError();

The weakness of checked exceptions is that the API designer has to decide whether the exceptional condition is important enough to force some handling.

Nevertheless, I find checked exceptions very useful to structure error handling in some cases. In a MVC-webapp, the model and some parts of the controller will do database operations may throw SQLException. This should be handled before the controller starts emitting the view.

[link]

From: Jilles van Gurp (Feb 11 2015, at 04:41)

Yes, highly annoying. What's more annoying is the fact that the default encoding differs per platform. OS X for example does not default to UTF8. Linux does, but only if your user has the right env settings. This of course is not necessarily true when launching your process from an init.d script, which runs as root. I found all of this out the hard way. For this reason, have your continuous integration environment default to something wild and exotic to flush out any sloppy coding.

In any case feeding strings to json parsers is probably a bad idea. You should be passing in a stream and not buffer input into strings.

[link]

From: Craig R Ewert (Feb 11 2015, at 08:57)

> Best Practice to call the constructor with the “charsetName” argument because otherwise it’ll use “the platform’s default charset”

I think you should push back on your less-than-sensible reviewer here. The reason we HAVE defaults is so we can not mention them when we use them. Your first line of code was the best one.

Or do what I do: Curse Gosling, and his robot boats.

[link]

From: Art (Feb 11 2015, at 10:47)

"I can’t conceive of any circumstances in which any of the three versions of this code presented here will produce results different from any of the others."

Run the code on Windows and it will be wrong. Worse, it will work for ASCII, but fail as soon as a "special character" is in the JSON.

Your reviewer is absolutely right to complain.

[link]

From: Gavin (Feb 11 2015, at 13:46)

There's a static for exactly that in Java 7.

http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html?is-external=true#UTF_8

[link]

From: Caleb powell (Feb 12 2015, at 11:26)

Curious, is there a specific reason why the java.nio.charset.StandardCharsets class was not implemented as an enum type? Possibly due to a concern that people would start relying on the ordinal value of a specific instance?

[link]

From: Caleb Powell (Feb 12 2015, at 11:41)

I guess having to type StandardCharset.UTF_8.getCharset() would be a bit of a pain.

[link]

author · Dad · software · colophon · rights
picture of the day
February 09, 2015
· Technology (85 fragments)
· · Coding (99 fragments)
· · · Java (24 more)

By

I am an employee of Amazon.com, but the opinions expressed here are my own, and no other party necessarily agrees with them.

A full disclosure of my professional interests is on the author page.