Editing JSON

As sort of a 2% project, I’m helping out over in the IETF, working on a revision of the JSON spec.

I wrote back in February about the depressing floppiness of the JSON spec, which allows things that are just bugs to people like me who use JSON always and only to represent hashes and records and suchlike in network APIs. And if the API is a crypto/authentication thing, those bugs can be nasty exploits. (Think “duplicate key” or “naked surrogate”, and shudder.)

What I hadn’t realized was that there actually isn’t a standalone anything you can link to and say “This is the JSON spec”; RFC 4627 is just a mime-type registration. So the IETF made a Working Group to write one, with the constraints that it can’t actually fix the problems with 4627; i.e. anything that is currently considered JSON, no matter how broken, can’t be de-JSON-ified.

What the WG can do is fix a couple of errata, document where the stupid things that 4627 allows can lead to breakage, and turn it into a spec, not just a registration doc. Fair enough.

I pitched in on a low-bandwidth level, suggesting a couple chunks of spec language here and there. I enjoy arguing about markup-language syntax, so it was a pleasant background activity. Anyhow, the time came to produce the next draft, and Douglas Crockford, to whom we all owe a major debt for having crafted JSON, couldn’t be found to do it. So they asked me if I would and I did (real HTML version here); a couple hours work while watching MLB.tv.

No real news of any significance here. But I’m amused, because there’s a file I’m editing called “json.xml”, sort of like how, fifteen years ago, I was putting cycles into editing a file called “xml.xml”.

Contributions

Comment feed for ongoing:

From: David (Sep 19 2013, at 19:14)

For me at least, the URL for the list of drafts comes across as http:/⁠/⁠datatracker.ietf.org/⁠drafts/⁠current/⁠.

(several ampersand pound 8288 semi colons rather than forward slashes)

[link]

From: Andrew (Sep 19 2013, at 20:17)

Aside: are "curly brackets" and "square brackets" real things now, even in IETF docs?

Are parentheses doomed to be renamed "rounded brackets" someday soon? Don't get me started on "angle brackets".. :-)

Glad to see that a solidus is still a thing, though.

[link]

From: Brett Slatkin (Sep 19 2013, at 21:00)

Sounds fun! Looking forward to seeing it evolve.

[link]

From: Hanan Cohen (Sep 20 2013, at 00:18)

I have Google ["naked surrogate" unicode -sex] and other combinations but didn't find an explanation to what it is.

Can you please explain?

Thanks.

[link]

From: Deron Meranda (Sep 21 2013, at 13:54)

I know from having been involved with several JSON implementations in different languages, that there are still some real-world interoperability concerns that have not yet been addressed. These include:

* Some implementations can not handle strings with the character U+0000, however encoded or escaped. This most often occurs when trying to embed raw binary data into a JSON value, and usually affects C language based implementations.

* Negative zeros, whether -0 or -0.0 are distinct from 0 and 0.0. The IEEE floating point standard, as well as JavaScript, require the negative sign for floating-point zeros to be preserved. Other implementations may not. So encoding a -0.0 in JSON may lead to interoperability problems. I've not seen any implementations that preserves negatives for integer zeros.

* Integer vs. floating point. The JSON spec makes no distinction between integer or floating point numbers. However most, but not all, JSON implementations do, so again this can lead to interoperability problems. Some implementations may treat 1.0, 1E3, 1000E-2, etc. as integers; as mathematically they are. Adding some language about the "interpretation" of numbers in JSON may be welcome.

[link]

From: Deron Meranda (Sep 22 2013, at 13:54)

Hanan, Re the "naked surrogates". It's an historical artifact of Unicode. In the early days Unicode, it used 16-bits to represent characters. Later they determined this wasn't enough and it was expanded to (almost) 21-bits.

Old applications that were coded to use two bytes per character were out of luck. So rather than start completely over a clever encoding hack, called UTF-16, was invented. It allowed those new "code points" that were beyond the 16-bit numeric range to be represented by using a pair of 2-byte values (for 4 bytes in total). Each half of these special pairs are called "surrogates", and they must be in the numerical range of 0xD800 to 0xDFFF. Most importantly Unicode explicitly says those values are not to be considered as "characters", and they may only appear in "pairs".

Thus a single surrogate value by itself (e.g., a "naked surrogate") is illegal in Unicode, and should be illegal in well-formed JSON too.

JSON, being a derivative of JavaScript, uses this surrogate pair encoding technique to be able to represent any character from the entire Unicode repertoire.

[link]

From: John Roth (Sep 23 2013, at 07:08)

Just to expand on the naked surrogate explanation. According to the standard, surrogate pairs can only be used in UTF-16 encodings; either individually or in pairs they're invalid in either UTF-8 or UTF-32 encoded text.

Enforcement of this rule is very spotty, and depends on the programming language. Go, for example, appears to enforce it, while IIRC, Python doesn't. This can cause interesting interoperability problems.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

September 18, 2013
· Technology (90 fragments)
· · Internet (116 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!