On XML Language Design

If you’re going to be designing a new XML language, first of all, consider not doing it. But if you really have to, this piece discusses the problems you’re apt to face and offers some advice on improving your chances of success.

Expect Semantic Gaps · “There are only two hard things in Computer Science: cache invalidation and naming things” said Phil Karlton. Designing an XML vocabulary is all about agreeing on names for things, and thus it’s hard. ¶

The disease’s symptoms are semantic gaps, painfully familiar to anyone who’s done application integration at a large scale. It’s astounding how similar-sounding things can have completely different meanings; the example that burns brightest in my mind right now happened in the Atom Working Group last year; we spun our wheels for weeks and weeks trying to agree what an entry’s “Last-Updated” date really meant. No, I’m not kidding.

Agreeing on what the objects you’re dealing with is are, and on what to name them, is often a process that can try the patience of saints; and few saints are engaged in XML language building.

Manage the Process · Language design is a business process. It’s not quite like product development or marketing-campaign design or negotiating an OEM contract, but it’s not entirely unlike any of those things. Thus, it needs to be managed, and a crucial component of success is the management quality. ¶

This means that if you are running one of these, you’ll need to focus on good old-fashioned business values like accountability, transparency, goal-setting, and information sharing. Each and every language-design effort has the potential to blow through its deadlines and trail along forever, and offers many chances to fall into the bikeshed trap.

A good standards process manager has to be impatient, courteous, skilled at listening, and good at relentless polite follow-ups to hold people to their commitments, because many of the people involved in these processes are doing it in their spare time or would rather be doing something else.

The Syntax-vs-Model Wars · There are two completely different (and fairly incompatible) ways of thinking about language invention. The first, which I’ll call syntax-centric, focuses on the language itself: what the tags and attributes are, and which can contain which, and what order they have to be in, and (even more important) on the human-readable prose that describes what they mean and what software ought to do with them. The second approach, which I’ll call model-centric, focuses on agreeing on a formal model of the data objects which captures as much of their semantics as possible; then the details of the language should fall out. ¶

My description of the two sides cannot hope to be fair, since I am firmly in the syntax-and-prose camp, and frequently am simply unable to understand the diagrams produced by the data modeling tools. Having said that, even someone who is generally leery of model-centric design can occasionally appreciate the benefits of formalisms: Mark Pilgrim wrote up a very nice explanation of how RDF-based modeling clarified some of the design of Atom.

Whichever approach you take, be warned: a conflict between these approaches will quite likely arise, and will be made worse by the fact that many members of both camps don’t see a reason for the other camp to exist.

Minimalism vs. Completeness · The single most important decision regarding the design of a new language is “How big is it?” Which is to say, how much of the problem does it try to solve? Where are the design goals on the spectrum ranging from bare-minimum at one end through nicely-balanced to complete-solution at the other end? ¶

On this issue, I’ll put my heart on my sleeve and quote Gall’s Law: A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

I think the number-one thing that makes language-design projects fail is overreaching, completism, trying to be, in Gall’s terms, “a complex system designed from scratch.” There are a bunch of good solid explanations why the world works this way; my favorite comes from the world of Extreme Programming and it goes like this: until you’ve implemented and deployed a feature, you don’t really understand it. So don’t try to build multiple layers of features on things you don’t really understand.

I’ve made the same point from another angle in my Technology Predictor Success Matrix series, especially in The 80/20 Point.

There are a bunch of cute slogans that apply here: “Dare to do less”, “MPRDV: The Minimum Progress Required to Declare Victory”, “Worse is better”, and “YAGNI” are some of them.

Ignore this at your peril. But it’s is not an absolute; the W3C’s XQuery effort seems like it’s actually going to be finished this year, and actually be used; the process has taken over five years elapsed and consumed probably a hundred person-years on the part of smart, senior, people. Maybe it will have been worth it.

Evolution vs. Stability · After the completeness issue, and closely related to it, the next Really Big Question you’ll have to face is about evolution. That is to say, when you ship release 1.0 of your new language, are you going to disband your working group, or immediately get to work on the next release? ¶

This is not a slam-dunk; there are lots of problems that need multi-level solutions, so you want to do one level at a time. In other cases, you get urgent feedback from the field on release 1.0 that makes it obvious that there just has to be a release 2.0.

Having said that, the advantages to stopping with 1.0 are huge. Here’s why: one of the main success factors for a new XML language is how much software there is out there that does useful things with it. (See my proposed restatement of Metcalfe’s law over in the companion piece.) And software developers love a stable target above all things. If you publish 1.0 of your language and are fortunate enough to get some uptake in the developer community so some useful tools are shipping, bear in mind that if you then release 2.0, you’ve quite likely broken all those tools. You can maybe do this once before the developers are going to start seeing you as part of the problem rather than as part of the solution. They might just give up on you, or they might decide to go on maintaining their 1.0-compatible releases and simply ignore your 2.0, which will then likely go un-used. This is more or less exactly what happened to XML 1.1 which, despite having my name on the cover, I thought was a bad idea and fought every step of the way.

Having said all that, the job is not over once you’ve shipped 1.0; the world changes, and if what you’ve built is useful, people will want to use it in new, different, unforeseen applications, and in ways you didn’t predict. This is a good thing; a symptom of success.

Rather than having your Working Group endlessly revising the language to keep up with the world, I’d recommend arranging that people who want to add stuff to solve their problems can do it, without asking you and without breaking anything. This is called “Extensibility”.

Extensibility · The experience of the Web, and the currently-in-progress exploration of the Web Services territory, has led to a lot of people thinking intensely about extensibility, and some powerful lessons have been learned. ¶

If you’re going to be designing an XML language, you owe it to yourself to do some of this thinking yourself, if you want it to have any kind of future.

The two most important things to think about are called “MustIgnore” and “MustUnderstand”, and they mean about what they sound like; while they encapsulate fine old engineering principles, as labels they are new things in the world, only having achieved common currency in recent years.

MustIgnore · This was an unstated axiom of the World Wide Web. When a browser runs across a weird, unknown tag, it just ignores it. This fact allowed the explosive multidirectional growth of HTML technology back in the nineties. Anybody shipping a browser could, and many did, introduce weird new tags that did weird new things. If people liked what they did, the other browsers would pick them up. Meanwhile, nothing broke, because of the unwritten MustIgnore. ¶

Here’s how MustIgnore is stated in a modern XML language design, Atom 1.0, from section 6.3 of RFC 4287:

Atom Processors that encounter foreign markup in a location that is legal according to this specification MUST NOT stop processing or signal an error. It might be the case that the Atom Processor is able to process the foreign markup correctly and does so. Otherwise, such markup is termed “unknown foreign markup”.

When unknown foreign markup is encountered as a child of atom:entry, atom:feed, or a Person construct, Atom Processors MAY bypass the markup and any textual content and MUST NOT change their behavior as a result of the markup's presence.

When unknown foreign markup is encountered in a Text Construct or atom:content element, software SHOULD ignore the markup and process any text content of foreign elements as though the surrounding markup were not present.

The notion there of “foreign markup” is a interesting. In Atom, it includes both markup from other non-Atom namespaces and markup from the Atom namespace that’s not defined in the RFC. In another situation, you might want to distinguish between the two.

But note that this would in principle allow the creation of an an “Atom 2.0” in the same namespace without breaking old software.

MustUnderstand · This is the opposite of MustIgnore; you use it when you add an extension to an existing language that you don’t want ignored. For example, maybe you’ve got a new security policy and you don’t want anyone acting on messages unless they’ve been signed with this year’s newer and better digital-signature technology. So you’d have something built into the base language so that when you include one of the new digital signatures in a message there’s a way to say “Do not process this message unless you can verify this signature.” ¶

There are a variety of ways to do MustUnderstand; in SOAP, any field in the message header can have a MustUnderstand flag saying that if you don’t understand that field, you can’t do anything with the whole message. In practice this turns out to be awkward, since a SOAP message can have a lot of headers, and you can’t act on any of them until you’ve read them all because there might be a lurking MustUnderstand.

We were considering a MustUnderstand for Atom, and were thinking of something much simpler: one MustUnderstand element at the top of the document with a list of namespace names, and if software found any namespaces there it couldn’t handle, it was not to proceed.

But in the end we decided to leave MustUnderstand out of Atom. I think that was the right decision; there are going to be some languages where you’ll need it, but be aware that it adds real complexity to implementors’ lives. Also, at the end of the day it’s hard to enforce not only in theory but in practice: telling programmers who’ve already received a message that they’re not allowed to process it may not produce the results you expect.

If it seems like I’ve put more effort into extensibility than everything else put together, that’s because I have, because I think it’s important. If you want to dive a whole lot deeper, look here or here; both have Dave Orchard’s fingerprints on them, which is appropriate because he’s thought about it a whole lot.

Conclusion · Good luck! You’ll need it. ¶

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

January 09, 2006
· Technology (90 fragments)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!