Don’t Invent XML Languages

The X in XML stands for “Extensible”; one big selling point is that you can invent your own XML languages to help you solve your own problems. But I’ve become convinced, over the last couple of years, that you shouldn’t. Unless you really have to. This piece explains why. And, there’s a companion piece entitled On XML Language Design, in case you do really have to.

Even though I’ve spent time recently helping invent an XML language, please lay off the cries of hypocrisy. I’m not chiefly a language designer, and if I can lay claim to any special expertise, it’s primarily as the user of a whole bunch of different XML languages. Too many, in fact; that’s the point.

Neither Easy Nor Fun · Designing XML Languages is hard. It’s boring, political, time-consuming, unglamorous, irritating work. It always takes longer than you think it will, and when you’re finished, there’s always this feeling that you could have done more or should have done less or got some detail essentially wrong.

Pass/Fail Ratio · At Robin Cover’s invaluable XML Cover Pages, he’s assembled a helpful list of known XML languages, which currently has about 600 members. I’m sure Robin wouldn’t claim the list is comprehensive, these are the just the ones that have crossed his radar.

Looking at the list, I have a question: How many of them matter?

Have a look at the list; what do you think? I think it’s a lot less than 600. Let’s rephrase that question: How many achieved their designers’ objectives? Same answer, I think.

The conclusion is obvious; if you embark on designing a new XML language, there’s a substantial probability that your effort will not be rewarded with success.

[Sidebar: Of course, to mitigate that risk, when you’ve finished designing your language, you have to take your show on the road and sell that sucker. You can no more hope to succeed without marketing than can any other technology. Some of us like marketing and selling, so for us this is not a big downside. But it will be for others.]

So right there is a good reason not to embark on this kind of thing: it’s really hard, really time-consuming, and there’s an excellent chance that it won’t produce the results you were hoping for. In this life it’s generally a good idea to stay away from projects which are difficult, unpleasant, and have a high chance of failure. And so far I’ve just talked about the personal expenditure of time.

Software Pain · If you’re going to design a new language, you’re committing to a major investment in software development. First, you’ll need a validator. Don’t kid yourself that writing a schema will do the trick; any nontrivial language will have a whole lot of constraints that you can’t check in a schema language, and any nontrivial language needs an automated validator if you’re going to get software to interoperate. Second, if you’re designing a language that will be human-authored, you’re going to have to arrange for there to be authoring software. This either means writing an authoring package from scratch (we’re talking huge money and time and pain), or customizing one of the generalized XML-authoring tools, which is only moderately-less-huge time and money and pain. Finally, there’s the payload software; you wouldn’t be designing a language if you didn’t want to do something with it, other than author and validate it. Someone’s going to have to write that software. Software is expensive.

But there are other, more important reasons not to invent languages, starting with the network effect.

Restating Metcalfe · Bob Metcalfe, that is, who invented Ethernet and is a really smart guy and will probably be known to history mostly as the originator of Metcalfe’s Law: The value of a network is proportional approximately to the square of the number of nodes.

Here’s a related law: The value of a markup language is proportional approximately to the square of the number of different software implementations that can process it. I could argue this from theory but would prefer to do so by example: HTML. RSS. PDF. (And, I’m betting on Atom and XMPP and ODF pretty soon). Convinced yet? HTML can be used by robotic link followers as well as human-oriented rendering engines. RSS/Atom feeds can be used by event trackers as well as desktop news aggregators. Apple can build a PDF reader that’s better than Adobe’s.

In case it isn’t already obvious, this is why I’m such a hardcore partisan of ODF and so irritated by Microsoft’s refusal to get behind this particular network effect.

Opportunity Cost · I’ve already mentioned that developing an XML language is time-consuming (I’ve never seen it done in less than a year), and then you get to start writing the software that’s going to do whatever it is you need done with that language. But presumably you need that whatever-it-is done right now, or you wouldn’t be working on that problem. Are you prepared to live with the multi-year cost of developing the language and then implementing the software around it?

The Big Five · Suppose you’ve got an application where a markup language would be handy, and you’re wisely resisting the temptation to build your own. What are you going to do, then?

The smartest thing to do would be to find a way to use one of the perfectly good markup languages that have been designed and debugged and have validators and authoring software and parsers and generators and all that other good stuff. Here’s a radical idea: don’t even think of making your own language until you’re sure that you can’t do the job using one of the Big Five: XHTML, DocBook, ODF, UBL, and Atom.

XHTML + Microformats · If you’re delivering information to humans over the Web, even if you don’t think of it as “Web Pages”, it’s almost certainly insane not to use XHTML. Yes, XHTML is semantically weak and doesn’t really grok hierarchy and has a bunch of other problems. That’s OK, because it has a general-purpose class attribute and ignores markup it doesn’t know about and you can bastardize it eight ways from center without anything breaking. The Kool Kids call this “Microformats” and in fact I accidentally invented one on ongoing last November; look at that template and its class attributes.

And of course, if you use XHTML you can feed it to the browsers that are already there on a few hundred million desktops and humans can read it, and if they want to know how to do what it’s doing, they can “View Source”—these are powerful arguments.

DocBook · Suppose you’re building something that needs to go bigger and deeper and richer than XHTML is comfy with, and you want to repurpose it for print and electronic and voice, and you need chapters and sections and appendices and bibliographies and footnotes and so on. DocBook is what you need. It’s got everything you could possibly begin to imagine already built-in, and there are lots of good tools out there to do useful things with it.

ODF · Suppose you’re working with material that’s going to have a lot of workflow around it, and be complex, visually if not structurally, and maybe some day will be printed out and have signatures at the bottom. ODF is what you want. Not the most Web-oriented approach, but on the other hand the authoring tools are more human-friendly than anything else on this list.

UBL · If you’re working with invoices and purchase orders and that kind of stuff (and who isn’t?), do not even think of inventing anything. A whole bunch of smart people have put hundreds of person-years into pulling together the basics, and they did a good job, and it’s ready to go today. Look no further.

Atom · Suppose you think of your data as a list of, well, anything: stock prices or workflow steps or cake ingredients or sports statistics. Atom might be for you. Suppose the things in the list ought to have human-readable labels and have to carry a timestamp and might be re-aggregated into other lists. Atom is almost certainly what you need. And for a data format that didn’t exist a year ago, there’s a whole great big butt-load of software that understands it.

To the Managers Out There · The next time one of your technical superstars comes into the room and says “We gotta design an XML vocabulary for X”, make them prove they can’t do it with one of the Big Five. And if they can prove it, sigh deeply and budget a couple of years’ delay, and a few thousand more engineering hours.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

January 08, 2006
· Technology (90 fragments)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!