Not so long ago, I wrote a piece about open document formats. Just today there was an interesting (as always) follow-up from Jon Udell, but what I wanted to address here is Dare Obasanjo’s take, which is pretty well the Microsoft party line (not that Dare’s always a party-line guy): the Office software and its document formats are winners because they allow the use of custom schemas for office documents. That’s more important, they say, than the dodgy licensing terms and the missing pieces. I used to believe that custom schemas for office documents were generally a good idea, but I no longer do. Here’s why.

(Oh, and by the way, I’ve done work on authoring and publishing systems for the Oxford English Dictionary, Random House, the European Union Legislature, Encyclopedia Britannica, Medtronic, and some others, so I may be wrong, but it’s not due to lack of experience.)

History · The first time I saw real descriptive markup was eighteen years ago, and it was a custom tag-set cooked up for the OED. I quickly got with the SGML idea: that you cook up your own tag-set for each problem. SGML never really made it very far outside of the domain of monster publishing systems: Boeing maintenance docs, EU legislation and so on. One of the reasons was the insanely high cost of developing custom tag-sets that actually worked.

Then the Web came along a decade or so ago, and by virtue of having one tag-set (HTML) with semantics shared globally, turned the world inside out. HTML in the early days had plenty of warts, but in the form of modern XHTML, it’s a pretty decent general-purpose document markup language. Just take a minute and consider how many person-years and dollars it’s taken to shake HTML down to the point where it generally just kind of interoperates and there are good authoring environments and so on.

The Cost of Languages · HTML isn’t unusual. Documents are hard to design, and general frameworks for families of documents are even harder. The conventional wisdom back in the day was that to get yourself a good DTD designed, you were looking at several tens of thousands of dollars.

Then, once you’ve got your language designed, you start the hard work on the software. Frameworks like XSLT help, but no significant language comes without a significant cost in software design.

Then, if it’s an office document format, well then let’s assume that people are going to want to edit it by hand. Which means you’re going to need to customize your editor to make that smooth; and bear in mind that the victimsusers are probably non-technical content specialists who have no time for or patience with content models and attribute namespaces and that kind of thing.

There used to be a bunch of companies that sold such authoring environments; a few still survive, but none of them ever made much money. The cost of customizing one of these products for a particular new language, and getting production-ready polish on it, involved a lot of effort and, usually, some nontrivial software development.

It took years and years and years to build adequate authoring environments for HTML, why should we expect any other custom language to be easier? By the way, Lauren was in the trenches with one of these vendors for years, and knows the pain as well as anyone.

Interoperability · Here’s the real dirty secret; every time you cook up your own tag-set, you lose interoperability. The deep semantics that XML tags are labels for can’t be captured in any one of a schema or a write-up or lunchroom chats or running code; they need all of these things. (The notion, inherent in the phrase “custom schemas”, that a schema captures the essence of a language, is just totally wrong). The lesson is, to the extent that you can use a language that someone else already wrote, you win.

Just Documents, Of Course · Of course many “XML Documents” aren’t documents at all; they’re RPC invocations or Jabber conversations or software configuration files or syndication feeds or any of a million other program-to-program things. These are read and written by programs and exist to capture specific semantics and none of the remarks in this essay so far apply to them, so it’s just fine to make up your own languages, I do all the time.

But for office documents, the costs of custom schemas are insanely, unbearably high, and the benefits not that great.

What Then? · There is one area in which I disagree pretty seriously with the conclusions of the European Commission that I referred to in that other article. They considered, and rejected, XHTML as a standard office document format. I think that it can do most things you need in a modern office document and has remarkably few real drawbacks.

No, I’m not saying that everyone should use XHTML or the OpenOffice.org formats for every document in the world. But I do think that the cost of rolling your own is a lot higher than you think, and you should really try to avoid doing that if you possibly can.

But with specific reference to XML languages for office documents, I think that, in the interests of open-ness, interoperability, and reducing friction, fewer is better and one is ideal. I don’t think the OpenOffice.org people should waste their time on custom schemas, which are at best a red herring. And I think deployments of custom schemas in the Microsoft office will happen, but they’ll be at best a small, uninteresting niche market. Just like they always have been.


author · Dad
colophon · rights
picture of the day
June 17, 2004
· Technology (90 fragments)
· · XML (136 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!