<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns:og='https://ogp.me/ns#' lang='en'>
<head>
<title>ongoing by Tim Bray &#xb7; On Custom Schemas</title>
<meta name='viewport' content='width=device-width, initial-scale=1.0, shrink-to-fit=no'/>
<meta property='og:site_name' content='ongoing by Tim Bray'/>
<meta property='og:title' content='On Custom Schemas'/>
<meta property='og:image' content='/ongoing/misc/podcast-default.jpg'/>
<meta property='og:type' content='website'/>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
<link rel='stylesheet' type='text/css' media='screen' title='serif' href='/ongoing/serif.css' />
<script type='text/javascript' src='//use.typekit.net/ugm7uwx.js'></script>
<script type='text/javascript'>try{Typekit.load();}catch(e){}</script>
<script type='text/javascript' src='/ongoing/ongoing.js'></script>
<link rel='alternate' type='application/atom+xml' title='Atom (full content)' href='/ongoing/ongoing.atom' />
<!-- Generated from XML source code using Perl, Expat, Emacs, Mysql, Ruby, Java, and ImageMagick.  Industrial-strength technology, baby. -->
</head><body itemscope='' itemtype='http://schema.org/Blog'>
<div id='payload'>
<div id='banner'><h1 itemprop='name'>On Custom Schemas</h1><div id='search'><form action="https://www.google.com/search" target="_parent">Search <input size="20" name="as_q" /><input type="hidden" name="hl" value="en" /><input type="hidden" name="ie" value="UTF-8" /><input type="hidden" name="btnG" value="Google+Search" /><input type="hidden" name="as_qdr" value="all" /><input type="hidden" name="as_occt" value="any" /><input type="hidden" name="as_dt" value="i" /><input type="hidden" name="as_sitesearch" value="tbray.org" /></form></div></div>
<div id='center-and-right'><div id='centercontent'>
<p itemprop='description'>Not so long ago, I wrote
<a href='/ongoing/When/200x/2004/06/09/ScienceStreet'>a piece
about open document formats</a>.  Just today there was an interesting (as
always) 
<a href='http://weblog.infoworld.com/udell/2004/06/17.html#a1025'>follow-up
from Jon Udell</a>, but what I wanted to address here is
<a href='http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=eeb0c3e1-b8a0-48da-8c1a-4701b6fd16de'>Dare
Obasanjoâ€™s</a> take, which is pretty well the Microsoft party line (not that
Dareâ€™s always a party-line guy): the Office software and its document formats
are winners because they allow the use of custom schemas for office
documents.  Thatâ€™s more important, they say, than the dodgy licensing terms
and the missing pieces.
I used to believe that custom schemas for office documents were generally a
good idea, but I no longer do.
Hereâ€™s why.</p>

<p id="p-9">(Oh, and by the way, Iâ€™ve done work on authoring and publishing
systems for 
the <cite>Oxford English Dictionary</cite>, Random House, the European Union
Legislature, Encyclopedia Britannica, Medtronic, and some others, so I may be
wrong, but itâ€™s not due to lack of experience.)</p>

<p id='p-1' class='p1'><span class='h2'>History</span> &#xb7; 
The first time I saw real
<a href='/ongoing/When/200x/2003/04/09/SemanticMarkup'>descriptive
markup</a> was eighteen years ago, and it was a custom tag-set cooked up for
the <cite>OED</cite>.  I quickly got with the SGML
idea: that you cook up your own tag-set for each problem.
SGML never really made it very far outside of the domain of monster
publishing systems: Boeing maintenance docs, EU legislation and so on.
One of the reasons was the insanely high cost of developing custom tag-sets
that actually worked.</p>

<p id="p-2">Then the Web came along a decade or so ago, and by virtue of having
one 
tag-set (HTML) with semantics shared globally, turned the world inside out. 
HTML in the early days had plenty of warts, but in the
form of modern XHTML, itâ€™s a pretty decent general-purpose document markup
language.
Just take a minute and consider how many person-years and dollars itâ€™s taken
to shake HTML down to the point where it generally just kind of interoperates
and there are good authoring environments and so on.</p>

<p id='p-3' class='p1'><span class='h2'>The Cost of Languages</span> &#xb7; 
HTML isnâ€™t unusual.
Documents are hard to design, and general frameworks for families of
documents are even harder. 
The conventional wisdom back in the day was that to get yourself a good DTD
designed, you were looking at several tens of thousands of dollars.</p>

<p id="p-4">Then, once youâ€™ve got your language designed, you start the hard
work on 
the software.  Frameworks like XSLT help, but no significant language comes
without a significant cost in software design.</p>

<p id="p-5">Then, if itâ€™s an <strong>office</strong> document format, well
then letâ€™s 
assume that people are going to want to edit it by hand.
Which means youâ€™re going to need to customize your editor to make that
smooth; and bear in mind that the <s>victims</s>users are probably
non-technical content specialists who have no time for or patience with content
models and attribute namespaces and that kind of thing.</p>
 
<p id="p-6">There used to be a bunch of companies that sold such authoring
environments; a few still survive, but none of them ever made much money.
The cost of customizing one of these products for a particular new language,
and 
getting production-ready polish on it, involved a lot of effort and, usually,
some nontrivial software development.</p>

<p id="p-7">It took years and years and years to build adequate authoring
environments for HTML, why should we expect any other custom language to be
easier?
By the way,
<a href='http://www.laurenwood.org/anyway/'>Lauren</a> was in the trenches
with one of these vendors for years, and knows the pain as well as anyone.</p>

<p id='p-11' class='p1'><span class='h2'>Interoperability</span> &#xb7; 
Hereâ€™s the real dirty secret; every time you cook up your own tag-set, you
lose interoperability. 
The deep semantics that XML tags are labels for canâ€™t be
captured in any one of a schema or a write-up or lunchroom chats or running
code; they need all of these things.
(The notion, inherent in the phrase â€œcustom schemasâ€, that a schema captures
the essence of a language, is just totally wrong).
The lesson is, to the extent that you can use a language that someone else
already wrote, you win.</p>

<p id='p-12' class='p1'><span class='h2'>Just Documents, Of Course</span> &#xb7; 
Of course many â€œXML Documentsâ€ arenâ€™t documents at all; theyâ€™re 
RPC invocations or Jabber conversations or software configuration
files or syndication feeds or any of a million other program-to-program
things.
These are read and written by programs and exist to capture specific
semantics and none of the remarks in this essay so far apply to them, so itâ€™s
just fine to make up your own languages, I do all the time.</p>

<p id="p-14">But for office documents, the costs of custom schemas are
insanely, 
unbearably high, and the benefits not that great.</p>

<p id='p-13' class='p1'><span class='h2'>What Then?</span> &#xb7; 
There is one area in which I disagree pretty seriously with the
conclusions of the European Commission that I referred to in that other
article.  They considered, and rejected, XHTML as a standard office document
format.  I think that it can do most things you need in a modern office
document and has remarkably few real drawbacks.</p>

<p id="p-16">No, Iâ€™m not saying that everyone should use XHTML or the
OpenOffice.org formats for every document in the world.  But I do think that
the cost of rolling your own is a lot higher than you think, and you should
really try to avoid doing that if you possibly can.</p>

<p id="p-15">But with specific reference to XML languages for office
documents, I think 
that, in the interests of open-ness, interoperability, and reducing friction,
fewer is better and one is ideal.  
I donâ€™t think the OpenOffice.org people should waste their time on custom
schemas, which are at best a red herring.
And I think deployments of custom schemas in the Microsoft office will
happen, but theyâ€™ll be at best a small, uninteresting niche market.
Just like they always have been.</p>

<hr />
<div id='commentHere'></div>
<div id='footer'><p class='footer'><b>Updated: 2004/06/18</b></p>
</div>
</div>

<div id='rightcontent'><div class='oo'><a id='to-home' href='https://www.tbray.org/ongoing/'><span id='home'>ongoing</span></a></div>
<div>
<div class='principles'>
<a href='/ongoing/WhatItIs'>What this is</a> &#xb7;
<a href='/ongoing/ongoing.atom'><img title="Subscribe to ongoing" alt="Subscribe to ongoing" src="/ongoing/Feed.png"/></a><br/>
<a href='/ongoing/Truth'>Truth</a> &#xb7;
<a href='/ongoing/Biz'>Biz</a> &#xb7;
<a href='/ongoing/Tech'>Tech</a></div>
<a href='/ongoing/misc/Tim'>author</a> &#xb7;
<a href='http://www.textuality.com/BillBray/'>Dad</a><br/>
<a href='/ongoing/misc/Colophon'>colophon</a> &#xb7;
<a href='/ongoing/misc/Copyright'>rights</a>
</div>
<div id='potd'><a id='tnA' href='/ongoing/goto-potd/'><img id='tnI' src='/ongoing/potd.png' alt='picture of the day' /></a></div>
<div id='cats'>
<a href='/ongoing/When/200x/2004/06/'>June</a> <a href='/ongoing/When/200x/2004/06/17/'>17</a>, <a href='/ongoing/When/200x/2004/'>2004</a><br/> &#xb7; <a href='/ongoing/What/Technology'>Technology</a><span class='more'> (90 fragments)</span>
<br/> &#xb7; &#xb7; <a href='/ongoing/What/Technology/XML'>XML</a><span class='more'> (136 more)</span>
</div>

<div class="employ">
<p>By <a rel="author" href="/ongoing/misc/Tim">Tim Bray</a>.</p>
<p>The opinions expressed here <br/>
are my own, and no other party<br/>
necessarily agrees with them.</p>
<p>A full disclosure of my<br/>
professional interests is<br/> 
on the <a href='/ongoing/misc/Tim'>author</a> page.</p>
<p>Iâ€™m on <a rel="me" href="https://cosocial.ca/@timbray">Mastodon</a>!</p>
</div>



</div>
</div>
</div>

</body>
</html>