Schemaware for Pie 0.1

I cooked up a RelaxNG schema for Pie/Not-Echo or whatever you want to call it, in its 0.1 snapshot form. Which, as a side-effect, generates a W3C XML Schema. This note includes specific conmmentary on this schema, general commentary on schemas (summary: Why would you ever use XML Schema?), and some recommendations for pruning Pie/Not-Echo.

Pie.rnc v0.1 · The schema is available at http//www.tbray.org/ongoing/pie/0.1/pie.rnc; as the snapshot versions advance, I’ll try to make sure there are snapshot schemas under directories named by the version number; so the 0.2 schema will be in http//www.tbray.org/ongoing/pie/0.2/pie.rnc, and so on.

While I’ve fooled around with RelaxNG, this is my first attempt to take on something substantial from scratch. It’s perfectly possible that I’ve done this in a way that is stupid or wrong or suboptimal, I’d be delighted to get feedback and will incorporate to the extent possible. I’ve created a discussion page at the Wiki; feedback there, please.

Here are a list of points with reference to the existing schema, in no particular order:

The schema is written in RelaxNG’s Compact Syntax (tutorial here); thus the extension .rnc; I’ll refer to it “the RNC” from here on in.
Using James Clark’s wonderful Trang tool, I have generated a W3C XML Schema; I’ll go on doing this, and the XSD can always be found at the same place as the RelaxNG source, with the .xsd extension.
The XSD version doesn’t apply some of the same controls as the RNC version; I’m not enough of an XSD expert to know whether XSD just can’t do this stuff, or whether Trang doesn’t know how to generate the XSD. In particular the XSD doesn’t do the selection magic based on the mode= and src= attributes of <content>. I’d welcome feedback on the quality of the XSD as well as the RNC.
I can’t get Trang to generate a DTD, because there are just too many things in the RNC that have no remote equivalent in DTD’s.
I changed the namespace, because the snapshot uses one based in example.com, and it’s just not OK to use that for anything but an example. So for the moment I’m using http://www.intertwingly.net/wiki/pie/.
This version of the schema forces the top-level version= attribute to have the value 0.1. Accepting 1.0 here would just be incorrect and dangerous.
I tried to follow the snapshot as closely as possible.
The elements inside <feed> and <entry> are allowed to appear in any order. I’m not sure this is cost-effective. Since these things are usually going to be machine-generated, it might be a good idea to lock down the order of the elements. It might also be a good idea to force any foreign-namespace elements off into a ghetto at the end of the parent element. It would provide another level of sanity-checking and simplify the lives of those who are doing quickie jobs with regular expressions or whatever.
For <content mode="xml"> (the default), the most common contents will be XHTML. So for the moment, there’s a rule that allows any mixture of elements in the XHTML namespace, with any attributes at all. This means that you have to have a topmost XHTML element (for example <div> or <span> or <body> immediately inside the <content> element. This will be useful anyhow because you have to have somewhere to declare the XHTML namespace. Alternately, if you had declared a prefix for the XHTML namespace higher-up in the feed, you could just plunge into mixed XHTML content with all the elements prefixed. If there’s demand for that scenario it would be easy enough to re-write the schema. But requiring a top-level element feels cleaner anyhow to me.
For this cut, I didn’t put in support for embedding other things like the <ent:topic> found in the example. This is trivially easy to add later with RelaxNG, let’s get the base language right first.
I used the Jing tool to validate a slightly-modified version of the example in the snapshot (namespace name, version, and so on). I’m not planning to post the modified version, anyone who is close enough to the problem to care is capable of grabbing Jing and fixing it up themselves. I will also intermittently create a Pie version of the ongoing feed at http://www.tbray.org/ongoing/ongoing.pie; the one there right now validates with pie.rnc.
The RNC makes use of the XSD preclared datatypes anyURI and dateTime, which are now built-in to Jing.

What Needs Fixing in Pie · The elements and attributes that are in the 0.1 snapshot are OK, except there are too many of them. The following need removal forthwith, simply because previous generations of syndication technology got by without them just fine, and we’re not here to invent stuff:

subtitle · Exactly what can we not do if we don’t have this? What prior art demonstrates its necessity?

weblog/homepage · The debate over in the Wiki had, I thought, some crushing arguments in favor of just having a <web> field per-person; the extras are at best un-necessary and in some cases actively harmful.

content* · Why do we ever need more than one <content> element per entry? This has never been proved necessary in previous syndication formats, and now is the wrong time to invent it. We have the ability to embed XML in the <content> element, and XML provides many nice mechanisms for marking-up lists of things, so anybody who really needs this functionality can work out the bugs in that sandbox until we know what needs to go in at the Pie level.

<content src= · Content-by-reference is a bold new idea, and we don’t need bold new ideas, we need to write down what already works. Once again, <content> can contain XML, and XML provides excellent ways to insert hyperlinks to other things. Work it out there and when you prove that you understand the issues, then it’s a candidate for first-clss citizenship in the syndication format itself.

...But the Glass is Half-Full · These gripes aside, the Pie format feels reasonably well-baked to me. All we have to do is lose the superfluous bits, find it a name, sort out a pure-HTTP API and derive XML-RPC and SOAP versions from that (let the market sort ’em out), figure out a neutral, long-lived home for the spec, and declare victory.

RelaxNG vs W3C XML Schemas · I invite people, even those who don’t think they’re schema weenies (I for example am definitely not a schema guy) to have a look at that RelaxNG compact-syntax schema. It’s readable, it only took me two hours to get it working (that includes downloading the Jing and Trang software, downlaoding and installing Java 1.4 from Apple, rebooting, and sorting out the usual CLASSPATH hell).

It does some pretty magical things with the allowed content of <content>, based on attribute values. It calls out to precooked definitions of dates and URIs, and it generates XML Schema files for free.

I’d really like to see a best effort from an XML Schemas maven which duplicates the functionality of the RNC as closely as possible, as readably as possible; and maybe does some more things that the RNC can’t do.

Until I’ve seen that, my provisional conclusion is that XML Schemas are basically second-rate in terms of functionality and usability, and you can get them for free by starting with Relax NG.

So, why would you use anything else?

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

July 09, 2003
· Technology (90 fragments)
· · Publishing (161 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!