I originally covered this subject back on February 27th, 2003, the same day that I announced ongoing to the world. I think it’s worth revisiting, because it sure would be handy if there were such a thing as a Web Site, as Dave Winer, Sam Ruby, and Jeremy Zawodny have all observed.

Summary: I think the Jeremy/Dave idea of a site feed directory in OPML or some equivalent is just fine. I think, though, that we’re going to have to point to it from the individual pages rather than try to park it in “a well-known spot on the site.” Note that Joe Gregorio has published a rant along the same lines as this one, but covering some useful ground that I don’t.

The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog. On the other hand, there are a lot of things that we’d like to know about a site, including:

  • What pages are in the site.

  • What the home page of the site is.

  • What syndication feeds there are for the site (the problem that got us here today).

  • What the site-owner’s policies are for crawlers (what we now use robots.txt for).

  • What little icon should be displayed in the address bar (what we now use favicon.ico for).

  • What the site’s privacy policies and content ratings are.

  • Where the site’s sitemap is.

And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.

Historically things like robots.txt and favicon.ico have been jammed into the URI right after the host name. Unfortunately, there are lots of cases where this just doesn’t work. Anyone who’s run a big corporate website has gotten tired of explaining to, say, the HR group why they can’t have their own robots.txt in the root of their space so they can establish their own crawling policies. They don’t want to be pestering the webmaster for every little change, and the webmaster doesn’t want to be making those changes either. (If your setup allows the use of virtual hosts that helps, but some don’t).

Finding the “Site” Isn’t Simple · There’s just no way, as far as I can tell, to look at a URI and figure out what site it’s from. Some sites just aren’t hierarchical, sometimes the site isn’t rooted at the top level. For example, the root of ongoing is at http://www.tbray.org/ongoing/, but there are things that are part of ongoing that don’t start with http://www.tbray.org/ongoing/ and there are things elsewhere on http://www.tbray.org/ that are part of other web sites.

In particular, some of the big content management systems have URI-space layouts that have nothing to do with hierarchy. (In general, most of them also have URI-space layouts that suck, but that’s another matter).

Grabbing Pieces of Namespace Isn’t OK · Now, let’s assume that we could somehow find the “root” of a web site by some magic. I just don’t think it’s OK now in 2003, when we’re maybe 1% of the way into the Web’s lifespan, to start gobbling up little bits of the namespace. As it is, the names robots.txt and favicon.ico are stolen forever, nobody will ever be able to use them for their own purposes again.

What To Do? · I think that the MyFeeds.opml idea is basically sound; there’s lots of room to argue about the merits or deficiencies of OPML but hey, it’s here and it works and it doesn’t get in the way.

So in the short term, I would just arrange for web pages to contain one more link element like so:

<link rel="feed-directory" href="myfeeds.opml" />

It ain’t perfect, but it’ll get the job done. Down the road, if we all manage to agree on how to wire the notion of a “Site” into the Web, one of the things that a “Site File” ought to point to will, of course, be this OPML file. But that’s down the road.

author · Dad · software · colophon · rights
picture of the day
October 15, 2003
· Technology (77 fragments)
· · Web (385 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.