This is not really for general consumption (although all are welcome); it is a contribution to a massive, lengthy debate that has been swirling for a long time on the W3C TAG public forum as well as the more IETF-centric URI talk shop. It’s just that ongoing not only gives me a nicer writing environment, but gives the rest of the world a better reading environment, than email. Particularly if you use a better browser (if you’re reading this in IE, try ongoing in some other browser flavor just as an experiment). Anyhow, it gets real technical and pretty abstruse starting right here.
What the TAG’s For · Since this argument has a nice well-identified centre, namely the TAG’s in-progress Architecture of the Web document, it’s appropriate to consider what this document (and the TAG, for which it is a key deliverable) is for.
I think our job is to write down the practices and principles that have made the Web work so well, straightforwardly enough that moderately technical people can understand them, and with enough supporting evidence to be convincing. The goal is this: that those who are writing new software or specifying new formats have the option of adopting those practices and principles. Such adoption is worth serious consideration where projects are Web-centric, or where some of the technical constraints are Web-like.
This is not a trivial goal. In my life’s experience, the Web has constituted one of the really important steps forward, a blow against the forces of entropy and stupidity. The project of informing the world how this works and thus protecting our gains, is one that’s worth doing.
So, explaining the Web-as-it-is would be enough to make me happy. Clearly, we should have an eye to the future, and, in writing down the architecture, try to avoid making life difficult for any others who are working to make something new and important involving the Web. Obvious examples are the Semantic-Web and Web-Services efforts.
But at the end of the day, the success criterion for me is having the success criteria for the Web-as-it-is explained clearly and convincingly. For me, this means spending a lot of time thinking in terms of what the software does: the servers, intermediaries, spiders, and browsers. I've spent substantial time working with each of these kinds of software, and I’m pretty confident I understand how they work today. So, without apology, a lot of the subsequent discussion will be framed in terms of what the software does.
Three Legs · We have good consensus that the three legs of Web Architecture are URIs, which (as the acronym suggests) identify resources, representations (the MIME headers + bag-of-bits combination that your browser eats), and protocols (most importantly HTTP).
And in fact, if you look inside the software, you find lots of code to collect, store, index, look up, and dereference URIs. You find even more that reads representations and does one thing or another, most commonly display them to humans. And finally, there is a large amount of code implementing Web protocols like HTTP. (By the way, for those who care about HTTP, Mark Pilgrim’s back-to-basics tour of the landscape is a must-read.)
As I look around the landscape of Web software, I don’t seem to encounter any that processes or uses or considers Resources.
Do We Need Resources At All? · There is little or no software or spec-ware that says much of anything about Resources. If you tour around the specification of HTTP in RFC2616, the language occasionally does talk about resources: Code 201 says a new resource has been created, code 304 says “the document has not been modified” (sloppy, this is the first we’ve heard of “documents”).
But that’s about it. The software doesn’t go there. If I claim that http://www.tbray.org/ongoing/misc/Tim represents me, Tim Bray, you might raise your eyebrows but it would not cause any software either to break or work better.
So, you could in fact drop the notion of “Resource” from the TAG’s Web Architecture document, and it would work about as well in terms of keeping the software running smoothly.
This would seem a little perverse, though. After all, the “R” in URI ought to stand for something, and if only for the mental comfort of our readers, we ought to say what.
But in practical terms, in the Web as implemented, a resource is simply “that which is named by a URI.” That’s all the system knows about. Any further assumptions about what a Resource can or can’t be, from the Web software’s point of view, are simply vacuous, because they have no observable effect. More damning, such assertions are non-scientific, because there is no falsifiable hypothesis that can be constructed to test them.
Doctor, It Hurts When I Do This · So don’t do it. Trying to make assertions about what resources must be or not be, in the context of the architecture of today’s Web, is a dead-end street. Others may be willing to invest time in arguing propositions whose truth or falsehood has no observable effect, and which are not subject to scientific verification, but I’m not.
Ambiguity is Bad · Having said all that, clearly people have a notion about what a URI identifies: a picture of a sunset, or an XML namespace, or a service for booking airline tickets.
Everyone agrees that when you get confused about what’s being identified, this is a bad thing and makes the Web less useful. As TimBL has said repeatedly: a resource can’t be both a person and a picture of a person.
Unfortunately, such ambiguity is not a condition that Web software can detect. So this problem clearly has to be dealt with at the social level; it’s a problem of policy or of management, not of technology. The Web Architecture document should both (as it does) note the problem, and point out that it’s in the social/policy domain.
“Information Resources” · In recent days, there has been much ballyhoo about this notion. The idea is that “everybody knows” that some resources are just HTML or graphic files on a server somewhere, and others are just names, for example XML Namespace Names or an RDF #-URI identifying a toaster. So there is an attempt to sort the universe of resources into two buckets: those that actually supply and consume information, and those that don’t.
To some extent this might be useful, but let’s not pretend it has
anything to do with Web software.
Given any URI, Web software will, on request, try to dereference it.
If it gets a representation, it will do whatever seems most appropriate with
If the dereference fails, it fails, and the state of the system is
(generally: but see
410 Gone) not changed.
There is no way for the Web software to distinguish between an
“Information Resource” and any other kind.
So let’s please not pretend that this distinction is fundamental to, or even noticeable in, the architecture of today’s Web.
Resource Taxonomies · Despite this skepticism, I’m positive, in fact enthusiastic, about sorting resources into taxonomic buckets and knowing more about them than you can learn by retrieving representations.
I’ve spent a high proportion of my life doing search technology, with a substantial proportion of that Web search. Thus my enthusiasm for the Semantic Web is mostly due to its potential to empower search by giving it more to work with than just the content of representations.
The HTTPRange-14 Fallacy · The Web doesn’t know about information resources. In fact, the Web hardly knows about resources. Any attempt to classify resources by tying them to pieces of Web infrastructure, such as for example the URI scheme, is just inconsistent with the way the Web actually works. So we shouldn’t go there.
What Can We Do? · Soft-Pedal Resources We could just not talk about resources in the Architecture document. That wouldn’t get in the way of any software that I know of. But I suspect that this would impair the document’s usefulness as people paged frantically back and forth trying to figure out what URIs identify. Perhaps there’s a middle ground, where we say that the nature of resources is outside the scope of this document, aside from the fact that they are what is named by URIs.
“Name” vs. “Identify” A huge number of words have been spilled on discussions centering around the multiple usages, both formal and informal, of the word “identify.” It has been contrasted to “denote” and “address” in ways which I find entirely un-illuminating. Note that in the last sentence of the previous paragraph I talked about URIs naming resources. I find it went down nice and smooth and easy, and is at the end of the day possibly even more correct.
Any other ideas? · Let’s hear them. Because at the moment we’re kind of stuck.