Metadata, Semantics and All That

Mr. Shirky’s recent extended polemic against the Semantic Web has been given lots of linkage from more than one influential direction. I’m closer to the Semantic Web project than most, and remain significantly unconvinced, but I don’t think dismissing it is as easy as shooting fish in a barrel, and anyhow shooting fish in a barrel is unsportsmanlike and generally sucks. So here is another take on it all, with an (admittedly somewhat scaled down) vision of what the Next Web might look like.

Where I Stand · As with most people, it depends in part on where I sit. I sit on the W3C TAG, which means I spend a few hours a month and a few days a year working closely with Tim Berners-Lee and Dan Connolly, two of the SemWeb’s chief evangelists. There is no doubt that Tim in particular would like us in the TAG to orient our work more towards the architecture of his vision of the Next Web, rather than the boring old current edition. There is also no doubt that the TAG contains several people who really haven’t bought into that vision.

I see both sides of it: I think there is some justice in the accusation that the SemWeb is trying to resuscitate the (very expensive) twenty-year-old failed dreams of the Knowledge-Representation visionaries. In fact, I have some history here. Back in the Eighties when I was just getting started in this business, a lot of people threw a lot of time and a whole lot of money into the visions of the AI practitioners; AI was going to revolutionize life, replace human medical diagnosticians, make Reagan’s “Star Wars” technology work, and cement Japan’s insurmountable economic lead over the rest of the world. I took the trouble to dive pretty deep into Prolog (something you don’t hear much about now) but at the end of the day thought it was a bunch of bullshit, and said so, and suffered some temporary career damage and subsequently some credibility enhancement.

Having said all that, as Tim is prone to point out, lots of people also used to say (pre-Web) that Hypertext was also a grand failed experiment. He’s right, I was there and I heard them. So in the big picture, I think that (a) just because something hasn’t worked so far doesn’t mean it will go on not working, and (b) betting against Tim Berners-Lee is something that ought to make you feel nervous.

It’s in the Metadata · Right at the moment, a lot of the Semantic Web theory is doomed to remain just that—theory—because it relies on the existence of a critical mass of metadata, and if we’ve learned one thing in recent decades it’s that there is no such thing as cheap metadata. I’m pretty convinced that if you could build up a lot more metadata you could make the Web a more useful place, and I’ve thought a whole lot about this problem over the years, and really haven’t made much progress. I tend to think that the incentives for authors to provide metadata are slowly increasing in strength, and technology (such as for example what I build at work) is making it easier to provide, so there is grounds for optimism. But it’s a slow process.

Machine-Readability · If you listen closely to the Semantic Web premise, when they say “semantic” they mean in some part “machine-readable” and I think that this alternate formulation may show a useful way forward. I first heard the idea I’m about to present from R. V. Guha.

Right now, if I hear of some company by name (for example, let’s imagine a company called “Example Corporation”) I know that if I stick www. in front of the name and .com after it, then I can point a web browser at www.example.com and find out a bunch of stuff, including:

How long they’ve been around.
Whether I know anyone in their management.
Whether they’re private or public.
If they’re private, who their investors are, and if they’re public, a whole lot of detailed financial information.
Where their office is.
What their phone number is.
Whether they’re hiring.
What the names of their products are.

Of course, when I say “I can find out” I mean that I as a human can plow through Flash intros and HTML pages and PDF printouts to laboriously hand-assemble all this useful information. Which basically sucks.

So imagine that given any www.example.com, I could count on there also being a data.example.com, which would typically have all these facts available in some straightforward XML dialect, so that I could use a program to do the tedious basic factfinding work.

XBRL · In this context, it’s interesting to note that I gave a keynote speech last week at the 8th XBRL International Conference; XBRL is Extensible Business Reporting Language, designed to capture all the minutiae of financial statements in a machine-processable form. The value proposition for XBRL is a no-brainer: cost reduction. The financial industry depends totally on consuming accurate financial information, and since at the moment this is generally available only on paper, or if electronically, in PDF or the hardly-more-useful EDGAR version, a huge amount of time and error-prone human effort goes into extracting and repackaging it.

Of course, if companies as a matter of routine posted XBRL versions of their financials at addresses like data.ibm.com and data.renault.com and data.hsbc.com and data.daimler-chrysler.com, a huge amount of time and money would be saved. And you’d have taken some useful steps towards a machine-processable web.

Now, I wouldn’t go so far as to assert that all the inferencing-machinery goodness that TimBL prophesies and that Clay Shirky pisses on, both from a great height, would spontaneously emerge. However, if all of a sudden there were a million machine-readable business facts there for anyone to read, I think that quite a few software-savvy and accounting-savvy entrepreneurs would retreat into their garages and there would be some considerable surprises in store.

There is very little information as valuable as quantitative data about the performances of businesses and markets; if a Machine-Processable (not to say Semantic) Web can’t be built in this domain it can’t be built anywhere.

But Why? · Why would companies go to the trouble of publishing XBRL in particular or machine-readable financials in general? I can think of three reasons: First, the whole investment community is generally pissed off at the whole business community and CFOs around the world are keenly aware that to compete for investor dollars, they’d do well to have a good story on transparency. Secondly, the investment community might actually be willing to pay for such information; they’re already paying people to plow through the company’s PDFs and paper to extract the same information less reliably. Third, the governments of the civilized world might simply point a legislative gun at the companies’ heads and tell them they have to do this, for reasons including the first one in this list and some others discussed here.

Parting Words · When what you’re doing feels like shooting fish in a barrel, you ought to be worried that things aren’t as simple as you think they are.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

November 09, 2003
· Technology (90 fragments)
· · Metadata (1 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!