About as Big as the Web

In recent months, I've been having serious fun on the job working with OCLC WorldCat data. WorldCat is big - about as big as the Web, and in some respects richer. It is also amazingly under-utilized (what was the last time you did a large-scale search on anything but Google?), and we'd like to fix that. Herewith some notes on who OCLC is, what WorldCat is, and some of the fun we're having. (Warning: long, and with some pitching for Antarctica; but some juicy screenies, and infojunkies must read.)

OCLC · It stands for Online Computer Library Center, but the “O” used to stand for Ohio; they're in Dublin, OH (not far from Columbus). OCLC is a nonprofit co-operative whose 43,559 members probably include every library you've ever been in. Out there in Dublin they have several hundred staff and a lot of really big computers. I think there are some people there who have approximately the best job in the world, getting paid to spend their time wrestling with the world's biggest library catalog; but then I'm a geek.

Cataloging · WorldCat is just one of the things OCLC does, but it's my favorite. I'm going to have to provide a bit of back-story to help the non-Librarians in the crowd (I'm sure there must be one) understand what it is.

When a “Research Library” (i.e. most University libraries, and places like the British Library and the New York Public Library) acquires a book, they don't just put it on the shelf, they have to catalog it first. This is a serious process, and usually takes several weeks.

Let's not let that number just slip by. Once I've written and proofread an ongoing essay, it takes me less than a second to publish it to the world. If it has a spicy title and I publish at a time of day when people are awake, usually several hundred of the RSS-driven subscribers will have read it within an hour. Unlike many online authors, I take the trouble to categorize these pages, which sometimes takes me as long as thirty seconds. Librarians have a different world-view; these are people who take information seriously, which is appropriate since that's their profession.

Cataloging includes assigning a Library of Congress or Dewey Decimal “call number” to the book, as well as some related subject headings, and capturing the basic metadata like author, title, date, and pagecount. There's more, but let's leave it at that.

The call number is important; it is an assertion as to what the book is primarily about, and it controls where (physically) the book is going to go on the shelves.

WorldCat · Clearly, it's expensive to catalog all the books that all the libraries buy; furthermore, given that every serious library in the world is going to acquire Vikram Seth's next novel and Bill Clinton's memoirs, it seems redundant for everyone to do essentially the same task. Thus WorldCat, the union of all the member libraries' catalogs.

When a new book arrives at a library, the cataloguer first of all looks it up in WorldCat; if it's there, some or all of the existing catalog entry can be re-used. If it isn't there, the cataloguer does the hard work and then stores the results in WorldCat. Many books are catalogued (at least in part) by the Library of Congress, and the big famous Universities' research libraries do more than their share. But books with a regional focus often get fed into the system by the local college or public library.

As you might expect, this adds up. Basic stats are available online, and they're impressive (this from the April 2003 snapshot): 798 million books, 40,974,753 unique book titles, 2,475,845 serials, 692,264 maps, and so on, totaling over 883 million copies of 49 million different pieces of our species memory.

This is remarkable, especially when you consider that it was all built by hand. I am in awe of WorldCat; I think it is one of the enduring monuments, one of the reasons why we can (occasionally) be proud to be human.

WorldCat vs. Google · Let's ignore the copies and do a little math. As of today, Google says it indexes 3,083,324,652 web pages. Let's say they contain on average about the equivalent of five printed pages of information (I'm just pulling that out of the air, but bear with me; I suspect it's a bit high). Let's ignore everything in WorldCat but books and assume that the 40 million unique titles have on average 200 pages (I bet that's low). So in terms of print pages, we get a (high-ish) estimate of fifteen billion for Google and a (low-ish) eight billion for WorldCat. I'd call that being in the same ballpark.

WorldCat and Google are very different. Google has the full text of each of those pages and one very important piece of metadata: how many people thought it was worth pointing at (the famous “PageRank”). WorldCat has basically no full-text, but they know a lot about each object they have, starting with basic stuff like who wrote it, and when and where that happened. Also (this is important) they have the Library of Congress and/or Dewey Decimal categorization information. This is an assertion made by information professionals saying “Here's what this is about.”

Google is not entirely bereft of topic information: they link as best they can to the taxonomy maintained by the Open Directory Project, which currently indexes around 3.8 million web sites. You'd be wrong to conclude that only 0.12% of the pages are categorized, because the ODP does sites not pages; do a few random Google searches and you'll see that quite a few of the result-list entries have categories. (The ODP is worth a write-up here at ongoing some day, I used to be an editor but got myself colourfully fired, it's a good story.)

I'm not going to argue that WorldCat is either better or worse than Google. They're different, I see them both as essential pieces of the shared information space that in some sense defines us as an intelligent language-using species.

Under-Utilization · WorldCat was built mostly to suport cataloging, but any information hound can see that it's one red-hot potential search resource. Unfortunately, you can't get access to it directly. Fortunately, OCLC provides a WorldCat search facility called “FirstSearch” that member libraries can subscribe to and offer their patrons.

I just now checked the WorldCat stats page and they proclaim that “Every 1.36 seconds a library user searches WorldCat using FirstSearch.” To me, this search volume seems astoundingly, mind-bogglingly, incomprehensibly low. Google doesn't publicize its query volume, but based on personal experience running a web search engine back in 1995-96, I would bet that it's a couple of hundred per second.

Do people who are searching for information genuinely not care about the entire repertoire of books in the libraries of the world? This is more than a little surprising.

Islam in the OPAC · Before I go on I should say that while we've been talking to OCLC, I absolutely am not in the slightest speaking for them.

Basically, WorldCat is like a library catalog, and so FirstSearch is like a library catalog, and you normally search such a thing with what librarians call an OPAC, for “online public access catalog.” Many educational-institution and public libraries put their OPACs up on the Web for anyone to use. Here are screenshots of the OPACs at a few well-known libraries, in each case after a subject search for “islam.” I picked them more or less at random, I don't claim they're a statistically valid sample, but I think they're typical of what's out there.

Interestingly, I wanted to put hyperlinks in the text here that would take you to the real screens, but they use weird forms and have bizarre URIs and apparently it just can't be done.

Yale University Library

UC Berkeley Library

University of Toronto Library

The British Library

You can enlarge any of those screenies, but unfortunately they don't look much better. Nobody would call any of these screens a “successful” search; if you were an undergrad working on an essay on Islam, you'd probably go to Plan B:

Plan B: I typed “islam” into Google.

I'm Being a Little Unfair · The OPACs were never really designed to do this kind of thing, to be general-purpose research tools. They are optimized for very structured lookups, for example to see if a particular book is on the shelves, or what's in the collection by a particular author.

But people, particularly students in a hurry, want general-purpose research tools. So they're ending up at Google.

But Google isn't perfect. And it just doesn't have a lot of the goodies that are there in all those OPACs, and most of all in WorldCat.

It's Time · Long past time, in fact, to take the worlds' OPACs, and especially WorldCat, and build a general-purpose research tool for everybody; with this and Google we would really be covering the bases.

The community of librarians has devoted tens of thousands of lives, in aggregate, to the stewardship of this remarkable body of knowledge, and it is just wrong that it isn't an everyday part of the Web.

Visual Net · I can't show you our take on the WorldCat data, because it's just an experiment and the folks there haven't signed off on it. I can say, though, that it's been hugely, monstrously fun to work with.

But we have already put Visual Net in production on more than one OPAC; here's a (much smaller) library collection, showing the results of the same search.

Visual Net doesn't really offer any information that the OPAC doesn't, it just provides a different user interface:

Ten of the 322 matches are brought to the top, based on the rules set by the librarians: I believe in this case it's a combination of how recently the book was acquired and how often it's checked out.
It makes it obvious that there's no good way to line all the matches for “islam” up in a Google-style result list. Some of them are about religion, some of them about politics, and some, as you can see, about Law and Language and other things.
Using graphics, you can show off how new each item is, and whether it's electronic or paper or one of the other things in the legend at the left (in this case the top ten are all books or e-books.)
The red polygon shows you that one of the categories in the Library of Congress taxonomy has Islam in its name. A good way to pursue your research would be to click on it and start poking around.
Which leads to maybe the most important point: this map is interactive; click on any of those sub-headings and you'll be there, and it'll remember that you're only looking for things Islamic until you tell it to clear the search. It's (deliberately) like walking around in a real library.

I'm a search guy, and when I think of the possibilities of using Google and WorldCat, and of having an interface that isn't “type a query and see a list,” I get excited. Which is why I like my job.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

May 08, 2003
· Antarctica (18 fragments)
· · Technology (3 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!