This series of essays on the construction, deployment and use of search technology (by which I mean primarily “full-text” search) was written between June and December of 2003. It has fifteen instalments not including this table of contents.
This may be a weblog, but the following are not in reverse chronological order, they’re in the order I wrote ’em, which I suspect is the right order to read ’em.
Backgrounder · This is the first of a series on search, by which I mean full-text search. Anyone who uses computers now uses search pretty well every day, so this is an important chunk of our technology spectrum. This piece covers the business and history angles; future instalments will explain how search engines work and the interfaces to them. I plan to conclude with a description of the next search engine, which doesn’t exist yet but someone ought to start building.
The Users · Herewith Chapter Two of the search travelogue. Between late 1994 and early 1996 I was occupied full-time and then some building and running one of the first Web search engines, the long-departed Open Text Index. There weren’t many million-hits-a-day sites back then. When you’re running that kind of thing, you spend a lot of time watching your logs to figure out what your users are doing and what makes them happy. There are two lessons that loom larger than all the others put together.
Basic Basics · In this ongoing safari through the Search hinterland, I had thought next to talk about popular features of search engines and their costs and benefits and so on. But I think that everything else I want to cover will be easier if there’s a shared view of the machinery making it all go. So here is a tour through the basics of search-engine engineering.
Precision and Recall · Searching is a branch of computer programming, which is supposed to be a quantitative discipline and a member of the engineering family. That means we should have metrics: measures of how good our search techniques are. Otherwise, how can we ever measure improvements in one system or the differences between two systems? “Precision” and “recall” are the most common measures of search performance. But they’re not as helpful as we’d like.
Intelligence · Here’s the problem: searching for words isn’t really what you want to do. You’d like to search for ideas, for concepts, for solutions, for answers. Instead, your typical search engine moronically sorts through its postings, and tries to solve your problems by looking at which words appear where, and how often, and so on. What we’d really like is an intelligent search engine. This essay is mostly about why we’re not likely to get one any time soon.
Squirmy Words · In this, the sixth instalment of my search saga, a survey of the fuzzy edges of words and their meanings and the (surprisingly moderate) consequences for search systems.
UI Archeology · This chapter of the On Search saga is a side-trip; a look at an unusual search user interface I built a dozen years ago. One of the reasons it didn’t catch on back then was that there wasn’t enough XML in the world. Now that there is, maybe this bit of legacy code will provoke an idea or two. Just maybe, it contains some ideas that will be useful to the folk who are wondering how to make the power of XPath and XQuery useful to ordinary people.
Stopwords · Here’s a Google search for a famous phrase, to be or not to be; give it a try and see what happens. When you look at word frequencies, it appears that there are a few words that appear unreasonably often and carry unreasonably little information. They are called “stopwords,“ and this (brief) eighth bead in the On Search necklace considers them.
Metadata · In the Web’s early years, the overwhelming favorite among search engines was Yahoo. Today it’s Google. Neither has actually had better text search technology than the competition. They won because they used metadata effectively to make their services more useful. In this ninth On Search, a survey of what metadata is, where it comes from, and how to use it.
I18n · The “On Search” series resumes with this look at the issues that arise in search when (as you must) you deal with words from around the worlds written in the characters that the people around the world use. I18n stands for “internationalization.”
Result Ranking · If you’re searching a big database, unless you’re lucky you’re usually going to get a lot more matches to any given query than you want to look through. So it really matters what order that result list is in. Google got to be famous in large part because they do a good job on this; the stuff near the top of their list is usually about what you want, and if you don’t see what you need near the top of the list, it may not be out there. Herewith some remarks on how to go about sorting result lists. In general, the news is not very good; however, there are some promising techniques that are under-explored.
Interfaces · Herewith an investigation of how search software ought to interact with the outside world. I’ll start with a look at the current state of the art, and propose another (I think better) approach. This is, I think, the third-last On Search piece, so a few words at the meta level about that.
XML · Searching is all about text, and the proportion of all the world’s text that is XML keeps getting higher and higher. So if you’re going to do search, at some point you’re going to have to think about searching XML. Herewith a survey of some of the issues and problems (which, like other essays as we approach the end of On Search, contains opinions among the reportage).
Robots · Robots—also known as spiders—are programs that retrieve Web resources, using the embedded hyperlinks to automate the process. Robots retrieve data for all sorts of purposes, but they were invented mostly to drive search engines. Herewith a tour through Robot Village.
Turn On Search · This is the last in my series of On Search essays. I’ve written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I’d like to change this part of the world. In short, I’d like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I’ll write something on how it might get built.