Herewith an investigation of how search software ought to interact with the outside world. I’ll start with a look at the current state of the art, and propose another (I think better) approach. This is, I think, the third-last On Search piece, so a few words at the meta level about that.
Meta-Essay: A Change of Tone · So far, On Search has been pretty pure reportage: here’how things are done, and why. There’s not a lot of scope for editorializing while covering why internationalized case-folding is hard or how postings work. Starting here, I’ll venture into the territory of opinion and make some proposals for how things ought to be done. While these opinions are well-informed by experience, they may be entirely wrong. After this, I’ll write a piece on Search and XML, and wrap up the series with one on what the next search engine should look like and how we ought to go about building it.
How Lucene Does It · Lucene is probably the best Open Source/free search engine out there; it’s now part of the Apache family. The following is excerpted from its excellent online documentation and will be transparent to the Java-savvy, perhaps less so to others, which is part of the problem:
The Jakarta Lucene API is divided into several packages.
org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, a collection of named files written by an OutputStream and read by an InputStream. Two implementations are provided, FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Token's. A TokenStream is composed by applying TokenFilter's to the output of a Tokenizer. A few simple implemenations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
org.apache.lucene.search provides data structures to represent queries (TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into Hits. IndexSearcher implements search over a single IndexReader.
To use Lucene, an application should:
Call QueryParser.parse() to build a query from a string; and
How Commercial Products Do It · My industry feelers, which are pretty good, tell me that the two leading commercial search producsts these days come from Autonomy and Verity. I just invested an hour failing to find online documentation for their APIs. You can indeed download Verity’s entire Ultraseek package on a trial-run basis, and it does indeed contain a documentation directory, but no browser here on this Macintosh would succesfully open any of the HTML files therein, so the hell with ’em.
I do have considerable experience in this area, though, and can testify that most of the APIs out there look quite a bit like the Lucene example, if perhaps less well-factored and well-thought-out. Which is only to say that Lucene, coming to the game a decade later, has learned the lessons of its predecessors, something that Open Source is generally good at.
What’s Wrong with This Picture? ·
I think Lucene’s API is well thought out, but there are not one but two
elephants in the room that it’s trying hard to ignore.
The first is the fact that the world contains many who are not members of the
Church of Java, who will be left cool by the notion of, for example,
“converting from text from a
java.io.Reader into a
The second elephant is the Web. Suppose I don’t want to write Java code, I
just want to tell my website to index this directory.
A pure Web interface would solve both these problems, but first a few words on
what that means.
Web Architecture Revisited ·
In Web Architecture, nothing
exists unless it’s a Resource, identified by a URI.
Web protocols, chiefly HTTP, offer a simple set of stateless methods for
interacting with resources:
GET retrieves representations
without changing anything,
POST changes resource state,
PUT sends a new representation, and
you get the idea.
These days, most search is de facto on and of the Web. I find that I can easily think of a search engine as a Web resource that allows you to add and update document info and postings. While most search services, historically, have kept postings out of sight under the covers, I think it makes all sorts of sense to expose them as the basic public API object. I’ll discuss the reasons below, but first let’s sketch out what a pure Web search API might look like.
This is easy; there is rough consensus among search engines on what a query
should look like: quotes mark
"phrase searches", you use “+”
to say a word
must +occur and “-” to
You URI-restrict searches like so:
Down the road, you might want to support something like XPath or even XQuery,
but that would work fine through a Web interface too.
So a search engine would publicize a URI that you could use HTTP
GET on to run searches; arguments would be
mr for maximum results, and
So a complete search request might look like:
What’s the result of running such a query? Well, for your users you’d like a nice HTML page that you’ve fixed up with your favorite Website constructor, be it PHP or Asp.NET or JSP or whatever. So let’s have our search return a simple chunk of XML with no particular concern for templating or HTML beautification, designed to make it easy for the PHP-heads and Asp.NET bigots and JSP partisans to parse and feed through their own templating engines.
Here’s an example of a very short result list. To make sense of it, you might want to go and review the Basic Basics entry in this series, that explains what a posting is.
<results start='11'> <resource href="http://example.com/marcel" weight="32"> <match wnum="2" /> <match wnum="383" /> </resource> <resource href="http://example.com/herman" weight="17"> <match wnum="1" /> </resource> </results>
In English: here are two resources matching our search,
starting at the eleventh-best, and each has a
that estimates how good the match is (The
preceding entry in
this series discusses how those weights might have been computed).
For each document, the matches are provided;
wnum is “Word
Number,” i.e. how many words into the document you have to go to find the
match; to use it, you’d have to know how that resource was tokenized into
This is easy to parse, and gives clients a lot of leeway as to how they might build a user interface around it.
What About Structure? · In my discussion of both queries and result lists, there’s a glaring absence of any attention to document structures such as “fields,” and in particular no mention of how you might deal with XML. That topic is I think worth its own essay, so with the reader’s indulgence I’ll push it on the stack. The focus here is on a Web-flavored interface to search, and I don’t think the absence or presence of simple (or even XML) structures really bears on that.
Updating the Search Engine ·
To update our search engine, we have to tell it about resources, and we
have to tell it about postings in the resources.
Since this changes the state of the index, we’ll use
It wouldn’t make sense to use
PUT since we’re not really
creating a new resource, we’re just updating the index, which knows about
resources and postings.
I’m sorry, we’ll just have to live with the potential confusion between the
POST and the (Search-jargon) “posting.”
I’m not totally married to the details of the interface I’m about to describe; among other things it hasn’t been hashed over by a bunch of smart people some of whom have tried writing code to implement pieces. Having acknowledged that this is vapour-ware, I think it’s useful in support of my main goal, which is to establish that a pure Web-flavored search API is not only possible, but natural and desirable.
Ishmael Again ·
Index updates are accomplished by
POSTing chunks of XML to
some URI advertised by the search engine.
I think only three verbs are needed:
Index a resource and one or more postings.
Delete all knowledge of a resource from the index.
Delete all the postings about a resource from the index.
Let’s add the resource
here, it contains
only the text Call me Ishmael.).
Here we go:
<update op="add" href="http://example.com/herman"> <posting word="call" wnum="0" /> <posting word="me" wnum="1" /> </update>
Oops, we forgot that last word:
<update op="add" href="http://example.com/herman"> <posting word="fishmeal" wnum="2" /> </update>
Note that we’ve taken care of monocasing Latin characters before the search engine ever sees the posting. Oops, mis-spelt that name. Let’s remove all the postings and start again
<update op="unpost" href="http://example.com/herman" /> <update op="add" href="http://example.com/herman"> <posting word="call" wnum="0" /> <posting word="me" wnum="1" /> <posting word="ishmael" wnum="2" /> </update>
And for completeness, let’s erase this from our index:
<update op="erase" href="http://example.com/herman" />
What’s Being Left Out · I am definitely sweeping some administrative details under the rug here. In particular, there should be a bunch of well-known metadata that can be provided for any resource: when it was last updated, the source data format (HTML, Word, PDF, whatever), what method was used to do the tokenization, and so on. And then there should be a facility for user-defined per-resource metadata. But once again, none of these things get in the way of giving the basic flavor of a web interface to a search engine.
This is Different · Pretty radically different from any search interface I’ve ever seen. But I like it. Let’s finish up with what I consider to be the big advantages of such an approach.
Platform Independence ·
If you want to index something, you don’t have to buy into
any particular programming language or operating system.
All you need to be able to do is
POST XML to a server and
GET requests against URIs with some arguments tacked on
This list of popular programming platforms that support this includes, well,
all of them.
Roll-Your-Own Input Processor · Earlier in this series, I’ve discussed some of the choices that go on at indexing time: how to deal with stopwords and inflexions and synonyms and so on. The best decisions on all of these things is usually “It depends;” most strongly on the specifics of the text being searched and the people doing the searching, but on other things as well.
This Web-flavored API strongly decouples the approaches you might take to all these problems from the guts of the search engine, and I think that’s a good thing.
Statelessness · Someone who’s loading data into a search engine shouldn’t really have to keep track of where they are in the process; consider the Lucene example above (and remember that I consider Lucene state-of-the-art); it feels like I have to do a lot of housekeeping to get text indexed.
With the Web-flavored interface, once you’ve tokenized the data (in the privacy of your own sandbox) you just blast one message off to a web server and when you’re done, you’re done. One of the strong lessons of the Web is the beneficial power of statelessness.
In Conclusion · Having engineered search into a few applications and websites, I think that the job would have been easier with a pure Web-flavored interface like the one I describe here. Now, this doesn’t mean that we have to throw away everything we have and start again from scratch. I’m pretty sure that you could build an interface like this to quite a few different modern search engines; and that somebody should.