In this ongoing safari through the Search hinterland, I had thought next to talk about popular features of search engines and their costs and benefits and so on. But I think that everything else I want to cover will be easier if there’s a shared view of the machinery making it all go. So here’s a tour through the basics of search-engine engineering.
Call Me Ishmael ·
This will be a more satisfactory journey if there are concrete examples to
feed the eye. Thus, suppose we are going to search through exactly one
which happens to be a Web resource whose URI is
containing exactly three words; here it is:
Call me Ishmael. Doc # 0 Word # 0 1 2
First of all, we’re going to need a table of documents, which, since there’s only one, will be pretty simple:
<docs> <doc id="0" href="http://example.com/herman" /> </docs>
Note: In this discussion, when I illustrate search-related data structures, I am going to use XML syntax, because it’s reasonably readable and because I have a hidden agenda that we’ll be getting to a little later in this series.
This is not to suggest that you’d actually store a search engine’s document table in XML. You might (I wouldn’t), but the point is the XML is being used here as an expository device.
Postings · Search engines are almost universally built around what we call postings. (I read somewhere about the origin of this usage of “posting” but can neither recall nor find it). A posting is a small data structure recording that a particular word appears in a particular document. Here, without further ado, is a complete set of postings describing our document:
<postings> <posting doc="http://example.com/herman" word="Call" /> <posting doc="http://example.com/herman" word="me" /> <posting doc="http://example.com/herman" word="Ishmael" /> </postings>
Each posting tells you that a particular word appears in a particular document. Obviously, when you’re storing your postings in your search engine, you’re not going to replicate the URI for each one, you’re just going to use the id number from the document table. So internally it might look more like:
<postings> <posting doc="0" word="Call" /> <posting doc="0" word="me" /> <posting doc="0" word="Ishmael" /> </postings>
Word List · The other really fundamental piece of your search engine is your word list. In our database so far, so it’s only going to have three words:
<words> <word w="Ishmael" /> <word w="call" /> <word w="me" /> </words>
Notice that the word list is sorted in strict alphabetical order: capital letters sort before lower-case in both ASCII and Unicode (let’s ignore issues of case-folding for now). It doesn’t have to be that way, but in your search engine, you do have to organize your word list so you can look up entries quickly. If you put them in an array in sorted order then you can binary search it; on the other hand, there are good arguments for putting it in some flavor of tree. Let’s just assume that we’re clever enough programmers that we can organize a word list for fast lookup.
Now, if you tie your word list and your postings together, you’ve got a search engine. But to illustrate that, we’re going to need at least two documents in our database.
Here's our other short document, whose URI is
Longtemps, je me suis couché de bonne heure. Doc # 1 Word # 0 1 2 3 4 5 6 7
Here are its postings:
<postings> <posting doc="1" w="Longtemps" /> <posting doc="1" w="je" /> <posting doc="1" w="me" /> <posting doc="1" w="suis" /> <posting doc="1" w="couché" /> <posting doc="1" w="de" /> <posting doc="1" w="bonne" /> <posting doc="1" w="heure" /> </postings>
Here’s our document table:
<docs> <doc id="0" href="http://example.com/herman" /> <doc id="1" href="http://example.com/marcel" /> </docs>
And here“s our new word list:
<words> <word w="Ishmael" /> <word w="Longtemps" /> <word w="bonne" /> <word w="call" /> <word w="couché" /> <word w="de" /> <word w="heure" /> <word w="je" /> <word w="me" /> <word w="suis" /> </words>
Here’s the Trick · The idea is, you break your postings lists up and sort them out by word, and call it your index, like so:
<index> <word w="Ishmael"> <posting doc="0"/> </word> <word w="Longtemps"> <posting doc="1"/> </word> <word w="bonne"> <posting doc="1"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="couché"> <posting doc="1"/> </word> <word w="de"> <posting doc="1"/> </word> <word w="heure"> <posting doc="1"/> </word> <word w="je"> <posting doc="1"/> </word> <word w="me"> <posting doc="0"/> <posting doc="1"/> </word> <word w="suis"> <posting doc="1"/> </word> </index>
Remember, your word list is set up so you can look up a word quickly, and once you’ve done that you’ve got a list of all that word’s postings, so you know where it appears in your document collection.
That’s really all there is to it. Want to find out if the word
“madeleine” appears in your collection?
Look it up, it’s not in the word list, no dice.
How about “call?”
Yes, it appears in document 0,
How about “me”?
It’s in both documents.
If you’re a data-structures kind of person you can now have some fun thinking how you’d store the words from the word list (trickier than you think), and how you store the postings (they’re fixed in size but there are more of them), and how you point from the word list to the postings list and from the postings list to the document list.
If you’re not, trust me, this is the kind of thing data-structures people live for, give them the problem and they’ll go away happily and not bother you for a while. Don’t forget to tell them that they need to allow space for somewhere between twenty thousand and a few million unique words, and many billions of documents, each with potentially lots of postings, so really a very large number indeed of postings. And, it all has to be updateable.
And the cool thing is, when you tell them those big numbers, it will make them happier! Aren’t geeks wonderful?
Booleans · A little thought reveals that our search engine is pretty well set up to do Boolean queries: e.g. documents that have word1 AND (word2 OR word3) BUT NOT word4. Each one of those words gives you a postings list, which is basically just an array of ID numbers, which are easy to merge (for the purists, this is really set arithmetic). Of course, you want to be clever about looking at your ANDS and ORs and starting with the shortest postings list and so on, but it’s really not rocket science.
Phrases · Suppose I want to search for the phrase “bonne heure”. Our index isn’t going to help us with that. But, if we added the word number to our postings, it would. Here's part of our postings list, slightly enhanced:
<index> ... <word w="bonne"> <posting doc="1" wnum="6" /> </word> ... <word w="heure"> <posting doc="1" wnum="7"/> </word> ... </index>
Now I can do phrase searching. I find the postings for the first word in the phrase, then for the second word, and compare the lists, looking to match a posting for the first and one for the second which are in the same document and where the second’s word number is one more than the first’s.
Of course, you’re paying a pretty severe price, since each and every one of your billions of postings now has to have another number in it. But there’s probably no way around it, since phrase search is very useful.
That’s About It · Believe it or not. I’ve never been behind the scenes at Google or Inktomi, but I betcha all those zillions of servers are spending a high proportion of their time finding and merging big lists of postings.
OK, I admit it, I’m lying egregiously. I have swept a huge number of icky problems under the rug. What about synonyms, and thesauri, and inflexions (run/ran, cow/cattle), and language differences, and finding the postings when there are no spaces between the words (e.g. Japanese and Chinese), and metadata (e.g. PageRank) and stopwords, and result-list organization, and APIs, and sub-document structures (e.g. XML) and “concept search” and “fuzzy search” and Latent Semantic Indexing and document clustering, and upper/lower case, and relevance ranking, and... well, the Devil is in the details.
In fact, I’ll devote at least one more essaylet in this series to some of these things. But once you’ve fought through them, the basic postings-list architecture of most search engines still looms pretty large in their workings, and pretty large in terms of understanding what they can and can’t do.
Acknowledgement · All this was worked out a long time ago. I read about it first of all in the works of Gerald Salton, who probably wrote more of the basic reference works in this area than everyone else put together. I don’t know how much of the technique he invented and how much he just organized and wrote down, but anyone who uses search technology, which these days is pretty well everyone, owes him a thank-you.