If you’re searching a big database, unless you’re lucky you’re usually going to get a lot more matches to any given query than you want to look through. So it really matters what order that result list is in. Google got to be famous in large part because they do a good job on this; the stuff near the top of their list is usually about what you want, and if you don’t see what you need near the top of the list, it may not be out there. Herewith some remarks on how to go about sorting result lists. In general, the news is not very good; however, there are some promising techniques that are under-explored.
How Google Does It · I assume everyone already knows this, but Google sorts Web pages depending mostly on how many other Web pages point at them. For example, if you type my name into Google, ongoing is the top result even though my name isn’t on the front page; that’s because a lot of pages that contain my name point here.
Also, Google claims that links from pages that themselves have a lot of incoming links count for more, but I’m not actually sure they’d need to do that to get the results they do.
Sidenote: Feed-Forward · I’ve noticed, six months and 500 essays into ongoing, that the number of people coming here via Google searches is going up slowly and steadily. It’s obvious when you think about it: if you accumulate a few incoming pointers Google starts sending more people, a few of whom inject more incoming pointers into the system. This is a feed-forward loop; in other words, popularity tends to be self-reinforcing. This phenomenon is not limited to the Internet.
It Won’t Work for You · If you’re writing or deploying a search engine for your Intranet or product catalogue or portal, Google’s PageRank trick probably won’t work, because most Intranet and catalogue and portal pages don’t point at each other. The Web is unique in that it has millions of authors independently making decisions about what’s important; aggregating those decisions is what makes PageRank so powerful.
Statistical Alternatives · The well-known alternatives for doing result ranking are based on analysis of the search terms and how they relate to the documents and other search terms. For example, if a search term appears in the document’s title, that’s probably worth more than having it appear somewhere twenty-five paragraphs in. If it appears twice in the title, even better.
Speaking of multiple appearances, frequencies matter more than counts.
For example, suppose I search for
osmolytic is a far less common word than
ratio, a match to
osmolytic anywhere in the
document is probably worth more than twelve
ratio matches, even
if one of them’s in the title.
Generalizing this, it’s pretty easy, given any word, to compute how many times it appears in the whole database, how many documents and words there are in the database, and thus the expected per-document and per-word frequencies of any of your query terms. Then, you’d sort your documents based on whether the search terms appear with higher than average frequency. You can go a long way down the road with this kind of statistical trickery, and many search engineers do.
“Latent Semantic Indexing,” which I’ve already mentioned in this series,
is a technique that goes even deeper, working out not just word frequencies
but word associations.
Thus, an LSE engine might notice that the words
pig are all strongly clustered with the
farm, and if the user searched for
rank a document whose title was Barn Construction for Pigs very
highly in a search for
farm even if it didn’t contain
farm at all.
Bad News · Unfortunately, most of these techniques don’t work very well. Go to the websites of a few public-facing organizations—government departments, big companies, universities—and try a few random searches. In many cases, the results are pathetic.
The Real Lesson of Google · PageRank works on the basis of guessing that what’s popular is what’s important, which turns out to be a good guess. The technique relies on the Web’s linkage network, and while non-Web deployments can’t use that, we shouldn’t give up on the notion of using a popularity metric.
If you’re running a library, you know what’s getting checked out. If you’re running a store, you know what’s selling. If you’re just running a Web server, at least you have your log files, so you know what people are fetching. These are all popularity measures, and yet when I’m advising people about their search engines and ask if they’re feeding back their usage numbers into their result rankings, I get a blank look.
Here’s another trick that many but not all search engine operators have picked up on: when you generate a result list, presumably it contains links to the matching resources, something like this:
<li><a href="http://example.com/barns">Barn Construction for pigs.</a></li>
Well, don’t let people just click on your results and traipse off to visit
them without finding out about it; try this instead (assume your search
engine is at
<li><a href="http://search.example.com/goto?u=http://example.com/barns">Barn Construction for pigs.</a></li>
So when they click on the link it comes back to you, you record that they
decided to visit
http://example.com/barns and then redirect them
there; most times they’ll never notice.
And you’re learning something that might be very valuable: which of your
search results people are actually using.
I think you might get very good results by feeding that back into your result
ranking: maybe not PageRank but rather UsedRank.
Let’s not obsess on the details. Virtually anyone who’s maintaining online information and cares enough about it to run a search engine has some metrics or statistics as to what’s popular and what’s not. So, put them to work. The results are apt to be better than what you’ll get with most of the statistical and linguistic techniques.