This one for web-tech aficionados only. Those of you who watch your webserver logs, go do a fgrep msnbot access_log (MSNbot got to ongoing today). Unlike any robot I’ve seen or heard of, MSNbot tells you the referer, so you can actually watch the trails it takes into and through your online presence. Neato. Ordinary people who are well-integrated with the real world can safely ignore this fascinating discovery and be fairly sure it will not impair their quality-of-life. Move along, now; nothing to watch here. (Update: I’m baffled, this makes no sense.)

Baffled · I’ve written two large-scale many-millions-of-pages Web robots, so I have some experience in this space. And this referer behavior has me shaking my head. Every robot I’ve written ar known about has more or less the same basic algorithm; you keep a big pool of URIs to work on, and you have a zillion parallel threads, each doing:


  while(true)
  {
     URI uri = pool.getUriThatNeedsCrawling();
     Page page = uri.fetch();
     page.addContentToIndex();
     URI link;
     while (link = page.nextUriInPage())
       pool.addUri(link);
  }

Because, of course, the same URI is going to show up in lots of different pages, so the notion that there is one referer is just wrong.

Side-trip: One of the real interesting design choices in designing a robot is what getUriThatNeedsCrawling() does. In my most recent robot, I had each thread work on a single site, asking for its pages one by one; other designs give each thread a new random page to fetch, usually not from the same site. You can get a heated discussion going among any group of robot designers by raising this issue (by the way, I’m right). But I digress.

But I have no idea what MSNbot is doing; maybe they’ve got a radical new crawler architecture. That would be surprising.


author · Dad · software · colophon · rights
picture of the day
June 19, 2003
· Technology (87 fragments)
· · Web (393 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.