This one for web-tech aficionados only.
Those of you who watch your webserver logs, go do a fgrep msnbot
access_log (MSNbot got to ongoing today).
Unlike any robot I’ve seen or heard of, MSNbot tells you the
referer, so you can actually watch the trails it takes
into and through
your online presence.
Neato. Ordinary people who are well-integrated with the real world can
safely ignore this fascinating discovery and be fairly sure it will not
impair their quality-of-life.
Move along, now; nothing to watch here.
(Update: I’m baffled, this makes no sense.)
Baffled ·
I’ve written two large-scale many-millions-of-pages Web robots, so I
have some experience in this space.
And this referer behavior has me shaking my head.
Every robot I’ve written ar known about has more or less the same basic
algorithm; you keep a big pool of URIs to work on, and you have a zillion
parallel threads, each doing:
while(true)
{
URI uri = pool.getUriThatNeedsCrawling();
Page page = uri.fetch();
page.addContentToIndex();
URI link;
while (link = page.nextUriInPage())
pool.addUri(link);
}
Because, of course, the same URI is going to show up in lots of different
pages, so the notion that there is one referer is just wrong.
Side-trip: One of the real interesting design choices in designing a robot
is what getUriThatNeedsCrawling() does.
In my most recent robot, I had each thread work on a single site, asking for
its pages one by one; other designs give each thread a new random page to
fetch, usually not from the same site.
You can get a heated discussion going among any group of robot designers by
raising this issue (by the way, I’m right). But I digress.
But I have no idea what MSNbot is doing; maybe they’ve got a radical new crawler architecture. That would be surprising.