Last fall, I ran the Wide Finder Project. The results were interesting, but incomplete; it was a real shoestring operation. I think this line of work is interesting, so I’m restarting it. I’ve got a new computer and a new dataset, and anyone who’s interested can play.
The Story Thus Far · I’ll use this entry as a table-of-contents for the series as it grows.
The Problem · The problem is that lots of simple basic data-processing operations, in my case a simple Ruby script, run like crap on modern many-core processors. Since the whole world is heading in the slower/many-core direction, this is an unsatisfactory situation.
If you look at the results from last time, it’s obvious that there are solutions, but the ones we’ve seen so far impose an awful complexity cost on the programmer. The holy grail would be something that maximizes ratio of performance increase per core over programmer effort. My view: Anything that requires more than twice as much source code to take advantage of many-core is highly suspect.
Last Time & This Time · There were a few problems last time. First, the disk wasn’t big enough and the sample data was too small (much smaller than the computer’s memory). Second, I could never get big Java programs to run properly on that system, something locked up and went weird. Finally, when I had trouble compiling other people’s code, I eventually ran out of patience and gave up. One consequence is that no C or C++ candidates ever ran successfully.
This time, we have sample data that’s larger than main memory and we have our own computer, and I’ll be willing to give anyone who’s seriously interested their own account to get on and fine-tune their own code.
The Set-Up · This time, the computer is a T2000, with 8 cores and 32 threads. It’s actually slower (well, fewer cores) than the semi-pre-production T5120 I was working with last time, but it’s got over 250G of pretty fast disks, which I’ve built a ZFS pool and one filesystem on.
I’ve also staged all the ongoing logfiles from February 2003 to April 2008 there, 45G of data in 218 million lines.
It’s Internet-facing, with ssh access and port 80 open (not that I’m running a Web server yet).
Want to Join In? · I’ve set up a WideFinder Project Wiki. The details of how to get started are there. For now, anyone with a wikis.sun.com account can write to it, which I’m hoping will be an adequate vandalism filter.
The First Step ·
Before we start coding, we need to agree on
Benchmark; the 13-line Ruby program was an instructive target for Wide
Finder 1, but several people pointed out that this could be done with an
sort one-liner, so something a little more
ambitious might be appropriate. I’ve made a couple of initial suggestions on