This is the eighth progress report from the Wide Finder Project; a quick comment-light aggregation of work that other people have been doing in the space. I’ve managed to get access to an unannounced many-core server and have some preliminary results (summary: Vinoski’s in the lead); I’ll publish those in the, uh, very near future, when things are, uh, less unannounced.

I’ll also make a serious effort to grab all the code I’ve been linking to here and run it on the T2 and report back, but probably won’t be able to start till next Thursday. By the way, extra credit is given for solving the whole problem and replicating the all the function of the original Ruby script; computing the ten most popular fetches by matching a pattern that includes the GET in front of the URI. If you’re working on this and I’ve missed you, shoot me a line.

OK, off to the races.

Java · Over at UnintentionalObjectRetention (yet another blogger with apparently no name, sigh) we have Wide Finder and Java, More Threads FTW, and Disk caching and file splitting.

C · Ilmari Heikkinen has Wide Finder implementations, but seems kind of discouraged and negative about the whole thing.

Python · The Pythonistas have arrived with a mighty slither of scales. Bill de hÓra has a serial implementation,

Fredrik Lundh takes a serious run at the problem in Some Notes on Tim Bray's Wide Finder Benchmark. Stan Seibert’s Parallel Processing in Python with processing is a direct response.

Erlang · Pichi follows up on Caoyuan’s work, already reported here, in Binaries really faster than lists.

C# · I also got a damn interesting C# implementation, but it turned out to involve some unreleased infrastructure and its author had to back off. Hey man, I’d love to run that code if you can make it Safe For Work.


Comment feed for ongoing:Comments feed

From: Fredrik (Oct 07 2007, at 14:22)

I'm pretty sure me and Seibert qualify as Pythonistas (even if I see myself more as a Pythoneer).

Btw, Erlangista Vinoski has tried my stuff on a big machine:

His HTML is a bit broken at the moment, but his findings are top notch.


From: Hynek (Pichi) Vychodil (Oct 07 2007, at 23:49)

I'm not sure if I didn't make mistake. We must test binary vs list performance with Caoyuan better. But I think we are both agreed that binary is good solution for starting partitioning of your problem. What is best for finally processing is not clear now.


From: Fredrik (Oct 08 2007, at 02:17)

(when I wrote my earlier comment, Tim mentioned my work under Erlang ;-)

For the record, Andrew Dalke has followed up on my article with a few more experiments, and an interesting discussion on *why* we're doing this:


From: Andrew Dalke (Oct 09 2007, at 17:12)

More specifically, here are my numbers when using Python to analyze 1,000,000 records, on a 2.33GHz MacBook Pro. Using regexps directly on a memory-mapped file makes for clean code and a fast 1.4 sec. With some somewhat ugly code that does simple filtering plus post-processing I can get to 1.0sec. With mxTextTools, a 3rd party Python extension for processing text, I can get down to 0.75sec.

By comparison, your original Ruby takes 2.3sec, the parallel Ruby code at (linked by Ilmari Heikkinen) takes 1.69sec and the C code wf3.c available from takes 1.23sec.

(All times on the same machine. Not connected to an outlet. Reporting best of two, but so far all times for a program run have been close to each other.)

Why is the Python code faster than the C code? I think it's because Python implements a better text search, both in the regexp engine and in the string ops, than that one-off C code. The Python times <= 1 second use Boyer-Moore variants.


From: Alastair Rankine (Oct 09 2007, at 19:53)

Hey Tim, as long as you're collecting implementations, my 42-line C++ version is here:

No attempt to make it concurrent... yet.


author · Dad
colophon · rights
picture of the day
October 07, 2007
· Technology (90 fragments)
· · Concurrency (75 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!