This is the eleventh progress report from the Wide Finder Project; I’ll use it as the results accumulator, updating it in place even if I go on to write more on the subject, which seems very likely. [Update: Your new leader: Perl.]
The Name · There’ve been a few posts wondering about the name “Wide Finder”. The original Ruby code came from my Beautiful Code chapter called “Finding Things”; thus it was a Finder. The problem with modern CPUs is that they’re getting wider not faster. Thus, I seek a Wide Finder.
Worth Doing · There is a steady drumbeat of commentary along the lines of “WTF? This is a trivial I/O-bound hack you could do on the command line, ridiculously inappropriate for investigating issues of parallelism.” To which I say “Horseshit”. It’s an extremely mainstream, boring, unglamorous file-processing batch job. There’s not the slightest whiff of IT Architecture or Social Networking or Service-Orientation about it. Well, that’s how lots of life is. And the Ruby version runs faster on my Mac laptop than on the mighty T5120. The T5120 is what most of the world’s future computers are going to look like! Houston, we have a problem, and the Wide Finder is a useful lens to examine it.
The Latest · [2007/11/19]: After a week or so off, I’m back to running Wide Finder code. First: Sean O’Rourke’s Perl code.
The good news is that it’s insanely fast; the bad news is that I had to visit
CPAN to get
Sys::Mmap, and CPAN hates me. Always has. In this
Makefile.PL makes a Makefile that doesn’t work on Solaris, out of the
box until you
mangle the incantations.
I also ran Dave Thomas’ memory-mapped code.
Neither the table’s format nor contents are carved in stone; I’m quite sure I’ll update it as I pour more results in. Are there any missing columns? Or outright broken-ness? Feel free to offer suggestions.
I still have a few Wide Finder implementations to run; if you’ve done one and it doesn’t appear here, it couldn’t hurt to drop me a line to make sure I know about it.
Results · The data is essentially all of the ongoing logfile from March of 2007; 971,538,252 bytes of data in 4,625,236 lines. I will make it available in compressed form to other experimenters on request, but I’ll require you to convince me that you won’t publish it.
In each case, the benchmark was run at least twice, usually in succession with other benchmarks, in an attempt to have the disk cache as hot as possible.
This is a production T5120 with eight cores each at 1.4GHz and 64G of RAM. It’s I/O performance is unexciting.
The table can be sorted by clicking on column headings.
The quantities in the cells marked “
*” above exhibited a lot
of variability from run to run, sufficiently so as to make them probably
actively misleading to include.
Unknown · The quantities in the cells marked “†” are unknown. These multi-process Erlang runs arrange for the absence of parent/child process relationships, so it’s trickier to determine User and System CPU times.
Notes on Running Nygren’s wfinder7_2 · This is multi-process not multi-thread. The best results are with 32 processes.
Notes on Running Nygren’s wfinder8 ·
The number in parentheses is the value of the
+S argument to
Erlang, telling it how many schedulers to run.
Note that this is an 8-core
machine with two integer instruction threads per core and support for
eight thread contexts per core. Solaris thinks it sees 64 CPUs.
So if you don’t specify
+S the default is 64, which seems about
Notes on Running the wf-* Python Code ·
This code is from Fredrik Lundh; see
I added a number-of-processes command-line parameter to
which appears in parentheses in the table above; so
running with eight processes.
I left the chunk size at 50M for now, this is just (just barely) small enough
to run with 16 processes.
Fredrik took a glance and suggested removing
wf-5 in the interests of making the table more readable.
The reporting of user and system CPU time was wildly variable.
wf-6(8) reported user CPU as low as
5.06 seconds and as high as 31.99 seconds. I suspect that this may be worth
bringing to the attention of the Solaris people; the state of the art in
tracking CPU usage in this fairly exotic type of machine is still a little
Analysis · Are you kidding me!?!? Getouttahere. Maybe someday.
Previous News · [2007/10/30]: I’ve been back from Shanghai for a few days now, but it took till past lunch today to get logged into the mighty T5120 and doing some Wide Finding. It’s a pretty naked Solaris box, and compiling all this stuff has been just a bundle of fun. Not.
[2007/10/31]: Added a Lines-of-Code column to the table. Ran Fredrik Lundh’s multi-process Python, Ilmari Heikkinen’s OCaml, and Russ Beattie’s PHP. There are a few that I’ve tried and failed to run; Erlang code that won’t compile, C and JoCaml code that will compile but not run. In each case I’ve pinged the author, and I may go back and try to see if I can sort things out.
[2007/11/01]: Got Pichi’s patched Erlang to run. 545 lines, wow!. Caoyuan’s too, much more concise, but not as fast.
[2007/11/01]: Did some runs with Nygren’s
[2007/11/04]: Per advice from contributors below, I installed sorttable and it seems to work just fine!.
[2007/11/05]: I ran Caoyuan’s latest; notes here.
Then I ran Mauricio Fernandez’ OCaml and JoCaml; see Aim for the Top! Beating the current #1 Wide Finder log analyzer with the join-calculus. Holy crap! I have to say that the OCaml build/run process is kind of klunky, this isn’t your classic REPL by any means. But you know, this isn’t the first time that I’ve seen OCaml thump the competition on a benchmark; it’s just the first time that it was a benchmark I cared about.
Then I ran Anders Nygren’s revised wfinder7_2; he writes: “The difference compared to wfinder8 is that this one does not use the SMP virtual machine. Instead it starts a number of slave erlang nodes.”
Caoyuan revised his
tbray9 to produce
requesting that I replace the results.
[2007/11/09]: Russ Beattie revised his PHP code, renaming it
widefinder3, and indeed, it moved a few places up the table.
Also, based on several requests about this box’s I/O performance, I did a Bonnie run, which is kind of interesting.
I/O · There have been appeals, both in the comments and in email, for a characterization of the test box’s I/O performance. I ran Bonnie with the following results:
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU /sec %CPU 20 23.9 99.7 41.5 29.8 44.2 61.8 22.3 100 161.8 100 21835 239.4
Given that this stupid thing has 64G of RAM and I could only find 20G of disk space to work with, the results should be taken with a grain of salt (especially the Random Seeks number). Having said that, watching the free-memory readout made it look to me as though when it was doing the sequential runs, it was doing real I/O, not just caching.
Note that when it says “100% CPU”, that means 100% of one of the 64 logical
CPUs that Solaris sees on this box. And in fact, it looks to me that doing C
I/O on this box pegs one of the cores, whether you’re using