WF XI: Results

This is the eleventh progress report from the Wide Finder Project; I’ll use it as the results accumulator, updating it in place even if I go on to write more on the subject, which seems very likely. [Update: Your new leader: Perl.]

The Name · There’ve been a few posts wondering about the name “Wide Finder”. The original Ruby code came from my Beautiful Code chapter called “Finding Things”; thus it was a Finder. The problem with modern CPUs is that they’re getting wider not faster. Thus, I seek a Wide Finder. ¶

Worth Doing · There is a steady drumbeat of commentary along the lines of “WTF? This is a trivial I/O-bound hack you could do on the command line, ridiculously inappropriate for investigating issues of parallelism.” To which I say “Horseshit”. It’s an extremely mainstream, boring, unglamorous file-processing batch job. There’s not the slightest whiff of IT Architecture or Social Networking or Service-Orientation about it. Well, that’s how lots of life is. And the Ruby version runs faster on my Mac laptop than on the mighty T5120. The T5120 is what most of the world’s future computers are going to look like! Houston, we have a problem, and the Wide Finder is a useful lens to examine it. ¶

The Latest · [2007/11/19]: After a week or so off, I’m back to running Wide Finder code. First: Sean O’Rourke’s Perl code. ¶

The good news is that it’s insanely fast; the bad news is that I had to visit CPAN to get Sys::Mmap, and CPAN hates me. Always has. In this case, the Makefile.PL makes a Makefile that doesn’t work on Solaris, out of the box until you mangle the incantations.

I also ran Dave Thomas’ memory-mapped code.

Neither the table’s format nor contents are carved in stone; I’m quite sure I’ll update it as I pour more results in. Are there any missing columns? Or outright broken-ness? Feel free to offer suggestions.

I still have a few Wide Finder implementations to run; if you’ve done one and it doesn’t appear here, it couldn’t hurt to drop me a line to make sure I know about it.

Results · The data is essentially all of the ongoing logfile from March of 2007; 971,538,252 bytes of data in 4,625,236 lines. I will make it available in compressed form to other experimenters on request, but I’ll require you to convince me that you won’t publish it. ¶

In each case, the benchmark was run at least twice, usually in succession with other benchmarks, in an attempt to have the disk cache as hot as possible.

This is a production T5120 with eight cores each at 1.4GHz and 64G of RAM. It’s I/O performance is unexciting.

The table can be sorted by clicking on column headings.

Name	Language	Elapsed	User CPU	System CPU	LoC	Notes
wf(32)	Perl	1.51	16.06	2.89	61	O’Rourke
wf(16)	Perl	1.70	13.79	2.66	61	O’Rourke
wf-mmap-multicore	JoCaml	1.76	2.42	0.55	278	Fernandez
wf(64)	Perl	1.77	18.72	3.56	61	O’Rourke
wf(8)	Perl	2.52	12.63	2.49	61	O’Rourke
wfinder7_2(32)	Erlang	3.54	†	†	345	Nygren
wfinder7_2(16)	Erlang	4.09	†	†	345	Nygren
wf(4)	Perl	4.25	12.24	2.38	61	O’Rourke
wfinder7_2(64)	Erlang	4.27	†	†	345	Nygren
wf-6(16)	Python	4.38	*	*	137	Lundh
wfinder8(32)	Erlang	4.42	46.28	12.27	322	Nygren
wfinder8(64)	Erlang	4.45	56.13	18.92	322	Nygren
wfinder8(16)	Erlang	4.74	38.48	7.40	322	Nygren
tbray9a(128)	Erlang	5.26	36.75	8.39	121	Caoyuan
tbray9a(32)	Erlang	5.29	36.64	8.11	121	Caoyuan
tbray9a(64)	Erlang	5.45	36.79	8.23	121	Caoyuan
tbray9a(16)	Erlang	5.60	36.08	8.27	121	Caoyuan
wf-6(8)	Python	5.81	*	*	137	Lundh
wfinder7_2(8)	Erlang	6.02	†	†	345	Nygren
wfinder8(128)	Erlang	6.02	2:20.71	24.75	322	Nygren
wfinder8(8)	Erlang	6.39	34.94	5.49	322	Nygren
wfinder1_1	Erlang	6.46	34.07	8.02	287	Nygren
tbray9a(8)	Erlang	7.63	35.28	8.33	121	Caoyuan
wf(2)	Perl	7.64	12.16	3.32	61	O’Rourke
wf_pichi3	Erlang	8.28	51.98	9.38	545	Pichi
wf-6(4)	Python	9.08	3.66	1.89	137	Lundh
wfinder7_2(4)	Erlang	9.97	†	†	345	Nygren
wfinder8(4)	Erlang	10.50	33.03	5.12	322	Nygren
tbray9a(4)	Erlang	11.81	35.37	8.25	121	Caoyuan
wf-mmap	OCaml	14.64	12.20	2.44	200	Fernandez
wf(1)	Perl	14.85	12.09	3.25	61	O’Rourke
wf-6(2)	Python	16.91	3.62	1.86	137	Lundh
wfinder8(2)	Erlang	18.88	31.58	4.83	322	Nygren
wf-block	OCaml	18.99	12.96	6.01	144	Fernandez
tbray9a(2)	Erlang	20.14	35.31	8.28	121	Caoyuan
tbray5	Erlang	20.74	3:51.33	8.00	76	Caoyuan
wfinder8(1)	Erlang	36.11	31.17	4.72	322	Nygren
tbray9a(1)	Erlang	37.58	35.51	7.82	121	Caoyuan
wf	OCaml	39.17	31.48	7.69	124	Fernandez
wf-2	Python	41.04	34.80	6.24	38	Lundh
widefinder	Perl	44.29	1:15.22	12.78	57	Wong
clv5	Gawk	46.73	40.63	6.10	24	Paddy3118
wf	OCaml	49.69	41.94	7.75	110	Heikkinen
wf_p	Ruby	50.16	37.58	12.50	39	Heikkinen
dave	Ruby	58.27	43.18	14.39	8	Thomas
widefinder3	PHP	1:00:25	55:04	5.21	39	Beattie
tbray5	Erlang	1:04.32	35:33.35	45.84	93	Vinoski
report-counts	Ruby	1:43.71	1:27.11	16.60	13	Bray
?	Groovy	2:21.83	2:22.97	19.95	17	Brown

Variability · The quantities in the cells marked “*” above exhibited a lot of variability from run to run, sufficiently so as to make them probably actively misleading to include. ¶

Unknown · The quantities in the cells marked “†” are unknown. These multi-process Erlang runs arrange for the absence of parent/child process relationships, so it’s trickier to determine User and System CPU times. ¶

Notes on Running Nygren’s wfinder7_2 · This is multi-process not multi-thread. The best results are with 32 processes. ¶

Notes on Running Nygren’s wfinder8 · Check out Anders’ notes. The number in parentheses is the value of the +S argument to Erlang, telling it how many schedulers to run. Note that this is an 8-core machine with two integer instruction threads per core and support for eight thread contexts per core. Solaris thinks it sees 64 CPUs. So if you don’t specify +S the default is 64, which seems about right. ¶

Notes on Running the wf-* Python Code · This code is from Fredrik Lundh; see his discussion. I added a number-of-processes command-line parameter to wf-6.py, which appears in parentheses in the table above; so wf-6(8) means running with eight processes. I left the chunk size at 50M for now, this is just (just barely) small enough to run with 16 processes. ¶

Fredrik took a glance and suggested removing wf-3 through wf-5 in the interests of making the table more readable.

The reporting of user and system CPU time was wildly variable. For example, wf-6(8) reported user CPU as low as 5.06 seconds and as high as 31.99 seconds. I suspect that this may be worth bringing to the attention of the Solaris people; the state of the art in tracking CPU usage in this fairly exotic type of machine is still a little shaky, apparently.

Analysis · Are you kidding me!?!? Getouttahere. Maybe someday. ¶

Previous News · [2007/10/30]: I’ve been back from Shanghai for a few days now, but it took till past lunch today to get logged into the mighty T5120 and doing some Wide Finding. It’s a pretty naked Solaris box, and compiling all this stuff has been just a bundle of fun. Not. ¶

[2007/10/31]: Added a Lines-of-Code column to the table. Ran Fredrik Lundh’s multi-process Python, Ilmari Heikkinen’s OCaml, and Russ Beattie’s PHP. There are a few that I’ve tried and failed to run; Erlang code that won’t compile, C and JoCaml code that will compile but not run. In each case I’ve pinged the author, and I may go back and try to see if I can sort things out.

[2007/11/01]: Got Pichi’s patched Erlang to run. 545 lines, wow!. Caoyuan’s too, much more concise, but not as fast.

[2007/11/01]: Did some runs with Nygren’s wfinder8; details here.

[2007/11/04]: Per advice from contributors below, I installed sorttable and it seems to work just fine!.

[2007/11/05]: I ran Caoyuan’s latest; notes here.

Then I ran Mauricio Fernandez’ OCaml and JoCaml; see Aim for the Top! Beating the current #1 Wide Finder log analyzer with the join-calculus. Holy crap! I have to say that the OCaml build/run process is kind of klunky, this isn’t your classic REPL by any means. But you know, this isn’t the first time that I’ve seen OCaml thump the competition on a benchmark; it’s just the first time that it was a benchmark I cared about.

[2007/11/08]: I ran Eric Wong’s Perl code (notes here); kind of disappointing.

Then I ran Anders Nygren’s revised wfinder7_2; he writes: “The difference compared to wfinder8 is that this one does not use the SMP virtual machine. Instead it starts a number of slave erlang nodes.”

Caoyuan revised his tbray9 to produce tbray9a, requesting that I replace the results.

[2007/11/09]: Russ Beattie revised his PHP code, renaming it widefinder3, and indeed, it moved a few places up the table.

Also, based on several requests about this box’s I/O performance, I did a Bonnie run, which is kind of interesting.

I/O · There have been appeals, both in the comments and in email, for a characterization of the test box’s I/O performance. I ran Bonnie with the following results: ¶


     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU  /sec  %CPU
  20  23.9 99.7  41.5 29.8  44.2 61.8  22.3  100 161.8  100 21835 239.4

Given that this stupid thing has 64G of RAM and I could only find 20G of disk space to work with, the results should be taken with a grain of salt (especially the Random Seeks number). Having said that, watching the free-memory readout made it look to me as though when it was doing the sequential runs, it was doing real I/O, not just caching.

Note that when it says “100% CPU”, that means 100% of one of the 64 logical CPUs that Solaris sees on this box. And in fact, it looks to me that doing C I/O on this box pegs one of the cores, whether you’re using getc() or read().

Contributions

Comment feed for ongoing:

From: Stuart Langridge (Oct 31 2007, at 01:40)

My sorttable (http://www.kryogenix.org/code/browser/sorttable/) is a JS table sorter. I'm not sure how it'll cope with the 1:44.32 format times, mind...