This is the eleventh progress report from the Wide Finder Project; I’ll use it as the results accumulator, updating it in place even if I go on to write more on the subject, which seems very likely. [Update: Your new leader: Perl.]

The Name · There’ve been a few posts wondering about the name “Wide Finder”. The original Ruby code came from my Beautiful Code chapter called “Finding Things”; thus it was a Finder. The problem with modern CPUs is that they’re getting wider not faster. Thus, I seek a Wide Finder.

Worth Doing · There is a steady drumbeat of commentary along the lines of “WTF? This is a trivial I/O-bound hack you could do on the command line, ridiculously inappropriate for investigating issues of parallelism.” To which I say “Horseshit”. It’s an extremely mainstream, boring, unglamorous file-processing batch job. There’s not the slightest whiff of IT Architecture or Social Networking or Service-Orientation about it. Well, that’s how lots of life is. And the Ruby version runs faster on my Mac laptop than on the mighty T5120. The T5120 is what most of the world’s future computers are going to look like! Houston, we have a problem, and the Wide Finder is a useful lens to examine it.

The Latest · [2007/11/19]: After a week or so off, I’m back to running Wide Finder code. First: Sean O’Rourke’s Perl code.

The good news is that it’s insanely fast; the bad news is that I had to visit CPAN to get Sys::Mmap, and CPAN hates me. Always has. In this case, the Makefile.PL makes a Makefile that doesn’t work on Solaris, out of the box until you mangle the incantations.

I also ran Dave Thomas’ memory-mapped code.

Neither the table’s format nor contents are carved in stone; I’m quite sure I’ll update it as I pour more results in. Are there any missing columns? Or outright broken-ness? Feel free to offer suggestions.

I still have a few Wide Finder implementations to run; if you’ve done one and it doesn’t appear here, it couldn’t hurt to drop me a line to make sure I know about it.

Results · The data is essentially all of the ongoing logfile from March of 2007; 971,538,252 bytes of data in 4,625,236 lines. I will make it available in compressed form to other experimenters on request, but I’ll require you to convince me that you won’t publish it.

In each case, the benchmark was run at least twice, usually in succession with other benchmarks, in an attempt to have the disk cache as hot as possible.

This is a production T5120 with eight cores each at 1.4GHz and 64G of RAM. It’s I/O performance is unexciting.

The table can be sorted by clicking on column headings.

Name Language Elapsed User
CPU
System
CPU
LoC Notes
wf(32) Perl 1.51 16.06 2.89 61 O’Rourke
wf(16) Perl 1.70 13.79 2.66 61 O’Rourke
wf-mmap-multicore JoCaml 1.76 2.42 0.55 278 Fernandez
wf(64) Perl 1.77 18.72 3.56 61 O’Rourke
wf(8) Perl 2.52 12.63 2.49 61 O’Rourke
wfinder7_2(32) Erlang 3.54 345 Nygren
wfinder7_2(16) Erlang 4.09 345 Nygren
wf(4) Perl 4.25 12.24 2.38 61 O’Rourke
wfinder7_2(64) Erlang 4.27 345 Nygren
wf-6(16) Python 4.38 * * 137 Lundh
wfinder8(32) Erlang 4.42 46.28 12.27 322 Nygren
wfinder8(64) Erlang 4.45 56.13 18.92 322 Nygren
wfinder8(16) Erlang 4.74 38.48 7.40 322 Nygren
tbray9a(128) Erlang 5.26 36.75 8.39 121 Caoyuan
tbray9a(32) Erlang 5.29 36.64 8.11 121 Caoyuan
tbray9a(64) Erlang 5.45 36.79 8.23 121 Caoyuan
tbray9a(16) Erlang 5.60 36.08 8.27 121 Caoyuan
wf-6(8) Python 5.81 * * 137 Lundh
wfinder7_2(8) Erlang 6.02 345 Nygren
wfinder8(128) Erlang 6.02 2:20.71 24.75 322 Nygren
wfinder8(8) Erlang 6.39 34.94 5.49 322 Nygren
wfinder1_1 Erlang 6.46 34.07 8.02 287 Nygren
tbray9a(8) Erlang 7.63 35.28 8.33 121 Caoyuan
wf(2) Perl 7.64 12.16 3.32 61 O’Rourke
wf_pichi3 Erlang 8.28 51.98 9.38 545 Pichi
wf-6(4) Python 9.08 3.66 1.89 137 Lundh
wfinder7_2(4) Erlang 9.97 345 Nygren
wfinder8(4) Erlang 10.50 33.03 5.12 322 Nygren
tbray9a(4) Erlang 11.81 35.37 8.25 121 Caoyuan
wf-mmap OCaml 14.64 12.20 2.44 200 Fernandez
wf(1) Perl 14.85 12.09 3.25 61 O’Rourke
wf-6(2) Python 16.91 3.62 1.86 137 Lundh
wfinder8(2) Erlang 18.88 31.58 4.83 322 Nygren
wf-block OCaml 18.99 12.96 6.01 144 Fernandez
tbray9a(2) Erlang 20.14 35.31 8.28 121 Caoyuan
tbray5 Erlang 20.74 3:51.33 8.00 76 Caoyuan
wfinder8(1) Erlang 36.11 31.17 4.72 322 Nygren
tbray9a(1) Erlang 37.58 35.51 7.82 121 Caoyuan
wf OCaml 39.17 31.48 7.69 124 Fernandez
wf-2 Python 41.04 34.80 6.24 38 Lundh
widefinder Perl 44.29 1:15.22 12.78 57 Wong
clv5 Gawk 46.73 40.63 6.10 24 Paddy3118
wf OCaml 49.69 41.94 7.75 110 Heikkinen
wf_p Ruby 50.16 37.58 12.50 39 Heikkinen
dave Ruby 58.27 43.18 14.39 8 Thomas
widefinder3 PHP 1:00:25 55:04 5.21 39 Beattie
tbray5 Erlang 1:04.32 35:33.35 45.84 93 Vinoski
report-counts Ruby 1:43.71 1:27.11 16.60 13 Bray
? Groovy 2:21.83 2:22.97 19.95 17 Brown

Variability · The quantities in the cells marked “*” above exhibited a lot of variability from run to run, sufficiently so as to make them probably actively misleading to include.

Unknown · The quantities in the cells marked “†” are unknown. These multi-process Erlang runs arrange for the absence of parent/child process relationships, so it’s trickier to determine User and System CPU times.

Notes on Running Nygren’s wfinder7_2 · This is multi-process not multi-thread. The best results are with 32 processes.

Notes on Running Nygren’s wfinder8 · Check out Anders’ notes. The number in parentheses is the value of the +S argument to Erlang, telling it how many schedulers to run. Note that this is an 8-core machine with two integer instruction threads per core and support for eight thread contexts per core. Solaris thinks it sees 64 CPUs. So if you don’t specify +S the default is 64, which seems about right.

Notes on Running the wf-* Python Code · This code is from Fredrik Lundh; see his discussion. I added a number-of-processes command-line parameter to wf-6.py, which appears in parentheses in the table above; so wf-6(8) means running with eight processes. I left the chunk size at 50M for now, this is just (just barely) small enough to run with 16 processes.

Fredrik took a glance and suggested removing wf-3 through wf-5 in the interests of making the table more readable.

The reporting of user and system CPU time was wildly variable. For example, wf-6(8) reported user CPU as low as 5.06 seconds and as high as 31.99 seconds. I suspect that this may be worth bringing to the attention of the Solaris people; the state of the art in tracking CPU usage in this fairly exotic type of machine is still a little shaky, apparently.

Analysis · Are you kidding me!?!? Getouttahere. Maybe someday.

Previous News · [2007/10/30]: I’ve been back from Shanghai for a few days now, but it took till past lunch today to get logged into the mighty T5120 and doing some Wide Finding. It’s a pretty naked Solaris box, and compiling all this stuff has been just a bundle of fun. Not.

[2007/10/31]: Added a Lines-of-Code column to the table. Ran Fredrik Lundh’s multi-process Python, Ilmari Heikkinen’s OCaml, and Russ Beattie’s PHP. There are a few that I’ve tried and failed to run; Erlang code that won’t compile, C and JoCaml code that will compile but not run. In each case I’ve pinged the author, and I may go back and try to see if I can sort things out.

[2007/11/01]: Got Pichi’s patched Erlang to run. 545 lines, wow!. Caoyuan’s too, much more concise, but not as fast.

[2007/11/01]: Did some runs with Nygren’s wfinder8; details here.

[2007/11/04]: Per advice from contributors below, I installed sorttable and it seems to work just fine!.

[2007/11/05]: I ran Caoyuan’s latest; notes here.

Then I ran Mauricio Fernandez’ OCaml and JoCaml; see Aim for the Top! Beating the current #1 Wide Finder log analyzer with the join-calculus. Holy crap! I have to say that the OCaml build/run process is kind of klunky, this isn’t your classic REPL by any means. But you know, this isn’t the first time that I’ve seen OCaml thump the competition on a benchmark; it’s just the first time that it was a benchmark I cared about.

[2007/11/08]: I ran Eric Wong’s Perl code (notes here); kind of disappointing.

Then I ran Anders Nygren’s revised wfinder7_2; he writes: “The difference compared to wfinder8 is that this one does not use the SMP virtual machine. Instead it starts a number of slave erlang nodes.”

Caoyuan revised his tbray9 to produce tbray9a, requesting that I replace the results.

[2007/11/09]: Russ Beattie revised his PHP code, renaming it widefinder3, and indeed, it moved a few places up the table.

Also, based on several requests about this box’s I/O performance, I did a Bonnie run, which is kind of interesting.

I/O · There have been appeals, both in the comments and in email, for a characterization of the test box’s I/O performance. I ran Bonnie with the following results:


     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU  /sec  %CPU
  20  23.9 99.7  41.5 29.8  44.2 61.8  22.3  100 161.8  100 21835 239.4

Given that this stupid thing has 64G of RAM and I could only find 20G of disk space to work with, the results should be taken with a grain of salt (especially the Random Seeks number). Having said that, watching the free-memory readout made it look to me as though when it was doing the sequential runs, it was doing real I/O, not just caching.

Note that when it says “100% CPU”, that means 100% of one of the 64 logical CPUs that Solaris sees on this box. And in fact, it looks to me that doing C I/O on this box pegs one of the cores, whether you’re using getc() or read().



Contributions

Comment feed for ongoing:Comments feed

From: Stuart Langridge (Oct 31 2007, at 01:40)

My sorttable (http://www.kryogenix.org/code/browser/sorttable/) is a JS table sorter. I'm not sure how it'll cope with the 1:44.32 format times, mind...

[link]

From: Etienne Posthumus (Oct 31 2007, at 02:38)

For brain-dead-easy sortable HTML tables, look no further than http://www.kryogenix.org/code/browser/sorttable/

by Stuart Langridge.

It is non-obtrusive Javascript with a tiny footprint. Highly recommended.

[link]

From: Al (Oct 31 2007, at 04:49)

Hi Tim did you try Pichis code at all:

http://pichis-blog.blogspot.com/

It looks very fast, worth checking out.

P.S. This is a great challenge

regrards

Al

[link]

From: Will Fitzgerald (Oct 31 2007, at 06:27)

I've used 'Standardardnista Table Sorting' for table sorting. It works pretty well. You would need to define a special duration sorter, though, since it doesn't come with one standard.

[link]

From: Hadley (Oct 31 2007, at 07:27)

How about standardising on seconds for the measurement of time? It would make it easier to get a quick feel for the magnitude of the results (as would some rounding).

[link]

From: Mahlen Morris (Oct 31 2007, at 07:30)

For wfinder1_1, how is it possible that the Elapsed time is less than the User time? I'm guessing that the numbers got mistyped somewhere?

[link]

From: Anton McConville (Oct 31 2007, at 08:08)

How about using a Google spreadsheet ( docs.google.com ) and sharing for public view? Though not sure if we'd all need Google accounts just to view.

Maybe if you're not too jet-lagged and snowed under, you could also build a nifty results atom feed for the spreadsheet to actively import ;) It's like a swiss army knife user interface.

Thanks for your interesting dispatches from China.

[link]

From: Brett (Oct 31 2007, at 08:17)

What about effbot's wf-6.py?

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

[link]

From: Al (Oct 31 2007, at 09:45)

I Use www.EditGrid.com to share tables

e.g. The NBL chart I did for Steve Yegge's post : http://www.editgrid.com/user/folknology/NBL

[link]

From: Keith Fahlgren (Oct 31 2007, at 09:53)

Yeah, use sorttable as everyone else suggests. It makes all sorts of simple stuff appear heroic to businessy types.

[link]

From: Will (Oct 31 2007, at 10:12)

I'd like to second the request to include effbot's Python work, if it's not too much trouble. That table looks a little lonely without some kind of Python results...

[link]

From: Jeff (Oct 31 2007, at 11:42)

Mahlen: This is the whole point of parallelized computation. It's possible for elapsed - wallclock - time to be less than the user time if the user time is parallelized. Ideally, 35 threads each taking 1s of user time is 35s of user time but still only 1s of elapsed time.

[link]

From: Fredrik (Oct 31 2007, at 12:03)

"how is it possible that the Elapsed time is less than the User time"

If user time is the total time for all CPU:s, that's the expected result for any implementation that actually uses more than one CPU.

That doesn't explain how one of the Erlang scripts manage to accumulate 35 minutes user time in a little over one minute, though.

[link]

From: Anders Nygren (Oct 31 2007, at 14:32)

Fredrik:

"That doesn't explain how one of the Erlang scripts manage to accumulate 35 minutes user time in a little over one minute, though."

Tim wrote about that some time ago on http://www.tbray.org/ongoing/When/200x/2007/10/09/Niagara-2-T2-T5120

quote

"it seems that the hardware doesn’t tell the OS how it’s sharing out the cycles among the the threads that it has runnable at any point in time. So Solaris just credits them with user CPU time whenever they’re in Run state."

[link]

From: Fredrik (Oct 31 2007, at 14:43)

"What about effbot's wf-6.py?"

Both wf-5.py and wf-6.py would be interesting, I think (with wf-2 as a baseline).

Also note that the scripts aren't self-tuning; to get optimal performance, Tim needs to run them with a suitable number of processes (the default is 2; I'd try them with 8, 16, and 64 processes on this machine). And the default chunk size in wf-6.py is 50 megabytes; to use all processes, make sure its lower than (size of file / number of processes). For 64 processes, 4, 8 or possibly 16 megabytes might be good values.

[link]

From: Harry Fuecks (Nov 01 2007, at 02:15)

Another PHP implementation but which processes the file in parallel over HTTP (localhost) using curl - http://www.sitepoint.com/blogs/2007/11/01/wide-finder-in-errr-php

Comparing it running single threaded vs. 4 "threads" (the optimimum on my T60p given filesize, startup costs) typical times look like;

$ time ./wf_reduce.php 1

real 0m0.220s

user 0m0.076s

sys 0m0.136s

$ time ./wf_reduce.php 4

real 0m0.163s

user 0m0.048s

sys 0m0.068s

[link]

From: Pete Kirkham (Nov 01 2007, at 03:28)

I've updated my effort to report the top-10 (link above), along with a pthreads variant. On my box, the single thread C++ version uses around ~11 times less CPU time than the ruby, so should be ~8s, so it does become worth parallelising the CPU parts, as there are numbers in the table less than that.

[link]

From: Pierre Phaneuf (Nov 01 2007, at 10:05)

Oh, while I agree with you that making this concurrent is important for the future, that doesn't change the fact that thing is indeed a quick and dirty command-line job.

That's why GNU's "xargs" has the -P (parallelism!) option, of course. You might want to combine it with -n.

[link]

From: Anders Nygren (Nov 01 2007, at 12:36)

The user and system times for wf-6(2) and wf-6(4) seems strange too.

Going from2 to 4 processes shows a nice improvement in elapsed time, but no change in user and system times.

[link]

From: Anders Nygren (Nov 01 2007, at 14:53)

Please forget my previous comment.

Its an obvious brain-fart.

[link]

From: Bryan Murphy (Nov 01 2007, at 15:13)

You should write up a quick perl script to munge the data, that way you can post it for everybody.

[link]

From: Tim (but not THE Tim) (Nov 01 2007, at 17:40)

Coming from a mainframe perspective, the CPU measurement "problem" has been around a while: the theoretical maximum CPU time you can get with 'N' processors (which I assume roughly corresponds to "cores") is N*(the clock time); multiprocessor effects (spin locks, etc.) usually reduce this somewhat.

I'd be interested to see a post about the I/O subsystem(s) underlying these tests: is it one big hard drive, some sort of RAID array, some sort of parallelized filesystem? Since as someone said this is "I/O bound" the I/O involved makes a difference and I don't remember seeing it documented yet.

As with Chinese, I don't read or speak Erlang, so I couldn't tell if there were tricks done with the file processing. Nothing wrong with those kinds of tricks, I'd just like to know if there were any.

[link]

From: Hynek (Pichi) Vychodil (Nov 02 2007, at 08:48)

I was compared how much each solution used CPU cores:

http://pichis-blog.blogspot.com/2007/11/how-much-cores-are-using-wf-solutions.html

[link]

From: Daniel (Nov 03 2007, at 09:26)

Hi Tim,

Thanks for doing this!

Like many others, I had expected this to be I/O-bound with no noticeable effect of the many cores. So... I'm quite (pleasantly) surprised.

Could you add a benchmark that merely reads the log file, like 'cat' or so, to give us an idea how close to I/O speed we are?

Thanks,

Daniel

[link]

From: Eric Wong (Nov 04 2007, at 03:56)

I don't have an SMP machine, but here's my stab at doing it in Perl with

fork + pipe:

http://normalperson.com/widefinder

I'd be interested in how it scales to multiple cores, you can just

change the 'J' constant at the top to define the number of children to

spawn.

Mapping back is done using tempfiles, so it'll be worth it to put them

on a RAM disk or tmpfs-like system if the temporary set gets big

enough...

[link]

From: Mauricio Fernandez (Nov 04 2007, at 11:46)

I have made a number of OCaml and JoCaml implementations. They can be found at http://eigenclass.org/repos/widefinder/head/

wf-kig is the entry from Ilmari Heikkinen with the dead code removed; you might want to update the line count to reflect the length without all the extra code.

wf uses a sublinear string search algorithm in addition to RE matching (1.3X relative to wf-kig).

wf-block processes the file in 50MB blocks (1.8X relative to wf-kig).

wf-mmap uses mmap and works over three times faster than wf-kig. It is over 25% faster than the fastest single-core implementation in the above table on my box.

wf-mmap-multicore is an extension of wf-mmap implemented with JoCaml. I haven't tested it on a multicore machine yet, but I expect the single-core 20-30% speed advantage to hold with multiple cores until IO dominates.

[link]

From: Scott Robinson (Nov 05 2007, at 20:56)

You list LoC in your statistics. Why not take the next step? LoC / Elapsed Time

http://ship.quadhome.com/posts/wide_finder_models/

[link]

author · Dad
colophon · rights
picture of the day
October 30, 2007
· Technology (90 fragments)
· · Concurrency (75 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!