[This is part of the Wide Finder 2 series.] This should be the final entry, after a couple of years of silence. The results can be read here, for as long as that address keeps working. I’m glad I launched that project, and there is follow-on news; taking effect today in fact.

Conclusions · This was a lot of work to demonstrate two simple findings that most of us already believed:

  1. It is possible to achieve remarkable throughput on highly parallel hardware, even for boring old-style I/O-heavy batch-processing problems.

  2. It remains unacceptably hard to achieve such performance. Whether you measure it by the number of lines of code, the obscurity of the languages and libraries you have to learn, or the number of bugs you have to fight, it’s still too difficult to write concurrent application code.

Smiling · If you see Wide Finder 2 as a horse race, there’s a clear winner: Dmitry Vyukov. More than one software-savvy geezer, when I described the kind of throughput he was squeezing out of that very modest SPARC box, looked incredulous. His code pulled some useful statistics out of the test data in less elapsed time than a lot of people it would take that processor to read it off those disks.

Mind you, he also has won a couple of parallel-programming contests put on by Intel, and he’s running a lovely series of writeups on the issues over at 1024Cores, including (ahem) Wide Finder 2, from which I quote: “The program contains 17 times more LOC than the initial Ruby program, and is 479 times faster.”

When Dmitry wrote me about that post, he also mentioned that he was interviewing with Google. I immediately tracked down the recruiter and told her not to let this one go. I get the impression that they’d already noticed, and Dmitry is starting work at Google today.

I’m pretty sure he’s not going to be able to talk much about what he works on, so I hope he keeps 1024Cores going, he’s got plenty to tell the world about how to keep getting us the most out of Moore’s astonishing law.



Contributions

Comment feed for ongoing:Comments feed

From: John Hart (Feb 07 2011, at 10:24)

Thanks for the round-up, Tim.

There was a fair bit of confusion at the start of the competition, with people saying things like "we're somehow reading data off the disk faster than the disk can read data." And by "people" I mean very smart people like Tim Bray (see Wide Finder XV w/r/t JoCaml).

But it ain't so.

As Dmitriy states:

"... however, in reality there is a crucial difference between processing of the cached and uncached file parts. Average processing speed is 235MB/s (while disks are able to provide only 150MB/s, once again it's due to file caching)."

His (very smart) algorithm lets the CPU fly on the file-system-cached data without waiting on the reads of the non-cached files.

And that is what lets him win. Any result that gets a time below ~ 5:00 is similarly optimizing use of cached data.

I also think that's what contributes to your conclusion that "it remains unacceptably hard to achieve such performance."

From a truly cold start, you might see the top contenders squash into a narrower performance band. And then all of a sudden the easier-to-write programs don't look so bad anymore.

[link]

author · Dad · software · colophon · rights
picture of the day
February 07, 2011
· Technology (77 fragments)
· · Concurrency (70 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.