ongoing by Tim Bray · WF XII: Discussion

This is the twelfth progress report from the Wide Finder Project. It exists to host the excellent discussion so far from others; see the comments.

My Notes So Far · Check out the results, which I think are remarkable; JoCaml, a concurrent variant of a language often regarded as a CS-research toy, totally wiped the carpet with all other contenders. Near the top is a thread-oblivious implementation in Python, a language with no particular concern for concurrency. What do you think? If your thoughts are material, write ’em in your own space and link from the comments. I do have one specific question, which I hope to address at length: Which of these programs is the best? ¶

Now let’s turn to others’ contributions; they are definitely worth reading:

Contributions

Comment feed for ongoing:

From: John Wilson (Nov 06 2007, at 10:52)

The three things that you would seem to need to do to get this to run fast are:

1/ keep the data as USASCII

2/ use Boyer Moore not regexps

3/ read really big chunks of the data in one go.

If you do that you can trivially parallelise it (you have a minor issue with lines crossing block boundaries but it's not hard to solve this).

One question about the Python solution which appears to use multiple processes not multiple threads. My understanding is that on Sun's multi core boxes one core can only run threads from a single process (i.e. they must share the same virtual memory mapping data). I did some tests on a box which Sun was kind enough to lend me and it did seem to support this assertion. Do you know if this is, in fact, true? If so then the Python solution will not be able to use all the processing power of the system and will block access to processor cycles for other multi threaded programs. I'd be most grateful if you could get a definitive answer to this question.

[link]

From: Anders Nygren (Nov 06 2007, at 11:42)

John Wilson wrote

"3/ read really big chunks of the data in one go."

I am not sure about this point. At least on Linux both me and Steve Vinoski found that reading ~200KB was faster.

[link]

From: CB (Nov 06 2007, at 13:24)

I think your results are not really representative of what you would get in the real world. At best you are reading 500MB per second so presumably that means the disk cache is warmed up. In real world use I don't think that would be the case.

As John says most of the fast results use custom string matching (python has the advantage of having that built in). That causes me to doubt how quickly you could re-purpose some of these solutions for a similar task.

I also would really like to see how single thread/process versions perform. It could be that the multicore stuff was mostly irrelevant and the fastest IO library won.

I doubt the Java version will come close to JoCaml, but for reasons that are unrelated to concurrency. It uses full regexps and not a fast string match (and the regexp lib is written in Java not C). It uses stream based IO instead of memory mapping. And it uses unicode strings (which have to be converted from the ASCII) instead of simple bytes. So I'm not sure where that leaves the question of multicore scaling.

That said. I think you have killed the Erlang hype stone dead, and we will be hearing more about JoCaml in the near future.

[link]

From: Erik Engbrecht (Nov 06 2007, at 14:33)

I still owe Tim a Scala version. I've got a what I think is a deadlock problem caused by the way I'm using the Actors library. I may just switch to use plain old worker threads.

I would say one of the conclusions is that abstraction costs are real. In Java transcoding costs from bytes to chars are huge. Using direct or memory mapped buffers helps IO performance but then hurts transcoding performance because all the calls to "byteAt" cost more than just letting NIO copy the buffer in normal heap space so that array operations can be used. Splitting the input into lines costs a lot, especially doing it separately from transcoding.

So all these layers cost in terms of performance, because you're scanning and copying something several times when it could be simply transferred from disk to memory and then processed.

But then maybe not (I'll know better when I'm done and Tim can run my code). Doing it in three layers costs more total, but those tasks can all be done in a pipeline by separate threads. So on a sufficiently parallel system it may it theoretically could be faster to split up the tasks.

And then how many of the solutions would break down if the file was bigger than 2^32 bytes? If the box only had 1 or 2 cores? If the solution had to be generalized for any line-oriented file processing function the user/programmer can dream up? If the box was shared with production applications and so the the program had to be resource friendly - i.e. share the CPU(s) and not start paging the RAM, possibly fitting with minimal memory bounds?

More questions than answers, so I hope this topic doesn't go away as internet ADD sets in.

[link]

From: Thomas Lindgren (Nov 06 2007, at 14:33)

The Jocaml results are impressive, and even more so when it so comprehensively beats the best OCaml solution by such a margin, even considering user time. I would have guessed OCaml to perform better in this respect.

A couple of the solutions manage to reduce the user time from about 10+ seconds to 2-3 seconds. (Don't forget Python, user cpu 3.6 seconds, even if the timings seem very variable?) Where does this come from? Seems like a nice source of improvement for the rest of the pack.

[link]

From: Anders Nygren (Nov 06 2007, at 19:04)

Thomas wrote

"Don't forget Python, user cpu 3.6 seconds, even if the timings seem very variable?"

I think that is a measurement error. On my 1.66GHz dual core laptop wf-6.py reports user 0.048s and sys 0.008s. Since it spawns extra processes that does the work I think that time is not reported.

I have an Erlang version that does not use the SMP VM, but instead starts slave nodes. The real time is a little longer than for wfinder8, but time only reports user 0.208s and sys 0.048. So I am pretty sure that it is a problem with the time command.

Tim hasn't tested that one yet but I have hope that it will beat wfinder8. Since with the erlang SMP VM the user+sys times increase with ~0.6s for each extra scheduler thread.

[link]

From: Andrew Dalke (Nov 07 2007, at 01:56)

"read really big chunks of the data in one go"

In my single threaded version for Python (at http://www.dalkescientific.com/writings/diary/archive/2007/10/07/wide_finder.html ) I found that reading 14K block chunks gave the fastest performance. 0.9s for my test set, vs. about 1.2 seconds for "really big chunks." I also found that reading chunks was faster than using a mmap'ed file.

My code is in Python, based on Fredrik Lundh's code here. The single threaded version is about 30% faster (1 second instead of 1.5 seconds) than his wf-6 with one process. That suggests it should be possible to get the time reported here down to about 3.0 seconds. Though I can do faster if I use a 3rd party extension.

[link]

From: Fredrik (Nov 07 2007, at 09:02)

"I also found that reading chunks was faster than using a mmap'ed file"

From what I can tell (without actually running the code on a Mac), that's because memory mapping is really slow on Mac OS X, compared to all other platforms.

Would still be interesting to see wf-5.py and wf-6.py figures for more than 16 processes, though.

[link]

From: Mauricio Fernandez (Nov 07 2007, at 12:32)

I've realized that wf-mmap-multicore might in fact be considerably faster than

reported previously. On my 2-core machine, spawning more than 2 processes

makes it 25% slower than with 2. By default, wf-mmap-multicore uses 16

processes, and the T5120 has got only 8 cores... (The number of processes can

now be specified with the optional 2nd argument.)

I have also modified the fastest single-core and multi-core programs to

operate without regexps (similarly to wfinder8). wf-mmap-nore and

wf-mmap-multicore-nore might be up to 40% faster than the originals. See

http://eigenclass.org/repos/widefinder/head/.

Thomas: the user/sys times are totally bogus. As expected,

wf-mmap-multicore is about 8 times faster than wf-mmap because it uses 8 cores

instead of one. That's all there is to it; don't draw any conclusions from the

user/sys time. It's mostly the same code, both source and binary (JoCaml is

binary-compatible with OCaml). In fact, I could put the common code in a

separate file and compile it once with ocamlopt for both versions. The

performance wouldn't differ at all, it's the same code generator.

Tim: It's too late to ask "Which of these programs is the best?", or if you

want too early, if you intend to host some new, vaguely defined competition.

All programs were written with performance in mind and other considerations

were secondary. If there are other goals, they should have been made explicit

from the beginning, otherwise it will be hard to reach any meaningful

conclusion based on performance-oriented code.

Even worse, people might have been aiming for different qualities in addition

to speed. For instance, my code includes a reusable module to operate on

mmap'ed areas that includes things I didn't need for this specific task. This

makes the program longer than necessary, but a part of it can be reused now.

What's better, code that can be reused or low line count? Same goes for

performance: I can simplify wf-mmap-multicore and get something that will

still be faster than the current second entry while fitting in maybe 60 or 70

lines. What's better, shorter code or speed? How fast is fast enough, how

much code is too much? The general point applies to all entries, not only

mine, of course.

Finally, we don't know anything about the effort invested in each solution.

AFAIK wf-6.py didn't take a long time. As for myself, benchmarking properly and

documenting all my solutions (plus writing long comments such as this one :)

took much longer than coding them (I wrote them last Sunday). OTOH, it seems

the top Erlang implementations have gone through many expert hands for over

one month. This illustrates that low line counts don't necessarily map to less

programmer time.

I'll probably write some more on wide finder, scaling-up vs. scaling-out,

JoCaml, Erlang and stuff on my site ( http://eigenclass.org ).

[link]

From: Bob Miller (Nov 07 2007, at 17:08)

I know I'm late to the party, but here's another WF implementation.

On my desktop PC, I can scan a 900 MB log in 0.70 seconds. That's faster than any of the results in the table...

I have every reason to believe that the algorithm will scale through 32 CPUs, and possibly higher, depending on the T5120's memory bandwidth.

Read all about it here.

http://www.lug.corvallis.or.us/drupal/node/102

[link]

From: Adrian Thurston (Nov 07 2007, at 17:28)

Hi, I did a wide finder using Ragel. I don't have a multi-core to test it on but on my single-core machine it goes about 10x faster than the report-counts.rb program.

The source program is the .rl file. The ragel generated .cpp file is included.

http://groups.google.com/group/ragel-users/browse_thread/thread/45c5b1448771df7

[link]

From: Paddy3118 (Nov 07 2007, at 23:57)

Fredrik Lundh's blog ontry on Wide Finder I found very informative. Thay guy is good, so it is hard to work out the contribution of his coding language - Python, but I'll have a go ;-)

Python allowed for the timely exploration of multiple solutions and in this case, shows that prototyping in Python, a dynamic, interpreted language can allow you to ship the prototype. Pythons readability allowed the contributer of the only faster jOcaml implementation to copy the techniques used in Python.

- Paddy.

(P.S. I'll be making time to test a parallel gawk solution soon).

[link]

From: Mauricio Fernandez (Nov 08 2007, at 03:46)

Paddy: >> "Pythons readability allowed the contributer of the only faster jOcaml implementation to copy the techniques used in Python."

That's not really the case; I didn't refer to the Python code when coding my solutions.

Several techniques just don't apply at all ("compiling the RE", "skipping lines that cannot match", "reading files in binary mode"). As for those that do, block processing and using mmap are obvious (it was clear to me that they'd be needed before I found that blogpost), and the one key insight, using sublinear string matching, wasn't conveyed by the Python code, but rather by the article itself (I had implemented the Boyer-Moore algorithm before so I just needed to see two words, "superlinear search", to get the point). Distributed processing is done in a totally different way.

That said, Python indeed allowed to create a prototype that yields maybe 20% of the maximum attainable performance (with low-level, hand-tuned, very specialized code). JoCaml allowed me to create a prototype that is maybe within 75% of the maximum performance, and it doesn't look like it took much more effort. The Python implementations used mmap and sublinear string search routines in the standard library; I reused the Boyer-Moore implementation I had handy. The application-specific code that analyzes the logs and controls the processes is about the same size.

[link]

From: Paddy3118 (Nov 08 2007, at 11:01)

Sorry Mauricio,

There was no slight intended. I read too much into your blog post.

I do note the absence of C++/C/Java/ solutions in Tims table but then the problem seems to me to be the kind of problem people usually write interpreted scripts to solve.

- Paddy.

[link]

From: Mauricio Fernandez (Nov 08 2007, at 11:20)

s/within 75% of the maximum performance/around 75% of the maximum performance/ in my previous post

[link]

From: Caoyuan (Nov 09 2007, at 03:27)

How about add a (User+System)/Elapsed colunm?

Per the result of tbray9a.erl:

Schedulers# Elapsed(s) User(s) System(s) (User+System)/Elapsed

1 37.58 35.51 7.82 1.15

2 20.14 35.31 8.28 2.16

4 11.81 35.37 8.25 3.69

8 07.63 35.28 8.33 5.72

16 05.60 36.08 8.27 7.92

32 05.29 36.64 8.11 8.46

64 05.45 36.79 8.23 8.26

128 05.26 36.75 8.39 8.58

It seems that T5120 is a computer with 8-core parallel computing ability and 64-thread for schedule?. An interesting point is, when scheduler number increased, Elpased time dropped along with the User time and System time kept almost constancy. This may because I separated the parallelized / sequential part completely in my code.

On my 4-core intel linux box, the fastest time (128 schedulers) vs the slowest time (1 scheduler) is 6.627/3.763 = 1.76. On this T5120, is 37.58/5.26 = 7.14. So, with the 8-core and 64-thread combination, the T5120 is pretty good on parallel computing.

[link]

From: Adrian Thurston (Nov 09 2007, at 12:05)

Hi, the ragel version I posted has an error in it that is only revealed when the letter 'G' appears ahead of a GET. I sent a fixed version to the mailing list.

http://groups.google.com/group/ragel-users/browse_thread/thread/45c5b1448771df7

Adrian

[link]

What this is ·

Subscribe to ongoing

Truth · Biz · Tech

author · Dad
colophon · rights

picture of the day

November 06, 2007
· Technology (90 fragments)
· · Concurrency (76 more)

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!