This is the sixth progress report from the Wide Finder Project, in which I try to paint a picture of what the solution should look like. The project really has two goals. First, to establish whether it’s possible to write code that solves this problem and runs much faster in a many-core environment. Second, to consider how that code should look to the programmer who has to write it.

[In the background, Steve Vinoski and I are both grinding away trying to convince Erlang to do file-reading and line-matching efficiently. Other people are contributing ideas both from Erlang-land and in other languages. We’ll get back to that.]

Here’s the original Ruby Wide Finder program that we’re trying to replicate the semantics of, with a one-character change.

counts = {}
counts.default = 0

ARGF.each_line* do |line|
  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{counts[key]}: #{key}"
end

See, the each_line method has become each_line* with the addition of a *, which I’ve coloured red to make it stand out. It says “Dear Ruby, I don’t care about the order of the invocations of the block here, and there will be no problems if you run many copies of it in parallel. Figure it out and make it go fast, please.”

[I’m not seriously proposing this as a syntax convention, although it’s not entirely implausible on the face of it, given Ruby’s conventions around ? and !. This is just a thought experiment proposing that there be a really simple way for a programmer to mark something as parallelizable.]

Now, this doesn’t quite work, because the counts hash is shared-state, so you’d have to do something about that.

I’m not asking for the detection of concurrency to be automatic, because that just feels too hard; it’s OK to require the programmer to tell the system what parts to parallelize. But it shouldn’t require more than a character or two of syntactic overhead.

[Update: If you’re programming in C++ and have access to an OpenMP library, you can do something like this:]

#pragma omp parallel do
for (int i = 0; i < MAX; i++) {
  some_sub(i);
}

somesub(int i) {
  parallel stuff();
#pragma omp critical
  do_crit_section();
}


Contributions

Comment feed for ongoing:Comments feed

From: Kris (Sep 29 2007, at 15:02)

In Fortran at least, there is OpenMP, which in many cases *is* that simple to parallelize (just a special comment in front of a loop).

[link]

From: Sam Pullara (Sep 29 2007, at 15:19)

Before the Erlang parallel edition was posted I had implemented this in Java, its basically just a Map/Reduce job. The problem is that if the file isn't in memory and you actually have to do the reads and seek around the hard drive I found it to be very very slow. In the case where it was cached, it ran very quickly, in the neighborhood of 6s (vs 11s for ruby on this box [4cpu] and my data set). Have you ensured in the erlang case that the data is in fact not cached in system memory?

[link]

From: CB (Sep 29 2007, at 16:03)

I implemented something that in Java where each line was submitted as a task to an Executor and they updated a ConcurrentHashmap. It only added about 4 lines to the length. The performance was appalling.

[link]

From: Janne (Sep 29 2007, at 19:45)

I do parallel programming in the context of a vision processing system. My, perhaps naive, reaction to this whole exercise is that the granularity of this application is too fine. Just doing one grep and updating a counter is a quick operation. No matter how you set up your threads or other parallelize it, I worry that the setup, synchronization, context switches, teardown and so on will completely swamp the actual work being done.

I like the "*" idea. But running this on a per-line basis feels like at least an order of magnitude too small a work unit.

[link]

From: noyb (Sep 29 2007, at 23:29)

This is a great thread. Thanks.

It underscores the problem that there are subtleties in this multi-threaded world which are not simple, nor intuitive to resolve.

No matter how folks try to make this look easy, this problem is hard.

Keep up the great work! I've been following with intense interest to see you and others succeed in taking advantage of the multi-core CPU's for a great Wide Finder.

Thanks again.

[link]

From: Joe Cheng [MSFT] (Sep 30 2007, at 01:10)

Don't you want the sort to be parallelized too?

BTW, in case you haven't seen this yet:

http://msdn.microsoft.com/msdnmag/issues/07/10/Futures/default.aspx

[link]

From: Danny (Sep 30 2007, at 01:33)

Ok, this is a great way of kicking the tyres of the languages, but every time I've read one of your posts in this series I've felt a little uncomfortable. Performing all this work for a single query seems to fly in the face of sanity, not least because the data appears in real time and not especially rapidly, so indexing could also take place in real time.

I'd been thinking in traditional database terms, but it only just occurred to me that the underlying problem is to efficiently join a single-process system (the disk reader) to a multi-process system (the regex/count/sort stuff). The database-style approach would only be one possible option. It certainly shouldn't be necessary to go as far as having log writes appearing to multiple processes.

I dunno, I suspect the optimum solution that still maintained the facilities that a text log file allow would be to have the server write to a virtualised file system that allowed parallel, read access (not necessarily random) as well as regular document-style linear retrieval. Just swap /var/log for /new_fs/var/log.

[link]

From: Anthony (Sep 30 2007, at 07:05)

The problem is that it doesn't work (as you pointed out) when you try to parallelize at this fine of a level. You end up with the same problems of locking and synchronization that have plagued parallel programming in these languages for years. You can't make that dissapear with a bit of syntactic sugar.

[link]

From: Davi Pires (Sep 30 2007, at 09:01)

Indeed, it seems to me that it's impossible for a compiler/interpreter figure out all the resouces being used inside the loop (in this case, the file and the counts hash) and turn it into "parallel-accessable" stuff. Although it'd be really cool if we could. Am not familiar with Erlang, so I'm glad we're back to Ruby-land. I'll keep following the thread with great interest.

[link]

From: Steve Tooke (Oct 01 2007, at 01:50)

For ruby, you could try using StarFish (http://tech.rufy.com/2006/08/mapreduce-for-ruby-ridiculously-easy.html) to achieve something like this very easily.

[link]

From: Tony Fisk (Oct 03 2007, at 20:38)

Something like '//' would indicate parallelism better (except it's been taken in most languages!)

I suppose that, whenever a routine with parallelism invoked is entered, the language interpreter could wrap all variables accessed in a routine with parallelism invoked via a locking resource (similar to a shared pointer), and then have the main thread handle the parallel sections, firing off child threads to handle the block.

Performance then depends on how threads are handled on a multi-core engine.

It does start to feel a little heavy duty for an interpreter, however.

[link]

From: Bruno (Oct 04 2007, at 00:30)

Actually, I knew this all along, but your "update" reminded me.

The implementation of concurrency patterns proposed here, allows you to parallellize Java code by introducing very similar annotations:

http://ctp.di.fct.unl.pt/~mpm/Cunha-AOSD06.pdf

Of the top of my head, I think your example would look like this:

for (int i = 0; i < MAX; i++) {

some_sub(i);

}

@OneWay void some_sub(int i) {

parallel stuff();

do_crit_section();

}

@Synchronized void do_crit_section() {

sensitive stuff();

}

The implementation is based on AspectJ aspects, which is based in turn on bytecode manipulation, to insert thread spawning/synchronizing code. This is nice and clean, and it'll work for your example, but are threads the concurrency model for the new era? Also see the concurrency section of this post:

http://www.artima.com/weblogs/viewpost.jsp?thread=214112

[link]

From: Toby (Oct 04 2007, at 05:34)

Janne> I like the "*" idea. But running this on a per-line basis feels like at least an order of magnitude too small a work unit.

It seems to me though, that what we want from an automatically scaling-out language is to worry much less about what size a work unit is.

Sure, that's a very small work unit, so I want the interpreter to take a look at it, and think

<it>"Hmm, that's very small, no point in spinning out a thread for every single one of these - I'll spend more time thread-handling than working. I'll just batch them all up together and do them in one workload."</it>

or better, <it>"well, I know how many cores are available, and I know what my inter-process communications latency is, and I know the cost of building and tearing down a thread context, so probably this will do best if I do these in batches of about 10000 per thread".</it>

Same principle as single-threaded compilers really. The computer knows its own idiosyncracies much better than me, so is in a better position to do micro-optimizations.

[link]

From: Hynek (Pichi) Vychodil (Oct 06 2007, at 14:41)

Some erlang solution faster than Caoyuan's for your collection.

http://pichis-blog.blogspot.com/2007/10/binaries-realy-faster-than-lists.html

[link]

From: Kim (Oct 07 2007, at 04:31)

A fast Python solution:

http://mtrr.org/blog/?p=91

in response to:

http://effbot.org/zone/wide-finder.htm

[link]

From: Fredrik (Oct 07 2007, at 10:56)

Kim, you did notice that the "fast" solution is a bit slower than the one it's a response to, did you? ;-)

[link]

author · Dad · software · colophon · rights
picture of the day
September 29, 2007
· Technology (77 fragments)
· · Concurrency (70 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.