ongoing by Tim Bray · Concur.next

These, “refs” for short, are one of the three tools offered by Clojure to make concurrency practical and manageable. Herewith a walk-through of code that uses them to accomplish a simple task in a highly concurrent fashion.

[This is part of the Concur.next series.]

[Update: There are a ton of comments; I’ve added some reactions at the bottom.]

The Problem · It’s one that’s been much-discussed in this space: reading all the lines in a large file and computing some simple statistics. It was the discovery that this sort of script ran faster on my Mac laptop than a pricey industrial-grade Sun server that originally motivated the Wide Finder work.

What I would really like to come out of all this work would be a set of tools, accessible to ordinary programmers, that help them accomplish everyday tasks with code that makes good use of many of the compute engines on a modern multi-core processor.

I keep coming back to this “read lines and compute stats” task because it’s as simple as you can imagine, it occurs in the real world, and if I can’t manage that one maybe that’s a signal that the whole project is doomed.

If you want an application to process lines out of a file concurrently, you need, first, to parallelize the process of reading the file, and second, parallelize the application code.

Paralines · This is Clojure code I wrote to parallelize the line-reading part of the problem. I’m not sure what the right name is for a splodge of Clojure code: “Module”? “Package”? Whatever; let’s ignore Paralines for the purposes of this essay; I’ll get back to it later. It offers a function called read-lines; here’s its doc string:

([file-name destination all-done user-data params]) Reads the named file and feeds the lines one at a time to the function 'destination', which will be called in parallel on many threads with two arguments, the first being a line from the file, the second 'user-data'. When all the lines have been processed, the the 'all-done' function will be called with 'user-data' as its single argument. The 'destination' function will be called on multiple threads in parallel, so its logic must be thread-safe. There is no guarantee in which order the lines from the file will be processed. The 'params' argument is a map; the default values for the number of threads and file block-size may be overridden with the :width and :block-size values respectively.

Popular · This is the little Clojure script I wrote to solve the original Wide Finder problem: read an Apache logfile from this blog and find the ten most popular pieces; for a refresher, see the original Ruby script. It calls Paralines like so:

(read-lines "data/O.1m" proc-line report (ref {}) {})

In this invocation data/O.1m is the one-million-line sample of logfile data. proc-line is the function to which each line will be passed. report is the function which will be used to report on the results of the run.

The second-last argument, (ref {}), is more interesting. This will be passed as the second argument to each invocation of proc-line and as the sole argument to the final report invocation. It’s meant to be used to hold the program’s state as it does its computations.

{} is the Clojure idiom for an empty map (think hash table). The ref function produces a reference, a thing that (unlike most things in Clojure) can be changed, and in particular can be changed safely in a concurrent environment. The notion of using a hash table to build up state in this kind of work should hardly be surprising to anyone who’s ever written a Perl script.

And we’re going to use it just as we would in a Perl script; the hash will be keyed by the URIs of ongoing pieces, and the values will just be integers, how many times we’ve seen each one so far in the logfile.

The final argument, {}, is an empty hash table which tells Paralines to use the default block-size and thread-count.

proc-line · Paralines calls this function once for each line, the first argument being the line as a string, the second being that hash-table-reference that we set up at the beginning.

1 (def re #"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")
2
3 (defn proc-line [ line so-far ]
4   (let [hit (re-find re line)
5         target (if hit (last hit) nil)]
6     (if target (record target so-far))))

Line 1: in Clojure a string with # in front is a regular expression. I find it pleasing that I can use pretty well the same regexes in Perl, Ruby, Java, and now anything built on the Java platform.

Line 3: The ref to the hash where the results live will be called so-far from here on in.

Line 4: re-find uses the java.util.regex library to scan the string.

Line 5: If re-find missed it returns nil. If it hits, it returns a list whose first member is the whole match, then $1, $2, and so on. Since I only have one ()-group, it’ll always be of size two and I can use last to fish out $1.

Line 6: Assuming the regex hit, we call record with the interesting part of the URI and the so-far hash ref.

record · Here is where we start to use Clojure references.

1 (defn record [target so-far]
2   (let [ counter (@so-far target) ]
3     (if counter
4       (incr counter)
5       (incr (new-counter so-far target)))))

Line 2: Remember that so-far isn’t the results hash table, it’s a ref. The @ means “reach through the ref and extract the value”, apparently ignoring concurrency issues.

In Clojure, a hash is just a function; so (@so-far target) says “reach through the ref, get the hash table, and look up the target in it”.

Lines 3-4: If counter is non-nil, that means we’ve already seen this target and have started counting references to it. So we just use the incr function to increment that counter.

Line 5: We didn’t find a counter for this target, so we’re going to have to use new-counter to add one to the hash table; new-counter also returns the counter it just added, so we can increment it.

incr · What could be simpler than adding one to a counter? Well, except for, this is running in (a potentially large number of) parallel threads, more than one of which might be wanting to increment this counter at the same time.

Above, we said that the values in the hash table were the occurrence counts, but that wasn’t quite accurate. They’re actually Clojure refs to the occurrence counts, so we can run in parallel.

1 (defn incr [ counter ]
2   (dosync (alter counter inc)))

All the magic is in Line 2. First, since we’re going to update a reference, we have to call dosync to launch a “transaction”, the STM voodoo that Clojure uses to make this possible and safe.

Inside the dosync, we use alter to send the inc function to the actual integer with the count. Since Clojure always uses immutable values as opposed to traditional variables, what actually happens is it creates a new integer and rejiggers the counter ref to point to that.

new-counter · This is the function that gets called when we’re trying to record the occurrence of a URI but the hash doesn’t yet contain a counter for it.

1 (defn new-counter [ so-far target ]
2   (dosync
3     (let [ c (@so-far target) counter (ref 0) ]
4       (if c
5         c
6         (do
7           (ref-set so-far (assoc @so-far target counter))
8           counter)))))

Line 2: The function wraps itself up in a dosync because it’s going to be poking and prodding at the hash reference quite a bit.

Line 3: The part where we set c may be a little puzzling; we look the target up in the hash table; but didn’t we just do that, and call this function because it wasn’t there? Be careful: We’re running in parallel, and someone else might have got in and added it, and counted a few occurrences even, while we weren’t looking, and we wouldn’t want to reset the count to zero accidentally.

Remember that when we dereferenced the table in the higher-level record function by saying (@so-far target), I said “apparently ignoring concurrency issues”. Well, not quite. If another thread had been in this new-counter function, the dosync function and Clojure’s concurrency magic would have prevented the dereference from seeing the hash in an inconsistent state.

The second half of Line 3, assuming that we’re probably going to need a new counter, goes ahead and creates a new ref-to-an-int for that purpose.

Line 4-5: If c came back non-nil, that means that, yes indeed, someone came along and created the counter and we don’t need to, we can just return it.

Line 7: This is the normal path through the function, where we update the hash table with the new counter. ref-set is like alter; I find it a bit more readable when the code is something more complicated than just inc.

assoc looks like it updates the hash table, but because we’re living in the land of immutability it doesn’t really, it creates a new one which differs only by having the new key/value pair. This isn’t as horrifically inefficient as it sounds, Clojure is smart about creating a “new” object by just twiddling pointers and re-using most of the old one.

report · We’re about done; all that’s left is the function that prints out the top ten most-referenced ongoing pieces.

(defn report [ so-far ]
  (let [sorted (sort-by last (fn [a b] (> @a @b)) @so-far) ]
    (doseq [key (keys (take 10 sorted))]
      (println (str "K " key " - " @(@so-far key))))))

There’s nothing terribly exotic here, except for we have to throw a lot of @ characters around to reach through the references to the table itself and then to the values in the key/value pairs. Which is perfectly OK since this stage is read-only.

Looking Back · Note that actual collisions between concurrent threads are probably going to be rare in this implementation. Most times when you probe the hash-table ref you’ll find that yes, we’re already counting this URI; no transaction required. And most times when you increment the counter, probably no other threads are, so once again, the STM should work smoothly.

So there, you have it; 30 or so lines of code that reliably and concurrently compute line-at-a-time statistics.

Does This Actually Work? · This section used to begin:

Not quite, at the moment. I mean, it works fine for the first 50 million lines or so of the big dataset’s quarter-billion, keeping my eight-core 32-thread SPARC T2000 maxed, and I mean smokin’, you can see the 50M blocks that Paralines is reading go by pop-pop-pop. But then it hits its heap limit and descends into garbage-collection hell and thrashes its way to a miserable single-threaded standstill.

No more! I found the bug, due to a combination of some moderate stupidity on my part and an old and well-known Java issue, and killed it.

It runs fine; see Parallel I/O for more.

The Big Question · Are the ref and @ and dosync and alter and ref-set machinery accessible to ordinary programmers? Will they figure out how to use them and use them correctly? Clearly this is a lot less painful than wrangling locks and threads, but on the other hand there is still careful thought required and places where you can go wrong.

Is this a primitive that will be one of the distinguishing features of the Java of Concurrency?

Reactions to Commenters · On the question of whether agents would be a better fit; Well, maybe. I don’t think there’s a single person on this green planet who is smart enough to predict, based on intuition, where the bottlenecks are going to happen in this kind of a setup. Once I’ve fought through the GC weirdness, I’ll see if going to an agent-based system helps.

I will point out that, until the GC conflagration broke out, this gave the appearance of running very efficiently on the Niagara box, with all the available threads maxed out. So it may be that the Clojure mechanisms have hit an 80/20 point and further improvement is incremental. We’ll see.

To Avi and others who suggested map/reduce or other more deeply-parallel approaches: Well, maybe. Do bear in mind that I’m trying to nail down The Simplest Thing That Will Really Help, not the most optimal solution regardless of complexity.

To those who suggest that my performance problem is due to concurrency contention rather than GC: I really think you’re wrong. I have the GC diagnostics printing out, and I can see it descending into GC hell when it bottlenecks.

To those advancing theories based on the number of unique counters, lines, and so on: There are around 3,000 unique URIs, and a quarter-billion lines of data, just under 10% of which match the regex and will lead to an update.

Contributions

Comment feed for ongoing:

From: Patrick (Nov 12 2009, at 12:19)

Would you like to post the entire code so I/we can poke at it?

Thanks for the detailed walk-thru.

Concur.next — References

Contributions