ongoing by Tim Bray · Concur.next & WF2 — Tuning Concurrent Clojure

I’ve been working on, and writing about, running Clojure Wide Finder code. But I was never satisfied with the absolute performance numbers. This is a write-up in some detail as to how I made the code faster and also slower, including lessons that might be useful to those working on Clojure specifically, concurrency more generally, and with some interesting data on Java garbage collection and JDK7.

[Update: My relationship with the JVM is improving, based on good advice from the Net. This article has been partly re-written and is much more cheerful.]

[This is part of the Concur.next series and also the Wide Finder 2 series.]

In a comment on my last piece, Alexy Khrabrov noted “I saw WF2 has a pretty good Scala result, apparently better than Clojure's — and that's a year-old Scala.” Alexy is right; my best times were mostly over 30 minutes, while Ray Waldin’s Scala code ran in under 15. There’s no obvious reason I can see why Clojure should be significantly slower than Scala, and while I stand by my Eleven Theses on Clojure, I was puzzled and irritated.

The following are organized roughly chronologically; if the narrative isn’t all that coherent, that would be an indicator of me having done considerable thrashing about.

Low-Hanging Fruit · The information you can get from the JVM with the -Xprof argument is not brilliant, but it’s a whole lot better than nothing. What you want to see in profiling output is a big fat obvious culprit, and I did, and it was the code fragment below, which breaks up buffers into lines; text is the text we’re wrangling, first-nl and last-nl are the first and last places where a newline character(whose value, note, is 10) appears in the block. destination is the user-provided payload function.

(loop [ nl first-nl ]
  (let [ from (inc nl) to (. text indexOf 10 from) ]
    (if (< from last-nl)
      (do
        (destination (new String (. text substring from to)) accum user-data)
        (recur to)))))))

This sucker was burning well over half of all the CPU. There are a few things wrong with it. First of all, since I was already using Java to split up the text, why not do it all at once like this?

(let [chunks (. #"\n" split text)

Yep, that helped. (I’d actually had it that way in a previous iteration but had unrolled it while fighting a memory leak that turned out to be unrelated).

Lesson: Any time you can package up a bunch of work as a single call into Javaland, that’s probably a good idea.

Reducing · The second thing that was wrong was that I was iterating at all. Lisps want you to think in terms of lists not their members, and I wasn’t. What I was actually wanted to do was to call the payload function on each line of text, passing the output of each invocation to the input of the next. And of course Clojure, like any decent functional language, has a nice reduce function. So let’s just add a line to that fragment:

(let [chunks (. #"\n" split text)
      ; ... turn "chunks" into "lines" by stripping leading/trailing fragments ...
      accumulator (reduce per-line nil lines)

Lispers are now nodding their heads in a despairing of-course-the-moron-should-have-done-it-that-way way. Me, I like it when a profiler shows me where I’m being stupid. Some non-Lispers are probably thinking “That’s slick”. Most modern languages, not just Lisps, have some sort of a reduce equivalent.

I’m a bit amused here: this code has (basically) a map/reduce architecture, except for I’m using reduce in the map phase.

Lesson: Operate on lists instead of iterating.

Type Hinting · The code was now running visibly faster; time for another whack at the profiler. I saw a whole lot of time being sucked up by java.lang.Class.forName0. This suggested to me that the Clojure runtime was spending a bunch of time trying to figure out what type of thing was crossing some method-dispatch barrier. I already had a wrapper around the user-provided payload function because it has three arguments and reduce wants just two, and to protect myself from Java issue 4513622. I seemed to remember from somewhere that Clojure has “type hints”, so I looked that up and decorated the wrapper:

(let per-line (fn [accum #^String line] (destination (new String line) user-data accum))

See that #^String goober? It’s a hint telling Clojure to assume that the incoming line argument is a Java string. It helped too, quite a bit. I was sort of irritated at having to hold the system’s hand like this. But I think this may be a Release-1.0 symptom rather than a Clojure design flaw. It feels to me like there’s quite a bit more scope for type inferencing, maybe even JIT inferencing at run-time.

Lesson: A little static typing can go a long way.

Reading vs. Mapping · My first cut at this problem used Java’s NIO subsystem; I’d noted that java.nio.channels.FileChannel advertises itself as being thread-safe and furthermore had a map() method for mapping stuff into memory. So I had this pretty slick (I thought) setup where multiple threads would just get dealt out an offset and block size and map successive regions of the file, then do the string-i-fying and splitting.

I was twiddling buffer sizes and thread counts and not observing any real helpful patterns and wondered if I was maybe overthinking the problem. So I made another version that uses a Clojure agent to synchronize sucking in the data with good old-fashioned java.io.FileInputStream.read() and fires off a new thread for each block. Right off the bat, it was faster.

Experimenting with this one seemed to suggest that it ran faster and faster the smaller I made the buffers. I’d started with 64M and eventually went all the way down to 2M; 1M wasn’t any faster. I hypothesize that with the big buffers, the big String.split operation illustrated above was creating too many short-lived transient objects that were overloading the garbage collector.

Lesson: Disk really is the new tape.

Stabilizing the JVM · Early on, when I was fighting a memory leak, Rich Hickey suggested using a smaller heap size. I did that, and was astounded.

I was now seeing throughput on the order of 50-60M/sec. But as I watched the program run, I could see that as the program built up its heap to the ceiling (set via java’s -Xmx argument), it was running much slower, as slow as a tenth of that speed. When it hit the ceiling, absent other pathologies, it would ramp up to production speed, I assume as the GC got into balance with the rest of the program’s work.

Since it can easily take minutes for the JVM to expand to fill a 10G-or-bigger heap, you really can shave a couple of minutes off your run time by giving it a whole lot less. Yes, this is profoundly counter-intuitive.

Lesson: Measure, don’t guess.

Happy Dance! · At this point, I was starting to see run-times down well below fifteen minutes, and profiler output that looked like this (the primitive -Xprof facility gives you a little dump every time a thread exits, which turns out to work really well for me as I’m running through hundreds of ’em):

Flat profile of 29.63 secs (259 total ticks): Thread-58

  Interpreted + native   Method
  0.4%     1  +     0    java.util.Arrays.copyOf
  0.4%     1  +     0    sun.nio.cs.UTF_8.updatePositions
  0.4%     1  +     0    java.lang.StringCoding$StringDecoder.decode
  1.2%     3  +     0    Total interpreted

     Compiled + native   Method
 57.1%   148  +     0    java.util.regex.Matcher.search
 17.0%     1  +    43    java.util.regex.Pattern.split
  8.1%     0  +    21    clojure.lang.APersistentVector$Seq.reduce
  5.8%     3  +    12    java.util.regex.Pattern.matcher
  3.1%     0  +     8    clojure.lang.ArraySeq.next
  2.7%     0  +     7    clojure.core$conj__3121.invoke
  1.2%     1  +     2    clojure.lang.PersistentHashMap$FullNode.assoc
  0.8%     2  +     0    clojure.core$next__3117.invoke
  0.4%     0  +     1    clojure.core$re_find__4554.invoke
  0.4%     1  +     0    org.tbray.paralines$record__54.invoke
  0.4%     0  +     1    clojure.core$assoc__3148.invoke
  0.4%     0  +     1    clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc
 97.3%   156  +    96    Total compiled

         Stub + native   Method
  0.8%     0  +     2    java.lang.System.arraycopy
  0.8%     0  +     2    java.lang.Object.getClass
  1.5%     0  +     4    Total stub

In other words, the program is spending most of its time using Java’s regex machinery to grind away at all that text input; which is I think what you’d like to see.

This was around 10PM last Friday, and I was crowing exultantly on Twitter, feeling like a rockin’ sockin’ functional-programming’ concurrin’ homoiconic wizard. “Faster than Scala” I uttered, incautiously.

Lesson: Read some Classics in transition from the archaic Greek and learn to fear the punishment for hubris, especially public hubris.

Which Wide Finder Was That? · Because, you see, I was actually running a Wide Finder 1 program, just computing one little statistic, whereas Ray Waldin’s Scala code, and the others on the results page, were working on the benchmark from Wide Finder 2. It’s a lot more complex and computes five different statistics which require looking at a much higher proportion of the data.

Well, I thought, How hard can it be?

Lesson: [Never mind. We’ve all been there.]

Attitude Problem · I now have one, because it seemed like a lot more work than it should be to code this up.

The stats you want to compute could be called “hits”, “bytes”, “referers”, “404s”, and “fetchers”. So I accumulated results in a map whose keys were :hits, :bytes, :referers, :404s, and :fetchers; the values were maps whose keys were the URIs and values the in-progress statistics. So if you want to add increment the :hits value for some URI, you say:

(bump accumulator :hits uri 1)

Here’s the bump function:

(defn bump [accum stat uri increment]
  (let [totals (accum stat)
        currently (if-let [total (totals uri)] total 0)]
    (assoc accum stat (assoc totals uri (+ currently increment)))))

In Ruby it’d all be a one-liner...

accumulator[stat][uri] += increment

...but then I’d be promiscuously mutating state and condemning myself to single-threadedness. Which is what this series is all about trying to avoid. On the other hand, it says exactly what I want to say without any ceremony or extra arm-waving. Is there a middle way?

Anyhow, here’s my gripe: I started sketching in the code to gather the five Wide Finder 2 stats Sunday while I was watching football. I spent all day Monday, until late in the evening, getting it all to work; admittedly, I’m a relative Clojure newb, but I’m not that terrible a programmer and I think I’m actually fairly quick at learning new things. This is way too long.

The result-merging step in particular gave me heartburn. If you can merge two maps with Clojure’s elegant merge-with and you can turn a list-or-vector into another list-or-vector with map, why can’t you run a map on a map and get a map?!?

Now, the situation wasn’t helped by the fact that Clojure is after all 1.0, so its tracebacks are not models of helpful transparency; also, the code is running in dozens of threads in parallel which made for extra work in tracking down where what provoked that NullPointerException or ClassCastFailure.

Lesson: Some combination of faults in Clojure and faults in your humble correspondent is interacting badly.

Performance · Improved, but not terribly satisfying. The wrong collection of JVM options can drive this sucker into Garbage-Collection hell. This is painfully obvious when it’s going on; suppose you’re using a simple monitoring tool like top(1) while you monitor the output, you can see it trundling along processing 30 to 50 MB/sec and reporting a CPU burn rate anywhere between 1200% and 3000%, which means you’re keeping all the cores and threads pretty darn busy. Then all of a sudden the CPU drops to around 100% and the output just stops. For like 90 seconds or more. If you’ve picked a bad combination of options, this happens more and more frequently until you’re spending way more time garbage-collecting than actually computing.

[Update:] Someone who wishes to remain anonymous read the first version of this piece had a discussion with me about JVM options and suggested simply taking all of them out, except for -Xms and -Xmx to ensure there’s enough heap; letting it manage its own GC strategy.

This person turned out to be smarter than the tribal lore I’d picked up over the years and around the net, and I’ve managed to stay out of GC Hell for the last couple of days.

On the other hand, after all the profiling and adjusting, my code has grown tons of extrusions and decorations; here’s the bit that actually does the work of processing a line of logfile:

 1 (defn proc-line [line so-far accum-1]
 2   (if (nil? line)
 3     (send so-far merger accum-1)
 4     (let [accum-2 (if accum-1 accum-1 (new-accum))
 5           fields (. #" " split line)
 6           ; [client _ _ _ _ _ uri _ bstring status ref] (. #" " split line)
 7           #^String uri-1 (aget fields 6) #^String uri (. uri-1 intern)
 8           #^String status (aget fields 8)
 9           accum-3 (if (= status "404") (bump accum-2 :404s uri 1) accum-2)
10           #^String bstring (aget fields 9)
11           accum-4 (if (re-find #"^\d+$" bstring)
12                     (bump accum-3 :bytes uri (new Integer bstring))
13                     accum-3)]
14       (if (re-find re uri)
15         (let [accum-5 (bump accum-4 :hits uri 1)
16               #^String ref (aget fields 10)
17               accum-6 (if (or (= ref "\"-\"") (re-find #"tbray.org/" ref))
18                         accum-5 (bump accum-5 :referers ref 1))
19               #^String client (aget fields 0)
20               accum-7 (bump accum-6 :fetchers client 1) ]
21           accum-7)
22         accum-4))))

Boy, that ain’t pretty. At one point, I was pulling out the fields with fragments like (get fields 6) but that led to some XxxArrayAccessor method bubbling to the top of the profile output. I attempted to replace that with “destructuring”, as in the commented-out Line 6, but that turns out to be just a macro that apparently generates about the same calls.

I settled on aget, apparently specialized for fishing around in Java arrays. But you still have to type-hint what comes out of it. Feaugh.

In Line 7, you can protect yourself from our old friend 4513622 by either calling new String on the uri, or by interning it, as above. The former causes more memory stress but, if you can spare the memory, runs faster.

Lesson 1: I’m not sure there’s a clear one here, but it makes me nervous when application code that should be simple is hard to get right, and so far Clojure’s not making this developer happy the way Matz designed Ruby to.

Lesson 2: The Concurrency Tax is too high.

Fallback hypothesis: I’m just Doing It Wrong.

GC Explosion · I accidentally ran an experiment that showed me where a lot of my GC pain was coming from. Reminder: this works by having one Clojure agent read blocks sequentially and handing them off to threads to process, and build those tables of tables with the results. When each thread is done, it sends its table off to another agent to merge.

One time, I had a bug that turned the merger into a no-op, the agent was invoked all right but simply ignored the incoming table. This version ran very fast; not as fast as the simple Wide Finder 1 code I’d been working with earlier, but not bad at all, with a lot less GC overhead. So that code above may be ugly but it actually runs OK.

Here’s the code that does the merge:

(defn merger [current incoming]
  (loop [keys (keys current) output {}]
    (if-let [key (first keys)]
      (let [merge-1 (merge-with + (current key) (incoming key))]
        (recur (rest keys) (assoc output key merge-1)))
      output)))

It’s iterating all right, but only over the keys of the top-level map, of which, remember, there are only 5: :hits, :bytes, and so on. Not much to it, and reasonably idiomatic I think. But boy does it ever generate garbgae.

I note that in Ray Waldin’s nicely-performant Wide Finder 2 code he relied on a java.util.concurrent.ConcurrentHashMap. I suppose it’s unsurprising that that would outperform the sort of pure functional/dynamic code above, but it feels a bit like cheating.

Java Versions and Options · My best times for the full WF 2 run are now around 30 minutes. 29:38 with JDK7, 35:10 with Java 6. The results of tinkering with GC options and generation sizes and so on seem insignificant-to-damaging. In particular then new “G1” garbage collector in JDK7 doesn’t seem appropriate for this problem.

What I Could Do Next ·

Use a ConcurrentHashMap like Ray did.
Bring VisualVM to bear on the problem and really do a deep-dive on what’s happening inside this code.
Try refactoring the app some more.

I dunno. I still like Clojure and stand by my 11 Theses, but that impedence mismatch with my conventional procedural object-oriented programer’s mind grows fatiguing.

Contributions

Comment feed for ongoing:

From: Erik Engbrecht (Dec 09 2009, at 21:09)

One of the neat things Ray's solution does (and, indeed, I think most of the top performers) is not bother with line splitting. Once you're dealing with fairly optimized code, the performance is really driven by how many times you have to pass over the data. Java is automatically penalized in this case because it does one pass in order to turn bytes into strings, where languages that natively work with ASCII can just copy the bytes in. Being able to use the bytes directly rather than having a big heavy immutable string object probably helps a lot too (there's a another pass - copying the file buffer memory into the string buffer memory).

Anyway, I suggest looking through your code and seeing how many times you'll end up scanning the bytes/chars during your processing. Squishing that down will squish your time.