In which the actual costs of running concurrently are examined, and seem shockingly high.
[This is part of the Concur.next series.]
[Update: Apologies; I just made a major correction; I had omitted to put units on the timings and they were highly misleading. You can probably disregard my back-and-forth with Dmitriy in the comments.]
I was running lots and lots of map/reduce-style Wide Finder scripts in Clojure, mostly concerned about the elapsed time they took to run; but I couldn’t help noticing that the reported CPU times seemed awfully high.
As a sanity check, I made a quickie single-threaded version of the most-popular script and ran it on the big dataset. It ran in 2 hours 9 minutes, burning 2h 28m in user CPU and another 13 minutes in system time. While the code is single-threaded, I’m using Java’s “concurrent mark-sweep” garbage-collector, so that’s the 25% or so excess of compute over elapsed time; seems reasonable to me.
This also reveals, since the box can do I/O at 150M/sec, that the code is entirely CPU-limited; at the moment I think most of the time is going into Java converting bytes into UTF-16 characters at what seems a painfully high cost.
Surprise! · So I took an average over 9 runs of the parallel map/reduce style code. The average elapsed time was 32m, the average system+user CPU was 5h 35m; a ratio of 10.3 or so between cpu and elapsed times. Put another way, the concurrency tax was 3 extra hours of CPU time on a 2½-hour job.
There’s a caveat; CPU-time reporting on the Niagara boxes is a bit shaky, since it regards processes as being in a run state when in one of the multiple cached-threads-per-core even if they’re not running. Since the thing only has 8 cores, ratios like the 10.3 here might be suspect. Well, except for, each core can (mostly) run two integer threads in parallel, so reports of CPU time up to 16 times the elapsed might be right.
Whatever... even if we chop the CPU/elapsed ratio back to eight or so, this seems to be telling us that the CPU-cycle cost of concurrency is startlingly high. To achieve a 4× speedup in elapsed time, we’re at least doubling the number of CPU cyles we burn.
Is this an artifact of the Clojure runtime or the Java runtime or the way I’m using the primitives or is it inherent to highly-concurrent software? Clearly there’s lots more research here that wants to be done.