In which the actual costs of running concurrently are examined, and seem shockingly high.
[This is part of the Concur.next series.]
[Update: Apologies; I just made a major correction; I had omitted to put units on the timings and they were highly misleading. You can probably disregard my back-and-forth with Dmitriy in the comments.]
I was running lots and lots of map/reduce-style Wide Finder scripts in Clojure, mostly concerned about the elapsed time they took to run; but I couldn’t help noticing that the reported CPU times seemed awfully high.
As a sanity check, I made a quickie single-threaded version of the most-popular script and ran it on the big dataset. It ran in 2 hours 9 minutes, burning 2h 28m in user CPU and another 13 minutes in system time. While the code is single-threaded, I’m using Java’s “concurrent mark-sweep” garbage-collector, so that’s the 25% or so excess of compute over elapsed time; seems reasonable to me.
This also reveals, since the box can do I/O at 150M/sec, that the code is entirely CPU-limited; at the moment I think most of the time is going into Java converting bytes into UTF-16 characters at what seems a painfully high cost.
Surprise! · So I took an average over 9 runs of the parallel map/reduce style code. The average elapsed time was 32m, the average system+user CPU was 5h 35m; a ratio of 10.3 or so between cpu and elapsed times. Put another way, the concurrency tax was 3 extra hours of CPU time on a 2½-hour job.
There’s a caveat; CPU-time reporting on the Niagara boxes is a bit shaky, since it regards processes as being in a run state when in one of the multiple cached-threads-per-core even if they’re not running. Since the thing only has 8 cores, ratios like the 10.3 here might be suspect. Well, except for, each core can (mostly) run two integer threads in parallel, so reports of CPU time up to 16 times the elapsed might be right.
Whatever... even if we chop the CPU/elapsed ratio back to eight or so, this seems to be telling us that the CPU-cycle cost of concurrency is startlingly high. To achieve a 4× speedup in elapsed time, we’re at least doubling the number of CPU cyles we burn.
Is this an artifact of the Clojure runtime or the Java runtime or the way I’m using the primitives or is it inherent to highly-concurrent software? Clearly there’s lots more research here that wants to be done.
Comment feed for ongoing:
From: Kevin Scaldeferri (Nov 26 2009, at 17:01)
I'm not sure if it's inherent (i.e. unavoidable), but it's certainly common. If you look at the shootout benchmarks:
and compare the single-core vs. quad-core performance of identical programs, it's quite common to see a 50-100% increase in total CPU usage. In some cases, it's dramatically worse than that.
From: Dmitriy V'jukov (Nov 26 2009, at 23:00)
Tim, I do not quite follow you here.
CPU time for single-threaded program is 922 secs. While CPU time for multi-threaded program is 335 secs. What overhead are you talking about? I see 3x decrease in total work for multi-threaded version (which probably just means some defect in your measurements).
The fact that CPU time is larger than elapsed time does not mean anything good or bad in itself.
For example, if you have 10 employers in your team. And they accomplish a task in 1 day, total spent time is 10 days (10 times bigger than elapsed time). This does not mean anything until we don't know how fast 1 employer can accomplish the task. If 1 employer accomplishes the task in 10 days, we have perfect parallelization. Is 1 employer accomplishes the task in 5 days, we have sub-optimal 0.5x parallelization. Is 1 employer accomplishes the task in 20 days, we have super-linear 2x parallelization.
In general, I believe that both Java and Scala are able to deliver overhead-less concurrency. For example, consider single-threaded program that calculates a sum of 1..N numbers (where N is very large). Now consider multi-threaded program with M threads and each calculates a sum of 1..N independent subrange. If M < number_of_cores, I believe, one will observe exactly the same work (CPU time) and M times smaller elapsed time.
For real programs there is a number of reasons for concurrency overhead, of course. For example, threads may saturate memory bus, contend for cache (provided shared caches), or loose some time on synchronization, GC will take longer, etc.
From: Tim (Nov 26 2009, at 23:08)
Dmitriy, I don't follow you either; what are those 922-second and 355-second numbers coming from? Either the elapsed/user/system times being reported by Solaris are wrong (unlikely but possible) or the increase in CPU time to do the same work in less time is surprisingly large.
I think that overhead-less concurrency is of course the goal, but I think we will have to be satisfied with moderate-overhead concurrency.
I wouldn't be surprised if this is due to some weird GC artifact or something. At this point, I'm just reporting a surprising observation, not drawing conclusions.
From: Florian (Nov 26 2009, at 23:25)
The 922 seconds are just what you wrote: 2:28 + 12:54 = 15:22 = 922s. Shows that you should always report your measurements with units, from what you wrote further down, one might infer that you were actually talking about 2h28 + 12m54 > 922s, but that is pure conjecture.
From: Jilles van Gurp (Nov 26 2009, at 23:28)
I would suggest you hook this up to a profiler to see what is really going on. Where is it spending it's time? Are there any synchronization bottlenecks? Etc. The problems might be inherent to your design or just to how things are done in the OS/JVM.
From: Brett Morgan (Nov 27 2009, at 01:20)
One thing worth noting is that, as far as I understand it, Clojure's concurrency primitives are optimistic, as opposed to lock based designs pesimistic defaults.
Is burning cpu cycles worthwhile to get faster wall clock performance? I think in a lot of cases it is.
From: Jouni Osmala (Nov 27 2009, at 01:23)
Actually if you really did it on niagara boxes it didn't really consume that much more CPU resources as it looks.
Niagara boxes use multiple threads on a single CPU each thread counting time all the time. Which should result better utilisation of CPU resources, but it also means that each cpu time unit you got reported when running multiple threads on niagara has less number of resources available and used compared to running fewer threads. So no ONLY thing you really could measure reliably is wall clock time.
From: Robert Young (Nov 27 2009, at 07:02)
>> I think that overhead-less concurrency is of course the goal
Not a reasonable goal, in that it violates the laws of math and physics, with existing processors. Barring Black Ops machines deep inside mountains, parallelism/concurrency is carried out with multiple Von Neumann machines stitched together. As such, software (either system or user level) does the dispatching and syncing. This is overhead, and can't be removed.
I have been saying for years: parallelism will only be mainstream when a new physical architecture is defined which transforms logically linear problems into logically parallel problems, without coder (system or user) intervention. That's the problem that we need to work on.
Barring such a breakthrough, we should focus on low-power cpu clients wired to very high power (multi-core/processor) cpu servers (which use their multi-ness in support of multiple linear clients), and re-invent the world of X-terms and real centralized databases. Back to the future.
From: Dmitriy V'jukov (Nov 27 2009, at 08:25)
Solaris spreads threads evenly across cores, so if one uses <=8 threads, CPU time represents quite meaningful value.
Yes, there are definitely will be some non-zero overhead, however it can be kept below 5% or 3% or even 1%.
IO, memory accesses, addition operations also have overheads too, so let do not use them too!
From: Mike (Nov 30 2009, at 03:32)
It's late as I read this, so maybe this insight is totally bogus: Isn't this CPU overhead just Physics? After all, as has already been commented, the wall-clock speed-up is about right given the number of CPUs thrown at the problem.
Surely this is just our wrong intuitions. After all, to go twice as fast down the highway, you actually burn four times as much fuel...
I think I'll go to bed :)
From: Tiest Vilee (Dec 02 2009, at 02:03)
I've just been looking at some non-blocking data structures, ones that use CompareAndSwap (CAS), and wondered to myself how CAS works.
So I looked it up (and promptly forgot where I found it). It seems that this instruction raises the address line while performing the atomic instruction. Perfectly reasonable thing to do, except that it hinders the other CPUs' ability to read from memory.
So I wonder if it is a lot of little things like that which stall all the other processors while one processor does something atomic.
I guess that's obvious, but it implies that communication should be kept to a minimum between the different processors. (and in my mind encourages the use of agents with hard edges, rather than lots of teeny little locks being updated by everyone)
From: Dmitriy V'jukov (Dec 02 2009, at 08:12)
Bus locking fell into oblivion. Modern hardware uses cache locking which does not affect other processors/cores in the system.
Intel QPI uses system-wide stoppage only to support non-aligned atomic accesses solely in the name of backwards compatibility.
However communication indeed should be kept to a minimum by different reason. Communication itself is costly. And it equally applies to message sending in agent-oriented systems.