Anyone who cares at all about taking advantage of these nasty new microprocessors that still follow Moore’s law but sideways not just straight up ought to go and read The Landscape of Parallel Computing Research: A View from Berkeley. As the title suggests, it’s an overview paper. Thank goodness we have universities, so that smart people who know about this stuff can invest a few months in surveying the landscape and report back for the rest of us who are being whipped around by the market maelstrom. Herewith way too much applause and disagreement and expansion.
[First off, thanks to Liang Chen; I’d been meaning to get around to looking at this for a while, and he slapped a printed copy down in front of me, which got me started.]
Logistics · This is a big, thick, dense paper; I had to read it in fits and starts while doing lots of other things and I lost track, but I bet it added up to a couple of hours. Also, there’s a supporting wiki and a blog, although I’m not clear whether the paper represents the conclusion of the wiki process, or whether the material in the wiki reflects further thought.
Crunchy Research Goodness · Maybe the best thing here for me is that the survey covers a bunch of research that I’d missed. I’m sure lots of my readers knew the following, but I didn’t:
Internal system bandwidth is increasing much faster than latency, they say “at least the square of the improvement in latency”.
Cisco has been drawing a lot of attention with their Metro chip: 188 small cores based on 130nm technologies.
There are some neat calculations from Amdahl’s law to suggest that making the cores on your chip non-uniform, as in having one that’s a lot faster than the rest, can pay off big-time.
As memories get bigger and cheaper and processors get faster, the number of processor cycles to fetch something all the way from mainline RAM gets bigger and bigger.
In a complicated multi-core multi-chip system, message-passing latency between core varies dramatically depending on how “close” the cores are to each other. Hmm, I feel a new acronym coming on: NuML.
In recent research, “Autotuners” have been showing excellent results. It’s like this: there are a lot of different ways a compiler can go about optimizing the code it generates. It’s really tough to make the right choices along many different optimization axes based on your best guess as to how the underlying hardware will behave. An autotuner explores the optimization space by trying random permutations of lots of different settings while running actual code, and discovers empirically what works best; you run it once to determine the right compiler settings for a particular machine configuration. Unsurprisingly, these things can outperform the best guesses of the compiler writers.
Convincing · The paper presents a few arguments I hadn’t heard before that had me thinking (without going and plowing through all the supporting research): that has to be right.
Finite Automata · State machines have long been my personal single favorite piece of Computer Science. But the parallelization story isn’t good, not good at all. Which means they will probably become a little less central to our discipline. Sob.
Multicore vs. Manycore · I’d never seen “manycore” prior to reading this, perhaps it’s a term of art in the Computer-Architecture coffee-shops. By “multicore” they mean “what IBM and AMD and Intel and Sun are doing now”. By “manycore” they mean, “way more cores, each simpler, like what the Cisco Metro is doing”.
They argue that “manycore” buys you more of the performance metrics that matter. Sounds convincing. But also sounds hard to program.
Performance Counters · Historically, some CPUs have had them, but they’ve been low-priority features that got triaged out when the designers had their back to the wall.
Which is just silly. They’re easy to design, they don’t slow things down, and in a processor whose design is getting increasingly complex along multiple dimensions, it’s just not reasonable to ask software designers to live without them. JIT compilers can use them, big-time. As the study points out, ideally you’d like existing programs (existing binaries, even) to run faster on complex parallel hardware, and without hard data about where the cycles are going, you’re pushing a hard problem very near the border of “impossible”.
Human Results · The full title of the section is Programming model efforts inspired by psychological research, but that’s a little misleading. If I understand correctly, what they mean is: Study the effectiveness of different programming models by measuring the results delivered by actual real human programmers who use them.
Well, d’oh. Anyone who’s sat in a bar with a bunch of heavy AOP or Lisp wankers will sign up for this one in a flash. Me, I do know quite a bit about programming models and I know what I like, but I don’t think I’m smart enough to predict what’s going to empower developers to deliver, whether your efficiency metric revolves around programmer time or CPU time.
So let’s, you know, do the experiments and learn from whatever it is they tell us? Yes, controlling for all the variables will be hard. In fact, there may be some flashes of genius required in pulling together the experimental methodology. But important science isn’t supposed to be easy.
Unconvincing · Mother always said “If you can’t find anything nice to say, don’t say anything.” But this is Science, so that doesn’t apply.
Dwarfs · The study supposes that if you could partition the universe of programming problems into a few equivalence classes, then by optimizing your system design for one or more of these classes you could optimize it for the problems that fall into those baskets. Which sounds plausible. They call their baskets “Dwarfs” because they started with seven of them (the number grew to thirteen).
Personally, I think all these basket-weaving Dwarfs could be subtracted from the study without loss of value. The majority of them apply to particular sub-disciplines of scientific computing; for example, the first three are Dense Linear Algebra, Sparse Linear Algebra, and Spectral Methods. Important stuff for people who do Scientific Computing, which is itself important stuff.
But I do Web computing, which constitutes a not-insignificant proportion of all computing, and when I consider which Dwarfs I ought to care about, well, I’m not sure. The wiki is a little more helpful than the paper itself, and my best guess is that I ought to care about Graph Traversal and Finite State Machines.
The first suggests that a lot of this computing depends on random memory access latency, which accords well with my intuitions. But I’m not sure that all those hypothetical dwarfs were a good way to establish this.
Data Types and Characters · The discussion data sizes and types, as they appear in programming models, seems really weak to me. I’m offended that in 2007 intelligent people would still suggest either that an 8-bit datum is suitable for representing a character, or that a 16-bit datum is suitable for representing a Unicode character.
Also, they suggest direct support for “Large Integer (> 128 bits)”; but this is hardly news. Every programming language in the world has this, one way or another; in many of them (Smalltalk and Ruby at least) it’s built-in to Integer and transparent to the programmer. Last time I suggested this might be inefficient I was gently but firmly corrected by Avi Bryant.
Missing In Action · The paper has surprisingly little to say about what I think is the central program: what interface should the programmer use when dealing with multicore or “manycore” systems? [Oops, in fact it says one profoundly important thing as I noted above: we ought to measure the effectiveness of competing approaches to parallelism by measuring the empirical results achieved with them by programmers]. I had hoped for a bit more insight or some research results in this space. They do note that Software Transactional Memory is expensive and suggest that with hardware support, it might become very interesting; but that’s about all.
It seems that, based on the evidence we have right now, we can draw two firm conclusions:
Conventional thread-based concurrency is “too hard” for the general population of programmers.
The combination of functional programming and message-passing (as in Erlang or MapReduce) makes astoundingly good use of parallel systems and may well be usable by the average programmer.
Thus, my question: is there anything between #1 and #2? Who knows the answer?