This is an introduction to the state of the art in Grid Computing in mid-2006. I start with some definitions and motivation, look at grid economics, survey the kinds of infrastructure that are out there now, and touch on future directions. The introduction is “brief” in the sense that it does little more than touch the surface of the subject, but it’s still immensely long for a blog. [Update: MapReduce is here and multi-node today! It’s called Hadoop.]
Defining Terms · Houston, we have a problem: nobody agrees what “Grid” means. Check out Wikipedia for a list of alternative definitions. Whatever. I think that the massed Opterons folding proteins in university computing centers are grids. I think the Google and Yahoo data centers handling a kazillion searches a day are too. And I think that SETI@Home is too.
So I’m going to be sloppy and use “grid” to mean any scenario where you want a good way to throw a whole bunch of computers in parallel at a problem, and you want some infrastructure there to help.
How I Got Here · For an application I was thinking of, I needed something like memcached or Prevayler or Tangosol Coherence; a really fast distributed hash table storing data in RAM across a lot of machines. I looked at those things and eventually decided they didn’t really meet my needs, so I built my own.
It worked pretty well (fast!), but once I was finished I realized the infrastructure was broken. I had to pick what machines I was going to run on and preconfigure the system with static files. Which wasn’t good enough, since I wanted to be able to add machines to the grid on demand, and to detect and survive the condition when a machine failed.
So I decided I needed to run it on one of these new-fangled “Grid” thingies and went looking for infrastructure. This article details what I found. I don’t claim it’s complete, just that it represents the results of research by a motivated non-specialist.
Grids are Great... · Grids are becoming attractive in a lot of different scenarios. One reason is that we’re all generally moving toward scaling out rather than up; throwing lots of relatively cheap machines at problems in parallel, rather than trying to use one big honking mainframe-class box. The potential wins in flexibility and scaling are huge; of course, life gets more complex.
Another driving force is one of my favorite mantras: “Memory is the new disk. Disk is the new tape.” (First uttered by Jim Gray.) This is true in a bunch of different ways. First, memory is several orders of magnitude faster than disk for random access to data (even the highest-end disk storage subsystems struggle to reach 1,000 seeks/second). Second, with data-center networks getting faster, it’s not only cheaper to access memory than disk, it’s cheaper to access another computer’s memory through the network. As I write, Sun’s Infiniband product line includes a switch with 9 fully-interconnected non-blocking ports each running at 30Gbit/sec; yow! The Voltaire product pictured above has even more ports; the mind boggles. (If you want the absolute last word on this kind of ultra-high-performance networking, check out Andreas Bechtolsheim’s Stanford lecture.)
Don’t forget the disk part of the mantra. For random access, disks are irritatingly slow; but if you pretend that a disk is a tape drive, it can soak up sequential data at an astounding rate; it’s a natural for logging and journaling a primarily-in-RAM application.
So, why wouldn’t you deploy a grid for every computing platform?
... Except When They’re Not · Jim Gray (of Microsoft research, quoted above, pictured here) published a wonderful paper in 2003 entitled Distributed Computing Economics; the abstract says: “Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth.”
Let’s put that another way: memory space and compute cycles are pretty cheap. Disk space is effectively free. Moving data around in large quantities is expensive. In Jim’s case, he’s been working with big astronautical data sets. When he wants to send one to a colleague on the other side of the continent, he loads it onto a desk-side computer stuffed with 250G disks, puts that in a FedEx box and off it goes.
The take-away is that you’d like to maximize the ratio of the amount of computation to the amount of data traffic. The perfect example, of course, is SETI@Home, in which remote nodes perform lengthy calculations on tiny chunks of data.
Another approach, of course, is to have the data live in the grid. Consider a big search engine like Google or Yahoo: the incoming requests and outgoing results are modest in size; the data living in the grid is enormous, but it stays there.
Two Kinds of Grids · Here’s what real-world grids do today:
Predict financial risks and returns.
Build Web search indices.
Search the Web.
One of these is not like the others. Items 1 through 5 in the list above are batch jobs. This is what a lot of grids do; let’s call them “batch grids”.
Searching the Web isn’t a batch job at all, it’s online 24x7x365 and will go on providing this service as long as Homo sapiens is still using computers. The notion of a Web search engine “ending” or “completing” is silly.
Let’s call this kind of grid a “service-oriented grid”. By definition, a service-oriented grid has to be available, which means you have to be able to connect to it while it’s running. Also, for the economic reasons we discussed above, the data pretty well has to live in the grid.
Batch and Service-oriented Grids Today · This table groups grid technologies by batch or service orientation, and also by whether they’re here now and deployed now, or still largely speculative for one reason or another.
|The Grid Landscape|
|Batch||MPI, MapReduce, SGE, DRMAA|
|Service-Oriented||Globus, Google, Yahoo, etc.||Rio, Gridbus|
I’ll touch on each of these and try to end up with a compete-ish picture of the current landscape.
MPI · It stands for Message Passing Interface, and may be found at MPI-Forum.org. MPI was first standardized in 1994, and the current MPI-2 dates from 1997. It’s a set of FORTRAN and C libraries for doing parallel computing, taking care of the bookkeeping of moving data around among processes working together in a grid; it also does parallel I/O. I looked around and found some talk about Java APIs, but nothing that looked like it had complete coverage and was being maintained and used. This isn’t surprising, because the style of the API is very un-Java.
MPI is by far the world’s most popular “grid” API, in terms of code that’s actually being run today by real people to do real work. If you go to any academic supercomputing center, you can bet they’ll be running a lot of MPI.
It supports several message-passing patterns: point-to-point, broadcast, and so on. You can call it synchronously (don’t return until the messages get there) or asynchronously (launch the message traffic and I’ll go on working). It’s got language-independent data binding, if by “language-independent” you mean “FORTRAN and C”.
Here’s a basic C-language call from the MPI Tutorial at NCSA.
int MPI_Bcast ( void* buffer, int count, MPI_Datatype datatype, int rank, MPI_Comm comm );
Co-operating programs in MPI lingo are called “Processors”, by which they
mean “processes”. They live in a named cluster, each identified by an
integer called its “rank”.
In the call above,
count items whose type is given by
datatype and which are stored at the address given by
buffer will be copied from
buffer in the
processor identified by
buffer in all the
other processors in the group identified by
comm; it’s a
MPI can group the processes into matrices and sub-matrices to divide the work up. It also has an interesting “Reduce” operation, which can aggregate data from many processors, put it through a computation step, and store the result in one of them.
This provides a natural lead-in to what is probably the single most important advance in batch grid technology in the last decade; Google’s MapReduce.
MapReduce · This is described in MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. It’s an absolutely remarkable piece of work, and I recommend that anyone who cares about this stuff read it.
The idea is that the programmer provides a
map function, which
reads (usually a modest number) of key/value pairs and emits (usually a large
number) of other key/value pairs, and a
reduce function, which
takes the map output pairs, aggregated by key, and produces a useful result.
Here’s an example from the paper that shows how you’d use it to count the
number of occurrences of each unique word in a large number of documents.
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
The MapReduce implementation (in C++) is tremendously clever; it uses some elegant tricks to work around machines which fail in mid-run, machines which run anomalously slow, and even bugs that are provoked by weird input values.
The benchmark results are impressive: a
grep style program
running on a few thousand CPUs scanned 1010 100-byte records,
extracting 92,000 matches, in
80 seconds; and sorted those 1010 records in 839 seconds .
MapReduce really represents a qualitative step forward in the state of the art. Java fans will be delighted to hear that Doug Cutting, of Lucene and Nutch fame, is working on a Java version called Hadoop. The benchmark results are impressive already.
The Sun Grid Engine · This is a specialized instance of Sun’s N1 Grid Engine deployed on the public-facing Sun Grid. The easiest way to introduce it is to look at how you’d run a simple job, for example:
Step1 initializes, reads data file “input.txt”, writes three intermediate files.
Step2 processes them in parallel, writes three output files, no cross-dependencies.
Step3 processes the three output files to generate the final output.
Let’s assume there are shell scripts
step3.sh which run the steps.
You’d create a shell script, say it’s called
run.sh, like so:
#!/bin/sh qsub -N step1 -b n step1.sh qsub -N step2 -hold_jid step1 -b n step2.sh qsub -N step2 -hold_jid step1 -b n step2.sh qsub -N step2 -hold_jid step1 -b n step2.sh qsub -hold_jid step2 -b n step3.sh
qsub command does all the work. Its
argument is for “binary”, in this case the value is always “n” for no,
since these are scripts.
-N assigns a job-step ID, so
-hold-jid argument says not to run until all jobs with the
named job-step ID have completed. Note that the last step doesn’t need to
have a job-step ID because there’s nothing waiting for it to complete.
Then you zip up
input.txt; you use a Web GUI to submit the zipfile
to the grid engine, track its process, and,
when it’s done, fetch the
output. This all works just fine and people
are running these kinds of jobs right now today.
It’s amazingly reminiscent of the JCL that we used to use to submit punched-card decks to mainframes when I was a kid. Which is OK; that was highly evolved technology and is entirely appropriate for batch-job wrangling.
The Sun Grid creates a private subnet, connected to nothing, for each job, as well as a private filesystem. So obviously it’s not service-oriented in any meaningful way. The Grid people are aware that service-oriented grids are interesting, but they’re also keenly aware that our grid, in the hands of a malicious user, could flatten any given bank in seconds, and maybe even give Google a headache.
The Global Grid Forum · On the net at ggf.org, this organization first met in 2001. It’s big; there are 34 working groups. They’ve produced a ton of specifications, the best known of which are DRMAA and OGSA.
DRMAA is not very interesting. It provides C and Java interfaces that let
programmers do more or less what the Grid Engine’s
friends do; it’s been implemented on the Sun Grid and in “Project Condor”
at the University of Wisconsin-Madison.
OGSA, for Open Grid Services Architecture, is a much bigger thing, an attempt to provide standardized general-purpose infrastructure for a service-oriented grid. One of their use cases gives a good flavor for the kind of problem they’re addressing, and is interesting. Imagine a grid which is receiving telemetry data from all over the Caribbean: air and water temperature, barometric pressure, wind speed. If you see sudden changes in a bunch of these over a 50-square-km area, you need to run some large-scale hurricane simulations Pretty Damn Quick to decide whether or not to evacuate New Orleans.
OGSA tries hard to Solve The Whole Problem, and it’s based on the WS-* technologies; indeed, it is a key contributor to WSRF (Web Services Resource Framework).
There’s another organization, Globus, which actually produces open-source implementations of OGSA standards; here’s a block diagram of the architecture from their online tutorial.
To give more flavor, here’s an excerpt:
Writing and deploying a WSRF Web Service is easier than you might think. You just have to follow five simple steps.
Define the service's interface. This is done with WSDL
Implement the service. This is done with Java.
Define the deployment parameters. This is done with WSDD and JNDI.
Compile everything and generate a GAR file. This is done with Ant.
Deploy service. This is also done with a GT4 tool.
This is not just theory, the Globus toolkit is in release 4 and (they claim) is being used by a whole bunch of different academic-research projects.
Clearly, there is a remarkable amount of flexibility in this architecture, purchased at the price of complexity. There will also be a runtime cost to all the SOAP-packet marshaling and unmarshaling; whether it’s significant will be highly dependent on application specifics.
I’m probably the wrong person to come to for an opinion on OGSA and Globus. Not only is my antipathy to the WS-* suite well-known, but the specs clustering around WSRF have always struck me as particularly offensive, since on the face of it they seem to re-implement HTTP (badly) on top of SOAP on top of HTTP.
Having said all that, it seems like Globus is an actual running, usable instance of general-purpose service-oriented grid infrastructure. In fact, the only one I’ve encountered.
Rio · When I started looking around for grid infrastructure, a bunch of different people pointed me at the Jini Rio Framework. Here’s its architecture diagram.
Rio’s introductory text says “Project Rio provides a model to dynamically instantiate, monitor & manage service components as described in an architectural meta-model called an OperationalString.” It’s based on multiple layers of abstraction and the documentation uses lots of words that I thought I understood in ways that make it obvious they mean something different. To start with, I don’t know what an “architectural meta-model” is.
Rio’s sample “Hello World” app has one interface, six classes, and 625 lines of code. Now, to be fair, it’s very complete and and has a nice Swing user interface. Rio provides a lot of facilities that sound like they ought to be useful, including help for handling errors and failures in your infrastructure. It’s pretty clear that understanding Rio is going to take a lot of work, and if you’re going to be using something for infrastructure in a large-scale high-performance situation, it’s important to understand all the layers thoroughly.
I think that Rio has a terrific future, but seriously needs work to reduce the barrier to entry.
Gridbus · If you go Googling around for “Service-Oriented Grid” you’re pretty quickly going to run up against the work of Prof. Rajkumar Buyya (right) of the University of Melbourne, director of the Gridbus project. Like OGSA, Gridbus is ambitious, but in a different direction. It aims to provide an infrastructure that comprises not only the technical requirements for doing grid computing, but also an economic model so that multiple players can combine dynamically to offer a grid utility, competing for business.
There is apparently a .NET-based implementation of at least part of the theory, called Alchemi, which reached 1.0 in December 2005.
Web-Facing Service-Oriented Grids · Of course, multitudes of people do grid computing every day, when they search at Yahoo! or Google, or shop at Amazon, or use any of a huge number of Web-facing services. Most of which are implemented as grids; it’s the only practical way to deal with Web-scale traffic.
The trouble is that these guys all build their infrastructure themselves, by hand, down to the metal. None of them are in the business of offering general-purpose grid infrastructure that you can use to run your programs on your computers, or even on theirs.
So let’s revisit that table above, adding “Infrastructure” to the title and removing all the entries that aren’t in the you-can-use-it infrastructure business:
|The Grid Infrastructure Landscape|
|Batch||MPI, MapReduce, SGE, DRMAA|
Conclusion · There are a lot of options out there; this article has reviewed the ones that I turned up when I was looking for something to use myself.
Unfortunately, I decided that none of them really met my needs. My problem is that I’ve been a Unix guy for twenty years and a Web guy for ten. My feeling is that if something says it’s a service, the right way to talk to it is to get a hostname/port-number combination, call it up, send a request, and wait for a response. My feeling is that a good way to have processes work together is for them to exchange streams of text using standard input and standard output. My feeling is that if I’m going to be stressing out the performance of an app on a grid, I don’t want too many layers of abstraction between me and the messages. My feeling is that server failures are going to have to be handled in an application-specific way, not pushed down into the abstraction layer.
So I ended up writing my own piece of service-oriented grid infrastructure, named Sigrid, which isn’t like any of the other things in this essay. I’ll write that up here soon.
But I’m 100% sure that there are lots of problems out there where one of the alternatives described here will do just what you want.