I did some research on storage-system performance for my QConf San Francisco preso and have numbers for your delectation (pictures too). Now is a good time to do this research because the hardware spectrum is shifting, and there are a bunch of things you might mean by the word “disk”. I managed to test some of the very hottest and latest stuff.

Methodology · Even after all these years, I still like to measure with Bonnie. Yeah, it’s old (18 years!) and is a fairly blunt instrument, but it has the virtue that you don’t have to think very much before running it, and I’m still proud of how clear and compact the output is, and I still believe that the very few things it measures are really useful things to measure.

I’m not alone, either, Last week at ApacheCon, during the httpd-tuning talk by Colm MacCárthaigh, he talked about using it (the Bonnie++ flavor) to get a grip on filesystem performance. He said, looking kind of embarrassed, something along the lines of “Yeah, it’s old and it’s simplistic but it’s easy to use and has decent output.” [Smile]

Also, Steve Jenson has been using it to look at MacBook Pro filesystem performance, see More RPMs means faster access times. No news there. (Hey Steve, it’s OK to cut out all Bonnie’s friendly in-progress chat about how it’s readin’ and writin’, and just include the ’rithmetic.)

And hey, just to brighten up this dry technical post, here’s a picture of Bonnie Raitt, after whom the program is named. She’s older than me; doesn’t she look great?

Bonnie Raitt

What Does “Disk” Mean? · I think it can mean three distinct things, these days:

  • A plain old-fashioned spinning-rust disk system attached directly to your computer through some sort of bus connection.

  • (This is new) A solid-state disk (SSD) device; essentially flash memory packaged up to look like a disk.

  • A network-accessed storage device, like for example the Storage 7000 storage-appliance line Sun just announced, which might well include both traditional and SSD storage modules.

Systems Under Test · There are four different tests here, representing (I think) a pretty fair sampling of the storage options system builders have to choose from. The titles in the next few sub-sections correspond to the row labels in the summary table below.

MacPro · This is my own Mac Pro at home that I use for photo and video editing. It’s a meat-grinder; dual quad-core 2.8GHz Xeons, 6GB of RAM. There’s one 250G disk; whatever Apple thinks is appropriate, which bloody well better be pretty damn high-end considering what I paid for this puppy.

T2K · This the Sun T2000 hosts for the Wide Finder 2 project; eight 1.2GHz cores, 32G of RAM, two 170G disks; whatever Sun thinks is appropriate. There’s a single ZFS filesystem splashed across them, taking all the defaults.

7410 · This is a Sun Storage 7410 appliance, the top of the line that we just announced. It has an 11TB filesystem, backed by some combination of RAM and SSDs and spinning rust. They gave me a smaller box with 8G of RAM to run the actual test on, connected to the 7410 via 10G Ethernet.

IntelSSD · This is one of the latest-and-greatest; in fact the very one that Paul Statamiou recently wrote up in Review: Intel X25-M 80GB SSD. It’s attached to a recent 4G MacBook Pro, which Paul also reviewed. What happened was, I filled out Paul’s contact form and wondered politely if he’d be open to doing a Bonnie run. He wrote back with the output; what a guy.

The Table · There are notes below commenting on each of the four lines of numbers but, if you’re the kind of person who cares about this kind of thing, take a minute to review them and think about what you’re seeing.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU  /sec %CPU
MacPro     12  64.7 82.0  67.4 10.0  29.8  5.0  64.8 76.7  67.9  6.5   190  0.7
T2K       100  20.5  100 150.1  100  61.4 64.8  19.8 98.9 148.9 76.7   214 10.7
7140       16 121.5 97.7 222.2 51.0  75.3 27.2 100.0 95.6 254.2 47.3   975 76.6
IntelSSD    8  44.8 66.4  69.3 12.8  51.5 10.7  73.4 94.3 246.0 27.0  7856 43.2

Mac Pro Results · Given that Apple’s HFS+ filesystem is held in fairly low regard by serious I/O weenies, these numbers are not too bad. Salient points:

  • On this system with its heavy-metal CPUs, I/O is not in the slightest CPU-limited; the per-char and block input and output rates don’t differ much. Remember that Bonnie’s %CPU numbers are percentage of one CPU, and the Mac pro, like several of the other boxes under test, has lots.

  • The maximum input and output rates are about the same. This is actually a little surprising; most modern I/O setups, including the others under test here, exhibit some asymmetry.

  • That under-30M/sec number for in-place update of a big file is pretty poor.

  • The ability to seek almost 200 times/second is quite cheering; as recently as a single-digit number of years ago, it was really hard to find hardware that could seek more than 50 times per second. Since disk subsystems benefit only slightly from Moore’s Law, these performance increases are pretty hard-won.

T2000 Results · This thing has a much slower and wider CPU than the Mac Pro, and a massively more ambitious I/O subsystem; it’s designed for life as a Web server.

  • The kind of single-threaded I/O that Bonnie does (so do lots of other apps) is totally CPU-limited. See the per-character input and output, where the performance is lousy while one of its many cores is maxed. Even on the block-I/O tests it looks like the CPU may be the bottleneck.

  • Despite the CPU bottleneck, this box clearly has massively more I/O bandwidth than the Mac, the block I/O numbers are more than twice as high. There’s no indication we’re maxing out the I/O bandwidth; if we got a few more cores pumping and tuned the ZFS parameters, I bet those numbers could be cranked way up.

7410 Results · Remember, in this one, there’s a (fast) network in between the computer and the disk subsystem.

  • The numbers are really, really big. Way north of 200M/second both in and out, nearly a thousand random filesystem seeks a second. Yow.

  • The bottleneck here isn’t obvious: We had a close look with the Fishworks analytics (more on that later), and it was clear that the 10GigE link had lots of headroom and the 7410 itself was barely breaking a sweat. So something in the client or (quite likely) its network adapter was holding things back. Technically, it may not be correct to call this I/O “CPU-limited”, but it’s certainly some aspect of the single-threaded client that’s holding things back.

    It shouldn’t be a surprise, but there’s an important lesson here: Given modern storage back-ends and network infrastructure, single-threaded programs are just not going to be able to max out the available I/O throughput. Of course, the 7410 is designed to serve a whole lot of threads and processes and clients; the total bandwidth this puppy could deliver under a serious load ought to be mind-boggling.

Intel SSD · Um, one of these things is not like the other, and this would be the one.

  • The output numbers are just not that great. I’m not sure what’s wrong here; maybe HFS+ is getting in the way? Also this is a after all a notebook; high-volume output may not have been the design center.

  • The block-input number, at nearly 250M/sec, is pretty mind-boggling in a notebook. But you have to be smart and do block not per-char I/O; once again, it’s easy to get into CPU-limited I/O mode.

  • As for the random-access number... words fail me. I’ve never seen numbers like this on any disk-like storage device ever; nearly 8000 seeks/second. This is into getting into territory that’s competitive with memcached and friends.

    And, unlike the other local disks under test, this class of device has Moore’s Law in its corner; so the price, capacity, and performance will all be moving in the right direction. Ladies and gentlemen, you are looking at the future.

Take-Aways ·

  • My employer Sun Microsystems is really good at I/O.

  • SSDs are gonna win. They have fewer moving parts, better performance, and Moore’s Law on their side. Plus they burn less energy.

  • If your application is I/O-bound, and lots are, you’re going to have to go parallel, and be smart about doing block I/O.

  • It’s easy to be bottlenecked on your network link or your storage client performance. It’s getting harder and harder to actually max out the raw throughput of a big-league storage back-end.

What a great time to be in this business.



Contributions

Comment feed for ongoing:Comments feed

From: alphageek (Nov 21 2008, at 03:12)

Nice recap of a variety of systems, although I'd point out one little thing - the T2000 is pretty clearly CPU bound on some operations (which makes sense since the ZFS management is being handled by the main CPU as well as other tasks) and the other numbers that are just over twice that of the MacPro which would seem to be clearly the result of having two spindles and a more efficient file system than HFS (who showed up much better than I'd imagined). I'd clarify that it's the two disk configuration, rather than the box itself that seems to make the difference.

I'd be curious to see the results on the Mac Pro with two disks using the experimental ZFS project from Apple.

The 7410 obviously rocks out, and it would be really interesting to see the results with multiple concurrent clients. Given the architecture, I suspect that you'll be able to product almost the same profile on two to four clients concurrently - love to to see a scale up chart on that one.

The SSHD justifies ZFS as the filesystem to use in the future with the ability to easily integrate SSHD cache and log devices with this kind of performance fronting for cheap high capacity devices. Moore's law is working great on capacity on spinning rust - it's just in IOps that it falls down.

[link]

From: robert (Nov 21 2008, at 10:07)

>> If your application is I/O-bound, and lots are, you’re going to have to go parallel, and be smart about doing block I/O.

This sounds like what a RDMBS engine does. Can you say 4NF? Too bad Sun bought a flat-file database.

[link]

From: Sam Pullara (Nov 21 2008, at 10:09)

Can you publish the command line that you used? When I try and run this on my system I get ridiculously high results which imply that some pretty severe caching is being done.

Sam

[link]

From: Tim (Nov 21 2008, at 10:15)

Sam: The trick is the -s option. You need at least twice as much data as you have RAM. So on a 4G machine, you'd say:

./Bonnie -s 8000

[link]

From: Sam Pullara (Nov 21 2008, at 12:50)

Interesting test. I've been looking forward to replacing my bootdisk on my Mac Pro with an SSD and it looks like they are getting close. I went ahead and ran this benchmark on two of my disk subsystems:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU  /sec %CPU
MacPro     16  56.2 88.5 305.1 53.9 105.2 23.4  70.7 97.6 474.0 60.2   758  4.0
RAIDZ      16  59.0 71.2 137.0 31.9 126.7 36.2  89.4 96.9 587.0 81.8   162  3.1

The first is the 8 1TB disk array described here:

http://flickr.com/photos/spullara/2923859802/

The second result is using ZFS on my Mac Pro from here: http://zfs.macosforge.org/trac/wiki configured across 3 500GB disks using RAIDZ. I was quite impressed with the performance of ZFS!

[link]

From: Ric Davis (Dec 03 2008, at 05:56)

"That under-30G/sec number for in-place update of a big file is pretty poor." for the MacPro.

Shouldn't that be 30M/sec, or am I misunderstanding?

Looking at the GB column, it looks as if for the 7410(7140 in the table) test, you've gone with a file the same size as the test system RAM. Isn't that going to tend to flatter it?

[link]

From: Tim (Dec 03 2008, at 07:55)

Ric - Oops, you caught 2 typos. Should be 30M and 16G. The Network Is The Editor. Shocking that nobody caught this yet.

[link]

From: Tom Matthews (Dec 04 2008, at 02:25)

The 7140 results overview states :

"Way north of 200G/second both in and out"

Surely 200M/second?

[link]

From: uli.w (Dec 05 2008, at 03:46)

Does Sun force its employees to include PR and marketing in their own private blogs? That's really sad.

[link]

From: Edward Vielmetti (Dec 13 2008, at 21:01)

Tim -

There's a class of disk systems that you didn't look at which fit into some useful category - the ones where the drive controllers deliberately spin down the drives to keep power consumption down, at the expense of latency. This is the "massive array of idle disks" approach, where the cost of a "miss" might be 10 seconds or more.

You have to compare it to tape for near line storage. Power consumption should be a fraction of equivalent always-on drives since maybe 75% of the disks are powered down at any moment.

[link]

author · Dad · software · colophon · rights

November 20, 2008
· Technology (77 fragments)
· · Storage (26 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.