WF XV: On Parallel I/O

This is the fifteenth progress report from the Wide Finder Project; it’s fairly-uncooked musing about parallelism and I/O.

Preston L. Bannister, in The Wide Finder Project (sample quote: “First, the example problem chosen by Tim Bray is crap...”) asserts, essentially, that there is no such thing as parallel I/O.

Normally, I’d tend to agree. They say “memory is the new disk, disk is the new tape” and one of the reasons that’s true is that disk seeking is still pretty slow but modern disks can read remarkably fast sequentially. But still; as I reported before, Bonnie seems to think that that system can pull data through the read() call at about 160MB/sec, and the fact that Fernandez’ JoCaml code processed it at more or less four times that speed leads one to think that something is happening in parallel.

Maybe I was getting fooled by the fact that I ran Bonnie on a 20G file-set while the Wide Finder data was just under 1G. So I re-ran Bonnie with a test file very close to the WF data size, and in a bunch of successive runs observed maximum block input performance, in MB/sec in successive runs, of 230, 214, 215, 261, and 230. If you look at how Bonnie works, the block-input phase immediately follows a phase in which the same file is read through stdio, so the cache ought to be pretty hot. So let’s say around 250, thus you’d expect to need four seconds to read a 1G file. And the JoCaml code still beat that by a factor of two.

Empirically, the I/O is somewhat parallel (unless the mmap() code path is twice as fast as read(), which seems implausible). How could this be happening?

There are a bunch of plausible explanations, but no experimental evidence yet to help us choose among them. Let’s have some fun speculating.

This is UFS not ZFS on a fairly-full disk, so to start with, the blocks are quite likely smashed all over the disk and you’re never gonna get truly sequential I/O anyhow. So if there’s a lot of seeking going on anyhow, it’s perfectly possible that the filesystem or the disk scheduler is optimizing the read requests and spraying the blocks from different parts of the file out to different cores with minimal head movement.

Also, bear in mind that on a modern server, there are a lot of layers of indirection hiding behind that read() call. There may be multiple physical paths to the disk, and the operating system or the filesystem or something may well be dealing different phases of the I/O out among cores as a matter of course. Or maybe the filesystem cache is smart about multiple parallel threads accessing big data; that’d probably be real helpful to MySQL and Oracle and friends, so it wouldn’t be too surprising. Or some other voodoo that I don’t know about.

So yeah, empirically, parallelizing I/O seems to work even if we don’t know why.

Something else occurs to me. In this kind of app, when you’ve got cores to burn and a serious I/O issue, would it ever not be a good idea to compress the data? ZFS has that built-in.

I’m thinking I need to write a parallelized I/O phase into Bonnie.

Contributions

Comment feed for ongoing:

From: Manuel (Nov 16 2007, at 18:15)

Tim, could you give us the exact specs of the machine you are running this on (motheboard, used RAM and disk?)

[link]

From: Daniel Berger (Nov 16 2007, at 20:54)

No such thing as parallel I/O? Isn't that exactly what scattered I/O is?

[link]

From: Joe Cheng [MSFT] (Nov 16 2007, at 22:22)

Leaving a comment here since Preston's blog doesn't allow comments.

At 150MB/s, there is no possible way those are "real not cached disk reads". StorageReview.com shows that the maximum sustained transfer rate of a Seagate Cheetah 15K.5 300GB Ultra320 SCSI, the fastest drive they've ever tested, is 135.0 MB/s. Second place is the Savvio 15K.1 73GB SAS at 108.0 MB/s. An average SATA 7200RPM desktop drive from last year (Seagate Barracuda 7200.9) is down at 62 MB/s. This is measuring at the very outer zones of the platter so you'll never actually see these rates in real life unless you do some VERY low level tweaking.

So now that we've established that all of these insanely high benchmark numbers all mostly involve reading stuff that's already in main memory, it's not as surprising that multiple cores might help. It also makes the Wide Finder problem less realistic, as it's hard to imagine the cache being this hot if you were using this code in real life.

[link]

From: Ben Hutchings (Nov 17 2007, at 05:28)

The need for a kind of parallel I/O - or rather, parallelisation of I/O processing - has been recognised in the high-speed networking world. The latest 10G network interface cards implement Receive-Side Scaling (to use the MS terminology) which means that RX interrupts are delivered to different CPUs depending on the received packets' flow ids. (I'm not sure how this interacts with coalescing.) Spreading out TX work may also help.

Of course IBM recognised this long ago, and implemented channel processors on its mainframes.

So your idea that the layers of abstraction of disk I/O take a lot of cycles and can be parallelised is quite plausible.

[link]

From: Tim (but not THE Tim) (Nov 17 2007, at 08:48)

Parallel I/O exists (note I am a mainframer so some terms may require translation):

Striping: spreading a file across multiple (possibly virtual) disk drives allows multiple I/Os to be going on at one time for a file; if a lot of read-ahead buffering is being used this can stage data into buffers more quickly than just one device.

Parallel Access Volumes: more than one virtualized device address referencing the same virtual disk device (which is itself emulated by software in a RAID array). This eliminates queuing for the "device".

A lot depends upon the hardware being used, but a RAID array fronted with all sorts of software in the array itself can certainly provide a lot of I/O speedup.

That said, if all of the tests are being run on the same I/O hardware, then the differences in speed related to I/O must have to do with a specific language's way of interfacing to the operating system's I/O subsystems.

[link]

From: John Hart (Nov 17 2007, at 10:05)

I think Joe Cheng's comment is perhaps the best in this entire Wide Finder thread.

Tim accompanies his Bonnie specs (160MB/sec) with the comment:

"Given that this stupid thing has 64G of RAM and I could only find 20G of disk space to work with, the results should be taken with a grain of salt"

If the fastest hard drive on the market can only sustain 135MB/sec - that's a sequential read, where no seeks are happening and *every bit* under the head is part of the desired data set and is seen in the right order - then any single-drive number above that is clearly based off cache. Somewhere.

Tim, your theory about Parallel I/O seems like post-hoc justification. The blocks-are-all-over-the-disk, maybe-the-scheduler-is-really-smart reasoning doesn't allow the drive to exceed its own peak sustained transfer rate. An ideal scheduler is one in which every bit under the head is always immediately relevant to the outgoing stream - which means we're back at sequential read speed, no faster.

A lot of man-hours have been spent on Wide Finder, and I'm glad to have heard about JoCaml more than I have in the past, but these benchmarks on a 1GB dataset on a machine with 64GB of ram don't mean much.

I/O bound is still I/O bound. Trying running the JoCaml code on a 500GB dataset and see what happens.

[link]

From: Adrian Thurston (Nov 17 2007, at 11:29)

Hi Tim, are you still trying out solutions? Any reason you haven't posted results on this C++/ragel solution?

http://athurston.blogspot.com/2007/11/tim-brays-wide-finder-in-ragel.html

[link]

From: Wes Felter (Nov 17 2007, at 12:48)

I think there are different definitions of I/O here. Bonnie and WF are doing plenty of read()s and mmap()s, but they aren't touching the disk at all. If we define I/O as read() and mmap(), then yeah, Solaris can do that in parallel. But if we define I/O as actually talking to the disk (there happens to be only one), then it's fundamentally a serial process (pipelined, but serial).

As I said at Hacker News (http://news.ycombinator.com/item?id=79277): There is an engineering tradeoff here; doing I/O in multiple threads introduces seeks, but it simplifies the code dramatically. Using one thread for I/O will introduce data copies and locking for the work queue. However, since WF does not touch to the disk, Bannister's concerns about seeks are irrelevant.

[link]

From: Preston L. Bannister (Nov 23 2007, at 00:50)

I would not count reading from the in-memory OS file cache as I/O. Given your test data is *much* smaller than memory, and given that you pre-read the file, what you are testing is the performance of the OS file cache.

Of course you can do "parallel" access to the file cache - it is only memory.

For the class of problems of which your test is one example, do you *always* expect the file size(s) to be much smaller than memory? If not then there will be a threshold where a small increase in size causes a huge drop in performance (when the file can no longer be cached). Also you need to count the pre-reading of the uncached file into memory as a part of the time.

I have a strong bias against solutions that don't scale.

Just does not seem an interesting problem for small data. The elapsed time is short enough that not many folk are going to care. What *would* be interesting is to show a T2-based box ripping through a huge processing job in less time than hottest (ha?) x86 single or dual CPU, while sipped a fraction of the power, and without requiring any exotic programming.

At least that is how I interpret (re-interpret?) the purpose of this exercise.

[link]

From: Erik Engbrecht (Nov 24 2007, at 10:12)

I think you need to test the widefinders on some files that are too big to cache, or on a box under decent load so it doesn't have a lot of free memory for caching or for the widefinder to pull down while it processes the file.

[link]

From: Erik Engbrecht (Nov 26 2007, at 10:38)

Ok, I did some tests of my own on my dual core Windows laptop with 2 GB RAM using my Scala version.

For a 5 million line input file (just under 1 gig, fits in memory), Scala pegs both cores and takes ~16 seconds to finish. Ruby takes ~34 seconds and pegs one CPU.

For a 25 million line file (just under 5 gigs, does not fit in memory), Scala takes 7 minutes 49 seconds and doesn't even peg one core. Ruby takes 11 minutes 31 seconds and shows slightly lower utilization than the Scala version.

So what we see is that the task is very I/O bound. The JVM is faster than the Ruby VM, so even though the Scala version is doing more work (it has to transcode 8bit ASCII to 16bit Unicode) it is faster. The parallelization is probably yielding a slight benefit as well.

[link]

From: Adrien Lamothe (Nov 27 2007, at 13:57)

mmap() was more than twice as fast as read(), in Solaris 2.6. I

developed a data pumping application, written in C on Solaris 2.6,

running on an E-450, that had to make comparisons between two very large

data files. I used read() in the first version and it took ~ 28 minutes

to perform the comparison. After replacing read() with mmap(), the

comparison time sped up dramatically, to an average of 1 minute 47

seconds, which made mmap() 15.7 times faster than read(). I tested both

versions many times, with approximately the same performance every run.

The server the program ran on was lightly loaded, used primarily for the

data pump.

I believe mmap() performs similarly in other *nixes.

[link]

From: raggi (Nov 30 2007, at 17:06)

IO is going parallel, there's no doubt. No matter what this test provides in terms of an idea of Disk IO in this particular scenario is irrelevant, and I personally think the original spec was described toward this in some ways. That is, we're looking at a new style of machine.

Memory performance across multiple cores is still relevant to IO, as memory IO is IO too.

I'm wondering how many of the previous comments were made with an awareness of technologies such as the Fusion IO cards, which reportedly achieve over 100,000 IOPS even with 4k packets. The rates are show also significantly higher bandwidth than the disks described above. In fact, it's closer to the speed of RAM.

It is also irrelevant to note the particular state of the particular Fusion IO product release (or not yet release), as this *is* the next generation style of such technologies. It will happen because we need it to, the same as changes in core processing bandwidth architectures is changing for even end users, now.

The series has been an interesting read, thank you. :)

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

November 16, 2007
· Technology (90 fragments)
· · Concurrency (75 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!