This is the fifteenth progress report from the Wide Finder Project; it’s fairly-uncooked musing about parallelism and I/O.
Preston L. Bannister, in The Wide Finder Project (sample quote: “First, the example problem chosen by Tim Bray is crap...”) asserts, essentially, that there is no such thing as parallel I/O.
Normally, I’d tend to agree. They say “memory is the new disk, disk is the
new tape” and one of the reasons that’s true is that disk seeking is still
pretty slow but modern disks can read remarkably fast sequentially.
But still; as I reported before, Bonnie seems to think that that system can pull
data through the
read() call at about 160MB/sec, and the fact that
Fernandez’ JoCaml code processed it at more or less four times that speed
leads one to think that something is happening in parallel.
Maybe I was getting fooled by the fact that I ran Bonnie on a 20G file-set
while the Wide Finder data was just under 1G. So I re-ran Bonnie with a test
file very close to the WF data size, and in a bunch of successive runs
observed maximum block input performance, in MB/sec in successive runs, of 230, 214,
215, 261, and 230. If you look at how Bonnie works, the block-input phase
immediately follows a phase in which the same file is read through
stdio, so the cache ought to be pretty hot.
So let’s say around 250, thus you’d expect to need four seconds to read a 1G file.
And the JoCaml code still beat that by a factor of two.
Empirically, the I/O is somewhat parallel (unless the
code path is twice as fast as
read(), which seems implausible).
How could this be happening?
There are a bunch of plausible explanations, but no experimental evidence yet to help us choose among them. Let’s have some fun speculating.
This is UFS not ZFS on a fairly-full disk, so to start with, the blocks are quite likely smashed all over the disk and you’re never gonna get truly sequential I/O anyhow. So if there’s a lot of seeking going on anyhow, it’s perfectly possible that the filesystem or the disk scheduler is optimizing the read requests and spraying the blocks from different parts of the file out to different cores with minimal head movement.
Also, bear in mind that on a modern server, there are a lot of layers of
indirection hiding behind that
read() call. There
may be multiple physical paths to the disk, and the operating system or the
filesystem or something may well be dealing different phases of the I/O out
among cores as a matter of course. Or maybe the filesystem cache is smart
about multiple parallel threads accessing big data; that’d probably be real
helpful to MySQL and Oracle and friends, so it wouldn’t be too surprising.
Or some other voodoo that I don’t know
So yeah, empirically, parallelizing I/O seems to work even if we don’t know why.
Something else occurs to me. In this kind of app, when you’ve got cores to burn and a serious I/O issue, would it ever not be a good idea to compress the data? ZFS has that built-in.
I’m thinking I need to write a parallelized I/O phase into Bonnie.