This is a sketch of how to provide highly concurrent read and update access to sorted paged lists while requiring minimal locking. This particular trick has probably been covered before but if so I’ve missed it and haven’t seen anyone else using it.

I implemented it a long time ago in a closed-source and now-defunct piece of software. It worked really well, and I’ve never since seen it done quite this way. There was some thought that it might be patentable and a literature search back then came up empty. But I’d be amazed if this weren’t being used here and there; who knows, maybe it’s crept into the undergrad CS curriculum while I wasn’t looking.

Page Table · Suppose we have a large ordered list containing some type that the “<” operator works with, and we’ve organized it into pages; we access them through a simple page table whose entries need contain only a base value for the page and a pointer to it.

Searching · For this to work, we assume we can search both the page table and the pages in a way that depends only on the ordering of the elements, and can tolerate duplicate elements (as in where the same object appears in two or more successive slots in a page). An example of such an algorithm is binary search, which is in fact what I used in my implementation, and I think is generally a good choice.

Reading · To find something in the list, you search the page table to find which page it’s in, then search the appropriate page, then you’re done. If you just want to retrieve information, you don’t need to acquire any locks at any time.

Shuffling · This is the lowest level of the process by which we actually or add or delete items from the table. I think the explanation will run smoother if I cover shuffling first.

Let’s consider a page in the table as a number declaring the element count, and then a series of numbers representing the items. For example, a page with four items represented as numbers:

4: 10, 20, 30, 40

Now suppose we want to add the value 15, which will fall between 10 and 20. Let’s assume there are more slots available in the page; the count 4 describes the number that are actually used.

I’ll refer to this process as shuffling-in. First we add a duplicate of the top element in space one off the declared end of the page.

4: 10, 20, 30, 40, 40

Now we increment the element count.

5: 10, 20, 30, 40, 40

Now we shuffle the elements up, going from the top down.

5: 10, 20, 30, 30, 40
5: 10, 20, 20, 30, 40

Finally, we over-write the occupant of the correct position with the new value.

5: 10, 15, 20, 30, 40

Note that at every point in this process, a binary search through the page, based on the declared number of elements, will produce a correct result, assuming it can tolerate the presence of duplicates (which binary search by default does).

I’ll call the inverse process shuffling-out; here we shuffle out the newly-added 15.

First, we copy the to-be-deleted element’s right neighbor on top of it, creating a duplicate of the neighbor, and repeat the process until we’ve reached the top of the page.

5: 10, 15, 20, 30, 40
5: 10, 20, 20, 30, 40
5: 10, 20, 30, 30, 40
5: 10, 20, 30, 40, 40

Finally we decrement the element count.

4: 10, 20, 30, 40, 40

Once again, at no point during the shuffle have we disrupted anyone’s ability to binary-search the page correctly.

Simple Updates · Shuffling-in and shuffling-out allow a writer to update a page of the list while readers are active, without getting in their way. Thus an updater who wants to add or delete an element from the list needs to do the following:

  1. Search the page table to find out which page will be affected.

  2. Acquire an update lock on that page.

  3. Shuffle in or out.

  4. Release the lock.

Difficult Updates · Things are a bit more complex when the page table needs to be updated, which happens in two cases: When a page is too big and needs to be split, or is too small and needs to be merged with another (or alternately, is empty and needs to be removed).

It turns out that you can shuffle the page table too. It’s a bit more complex; for example when you’re splitting a page with X entries whose base value is N into two pages with approximately X/2 entries the second of which has base value N2, you’d do this:

  1. Acquire the page table update lock, then the lock of the page to be split.

  2. Construct the new page, starting at N2, with the top half of the entries.

  3. Shuffle it into the table. There will now be a bunch of entries duplicated between the big page beginning at N and the small page beginning at N2, but that won’t keep a reader’s search from running correctly.

  4. Change the count declaration in the page starting at N.

  5. Release the locks.

Similarly, merging small pages requires a bit of housekeeping around the shuffle, but it’s not rocket science and is left as an exercise for the reader. The trick is to keep things sorted and usable by read-only code at all times.

Finally, there’s a corner-case to the simple update where you delete the lowest entry in a page; this also requires locking the page table.

Hot Lock · In my implementation, I had every updater acquire the page table lock long enough to look at the target page and figure out whether the page table would need to be updated, then, assuming not, release it before it went to work on the target page.

I was worried that that lock would get hot and bottleneck the system, so I thought of some dodges including storing page metadata in the page table, and taking a two-pass approach with back-off; going straight to the page and then if it needed splitting or merging, backing off and acquiring the page-table lock.

Tuning and Experience · In practice, the page-table lock wasn’t a problem, probably since the amount of work involved in seeing if a page is going to split or merge is very small.

Another sin of omission was that I never bothered to merge pages, just deleted them if they emptied out entirely.

I was actually worried about cache coherency; when the system is running hot, it’s absolutely the case that two processes will be issuing simultaneous read and write requests to the same address at the same time. But the algorithm is pretty resistant to update latencies and I never observed any problems in practice.

Obviously this is not transactional; a reader process can fail to see an insertion that started some time ago if the updater is shuffling its way through a large page.

There’s only one obvious tuning parameter: page size. In my implementation there was a very high query rate but with less than 10% updates, and whatever page size I picked more or less at random on the first time through was good enough. It was fairly small; my intuition was that the cost of big shuffles would be higher than the cost of searching a larger page table.

This was done in a C program; the entries were fixed-size structs packed into huge mmap-backed buffers, so the entries were all just 64-bit pointers. I didn’t even persist the page-table or page structures, just constructed them when the program started up. The locking primitives were basic POSIX stuff, can’t remember which.

I could run hundreds of processes handling immense numbers of queries and quite a few updates per second through about as much memory as you would reasonably put on a generic Debian server a decade-and-a-bit ago. I never actually figured out how fast it would go or how to tune it because it was never the critical point in the system. My intuition is that you could distribute it across multiple systems without too much pain, but I haven’t thought it through.

I haven’t really any useful information about the range of update patterns and frequencies that this scheme might tolerate before becoming unworkable. My intuition is that there are some patterns where it’ll outdo quite a few of the other off-the-shelf options, but I have no hard evidence.

That’s All Folks · The other night, I had insomnia and for some reason thought about this trick, and figured out I should write it down before I forget it, in case someone else finds it useful.



Contributions

Comment feed for ongoing:Comments feed

From: Andrew Brereton (Jul 14 2010, at 17:39)

Second fantastic article I've read this week regarding algorithms.

http://cacm.acm.org/magazines/2010/7/95061-youre-doing-it-wrong/fulltext

[link]

From: Rowan Nairn (Jul 14 2010, at 18:48)

It seems like you would get false negatives in some rare situations if you use binary search. Say that while your shuffle-in example is in progress, a searcher is looking for 30. Say that before the shuffle begins it has narrowed the search down to the range [10, 20, 30] but it doesn't get to the end of its search until the shuffle has finished. At that point the range it was looking in has become [10, 15, 20].

If this is a cache I guess you don't care about a rare false negative though.

[link]

From: Janne (Jul 14 2010, at 19:29)

A quick thought: Why stop at two levels; how about creating a B-tree without reader locks?

A second quick thought: What would be the minimal locking required for concurrent updates?

[link]

From: George Phillips (Jul 14 2010, at 21:21)

Clever stuff. What happens if you've got a reader searching for 40 when the page is at the "4: 10, 20, 30, 40" state and it reads in 4 as the number of elements. Then an updater comes along and gets at least to the "5: 10, 20, 30, 30, 40" state. Now the reader will be looking at the "10, 20, 20, 30" list and miss the 40.

Or is there some way to work around that case?

[link]

From: BPW (Jul 14 2010, at 21:43)

False positives are possible too. If I understand the description correctly, it appears that this algorithm assumes sequential consistency (a separate issue from cache coherency). Unfortunately, the strongest language shared-memory models around today (the Java Memory Model and the new C++ memory model) only provide sequential consistency in the absence of data races. (Other languages with shared-memory don't really have formal semantics for this...) This means that if there are data races, the execution is not guaranteed to be sequentially consistent: different threads may observe memory operations in different (and irreconcilable) orders.

Under this model, since readers of the list may race with updaters of the list, we cannot assume sequential consistency. (In essence, the compiler and processor are free to reorder memory accesses as long as they don't reorder across synchronization operations/fences.)

For example, a thread reading from a page while another thread updates the same page may see the updated page size value (5 in your "shuffling in" example) but see the previous value of the 5th slot in the page (i.e. before 40 is written there in the example; an unitialized value, perhaps). In other words, we've violated the invariant that "at every point in this process, a binary search through the page, based on the declared number of elements, will produce a correct result." Spurious results (false positives or false negatives) could follow. In practice, these may occur exceedingly rarely if at all, but they're entirely possible under the right loads.

[link]

From: Jeffrey Yasskin (Jul 14 2010, at 23:45)

BPW: You're certainly right that plain C operations won't work here, but I believe Mr. Bray can use C++0x's atomics to get the same performance here. Specifically, I think he only needs acquire loads on the read side and release stores on the write side, not fully sequential consistent ones. And on x86, acquire loads and release stores are plain mov instructions, with simply restrictions on the compiler's behavior.

In Java, the algorithm would need to use AtomicLongArrays, with an xchg on every store (unless...is it correct to replace a sequence of volatile stores with a series of plain movs with an xchg at the end? The x86 memory model may say yes, in which case a JVM could make this nearly as efficient as C++.)

I'm also curious about Tim's strategy for freeing pages (in non-GC'd languages). If a reader is trying to search a page when an updater empties it out and frees it, the reader is left with a dangling pointer. The easiest way to deal with this is with a shared_ptr (in C++), but you still have to lock the shared_ptr briefly to copy it/increment the refcount.

That's aside from the hole Rowan and George pointed out.

[link]

From: Paul (Jul 15 2010, at 03:10)

As others have pointed out, the page size could change during your search loop, so your search algorithm must not cache the page size while searching. In other words, the following would be bad:

int n = page.size;

for (int i=0; i<n; i++)

{

if (page.entry[i] == x)

break;

}

You will need to test its value every time around the loop:

for (int i=0; i<page.size; i++)

{

if (page.entry[i] == x)

break;

}

and mark page.size as volatile where appropriate for your language.

The same would apply to the page table part of the search.

[link]

From: Ivo (Jul 15 2010, at 05:59)

@Jeffrey Yaskin

"That's aside from the hole Rowan and George pointed out."

Is that a hole, given that Tim wrote:

"Obviously this is not transactional; a reader process can fail to see an insertion that started some time ago if the updater is shuffling its way through a large page."?

[link]

From: ray.racine@gmail.com (Jul 15 2010, at 06:30)

To George's point if right most element is duplicated a reader will not lose sight of the last element in a shuffle in as dups are allowed.

Starting with

{5: 10,20,30,40,40}

{5: 10,20,30,40,40,40}

{6: 10,20,30,40,40,40}

{6: 10,20,30,30,40,40}

...

[link]

From: George Phillips (Jul 15 2010, at 19:06)

@Paul

Constantly checking the page size might work but I don't see how you'd apply the trick to a binary search.

@Ivo

The particular case I came up with fails to find an element that was already in the list, not a newly inserted one.

@Ray

Duplicating the last element would certainly lower the probability of failure. However, a reader who is suspended after reading the length and doesn't get going again until two inserts to the page occur will have the same problem.

Notwithstanding more subtle memory ordering problems, the page count could be fixed by duplicating the last value across the rest of the page. The reader would always search the entire page and could never have a problem with page counts. It is not clear to me that the same trick can be applied at the page table level, though.

[link]

From: nobody (Jul 16 2010, at 01:55)

I don't get it. Say there's one item on a page. My "binary search" decides to look at index 0. But before it can do so, an update pushes that item to index 1. My code will fail to find it.

Are you assuming that binary searches are atomic or something?

[link]

From: Rodrigo Bernardo Pimentel (Jul 23 2010, at 10:34)

George and Rowan: what if you did a sort of "sloppy" binary search? I haven't really thought this through yet, but what I mean is: you assume that, at any point while you're searching, one item might be introduced into the array (let's work with only one, as it seems unlikely, from Bray's description, that multiple updates would be made during a single search). Every time you split up the array in two, you get two extra elements, one from each end. So, if you were going to search through a sub-array of 10 elements, you now search through one of 12. One the next step, instead of one with 6 ou get one with 8, and so on.

So Now, instead of stopping when lower == upper (or however you chose to define your endpoints), stop when lower - upper < 4, and do a linear search on these (at most) four elements, in memory.

Still O(log(n)), though obviously a bit slower. But you shouldn't be caught off-guard by the insertion of one element.

But, once again, I'm thinking as I'm typing. I might regret this later :)

[link]

author · Dad · software · colophon · rights
picture of the day
July 13, 2010
· Technology (81 fragments)
· · Coding (98 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.