ned Productions - Sunday 31st August 2014

Sunday 31st August 2014: 11.44pm. Link shared: http://en.cppreference.com/w/cpp/atomic/memory_order

Was up till 3.30am on Saturday night deracing proposed boost::concurrent_unordered_map, and I have found myself pondering the problem during the drive to and from Limerick today.

valgrind drd and helgrind say the code is absolutely clean. And I believe those tools, yet the ThreadSanitiser reports 15Mb of warnings. I initially dismissed the tool as surely producing false positives, but that many???

It occurs to me that proposed concurrent_unordered_map probably is definitely race free on Intel CPUs - not hard as those are strongly ordered where all memory loads are acquires and all memory writes are releases with no extra effort. It probably is race free on ARM, I have an ARM test slave after all, and it soak tests without issue. Most programmers would probably accept the thing as good, and move on (as should I really).

But something is bugging me about it - ThreadSanitiser is happy when you're concurrently no-alloc inserting and removing, that's all good. It is getting upset when a bucket expansion occurs or with the concurrent rehashing, and I have a sneaking suspicion that it's right, even though it would never be a problem in the real world. You see, with valgrind drd and helgrind you need to annotate your code with hints such as ANNOTATE_RWLOCK_ACQUIRED() and such because Intel machine code is so strongly ordered valgrind needs hinting as to what you really mean. The ThreadSanitiser doesn't need that, because the C11/C++11 thread memory visibility ordering model is the annotation you use to tell ThreadSanitiser what you're doing i.e. the same stuff which makes your program correct is the annotation used, which is obviously handy. The C11/C++11 thread memory visibility ordering model is described at http://en.cppreference.com/w/cpp/atomic/memory_order, but to recap it means this:

* release on a value by thread A means any acquires on that value by thread B make visible to thread B all writes by thread A up to the release.

* release on a value by thread A means any consumes on that value by thread B make visible to thread B all data dependent writes by thread A up to the release. Consume is very handy when you release a pointer because anything pointed to by that pointer is made visible to the consumer of the pointer, while all other writes may or may not be visible. As often the pointer and what it points to is the only thing you wish to transfer between threads and you don't care about all the other writes, this is a useful optimisation and it can save a ton of cache coherency traffic.

And just to remind you of the obvious, a spinlock which to lock is an atomic compare-exchange always acquires (as an optimisation it can consume instead if it failed to lock), and to unlock always releases. This ensures that all writes before the unlock are always made visible to the next thread to successfully gain the lock.

Proposed boost::concurrent_unordered_map works as fast as a std::unordered_map yet is highly concurrent and rehash safe using some nifty atomic tricks based on the above. Firstly, the bucket array is pointed to by an atomic pointer, and each bucket in the array has a tri-state spinlock: unlocked, locked, reload bucket list - where the third state means go reload the bucket array from its atomic pointer, as the one you're looking at is retired. Within each bucket there is an array of hashes and pointers to value_type's, so to find/insert you load the pointer to the bucket array, modulus the hash of the key by the array size, lock the bucket in question and then do a linear search through the array of hash-pointer pairs, comparing hashes and if equal, verifying the keys are indeed equal. If during attempting to lock the bucket the lock goes into reload state, you loop reloading the pointer to the bucket array before trying to lock the bucket, else you spin trying to lock the bucket.

Rehashing works by firstly locking a special rehash spinlock - rehashing must be serialised. It then locks all buckets in the existing array, halting all concurrency. It creates a new bucket array with the new size and also locks all its buckets. It then atomically exchanges the pointer to the array using memory_order_acq_rel (probably a pure release would be enough, but rehashing isn't supposed to be fast), marks all the locks in the old array as reload buckets, and then copies every item in the old bucket array into the new array. If an exception throws, the old array can be instantly restored, if one doesn't the old array is reset to minimum storage and added to a ring buffer of old bucket arrays to keep it around for a while so it can tell its threads to reload the bucket array. And thus we keep locking totally away from normal insert/erase/find, and performance rises linear to CPU cores which is outstanding. This BTW is with safe erase, which none of the split ordered list concurrent hash tables can do (they can concurrent find, insert and rehash, they cannot erase which rather ruins their usefulness).

So far, so good. Much of the 15Mb of warnings from the ThreadSanitiser is a race is between the construction of a bucket_type (a store) and when its count member is examined for zero-ness (a load). Here is bucket_type:

struct bucket_type_impl
{
spinlock<unsigned char> lock; // = 2 if you need to reload the bucket list
atomic<unsigned> count; // count is used items in there
std::vector<item_type, item_type_allocator> items;
bucket_type_impl() : count(0), items(0) { }
...

What could it be possibly upset about I wondered?

Well, it turns out it is upset because the std::atomic constructor is not atomic - yes, you read that right, if you initialise via constructor, that constructing store is not visible to threads using that atomic. Yes, I was very surprised too, you would think a release would be a default :( but no, this shows where ThreadSanitiser is correct. Strictly speaking you must manually do a release store before an atomic can be used safely - if there ever was a defect in the standard, this is surely one of them - it is highly counter-intuitive.

And by adding a release store to the constructor, voila - proposed boost::concurrent_unordered_map is now race free.

Thank you ThreadSanitiser! :)

Go to previous entry

Go to next entry

Go back to the archive index

Go back to the latest entries

Niall’s virtual diary archives – Sunday 31st August 2014