ned Productions - Wednesday 30th July 2014

Wednesday 30th July 2014: 11.08pm. Link shared: http://boost.2283326.n4.nabble.com/Request-for-feedback-on-design-of-concurrent-unordered-map-plus-notes-on-use-of-memory-transactions-td4665594.html

Realised earlier tonight that the design for a concurrent_unordered_map which I posted to http://boost.2283326.n4.nabble.com/Request-for-feedback-on-design-of-concurrent-unordered-map-plus-notes-on-use-of-memory-transactions-td4665594.html is flawed in two ways: (i) references could invalidate during inserts if the load factor was high, which makes concurrent use hard (ii) I couldn't make rehashing meet the Abrahams exception guarantees. Which is probably why unordered_map always uses malloc to store objects, and which drove a stake through my own attempt to avoid malloc :(

So, restoring malloc in there has had obvious severe consequences to performance. On VS2013, single threaded comparing a spin locked unordered_map to my concurrent_unordered_map:

=== Large unordered_map spinlock write performance ===
1. Achieved 10937284.353334 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 34137236.511912 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 1 threads in this CPU
1. Achieved 9069097.564201 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 1 threads in this CPU
1. Achieved 20303962.021199 transactions per second

For GCC 4.9 (note this is a different, much more powerful machine):

=== Large unordered_map spinlock write performance ===
1. Achieved 24429803.739067 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 67016665.221787 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 1 threads in this CPU
1. Achieved 27521011.025952 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 1 threads in this CPU
1. Achieved 58011088.216197 transactions per second

My concurrent_unordered_map is now -17% for writes and -40% for reads on MSVC, and +13% for writes and -13% for reads on GCC 4.9. A lot of the disparity is that I suspect that MSVC's optimiser doesn't really cope with acquire/release atomic compare-exchange, he basically just always calls memory_order_seq_cst as the acquire/release implements just aliases to that on x64, and that forces a full fence on the optimiser. Put it another way, using memory_order_seq_cst for the cmpxchg makes no difference to MSVC but an enormous difference to GCC and GCC completely rewrites the assembler output. But I digress.

There has also been substantial loss of concurrent performance unfortunately :(. This is for 4 hyperthreads on VS2013:

=== Large unordered_map spinlock write performance ===
1. Achieved 10471187.171366 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 14316081.976936 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 4 threads in this CPU
1. Achieved 19220198.321859 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 4 threads in this CPU
1. Achieved 42207529.380585 transactions per second

And 8 hyperthreads for GCC 4.9:

=== Large unordered_map spinlock write performance ===
1. Achieved 21901097.405844 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 41994076.017479 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 8 threads in this CPU
1. Achieved 74325527.035646 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 8 threads in this CPU
1. Achieved 204756178.377553 transactions per second

So, my concurrency is reduced to the following multiples of single threaded performance:

MSVC writes: 2.1x (dual core machine)
MSVC reads: 2.1x (dual core machine)
GCC writes: 2.7x (quad core machine)
GCC reads: 3.5x (quad core machine)

I suspect the design doesn't scale out for writes particularly well - because we no longer store the mapped type in the per-bucket table, we are now squeezing four items in per cache line, so when buckets collide (the test intentionally collides a lot of buckets) even once past the spinlock you're going to generate a ton of cache invalidation traffic, probably effectively a line per insert/erase for the table and another for the bucket lock. I don't think that can hugely be escaped unfortunately, though I suppose I could erase by searching backwards and insert by searching forwards, it might help for high load factors.

Lastly, someone might wonder how does mine compare to a classic split ordered list concurrent_unordered_map like those from Intel Thread Building Blocks or Microsoft's Parallel Patterns Library? Well, the PPL on VS2013 gets about 29% more performance than mine for concurrent reads, but split ordered list designs can't do concurrent erase(), so I guess that's the tradeoff.

#boostcpp #c++

Go to previous entry

Go to next entry

Go back to the archive index

Go back to the latest entries

Niall’s virtual diary archives – Wednesday 30th July 2014