[webkit-dev] PSA: you should use WTF::Lock and WTF::Condition instead of WTF::SpinLock, WTF::Mutex, WTF::ThreadCondition, std::mutex, std::condition_variable, or std::condition_variable_any

Wed Aug 19 13:08:38 PDT 2015

Hi everyone,

Over the past two weeks or so I’ve been doing some work to improve the locking primitives that we use in WebKit.  These new primitives have landed, and they are simply called Lock and Condition.  You should use Lock instead of SpinLock, Mutex, or std::mutex, because it combines the best qualities of those other locks into one simple implementation.  You should use Condition instead of WTF::ThreadCondition, std::condition_variable, or std::condition_variable_any because Condition is always smaller and faster.  Also, Condition has a richer API for timed wait, which allows you to use either std::chrono time or the double-based time from <wtf/CurrentTime.h>.  It can even use our notion of monotonic time.

Prior to this change, we had to choose between WTF::SpinLock, which was fast, and WTF::Mutex/std::mutex, which didn’t waste CPU cycles during contention.  After this change, we no longer have to make such choices: the new Lock is fast and doesn’t waste CPU cycles.

I’ve already landed changes to all of WebKit that replace all uses of those other locking primitives with Lock/Condition.

The specific benefits of these new primitives are:

- Lock and Condition take 1 byte of space.  That’s the total space that they will ever consume.  Contention is handled using thread local data structures managed by WTF::ParkingLot and that space usage is O(number of threads).  By comparison, WTF::Mutex and std::mutex take 64 bytes on Darwin.  They are usually take a lot of space on all OSes.  WTF::SpinLock takes 4 bytes.  So, Lock is 4x more compact than SpinLock and 64x more compact than std::mutex/WTF::Mutex.

- Lock is “adaptive”: it will not waste CPU time or power when a lock is held for a long time.  SpinLock will peg a CPU at 100% while it’s trying to acquire a lock, which makes SpinLock unsuitable for any critical section that may be held for a while.  This means never using SpinLock in critical sections that do I/O or that may block on other locks.  Lock doesn’t have this problem; like std::mutex and WTF::Mutex, it has the ability to block threads indefinitely while waking them up as soon as the lock is available again.  WTF::Lock behaves like a spinlock so long as no thread waits for too long (currently, too long = 40 spins), and otherwise turns into a queue-based lock with barging (a hot thread that calls lock() just as another thread is dequeued may barge in, which increases throughput and somewhat avoids the convoy problem) and random fairness (in the long term, every contending thread has an equal shot at getting the lock) and thundering herd avoidance (unlock could cause one thread to try to contend for the lock, but it won’t ever wake up threads only to have them go back to sleep).

- Lock is very fast in the uncontended case.  Locking and unlocking fast paths are just inline CAS instructions (or LL/SC sequences on ARM and friends).  This makes Lock about 3x faster than some system mutexes, and within 2x of a spin lock.  For Lock performance details, see http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog <http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog>, http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog <http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog>, and http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog <http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog>.

- Lock is very fast in the case of microcontention.  Microcontention is when you have multiple threads all piling up on a lock that is held for a short time.  You may remember that this case is important in WebKit, in particular in our parallel GC: https://trac.webkit.org/changeset/117478 <https://trac.webkit.org/changeset/117478>, https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/ <https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/>.  Well, Lock takes all of the good ideas from the WTF::SpinLock, which was our previous best answer for microcontention.  Lock is up to 100x faster than WTF::Mutex and std::mutex in case of microcontention on some OSes.

- Condition::notifyOne()/notifyAll() are just a fast inlined load-and-branch in the case that nobody is waiting on the condition.  This feature, combined with the fact that Condition can be used with Lock, and the fact that Lock is very fast to acquire and release, means that producer/consumer scenarios can run up to 58x faster with Lock/Condition than with Mutex/ThreadCondition or std::mutex/std::condition_variable.  For Condition performance details, see http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog <http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog> and http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog <http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog>.

The algorithms used for Lock and Condition have novelty in them, but the overall heuristics are consistent with what I take to be existing best practices as documented in other places.  The way that Lock handles contention is most similar to how Jikes RVM did it.  That algorithm is best documented in http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf <http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf> (though ignore all of the stuff about biasing and inflation of “thin” locks - WTF::Lock doesn’t do any of that).  That basic algorithm had been around for a while and took in many influences (see that paper for citations to other relevant stuff).

FWIW, in parallel and concurrent code in JSC, switching over to Lock/Condition from the other primitives gave us speed-ups on SunSpider and Kraken, as well as speed-ups on some Octane subtests.  Both parallel GC and concurrent/parallel JIT appeared to be happy with this change.

Another nice outcome of this work is that the underlying queueing mechanism used by both Lock and Condition is available for implementing other locking protocols.  It’s called ParkingLot, because its job is to allow you to “park” threads that shouldn’t run until someone else “unparks” them.  Any valid memory address can be used as a handle for a ParkingLot queue.  ParkingLot can be used to implement any locking protocol that uses futexes (https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf <https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf>) - just use “parkConditionally” whenever you would have used “FUTEX_WAIT” and use “unparkOne” when you would have used “FUTEX_WAKE”, but it can also do a lot more than that, since it allows you to pass std::function callbacks that run while ParkingLot holds its internal queue lock.  This is what enables Lock to avoid the thundering herd and enables Condition to have a lock-free notify fast path when nobody is waiting.  In theory, you can use ParkingLot to implement completely fair locks, counting semaphores, read-write locks, and other things.  It’ll also come in handy for implementing new locking primitives for things like WebAssembly.

TL;DR.  Don’t use WTF::SpinLock, WTF::Mutex, std::mutex, WTF::ThreadCondition, std::condition_variable, or std::condition_variable_any.  They waste CPU time and they waste memory.  Use WTF::Lock and WTF::Condition instead.

-Filip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.webkit.org/pipermail/webkit-dev/attachments/20150819/4c5c753b/attachment.html>