[webkit-dev] PSA: you should use WTF::Lock and WTF::Condition instead of WTF::SpinLock, WTF::Mutex, WTF::ThreadCondition, std::mutex, std::condition_variable, or std::condition_variable_any

Fri Aug 21 21:54:45 PDT 2015

Hi Filip,

very interesting to see you change out such a fundamental algorithm! It's
amazing that it works.
Reading the code, it seems very comparable to Windows' critical sections
[1]. Is that the case?
Looking at github and msdn, 4000 seems to be the most common number of
spins.

I believe your code would also useful be outside of WebKit. Have you given
any thought on breaking it off and making it part of the operating system?

1:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683476(v=vs.85).aspx

On Wed, Aug 19, 2015 at 1:08 PM, Filip Pizlo <fpizlo at apple.com> wrote:

> Hi everyone,
>
> Over the past two weeks or so I’ve been doing some work to improve the
> locking primitives that we use in WebKit.  These new primitives have
> landed, and they are simply called Lock and Condition.  You should use Lock
> instead of SpinLock, Mutex, or std::mutex, because it combines the best
> qualities of those other locks into one simple implementation.  You should
> use Condition instead of WTF::ThreadCondition, std::condition_variable, or
> std::condition_variable_any because Condition is always smaller and
> faster.  Also, Condition has a richer API for timed wait, which allows you
> to use either std::chrono time or the double-based time from
> <wtf/CurrentTime.h>.  It can even use our notion of monotonic time.
>
> Prior to this change, we had to choose between WTF::SpinLock, which was
> fast, and WTF::Mutex/std::mutex, which didn’t waste CPU cycles during
> contention.  After this change, we no longer have to make such choices: the
> new Lock is fast and doesn’t waste CPU cycles.
>
> I’ve already landed changes to all of WebKit that replace all uses of
> those other locking primitives with Lock/Condition.
>
> The specific benefits of these new primitives are:
>
> - Lock and Condition take 1 byte of space.  That’s the total space that
> they will ever consume.  Contention is handled using thread local data
> structures managed by WTF::ParkingLot and that space usage is O(number of
> threads).  By comparison, WTF::Mutex and std::mutex take 64 bytes on
> Darwin.  They are usually take a lot of space on all OSes.  WTF::SpinLock
> takes 4 bytes.  So, Lock is 4x more compact than SpinLock and 64x more
> compact than std::mutex/WTF::Mutex.
>
> - Lock is “adaptive”: it will not waste CPU time or power when a lock is
> held for a long time.  SpinLock will peg a CPU at 100% while it’s trying to
> acquire a lock, which makes SpinLock unsuitable for any critical section
> that may be held for a while.  This means never using SpinLock in critical
> sections that do I/O or that may block on other locks.  Lock doesn’t have
> this problem; like std::mutex and WTF::Mutex, it has the ability to block
> threads indefinitely while waking them up as soon as the lock is available
> again.  WTF::Lock behaves like a spinlock so long as no thread waits for
> too long (currently, too long = 40 spins), and otherwise turns into a
> queue-based lock with barging (a hot thread that calls lock() just as
> another thread is dequeued may barge in, which increases throughput and
> somewhat avoids the convoy problem) and random fairness (in the long term,
> every contending thread has an equal shot at getting the lock) and
> thundering herd avoidance (unlock could cause one thread to try to contend
> for the lock, but it won’t ever wake up threads only to have them go back
> to sleep).
>
> - Lock is very fast in the uncontended case.  Locking and unlocking fast
> paths are just inline CAS instructions (or LL/SC sequences on ARM and
> friends).  This makes Lock about 3x faster than some system mutexes, and
> within 2x of a spin lock.  For Lock performance details, see
> http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog,
> http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog, and
> http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog.
>
> - Lock is very fast in the case of microcontention.  Microcontention is
> when you have multiple threads all piling up on a lock that is held for a
> short time.  You may remember that this case is important in WebKit, in
> particular in our parallel GC: https://trac.webkit.org/changeset/117478,
> https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/.  Well, Lock
> takes all of the good ideas from the WTF::SpinLock, which was our previous
> best answer for microcontention.  Lock is up to 100x faster than WTF::Mutex
> and std::mutex in case of microcontention on some OSes.
>
> - Condition::notifyOne()/notifyAll() are just a fast inlined
> load-and-branch in the case that nobody is waiting on the condition.  This
> feature, combined with the fact that Condition can be used with Lock, and
> the fact that Lock is very fast to acquire and release, means that
> producer/consumer scenarios can run up to 58x faster with Lock/Condition
> than with Mutex/ThreadCondition or std::mutex/std::condition_variable.  For
> Condition performance details, see
> http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog and
> http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog.
>
> The algorithms used for Lock and Condition have novelty in them, but the
> overall heuristics are consistent with what I take to be existing best
> practices as documented in other places.  The way that Lock handles
> contention is most similar to how Jikes RVM did it.  That algorithm is best
> documented in http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf (though
> ignore all of the stuff about biasing and inflation of “thin” locks -
> WTF::Lock doesn’t do any of that).  That basic algorithm had been around
> for a while and took in many influences (see that paper for citations to
> other relevant stuff).
>
> FWIW, in parallel and concurrent code in JSC, switching over to
> Lock/Condition from the other primitives gave us speed-ups on SunSpider and
> Kraken, as well as speed-ups on some Octane subtests.  Both parallel GC and
> concurrent/parallel JIT appeared to be happy with this change.
>
> Another nice outcome of this work is that the underlying queueing
> mechanism used by both Lock and Condition is available for implementing
> other locking protocols.  It’s called ParkingLot, because its job is to
> allow you to “park” threads that shouldn’t run until someone else “unparks”
> them.  Any valid memory address can be used as a handle for a ParkingLot
> queue.  ParkingLot can be used to implement any locking protocol that uses
> futexes (https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf) -
> just use “parkConditionally” whenever you would have used “FUTEX_WAIT” and
> use “unparkOne” when you would have used “FUTEX_WAKE”, but it can also do a
> lot more than that, since it allows you to pass std::function callbacks
> that run while ParkingLot holds its internal queue lock.  This is what
> enables Lock to avoid the thundering herd and enables Condition to have a
> lock-free notify fast path when nobody is waiting.  In theory, you can use
> ParkingLot to implement completely fair locks, counting semaphores,
> read-write locks, and other things.  It’ll also come in handy for
> implementing new locking primitives for things like WebAssembly.
>
> TL;DR.  Don’t use WTF::SpinLock, WTF::Mutex, std::mutex,
> WTF::ThreadCondition, std::condition_variable, or
> std::condition_variable_any.  They waste CPU time and they waste memory.
> Use WTF::Lock and WTF::Condition instead.
>
> -Filip
>
>
>
> _______________________________________________
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
> https://lists.webkit.org/mailman/listinfo/webkit-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.webkit.org/pipermail/webkit-dev/attachments/20150821/2882de34/attachment.html>