[webkit-dev] WTF::fastMalloc

Tue Oct 1 16:16:07 PDT 2013

On Oct 1, 2013, at 3:47 PM, Geoffrey Garen <ggaren at apple.com> wrote:

>>> To access thread-specific data using pthreads, you first need to take a lock and call pthread_key_create(). Since the whole point of thread-specific data is to avoid taking a lock, the API is useless.
>> 
>> The normal way to do it is to use pthread_once to create the key, which does not in general take a lock. (That or use an out-of-band prior initializer, but that wouldn't work for malloc).
> 
> Most implementations of pthread_once use a spinlock, or some moral equivalent. Fundamentally, there’s no memory-safe way to implement concurrent one-time execution of arbitrary side effects without a spinlock.

This implementation from the Linux C library will only ever take a lock in the rare case where initialization has not already been performed, as far as I can tell:
http://searchcode.com/codesearch/view/18325089

Assuming my reading is correct, it only ever hits the slow path if initialization has not been performed yet, and multiple threads attempt to do it at once, which happens at most once early in startup.

As far as I know, the only significant cost in practice to using pthread_once + pthread_getspecific instead of pthread_getspecific_direct is function call overhead. That is my recollection from when we switched on Mac.

> 
> That’s why requiring concurrent one-time execution of arbitrary side effects in order to access thread-specific memory is broken API.

It's definitely lame, but we have existence proofs that you can still be a lot faster than popular system malloc implementations without solving this problem (namely FastMalloc on Linux platforms today, and FastMalloc as initially deployed on Mac before we adopted pthread_getspecific). Does the new malloc implementation access thread-specific data much more frequently?

>> C++11 also introduces the thread_local keyword which is likely more readily optimizable than function-call-based APIs where supported.
> 
> thread_local might be a reasonable option, if a platform achieves all the other requirements for fast malloc. It’s still too slow, but at least it isn’t slow by definition, and it doesn’t pollute the rest of the code too badly.

Maybe it would be easier to understand what the issue is looking at the code. 

From this and your other posts, it sounds like there might be an issue of code pollution/complexity and not just prospective performance.

Regards,
Maciej