[webkit-dev] Hashing pairs in WTF

Fri Sep 7 10:32:47 PDT 2012

Hello WebKittens,

I noticed the other day that our hashing method in wtf/HashFunctions.h
for hashing of pairs can be improved.

We have a hash method (WTF::intHash) for a single integer that takes
32 bits as input and produces a key of 32 bits. This works well enough
for hashing an integer, as it is a reversible function.

To hash a pair (A, B) of 32 bits, we hash each of A and B with the
above method, to produce (A', B'). Then concatenate the two and hash
the result to produce a 64 bit hash code. This whole process is
reversible. However, as a final step we truncate the hash code to 32
bits, which destroys the properties of the hashing method. The result
is a hash function that collides more than it should for typical
inputs.

I am proposing a patch [1] to improve the hash method for keys
composed of pairs. The new method [2][3][4] is both almost-universal
and efficient for machines to compute.

To demonstrate the difference I inserted pairs of integers into a
WTF::HashMap. A good hash method will populate the HashMap more
densely. Whereas a bad hash method does not and will cause the HashMap
to require more space and more calls to realloc(), which is a costly
function and will be slower in the end.

In testing, the new hash method reduced the time to insert integer
pair keys into a WTF::HashMap by between 23% and 49% for different
sized input sets. For example, with various input set sizes, the time
to insert into the WTF::HashMap, across multiple runs, was reduced by:

8192 elements: 32% 31% 30% 34% 34% 33%
4096 elements: 47% 48% 48% 47% 47% 49%
1024 elements: 43% 43% 39% 42% 35%
256 elements: 25% 39% 33% 41% 23%

Lookup times into an empty HashMap were also improved, incidentally.
For example, with the old intHash method, making 8192 queries requires
approximately 129ms on my machine while the new method takes less than
1ms. All testing was done on a 64-bit x86 machine. The tests I ran are
attached to the bug [1] for those who may be interested in repeating
the experiment [5].

I want to repeat that the reason the new hash method is faster is
because of better hashing behaviour - ie, fewer collisions and a
denser resulting hash table. I am not comparing the time to do the
hash() itself.

Any feedback would be welcome on the bug!

Thanks,
Dana

[1] https://bugs.webkit.org/show_bug.cgi?id=96022
[2] http://opendatastructures.org/versions/edition-0.1d/ods-java/node33.html#SECTION00832000000000000000
[3] http://pages.cpsc.ucalgary.ca/~woelfel/paper/efficient_strongly/efficient_strongly.pdf
[4] http://pages.cpsc.ucalgary.ca/~woelfel/paper/diss/diss.pdf
[5] https://bugs.webkit.org/attachment.cgi?id=162804