Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps

Tue May 29 21:25:25 UTC 2012

On May 28 2012, at 12:34 , Doug Lea wrote:

> On 05/25/12 21:43, Ulf Zibis wrote:
>> To me it seems, that computing the murmur hash is more expensive, especially on
>> short strings, than the old hash algorithm.
> 
> It is definitely slower, but also definitely has better statistical
> properties (fewer collisions when used in hash tables).
> 
> In its original (C) context, Murmur hash is often about as fast as other
> string hashes, because it operates on 4byte words rather than 1-byte
> elements at a time. So even though the per-element cost is much higher, it
> takes 1/4 the steps. When done on Java char[]'s though, it can only
> process 2 chars at a time. (Plus, we cannot necessarily read 32bitsat a
> time because of possible byteswapping.) So it is a struggle to make
> it only twice as slow.

An intrinsic and/or native version might be added later if it turns out that the Java murmur3 implementation is a performance bottleneck. So far it hasn't been a particular bottleneck. Implementations are also free to change to a different algorithm for performance, security or scalability reasons as they see fit.

> This means that any application that hashes strings only once will
> be slower than using (old) hashCode. but any application that
> uses the (cached) hashes more than a few times will tend to run
> faster than old hashCode version because of higher quality hash codes.
> A few tests so far confirm this.

This was my result as well. Caching makes most of the differences between String.hashCode (legacy algorithm) and String.hash32 (the murmur3 implementation) disappear. I actually tested with a version of hashCode that recalculated the result 2X, 5X and 10X times in a loop to look at the impact of caching. The conclusion was that even a 5X more expensive hashing algorithm was bearable as long as caching was present. There are still many cases where caching can't be present, such as parsing, where all of the items being hashed are new objects with no previously cached hash value. These cases require that we provide a reasonably performing hash algorithm. Murmur3 is slower than the legacy algorithm but not enough slower to cripple overall usage and it's better distribution really shines on larger tables.

>> So I think, a developer should have an influence on the hash algorithm to be
>> used by a hash map,
> 
> I've had to make a lot of internal adjustments in hash tables to
> counteract crummy user hashCode() algorithms over the years,
> so I think that the less control, the better :-)

I am in agreement with Doug. I've seen checked in code with "return 0;" as the hashCode implementation for classes used as hash keys. One example that clearly didn't understand hashes had a different random number as the hashCode result for each class.... 

Mike