Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps
Ulf Zibis
Ulf.Zibis at gmx.de
Wed May 30 14:05:22 UTC 2012
Am 29.05.2012 23:25, schrieb Mike Duigou:
> On May 28 2012, at 12:34 , Doug Lea wrote:
>> On 05/25/12 21:43, Ulf Zibis wrote:
>>> To me it seems, that computing the murmur hash is more expensive, especially on
>>> short strings, than the old hash algorithm.
>> It is definitely slower, but also definitely has better statistical
>> properties (fewer collisions when used in hash tables).
Thanks for your confirmation.
> Implementations are also free to change to a different algorithm for performance, security or scalability reasons as they see fit.
How should that work, if HashMap.hash() is not overrideable?
>> This means that any application that hashes strings only once will
>> be slower than using (old) hashCode. but any application that
>> uses the (cached) hashes more than a few times will tend to run
>> faster than old hashCode version because of higher quality hash codes.
>> A few tests so far confirm this.
> This was my result as well. Caching makes most of the differences between String.hashCode (legacy algorithm) and String.hash32 (the murmur3 implementation) disappear. I actually tested with a version of hashCode that recalculated the result 2X, 5X and 10X times in a loop to look at the impact of caching. The conclusion was that even a 5X more expensive hashing algorithm was bearable as long as caching was present. There are still many cases where caching can't be present, such as parsing, where all of the items being hashed are new objects with no previously cached hash value. These cases require that we provide a reasonably performing hash algorithm. Murmur3 is slower than the legacy algorithm but not enough slower to cripple overall usage and it's better distribution really shines on larger tables.
So applications, which mostly process new strings from I/O, must live with the slower performance in
future?
Especially, if the developer knows about the nature of the expected strings, there could be a
significant performance gain from customizing the hash algorithm, see example:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6812862
Please additionally note, that my more general "overkill" approach has less instructions to execute
for calculating the hash32 and does not need the instanceof processing, which can be potentially
slow on some VM's.
May be someone knows about a better code architecture than my given approach to allow custom hash
algorithms, let us know.
In case of often using the same strings, developer should intern them for benefitting from the
already calculated VM-internal hash. BTW, wouldn't it make sense, to use the Murmur hash also there?
-Ulf
More information about the core-libs-dev
mailing list