Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps

Wed May 30 13:49:08 UTC 2012

On 05/29/12 17:25, Mike Duigou wrote:
> An intrinsic and/or native version might be added later if it turns out that
> the Java murmur3 implementation is a performance bottleneck.

I don't think that going native will make much difference.  The
main disadvantages of using Murmur3 for char[]s are that it takes
twice as long as for byte[]s (for which it was developed), and that
its statistical collision properties when using chars in which the top
byte is usually 0 (as is the case) are probably weaker than otherwise.

Backing up: There's a history of obsession with these sorts of details
in JDK hash table classes. Here's some background.

Many small-looking factors can make a big difference in
application-level performance, and even more so in some
"important" but poorly conceived benchmarks (most notably,
the previous version of specJBB). Some of this is due to
to the fact that some user and JDK hashCode algorithms
(e.g. for class Float) are so poorly distributed that some
intervention is required inside the hash table classes to avoid
horrible performance.  The added wrinkle in the current efforts is to
also avoid horrible performance when Strings are based on external
inputs that may be crafted to cause collisions. The use of a random
seed for String hashing addresses this pretty well. But since this
cannot (per the JLS) be done for all uses of String.hashCode(), it
opens up further exploration of exactly which random-seeded algorithm
to use.

This is similar to previous efforts that took the hashCode()
as a given, and then looked at:
   cost(convertHashToIndexAndLoadBin) +
       sum(i=0, inf, cost(compareElement) * prob >= i collisions))

Here, the main question is how much effort to expend to convert poorly
distributed hash codes via techniques that spread bits across a
32bit hash, resolved to an index via a power-of-two mask.
Bit-spreading (via those shifts, xors, etc you see in hash classes)
gives fewer collisions, but with diminishing returns with more effort,
in part because this code occurs at a choke-point: Loading the bin
(and key) at the index may entail a cache miss, stalling the
processor, so the code leading up to this point is on a critical
path.

You can model some of this analytically, as deviations from ideal
Poisson distribution, where the probability of at least one collision
is roughly 1 / (8 * N), where N is number of elements, under default
resizing policies and pure random keys. But models don't give you any
answer beyond illustrating that the tradeoff between mixing cost and
collision rates generally favors cheaper mixing. Beyond that, the best
you can do is empirically evaluate particular choices against a wide
range of key types and use cases, (There are other non-analytic
concerns here as well. For example, the current bit-spreading in class
HashMap additionally has the property of bounding cache miss rates
during traversal when used for adjacent consecutive integer keys. Yes,
this is a weird consideration but was important to some people at the
time.)

The new String efforts differ in that the hashCode function is up for
grabs, and additionally is cacheable. So the costs have both one-time
initial hash computation, and recurring (for, say, "U" usages of a
key) components. But because we are in control of initial hash,
we can ensure that we don't need further bit-spreading. so,
for U usages, the cost is:

cost(initialHash(key)) +
   U * sum(i=0, inf, cost(compareElement) * prob >= i collisions))

Back-of-envelope analytic modeling shows that the tradeoffs between
initial cost and collision rate depend on both U (usages per key) and
N (number of elements). There are too many unknown constants to make
any absolute conclusions about choice of algorithm.  But plugging in
a few sample values shows that the break even point for increasing
initial hash cost vs better collision rate is further out than you
might expect. For example, guestimating that the murmur code in
current patch is twice as expensive as the "DJB" algorithm in String
hashCode (which I think is a lower bound) but gives 10% lower collision
probability (which I think is  upper bound), you'd only choose
it if U > c * N, where "c" absorbs a lot of slop but is likely
greater than 1. Which indicates that this is probably not the most
common use case.  (But it is a common case in most benchmarks, that
reuse keys a lot to get better time estimates.)

So it does seem worth exploring algorithms with different
speed/quality tradeoffs. For example, some similar guestimation (and a
bit of empirical confirmation) suggests that a good choice might be a
random-seeded version of the well-established Jenkins "one-time"
algorithm (http://en.wikipedia.org/wiki/Jenkins_hash_function), which
is only a little more expensive as DJB but doesn't have as good
collision rates as Murmur3. (Although I'm not at all sure it is
not as good because of the top-byte-usually-zero issue
for Murmur3 on char[]).

In fact, it is not clear to me at the moment whether just using
random-seeded version of current DJB would be about as good as any
other choice. If that were the case, some of the upcoming changes
could be further simplified. And could be vastly simplified
if it were OK to have a VM switch -UseRandomStringHashSeed.
If so, the main impact would be to change one line in String.hashCode.
I'm sure the JLS purists out there would not be happy about this though.

(BTW, see some related discussions on StackExchange:
http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/)

There are other, completely different, approaches to dealing with bad
user hash codes, such as the use of separate data structures to handle
colliding codes. We've rejected them in the past, but I'm trying some
out in the in-progress overhaul of ConcurrentHashMap.

Aside: I discovered while investigating some of this that explicitly
omitting the bit-spreading step for String keys in hash table classes
under any of these algorithms works well if you are going to
explicitly do a (key instanceof String) check anyway.  (This
partially accounts for results I noted in previous mail about
impact of Murmur3.) While it doesn't hurt to further bit-spread
any of these String hash codes, it doesn't help, and since the
test is there anyway, it is better to also use it to skip spread step.
It may be is worth adding this check even if not otherwise
needed because it helps compilers resolve the hashCode call along
this path, and Strings are the most common kinds of keys anyway.

-Doug