A library for implementing equals and hashCode

John Rose john.r.rose at oracle.com
Thu Apr 25 00:41:49 UTC 2019


On Apr 24, 2019, at 4:45 PM, Kevin Bourrillion <kevinb at google.com> wrote:
> 
> On Wed, Apr 24, 2019 at 4:31 PM Stuart Marks <stuart.marks at oracle.com> wrote:
> 
>> However, I can see that 
>> some applications might want greater control over the actual hash function, so 
>> that capability probably will need to be provided through the API somehow.

> I would disagree with this strongly. Use cases that require a quality hash code really need to use a proper hashing library like (bias alert) Guava's. It can produce quality results because its API is designed for that.

Kevin, are you saying that people who are concerned about
hash quality should just stay away from Object.hashCode?

Or are you saying that people who want greater control
over actual hash functions should use an external
library like Guava?  (I'm guessing the former, but not
sure here.  I see the "strongly" but I can't process it
without some more clarity.)

> Object.hashCode() doesn't need to be a quality hash code! It only needs to be good enough for in-memory hash tables, which is about as low as oneś standards can possibly get. One should certainly avoid easily constructable collisions... and then that's about all there is to worry about. And, you can't reasonably get much better than that anyway, because you will continually find yourself depending on Integer.hashCode() and String.hashCode() etc., which will never get better than basic.

This paragraph proves that Object.hashCode is pretty
poor except for some limited use cases.  It's actually
worse than you claim:  hashCode was designed when
everybody ran Java on systems with less than a gigabyte
of memory.  Nowadays you can build in-memory structures
which are *guaranteed* to experience collisions because
they contain more than 2^32 keys.

You point out that Integer and String have… "basic"
hash codes; I would add List to that, since it uses the
same "basic" mixing function as String.  That was
appropriate in 1995 when 32-bit multiply was slow,
but now it's just an obsolete function seeking a sinecure
for a barely-working retirement.  These "basic" hash
codes used to be reasonable, and now they need
warning labels.  Something along the lines of, "these
so-called hash codes were reasonable in the 90's".

But the Object.hashCode API can't be deprecated or
ignored, because it is centrally located, and that makes
it frankly useful, sometimes despite its shortcomings.
This is why I'm hoping to find a path to replace it with
something better, something equally useful (which
includes ubiquitous) but not so basic.  Or moldly.

What excites me, in all this, is a possible convergence
of a need (hashCode is increasingly antiquated), an
opportunity (fast mixing primitives for wide values on
stock hardware), some well-understood frameworks for
hashing (like Guava's funnels and mixers IIRC), and some
upcoming new capabilities for migrating old classes to
new APIs (Valhalla bridges, coming soon I hope).

Out of this convergence I think we might be able to
obtain some good things:

 - A higher-precision hash code *in the JDK* that scales
 to all in-memory structures.  (working title "longHashCode")
 - A migration path that allows classes to lean on old hash
 codes as a crutch until they can be upgraded.
 - Some easy ways to upgrade them (such as Liam's library,
 plus maybe some delegation stuff Brian is cooking up).
 - Good upgrades for all commonly used JDK types.

And also a strong version of the randomization Stuart
was talking about:

 - A contract that allows the JVM to switch hashes, by
 modifying salts and even algorithms, depending on
 hardware available.
 - (Optionally) A way to pass a dynamic salt to a hash
 code, for container-specific perturbations.  But it might
 be OK to handle this as post-processing.

Ultimately it probably can't be a one-size-fits all thing,
so I think it might be reasonable to build something like
the funnel and mixer framework into a JDK template
function (when we have those) so that applications can
ask for customized stronger or weaker, wider or narrower
hash functions, maybe to cover data sets which are too
large for memory and/or which need to be reproducible
across JVMs.

Then the standard hash code ("longHashCode" or the
like) can be defined as a JVM-controlled instance of
that framework.  By then, if we start to roll out 128-bit
integers in Java (using value types) we can roll out some
larger hash codes at the same time, using the same
migration tricks.

Bonus point:  If the JVM gets ownership of a pervasively
used hash code, then hardening that hash code, even
at the expense of JVM performance, becomes an allowed
move.  Kind of like compiling in side-channel remediations:
You do it sometimes because you need more DiD, but
not always because of a performance hit.

HTH

— John


More information about the amber-spec-experts mailing list