Value type hash code

Wed Apr 11 20:27:42 UTC 2018

On Apr 11, 2018, at 11:32 AM, Remi Forax <forax at univ-mlv.fr> wrote:
> 
> The default implementation of equals and hashCode in java/lang/Object make only sense for reference,

That's because currently Object is the supertype of references only.

When Object becomes the supertype of values also, it's portfolio increases.

This is a common logical error:  Because X did Y historically, X must also do only Y in the future.

We could make this our story for X=Object, Y=identityHashCode by fiat, but we are not forced to.

In fact, since Object is the super of value types (by another fiat; yes, I know it doesn't fit perfectly),
it it totally reasonable to enhance the Object.hashCode method to some logic like this:

hashCode() {
  if (this.getClass is reference type)
    return System.identityHashCode(this)
  else
    return System.valueHashCode(this)
}

And in fact I'm convinced that is what we should do.

That leads to the question, "what is the definition of System.valueHashCode" (name is unimportant)?

> for value types, it depends on the value type, i.e, it's far easier for the compiler to provide an equals/hashCode than for the VM,

There is a third and much better answer, which I'm surprised you don't mention; it is using a BSM.

In other words, we are not stuck with the following unpleasant choice:

 - Put lots of complicated intrinsic behavior into the JVM proper.
 - Put lots of complicated generated bytecode into the classfiles, which is there forever.

The third choice is:

 - Let the Java runtime manage complicated intrinsic behaviors using BSMs.

This third choice makes classfiles smaller (which is important!) and allows the JVM and JDK
to jointly choose the most efficient code shapes to implement the documented behavior.

This worked really, really well for lambdas.

(Somebody might still be thinking that only generated bytecodes in classfiles will get
full performance.  Actually, generated bytecodes in classfiles are the worst-performing
option.  BSMs can generate efficient code with "cheats" like Unsafe which bytecodes
are forbidden to use.  Even a non-BSM solution like System.valueHashCode can be
faster than raw bytecodes, if the magic method is treated as an intrinsic by the JIT.
The problem with raw bytecodes is that the JIT cannot trust them to correspond to
an optimizable pattern; it must implement every corner of them and may not replace
them with more efficient code, because they might have been written by the user.
Also, loads of bytecodes in classfiles are slow to load and hard to verify and analyze.
The anti-pattern to avoid is a one-line Java class compiling to kilobytes of classfile,
and we can do this with BSMs.)

The most I'd like to see in value type classfiles is a one line hashCode (and equals, toString):

  hashCode() {
    return indy[Bootstraps::valueHashCode](this)
  }

And in fact, I would prefer to have this stereotyped code be inherited (by a new VM feature
"mindy" = method instance dynamic.)  …That's me wanting to sand down all the corners.

> hence my proposition that the VM should not provide any default and let the compiler of the different languages decide what defaults are. 

That's fine to let the compiler decide but putting lots of boilerplate bytecodes into classfiles
is an anti-pattern; see above.  Let the compiler choose a BSM and (maybe) a supertype to
inherit mindy methods from.

Now, back to the details of hashCode:

At points like this we can go back and try to fix historical defects in hashCode.  Fix more
defects and get more expense and delay, or fix fewer and get the feature sooner.

Someone noticed that there is a theoretical possibility of stack overflow in some cases.

The most important fact to note here is working precedents for handling this issue, which work today.

The most relevant working precedent is collection types, whose documented semantics
support value-based types, a close cousing to value types.  (The product of List.of is approximately
value-based.)

The code for AbstractList gives us the best precedent to work with, and it does not check
for self-cycles.  I don't think value types have any greater chance than lists of self-cycles;
in fact they probably have less, since they will often be logical leaves in data structure.

So it would be reasonable to adapt the standard definitions from AbstractList for
the value types.  To me that is the best first cut.  (The self-check in toString,
which is in AbstractCollection, short-circuits away in the case of value types.)

Details of this first cut:  Make a List (using List.of) of the fields in order of the value
instance.  Use source order of fields.  Omit statics of course.  For good measure
fold in the type, say, inserting this.getClass() into the field list as a leading element.

Defer equals/hashCode/toString to the List derived as specified.  Certainly tweak
the toString output so it doesn't look exactly like a List.toString output.

hashCode() {
  // library optimizes this behavior via BSM:
  return List.of(this.getClass(), this.field1, this.field2, …).hashCode();
}

To be quite clear:  Yes, if the value type has a reference field, that sub-object's
hashCode method gets called, and that could do something crazy.  If that's an
issue, the author of the value type should specify more expensive and careful
methods.  Preferably using a different BSM, of course.

(Alternative compliant behaviors would be to throw an exception or to skip the
reference field, for hashCode.  Using identityHashCode is also an option but
only if the corresponding equals method uses acmp/op== not an equals call.
But in that case having values use an identity-sensitive comparison for their
sub-components seems even more surprising than throwing an exception.
Of all the choices, following the precedent of List seems to me least surprising
and most useful.)

— John

P.S. Now to make it more complicated:  My favorite grudge against hashCode is
not that it is potentially self-recursive.  My grudge is that hashCode produces at
most 32 bits or in the case of identityHashCode only about 20 bits, and the standard
mixing functions (xor, *31+) are known to be poor, in terms of creating funnels that
confuse pairs of values, and failing to spread input bits to output bits.

Should we consider making a longHashCode method for various types, including
value types, and have it produce a full 64 bits of output, mixed in a more modern
way?  According to my experiments, two rounds of AES-step is better than any
standard arithmetic operation of similar cycle-count on modern hardware, and
light-years better (though another cycle or two) than h*31+x.

This algorithm would be much better than the legacy h*31+x:

    uint128_t h = (SALT);
    for (uint64_t x1, x2 in pairs of inputs) {
      uint128_t x12 = x1; x12 <<= 64; x12 |= x2;
      h = aes_step(h ^ x2);
      // optional: h = aes_step(h ^ SALT2);
    }

The second iteration of aes (a single AES encode or decode step, not a full block
of 10 of them) seems to meet or beat anything else doable in the same number of
cycles, since full-word multiplies are today about as expensive as AES steps,
and AES steps handle two words at a time, while two rounds of AES is at least
as good at mixing as full-word multiples.

AES or not, the relevant methods, for value types, could use longHashCode
as the defining block:

hashCode() {
  // library optimizes this behavior via BSM:
  return (int)this.longHashCode();
}

longHashCode() {
  // library optimizes this behavior via BSM:
  return List.of(this.getClass(), this.field1, this.field2, …).longHashCode();
}

Anyway, this elephant has many corners, and various of us are inspecting
distinct corners.  But poor hashing is the particular toenail I would like to trim.

Of course it's elephants all the way down:  Replacing h*31+x is, all by itself,
its own multi-faceted conversation; I think I've got this one by the trunk but
maybe there's something on a leg or tail that needs pointing out, such as the
fact that AES isn't available on many 32-bit chips.

One final point:  We don't need to be stuck with 32-bit hashCode forever.
Our eventual solution to reified generics, whatever it is, seems likely to allow
the size of the hashCode to be a generic parameter.  So it is reasonable to
hope that many of our hashing data structures could be generified over the
quality of the hash function.