Value type hash code

Stuart Marks stuart.marks at oracle.com
Wed Apr 11 18:42:53 UTC 2018


Interesting stuff. Several observations:

* Clearly there ought to be a default implementation. One question I have is 
whether the default implementation ought to be *specified*. If it's not 
specified -- really, it's specified only to conform to the general contract, but 
the details aren't specified -- then we'll have more freedom to change it in the 
future. Indeed, it'd be nice if it changed from time to time, or even 
frequently; it would help avoid code making inadvertent dependencies on the hash 
values.

* I think a primary use case for value types will be as HashMap keys. If so, I 
don't see any reasonable alternative than to delegate to the hashCode() of 
ordinary objects referenced from a value type, even if they're mutable. Lists of 
any implementation are defined to be equal and to have the same hashcode if 
their elements are equal. Two value types containing distinct List instances, 
containing equal elements should be equal and have equal hash codes.

Yes, this implies that a value type's hashCode() and equals() can change if it 
contains a reference to a mutable object. Sure, someone (javac?) can try to warn 
about this, but then we get into the business of tagging some objects as 
immutable. I'm not sure I want to go there.

To a certain extent, people are used to this; they use mutable objects as 
HashMap keys all the time. They just "know" that they mustn't mutate the object 
while it's in use. Occasionally, though, somebody will pop up in a huff 
complaining that HashMap is broken, and it's because they mutated a key... and 
then they'll complain that this isn't properly documented. Overall though, I 
don't think this is a serious problem.

* I'm sort-of ok with arrays using identity hash code, since that's what you get 
when you call hashCode() on them. The user can override this by calling 
Arrays.hashCode() or deepHashCode() if desired. But I can't help but think this 
is perpetuating a mistake....

* I'm wondering if it makes sense to mix the field types into the hashCode, at 
least for primitives. The hashcodes for the integral primitives are all equal to 
the value, for small values. That is, the hashes of (byte)123, (short)123, 
(char)123, 123, and 123L are all 123. That bugs me. I'm not sure how much 
practical impact it would have, though.

* Somebody (maybe me) should do some research into better hash algorithms. The 
base-31 polynomial was probably quite reasonable 20 years ago, but better things 
might have been discovered since then. Intuitively, I'd say that the starting 
value should be something other than zero, preferably a large prime number. 
Since the hash code of 0 is 0, and the hash code of "" is 0, it's quite likely 
for a value containing zeros and empty strings to be zero! (Unless we mix in the 
field types, as above.)

s'marks

On 4/10/18 7:49 AM, David Simms wrote:
> 
> After prototyping "hashCode()" for value types, here's a few observations and 
> thoughts...
> 
>   * "The general contract of hashCode" [1] is unchanged.
>   * The default implementation, if no user implementation is provided,
>     is assumed to be completely based upon the entire contents of the
>     value (there's nothing else to go on).
>       o The current "Object" implementation, both its generation and
>         object header hash caching are completely inappropriate for
>         value types.
>       o The VM cannot pretend to know one field is more significant than
>         another.
>       o Large values will benefit from user implementation to provide
>         efficiency.
>       o Whilst the VM may provide a default implementation for safety, a
>         "javac" generated method would be optimal (can be optimized by
>         the JIT, includes inlining).
>   * Values containing references whose contents are mutable pose a
>     problem, their hash code is only as stable as the contents of the
>     whole object graph.
>       o Objects may suffer similar problems, difficult to say this is
>         any more important for values. Except to say values are supposed
>         to be "immutable" but references may break this quality, perhaps
>         "javac" could warn when value fields are mutable objects (not
>         always possible, e.g. field reference to an interface).
> 
> I assume a the default implementation should look something like this (only with 
> concrete fields, not reflection):
> 
>          int hc = 0;
>          for (Field field : val.getClass().getDeclaredFields()) {
>              if (Modifier.isStatic(field.getModifiers())) continue;
> 
>              // Using the generic JDK hash for all types
>              hc = (31 * hc) + Objects.hashCode(field.get(val));
>          }
>          return hc;
> 
> This algorithm assumes the VM implements calls to reference field's hashCode(), 
> and encodes primitives the same as their boxed JDK counter-parts (e.g. 
> "Long.hashCode(long l)" does not generically hash two int size chunks, rather it 
> xors hi and lo, Boolean is another interesting example 1231 || 1237). Unclear if 
> this is actually important...however, this example:
> 
>      final __ByValue class MyInt implements Comparable<MyInt> {
>          final int value;
>          //....
>      }
> 
> The user is free to explicitly delegate to "Integer.hashCode(int val)", but is 
> it just more natural that the default works this way ? Alternative VM 
> implementations might hash over value data payload including field padding. With 
> h/w support (suitable checksum instruction) there might be some performance gain 
> for large values, but then if you introduce object references, said h/w support 
> would get broken. Said implementation would be dependent on field layout, and 
> not give the same result on different platforms. Whilst the Javadoc states 
> hashing "need not remain consistent from one execution of an application to 
> another execution of the same application." [1], I'm wondering how many folks 
> rely on consistent hashing, more than nobody I would fear. Lastly hashing large 
> amounts of data per value seems an unlikely general use-case.
> 
> 
> Cheers
> /David Simms
> 
> [1] https://docs.oracle.com/javase/10/docs/api/java/lang/Object.html#hashCode()
> 
> 



More information about the valhalla-dev mailing list