RFR: 8268358: [lworld] toString for primitive class should return `ClassName at hash`

Wed Jun 9 16:27:36 UTC 2021

On Tue, 8 Jun 2021 19:55:46 GMT, Mandy Chung <mchung at openjdk.org> wrote:

> `Object::toString` implementation of a primitive class should return the traditional `ClassName at hash` (rather than listing the field values) not to leak any private and security-sensitive information.   A primitive class can override `toString` implementation for their custom string representation.

I agree strongly with this change.  The correct
generalization to primitives of System.identityHashCode and
the hex number mentioned by Object.toString is not something
that looks like java.lang.Record field reports, but rather a
VM-chosen number, suitable for hash codes, and hard to
predict.

But, using the algorithm of Objects.hashCode should not be
the final word in producing the VM-chosen number.

(This is not the PR to change that, but this is a very good
moment to make the point.)

We should use a variable (salted) hash code, so that people
will not rely on the presence of the `31*x+` hashcode.  It
would be an own-goal if we locked ourselves forever into a
new use of that weak and leaky algorithm.

(The `31*x+` hash code computes a weighted checksum of
`Sum[i] x[L-i]*(31**i)`, truncated to 32 bits.  It is very
prone to collisions, especially when neighboring x’s are
linearly related.  A better hash code would (a) be more
resistant to simple collisions and (b) would tend to have
all bits of output depend on all bits of input.)

At a bare minimum, the JVM should configure a constant
random 32-bit “salt” at startup time, and xor that number
into the computation of the primitive hash code in such a
way that the output (for one object) is unguessable.

That is not secure (nor am I expecting this) since after you
hash a few thousand test objects you can reverse-engineer
the constants.  But adding the randomness _now_ would give
us the ability to tune the algorithm _going forward_.

Note that the JVM’s identity hash code is configurable and
relatively unpredictable [1].

[1]:
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/synchronizer.cpp#L828

We could make the same true of the built-in primitive hash
code.  The key step would be replacing the `31*x+` step with
a step that uses a larger state (at least 64 bits) and
better mixing (at least another shift and xor).

The cost of such upgrades is negligible.  The Marsaglia
register used for identity hash code is 128 bits (a good
size) and each hash step uses four xor operations to produce
one 32-bit value.

A different use of a similar register could consume about 64
bits of primitive field material for each step.  More steps
for smaller chunks of field material would improve mixing at
a growing CPU cost.  A little parametric salt could be added
either at the beginning or during hash steps.  All such
options could be configured at VM startup time, as with
object identity hash codes.

(I think, in the future, using a couple iterations of the
hardware AES instruction, or a 64-bit multiply, might be
slightly more performant with better mixing, but that's a
reseach project for later, and in any case produces more
bits than would be useful to a 32-bit result.)

Again: None of the above suggestions are cryptographic; they
can all be reversed straightforwardly.  But they are all
superior to the legacy `31*x+` hash code, and (crucially) if
we adopt a variable (salted) scheme, we can evolve the
algorithm.  If we use a fixed algorithm, we are stuck with
it forever.

And `31*x+`, as a fixed algorithm, is about the worst
choice.  We are stuck with it for java.lang.String.  We
don’t need _new_ uses of it.

-------------

PR: https://git.openjdk.java.net/valhalla/pull/438