[jdk8u-dev] RFR: 8310026: [8u] make java_lang_String::hash_code consistent across platforms

Mon Jul 10 09:09:04 UTC 2023

On Wed, 14 Jun 2023 13:02:53 GMT, Zdenek Zambersky <zzambers at openjdk.org> wrote:

> `java_lang_String::hash_code` produces inconsistent results on different platforms, when `s` is `char*`.  This is because on some platforms `char` is signed, while on others unsigned (resulting in `char` to be either zero-extended or sign-extended, when cast to `unsigned int`). This causes 1 tier1 test failure on aarch64.
> 
> Details:
> This was discovered by examining one failing test (from tier1) present on aarch64 builds:
> `test/serviceability/sa/jmap-hashcode/Test8028623.java`
> Test was introduced by [JDK-8028623](https://bugs.openjdk.org/browse/JDK-8028623). However fix done there does not work on aarch64. Code was later fixed (newer jdks) in [hotspot part](https://github.com/openjdk/jdk11u-dev/commit/7af927f9c10923b61f746eb6e566bcda853dd95a) of [JDK-8141132](https://bugs.openjdk.org/browse/JDK-8141132) (JEP 254: Compact Strings).
> 
> Fix:
> Fixed by backporting very small portion of JDK-8141132.
> 
> Testing:
> tier1 (x86, x86_64, aarch64): OK (tested by GH and in rhel-8 aarch64 VM)

hotspot/agent/src/share/classes/sun/jvm/hotspot/utilities/Hashtable.java line 66:

> 64:     // Emulate the unsigned int in java_lang_String::hash_code
> 65:     while (len-- > 0) {
> 66:       h = 31*h + (0xFFL & buf[s]);

`Byte.toUnsignedInt()` would be clearer.

hotspot/src/share/vm/classfile/javaClasses.hpp line 197:

> 195:     return h;
> 196:   }
> 197: 

I don't understand this. According to the comment, both of these functions are to mimic `String.hashCode`. But only the `jchar` variant does, or?

Assuming the `jchar` variant gets fed UCS2 and the `jbyte` variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.

For example, let string be a single unicode "ぁ" character, aka `U+3041`, which would be encoded as `0x3041` (len 1) with UCS2, `0xE38181` as UTF8.

Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of `0xE3 0x81 0x81` and a len of 3, and return 0x36443 (`(((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81`).

I must be missing something basic here.

-------------

PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1257953806
PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1257943896