[jdk8u-dev] RFR: 8310026: [8u] make java_lang_String::hash_code consistent across platforms
Thomas Stuefe
stuefe at openjdk.org
Mon Jul 10 09:09:04 UTC 2023
On Wed, 14 Jun 2023 13:02:53 GMT, Zdenek Zambersky <zzambers at openjdk.org> wrote:
> `java_lang_String::hash_code` produces inconsistent results on different platforms, when `s` is `char*`. This is because on some platforms `char` is signed, while on others unsigned (resulting in `char` to be either zero-extended or sign-extended, when cast to `unsigned int`). This causes 1 tier1 test failure on aarch64.
>
> Details:
> This was discovered by examining one failing test (from tier1) present on aarch64 builds:
> `test/serviceability/sa/jmap-hashcode/Test8028623.java`
> Test was introduced by [JDK-8028623](https://bugs.openjdk.org/browse/JDK-8028623). However fix done there does not work on aarch64. Code was later fixed (newer jdks) in [hotspot part](https://github.com/openjdk/jdk11u-dev/commit/7af927f9c10923b61f746eb6e566bcda853dd95a) of [JDK-8141132](https://bugs.openjdk.org/browse/JDK-8141132) (JEP 254: Compact Strings).
>
> Fix:
> Fixed by backporting very small portion of JDK-8141132.
>
> Testing:
> tier1 (x86, x86_64, aarch64): OK (tested by GH and in rhel-8 aarch64 VM)
hotspot/agent/src/share/classes/sun/jvm/hotspot/utilities/Hashtable.java line 66:
> 64: // Emulate the unsigned int in java_lang_String::hash_code
> 65: while (len-- > 0) {
> 66: h = 31*h + (0xFFL & buf[s]);
`Byte.toUnsignedInt()` would be clearer.
hotspot/src/share/vm/classfile/javaClasses.hpp line 197:
> 195: return h;
> 196: }
> 197:
I don't understand this. According to the comment, both of these functions are to mimic `String.hashCode`. But only the `jchar` variant does, or?
Assuming the `jchar` variant gets fed UCS2 and the `jbyte` variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.
For example, let string be a single unicode "ぁ" character, aka `U+3041`, which would be encoded as `0x3041` (len 1) with UCS2, `0xE38181` as UTF8.
Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of `0xE3 0x81 0x81` and a len of 3, and return 0x36443 (`(((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81`).
I must be missing something basic here.
-------------
PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1257953806
PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1257943896
More information about the jdk8u-dev
mailing list