[jdk8u-dev] RFR: 8310026: [8u] make java_lang_String::hash_code consistent across platforms

Mon Jul 10 10:37:11 UTC 2023

On Mon, 10 Jul 2023 08:59:35 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> `java_lang_String::hash_code` produces inconsistent results on different platforms, when `s` is `char*`.  This is because on some platforms `char` is signed, while on others unsigned (resulting in `char` to be either zero-extended or sign-extended, when cast to `unsigned int`). This causes 1 tier1 test failure on aarch64.
>> 
>> Details:
>> This was discovered by examining one failing test (from tier1) present on aarch64 builds:
>> `test/serviceability/sa/jmap-hashcode/Test8028623.java`
>> Test was introduced by [JDK-8028623](https://bugs.openjdk.org/browse/JDK-8028623). However fix done there does not work on aarch64. Code was later fixed (newer jdks) in [hotspot part](https://github.com/openjdk/jdk11u-dev/commit/7af927f9c10923b61f746eb6e566bcda853dd95a) of [JDK-8141132](https://bugs.openjdk.org/browse/JDK-8141132) (JEP 254: Compact Strings).
>> 
>> Fix:
>> Fixed by backporting very small portion of JDK-8141132.
>> 
>> Testing:
>> tier1 (x86, x86_64, aarch64): OK (tested by GH and in rhel-8 aarch64 VM)
>
> hotspot/src/share/vm/classfile/javaClasses.hpp line 197:
> 
>> 195:     return h;
>> 196:   }
>> 197: 
> 
> I don't understand this. According to the comment, both of these functions are to mimic `String.hashCode`. But only the `jchar` variant does, or?
> 
> Assuming the `jchar` variant gets fed UCS2 and the `jbyte` variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.
> 
> For example, let string be a single unicode "ぁ" character, aka `U+3041`, which would be encoded as `0x3041` (len 1) with UCS2, `0xE38181` as UTF8.
> 
> Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of `0xE3 0x81 0x81` and a len of 3, and return 0x36443 (`(((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81`).
> 
> I must be missing something basic here.

I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the `jchar` variant matches `String.hashCode` but I believe only that variant /needs/ to match `String.hashCode`. The `jchar` variant is used by all code operating on Java Strings proper. The `jbyte` variant is only used by the Symbol table and the agent. 

The problem this is fixing is to do  with the disparity between `SymbolTable::hash_symbol` and the agent `HashTable`. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because `SymbolTable::hash_symbol` accepts and passes on to `java_lang_String::hash_code` a value of C type `char*` (which may be signed or unsigned depending on the OS) while the agent `HashTable` code operates on a Java `byte[]` (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the `SymbolTable` and agent HashTable` to compute different results.

This current fix decouples the definitions of `hash_code(const jchar* s, int len)` and `hash_code(const jbyte* s, int len)` in order to allow the latter to match the redefined behaviour of the agent `HashTable` i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.

As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field `String.value`, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

-------------

PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1258051917