[jdk8u-dev] RFR: 8310026: [8u] make java_lang_String::hash_code consistent across platforms

Mon Jul 10 11:38:10 UTC 2023

On Mon, 10 Jul 2023 10:37:19 GMT, Zdenek Zambersky <zzambers at openjdk.org> wrote:

>> I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the `jchar` variant matches `String.hashCode` but I believe only that variant /needs/ to match `String.hashCode`. The `jchar` variant is used by all code operating on Java Strings proper. The `jbyte` variant is only used by the Symbol table and the agent. 
>> 
>> The problem this is fixing is to do  with the disparity between `SymbolTable::hash_symbol` and the agent `HashTable`. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because `SymbolTable::hash_symbol` accepts and passes on to `java_lang_String::hash_code` a value of C type `char*` (which may be signed or unsigned depending on the OS) while the agent `HashTable` code operates on a Java `byte[]` (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the `SymbolTable` and agent HashTable` to compute different results.
>> 
>> This current fix decouples the definitions of `hash_code(const jchar* s, int len)` and `hash_code(const jbyte* s, int len)` in order to allow the latter to match the redefined behaviour of the agent `HashTable` i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.
>> 
>> As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field `String.value`, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?
>
> Few uses of `java_lang_String::hash_code` in JDK8 are following:
> 
> `jbyte *` varaiant is only used to hash symbols (I don't think this affects code in java):
> https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/symbolTable.cpp#L228
> 
> other uses dealing with Strings use `jchar *`  variant:
> https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/javaClasses.cpp#L344
> https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/symbolTable.cpp#L654
> https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/gc_implementation/g1/g1StringDedupTable.cpp#L344
> 
> 
> In newer JDKs, `jbyte *` variant is also used for compact strings (as zero extended latin1 should be equal  to UCS2/UTF-16), e.g.:
> https://github.com/openjdk/jdk/blob/06a1a15d014f5ca48f62f5f0c8e8682086c4ae0b/src/hotspot/share/classfile/javaClasses.cpp#L522
> (but JDK8 does not have compact strings...)

> I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the `jchar` variant matches `String.hashCode` but I believe only that variant /needs/ to match `String.hashCode`. The `jchar` variant is used by all code operating on Java Strings proper. The `jbyte` variant is only used by the Symbol table and the agent.
> 
> The problem this is fixing is to do with the disparity between `SymbolTable::hash_symbol` and the agent `HashTable`. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because `SymbolTable::hash_symbol` accepts and passes on to `java_lang_String::hash_code` a value of C type `char*` (which may be signed or unsigned depending on the OS) while the agent `HashTable` code operates on a Java `byte[]` (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the `SymbolTable` and agent HashTable` to compute different results.
> 
> This current fix decouples the definitions of `hash_code(const jchar* s, int len)` and `hash_code(const jbyte* s, int len)` in order to allow the latter to match the redefined behaviour of the agent `HashTable` i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.
> 
> As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field `String.value`, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Thank you, Andrew, for disentangling this. I get it now. The byte variant only has to match its java counterpart in the agent.

-------------

PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1258116487