[jdk8u-dev] RFR: 8310026: [8u] make java_lang_String::hash_code consistent across platforms

Zdenek Zambersky zzambers at openjdk.org
Mon Jul 10 10:40:10 UTC 2023


On Mon, 10 Jul 2023 10:32:38 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> hotspot/src/share/vm/classfile/javaClasses.hpp line 197:
>> 
>>> 195:     return h;
>>> 196:   }
>>> 197: 
>> 
>> I don't understand this. According to the comment, both of these functions are to mimic `String.hashCode`. But only the `jchar` variant does, or?
>> 
>> Assuming the `jchar` variant gets fed UCS2 and the `jbyte` variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.
>> 
>> For example, let string be a single unicode "ぁ" character, aka `U+3041`, which would be encoded as `0x3041` (len 1) with UCS2, `0xE38181` as UTF8.
>> 
>> Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of `0xE3 0x81 0x81` and a len of 3, and return 0x36443 (`(((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81`).
>> 
>> I must be missing something basic here.
>
> I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the `jchar` variant matches `String.hashCode` but I believe only that variant /needs/ to match `String.hashCode`. The `jchar` variant is used by all code operating on Java Strings proper. The `jbyte` variant is only used by the Symbol table and the agent. 
> 
> The problem this is fixing is to do  with the disparity between `SymbolTable::hash_symbol` and the agent `HashTable`. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because `SymbolTable::hash_symbol` accepts and passes on to `java_lang_String::hash_code` a value of C type `char*` (which may be signed or unsigned depending on the OS) while the agent `HashTable` code operates on a Java `byte[]` (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the `SymbolTable` and agent HashTable` to compute different results.
> 
> This current fix decouples the definitions of `hash_code(const jchar* s, int len)` and `hash_code(const jbyte* s, int len)` in order to allow the latter to match the redefined behaviour of the agent `HashTable` i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.
> 
> As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field `String.value`, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Few uses of `java_lang_String::hash_code` in JDK8 are following:

`jbyte *` varaiant is only used to hash symbols (I don't think this affects code in java):
https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/symbolTable.cpp#L228

other uses dealing with Strings use `jchar *`  variant:
https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/javaClasses.cpp#L344
https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/classfile/symbolTable.cpp#L654
https://github.com/openjdk/jdk8u-dev/blob/a2a68737d927e7afa6e5c2d2642d8cb42f9d6be9/hotspot/src/share/vm/gc_implementation/g1/g1StringDedupTable.cpp#L344


In newer JDKs, `jbyte *` variant is also used for compact strings (as zero extended latin1 should be equal  to UCS2/UTF-16), e.g.:
https://github.com/openjdk/jdk/blob/06a1a15d014f5ca48f62f5f0c8e8682086c4ae0b/src/hotspot/share/classfile/javaClasses.cpp#L522
(but JDK8 does not have compact strings...)

-------------

PR Review Comment: https://git.openjdk.org/jdk8u-dev/pull/336#discussion_r1258056867


More information about the jdk8u-dev mailing list