RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)

Fri Oct 18 09:52:50 UTC 2024

On Wed, 16 Oct 2024 21:24:01 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> The `len` in all these methods is the number of unicode characters, which yes, could be less than the length of the UTF8 array.
>> 
>> ---
>> 
>> `UTF8::convert_to_unicode` uses `UTF8::next`, like I also do in the new UTF8 equals/hash functions. Part of `UTF8::next` looks like this: 
>> 
>> ```c++
>>   switch ((ch = ptr[0]) >> 4) {
>>     default:
>>     ... work ... /* 1-byte character */
>>     break;
>> 
>>   case 0x8: case 0x9: case 0xA: case 0xB: case 0xF:
>>     /* Shouldn't happen. */
>>     break;
>> 
>>   case 0xC: case 0xD:
>>     /* 110xxxxx  10xxxxxx */
>>     ... work ... /* 2-byte character */
>>     break;
>> 
>>   case 0xE:
>>     /* 1110xxxx 10xxxxxx 10xxxxxx */
>>     ... work ... /* 3-byte character */
>>     break;
>>   } /* end of switch */
>> 
>> 
>> In this code, 4-byte long UTF8 characters are not converted. This leads me to believe that we do not support this range of characters. Or that we at least do not support it encoded in UTF8. With this restriction, we also do not have 2 wide (4 byte) UTF16 characters, as 3-byte UTF8 characters fit in a single (2 byte) UTF16 unit.
>> 
>> So for the question, I do not know what would happen. I believe it would be undefined behaviour as this character would not be supported encoded in UTF8. Just like if we would use `UTF8::convert_to_unicode` first to then compare two UTF16 strings.
>
> @caspernorrbin - Aarggh! Sorry I picked a bad example - the JVM uses modified-UTF8 which does not support 4-byte UTF8 encodings. Any UTF16 surrogate pair would be encoded as two three-byte modifed-UTF8 sequences.
> 
>> The len in all these methods is the number of unicode characters
> 
> Are you sure you meant "characters" there?
> 
> Let me try to back up a bit and go to the code that I have a problem with:
> 
> bool java_lang_String::equals(oop java_string, const char* utf8_string, int num_unicode_points) {
>   assert(java_string->klass() == vmClasses::String_klass(),
>          "must be java_string");
>   typeArrayOop value = java_lang_String::value_no_keepalive(java_string);
>   int length = java_lang_String::length(java_string, value);
>   if (length != num_unicode_points) {
>     return false;
>   }
> 
> The incoming String represents a single unicode character encoded as a surrogate pair - so `length == 2`. Each element in the surrogate pairs encodes as a 3-byte modified-UTF8 sequence, so the length of `utf8_string` is 6. For the length check to pass the incoming value of `num_unicode_points` has to be 2. So for this to succeed the caller always has to pass in the number of UTF16 elements that comprise the `java_string`. As Johan noted in the utf8.cpp file when it refers to "unicode" it really means UTF16, so in that sense `num_unicode_points` is not incorrect.  But it means those `len` parameters should also be named the same way.

You're right, I meant the number of code points..

The `num_unicode_points` in this case would come from `UTF8::unicode_length`, which would give 2 from the 6-byte UTF8 string in your example. But I agree that it is somewhat unclear, especially if someone were to use this function in the future.

I have two ideas how this could be improved:
I could either add clarifying comments and change naming across the different functions to be clearer on how to use and call correctly. Or I could instead change it so that the function takes the UTF8-length instead of the UTF16-length. That way you always pass the array length to the function.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1806218904