RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)

Wed Oct 16 21:27:17 UTC 2024

On Wed, 16 Oct 2024 08:45:47 GMT, Casper Norrbin <cnorrbin at openjdk.org> wrote:

>> See for example this snippet from `java_lang_String::create_from_str(const char* utf8_str, T/
>> ```c++
>>   int length = UTF8::unicode_length(utf8_str, is_latin1, has_multibyte);
>>   if (!CompactStrings) {
>>     has_multibyte = true;
>>     is_latin1 = false;
>>   }
>> 
>>   Handle h_obj = basic_create(length, is_latin1, CHECK_NH);
>>   if (length > 0) {
>>     if (!has_multibyte) {
>>       const jbyte* src = reinterpret_cast<const jbyte*>(utf8_str);
>>       ArrayAccess<>::arraycopy_from_native(src, value(h_obj()), typeArrayOopDesc::element_offset<jbyte>(0), length);
>>     } else if (is_latin1) {
>>       UTF8::convert_to_unicode(utf8_str, value(h_obj())->byte_at_addr(0), length);
>>     } else {
>>       UTF8::convert_to_unicode(utf8_str, value(h_obj())->char_at_addr(0), length);
>>     }
>>   }
>
> The `len` in all these methods is the number of unicode characters, which yes, could be less than the length of the UTF8 array.
> 
> ---
> 
> `UTF8::convert_to_unicode` uses `UTF8::next`, like I also do in the new UTF8 equals/hash functions. Part of `UTF8::next` looks like this: 
> 
> ```c++
>   switch ((ch = ptr[0]) >> 4) {
>     default:
>     ... work ... /* 1-byte character */
>     break;
> 
>   case 0x8: case 0x9: case 0xA: case 0xB: case 0xF:
>     /* Shouldn't happen. */
>     break;
> 
>   case 0xC: case 0xD:
>     /* 110xxxxx  10xxxxxx */
>     ... work ... /* 2-byte character */
>     break;
> 
>   case 0xE:
>     /* 1110xxxx 10xxxxxx 10xxxxxx */
>     ... work ... /* 3-byte character */
>     break;
>   } /* end of switch */
> 
> 
> In this code, 4-byte long UTF8 characters are not converted. This leads me to believe that we do not support this range of characters. Or that we at least do not support it encoded in UTF8. With this restriction, we also do not have 2 wide (4 byte) UTF16 characters, as 3-byte UTF8 characters fit in a single (2 byte) UTF16 unit.
> 
> So for the question, I do not know what would happen. I believe it would be undefined behaviour as this character would not be supported encoded in UTF8. Just like if we would use `UTF8::convert_to_unicode` first to then compare two UTF16 strings.

@caspernorrbin - Aarggh! Sorry I picked a bad example - the JVM uses modified-UTF8 which does not support 4-byte UTF8 encodings. Any UTF16 surrogate pair would be encoded as two three-byte modifed-UTF8 sequences.

> The len in all these methods is the number of unicode characters

Are you sure you meant "characters" there?

Let me try to back up a bit and go to the code that I have a problem with:

bool java_lang_String::equals(oop java_string, const char* utf8_string, int num_unicode_points) {
  assert(java_string->klass() == vmClasses::String_klass(),
         "must be java_string");
  typeArrayOop value = java_lang_String::value_no_keepalive(java_string);
  int length = java_lang_String::length(java_string, value);
  if (length != num_unicode_points) {
    return false;
  }

The incoming String represents a single unicode character encoded as a surrogate pair - so `length == 2`. Each element in the surrogate pairs encodes as a 3-byte modified-UTF8 sequence, so the length of `utf8_string` is 6. For the length check to pass the incoming value of `num_unicode_points` has to be 2. So for this to succeed the caller always has to pass in the number of UTF16 elements that comprise the `java_string`. As Johan noted in the utf8.cpp file when it refers to "unicode" it really means UTF16, so in that sense `num_unicode_points` is not incorrect.  But it means those `len` parameters should also be named the same way.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1803816182