RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)

Wed Oct 16 07:27:13 UTC 2024

On Wed, 16 Oct 2024 07:23:35 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:

>> Actually that in itself is not enough to make the code correct AFAICS. Consider this example from Table 2-11, page 71, in https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf The Unicode character 00010384 in UTF-32, consists of the surrogate pair D800, DF84 in UTF-16, and the four byte sequence F0, 90, 8E, 84 in UTF-8. So if we have a java.lang.String that represents this single unicode character, the String will consist of an array of 2 char values, and the UTF8 representation would consist of 4 byte values. So if you were doing an equals comparison between the String's value array and the utf_str, what length would you pass to the equals method?
>
> I'd hope that the answer is 4, but I suspect that the answer is 5 (UTF-16). When the class `UTF8` talks about "unicode", it seems to be talking about whatever encoding Java's strings are, which AFAIK is basically UTF-16 with some special cases.

See for example this snippet from `java_lang_String::create_from_str(const char* utf8_str, T/
```c++
  int length = UTF8::unicode_length(utf8_str, is_latin1, has_multibyte);
  if (!CompactStrings) {
    has_multibyte = true;
    is_latin1 = false;
  }

  Handle h_obj = basic_create(length, is_latin1, CHECK_NH);
  if (length > 0) {
    if (!has_multibyte) {
      const jbyte* src = reinterpret_cast<const jbyte*>(utf8_str);
      ArrayAccess<>::arraycopy_from_native(src, value(h_obj()), typeArrayOopDesc::element_offset<jbyte>(0), length);
    } else if (is_latin1) {
      UTF8::convert_to_unicode(utf8_str, value(h_obj())->byte_at_addr(0), length);
    } else {
      UTF8::convert_to_unicode(utf8_str, value(h_obj())->char_at_addr(0), length);
    }
  }

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802504344