RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)

Wed Oct 16 08:48:13 UTC 2024

On Wed, 16 Oct 2024 07:24:48 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:

>> I'd hope that the answer is 4, but I suspect that the answer is 5 (UTF-16). When the class `UTF8` talks about "unicode", it seems to be talking about whatever encoding Java's strings are, which AFAIK is basically UTF-16 with some special cases.
>
> See for example this snippet from `java_lang_String::create_from_str(const char* utf8_str, T/
> ```c++
>   int length = UTF8::unicode_length(utf8_str, is_latin1, has_multibyte);
>   if (!CompactStrings) {
>     has_multibyte = true;
>     is_latin1 = false;
>   }
> 
>   Handle h_obj = basic_create(length, is_latin1, CHECK_NH);
>   if (length > 0) {
>     if (!has_multibyte) {
>       const jbyte* src = reinterpret_cast<const jbyte*>(utf8_str);
>       ArrayAccess<>::arraycopy_from_native(src, value(h_obj()), typeArrayOopDesc::element_offset<jbyte>(0), length);
>     } else if (is_latin1) {
>       UTF8::convert_to_unicode(utf8_str, value(h_obj())->byte_at_addr(0), length);
>     } else {
>       UTF8::convert_to_unicode(utf8_str, value(h_obj())->char_at_addr(0), length);
>     }
>   }

The `len` in all these methods is the number of unicode characters, which yes, could be less than the length of the UTF8 array.

---

`UTF8::convert_to_unicode` uses `UTF8::next`, like I also do in the new UTF8 equals/hash functions. Part of `UTF8::next` looks like this: 

```c++
  switch ((ch = ptr[0]) >> 4) {
    default:
    ... work ... /* 1-byte character */
    break;

  case 0x8: case 0x9: case 0xA: case 0xB: case 0xF:
    /* Shouldn't happen. */
    break;

  case 0xC: case 0xD:
    /* 110xxxxx  10xxxxxx */
    ... work ... /* 2-byte character */
    break;

  case 0xE:
    /* 1110xxxx 10xxxxxx 10xxxxxx */
    ... work ... /* 3-byte character */
    break;
  } /* end of switch */

In this code, 4-byte long UTF8 characters are not converted. This leads me to believe that we do not support this range of characters. With this restriction, we also do not have 2 wide (4 byte) UTF16 characters, as 3-byte UTF8 characters fit in a single (2 byte) UTF16 unit.

So for the question, I do not know what would happen. I believe it would be undefined behaviour as this character would not be supported. Just like if we would use `UTF8::convert_to_unicode` first to then compare two UTF16 strings.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802636657