RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)
Casper Norrbin
cnorrbin at openjdk.org
Wed Oct 16 08:48:13 UTC 2024
On Wed, 16 Oct 2024 07:24:48 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:
>> I'd hope that the answer is 4, but I suspect that the answer is 5 (UTF-16). When the class `UTF8` talks about "unicode", it seems to be talking about whatever encoding Java's strings are, which AFAIK is basically UTF-16 with some special cases.
>
> See for example this snippet from `java_lang_String::create_from_str(const char* utf8_str, T/
> ```c++
> int length = UTF8::unicode_length(utf8_str, is_latin1, has_multibyte);
> if (!CompactStrings) {
> has_multibyte = true;
> is_latin1 = false;
> }
>
> Handle h_obj = basic_create(length, is_latin1, CHECK_NH);
> if (length > 0) {
> if (!has_multibyte) {
> const jbyte* src = reinterpret_cast<const jbyte*>(utf8_str);
> ArrayAccess<>::arraycopy_from_native(src, value(h_obj()), typeArrayOopDesc::element_offset<jbyte>(0), length);
> } else if (is_latin1) {
> UTF8::convert_to_unicode(utf8_str, value(h_obj())->byte_at_addr(0), length);
> } else {
> UTF8::convert_to_unicode(utf8_str, value(h_obj())->char_at_addr(0), length);
> }
> }
The `len` in all these methods is the number of unicode characters, which yes, could be less than the length of the UTF8 array.
---
`UTF8::convert_to_unicode` uses `UTF8::next`, like I also do in the new UTF8 equals/hash functions. Part of `UTF8::next` looks like this:
```c++
switch ((ch = ptr[0]) >> 4) {
default:
... work ... /* 1-byte character */
break;
case 0x8: case 0x9: case 0xA: case 0xB: case 0xF:
/* Shouldn't happen. */
break;
case 0xC: case 0xD:
/* 110xxxxx 10xxxxxx */
... work ... /* 2-byte character */
break;
case 0xE:
/* 1110xxxx 10xxxxxx 10xxxxxx */
... work ... /* 3-byte character */
break;
} /* end of switch */
In this code, 4-byte long UTF8 characters are not converted. This leads me to believe that we do not support this range of characters. With this restriction, we also do not have 2 wide (4 byte) UTF16 characters, as 3-byte UTF8 characters fit in a single (2 byte) UTF16 unit.
So for the question, I do not know what would happen. I believe it would be undefined behaviour as this character would not be supported. Just like if we would use `UTF8::convert_to_unicode` first to then compare two UTF16 strings.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802636657
More information about the hotspot-dev
mailing list