RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)
Johan Sjölen
jsjolen at openjdk.org
Wed Oct 16 07:27:13 UTC 2024
On Wed, 16 Oct 2024 07:23:35 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:
>> Actually that in itself is not enough to make the code correct AFAICS. Consider this example from Table 2-11, page 71, in https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf The Unicode character 00010384 in UTF-32, consists of the surrogate pair D800, DF84 in UTF-16, and the four byte sequence F0, 90, 8E, 84 in UTF-8. So if we have a java.lang.String that represents this single unicode character, the String will consist of an array of 2 char values, and the UTF8 representation would consist of 4 byte values. So if you were doing an equals comparison between the String's value array and the utf_str, what length would you pass to the equals method?
>
> I'd hope that the answer is 4, but I suspect that the answer is 5 (UTF-16). When the class `UTF8` talks about "unicode", it seems to be talking about whatever encoding Java's strings are, which AFAIK is basically UTF-16 with some special cases.
See for example this snippet from `java_lang_String::create_from_str(const char* utf8_str, T/
```c++
int length = UTF8::unicode_length(utf8_str, is_latin1, has_multibyte);
if (!CompactStrings) {
has_multibyte = true;
is_latin1 = false;
}
Handle h_obj = basic_create(length, is_latin1, CHECK_NH);
if (length > 0) {
if (!has_multibyte) {
const jbyte* src = reinterpret_cast<const jbyte*>(utf8_str);
ArrayAccess<>::arraycopy_from_native(src, value(h_obj()), typeArrayOopDesc::element_offset<jbyte>(0), length);
} else if (is_latin1) {
UTF8::convert_to_unicode(utf8_str, value(h_obj())->byte_at_addr(0), length);
} else {
UTF8::convert_to_unicode(utf8_str, value(h_obj())->char_at_addr(0), length);
}
}
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802504344
More information about the hotspot-dev
mailing list