RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)
Johan Sjölen
jsjolen at openjdk.org
Wed Oct 16 07:27:13 UTC 2024
On Wed, 16 Oct 2024 00:14:30 GMT, David Holmes <dholmes at openjdk.org> wrote:
>> src/hotspot/share/classfile/stringTable.hpp line 88:
>>
>>> 86: static const jchar *to_unicode(StringWrapper wrapped_str, int len, TRAPS);
>>> 87: static Handle to_handle(StringWrapper wrapped_str, int len, TRAPS);
>>> 88: static void print_string(StringWrapper wrapped_str, int len, TRAPS);
>>
>> What is `len` supposed to represent in all of these methods? The code only makes sense to me if `len` here is actually "number of unicode characters" (which need not be the same as the length of any wrapped UTF8 sequence).
>
> Actually that in itself is not enough to make the code correct AFAICS. Consider this example from Table 2-11, page 71, in https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf The Unicode character 00010384 in UTF-32, consists of the surrogate pair D800, DF84 in UTF-16, and the four byte sequence F0, 90, 8E, 84 in UTF-8. So if we have a java.lang.String that represents this single unicode character, the String will consist of an array of 2 char values, and the UTF8 representation would consist of 4 byte values. So if you were doing an equals comparison between the String's value array and the utf_str, what length would you pass to the equals method?
I'd hope that the answer is 4, but I suspect that the answer is 5 (UTF-16). When the class `UTF8` talks about "unicode", it seems to be talking about whatever encoding Java's strings are, which AFAIK is basically UTF-16 with some special cases.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802502642
More information about the hotspot-dev
mailing list