RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)
Johan Sjölen
jsjolen at openjdk.org
Fri Oct 18 11:41:15 UTC 2024
On Fri, 18 Oct 2024 09:49:19 GMT, Casper Norrbin <cnorrbin at openjdk.org> wrote:
>> @caspernorrbin - Aarggh! Sorry I picked a bad example - the JVM uses modified-UTF8 which does not support 4-byte UTF8 encodings. Any UTF16 surrogate pair would be encoded as two three-byte modifed-UTF8 sequences.
>>
>>> The len in all these methods is the number of unicode characters
>>
>> Are you sure you meant "characters" there?
>>
>> Let me try to back up a bit and go to the code that I have a problem with:
>>
>> bool java_lang_String::equals(oop java_string, const char* utf8_string, int num_unicode_points) {
>> assert(java_string->klass() == vmClasses::String_klass(),
>> "must be java_string");
>> typeArrayOop value = java_lang_String::value_no_keepalive(java_string);
>> int length = java_lang_String::length(java_string, value);
>> if (length != num_unicode_points) {
>> return false;
>> }
>>
>> The incoming String represents a single unicode character encoded as a surrogate pair - so `length == 2`. Each element in the surrogate pairs encodes as a 3-byte modified-UTF8 sequence, so the length of `utf8_string` is 6. For the length check to pass the incoming value of `num_unicode_points` has to be 2. So for this to succeed the caller always has to pass in the number of UTF16 elements that comprise the `java_string`. As Johan noted in the utf8.cpp file when it refers to "unicode" it really means UTF16, so in that sense `num_unicode_points` is not incorrect. But it means those `len` parameters should also be named the same way.
>
> You're right, I meant the number of code points..
>
> The `num_unicode_points` in this case would come from `UTF8::unicode_length`, which would give 2 from the 6-byte UTF8 string in your example. But I agree that it is somewhat unclear, especially if someone were to use this function in the future.
>
> I have two ideas how this could be improved:
> I could either add clarifying comments and change naming across the different functions to be clearer on how to use and call correctly. Or I could instead change it so that the function takes the UTF8-length instead of the UTF16-length. That way you always pass the array length to the function.
I just want consistency and clarity in the header prototypes, for me either option is fine.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1806353639
More information about the hotspot-dev
mailing list