RFR: 8338257: UTF8 lengths should be size_t not int [v5]
Dean Long
dlong at openjdk.org
Tue Aug 27 03:17:03 UTC 2024
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes <dholmes at openjdk.org> wrote:
>> This work has been split out from JDK-8328877: [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths
>>
>> The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF-16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*`Integer.MAX_VALUE`. Though with compact strings this reduces to 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define UTF8 lengths as `size_t` to accommodate all possible representations. Higher-level API's can still use `int` if they know the strings (eg symbols) are sufficiently constrained in length. See the comments in utf8.hpp that explain Strings, compact strings and the encoding.
>>
>> As the existing JNI `GetStringUTFLength` still requires the current truncating behaviour of ` UNICODE::utf8_length` we add back `UNICODE::utf8_length_as_int` for it to use.
>>
>> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` use `length` as an IN/OUT parameter: it is the incoming (int) length of the jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. This makes some of the call sites a little messy with casts.
>>
>> Testing:
>> - tiers 1-4
>> - GHA
>
> David Holmes has updated the pull request incrementally with one additional commit since the last revision:
>
> more missing casts
src/hotspot/share/classfile/javaClasses.cpp line 588:
> 586: size_t utf8_len = static_cast<size_t>(length);
> 587: const char* base = UNICODE::as_utf8(position, utf8_len);
> 588: Symbol* sym = SymbolTable::new_symbol(base, checked_cast<int>(utf8_len));
With the current limitations of checked_cast(), we would also need to check if the result is negative on 32-bit platforms, because then size_t and int will be the same size, and checked_cast will never complain.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1732062256
More information about the serviceability-dev
mailing list