RFR: 8338257: UTF8 lengths should be size_t not int [v7]

Thomas Stuefe stuefe at openjdk.org
Thu Aug 29 05:56:22 UTC 2024


On Thu, 29 Aug 2024 02:45:53 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> This work has been split out from JDK-8328877: [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths
>> 
>> The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF-16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*`Integer.MAX_VALUE`.  Though with compact strings this reduces to 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define UTF8 lengths as `size_t` to accommodate all possible representations. Higher-level API's can still use `int` if they know the strings (eg symbols) are sufficiently constrained in length.  See the comments in utf8.hpp that explain Strings, compact strings and the encoding.
>> 
>> As the existing JNI `GetStringUTFLength` still requires the current truncating behaviour of ` UNICODE::utf8_length` we add back `UNICODE::utf8_length_as_int` for it to use.
>> 
>> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` use `length` as an IN/OUT parameter: it is the incoming (int) length of the jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. This makes some of the call sites a little messy with casts.
>> 
>> Testing:
>>  - tiers 1-4
>>  - GHA
>
> David Holmes has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision:
> 
>  - Merge branch 'master' into 8338257-utf8-length
>  - Extra assertion requested by tstuefe
>  - more missing casts
>  - fix cast
>  - missing cast
>  - Fix incorrect comments and size_t use per Dean's review
>  - Add missing cast for signed-to-unsigned converion.
>  - unnecessary cast
>  - Fix comments
>  - Fix off-by-one error
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/fb153748...9dce4ffb

Many of these translations seem awkward, since they convert to size_t only to then convert back to int.

Proposal: I undestand you need to find a good point to tourniquet off the int->size_t conversion to minimize the translations needed. But I'd consider converting SymbolTable functions to size_t too. SymbolTable already does not use the full width of the int length parameter, so functionally nothing changes (it needs to check the length for validity). 

If you worry that the changes fan out too much, at least consider converting SymbolTable::new_symbol. That appears about ten times.

-------------

PR Review: https://git.openjdk.org/jdk/pull/20560#pullrequestreview-2267738621


More information about the hotspot-dev mailing list