RFR: 8338257: UTF8 lengths should be size_t not int [v7]

Thu Aug 29 08:58:24 UTC 2024

On Thu, 29 Aug 2024 02:45:53 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> This work has been split out from JDK-8328877: [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths
>> 
>> The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF-16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*`Integer.MAX_VALUE`.  Though with compact strings this reduces to 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define UTF8 lengths as `size_t` to accommodate all possible representations. Higher-level API's can still use `int` if they know the strings (eg symbols) are sufficiently constrained in length.  See the comments in utf8.hpp that explain Strings, compact strings and the encoding.
>> 
>> As the existing JNI `GetStringUTFLength` still requires the current truncating behaviour of ` UNICODE::utf8_length` we add back `UNICODE::utf8_length_as_int` for it to use.
>> 
>> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` use `length` as an IN/OUT parameter: it is the incoming (int) length of the jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. This makes some of the call sites a little messy with casts.
>> 
>> Testing:
>>  - tiers 1-4
>>  - GHA
>
> David Holmes has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision:
> 
>  - Merge branch 'master' into 8338257-utf8-length
>  - Extra assertion requested by tstuefe
>  - more missing casts
>  - fix cast
>  - missing cast
>  - Fix incorrect comments and size_t use per Dean's review
>  - Add missing cast for signed-to-unsigned converion.
>  - unnecessary cast
>  - Fix comments
>  - Fix off-by-one error
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/8c60a602...9dce4ffb

I think this patch was onerous but valuable work. I have one question inline.

Other than that, I think most of the cases where you modified calls to `SymbolTable` to feed in the new size_t length, but checked, could have just continued to use the old int length.

But I did not find any errors here. I'll mark this as approved. Up to you if you address my inline concern.

src/hotspot/share/classfile/javaClasses.hpp line 138:

> 136:   // Legacy variants that truncate the length if needed
> 137:   static int    utf8_length_as_int(oop java_string);
> 138:   static int    utf8_length_as_int(oop java_string, typeArrayOop string_value);

I don't get the point of this variant of the function. It takes a string and a typearray. What is the contract here, is the only value allowed for typearray the array oop underlying the string? If yes, why do you assert for value equality in the function? That implies I can feed in any typearrayoop here as long as it has the same value as the string.

I can only see a single point where this function is used, so that does not explain much. Maybe I am overlooking something, but why not just inline the code into that one using call site?

-------------

Marked as reviewed by stuefe (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/20560#pullrequestreview-2268097833
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1735827631