RFR: 8338257: UTF8 lengths should be size_t not int

Coleen Phillimore coleenp at openjdk.org
Thu Aug 15 20:47:51 UTC 2024


On Tue, 13 Aug 2024 02:20:41 GMT, David Holmes <dholmes at openjdk.org> wrote:

> This work has been split out from JDK-8328877: [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths
> 
> The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF-16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*`Integer.MAX_VALUE`.  Though with compact strings this reduces to 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define UTF8 lengths as `size_t` to accommodate all possible representations. Higher-level API's can still use `int` if they know the strings (eg symbols) are sufficiently constrained in length.  See the comments in utf8.hpp that explain Strings, compact strings and the encoding.
> 
> As the existing JNI `GetStringUTFLength` still requires the current truncating behaviour of ` UNICODE::utf8_length` we add back `UNICODE::utf8_length_as_int` for it to use.
> 
> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` use `length` as an IN/OUT parameter: it is the incoming (int) length of the jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. This makes some of the call sites a little messy with casts.
> 
> Testing:
>  - tiers 1-4
>  - GHA

One thing that I think needs changing and some comments, but your change strikes a good balance of promoting enough parameters to size_t but not too many.
Did you try to compile this with -Wconversion (not -Wsign-conversion) without -Werror?

It doesn't look like GHA is configured for you here.

src/hotspot/share/classfile/javaClasses.cpp line 586:

> 584:     ResourceMark rm;
> 585:     jbyte* position = (length == 0) ? nullptr : value->byte_at_addr(0);
> 586:     size_t utf8_len = length;

These are sign conversions.  This is too big to change here and not sure what the fan out would be but java_lang_String::length() should return unsigned and also arrayOop.hpp length.  They're never negative.  We should probably assert that (below).

src/hotspot/share/classfile/javaClasses.cpp line 639:

> 637:   if (length == 0) {
> 638:     return 0;
> 639:   }

Maybe assert length > 0 here?

src/hotspot/share/classfile/javaClasses.cpp line 702:

> 700:   } else {
> 701:     jbyte* position = (length == 0) ? nullptr : value->byte_at_addr(0);
> 702:     return UNICODE::as_utf8(position, static_cast<int>(length), buf, buflen);

Don't you want checked_cast here not static casts.

src/hotspot/share/prims/jni.cpp line 2226:

> 2224:  HOTSPOT_JNI_GETSTRINGUTFLENGTH_ENTRY(env, string);
> 2225:   oop java_string = JNIHandles::resolve_non_null(string);
> 2226:   jsize ret = java_lang_String::utf8_length_as_int(java_string);

So the spec says that this should be jsize (signed int), which is why this is, right?

src/hotspot/share/utilities/utf8.cpp line 512:

> 510: 
> 511: template<typename T>
> 512: int UNICODE::utf8_length_as_int(const T* base, int length) {

Why was this parameter left as int and not size_t ?

src/hotspot/share/utilities/utf8.hpp line 47:

> 45: 
> 46: In the code below if we have latin-1 content then we treat the String's data
> 47: array as a jbyte[], else a jchar[]. The lengths of these arrays are specified

jchar is 16-bits right?  So the max length if not latin-1 is INT_MAX/2 ?

src/hotspot/share/utilities/utf8.hpp line 70:

> 68: 
> 69: */
> 70: 

This is a good place for this comment.  One could find it again if they need to understand this.

-------------

PR Review: https://git.openjdk.org/jdk/pull/20560#pullrequestreview-2241255691
PR Comment: https://git.openjdk.org/jdk/pull/20560#issuecomment-2292205627
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718945451
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718948865
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718950816
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718954620
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718959373
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718964159
PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1718966422


More information about the serviceability-dev mailing list