RFR: 8338257: UTF8 lengths should be size_t not int [v5]

Tue Aug 27 07:54:04 UTC 2024

On Tue, 27 Aug 2024 07:20:27 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> src/hotspot/share/classfile/javaClasses.cpp line 588:
>> 
>>> 586:     size_t utf8_len = static_cast<size_t>(length);
>>> 587:     const char* base = UNICODE::as_utf8(position, utf8_len);
>>> 588:     Symbol* sym = SymbolTable::new_symbol(base, checked_cast<int>(utf8_len));
>> 
>> With the current limitations of checked_cast(), we would also need to check if the result is negative on 32-bit platforms, because then size_t and int will be the same size, and checked_cast will never complain.
>
> I'm trying to reason if on 32-bit we could even create a large enough string for this to be a problem? Once we have the giant string `as_utf8` will have to allocate an array that is just as large if not larger. So for overflow to be an issue we need a string of length INT_MAX - which is limited to 2GB and then we have to allocate a resource array of 2GB as well. So we need to have allocated 4GB which is our entire address space on 32-bit. So I don't think we can ever hit a problem on 32-bit where the size_t utf8 length would convert to a negative int.

I think the Java string would only need to be INT_MAX/3 in length, if all the characters require surrogate encoding.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1732326074