RFR: 8338257: UTF8 lengths should be size_t not int [v5]

Tue Aug 27 12:23:06 UTC 2024

On Tue, 27 Aug 2024 07:51:38 GMT, Dean Long <dlong at openjdk.org> wrote:

>> I'm trying to reason if on 32-bit we could even create a large enough string for this to be a problem? Once we have the giant string `as_utf8` will have to allocate an array that is just as large if not larger. So for overflow to be an issue we need a string of length INT_MAX - which is limited to 2GB and then we have to allocate a resource array of 2GB as well. So we need to have allocated 4GB which is our entire address space on 32-bit. So I don't think we can ever hit a problem on 32-bit where the size_t utf8 length would convert to a negative int.
>
> I think the Java string would only need to be INT_MAX/3 in length, if all the characters require surrogate encoding.

IIUC for compact strings, with non-latin-1 each pair of bytes would require at most 3-bytes to encode so you'd need 2/3 of INT_MAX. With latin-1 it would be 1/2 INT_MAX. But yes I suppose in theory you might be able to get an overflow on 32-bit.  Need to think more about what could even be done for this case ... and whether it is worth trying ...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1732741739