RFR: 8311906: Improve robustness of String constructors with mutable array inputs [v2]

Fri Nov 10 16:44:00 UTC 2023

On Fri, 10 Nov 2023 14:59:57 GMT, Raffaello Giulietti <rgiulietti at openjdk.org> wrote:

>> Roger Riggs has updated the pull request incrementally with three additional commits since the last revision:
>> 
>>  - Refactored extractCodePoints to avoid multiple resizes if the array was modified
>>  - Replaced isLatin1 implementation with `getChar(buf, ndx) <= 0xff`
>>    It performs better than the single byte array access by avoiding the bounds check.
>>  - Misc updates for review comments, javadoc cleanup
>>    Extra checking on maximum string lengths when calling toBytes().
>
> src/java.base/share/classes/java/lang/StringUTF16.java line 279:
> 
>> 277:             } else {
>> 278:                 // Pass 1: Compute precise size of char[]; see extractCodePoints for caveat
>> 279:                 int estSize = ndx + computeCodePointSize(val, off, end);
> 
> To avoid reallocations in `extractCodepoints()`, a way is to profit from the presence of `latin1[]`, putting `latin1[i] = (byte) (cp >>> 16)`, starting from `ndx`, while scanning the `val[]` in `computeCodePointSize()` to preserve information about the upper bits of the codepoint.
> 
> Later, while copying the `val[]` codepoints to `utf16[]`, the info in `latin1[]` is included in the `cp` just read from `val[]`, for example as in `cp = (cp & 0xffff) | ((latin1[i] & 0xff) << 16)`.
> 
> The resulting codepoint would be BMP or supplementary as when it was scanned during `computeCodePointSize()`, even in presence of later modifications, and preserving the original value if it wasn't modified in the meantime. Since the info about a codepoint needing 1 or 2 chars in `utf16[]` is preserved in `latin1[]`, there should be no need for reallocations.

Interesting idea, but it might mean that if the codepoint val[i] was modified it could result in a cp that did not (ever) exist in the input; creating a value out of thin air.  The high bits would be from the first read of val[i] and the low bits from the 2nd read. The code in extractCodepoints could be simpler and computeCodePointSize just a little more comples. Creating values out of thin air is usually bad and could have (different) unexpected results in the app.

The additional writes to the latin1 array would also slow down the normal case of computing the size whether or not the input was modified.
On that basis, I'd keep the current approach to resizing only if needed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1389637034