RFR: 8311906: Improve robustness of String constructors with mutable array inputs

Wed Nov 8 18:40:07 UTC 2023

On Mon, 30 Oct 2023 18:34:44 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

> Strings, after construction, are immutable but may be constructed from mutable arrays of bytes, characters, or integers.
> The string constructors should guard against the effects of mutating the arrays during construction that might invalidate internal invariants for the correct behavior of operations on the resulting strings. In particular, a number of operations have optimizations for operations on pairs of latin1 strings and pairs of non-latin1 strings, while operations between latin1 and non-latin1 strings use a more general implementation. 
> 
> The changes include:
> 
> - Adding a warning to each constructor with an array as an argument to indicate that the results are indeterminate 
>   if the input array is modified before the constructor returns. 
>   The resulting string may contain any combination of characters sampled from the input array.
> 
> - Ensure that strings that are represented as non-latin1 contain at least one non-latin1 character.
>   For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or another encoding decoded to latin1 the scanning and compression is unchanged.
>   If a non-latin1 character is found, the string is represented as non-latin1 with the added verification that a non-latin1 character is present at the same index.
>   If that character is found to be latin1, then the input array has been modified and the result of the scan may be incorrect.
>   Though a ConcurrentModificationException could be thrown, the risk to an existing application of an unexpected exception should be avoided.
>   Instead, the non-latin1 copy of the input is re-scanned and compressed; that scan determines whether the latin1 or the non-latin1 representation is returned.
> 
> - The methods that scan for non-latin1 characters and their intrinsic implementations are updated to return the index of the non-latin1 character.
> 
> - String construction from StringBuilder and CharSequence must also be guarded as their contents may be modified during construction.

First take ;-)
More will follow.

src/java.base/share/classes/java/lang/String.java line 602:

> 600:                 }
> 601:                 this.value = utf16;
> 602:                 this.coder = (utf16.length == dp) ? LATIN1 : UTF16;

Is it possible to have `utf16.length == dp` here? I think the `coder` can only be `UTF16`.

src/java.base/share/classes/java/lang/StringLatin1.java line 37:

> 35: import jdk.internal.util.ArraysSupport;
> 36: import jdk.internal.util.DecimalDigits;
> 37: import jdk.internal.vm.annotation.ForceInline;

This annotation does not seem to be used.

src/java.base/share/classes/java/lang/StringUTF16.java line 158:

> 156:      * {@return an encoded byte[] for the UTF16 characters in char[]}
> 157:      * **Only** use this if it is known that at least one character is UTF16.
> 158:      * Otherwise, an untrusted char array may have racy contents and really be latin1.

While this is a good advice, it turns out that the `compress()` method below invokes this method _without_ knowing for sure if the provided `value` contains at least one non-latin1 `char` when this method is invoked or while it runs: it only _assumes_ so, and indeed prudentially checks afterwards.

It should be suggested to check the result after invoking this method if `value` is untrusted.

src/java.base/share/classes/java/lang/StringUTF16.java line 198:

> 196:      * @param val   a char array
> 197:      * @param off   starting offset
> 198:      * @param count length of chars to be compressed, length > 0

Suggestion:

     * @param count count of chars to be compressed, {@code count} > 0

src/java.base/share/classes/java/lang/StringUTF16.java line 214:

> 212:             }
> 213:         }
> 214:         return latin1;     // latin1 success

The original version of this `public` method can return `null` to signal failure to compress. Does this change impact callers that might expect `null`?

src/java.base/share/classes/java/lang/StringUTF16.java line 226:

> 224:      * @param val   a byte array with UTF16 coding
> 225:      * @param off   starting offset
> 226:      * @param count length of chars to be compressed, length > 0

Suggestion:

     * @param count count of chars to be compressed, {@code count} > 0

src/java.base/share/classes/java/lang/StringUTF16.java line 232:

> 230:         int ndx = compress(val, off, latin1, 0, count);
> 231:         if (ndx != count) {// Switch to UTF16
> 232:             byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + count) << 1);

Not sure if the left shifts do not overflow on this `public` method. If that happens, the outcomes could be non-negative, so the copy would succeed but be kind of corrupted.

src/java.base/share/classes/java/lang/StringUTF16.java line 240:

> 238:             }
> 239:         }
> 240:         return latin1;     // latin1 success

See the `compress()` above for a remark on `null` as a return value.

src/java.base/share/classes/java/lang/StringUTF16.java line 319:

> 317:             int codePoint = val[off];           // read each codepoint from val only once
> 318:             int dstLimit = dstOff
> 319:                     + (Character.isBmpCodePoint(codePoint) ? 1 : 2)

Suggestion:

                    + Character.charCount(codePoint)

This method was introduced in 1.5, so should be safe to use even for backports.

src/java.base/share/classes/java/lang/StringUTF16.java line 411:

> 409:             return 2;
> 410:         } else
> 411:             throw new IllegalArgumentException(Integer.toString(codePoint));

Maybe `Character.charCount()` can be used here, although it returns 2 even for invalid codepoints.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16425#pullrequestreview-1720680695
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386992886
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386829351
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386885841
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905655
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386923763
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905964
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386915873
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386927514
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386951908
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386997067