RFR: 8311906: Race condition in String constructor

Mon Sep 25 15:45:12 UTC 2023

On Mon, 25 Sep 2023 12:28:40 GMT, Chen Liang <liach at openjdk.org> wrote:

> In the constructor of String, many locations the user-supplied byte or char arrays are read multiple times with a plain memory access; if a user previously wrote to one of such locations out of happens-before order, distinct plain memory reads may result in different unanticipated values.
> 
> The main problem caused by such error is that String constructor may incorrectly produce a UTF16 coder string with all-LATIN1 compatible characters when `COMPACT_STRING` is true, which breaks the contract of String. (The error can happen the other way around, but the resulting LATIN1 string is valid; this patch does not address that.)
> 
> Thus, I modified the String data compression for non-trusted arrays: a LATIN1 compression first-pass is still done, but if the first compression fails, a second compression pass is done on a trusted (that is, copied from the original data) data where reading would be consistent. The approach takes a toll on UTF16 string construction time, but should not be more costly memory-wise.
> 
> A separate routine to decode UTF8 in String constructor that takes byte encoding has the same multi-read problem, that the old `offset--` leads to a problematic double read. This is resolved by copying the data to decode to a local array at first instead of reading from the user-provided byte array. This fix also costs more runtime but at no extra memory cost.
> 
> Internal APIs such as newStringNoRepl are not guarded by this patch, as they are already trusted to be immutable and unshared.
> 
> `test/jdk/java/lang/String` tests pass. More testing is needed to see if there are other edge cases not covered.
> 
> Please review and don't hesitate to critique my approach and patch.

A side comment: I don't think it's problematic if we incorrectly create a LATIN1 String (such as by downcasting a char to byte), for such a String is valid and it's user's fault (for not publishing their writes to the array in happens-before order). We only think about avoiding writing a values array with no significant byte: we can just write any non-trivial 2-byte into UTF16 to make it valid, and that's why I'm looking for compress to return a `twoByte << 32 | consumedLength`. Such a task of writing a 2-byte should be easy to accomplish.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15902#issuecomment-1733990052