RFR: 8311906: Improve robustness of String constructors with mutable array inputs [v2]

Thu Nov 9 09:10:00 UTC 2023

On Mon, 6 Nov 2023 15:30:46 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/StringUTF16.java line 202:
>> 
>>> 200:     @ForceInline
>>> 201:     public static byte[] compress(final char[] val, final int off, final int count) {
>>> 202:         byte[] latin1 = new byte[count];
>> 
>> Will this redundant array allocation be costly if we are working with mostly-utf16 strings, such as CJK strings with no latin characters?
>> 
>> I suggest we can use a heuristic to read the initial char; if it's utf16 then we skip the latin-1 process altogether (and we can assign the utf16 value to the initial index to ensure it's non-latin-1 compressible.
>
> We can reconsider this design as a separate PR. 
> Every additional check has a performance impact and in this bug the goal is to avoid any regression.
> 
> We'll need to gain some insight into the distribution of strings when used in a non-latin1 application.
> How many of the strings are latin1 vs non-latin1, what is the distribution of string lengths and which APIs are in use in the applications.  The implementation is already pretty good about working with strings of different coders
> but there may be some different choices when converting between char arrays and int arrays and strings.

Just curious, how does benchmark StringConstructor.newStringFromCharsMixedBegin change before and after this patch? If we can see how much of an impact this has on CJK strings it would be appreciated.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1387693255