RFR: 8299807: newStringNoRepl should avoid copying arrays for ASCII compatible charsets

Sat Jan 28 04:06:51 UTC 2023

On Fri, 27 Jan 2023 16:04:41 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

>> This is the javadoc of `JavaLangAccess::newStringNoRepl`:
>> 
>> 
>>     /**
>>      * Constructs a new {@code String} by decoding the specified subarray of
>>      * bytes using the specified {@linkplain java.nio.charset.Charset charset}.
>>      *
>>      * The caller of this method shall relinquish and transfer the ownership of
>>      * the byte array to the callee since the later will not make a copy.
>>      *
>>      * @param bytes the byte array source
>>      * @param cs the Charset
>>      * @return the newly created string
>>      * @throws CharacterCodingException for malformed or unmappable bytes
>>      */
>> 
>> 
>> It is recorded in the document that it should be able to directly construct strings with parameter byte array to reduce array allocation.
>> 
>> However, at present, `newStringNoRepl` always copies arrays for UTF-8 or other ASCII compatible charsets.
>> 
>> This PR fixes this problem.
>
> It seems odd that the benchmark seems slower for smaller files; can you suggest why that might be?
> I'd expect the size distribution for Files.readString to be biased toward the smaller files.
> Can you repeat the benchmark using the default file system.  OS file caching should eliminate the disk speed effects.

@RogerRiggs 

I rerun benchmark based on the default file system, and the test file size is between 0 and 32KiB.

The throughput of reading ASCII files as UTF-8:
![image](https://user-images.githubusercontent.com/20694662/215230987-ded941e1-6260-497a-8f50-e62f9651c320.png)

The throughput of reading ASCII files as GBK:
![image](https://user-images.githubusercontent.com/20694662/215231012-0bc715c5-6473-4152-8089-4e64128e164e.png)

The performance has been slightly improved, and there is no performance degradation.

For UTF-8 and GBK files with non-ASCII characters, the throughput fluctuates by no more than 4%.

Test code and original results: https://gist.github.com/Glavo/f3d2060d0bd13cd0ce2add70e6060ea0?permalink_comment_id=4451350#gistcomment-4451350

> It seems odd that the benchmark seems slower for smaller files; can you suggest why that might be?

The most likely reason is the cost of the newly added if judgment in newStringUTF8NoRepl.

I don't think this is an important issue, because when it comes to actual I/O operations, its impact is negligible.

The main purpose of this PR is to eliminate unnecessary temporary memory allocation, thus reducing GC pressure. The change in throughput is only a by-product.

-------------

PR: https://git.openjdk.org/jdk/pull/12119