RFR: 8299807: newStringNoRepl should avoid copying arrays for ASCII compatible charsets
Glavo
duke at openjdk.org
Sat Jan 28 04:06:51 UTC 2023
On Fri, 20 Jan 2023 16:47:27 GMT, Glavo <duke at openjdk.org> wrote:
> This is the javadoc of `JavaLangAccess::newStringNoRepl`:
>
>
> /**
> * Constructs a new {@code String} by decoding the specified subarray of
> * bytes using the specified {@linkplain java.nio.charset.Charset charset}.
> *
> * The caller of this method shall relinquish and transfer the ownership of
> * the byte array to the callee since the later will not make a copy.
> *
> * @param bytes the byte array source
> * @param cs the Charset
> * @return the newly created string
> * @throws CharacterCodingException for malformed or unmappable bytes
> */
>
>
> It is recorded in the document that it should be able to directly construct strings with parameter byte array to reduce array allocation.
>
> However, at present, `newStringNoRepl` always copies arrays for UTF-8 or other ASCII compatible charsets.
>
> This PR fixes this problem.
I ran the tier1 and tier2 tests, and there were no new errors.
The only use case affected is `Files.readString`. I tested the performance of `readString` based on the memory file system.
baseline:
Benchmark (length) Mode Cnt Score Error Units
NoRepl.testReadAscii 0 thrpt 5 5049760.744 ± 3563.324 ops/s
NoRepl.testReadAscii 1024 thrpt 5 3523083.785 ± 23747.078 ops/s
NoRepl.testReadAscii 8192 thrpt 5 2415952.140 ± 85884.289 ops/s
NoRepl.testReadAscii 1048576 thrpt 5 32425.563 ± 284.121 ops/s
NoRepl.testReadAscii 33554432 thrpt 5 872.492 ± 4.311 ops/s
NoRepl.testReadAscii 268435456 thrpt 5 58.736 ± 0.224 ops/s
NoRepl.testReadAsciiAsGBK 0 thrpt 5 4229547.832 ± 10997.381 ops/s
NoRepl.testReadAsciiAsGBK 1024 thrpt 5 3409472.580 ± 819.566 ops/s
NoRepl.testReadAsciiAsGBK 8192 thrpt 5 2179865.886 ± 16606.862 ops/s
NoRepl.testReadAsciiAsGBK 1048576 thrpt 5 32962.429 ± 249.385 ops/s
NoRepl.testReadAsciiAsGBK 33554432 thrpt 5 871.810 ± 1.812 ops/s
NoRepl.testReadAsciiAsGBK 268435456 thrpt 5 59.131 ± 0.172 ops/s
NoRepl.testReadGBK 0 thrpt 5 4657464.074 ± 51849.667 ops/s
NoRepl.testReadGBK 1024 thrpt 5 1199083.242 ± 8653.846 ops/s
NoRepl.testReadGBK 8192 thrpt 5 75823.949 ± 46.493 ops/s
NoRepl.testReadGBK 1048576 thrpt 5 675.214 ± 1.729 ops/s
NoRepl.testReadGBK 33554432 thrpt 5 20.972 ± 2.280 ops/s
NoRepl.testReadGBK 268435456 thrpt 5 2.585 ± 0.394 ops/s
NoRepl.testReadUTF8 0 thrpt 5 5222403.558 ± 44728.740 ops/s
NoRepl.testReadUTF8 1024 thrpt 5 1417472.161 ± 23776.534 ops/s
NoRepl.testReadUTF8 8192 thrpt 5 185714.265 ± 2096.328 ops/s
NoRepl.testReadUTF8 1048576 thrpt 5 1318.763 ± 16.051 ops/s
NoRepl.testReadUTF8 33554432 thrpt 5 30.663 ± 0.114 ops/s
NoRepl.testReadUTF8 268435456 thrpt 5 3.782 ± 0.041 ops/s
This PR:
Benchmark (length) Mode Cnt Score Error Units
NoRepl.testReadAscii 0 thrpt 5 5084610.141 ± 37449.826 ops/s
NoRepl.testReadAscii 1024 thrpt 5 3425713.961 ± 24060.542 ops/s
NoRepl.testReadAscii 8192 thrpt 5 2765684.248 ± 20586.103 ops/s
NoRepl.testReadAscii 1048576 thrpt 5 48074.603 ± 371.213 ops/s
NoRepl.testReadAscii 33554432 thrpt 5 1167.878 ± 14.427 ops/s
NoRepl.testReadAscii 268435456 thrpt 5 71.028 ± 0.439 ops/s
NoRepl.testReadAsciiAsGBK 0 thrpt 5 4783174.805 ± 9789.109 ops/s
NoRepl.testReadAsciiAsGBK 1024 thrpt 5 3518265.840 ± 18467.577 ops/s
NoRepl.testReadAsciiAsGBK 8192 thrpt 5 2775108.822 ± 19282.776 ops/s
NoRepl.testReadAsciiAsGBK 1048576 thrpt 5 46956.963 ± 147.593 ops/s
NoRepl.testReadAsciiAsGBK 33554432 thrpt 5 1165.036 ± 10.032 ops/s
NoRepl.testReadAsciiAsGBK 268435456 thrpt 5 70.878 ± 0.191 ops/s
NoRepl.testReadGBK 0 thrpt 5 4910043.054 ± 27295.344 ops/s
NoRepl.testReadGBK 1024 thrpt 5 1177675.970 ± 15573.239 ops/s
NoRepl.testReadGBK 8192 thrpt 5 75417.479 ± 233.957 ops/s
NoRepl.testReadGBK 1048576 thrpt 5 674.620 ± 5.856 ops/s
NoRepl.testReadGBK 33554432 thrpt 5 19.899 ± 1.504 ops/s
NoRepl.testReadGBK 268435456 thrpt 5 2.705 ± 0.002 ops/s
NoRepl.testReadUTF8 0 thrpt 5 4851516.950 ± 9237.743 ops/s
NoRepl.testReadUTF8 1024 thrpt 5 1332016.420 ± 9570.465 ops/s
NoRepl.testReadUTF8 8192 thrpt 5 184177.766 ± 4662.562 ops/s
NoRepl.testReadUTF8 1048576 thrpt 5 1326.439 ± 3.420 ops/s
NoRepl.testReadUTF8 33554432 thrpt 5 30.782 ± 0.116 ops/s
NoRepl.testReadUTF8 268435456 thrpt 5 3.790 ± 0.011 ops/s
When reading an ASCII file as UTF-8 or GBK encoding, we can see that the throughput of `readString` has improved significantly. When reading ASCII files with a size of 1MiB, the throughput increased by 40%~50%, but for larger or smaller files, the throughput improvement will be smaller.
For files containing non-ASCII characters, the throughput of `readString` is between 94% and 104% of the baseline.
This is the source code of the benchmark: https://gist.github.com/Glavo/f3d2060d0bd13cd0ce2add70e6060ea0
Can someone help me open an Issue on Java Bug System?
The throughput of reading ASCII files as UTF-8:

The throughput of reading ASCII files as GBK:

> /issue JDK-8299807
Thank you!
-------------
PR: https://git.openjdk.org/jdk/pull/12119
More information about the core-libs-dev
mailing list