RFR: 8299807: newStringNoRepl should avoid copying arrays for ASCII compatible charsets

Glavo duke at openjdk.org
Sat Jan 28 04:06:51 UTC 2023


On Fri, 20 Jan 2023 16:47:27 GMT, Glavo <duke at openjdk.org> wrote:

> This is the javadoc of `JavaLangAccess::newStringNoRepl`:
> 
> 
>     /**
>      * Constructs a new {@code String} by decoding the specified subarray of
>      * bytes using the specified {@linkplain java.nio.charset.Charset charset}.
>      *
>      * The caller of this method shall relinquish and transfer the ownership of
>      * the byte array to the callee since the later will not make a copy.
>      *
>      * @param bytes the byte array source
>      * @param cs the Charset
>      * @return the newly created string
>      * @throws CharacterCodingException for malformed or unmappable bytes
>      */
> 
> 
> It is recorded in the document that it should be able to directly construct strings with parameter byte array to reduce array allocation.
> 
> However, at present, `newStringNoRepl` always copies arrays for UTF-8 or other ASCII compatible charsets.
> 
> This PR fixes this problem.

I ran the tier1 and tier2 tests, and there were no new errors.

The only use case affected is `Files.readString`. I tested the performance of `readString` based on the memory file system.

baseline:

Benchmark                   (length)   Mode  Cnt        Score       Error  Units
NoRepl.testReadAscii               0  thrpt    5  5049760.744 ±  3563.324  ops/s
NoRepl.testReadAscii            1024  thrpt    5  3523083.785 ± 23747.078  ops/s
NoRepl.testReadAscii            8192  thrpt    5  2415952.140 ± 85884.289  ops/s
NoRepl.testReadAscii         1048576  thrpt    5    32425.563 ±   284.121  ops/s
NoRepl.testReadAscii        33554432  thrpt    5      872.492 ±     4.311  ops/s
NoRepl.testReadAscii       268435456  thrpt    5       58.736 ±     0.224  ops/s
NoRepl.testReadAsciiAsGBK          0  thrpt    5  4229547.832 ± 10997.381  ops/s
NoRepl.testReadAsciiAsGBK       1024  thrpt    5  3409472.580 ±   819.566  ops/s
NoRepl.testReadAsciiAsGBK       8192  thrpt    5  2179865.886 ± 16606.862  ops/s
NoRepl.testReadAsciiAsGBK    1048576  thrpt    5    32962.429 ±   249.385  ops/s
NoRepl.testReadAsciiAsGBK   33554432  thrpt    5      871.810 ±     1.812  ops/s
NoRepl.testReadAsciiAsGBK  268435456  thrpt    5       59.131 ±     0.172  ops/s
NoRepl.testReadGBK                 0  thrpt    5  4657464.074 ± 51849.667  ops/s
NoRepl.testReadGBK              1024  thrpt    5  1199083.242 ±  8653.846  ops/s
NoRepl.testReadGBK              8192  thrpt    5    75823.949 ±    46.493  ops/s
NoRepl.testReadGBK           1048576  thrpt    5      675.214 ±     1.729  ops/s
NoRepl.testReadGBK          33554432  thrpt    5       20.972 ±     2.280  ops/s
NoRepl.testReadGBK         268435456  thrpt    5        2.585 ±     0.394  ops/s
NoRepl.testReadUTF8                0  thrpt    5  5222403.558 ± 44728.740  ops/s
NoRepl.testReadUTF8             1024  thrpt    5  1417472.161 ± 23776.534  ops/s
NoRepl.testReadUTF8             8192  thrpt    5   185714.265 ±  2096.328  ops/s
NoRepl.testReadUTF8          1048576  thrpt    5     1318.763 ±    16.051  ops/s
NoRepl.testReadUTF8         33554432  thrpt    5       30.663 ±     0.114  ops/s
NoRepl.testReadUTF8        268435456  thrpt    5        3.782 ±     0.041  ops/s


This PR:

Benchmark                   (length)   Mode  Cnt        Score       Error  Units
NoRepl.testReadAscii               0  thrpt    5  5084610.141 ± 37449.826  ops/s
NoRepl.testReadAscii            1024  thrpt    5  3425713.961 ± 24060.542  ops/s
NoRepl.testReadAscii            8192  thrpt    5  2765684.248 ± 20586.103  ops/s
NoRepl.testReadAscii         1048576  thrpt    5    48074.603 ±   371.213  ops/s
NoRepl.testReadAscii        33554432  thrpt    5     1167.878 ±    14.427  ops/s
NoRepl.testReadAscii       268435456  thrpt    5       71.028 ±     0.439  ops/s
NoRepl.testReadAsciiAsGBK          0  thrpt    5  4783174.805 ±  9789.109  ops/s
NoRepl.testReadAsciiAsGBK       1024  thrpt    5  3518265.840 ± 18467.577  ops/s
NoRepl.testReadAsciiAsGBK       8192  thrpt    5  2775108.822 ± 19282.776  ops/s
NoRepl.testReadAsciiAsGBK    1048576  thrpt    5    46956.963 ±   147.593  ops/s
NoRepl.testReadAsciiAsGBK   33554432  thrpt    5     1165.036 ±    10.032  ops/s
NoRepl.testReadAsciiAsGBK  268435456  thrpt    5       70.878 ±     0.191  ops/s
NoRepl.testReadGBK                 0  thrpt    5  4910043.054 ± 27295.344  ops/s
NoRepl.testReadGBK              1024  thrpt    5  1177675.970 ± 15573.239  ops/s
NoRepl.testReadGBK              8192  thrpt    5    75417.479 ±   233.957  ops/s
NoRepl.testReadGBK           1048576  thrpt    5      674.620 ±     5.856  ops/s
NoRepl.testReadGBK          33554432  thrpt    5       19.899 ±     1.504  ops/s
NoRepl.testReadGBK         268435456  thrpt    5        2.705 ±     0.002  ops/s
NoRepl.testReadUTF8                0  thrpt    5  4851516.950 ±  9237.743  ops/s
NoRepl.testReadUTF8             1024  thrpt    5  1332016.420 ±  9570.465  ops/s
NoRepl.testReadUTF8             8192  thrpt    5   184177.766 ±  4662.562  ops/s
NoRepl.testReadUTF8          1048576  thrpt    5     1326.439 ±     3.420  ops/s
NoRepl.testReadUTF8         33554432  thrpt    5       30.782 ±     0.116  ops/s
NoRepl.testReadUTF8        268435456  thrpt    5        3.790 ±     0.011  ops/s


When reading an ASCII file as UTF-8 or GBK encoding, we can see that the throughput of `readString` has improved significantly. When reading ASCII files with a size of 1MiB, the throughput increased by 40%~50%, but for larger or smaller files, the throughput improvement will be smaller. 

For files containing non-ASCII characters, the throughput of `readString` is between 94% and 104% of the baseline.

This is the source code of the benchmark: https://gist.github.com/Glavo/f3d2060d0bd13cd0ce2add70e6060ea0

Can someone help me open an Issue on Java Bug System?

The throughput of reading ASCII files as UTF-8:

![image](https://user-images.githubusercontent.com/20694662/213833461-97e60b8d-2845-48fb-9331-585810c182b2.png)


The throughput of reading ASCII files as GBK:

![image](https://user-images.githubusercontent.com/20694662/213833687-d2811afb-efce-4d4e-9074-a797f2eea83a.png)

> /issue JDK-8299807

Thank you!

-------------

PR: https://git.openjdk.org/jdk/pull/12119


More information about the core-libs-dev mailing list