RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
Steve Dohrmann
duke at openjdk.org
Tue Nov 21 00:12:10 UTC 2023
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann <duke at openjdk.org> wrote:
>> Update: the XorTest::xor results shown in this message used test code from PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit a788f066af17. The XorTest has since been updated and XorTest::copy is no longer needed and has been removed from this pull request. See comment [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) for updated performance data.
>>
>> Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512.
>>
>> Baseline data
>> Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
>> --------------------------------------------------------------------------------------
>> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± 60414308.540 ns/op
>> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± 2924954.498 ns/op
>> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± 28334453.652 ns/op
>> XorTest.copy REGION SMALL avgt 30 7399944.164 ± 216821.819 ns/op
>> XorTest.copy REGION MEDIUM avgt 30 20591454.558 ± 147398.572 ns/op
>> XorTest.copy REGION LARGE avgt 30 21649266.051 ± 179263.875 ns/op
>> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± 542.482 ns/op
>> XorTest.copy CRITICAL MEDIUM avgt 30 2496.961 ± 11.375 ns/op
>> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± 5.831 ns/op
>> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± 79489.276 ns/op
>> XorTest.copy FOREIGN MEDIUM avgt 30 19730666.341 ± 500505.099 ns/op
>> XorTest.copy FOREIGN LARGE avgt 30 34616758.085 ± 340300.726 ns/op
>> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± 2329417.319 ns/op
>> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± 3818334.424 ns/op
>> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± 5877981.900 ns/op
>> XorTest.xor REGION SMALL avgt 30 64093872.804 ± 599704.491 ns/op
>> XorTest.xor REGION MEDIUM avgt 30 81544576.454 ± 1406342.118 ns/op
>> XorTest.xor REGION LARGE avgt 30 90091424.883 ± 775577.613 ns/op
>> XorTest.xor CRITICAL SMALL avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
>
> - Merge branch 'master' into memcpy
> - Update full name
> Previous commit (fcbbc0d7880) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
> - - remerge upstream master
> - remove ::copy test from XorTest
> - Merge branch 'master' into memcpy
> - - fix whitespace error
> - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
> - - bug fix: only generate / use large copy code if MaxVectorSize == 64
> - - fix whitespace issues
> - fix xor test foreign impl constructor signature
> - - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked
> - add src address prefetches
> - switch to non-temporal writes
> - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore
The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the introductory comment above XorTest code from PR commit 7cc272e86279 which was based on Maurizio Cimadamore's commit a788f066af17. The XorTest has since been updated and the XorTest::copy is no longer needed and has been removed from this pull request. Performance can be evaluated using both the new XorTest and a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this PR. Results from these two benchmarks are show below:
In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in sizes 5MB and 10MB.
// Baseline
Benchmark (length) Mode Cnt Score Error Units
ArrayCopyAlignedLarge.testByte 100000 avgt 15 2434.515 ? 11.526 ns/op
ArrayCopyAlignedLarge.testByte 1000000 avgt 15 51211.235 ? 539.355 ns/op
ArrayCopyAlignedLarge.testByte 2000000 avgt 15 104837.012 ? 1338.823 ns/op
ArrayCopyAlignedLarge.testByte 5000000 avgt 15 293357.745 ? 3233.745 ns/op
ArrayCopyAlignedLarge.testByte 10000000 avgt 15 957068.292 ? 15509.983 ns/op
// PR
Benchmark (length) Mode Cnt Score Error Units
ArrayCopyAlignedLarge.testByte 100000 avgt 15 2443.354 ? 17.996 ns/op
ArrayCopyAlignedLarge.testByte 1000000 avgt 15 50854.800 ? 1253.863 ns/op
ArrayCopyAlignedLarge.testByte 2000000 avgt 15 105105.124 ? 1286.606 ns/op
ArrayCopyAlignedLarge.testByte 5000000 avgt 15 207298.875 ? 1260.289 ns/op
ArrayCopyAlignedLarge.testByte 10000000 avgt 15 457461.404 ? 8628.867 ns/op
In the XorTest::xor benchmark below, the PR code is active in 3 of the LARGE case runs: FOREIGN_NO_INIT, FOREIGN_INIT, and UNSAFE.
// New xor test - Baseline
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
XorTest.xor JNI_ELEMENTS SMALL avgt 30 0.220 ? 0.002 ms/op
XorTest.xor JNI_ELEMENTS MEDIUM avgt 30 2.859 ? 0.034 ms/op
XorTest.xor JNI_ELEMENTS LARGE avgt 30 117.436 ? 1.708 ms/op
XorTest.xor JNI_REGION SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor JNI_REGION MEDIUM avgt 30 1.623 ? 0.013 ms/op
XorTest.xor JNI_REGION LARGE avgt 30 8.923 ? 0.095 ms/op
XorTest.xor JNI_CRITICAL SMALL avgt 30 0.058 ? 0.001 ms/op
XorTest.xor JNI_CRITICAL MEDIUM avgt 30 1.215 ? 0.012 ms/op
XorTest.xor JNI_CRITICAL LARGE avgt 30 6.246 ? 0.048 ms/op
XorTest.xor FOREIGN_NO_INIT SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor FOREIGN_NO_INIT MEDIUM avgt 30 1.572 ? 0.018 ms/op
XorTest.xor FOREIGN_NO_INIT LARGE avgt 30 10.204 ? 0.116 ms/op
XorTest.xor FOREIGN_INIT SMALL avgt 30 0.071 ? 0.001 ms/op
XorTest.xor FOREIGN_INIT MEDIUM avgt 30 1.697 ? 0.008 ms/op
XorTest.xor FOREIGN_INIT LARGE avgt 30 12.056 ? 0.152 ms/op
XorTest.xor FOREIGN_CRITICAL SMALL avgt 30 0.059 ? 0.001 ms/op
XorTest.xor FOREIGN_CRITICAL MEDIUM avgt 30 1.215 ? 0.012 ms/op
XorTest.xor FOREIGN_CRITICAL LARGE avgt 30 6.301 ? 0.110 ms/op
XorTest.xor UNSAFE SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor UNSAFE MEDIUM avgt 30 1.589 ? 0.029 ms/op
XorTest.xor UNSAFE LARGE avgt 30 10.177 ? 0.108 ms/op
// New xor test - PR
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
XorTest.xor JNI_ELEMENTS SMALL avgt 30 0.224 ? 0.003 ms/op
XorTest.xor JNI_ELEMENTS MEDIUM avgt 30 2.873 ? 0.025 ms/op
XorTest.xor JNI_ELEMENTS LARGE avgt 30 118.523 ? 0.951 ms/op
XorTest.xor JNI_REGION SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor JNI_REGION MEDIUM avgt 30 1.639 ? 0.019 ms/op
XorTest.xor JNI_REGION LARGE avgt 30 8.890 ? 0.124 ms/op
XorTest.xor JNI_CRITICAL SMALL avgt 30 0.059 ? 0.001 ms/op
XorTest.xor JNI_CRITICAL MEDIUM avgt 30 1.213 ? 0.013 ms/op
XorTest.xor JNI_CRITICAL LARGE avgt 30 6.241 ? 0.099 ms/op
XorTest.xor FOREIGN_NO_INIT SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor FOREIGN_NO_INIT MEDIUM avgt 30 1.580 ? 0.015 ms/op
XorTest.xor FOREIGN_NO_INIT LARGE avgt 30 8.936 ? 0.059 ms/op
XorTest.xor FOREIGN_INIT SMALL avgt 30 0.071 ? 0.001 ms/op
XorTest.xor FOREIGN_INIT MEDIUM avgt 30 1.727 ? 0.028 ms/op
XorTest.xor FOREIGN_INIT LARGE avgt 30 10.544 ? 0.114 ms/op
XorTest.xor FOREIGN_CRITICAL SMALL avgt 30 0.059 ? 0.001 ms/op
XorTest.xor FOREIGN_CRITICAL MEDIUM avgt 30 1.215 ? 0.014 ms/op
XorTest.xor FOREIGN_CRITICAL LARGE avgt 30 6.230 ? 0.029 ms/op
XorTest.xor UNSAFE SMALL avgt 30 0.066 ? 0.001 ms/op
XorTest.xor UNSAFE MEDIUM avgt 30 1.578 ? 0.020 ms/op
XorTest.xor UNSAFE LARGE avgt 30 8.910 ? 0.100 ms/op
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1820006548
More information about the core-libs-dev
mailing list