RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
Steve Dohrmann
duke at openjdk.org
Fri Nov 10 00:40:28 UTC 2023
Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512.
Baseline data
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
--------------------------------------------------------------------------------------
XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± 60414308.540 ns/op
XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± 2924954.498 ns/op
XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± 28334453.652 ns/op
XorTest.copy REGION SMALL avgt 30 7399944.164 ± 216821.819 ns/op
XorTest.copy REGION MEDIUM avgt 30 20591454.558 ± 147398.572 ns/op
XorTest.copy REGION LARGE avgt 30 21649266.051 ± 179263.875 ns/op
XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± 542.482 ns/op
XorTest.copy CRITICAL MEDIUM avgt 30 2496.961 ± 11.375 ns/op
XorTest.copy CRITICAL LARGE avgt 30 515.454 ± 5.831 ns/op
XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± 79489.276 ns/op
XorTest.copy FOREIGN MEDIUM avgt 30 19730666.341 ± 500505.099 ns/op
XorTest.copy FOREIGN LARGE avgt 30 34616758.085 ± 340300.726 ns/op
XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± 2329417.319 ns/op
XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± 3818334.424 ns/op
XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± 5877981.900 ns/op
XorTest.xor REGION SMALL avgt 30 64093872.804 ± 599704.491 ns/op
XorTest.xor REGION MEDIUM avgt 30 81544576.454 ± 1406342.118 ns/op
XorTest.xor REGION LARGE avgt 30 90091424.883 ± 775577.613 ns/op
XorTest.xor CRITICAL SMALL avgt 30 57231375.744 ± 438223.342 ns/op
XorTest.xor CRITICAL MEDIUM avgt 30 58583884.930 ± 375355.215 ns/op
XorTest.xor CRITICAL LARGE avgt 30 60644832.949 ± 588120.738 ns/op
XorTest.xor FOREIGN SMALL avgt 30 73868679.405 ± 819965.524 ns/op
XorTest.xor FOREIGN MEDIUM avgt 30 88156275.944 ± 1051257.152 ns/op
XorTest.xor FOREIGN LARGE avgt 30 123115513.182 ± 1287935.621 ns/op
The 'copy' benchmark was added to measure the memory copy components of the 'xor' benchmark, separate from the memory allocation and xor data update components.
Profile data for the baseline REGION LARGE case, shows two hotspots covering about 90% of cycles:
Baseline REGION LARGE (r231)
Function CPU Time Clockticks Instructions Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 63.7% 18,189,000,000 52,464,000,000 0.347
__memcpy_evex_unaligned_erms 28.5% 7,608,000,000 3,459,000,000 2.199
```
The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :
Baseline FOREIGN LARGE (r226)
Function CPU Time Clockticks Instructions Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 46.4% 18,345,000,000 52,476,000,000 0.350
jlong_disjoint_arraycopy_avx3 29.3% 11,124,000,000 1,404,000,000 7.923
Copy::fill_to_memory_atomic 15.3% 5,016,000,000 8,010,000,000 0.626
This PR optimizes the jlong_disjoint_arraycopy_avx3 code. The The Copy::fill_to memory_atomic hotspot (which I believe is associated with the benchmark's per-op off-heap buffer allocation) is not optimized here. The av3 array copy code is optimized by increasing the loop granularity from 192 to 256 bytes, adding source address prefetches, and using non-temporal writes with a store fence. The optimized code in only used with copies of greater that a set threshold number of bytes, currently 2.5MB. This is the size at which the optimized code was observed to be faster than the original code. The profile data with optimization is:
Optimized FOREIGN LARGE (r277)
Function CPU Time Clockticks Instructions Retired CPI Rate
--------------------------------------------------------------------------------------------
xor_op 51.2% 18,153,000,000 52,404,000,000 0.346
jlong_disjoint_arraycopy_avx3 22.4% 7,581,000,000 2,364,000,000 3.207
Copy::fill_to_memory_atomic 16.3% 5,316,000,000 7,917,000,000 0.671
The optimization brings the cycles for the mem copy work roughly to parity with the REGION LARGE case. Benchmark data for the optimized case:
Optimized data
Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
XorTest.copy ELEMENTS SMALL avgt 30 551072938.467 ± 4287149.108 ns/op
XorTest.copy ELEMENTS MEDIUM avgt 30 272304419.633 ± 2993793.130 ns/op
XorTest.copy ELEMENTS LARGE avgt 30 1013925081.233 ± 8590245.238 ns/op
XorTest.copy REGION SMALL avgt 30 7472329.003 ± 77394.114 ns/op
XorTest.copy REGION MEDIUM avgt 30 19882540.205 ± 349544.602 ns/op
XorTest.copy REGION LARGE avgt 30 21185593.636 ± 404369.655 ns/op
XorTest.copy CRITICAL SMALL avgt 30 52358.715 ± 1382.355 ns/op
XorTest.copy CRITICAL MEDIUM avgt 30 2525.108 ± 22.396 ns/op
XorTest.copy CRITICAL LARGE avgt 30 528.865 ± 11.747 ns/op
XorTest.copy FOREIGN SMALL avgt 30 7748587.890 ± 67352.844 ns/op
XorTest.copy FOREIGN MEDIUM avgt 30 19401977.378 ± 256247.071 ns/op
XorTest.copy FOREIGN LARGE avgt 30 21519594.325 ± 124712.980 ns/op
XorTest.xor ELEMENTS SMALL avgt 30 221049328.389 ± 2629557.148 ns/op
XorTest.xor ELEMENTS MEDIUM avgt 30 503362446.150 ± 3759664.343 ns/op
XorTest.xor ELEMENTS LARGE avgt 30 1186563496.067 ± 5135607.671 ns/op
XorTest.xor REGION SMALL avgt 30 88402928.083 ± 790941.309 ns/op
XorTest.xor REGION MEDIUM avgt 30 80041519.052 ± 597221.491 ns/op
XorTest.xor REGION LARGE avgt 30 87706448.917 ± 751350.609 ns/op
XorTest.xor CRITICAL SMALL avgt 30 56869387.315 ± 408618.338 ns/op
XorTest.xor CRITICAL MEDIUM avgt 30 59041245.745 ± 820141.039 ns/op
XorTest.xor CRITICAL LARGE avgt 30 60433672.443 ± 500954.831 ns/op
XorTest.xor FOREIGN SMALL avgt 30 72838421.976 ± 410147.170 ns/op
XorTest.xor FOREIGN MEDIUM avgt 30 87970109.478 ± 1058857.783 ns/op
XorTest.xor FOREIGN LARGE avgt 30 103970690.407 ± 1033001.637 ns/op
I am very much looking forward to contributing to OpenJDK! Please review this PR and let me know how it can be improved.
-------------
Commit messages:
- - fix whitespace issues
- - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked
Changes: https://git.openjdk.org/jdk/pull/16575/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16575&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8310159
Stats: 597 lines in 11 files changed: 596 ins; 0 del; 1 mod
Patch: https://git.openjdk.org/jdk/pull/16575.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575
PR: https://git.openjdk.org/jdk/pull/16575
More information about the core-libs-dev
mailing list