RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]
Sandhya Viswanathan
sviswanathan at openjdk.org
Tue Nov 21 01:52:13 UTC 2023
On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann <duke at openjdk.org> wrote:
>> Update: the XorTest::xor results shown in this message used test code from PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit a788f066af17. The XorTest has since been updated and XorTest::copy is no longer needed and has been removed from this pull request. See comment [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) for updated performance data.
>>
>> Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512.
>>
>> Baseline data
>> Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units
>> --------------------------------------------------------------------------------------
>> XorTest.copy ELEMENTS SMALL avgt 30 584737355.767 ± 60414308.540 ns/op
>> XorTest.copy ELEMENTS MEDIUM avgt 30 272248995.683 ± 2924954.498 ns/op
>> XorTest.copy ELEMENTS LARGE avgt 30 1019200210.900 ± 28334453.652 ns/op
>> XorTest.copy REGION SMALL avgt 30 7399944.164 ± 216821.819 ns/op
>> XorTest.copy REGION MEDIUM avgt 30 20591454.558 ± 147398.572 ns/op
>> XorTest.copy REGION LARGE avgt 30 21649266.051 ± 179263.875 ns/op
>> XorTest.copy CRITICAL SMALL avgt 30 51079.357 ± 542.482 ns/op
>> XorTest.copy CRITICAL MEDIUM avgt 30 2496.961 ± 11.375 ns/op
>> XorTest.copy CRITICAL LARGE avgt 30 515.454 ± 5.831 ns/op
>> XorTest.copy FOREIGN SMALL avgt 30 7558432.075 ± 79489.276 ns/op
>> XorTest.copy FOREIGN MEDIUM avgt 30 19730666.341 ± 500505.099 ns/op
>> XorTest.copy FOREIGN LARGE avgt 30 34616758.085 ± 340300.726 ns/op
>> XorTest.xor ELEMENTS SMALL avgt 30 219832692.489 ± 2329417.319 ns/op
>> XorTest.xor ELEMENTS MEDIUM avgt 30 505138197.167 ± 3818334.424 ns/op
>> XorTest.xor ELEMENTS LARGE avgt 30 1189608474.667 ± 5877981.900 ns/op
>> XorTest.xor REGION SMALL avgt 30 64093872.804 ± 599704.491 ns/op
>> XorTest.xor REGION MEDIUM avgt 30 81544576.454 ± 1406342.118 ns/op
>> XorTest.xor REGION LARGE avgt 30 90091424.883 ± 775577.613 ns/op
>> XorTest.xor CRITICAL SMALL avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
>
> - Merge branch 'master' into memcpy
> - Update full name
> Previous commit (fcbbc0d7880) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
> - - remerge upstream master
> - remove ::copy test from XorTest
> - Merge branch 'master' into memcpy
> - - fix whitespace error
> - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
> - - bug fix: only generate / use large copy code if MaxVectorSize == 64
> - - fix whitespace issues
> - fix xor test foreign impl constructor signature
> - - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked
> - add src address prefetches
> - switch to non-temporal writes
> - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 753:
> 751: Label L_pre_main_post_large;
> 752:
> 753: if (MaxVectorSize == 64) {
This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 768:
> 766: }
> 767: __ movq(temp3, temp2);
> 768: copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0);
The last argument should be "true" or "1" instead of "0" or "false". This is as temp3 (length) could be less than 32 as well. This case is only handled when use64byteVector argument is true.
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 777:
> 775:
> 776: __ BIND(L_main_pre_loop_large);
> 777: __ subq(temp1, loop_size[shift]); // whay is this here
Spurious comment.
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 797:
> 795: __ jcc(Assembler::lessEqual, L_exit_large);
> 796: arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift,
> 797: temp4, temp3, L_entry_large, L_exit_large);
When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need to be copied so we don't need to go back to L_entry_large.
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1058:
> 1056: };
> 1057:
> 1058: if (MaxVectorSize == 64) {
This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.
src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp line 1175:
> 1173: void StubGenerator::copy256_avx3(Register dst, Register src, Register index, XMMRegister xmm1,
> 1174: XMMRegister xmm2, XMMRegister xmm3, XMMRegister xmm4,
> 1175: bool conjoint, int shift, int offset) {
The conjoint parameter is not used so could be removed from this function.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916497
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399918640
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399896421
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399934713
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399916567
PR Review Comment: https://git.openjdk.org/jdk/pull/16575#discussion_r1399881916
More information about the core-libs-dev
mailing list