RFR: 8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions [v5]

Fri Sep 25 21:08:49 UTC 2020

On Tue, 22 Sep 2020 15:55:11 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary:
>> 
>> 1)  New AVX3 optimized stubs for both conjoint and disjoint arraycopy.
>> 2)  Special instruction sequence blocks for copy sizes b/w 32-192 bytes.
>> 3)  Block copy operation above 192 bytes is performed using destination address aligned PRE-MAIN-POST loop. Main loop
>> copies 192 byte in one iteration and tail part fall over special instruction sequence blocks. 4)  Both small copy block
>> and aligned loop use 32 byte vector register to prevent and frequency penalty for copy sizes less than AVX3Threshold.
>> 5)  For block size above AVX3Theshold both special blocks and loop operate using 64 byte register. 6)  In case user
>> sets the maximum vector size to 32 bytes, forward copy (disjoint) operations are done using efficient REP MOVS for copy
>> sizes above 4096 bytes.  JMH Results:
>>   System     :  CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
>>   Micros     :  test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
>>   Baseline   :  [http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_Baseline.txt]()
>>   WithOpt  :  [http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_WithOpts.txt]()
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8252847 : Modifying file permission to resolve jcheck failure.

The main concern which is not clear in these changes is ZMM usage which will lower frequency and case performance
regression for small arrays. That is why AVX3Threshold is set to 4096 bytes by default.
Allowing and checking for 0 AVX3Threshold  value contradicts that. Would be nice to have clear comment/explanation
about that.

I also propose to use Flag constraint() functionality for checking  AVX3Threshold value instead of runtime checks
everywhere. Separate RFE, please.

src/hotspot/cpu/x86/assembler_x86.cpp line 2593:

> 2591:
> 2592: void Assembler::evmovdqu(XMMRegister dst, KRegister mask, Address src, int vector_len, int type) {
> 2593:   assert(VM_Version::supports_avx512vlbw(), "");

I suggest to add assert to these 2 new instruction to check 'type' value to make sure only expected types are passed.

src/hotspot/cpu/x86/assembler_x86.cpp line 2596:

> 2594:   InstructionMark im(this);
> 2595:   bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR;
> 2596:   bool bwinstr = type == T_BYTE ||  type == T_SHORT || type == T_CHAR;

'bwinstr' is used only once. You may as well directly set 'prefix' here. (Same in second instruction).

src/hotspot/cpu/x86/assembler_x86.cpp line 2595:

> 2593:   assert(VM_Version::supports_avx512vlbw(), "");
> 2594:   InstructionMark im(this);
> 2595:   bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR;

It looks strange but it is correct (I looked on existing evmovdqu* instructions). May be reorder - T_LONG last.

Do you consider replacing existing evmovdqu* instructions with these two?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1264:

> 1262:   }
> 1263:
> 1264: #ifndef PRODUCT

macroAssembler_x86.hpp become big. May be we should start thing about splitting arraycopy stubs into separate file.

src/hotspot/cpu/x86/stubRoutines_x86.hpp line 36:

> 34: enum platform_dependent_constants {
> 35:   code_size1 = 20000 LP64_ONLY(+10000),         // simply increase if too small (assembler will crash if too small)
> 36:   code_size2 = 35300 LP64_ONLY(+21400)          // simply increase if too small (assembler will crash if too small)

This is big increase in size!

src/hotspot/cpu/x86/vm_version_x86.cpp line 1167:

> 1165:
> 1166:   if (!FLAG_IS_DEFAULT(AVX3Threshold)) {
> 1167:     if (AVX3Threshold != 0 && !is_power_of_2(AVX3Threshold)) {

Consider flag's constraint() instead of runtime these checks.  Separate RFE, please.

src/jdk.internal.vm.compiler/share/classes/org.graalvm.compiler.lir.amd64/src/org/graalvm/compiler/lir/amd64/AMD64ArrayCompareToOp.java
line 87:

> 85:         super(TYPE);
> 86:
> 87:         assert useAVX3Threshold == 0 || CodeUtil.isPowerOf2(useAVX3Threshold) : "AVX3Threshold must be power of 2";

You would need to upstream Graal changes.

test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyConjoint.java line 33:

> 31:  *
> 32:  * @run main/othervm/timeout=600 -XX:-TieredCompilation  -Xbatch -XX:+IgnoreUnrecognizedVMOptions
> 33:  *      -XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:ArrayCopyPartialInlineSize=0 -XX:MaxVectorSize=32 -XX:+UnlockDiagnosticVMOptions

ArrayCopyPartialInlineSize flag is not defiled in these changes.
It seems they need to be included in 8252848 changes.

-------------

Changes requested by kvn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/61