RFR: 8252847: New AVX512 optimized stubs for both conjoint and disjoint arraycopy [v2]

Fri Sep 18 08:52:30 UTC 2020

On Thu, 17 Sep 2020 13:22:07 GMT, Nils Eliasson <neliasso at openjdk.org> wrote:

> My only concern is that it's getting hard to follow under what circumstances avx3 instructions are used:
> Could it be the case that different thresholds are needed for when you are using avx3 instructions with 32 or 64 byte
> vectors? Are we sure all variants are tested?

Following 2 runtime flags influence the implementation :-
- MaxVectorSize: Determined during VM initialization based on the CPUID of the target.
- AVX3Theshold: Set to a default value of 4096 bytes based on prior performance analysis.

Following general rules were followed during implementation:
1) If target support AVX3 features (BW+VL+F) then copy will use 32 byte vectors (YMMs) for both special cases and
aligned copy loop.  This is default configuration. 2) If copy length is above AVX3Threshold, then we can safely use 64
byte vectors (ZMMs) for main copy loop (and tail) since bulk of the cycles will be consumed in it. 3) Leaf level Macro
Assembly routines can dynamically choose b/w YMM or ZMM register based on the AVX3Threshold value. 4) If user forces
MaxVectorSize=32 then above 4096 bytes its seen that REP MOVs shows a better performance for disjoint copies. For
conjoint/backward copy vector based copy performs better.

Thus, for 32 byte vector we do not need any threshold since they execute at max frequency level.

tier1, tier2 and tier3 did not show any new issues with the changes.

> Also - have you thought about supporting oop-copies? You only have to call the

We may not see significant performance improvement considering prologue and epilogue barriers does considerable
processing over object arrays.

-------------

PR: https://git.openjdk.java.net/jdk/pull/61