RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions

Wed Oct 14 12:11:11 UTC 2020

On Thu, 8 Oct 2020 17:29:27 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> Can you explain why 32 bytes are such a distinct performance cliff?
>>> 
>>> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
>> 
>> Hi Nils,
>> Copy for sizes <= 32 bytes can be done using one YMM register,  AVX-512 vector length extension allows masked
>> instructions to operate on YMM and XMM registers. Using newly added flag -XX:ArrayCopyPartialInlineSize=64 one can
>> perform in-lining up to 64 bytes but since it will use a ZMM register CPU will operate at a lower frequency but it
>> could still give better performance depending on the application.   A single 64 byte masked copy may have a performance
>> hit if for majority of the application runtime, CPU operates at highest frequency. There is a switchover penalty from
>> higher frequency level to lower frequency level along with some hysteresis which forces subsequent instructions  to
>> operate a lower frequency for some cycles.   Current implementation has been kept simple to avoid emitting too many
>> instruction at call site considering arraycopy is a very high frequency operation.
>
> Hi @neliasso  , @vnkozlov , kindly let me know your review comments.

Hi Jatin,

I'm ready to approve it, but I would like to kick it through some performance testing first.

Best regards,
Nils Eliasson

-------------

PR: https://git.openjdk.java.net/jdk/pull/302