RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions

Thu Oct 8 17:32:19 UTC 2020

On Wed, 23 Sep 2020 15:27:48 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Can you explain why 32 bytes are such a distinct performance cliff?
>> 
>> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
>
>> Can you explain why 32 bytes are such a distinct performance cliff?
>> 
>> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
> 
> Hi Nils,
> Copy for sizes <= 32 bytes can be done using one YMM register,  AVX-512 vector length extension allows masked
> instructions to operate on YMM and XMM registers. Using newly added flag -XX:ArrayCopyPartialInlineSize=64 one can
> perform in-lining up to 64 bytes but since it will use a ZMM register CPU will operate at a lower frequency but it
> could still give better performance depending on the application.   A single 64 byte masked copy may have a performance
> hit if for majority of the application runtime, CPU operates at highest frequency. There is a switchover penalty from
> higher frequency level to lower frequency level along with some hysteresis which forces subsequent instructions  to
> operate a lower frequency for some cycles.   Current implementation has been kept simple to avoid emitting too many
> instruction at call site considering arraycopy is a very high frequency operation.

Hi @neliasso  , @vnkozlov , kindly let me know your review comments.

-------------

PR: https://git.openjdk.java.net/jdk/pull/302