RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions

Wed Sep 23 15:33:50 UTC 2020

On Wed, 23 Sep 2020 11:09:25 GMT, Nils Eliasson <neliasso at openjdk.org> wrote:

> Can you explain why 32 bytes are such a distinct performance cliff?
> 
> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?

Hi Nils,
Copy for sizes <= 32 bytes can be done using one YMM register,  AVX-512 vector length extension allows masked
instructions to operate on YMM and XMM registers. Using newly added flag -XX:ArrayCopyPartialInlineSize=64 one can
perform in-lining up to 64 bytes but since it will use a ZMM register CPU will operate at a lower frequency but it
could still give better performance depending on the application.

A single 64 byte masked copy may have a performance hit if for majority of the application runtime, CPU operates at
highest frequency. There is a switchover penalty from higher frequency level to lower frequency level along with some
hysteresis which forces subsequent instructions  to operate a lower frequency for some cycles.

Current implementation has been kept simple to avoid emitting too many instruction at call site considering arraycopy
is a very high frequency operation.

-------------

PR: https://git.openjdk.java.net/jdk/pull/302