RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions
Jatin Bhateja
jbhateja at openjdk.java.net
Wed Sep 23 15:33:50 UTC 2020
On Wed, 23 Sep 2020 11:09:25 GMT, Nils Eliasson <neliasso at openjdk.org> wrote:
> Can you explain why 32 bytes are such a distinct performance cliff?
>
> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
Hi Nils,
Copy for sizes <= 32 bytes can be done using one YMM register, AVX-512 vector length extension allows masked
instructions to operate on YMM and XMM registers. Using newly added flag -XX:ArrayCopyPartialInlineSize=64 one can
perform in-lining up to 64 bytes but since it will use a ZMM register CPU will operate at a lower frequency but it
could still give better performance depending on the application.
A single 64 byte masked copy may have a performance hit if for majority of the application runtime, CPU operates at
highest frequency. There is a switchover penalty from higher frequency level to lower frequency level along with some
hysteresis which forces subsequent instructions to operate a lower frequency for some cycles.
Current implementation has been kept simple to avoid emitting too many instruction at call site considering arraycopy
is a very high frequency operation.
-------------
PR: https://git.openjdk.java.net/jdk/pull/302
More information about the hotspot-compiler-dev
mailing list