RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions
Jatin Bhateja
jbhateja at openjdk.java.net
Thu Oct 8 17:32:19 UTC 2020
On Wed, 23 Sep 2020 15:27:48 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Can you explain why 32 bytes are such a distinct performance cliff?
>>
>> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
>
>> Can you explain why 32 bytes are such a distinct performance cliff?
>>
>> Is there any performance difference between doing a single 64 bytes masked copy or two 32 bytes?
>
> Hi Nils,
> Copy for sizes <= 32 bytes can be done using one YMM register, AVX-512 vector length extension allows masked
> instructions to operate on YMM and XMM registers. Using newly added flag -XX:ArrayCopyPartialInlineSize=64 one can
> perform in-lining up to 64 bytes but since it will use a ZMM register CPU will operate at a lower frequency but it
> could still give better performance depending on the application. A single 64 byte masked copy may have a performance
> hit if for majority of the application runtime, CPU operates at highest frequency. There is a switchover penalty from
> higher frequency level to lower frequency level along with some hysteresis which forces subsequent instructions to
> operate a lower frequency for some cycles. Current implementation has been kept simple to avoid emitting too many
> instruction at call site considering arraycopy is a very high frequency operation.
Hi @neliasso , @vnkozlov , kindly let me know your review comments.
-------------
PR: https://git.openjdk.java.net/jdk/pull/302
More information about the hotspot-compiler-dev
mailing list