RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE

Andrew Haley aph at openjdk.java.net
Tue Dec 7 11:17:11 UTC 2021


On Thu, 18 Nov 2021 03:50:45 GMT, Pengfei Li <pli at openjdk.org> wrote:

> Arraycopy partial inlining is a C2 compiler technique that avoids stub
> call overhead in small-sized arraycopy operations by generating masked
> vector instructions. So far it works on x86 AVX512 only and this patch
> enables it on AArch64 with SVE.
> 
> We add AArch64 matching rule for VectorMaskGenNode and refactor that
> node a little bit. The major change is moving the element type field
> into its TypeVectMask bottom type. The reason is that AArch64 vector
> masks are different for different vector element types.
> 
> E.g., an x86 AVX512 vector mask value masking 3 least significant vector
> lanes (of any type) is like
> 
> `0000 0000 ... 0000 0000 0000 0000 0111`
> 
> On AArch64 SVE, this mask value can only be used for masking the 3 least
> significant lanes of bytes. But for 3 lanes of ints, the value should be
> 
> `0000 0000 ... 0000 0000 0001 0001 0001`
> 
> where the least significant bit of each lane matters. So AArch64 matcher
> needs to know the vector element type to generate right masks.
> 
> After this patch, the C2 generated code for copying a 50-byte array on
> AArch64 SVE looks like
> 
>   mov     x12, #0x32
>   whilelo p0.b, xzr, x12
>   add     x11, x11, #0x10
>   ld1b    {z16.b}, p0/z, [x11]
>   add     x10, x10, #0x10
>   st1b    {z16.b}, p0, [x10]
> 
> We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
> both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
> JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
> size arguments on a 512-bit SVE-featured CPU. We got below performance
> data changes.
> 
> Benchmark                  (length)  (Performance)
> ArrayCopyAligned.testByte        10          -2.6%
> ArrayCopyAligned.testByte        20          +4.7%
> ArrayCopyAligned.testByte        30          +4.8%
> ArrayCopyAligned.testByte        40         +21.7%
> ArrayCopyAligned.testByte        50         +22.5%
> ArrayCopyAligned.testByte        60         +28.4%
> 
> The test machine has SVE vector size of 512 bits, so we see performance
> gain for most array sizes less than 64 bytes. For very small arrays we
> see a bit regression because a vector load/store may be a bit slower
> than 1 or 2 scalar loads/stores.

Marked as reviewed by aph (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/6444


More information about the hotspot-compiler-dev mailing list