RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE
Andrew Haley
aph at openjdk.java.net
Mon Dec 6 11:39:11 UTC 2021
On Thu, 18 Nov 2021 17:24:18 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Arraycopy partial inlining is a C2 compiler technique that avoids stub
>> call overhead in small-sized arraycopy operations by generating masked
>> vector instructions. So far it works on x86 AVX512 only and this patch
>> enables it on AArch64 with SVE.
>>
>> We add AArch64 matching rule for VectorMaskGenNode and refactor that
>> node a little bit. The major change is moving the element type field
>> into its TypeVectMask bottom type. The reason is that AArch64 vector
>> masks are different for different vector element types.
>>
>> E.g., an x86 AVX512 vector mask value masking 3 least significant vector
>> lanes (of any type) is like
>>
>> `0000 0000 ... 0000 0000 0000 0000 0111`
>>
>> On AArch64 SVE, this mask value can only be used for masking the 3 least
>> significant lanes of bytes. But for 3 lanes of ints, the value should be
>>
>> `0000 0000 ... 0000 0000 0001 0001 0001`
>>
>> where the least significant bit of each lane matters. So AArch64 matcher
>> needs to know the vector element type to generate right masks.
>>
>> After this patch, the C2 generated code for copying a 50-byte array on
>> AArch64 SVE looks like
>>
>> mov x12, #0x32
>> whilelo p0.b, xzr, x12
>> add x11, x11, #0x10
>> ld1b {z16.b}, p0/z, [x11]
>> add x10, x10, #0x10
>> st1b {z16.b}, p0, [x10]
>>
>> We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
>> both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
>> JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
>> size arguments on a 512-bit SVE-featured CPU. We got below performance
>> data changes.
>>
>> Benchmark (length) (Performance)
>> ArrayCopyAligned.testByte 10 -2.6%
>> ArrayCopyAligned.testByte 20 +4.7%
>> ArrayCopyAligned.testByte 30 +4.8%
>> ArrayCopyAligned.testByte 40 +21.7%
>> ArrayCopyAligned.testByte 50 +22.5%
>> ArrayCopyAligned.testByte 60 +28.4%
>>
>> The test machine has SVE vector size of 512 bits, so we see performance
>> gain for most array sizes less than 64 bytes. For very small arrays we
>> see a bit regression because a vector load/store may be a bit slower
>> than 1 or 2 scalar loads/stores.
>
> Hurrah! I have managed to duplicate your results.
>
> Old:
>
> Benchmark (length) Mode Cnt Score Error Units
> ArrayCopyAligned.testByte 40 avgt 5 23.332 ± 0.016 ns/op
>
>
> New:
>
> ArrayCopyAligned.testByte 40 avgt 5 18.092 ± 0.093 ns/op
>
>
> ... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now.
> Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon.
I'm quite happy with the AArch64 parts, but I'm not familiar with that part of the C2 compiler. I think you need an additional reviewer, perhaps @rwestrel .
-------------
PR: https://git.openjdk.java.net/jdk/pull/6444
More information about the hotspot-compiler-dev
mailing list