RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE

Mon Dec 6 11:39:11 UTC 2021

On Thu, 18 Nov 2021 17:24:18 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Arraycopy partial inlining is a C2 compiler technique that avoids stub
>> call overhead in small-sized arraycopy operations by generating masked
>> vector instructions. So far it works on x86 AVX512 only and this patch
>> enables it on AArch64 with SVE.
>> 
>> We add AArch64 matching rule for VectorMaskGenNode and refactor that
>> node a little bit. The major change is moving the element type field
>> into its TypeVectMask bottom type. The reason is that AArch64 vector
>> masks are different for different vector element types.
>> 
>> E.g., an x86 AVX512 vector mask value masking 3 least significant vector
>> lanes (of any type) is like
>> 
>> `0000 0000 ... 0000 0000 0000 0000 0111`
>> 
>> On AArch64 SVE, this mask value can only be used for masking the 3 least
>> significant lanes of bytes. But for 3 lanes of ints, the value should be
>> 
>> `0000 0000 ... 0000 0000 0001 0001 0001`
>> 
>> where the least significant bit of each lane matters. So AArch64 matcher
>> needs to know the vector element type to generate right masks.
>> 
>> After this patch, the C2 generated code for copying a 50-byte array on
>> AArch64 SVE looks like
>> 
>>   mov     x12, #0x32
>>   whilelo p0.b, xzr, x12
>>   add     x11, x11, #0x10
>>   ld1b    {z16.b}, p0/z, [x11]
>>   add     x10, x10, #0x10
>>   st1b    {z16.b}, p0, [x10]
>> 
>> We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
>> both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
>> JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
>> size arguments on a 512-bit SVE-featured CPU. We got below performance
>> data changes.
>> 
>> Benchmark                  (length)  (Performance)
>> ArrayCopyAligned.testByte        10          -2.6%
>> ArrayCopyAligned.testByte        20          +4.7%
>> ArrayCopyAligned.testByte        30          +4.8%
>> ArrayCopyAligned.testByte        40         +21.7%
>> ArrayCopyAligned.testByte        50         +22.5%
>> ArrayCopyAligned.testByte        60         +28.4%
>> 
>> The test machine has SVE vector size of 512 bits, so we see performance
>> gain for most array sizes less than 64 bytes. For very small arrays we
>> see a bit regression because a vector load/store may be a bit slower
>> than 1 or 2 scalar loads/stores.
>
> Hurrah! I have managed to duplicate your results.
> 
> Old:
> 
> Benchmark                       (length)  Mode  Cnt   Score   Error  Units
> ArrayCopyAligned.testByte             40  avgt    5  23.332 ± 0.016  ns/op
> 
> 
> New:
> 
> ArrayCopyAligned.testByte             40  avgt    5  18.092 ± 0.093  ns/op
> 
> 
> ... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now.

> Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon.

I'm quite happy with the AArch64 parts, but I'm not familiar with that part of the C2 compiler. I think you need an additional reviewer, perhaps @rwestrel .

-------------

PR: https://git.openjdk.java.net/jdk/pull/6444