RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE

Pengfei Li pli at openjdk.java.net
Thu Nov 18 04:03:54 UTC 2021


Arraycopy partial inlining is a C2 compiler technique that avoids stub
call overhead in small-sized arraycopy operations by generating masked
vector instructions. So far it works on x86 AVX512 only and this patch
enables it on AArch64 with SVE.

We add AArch64 matching rule for VectorMaskGenNode and refactor that
node a little bit. The major change is moving the element type field
into its TypeVectMask bottom type. The reason is that AArch64 vector
masks are different for different vector element types.

E.g., an x86 AVX512 vector mask value masking 3 least significant vector
lanes (of any type) is like

`0000 0000 ... 0000 0000 0000 0000 0111`

On AArch64 SVE, this mask value can only be used for masking the 3 least
significant lanes of bytes. But for 3 lanes of ints, the value should be

`0000 0000 ... 0000 0000 0001 0001 0001`

where the least significant bit of each lane matters. So AArch64 matcher
needs to know the vector element type to generate right masks.

After this patch, the C2 generated code for copying a 50-byte array on
AArch64 SVE looks like

  mov     x12, #0x32
  whilelo p0.b, xzr, x12
  add     x11, x11, #0x10
  ld1b    {z16.b}, p0/z, [x11]
  add     x10, x10, #0x10
  st1b    {z16.b}, p0, [x10]

We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
size arguments on a 512-bit SVE-featured CPU. We got below performance
data changes.

Benchmark                  (length)  (Performance)
ArrayCopyAligned.testByte        10          -2.6%
ArrayCopyAligned.testByte        20          +4.7%
ArrayCopyAligned.testByte        30          +4.8%
ArrayCopyAligned.testByte        40         +21.7%
ArrayCopyAligned.testByte        50         +22.5%
ArrayCopyAligned.testByte        60         +28.4%

The test machine has SVE vector size of 512 bits, so we see performance
gain for most array sizes less than 64 bytes. For very small arrays we
see a bit regression because a vector load/store may be a bit slower
than 1 or 2 scalar loads/stores.

-------------

Commit messages:
 - 8277168: AArch64: Enable arraycopy partial inlining with SVE

Changes: https://git.openjdk.java.net/jdk/pull/6444/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6444&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8277168
  Stats: 87 lines in 16 files changed: 57 ins; 7 del; 23 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6444.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6444/head:pull/6444

PR: https://git.openjdk.java.net/jdk/pull/6444


More information about the hotspot-dev mailing list