RFR(L): 8231441: AArch64: Initial SVE backend support

Wed Mar 25 06:37:17 UTC 2020

Hi,

Could you please help to review this patch adding AArch64 SVE support? 
It also touches c2 compiler shared code.

Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00

Arm has released new vector ISA extension for AArch64, SVE [1] and
SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
patch we have:

1) SVE feature enablement and detection
2) SVE vector register allocation support with initial predicate 
register definition
3) SVE c2 backend for current SLP based vectorizer. (We also have a POC 
patch of a new vectorizer using SVE predicate-driven loop control, but 
that's still under development.)

SVE register definition
=======================
Unlike other SIMD architectures, SVE allows hardware implementations to
choose a vector register length from 128 and 2048 bits, multiple of 128
bits. So we introduce a new vector type VectorA, i.e. length agnostic 
(scalable) vector type, and Op_VecA for machine vectora register. In the 
meantime, to minimize register allocation code changes, we also take 
advantage of one JIT compiler aspect, that is during the compile time we
actually know the real hardware SVE vector register size of current 
running machine. So, the register allocator actually knows how many 
register slots an Op_VecA ideal reg requires, and could work fine 
without much modification.

Since the bottom 128 bits are shared with the NEON, we extend current 
register mask definition of V0-V31 registers. Currently, c2 uses one bit 
mask for a 32-bit register slot, so to define at most 2048 bits we will 
need to add 64 slots in AD file. That's a really large number, and will 
also break current regmask assumption. Considering the SVE vector 
register is architecturally scalable for different sizes, we just define 
double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H, 
Vx_J ... Vx_O. After adlc, the generated register masks now looks like:

const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff, 
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...

const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303, 
0x3030303, 0x3030303, 0x3030303, 0x3030303, ...

const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f, 
0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...

And we use SlotsPerVecA to indicate regmask bit size for a VecA register.

Although for physical register allocation, register allocator does not 
need to know the real VecA register size, while doing spill/unspill, 
current register allocation needs to know actual stack slot size to 
store/load VecA registers. SVE is able to do vector size agnostic 
spilling, but to minimize the code changes, as I mentioned before, we 
just let RA know the actual vector register size in current running 
machine, by calling scalable_vector_reg_size().

In the meantime, since some vector operations do not have unpredicated 
SVE1 instructions, but only predicate version, e.g. vector multiply, 
vector load/store. We have also defined predicate registers in this 
patch, and c2 register allocator will allocate a temp predicate register 
to fulfill the expecting unpredicated operations. And this can also be 
used for future predicate-driven vectorizer. This is not efficient for 
now, as we can see many ptrue instructions in the generated code. One 
possible solution I can see, is to block one predicate register, and 
preset it to all true. But to preserve/reinitialize a caller save 
register value cross calls seems risky to work in this patch. I decide 
to defer it to further optimization work. If anyone has any suggestions 
on this, I would appreciate.

SVE feature detection
=====================
Since we may have some compiled code based on the initial detected SVE 
vector register length and the compiled code is compiled only for that 
vector register length, we assume that the SVE vector register length 
will not be changed during the JVM lifetime. However, SVE vector length 
is per-thread and can be changed by system call [3], so we need to make 
sure that each jni call will not change the sve vector length.

Currently, we verify the SVE vector register length on each JNI return, 
and if an SVE vector length change is detected, jvm simply reports error 
and stops running. The VM running vector length can also be set by 
existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is 
specified not the same as system default sve vector length (in 
/proc/sys/abi/sve_default_vector_length), JVM will set current process 
sve vector length to the specified vector length.

Compiled code
=============
We have added all current c2 backend codegen on par with NEON, but only 
for vector length larger than 128-bit.

On a 1024 bit SVE environment, for the following simple loop with int
array element type:

   for (int i = 0; i < LENGTH; i++) {
     c[i] = a[i] + b[i];
   }

c2 generated loop:

   0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
   0x0000ffff811c0824:   add     x13, x18, x11
   0x0000ffff811c0828:   add     x14, x1, x11
   0x0000ffff811c082c:   add     x13, x13, #0x10
   0x0000ffff811c0830:   add     x14, x14, #0x10
   0x0000ffff811c0834:   add     x11, x0, x11
   0x0000ffff811c0838:   add     x11, x11, #0x10
   0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
   0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
   0x0000ffff811c0844:   ptrue   p0.s
   0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
   0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
   0x0000ffff811c0850:   ptrue   p1.s
   0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
   0x0000ffff811c0858:   add     w10, w10, #0x20
   0x0000ffff811c085c:   cmp     w10, w12
   0x0000ffff811c0860:   b.lt    0x0000ffff811c0820

Test
====
Currently, we don't have real hardware to verify SVE features (and 
performance). But we have run jtreg tests with SVE in some emulators. On 
QEMU system emulator, which has SVE emulation support, jtreg tier1-3 
passed with different vector sizes. We've also verified it with full 
jtreg tests without SVE on both x86 and AArch64, to make sure that 
there's no regression.

The patch has also been applied to Vector API code base, and verified on 
emulator. In Vector API, there are more vector related tests and is more 
possible to generate vector instructions by intrinsification.

A simple test can also run in QEMU user emulation, e.g.

$ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD

(
To run it in user emulation mode, we will need to bypass SVE feature 
detection code in this patch. E.g. apply:
http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
)l

Others
======
Since this patch is a bit large, I've also split it into 3 parts, for
easy review:

1) SVE feature detection
http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature

2) c2 registion allocation
http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra

3) SVE c2 backend
http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2

Part of this patch has been contributed by Joshua Zhu and Yang Zhang.

Refs
====
[1] https://developer.arm.com/docs/ddi0584/latest
[2] https://developer.arm.com/docs/ddi0602/latest
[3] https://www.kernel.org/doc/Documentation/arm64/sve.txt

Thanks,
Ningsheng