[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Tue Jul 21 06:05:48 UTC 2020

[Ping]

Could anyone please help to review this patch, especially for the c2 
register allocation part?

JBS: https://bugs.openjdk.java.net/browse/JDK-8231441

The latest webrev:
http://cr.openjdk.java.net/~njian/8231441/webrev.02

In the latest webrev, we block one predicate register (p7) with all 
elements preset to TRUE, so that c2 compiled code can use it freely to 
generate instructions for unpredicated operations.

And the split parts:

1) SVE feature detection:
http://cr.openjdk.java.net/~njian/8231441/webrev.02-feature

2) c2 register allocation:
http://cr.openjdk.java.net/~njian/8231441/webrev.02-ra

3) SVE c2 backend:
http://cr.openjdk.java.net/~njian/8231441/webrev.02-c2

The initial RFR which has some descriptions of the patch:
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-March/037628.html

The description can also be found at:
http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt

Notes to verify the patch on QEMU user emulation, with an example of 
compiled code:
http://cr.openjdk.java.net/~njian/8231441/running-sve-in-qemu-user.txt

Thanks,
Ningsheng

On 5/27/20 3:23 PM, Ningsheng Jian wrote:
> Hi,
> 
> I have rebased this patch with some more comments added. And also 
> relaxed the instruction matching conditions for 128-bit vector.
> 
> I would appreciate if someone could help to review this.
> 
> Whole patch:
> http://cr.openjdk.java.net/~njian/8231441/webrev.01
> 
> Different parts of changes:
> 
> 1) SVE feature detection
> http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
> 
> 2) c2 registion allocation
> http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
> 
> 3) SVE c2 backend
> http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
> 
> (Or should I split this into different JBS?)
> 
> Thanks,
> Ningsheng
> 
> On 3/25/20 2:37 PM, Ningsheng Jian wrote:
>> Hi,
>>
>> Could you please help to review this patch adding AArch64 SVE support?
>> It also touches c2 compiler shared code.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
>> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
>>
>> Arm has released new vector ISA extension for AArch64, SVE [1] and
>> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
>> patch we have:
>>
>> 1) SVE feature enablement and detection
>> 2) SVE vector register allocation support with initial predicate
>> register definition
>> 3) SVE c2 backend for current SLP based vectorizer. (We also have a POC
>> patch of a new vectorizer using SVE predicate-driven loop control, but
>> that's still under development.)
>>
>> SVE register definition
>> =======================
>> Unlike other SIMD architectures, SVE allows hardware implementations to
>> choose a vector register length from 128 and 2048 bits, multiple of 128
>> bits. So we introduce a new vector type VectorA, i.e. length agnostic
>> (scalable) vector type, and Op_VecA for machine vectora register. In the
>> meantime, to minimize register allocation code changes, we also take
>> advantage of one JIT compiler aspect, that is during the compile time we
>> actually know the real hardware SVE vector register size of current
>> running machine. So, the register allocator actually knows how many
>> register slots an Op_VecA ideal reg requires, and could work fine
>> without much modification.
>>
>> Since the bottom 128 bits are shared with the NEON, we extend current
>> register mask definition of V0-V31 registers. Currently, c2 uses one bit
>> mask for a 32-bit register slot, so to define at most 2048 bits we will
>> need to add 64 slots in AD file. That's a really large number, and will
>> also break current regmask assumption. Considering the SVE vector
>> register is architecturally scalable for different sizes, we just define
>> double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
>> Vx_J ... Vx_O. After adlc, the generated register masks now looks like:
>>
>> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
>> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
>>
>> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
>> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
>>
>> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
>> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
>>
>> And we use SlotsPerVecA to indicate regmask bit size for a VecA register.
>>
>> Although for physical register allocation, register allocator does not
>> need to know the real VecA register size, while doing spill/unspill,
>> current register allocation needs to know actual stack slot size to
>> store/load VecA registers. SVE is able to do vector size agnostic
>> spilling, but to minimize the code changes, as I mentioned before, we
>> just let RA know the actual vector register size in current running
>> machine, by calling scalable_vector_reg_size().
>>
>> In the meantime, since some vector operations do not have unpredicated
>> SVE1 instructions, but only predicate version, e.g. vector multiply,
>> vector load/store. We have also defined predicate registers in this
>> patch, and c2 register allocator will allocate a temp predicate register
>> to fulfill the expecting unpredicated operations. And this can also be
>> used for future predicate-driven vectorizer. This is not efficient for
>> now, as we can see many ptrue instructions in the generated code. One
>> possible solution I can see, is to block one predicate register, and
>> preset it to all true. But to preserve/reinitialize a caller save
>> register value cross calls seems risky to work in this patch. I decide
>> to defer it to further optimization work. If anyone has any suggestions
>> on this, I would appreciate.
>>
>> SVE feature detection
>> =====================
>> Since we may have some compiled code based on the initial detected SVE
>> vector register length and the compiled code is compiled only for that
>> vector register length, we assume that the SVE vector register length
>> will not be changed during the JVM lifetime. However, SVE vector length
>> is per-thread and can be changed by system call [3], so we need to make
>> sure that each jni call will not change the sve vector length.
>>
>> Currently, we verify the SVE vector register length on each JNI return,
>> and if an SVE vector length change is detected, jvm simply reports error
>> and stops running. The VM running vector length can also be set by
>> existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
>> specified not the same as system default sve vector length (in
>> /proc/sys/abi/sve_default_vector_length), JVM will set current process
>> sve vector length to the specified vector length.
>>
>> Compiled code
>> =============
>> We have added all current c2 backend codegen on par with NEON, but only
>> for vector length larger than 128-bit.
>>
>> On a 1024 bit SVE environment, for the following simple loop with int
>> array element type:
>>
>>     for (int i = 0; i < LENGTH; i++) {
>>       c[i] = a[i] + b[i];
>>     }
>>
>> c2 generated loop:
>>
>>     0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
>>     0x0000ffff811c0824:   add     x13, x18, x11
>>     0x0000ffff811c0828:   add     x14, x1, x11
>>     0x0000ffff811c082c:   add     x13, x13, #0x10
>>     0x0000ffff811c0830:   add     x14, x14, #0x10
>>     0x0000ffff811c0834:   add     x11, x0, x11
>>     0x0000ffff811c0838:   add     x11, x11, #0x10
>>     0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
>>     0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
>>     0x0000ffff811c0844:   ptrue   p0.s
>>     0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
>>     0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
>>     0x0000ffff811c0850:   ptrue   p1.s
>>     0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
>>     0x0000ffff811c0858:   add     w10, w10, #0x20
>>     0x0000ffff811c085c:   cmp     w10, w12
>>     0x0000ffff811c0860:   b.lt    0x0000ffff811c0820
>>
>> Test
>> ====
>> Currently, we don't have real hardware to verify SVE features (and
>> performance). But we have run jtreg tests with SVE in some emulators. On
>> QEMU system emulator, which has SVE emulation support, jtreg tier1-3
>> passed with different vector sizes. We've also verified it with full
>> jtreg tests without SVE on both x86 and AArch64, to make sure that
>> there's no regression.
>>
>> The patch has also been applied to Vector API code base, and verified on
>> emulator. In Vector API, there are more vector related tests and is more
>> possible to generate vector instructions by intrinsification.
>>
>> A simple test can also run in QEMU user emulation, e.g.
>>
>> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
>>
>> (
>> To run it in user emulation mode, we will need to bypass SVE feature
>> detection code in this patch. E.g. apply:
>> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
>> )l
>>
>> Others
>> ======
>> Since this patch is a bit large, I've also split it into 3 parts, for
>> easy review:
>>
>> 1) SVE feature detection
>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
>>
>> 2) c2 registion allocation
>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
>>
>> 3) SVE c2 backend
>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
>>
>> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
>>
>> Refs
>> ====
>> [1] https://developer.arm.com/docs/ddi0584/latest
>> [2] https://developer.arm.com/docs/ddi0602/latest
>> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
>>
>> Thanks,
>> Ningsheng
>>
>