[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Fri Jul 31 01:41:45 UTC 2020

Hi Andrew,

Thanks a lot!!

FYI, the latest patch:

http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-July/039289.html

And some descriptions:

http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt

Thanks,
Ningsheng

On 7/30/20 7:26 PM, Andrew Dinn wrote:
> Hi Ningsheng,
> 
> I will start to review this either later today or (more likely)
> tomorrow. It will probably take some time to work through it all. I will
> work from the updated patch posted by PengFei.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> Red Hat Distinguished Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill
> 
> On 21/07/2020 07:05, Ningsheng Jian wrote:
>> [Ping]
>>
>> Could anyone please help to review this patch, especially for the c2
>> register allocation part?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8231441
>>
>> The latest webrev:
>> http://cr.openjdk.java.net/~njian/8231441/webrev.02
>>
>> In the latest webrev, we block one predicate register (p7) with all
>> elements preset to TRUE, so that c2 compiled code can use it freely to
>> generate instructions for unpredicated operations.
>>
>> And the split parts:
>>
>> 1) SVE feature detection:
>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-feature
>>
>> 2) c2 register allocation:
>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-ra
>>
>> 3) SVE c2 backend:
>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-c2
>>
>> The initial RFR which has some descriptions of the patch:
>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-March/037628.html
>>
>>
>> The description can also be found at:
>> http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt
>>
>> Notes to verify the patch on QEMU user emulation, with an example of
>> compiled code:
>> http://cr.openjdk.java.net/~njian/8231441/running-sve-in-qemu-user.txt
>>
>> Thanks,
>> Ningsheng
>>
>>
>> On 5/27/20 3:23 PM, Ningsheng Jian wrote:
>>> Hi,
>>>
>>> I have rebased this patch with some more comments added. And also
>>> relaxed the instruction matching conditions for 128-bit vector.
>>>
>>> I would appreciate if someone could help to review this.
>>>
>>> Whole patch:
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01
>>>
>>> Different parts of changes:
>>>
>>> 1) SVE feature detection
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
>>>
>>> 2) c2 registion allocation
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
>>>
>>> 3) SVE c2 backend
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
>>>
>>> (Or should I split this into different JBS?)
>>>
>>> Thanks,
>>> Ningsheng
>>>
>>> On 3/25/20 2:37 PM, Ningsheng Jian wrote:
>>>> Hi,
>>>>
>>>> Could you please help to review this patch adding AArch64 SVE support?
>>>> It also touches c2 compiler shared code.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
>>>> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
>>>>
>>>> Arm has released new vector ISA extension for AArch64, SVE [1] and
>>>> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
>>>> patch we have:
>>>>
>>>> 1) SVE feature enablement and detection
>>>> 2) SVE vector register allocation support with initial predicate
>>>> register definition
>>>> 3) SVE c2 backend for current SLP based vectorizer. (We also have a POC
>>>> patch of a new vectorizer using SVE predicate-driven loop control, but
>>>> that's still under development.)
>>>>
>>>> SVE register definition
>>>> =======================
>>>> Unlike other SIMD architectures, SVE allows hardware implementations to
>>>> choose a vector register length from 128 and 2048 bits, multiple of 128
>>>> bits. So we introduce a new vector type VectorA, i.e. length agnostic
>>>> (scalable) vector type, and Op_VecA for machine vectora register. In the
>>>> meantime, to minimize register allocation code changes, we also take
>>>> advantage of one JIT compiler aspect, that is during the compile time we
>>>> actually know the real hardware SVE vector register size of current
>>>> running machine. So, the register allocator actually knows how many
>>>> register slots an Op_VecA ideal reg requires, and could work fine
>>>> without much modification.
>>>>
>>>> Since the bottom 128 bits are shared with the NEON, we extend current
>>>> register mask definition of V0-V31 registers. Currently, c2 uses one bit
>>>> mask for a 32-bit register slot, so to define at most 2048 bits we will
>>>> need to add 64 slots in AD file. That's a really large number, and will
>>>> also break current regmask assumption. Considering the SVE vector
>>>> register is architecturally scalable for different sizes, we just define
>>>> double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
>>>> Vx_J ... Vx_O. After adlc, the generated register masks now looks like:
>>>>
>>>> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
>>>> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
>>>>
>>>> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
>>>> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
>>>>
>>>> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
>>>> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
>>>>
>>>> And we use SlotsPerVecA to indicate regmask bit size for a VecA
>>>> register.
>>>>
>>>> Although for physical register allocation, register allocator does not
>>>> need to know the real VecA register size, while doing spill/unspill,
>>>> current register allocation needs to know actual stack slot size to
>>>> store/load VecA registers. SVE is able to do vector size agnostic
>>>> spilling, but to minimize the code changes, as I mentioned before, we
>>>> just let RA know the actual vector register size in current running
>>>> machine, by calling scalable_vector_reg_size().
>>>>
>>>> In the meantime, since some vector operations do not have unpredicated
>>>> SVE1 instructions, but only predicate version, e.g. vector multiply,
>>>> vector load/store. We have also defined predicate registers in this
>>>> patch, and c2 register allocator will allocate a temp predicate register
>>>> to fulfill the expecting unpredicated operations. And this can also be
>>>> used for future predicate-driven vectorizer. This is not efficient for
>>>> now, as we can see many ptrue instructions in the generated code. One
>>>> possible solution I can see, is to block one predicate register, and
>>>> preset it to all true. But to preserve/reinitialize a caller save
>>>> register value cross calls seems risky to work in this patch. I decide
>>>> to defer it to further optimization work. If anyone has any suggestions
>>>> on this, I would appreciate.
>>>>
>>>> SVE feature detection
>>>> =====================
>>>> Since we may have some compiled code based on the initial detected SVE
>>>> vector register length and the compiled code is compiled only for that
>>>> vector register length, we assume that the SVE vector register length
>>>> will not be changed during the JVM lifetime. However, SVE vector length
>>>> is per-thread and can be changed by system call [3], so we need to make
>>>> sure that each jni call will not change the sve vector length.
>>>>
>>>> Currently, we verify the SVE vector register length on each JNI return,
>>>> and if an SVE vector length change is detected, jvm simply reports error
>>>> and stops running. The VM running vector length can also be set by
>>>> existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
>>>> specified not the same as system default sve vector length (in
>>>> /proc/sys/abi/sve_default_vector_length), JVM will set current process
>>>> sve vector length to the specified vector length.
>>>>
>>>> Compiled code
>>>> =============
>>>> We have added all current c2 backend codegen on par with NEON, but only
>>>> for vector length larger than 128-bit.
>>>>
>>>> On a 1024 bit SVE environment, for the following simple loop with int
>>>> array element type:
>>>>
>>>>      for (int i = 0; i < LENGTH; i++) {
>>>>        c[i] = a[i] + b[i];
>>>>      }
>>>>
>>>> c2 generated loop:
>>>>
>>>>      0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
>>>>      0x0000ffff811c0824:   add     x13, x18, x11
>>>>      0x0000ffff811c0828:   add     x14, x1, x11
>>>>      0x0000ffff811c082c:   add     x13, x13, #0x10
>>>>      0x0000ffff811c0830:   add     x14, x14, #0x10
>>>>      0x0000ffff811c0834:   add     x11, x0, x11
>>>>      0x0000ffff811c0838:   add     x11, x11, #0x10
>>>>      0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
>>>>      0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
>>>>      0x0000ffff811c0844:   ptrue   p0.s
>>>>      0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
>>>>      0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
>>>>      0x0000ffff811c0850:   ptrue   p1.s
>>>>      0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
>>>>      0x0000ffff811c0858:   add     w10, w10, #0x20
>>>>      0x0000ffff811c085c:   cmp     w10, w12
>>>>      0x0000ffff811c0860:   b.lt    0x0000ffff811c0820
>>>>
>>>> Test
>>>> ====
>>>> Currently, we don't have real hardware to verify SVE features (and
>>>> performance). But we have run jtreg tests with SVE in some emulators. On
>>>> QEMU system emulator, which has SVE emulation support, jtreg tier1-3
>>>> passed with different vector sizes. We've also verified it with full
>>>> jtreg tests without SVE on both x86 and AArch64, to make sure that
>>>> there's no regression.
>>>>
>>>> The patch has also been applied to Vector API code base, and verified on
>>>> emulator. In Vector API, there are more vector related tests and is more
>>>> possible to generate vector instructions by intrinsification.
>>>>
>>>> A simple test can also run in QEMU user emulation, e.g.
>>>>
>>>> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
>>>>
>>>> (
>>>> To run it in user emulation mode, we will need to bypass SVE feature
>>>> detection code in this patch. E.g. apply:
>>>> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
>>>> )l
>>>>
>>>> Others
>>>> ======
>>>> Since this patch is a bit large, I've also split it into 3 parts, for
>>>> easy review:
>>>>
>>>> 1) SVE feature detection
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
>>>>
>>>> 2) c2 registion allocation
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
>>>>
>>>> 3) SVE c2 backend
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
>>>>
>>>> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
>>>>
>>>> Refs
>>>> ====
>>>> [1] https://developer.arm.com/docs/ddi0584/latest
>>>> [2] https://developer.arm.com/docs/ddi0602/latest
>>>> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
>>>>
>>>> Thanks,
>>>> Ningsheng
>>>>
>>>
>>
>