[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Thu Jul 30 11:26:42 UTC 2020

Hi Ningsheng,

I will start to review this either later today or (more likely)
tomorrow. It will probably take some time to work through it all. I will
work from the updated patch posted by PengFei.

regards,

Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill

On 21/07/2020 07:05, Ningsheng Jian wrote:
> [Ping]
> 
> Could anyone please help to review this patch, especially for the c2
> register allocation part?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8231441
> 
> The latest webrev:
> http://cr.openjdk.java.net/~njian/8231441/webrev.02
> 
> In the latest webrev, we block one predicate register (p7) with all
> elements preset to TRUE, so that c2 compiled code can use it freely to
> generate instructions for unpredicated operations.
> 
> And the split parts:
> 
> 1) SVE feature detection:
> http://cr.openjdk.java.net/~njian/8231441/webrev.02-feature
> 
> 2) c2 register allocation:
> http://cr.openjdk.java.net/~njian/8231441/webrev.02-ra
> 
> 3) SVE c2 backend:
> http://cr.openjdk.java.net/~njian/8231441/webrev.02-c2
> 
> The initial RFR which has some descriptions of the patch:
> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-March/037628.html
> 
> 
> The description can also be found at:
> http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt
> 
> Notes to verify the patch on QEMU user emulation, with an example of
> compiled code:
> http://cr.openjdk.java.net/~njian/8231441/running-sve-in-qemu-user.txt
> 
> Thanks,
> Ningsheng
> 
> 
> On 5/27/20 3:23 PM, Ningsheng Jian wrote:
>> Hi,
>>
>> I have rebased this patch with some more comments added. And also
>> relaxed the instruction matching conditions for 128-bit vector.
>>
>> I would appreciate if someone could help to review this.
>>
>> Whole patch:
>> http://cr.openjdk.java.net/~njian/8231441/webrev.01
>>
>> Different parts of changes:
>>
>> 1) SVE feature detection
>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
>>
>> 2) c2 registion allocation
>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
>>
>> 3) SVE c2 backend
>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
>>
>> (Or should I split this into different JBS?)
>>
>> Thanks,
>> Ningsheng
>>
>> On 3/25/20 2:37 PM, Ningsheng Jian wrote:
>>> Hi,
>>>
>>> Could you please help to review this patch adding AArch64 SVE support?
>>> It also touches c2 compiler shared code.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
>>> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
>>>
>>> Arm has released new vector ISA extension for AArch64, SVE [1] and
>>> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
>>> patch we have:
>>>
>>> 1) SVE feature enablement and detection
>>> 2) SVE vector register allocation support with initial predicate
>>> register definition
>>> 3) SVE c2 backend for current SLP based vectorizer. (We also have a POC
>>> patch of a new vectorizer using SVE predicate-driven loop control, but
>>> that's still under development.)
>>>
>>> SVE register definition
>>> =======================
>>> Unlike other SIMD architectures, SVE allows hardware implementations to
>>> choose a vector register length from 128 and 2048 bits, multiple of 128
>>> bits. So we introduce a new vector type VectorA, i.e. length agnostic
>>> (scalable) vector type, and Op_VecA for machine vectora register. In the
>>> meantime, to minimize register allocation code changes, we also take
>>> advantage of one JIT compiler aspect, that is during the compile time we
>>> actually know the real hardware SVE vector register size of current
>>> running machine. So, the register allocator actually knows how many
>>> register slots an Op_VecA ideal reg requires, and could work fine
>>> without much modification.
>>>
>>> Since the bottom 128 bits are shared with the NEON, we extend current
>>> register mask definition of V0-V31 registers. Currently, c2 uses one bit
>>> mask for a 32-bit register slot, so to define at most 2048 bits we will
>>> need to add 64 slots in AD file. That's a really large number, and will
>>> also break current regmask assumption. Considering the SVE vector
>>> register is architecturally scalable for different sizes, we just define
>>> double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
>>> Vx_J ... Vx_O. After adlc, the generated register masks now looks like:
>>>
>>> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
>>> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
>>>
>>> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
>>> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
>>>
>>> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
>>> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
>>>
>>> And we use SlotsPerVecA to indicate regmask bit size for a VecA
>>> register.
>>>
>>> Although for physical register allocation, register allocator does not
>>> need to know the real VecA register size, while doing spill/unspill,
>>> current register allocation needs to know actual stack slot size to
>>> store/load VecA registers. SVE is able to do vector size agnostic
>>> spilling, but to minimize the code changes, as I mentioned before, we
>>> just let RA know the actual vector register size in current running
>>> machine, by calling scalable_vector_reg_size().
>>>
>>> In the meantime, since some vector operations do not have unpredicated
>>> SVE1 instructions, but only predicate version, e.g. vector multiply,
>>> vector load/store. We have also defined predicate registers in this
>>> patch, and c2 register allocator will allocate a temp predicate register
>>> to fulfill the expecting unpredicated operations. And this can also be
>>> used for future predicate-driven vectorizer. This is not efficient for
>>> now, as we can see many ptrue instructions in the generated code. One
>>> possible solution I can see, is to block one predicate register, and
>>> preset it to all true. But to preserve/reinitialize a caller save
>>> register value cross calls seems risky to work in this patch. I decide
>>> to defer it to further optimization work. If anyone has any suggestions
>>> on this, I would appreciate.
>>>
>>> SVE feature detection
>>> =====================
>>> Since we may have some compiled code based on the initial detected SVE
>>> vector register length and the compiled code is compiled only for that
>>> vector register length, we assume that the SVE vector register length
>>> will not be changed during the JVM lifetime. However, SVE vector length
>>> is per-thread and can be changed by system call [3], so we need to make
>>> sure that each jni call will not change the sve vector length.
>>>
>>> Currently, we verify the SVE vector register length on each JNI return,
>>> and if an SVE vector length change is detected, jvm simply reports error
>>> and stops running. The VM running vector length can also be set by
>>> existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
>>> specified not the same as system default sve vector length (in
>>> /proc/sys/abi/sve_default_vector_length), JVM will set current process
>>> sve vector length to the specified vector length.
>>>
>>> Compiled code
>>> =============
>>> We have added all current c2 backend codegen on par with NEON, but only
>>> for vector length larger than 128-bit.
>>>
>>> On a 1024 bit SVE environment, for the following simple loop with int
>>> array element type:
>>>
>>>     for (int i = 0; i < LENGTH; i++) {
>>>       c[i] = a[i] + b[i];
>>>     }
>>>
>>> c2 generated loop:
>>>
>>>     0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
>>>     0x0000ffff811c0824:   add     x13, x18, x11
>>>     0x0000ffff811c0828:   add     x14, x1, x11
>>>     0x0000ffff811c082c:   add     x13, x13, #0x10
>>>     0x0000ffff811c0830:   add     x14, x14, #0x10
>>>     0x0000ffff811c0834:   add     x11, x0, x11
>>>     0x0000ffff811c0838:   add     x11, x11, #0x10
>>>     0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
>>>     0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
>>>     0x0000ffff811c0844:   ptrue   p0.s
>>>     0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
>>>     0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
>>>     0x0000ffff811c0850:   ptrue   p1.s
>>>     0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
>>>     0x0000ffff811c0858:   add     w10, w10, #0x20
>>>     0x0000ffff811c085c:   cmp     w10, w12
>>>     0x0000ffff811c0860:   b.lt    0x0000ffff811c0820
>>>
>>> Test
>>> ====
>>> Currently, we don't have real hardware to verify SVE features (and
>>> performance). But we have run jtreg tests with SVE in some emulators. On
>>> QEMU system emulator, which has SVE emulation support, jtreg tier1-3
>>> passed with different vector sizes. We've also verified it with full
>>> jtreg tests without SVE on both x86 and AArch64, to make sure that
>>> there's no regression.
>>>
>>> The patch has also been applied to Vector API code base, and verified on
>>> emulator. In Vector API, there are more vector related tests and is more
>>> possible to generate vector instructions by intrinsification.
>>>
>>> A simple test can also run in QEMU user emulation, e.g.
>>>
>>> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
>>>
>>> (
>>> To run it in user emulation mode, we will need to bypass SVE feature
>>> detection code in this patch. E.g. apply:
>>> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
>>> )l
>>>
>>> Others
>>> ======
>>> Since this patch is a bit large, I've also split it into 3 parts, for
>>> easy review:
>>>
>>> 1) SVE feature detection
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
>>>
>>> 2) c2 registion allocation
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
>>>
>>> 3) SVE c2 backend
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
>>>
>>> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
>>>
>>> Refs
>>> ====
>>> [1] https://developer.arm.com/docs/ddi0584/latest
>>> [2] https://developer.arm.com/docs/ddi0602/latest
>>> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
>>>
>>> Thanks,
>>> Ningsheng
>>>
>>
>