[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Thu Aug 20 12:29:27 UTC 2020

Hi Ningsheng,

> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-July/039289.html 

Impressive work, Ningsheng!

> http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt

"Since the bottom 128 bits are shared with the NEON, we extend current
register mask definition of V0-V31 registers. Currently, c2 uses one bit
mask for a 32-bit register slot, so to define at most 2048 bits we will
need to add 64 slots in AD file. That's a really large number, and will
also break current regmask assumption."

Can you, please, elaborate on the last point? What RegMask assumptions 
are broken for 2048-bit vectors? I'm looking at [1] and try to 
understand the motivation for the changes in shared code.

Compared to x86 w/ AVX512, architectural state for vector registers is 
4x larger in the worst case (ignoring predicate registers for now). Here 
are the relevant constants on x86:

gensrc/adfiles/adGlobals_x86.hpp:

// the number of reserved registers + machine registers.
#define REG_COUNT    545
...
// Size of register-mask in ints
#define RM_SIZE 22

My estimate is that for AArch64 with SVE support the constants will be:

   REG_COUNT < 2500
   RM_SIZE < 100

which don't look too bad.

Also, I don't see any changes related to stack management. So, I assume 
it continues to be managed in slots. Any problems there? As I 
understand, wide SVE registers are caller-save, so there may be many 
spills of huge vectors around a call. (Probably, not possible with C2 
auto-vectorizer as it is now, but Vector API will expose it.)

Have you noticed any performance problems? If that's the case, then 
AVX512 support on x86 would benefit from similar optimization as well.

FTR there was a similar exercise [2] on x86 to abstract away exact sizes 
of vector registers, but it didn't have to worry about RA since all the 
operands were already available. Also, vectors of all different sizes 
may be used. So, it makes it hard to compare.

Best regards,
Vladimir Ivanov

[1] http://cr.openjdk.java.net/~njian/8231441/webrev.03-ra/

[2] https://bugs.openjdk.java.net/browse/JDK-8230015

> On 7/30/20 7:26 PM, Andrew Dinn wrote:
>> Hi Ningsheng,
>>
>> I will start to review this either later today or (more likely)
>> tomorrow. It will probably take some time to work through it all. I will
>> work from the updated patch posted by PengFei.
>>
>> regards,
>>
>>
>> Andrew Dinn
>> -----------
>> Red Hat Distinguished Engineer
>> Red Hat UK Ltd
>> Registered in England and Wales under Company Registration No. 03798903
>> Directors: Michael Cunningham, Michael ("Mike") O'Neill
>>
>> On 21/07/2020 07:05, Ningsheng Jian wrote:
>>> [Ping]
>>>
>>> Could anyone please help to review this patch, especially for the c2
>>> register allocation part?
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8231441
>>>
>>> The latest webrev:
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.02
>>>
>>> In the latest webrev, we block one predicate register (p7) with all
>>> elements preset to TRUE, so that c2 compiled code can use it freely to
>>> generate instructions for unpredicated operations.
>>>
>>> And the split parts:
>>>
>>> 1) SVE feature detection:
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-feature
>>>
>>> 2) c2 register allocation:
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-ra
>>>
>>> 3) SVE c2 backend:
>>> http://cr.openjdk.java.net/~njian/8231441/webrev.02-c2
>>>
>>> The initial RFR which has some descriptions of the patch:
>>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-March/037628.html 
>>>
>>>
>>>
>>> The description can also be found at:
>>> http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt
>>>
>>> Notes to verify the patch on QEMU user emulation, with an example of
>>> compiled code:
>>> http://cr.openjdk.java.net/~njian/8231441/running-sve-in-qemu-user.txt
>>>
>>> Thanks,
>>> Ningsheng
>>>
>>>
>>> On 5/27/20 3:23 PM, Ningsheng Jian wrote:
>>>> Hi,
>>>>
>>>> I have rebased this patch with some more comments added. And also
>>>> relaxed the instruction matching conditions for 128-bit vector.
>>>>
>>>> I would appreciate if someone could help to review this.
>>>>
>>>> Whole patch:
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01
>>>>
>>>> Different parts of changes:
>>>>
>>>> 1) SVE feature detection
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
>>>>
>>>> 2) c2 registion allocation
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
>>>>
>>>> 3) SVE c2 backend
>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
>>>>
>>>> (Or should I split this into different JBS?)
>>>>
>>>> Thanks,
>>>> Ningsheng
>>>>
>>>> On 3/25/20 2:37 PM, Ningsheng Jian wrote:
>>>>> Hi,
>>>>>
>>>>> Could you please help to review this patch adding AArch64 SVE support?
>>>>> It also touches c2 compiler shared code.
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
>>>>> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
>>>>>
>>>>> Arm has released new vector ISA extension for AArch64, SVE [1] and
>>>>> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
>>>>> patch we have:
>>>>>
>>>>> 1) SVE feature enablement and detection
>>>>> 2) SVE vector register allocation support with initial predicate
>>>>> register definition
>>>>> 3) SVE c2 backend for current SLP based vectorizer. (We also have a 
>>>>> POC
>>>>> patch of a new vectorizer using SVE predicate-driven loop control, but
>>>>> that's still under development.)
>>>>>
>>>>> SVE register definition
>>>>> =======================
>>>>> Unlike other SIMD architectures, SVE allows hardware 
>>>>> implementations to
>>>>> choose a vector register length from 128 and 2048 bits, multiple of 
>>>>> 128
>>>>> bits. So we introduce a new vector type VectorA, i.e. length agnostic
>>>>> (scalable) vector type, and Op_VecA for machine vectora register. 
>>>>> In the
>>>>> meantime, to minimize register allocation code changes, we also take
>>>>> advantage of one JIT compiler aspect, that is during the compile 
>>>>> time we
>>>>> actually know the real hardware SVE vector register size of current
>>>>> running machine. So, the register allocator actually knows how many
>>>>> register slots an Op_VecA ideal reg requires, and could work fine
>>>>> without much modification.
>>>>>
>>>>> Since the bottom 128 bits are shared with the NEON, we extend current
>>>>> register mask definition of V0-V31 registers. Currently, c2 uses 
>>>>> one bit
>>>>> mask for a 32-bit register slot, so to define at most 2048 bits we 
>>>>> will
>>>>> need to add 64 slots in AD file. That's a really large number, and 
>>>>> will
>>>>> also break current regmask assumption. Considering the SVE vector
>>>>> register is architecturally scalable for different sizes, we just 
>>>>> define
>>>>> double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
>>>>> Vx_J ... Vx_O. After adlc, the generated register masks now looks 
>>>>> like:
>>>>>
>>>>> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
>>>>> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
>>>>>
>>>>> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
>>>>> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
>>>>>
>>>>> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
>>>>> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
>>>>>
>>>>> And we use SlotsPerVecA to indicate regmask bit size for a VecA
>>>>> register.
>>>>>
>>>>> Although for physical register allocation, register allocator does not
>>>>> need to know the real VecA register size, while doing spill/unspill,
>>>>> current register allocation needs to know actual stack slot size to
>>>>> store/load VecA registers. SVE is able to do vector size agnostic
>>>>> spilling, but to minimize the code changes, as I mentioned before, we
>>>>> just let RA know the actual vector register size in current running
>>>>> machine, by calling scalable_vector_reg_size().
>>>>>
>>>>> In the meantime, since some vector operations do not have unpredicated
>>>>> SVE1 instructions, but only predicate version, e.g. vector multiply,
>>>>> vector load/store. We have also defined predicate registers in this
>>>>> patch, and c2 register allocator will allocate a temp predicate 
>>>>> register
>>>>> to fulfill the expecting unpredicated operations. And this can also be
>>>>> used for future predicate-driven vectorizer. This is not efficient for
>>>>> now, as we can see many ptrue instructions in the generated code. One
>>>>> possible solution I can see, is to block one predicate register, and
>>>>> preset it to all true. But to preserve/reinitialize a caller save
>>>>> register value cross calls seems risky to work in this patch. I decide
>>>>> to defer it to further optimization work. If anyone has any 
>>>>> suggestions
>>>>> on this, I would appreciate.
>>>>>
>>>>> SVE feature detection
>>>>> =====================
>>>>> Since we may have some compiled code based on the initial detected SVE
>>>>> vector register length and the compiled code is compiled only for that
>>>>> vector register length, we assume that the SVE vector register length
>>>>> will not be changed during the JVM lifetime. However, SVE vector 
>>>>> length
>>>>> is per-thread and can be changed by system call [3], so we need to 
>>>>> make
>>>>> sure that each jni call will not change the sve vector length.
>>>>>
>>>>> Currently, we verify the SVE vector register length on each JNI 
>>>>> return,
>>>>> and if an SVE vector length change is detected, jvm simply reports 
>>>>> error
>>>>> and stops running. The VM running vector length can also be set by
>>>>> existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
>>>>> specified not the same as system default sve vector length (in
>>>>> /proc/sys/abi/sve_default_vector_length), JVM will set current process
>>>>> sve vector length to the specified vector length.
>>>>>
>>>>> Compiled code
>>>>> =============
>>>>> We have added all current c2 backend codegen on par with NEON, but 
>>>>> only
>>>>> for vector length larger than 128-bit.
>>>>>
>>>>> On a 1024 bit SVE environment, for the following simple loop with int
>>>>> array element type:
>>>>>
>>>>>      for (int i = 0; i < LENGTH; i++) {
>>>>>        c[i] = a[i] + b[i];
>>>>>      }
>>>>>
>>>>> c2 generated loop:
>>>>>
>>>>>      0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
>>>>>      0x0000ffff811c0824:   add     x13, x18, x11
>>>>>      0x0000ffff811c0828:   add     x14, x1, x11
>>>>>      0x0000ffff811c082c:   add     x13, x13, #0x10
>>>>>      0x0000ffff811c0830:   add     x14, x14, #0x10
>>>>>      0x0000ffff811c0834:   add     x11, x0, x11
>>>>>      0x0000ffff811c0838:   add     x11, x11, #0x10
>>>>>      0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
>>>>>      0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
>>>>>      0x0000ffff811c0844:   ptrue   p0.s
>>>>>      0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
>>>>>      0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
>>>>>      0x0000ffff811c0850:   ptrue   p1.s
>>>>>      0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
>>>>>      0x0000ffff811c0858:   add     w10, w10, #0x20
>>>>>      0x0000ffff811c085c:   cmp     w10, w12
>>>>>      0x0000ffff811c0860:   b.lt    0x0000ffff811c0820
>>>>>
>>>>> Test
>>>>> ====
>>>>> Currently, we don't have real hardware to verify SVE features (and
>>>>> performance). But we have run jtreg tests with SVE in some 
>>>>> emulators. On
>>>>> QEMU system emulator, which has SVE emulation support, jtreg tier1-3
>>>>> passed with different vector sizes. We've also verified it with full
>>>>> jtreg tests without SVE on both x86 and AArch64, to make sure that
>>>>> there's no regression.
>>>>>
>>>>> The patch has also been applied to Vector API code base, and 
>>>>> verified on
>>>>> emulator. In Vector API, there are more vector related tests and is 
>>>>> more
>>>>> possible to generate vector instructions by intrinsification.
>>>>>
>>>>> A simple test can also run in QEMU user emulation, e.g.
>>>>>
>>>>> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
>>>>>
>>>>> (
>>>>> To run it in user emulation mode, we will need to bypass SVE feature
>>>>> detection code in this patch. E.g. apply:
>>>>> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
>>>>> )l
>>>>>
>>>>> Others
>>>>> ======
>>>>> Since this patch is a bit large, I've also split it into 3 parts, for
>>>>> easy review:
>>>>>
>>>>> 1) SVE feature detection
>>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
>>>>>
>>>>> 2) c2 registion allocation
>>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
>>>>>
>>>>> 3) SVE c2 backend
>>>>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
>>>>>
>>>>> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
>>>>>
>>>>> Refs
>>>>> ====
>>>>> [1] https://developer.arm.com/docs/ddi0584/latest
>>>>> [2] https://developer.arm.com/docs/ddi0602/latest
>>>>> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
>>>>>
>>>>> Thanks,
>>>>> Ningsheng
>>>>>
>>>>
>>>
>>
>