[aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE backend support

Thu Jul 30 06:59:23 UTC 2020

Hi,

To help reviewing the large ad file changes in the AArch64 backend, I created some jtreg tests checking if expected SVE/NEON instructions are correctly generated for each C2 vectornode.

I've uploaded my jtreg at http://cr.openjdk.java.net/~pli/rfr/8231441/jtreg.webrev.00/. Hope it would be useful for other reviewers.

--
Thanks,
Pengfei

> -----Original Message-----
> From: Ningsheng Jian <ningsheng.jian at arm.com>
> Sent: Thursday, July 30, 2020 14:23
> To: hotspot-compiler-dev at openjdk.java.net; Pengfei Li
> <Pengfei.Li at arm.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>;
> Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Andrew Haley
> <aph at redhat.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR(L): 8231441: AArch64: Initial SVE
> backend support
> 
> Hi,
> 
> Pengfei helped to review the patch offline and found that some multiply-
> add/sub and popcount match rules are missing for SVE. Added in the new
> webrev. Thanks to Pengfei!
> 
> New webrev:
> http://cr.openjdk.java.net/~njian/8231441/webrev.03
> 
> Incremental changes:
> http://cr.openjdk.java.net/~njian/8231441/webrev.03-vs-02/
> 
> Split parts:
> 
> 1) SVE feature detection:
> http://cr.openjdk.java.net/~njian/8231441/webrev.03-feature
> 
> 2) c2 register allocation:
> http://cr.openjdk.java.net/~njian/8231441/webrev.03-ra
> 
> 3) SVE c2 backend:
> http://cr.openjdk.java.net/~njian/8231441/webrev.03-c2
> 
> Thanks,
> Ningsheng
> 
> On 7/21/20 2:05 PM, Ningsheng Jian wrote:
> > [Ping]
> >
> > Could anyone please help to review this patch, especially for the c2
> > register allocation part?
> >
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8231441
> >
> > The latest webrev:
> > http://cr.openjdk.java.net/~njian/8231441/webrev.02
> >
> > In the latest webrev, we block one predicate register (p7) with all
> > elements preset to TRUE, so that c2 compiled code can use it freely to
> > generate instructions for unpredicated operations.
> >
> > And the split parts:
> >
> > 1) SVE feature detection:
> > http://cr.openjdk.java.net/~njian/8231441/webrev.02-feature
> >
> > 2) c2 register allocation:
> > http://cr.openjdk.java.net/~njian/8231441/webrev.02-ra
> >
> > 3) SVE c2 backend:
> > http://cr.openjdk.java.net/~njian/8231441/webrev.02-c2
> >
> > The initial RFR which has some descriptions of the patch:
> > http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-March
> > /037628.html
> >
> > The description can also be found at:
> > http://cr.openjdk.java.net/~njian/8231441/README-RFR.txt
> >
> > Notes to verify the patch on QEMU user emulation, with an example of
> > compiled code:
> > http://cr.openjdk.java.net/~njian/8231441/running-sve-in-qemu-user.txt
> >
> > Thanks,
> > Ningsheng
> >
> >
> > On 5/27/20 3:23 PM, Ningsheng Jian wrote:
> >> Hi,
> >>
> >> I have rebased this patch with some more comments added. And also
> >> relaxed the instruction matching conditions for 128-bit vector.
> >>
> >> I would appreciate if someone could help to review this.
> >>
> >> Whole patch:
> >> http://cr.openjdk.java.net/~njian/8231441/webrev.01
> >>
> >> Different parts of changes:
> >>
> >> 1) SVE feature detection
> >> http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
> >>
> >> 2) c2 registion allocation
> >> http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
> >>
> >> 3) SVE c2 backend
> >> http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
> >>
> >> (Or should I split this into different JBS?)
> >>
> >> Thanks,
> >> Ningsheng
> >>
> >> On 3/25/20 2:37 PM, Ningsheng Jian wrote:
> >>> Hi,
> >>>
> >>> Could you please help to review this patch adding AArch64 SVE support?
> >>> It also touches c2 compiler shared code.
> >>>
> >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
> >>> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
> >>>
> >>> Arm has released new vector ISA extension for AArch64, SVE [1] and
> >>> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In
> >>> this patch we have:
> >>>
> >>> 1) SVE feature enablement and detection
> >>> 2) SVE vector register allocation support with initial predicate
> >>> register definition
> >>> 3) SVE c2 backend for current SLP based vectorizer. (We also have a
> >>> POC patch of a new vectorizer using SVE predicate-driven loop
> >>> control, but that's still under development.)
> >>>
> >>> SVE register definition
> >>> =======================
> >>> Unlike other SIMD architectures, SVE allows hardware implementations
> >>> to choose a vector register length from 128 and 2048 bits, multiple
> >>> of 128 bits. So we introduce a new vector type VectorA, i.e. length
> >>> agnostic
> >>> (scalable) vector type, and Op_VecA for machine vectora register. In
> >>> the meantime, to minimize register allocation code changes, we also
> >>> take advantage of one JIT compiler aspect, that is during the
> >>> compile time we actually know the real hardware SVE vector register
> >>> size of current running machine. So, the register allocator actually
> >>> knows how many register slots an Op_VecA ideal reg requires, and
> >>> could work fine without much modification.
> >>>
> >>> Since the bottom 128 bits are shared with the NEON, we extend
> >>> current register mask definition of V0-V31 registers. Currently, c2
> >>> uses one bit mask for a 32-bit register slot, so to define at most
> >>> 2048 bits we will need to add 64 slots in AD file. That's a really
> >>> large number, and will also break current regmask assumption.
> >>> Considering the SVE vector register is architecturally scalable for
> >>> different sizes, we just define double of original NEON vector
> >>> register slots, i.e. 8 slots: Vx, Vx_H, Vx_J ... Vx_O. After adlc, the
> generated register masks now looks like:
> >>>
> >>> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
> >>> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
> >>>
> >>> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
> >>> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
> >>>
> >>> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
> >>> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
> >>>
> >>> And we use SlotsPerVecA to indicate regmask bit size for a VecA register.
> >>>
> >>> Although for physical register allocation, register allocator does
> >>> not need to know the real VecA register size, while doing
> >>> spill/unspill, current register allocation needs to know actual
> >>> stack slot size to store/load VecA registers. SVE is able to do
> >>> vector size agnostic spilling, but to minimize the code changes, as
> >>> I mentioned before, we just let RA know the actual vector register
> >>> size in current running machine, by calling scalable_vector_reg_size().
> >>>
> >>> In the meantime, since some vector operations do not have
> >>> unpredicated
> >>> SVE1 instructions, but only predicate version, e.g. vector multiply,
> >>> vector load/store. We have also defined predicate registers in this
> >>> patch, and c2 register allocator will allocate a temp predicate
> >>> register to fulfill the expecting unpredicated operations. And this
> >>> can also be used for future predicate-driven vectorizer. This is not
> >>> efficient for now, as we can see many ptrue instructions in the
> >>> generated code. One possible solution I can see, is to block one
> >>> predicate register, and preset it to all true. But to
> >>> preserve/reinitialize a caller save register value cross calls seems
> >>> risky to work in this patch. I decide to defer it to further
> >>> optimization work. If anyone has any suggestions on this, I would
> appreciate.
> >>>
> >>> SVE feature detection
> >>> =====================
> >>> Since we may have some compiled code based on the initial detected
> >>> SVE vector register length and the compiled code is compiled only
> >>> for that vector register length, we assume that the SVE vector
> >>> register length will not be changed during the JVM lifetime.
> >>> However, SVE vector length is per-thread and can be changed by
> >>> system call [3], so we need to make sure that each jni call will not change
> the sve vector length.
> >>>
> >>> Currently, we verify the SVE vector register length on each JNI
> >>> return, and if an SVE vector length change is detected, jvm simply
> >>> reports error and stops running. The VM running vector length can
> >>> also be set by existing VM option MaxVectorSize with c2 enabled. If
> >>> MaxVectorSize is specified not the same as system default sve vector
> >>> length (in /proc/sys/abi/sve_default_vector_length), JVM will set
> >>> current process sve vector length to the specified vector length.
> >>>
> >>> Compiled code
> >>> =============
> >>> We have added all current c2 backend codegen on par with NEON, but
> >>> only for vector length larger than 128-bit.
> >>>
> >>> On a 1024 bit SVE environment, for the following simple loop with
> >>> int array element type:
> >>>
> >>>      for (int i = 0; i < LENGTH; i++) {
> >>>        c[i] = a[i] + b[i];
> >>>      }
> >>>
> >>> c2 generated loop:
> >>>
> >>>      0x0000ffff811c0820:   sbfiz   x11, x10, #2, #32
> >>>      0x0000ffff811c0824:   add     x13, x18, x11
> >>>      0x0000ffff811c0828:   add     x14, x1, x11
> >>>      0x0000ffff811c082c:   add     x13, x13, #0x10
> >>>      0x0000ffff811c0830:   add     x14, x14, #0x10
> >>>      0x0000ffff811c0834:   add     x11, x0, x11
> >>>      0x0000ffff811c0838:   add     x11, x11, #0x10
> >>>      0x0000ffff811c083c:   ptrue   p1.s    // To be optimized
> >>>      0x0000ffff811c0840:   ld1w    {z16.s}, p1/z, [x14]
> >>>      0x0000ffff811c0844:   ptrue   p0.s
> >>>      0x0000ffff811c0848:   ld1w    {z17.s}, p0/z, [x13]
> >>>      0x0000ffff811c084c:   add     z16.s, z17.s, z16.s
> >>>      0x0000ffff811c0850:   ptrue   p1.s
> >>>      0x0000ffff811c0854:   st1w    {z16.s}, p1, [x11]
> >>>      0x0000ffff811c0858:   add     w10, w10, #0x20
> >>>      0x0000ffff811c085c:   cmp     w10, w12
> >>>      0x0000ffff811c0860:   b.lt    0x0000ffff811c0820
> >>>
> >>> Test
> >>> ====
> >>> Currently, we don't have real hardware to verify SVE features (and
> >>> performance). But we have run jtreg tests with SVE in some
> >>> emulators. On QEMU system emulator, which has SVE emulation
> support,
> >>> jtreg tier1-3 passed with different vector sizes. We've also
> >>> verified it with full jtreg tests without SVE on both x86 and
> >>> AArch64, to make sure that there's no regression.
> >>>
> >>> The patch has also been applied to Vector API code base, and
> >>> verified on emulator. In Vector API, there are more vector related
> >>> tests and is more possible to generate vector instructions by
> intrinsification.
> >>>
> >>> A simple test can also run in QEMU user emulation, e.g.
> >>>
> >>> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
> >>>
> >>> (
> >>> To run it in user emulation mode, we will need to bypass SVE feature
> >>> detection code in this patch. E.g. apply:
> >>> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
> >>> )l
> >>>
> >>> Others
> >>> ======
> >>> Since this patch is a bit large, I've also split it into 3 parts,
> >>> for easy review:
> >>>
> >>> 1) SVE feature detection
> >>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
> >>>
> >>> 2) c2 registion allocation
> >>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
> >>>
> >>> 3) SVE c2 backend
> >>> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
> >>>
> >>> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
> >>>
> >>> Refs
> >>> ====
> >>> [1] https://developer.arm.com/docs/ddi0584/latest
> >>> [2] https://developer.arm.com/docs/ddi0602/latest
> >>> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
> >>>
> >>> Thanks,
> >>> Ningsheng
> >>>
> >>
> >