RFR(L): 8231441: AArch64: Initial SVE backend support
Ningsheng Jian
ningsheng.jian at arm.com
Wed May 27 07:23:13 UTC 2020
Hi,
I have rebased this patch with some more comments added. And also
relaxed the instruction matching conditions for 128-bit vector.
I would appreciate if someone could help to review this.
Whole patch:
http://cr.openjdk.java.net/~njian/8231441/webrev.01
Different parts of changes:
1) SVE feature detection
http://cr.openjdk.java.net/~njian/8231441/webrev.01-feature
2) c2 registion allocation
http://cr.openjdk.java.net/~njian/8231441/webrev.01-ra
3) SVE c2 backend
http://cr.openjdk.java.net/~njian/8231441/webrev.01-c2
(Or should I split this into different JBS?)
Thanks,
Ningsheng
On 3/25/20 2:37 PM, Ningsheng Jian wrote:
> Hi,
>
> Could you please help to review this patch adding AArch64 SVE support?
> It also touches c2 compiler shared code.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
> Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
>
> Arm has released new vector ISA extension for AArch64, SVE [1] and
> SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
> patch we have:
>
> 1) SVE feature enablement and detection
> 2) SVE vector register allocation support with initial predicate
> register definition
> 3) SVE c2 backend for current SLP based vectorizer. (We also have a POC
> patch of a new vectorizer using SVE predicate-driven loop control, but
> that's still under development.)
>
> SVE register definition
> =======================
> Unlike other SIMD architectures, SVE allows hardware implementations to
> choose a vector register length from 128 and 2048 bits, multiple of 128
> bits. So we introduce a new vector type VectorA, i.e. length agnostic
> (scalable) vector type, and Op_VecA for machine vectora register. In the
> meantime, to minimize register allocation code changes, we also take
> advantage of one JIT compiler aspect, that is during the compile time we
> actually know the real hardware SVE vector register size of current
> running machine. So, the register allocator actually knows how many
> register slots an Op_VecA ideal reg requires, and could work fine
> without much modification.
>
> Since the bottom 128 bits are shared with the NEON, we extend current
> register mask definition of V0-V31 registers. Currently, c2 uses one bit
> mask for a 32-bit register slot, so to define at most 2048 bits we will
> need to add 64 slots in AD file. That's a really large number, and will
> also break current regmask assumption. Considering the SVE vector
> register is architecturally scalable for different sizes, we just define
> double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
> Vx_J ... Vx_O. After adlc, the generated register masks now looks like:
>
> const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
> 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
>
> const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
> 0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
>
> const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
> 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
>
> And we use SlotsPerVecA to indicate regmask bit size for a VecA register.
>
> Although for physical register allocation, register allocator does not
> need to know the real VecA register size, while doing spill/unspill,
> current register allocation needs to know actual stack slot size to
> store/load VecA registers. SVE is able to do vector size agnostic
> spilling, but to minimize the code changes, as I mentioned before, we
> just let RA know the actual vector register size in current running
> machine, by calling scalable_vector_reg_size().
>
> In the meantime, since some vector operations do not have unpredicated
> SVE1 instructions, but only predicate version, e.g. vector multiply,
> vector load/store. We have also defined predicate registers in this
> patch, and c2 register allocator will allocate a temp predicate register
> to fulfill the expecting unpredicated operations. And this can also be
> used for future predicate-driven vectorizer. This is not efficient for
> now, as we can see many ptrue instructions in the generated code. One
> possible solution I can see, is to block one predicate register, and
> preset it to all true. But to preserve/reinitialize a caller save
> register value cross calls seems risky to work in this patch. I decide
> to defer it to further optimization work. If anyone has any suggestions
> on this, I would appreciate.
>
> SVE feature detection
> =====================
> Since we may have some compiled code based on the initial detected SVE
> vector register length and the compiled code is compiled only for that
> vector register length, we assume that the SVE vector register length
> will not be changed during the JVM lifetime. However, SVE vector length
> is per-thread and can be changed by system call [3], so we need to make
> sure that each jni call will not change the sve vector length.
>
> Currently, we verify the SVE vector register length on each JNI return,
> and if an SVE vector length change is detected, jvm simply reports error
> and stops running. The VM running vector length can also be set by
> existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
> specified not the same as system default sve vector length (in
> /proc/sys/abi/sve_default_vector_length), JVM will set current process
> sve vector length to the specified vector length.
>
> Compiled code
> =============
> We have added all current c2 backend codegen on par with NEON, but only
> for vector length larger than 128-bit.
>
> On a 1024 bit SVE environment, for the following simple loop with int
> array element type:
>
> for (int i = 0; i < LENGTH; i++) {
> c[i] = a[i] + b[i];
> }
>
> c2 generated loop:
>
> 0x0000ffff811c0820: sbfiz x11, x10, #2, #32
> 0x0000ffff811c0824: add x13, x18, x11
> 0x0000ffff811c0828: add x14, x1, x11
> 0x0000ffff811c082c: add x13, x13, #0x10
> 0x0000ffff811c0830: add x14, x14, #0x10
> 0x0000ffff811c0834: add x11, x0, x11
> 0x0000ffff811c0838: add x11, x11, #0x10
> 0x0000ffff811c083c: ptrue p1.s // To be optimized
> 0x0000ffff811c0840: ld1w {z16.s}, p1/z, [x14]
> 0x0000ffff811c0844: ptrue p0.s
> 0x0000ffff811c0848: ld1w {z17.s}, p0/z, [x13]
> 0x0000ffff811c084c: add z16.s, z17.s, z16.s
> 0x0000ffff811c0850: ptrue p1.s
> 0x0000ffff811c0854: st1w {z16.s}, p1, [x11]
> 0x0000ffff811c0858: add w10, w10, #0x20
> 0x0000ffff811c085c: cmp w10, w12
> 0x0000ffff811c0860: b.lt 0x0000ffff811c0820
>
> Test
> ====
> Currently, we don't have real hardware to verify SVE features (and
> performance). But we have run jtreg tests with SVE in some emulators. On
> QEMU system emulator, which has SVE emulation support, jtreg tier1-3
> passed with different vector sizes. We've also verified it with full
> jtreg tests without SVE on both x86 and AArch64, to make sure that
> there's no regression.
>
> The patch has also been applied to Vector API code base, and verified on
> emulator. In Vector API, there are more vector related tests and is more
> possible to generate vector instructions by intrinsification.
>
> A simple test can also run in QEMU user emulation, e.g.
>
> $ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
>
> (
> To run it in user emulation mode, we will need to bypass SVE feature
> detection code in this patch. E.g. apply:
> http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
> )l
>
> Others
> ======
> Since this patch is a bit large, I've also split it into 3 parts, for
> easy review:
>
> 1) SVE feature detection
> http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
>
> 2) c2 registion allocation
> http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
>
> 3) SVE c2 backend
> http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
>
> Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
>
> Refs
> ====
> [1] https://developer.arm.com/docs/ddi0584/latest
> [2] https://developer.arm.com/docs/ddi0602/latest
> [3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
>
> Thanks,
> Ningsheng
>
More information about the hotspot-compiler-dev
mailing list