RFR(L): 8231441: AArch64: Initial SVE backend support
Ningsheng Jian
ningsheng.jian at arm.com
Wed Mar 25 06:37:17 UTC 2020
Hi,
Could you please help to review this patch adding AArch64 SVE support?
It also touches c2 compiler shared code.
Bug: https://bugs.openjdk.java.net/browse/JDK-8231441
Webrev: http://cr.openjdk.java.net/~njian/8231441/webrev.00
Arm has released new vector ISA extension for AArch64, SVE [1] and
SVE2 [2]. This patch adds the initial SVE support in OpenJDK. In this
patch we have:
1) SVE feature enablement and detection
2) SVE vector register allocation support with initial predicate
register definition
3) SVE c2 backend for current SLP based vectorizer. (We also have a POC
patch of a new vectorizer using SVE predicate-driven loop control, but
that's still under development.)
SVE register definition
=======================
Unlike other SIMD architectures, SVE allows hardware implementations to
choose a vector register length from 128 and 2048 bits, multiple of 128
bits. So we introduce a new vector type VectorA, i.e. length agnostic
(scalable) vector type, and Op_VecA for machine vectora register. In the
meantime, to minimize register allocation code changes, we also take
advantage of one JIT compiler aspect, that is during the compile time we
actually know the real hardware SVE vector register size of current
running machine. So, the register allocator actually knows how many
register slots an Op_VecA ideal reg requires, and could work fine
without much modification.
Since the bottom 128 bits are shared with the NEON, we extend current
register mask definition of V0-V31 registers. Currently, c2 uses one bit
mask for a 32-bit register slot, so to define at most 2048 bits we will
need to add 64 slots in AD file. That's a really large number, and will
also break current regmask assumption. Considering the SVE vector
register is architecturally scalable for different sizes, we just define
double of original NEON vector register slots, i.e. 8 slots: Vx, Vx_H,
Vx_J ... Vx_O. After adlc, the generated register masks now looks like:
const RegMask _VECTORA_REG_mask( 0x0, 0x0, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, ...
const RegMask _VECTORD_REG_mask( 0x0, 0x0, 0x3030303, 0x3030303,
0x3030303, 0x3030303, 0x3030303, 0x3030303, ...
const RegMask _VECTORX_REG_mask( 0x0, 0x0, 0xf0f0f0f, 0xf0f0f0f,
0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, ...
And we use SlotsPerVecA to indicate regmask bit size for a VecA register.
Although for physical register allocation, register allocator does not
need to know the real VecA register size, while doing spill/unspill,
current register allocation needs to know actual stack slot size to
store/load VecA registers. SVE is able to do vector size agnostic
spilling, but to minimize the code changes, as I mentioned before, we
just let RA know the actual vector register size in current running
machine, by calling scalable_vector_reg_size().
In the meantime, since some vector operations do not have unpredicated
SVE1 instructions, but only predicate version, e.g. vector multiply,
vector load/store. We have also defined predicate registers in this
patch, and c2 register allocator will allocate a temp predicate register
to fulfill the expecting unpredicated operations. And this can also be
used for future predicate-driven vectorizer. This is not efficient for
now, as we can see many ptrue instructions in the generated code. One
possible solution I can see, is to block one predicate register, and
preset it to all true. But to preserve/reinitialize a caller save
register value cross calls seems risky to work in this patch. I decide
to defer it to further optimization work. If anyone has any suggestions
on this, I would appreciate.
SVE feature detection
=====================
Since we may have some compiled code based on the initial detected SVE
vector register length and the compiled code is compiled only for that
vector register length, we assume that the SVE vector register length
will not be changed during the JVM lifetime. However, SVE vector length
is per-thread and can be changed by system call [3], so we need to make
sure that each jni call will not change the sve vector length.
Currently, we verify the SVE vector register length on each JNI return,
and if an SVE vector length change is detected, jvm simply reports error
and stops running. The VM running vector length can also be set by
existing VM option MaxVectorSize with c2 enabled. If MaxVectorSize is
specified not the same as system default sve vector length (in
/proc/sys/abi/sve_default_vector_length), JVM will set current process
sve vector length to the specified vector length.
Compiled code
=============
We have added all current c2 backend codegen on par with NEON, but only
for vector length larger than 128-bit.
On a 1024 bit SVE environment, for the following simple loop with int
array element type:
for (int i = 0; i < LENGTH; i++) {
c[i] = a[i] + b[i];
}
c2 generated loop:
0x0000ffff811c0820: sbfiz x11, x10, #2, #32
0x0000ffff811c0824: add x13, x18, x11
0x0000ffff811c0828: add x14, x1, x11
0x0000ffff811c082c: add x13, x13, #0x10
0x0000ffff811c0830: add x14, x14, #0x10
0x0000ffff811c0834: add x11, x0, x11
0x0000ffff811c0838: add x11, x11, #0x10
0x0000ffff811c083c: ptrue p1.s // To be optimized
0x0000ffff811c0840: ld1w {z16.s}, p1/z, [x14]
0x0000ffff811c0844: ptrue p0.s
0x0000ffff811c0848: ld1w {z17.s}, p0/z, [x13]
0x0000ffff811c084c: add z16.s, z17.s, z16.s
0x0000ffff811c0850: ptrue p1.s
0x0000ffff811c0854: st1w {z16.s}, p1, [x11]
0x0000ffff811c0858: add w10, w10, #0x20
0x0000ffff811c085c: cmp w10, w12
0x0000ffff811c0860: b.lt 0x0000ffff811c0820
Test
====
Currently, we don't have real hardware to verify SVE features (and
performance). But we have run jtreg tests with SVE in some emulators. On
QEMU system emulator, which has SVE emulation support, jtreg tier1-3
passed with different vector sizes. We've also verified it with full
jtreg tests without SVE on both x86 and AArch64, to make sure that
there's no regression.
The patch has also been applied to Vector API code base, and verified on
emulator. In Vector API, there are more vector related tests and is more
possible to generate vector instructions by intrinsification.
A simple test can also run in QEMU user emulation, e.g.
$ qemu-aarch64 -cpu max,sve-max-vq=2 java -XX:UseSVE=1 SIMD
(
To run it in user emulation mode, we will need to bypass SVE feature
detection code in this patch. E.g. apply:
http://cr.openjdk.java.net/~njian/8231441/user-emulation.patch
)l
Others
======
Since this patch is a bit large, I've also split it into 3 parts, for
easy review:
1) SVE feature detection
http://cr.openjdk.java.net/~njian/8231441/webrev.00-feature
2) c2 registion allocation
http://cr.openjdk.java.net/~njian/8231441/webrev.00-ra
3) SVE c2 backend
http://cr.openjdk.java.net/~njian/8231441/webrev.00-c2
Part of this patch has been contributed by Joshua Zhu and Yang Zhang.
Refs
====
[1] https://developer.arm.com/docs/ddi0584/latest
[2] https://developer.arm.com/docs/ddi0602/latest
[3] https://www.kernel.org/doc/Documentation/arm64/sve.txt
Thanks,
Ningsheng
More information about the hotspot-compiler-dev
mailing list