RFR: 8286941: Add mask IR for partial vector operations for ARM SVE
Xiaohong Gong
xgong at openjdk.java.net
Mon Jun 6 09:48:08 UTC 2022
VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct.
For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops.
Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop.
Here is an example for vector load and add reduction inside a loop:
ptrue p0.s, vl8 ; mask generation
ld1w {z16.s}, p0/z, [x14] ; load vector
ptrue p0.s, vl8 ; mask generation
uaddv d17, p0, z16.s ; add reduction
smov x14, v17.s[0]
As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop.
Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out.
Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system:
Benchmark size Gain
Byte256Vector.ADDLanes 1024 0.999
Byte256Vector.ANDLanes 1024 1.065
Byte256Vector.MAXLanes 1024 1.064
Byte256Vector.MINLanes 1024 1.062
Byte256Vector.ORLanes 1024 1.072
Byte256Vector.XORLanes 1024 1.041
Short256Vector.ADDLanes 1024 1.017
Short256Vector.ANDLanes 1024 1.044
Short256Vector.MAXLanes 1024 1.049
Short256Vector.MINLanes 1024 1.049
Short256Vector.ORLanes 1024 1.089
Short256Vector.XORLanes 1024 1.047
Int256Vector.ADDLanes 1024 1.045
Int256Vector.ANDLanes 1024 1.078
Int256Vector.MAXLanes 1024 1.123
Int256Vector.MINLanes 1024 1.129
Int256Vector.ORLanes 1024 1.078
Int256Vector.XORLanes 1024 1.072
Long256Vector.ADDLanes 1024 1.059
Long256Vector.ANDLanes 1024 1.101
Long256Vector.MAXLanes 1024 1.079
Long256Vector.MINLanes 1024 1.099
Long256Vector.ORLanes 1024 1.098
Long256Vector.XORLanes 1024 1.110
Float256Vector.ADDLanes 1024 1.033
Float256Vector.MAXLanes 1024 1.156
Float256Vector.MINLanes 1024 1.151
Double256Vector.ADDLanes 1024 1.062
Double256Vector.MAXLanes 1024 1.145
Double256Vector.MINLanes 1024 1.140
This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below:
sxtw x14, w14
whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14
-------------
Commit messages:
- 8286941: Add mask IR for partial vector operations for ARM SVE
Changes: https://git.openjdk.java.net/jdk/pull/9037/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8286941
Stats: 2228 lines in 19 files changed: 811 ins; 920 del; 497 mod
Patch: https://git.openjdk.java.net/jdk/pull/9037.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/9037/head:pull/9037
PR: https://git.openjdk.java.net/jdk/pull/9037
More information about the hotspot-compiler-dev
mailing list