RFR: 8286941: Add mask IR for partial vector operations for ARM SVE

Mon Jun 6 09:48:08 UTC 2022

VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct.

For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops.

Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop.

Here is an example for vector load and add reduction inside a loop:

  ptrue   p0.s, vl8             ; mask generation
  ld1w    {z16.s}, p0/z, [x14]  ; load vector

  ptrue   p0.s, vl8             ; mask generation
  uaddv   d17, p0, z16.s        ; add reduction
  smov    x14, v17.s[0]

As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop.

Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out.

Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system:

  Benchmark                  size   Gain
  Byte256Vector.ADDLanes     1024   0.999
  Byte256Vector.ANDLanes     1024   1.065
  Byte256Vector.MAXLanes     1024   1.064
  Byte256Vector.MINLanes     1024   1.062
  Byte256Vector.ORLanes      1024   1.072
  Byte256Vector.XORLanes     1024   1.041
  Short256Vector.ADDLanes    1024   1.017
  Short256Vector.ANDLanes    1024   1.044
  Short256Vector.MAXLanes    1024   1.049
  Short256Vector.MINLanes    1024   1.049
  Short256Vector.ORLanes     1024   1.089
  Short256Vector.XORLanes    1024   1.047
  Int256Vector.ADDLanes      1024   1.045
  Int256Vector.ANDLanes      1024   1.078
  Int256Vector.MAXLanes      1024   1.123
  Int256Vector.MINLanes      1024   1.129
  Int256Vector.ORLanes       1024   1.078
  Int256Vector.XORLanes      1024   1.072
  Long256Vector.ADDLanes     1024   1.059
  Long256Vector.ANDLanes     1024   1.101
  Long256Vector.MAXLanes     1024   1.079
  Long256Vector.MINLanes     1024   1.099
  Long256Vector.ORLanes      1024   1.098
  Long256Vector.XORLanes     1024   1.110
  Float256Vector.ADDLanes    1024   1.033
  Float256Vector.MAXLanes    1024   1.156
  Float256Vector.MINLanes    1024   1.151
  Double256Vector.ADDLanes   1024   1.062
  Double256Vector.MAXLanes   1024   1.145
  Double256Vector.MINLanes   1024   1.140

This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below:

  sxtw    x14, w14
  whilelo p0.s, xzr, x14  =>  whilelo p0.s, wzr, w14

-------------

Commit messages:
 - 8286941: Add mask IR for partial vector operations for ARM SVE

Changes: https://git.openjdk.java.net/jdk/pull/9037/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8286941
  Stats: 2228 lines in 19 files changed: 811 ins; 920 del; 497 mod
  Patch: https://git.openjdk.java.net/jdk/pull/9037.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/9037/head:pull/9037

PR: https://git.openjdk.java.net/jdk/pull/9037