RFR: 8282162: [vector] Optimize vector negation API

Fri Mar 11 06:37:00 UTC 2022

The current vector `"NEG"` is implemented with substraction a vector by zero in case the architecture does not support the negation instruction. And to fit the predicate feature for architectures that support it, the masked vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both can be optimized to a single negation instruction for ARM SVE.
And so does the non-masked "NEG" for NEON. Besides, implementing the masked "NEG" with substraction for architectures that support neither negation instruction nor predicate feature can also save several instructions than the current pattern.

To optimize the VectorAPI negation, this patch moves the implementation from Java side to hotspot. The compiler will generate different nodes according to the architecture:
  - Generate the (predicated) negation node if architecture supports it, otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
  - Generate `"zero.sub(v, m)"` for masked operation if the architecture does not have predicate feature, otherwise generate the original pattern `"v.xor(-1, m).add(1, m)"`.

So with this patch, the following transformations are applied:

For non-masked negation with NEON:

  movi    v16.4s, #0x0
  sub v17.4s, v16.4s, v17.4s       ==> neg v17.4s, v17.4s

and with SVE:

  mov z16.s, #0
  sub z18.s, z16.s, z17.s          ==> neg z16.s, p7/m, z16.s

For masked negation with NEON:

  movi    v17.4s, #0x1
  mvn v19.16b, v18.16b
  mov v20.16b, v16.16b             ==>  neg v18.4s, v17.4s
  bsl v20.16b, v19.16b, v18.16b         bsl v19.16b, v18.16b, v17.16b
  add v19.4s, v20.4s, v17.4s
  mov v18.16b, v16.16b
  bsl v18.16b, v19.16b, v20.16b

and with SVE:

  mov z16.s, #-1
  mov z17.s, #1                    ==> neg z16.s, p0/m, z16.s
  eor z18.s, p0/m, z18.s, z16.s
  add z18.s, p0/m, z18.s, z17.s

Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 machines(note that the non-masked negation benchmarks do not have any improvement on X86 since no instructions are changed):

NEON:
Benchmark                Gain
Byte128Vector.NEG        1.029
Byte128Vector.NEGMasked  1.757
Short128Vector.NEG       1.041
Short128Vector.NEGMasked 1.659
Int128Vector.NEG         1.005
Int128Vector.NEGMasked   1.513
Long128Vector.NEG        1.003
Long128Vector.NEGMasked  1.878

SVE with 512-bits:
Benchmark                Gain
ByteMaxVector.NEG        1.10
ByteMaxVector.NEGMasked  1.165
ShortMaxVector.NEG       1.056
ShortMaxVector.NEGMasked 1.195
IntMaxVector.NEG         1.002
IntMaxVector.NEGMasked   1.239
LongMaxVector.NEG        1.031
LongMaxVector.NEGMasked  1.191

X86 (non AVX-512):
Benchmark                Gain
ByteMaxVector.NEGMasked  1.254
ShortMaxVector.NEGMasked 1.359
IntMaxVector.NEGMasked   1.431
LongMaxVector.NEGMasked  1.989

[1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
[2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

-------------

Commit messages:
 - 8282162: [vector] Optimize vector negation API

Changes: https://git.openjdk.java.net/jdk/pull/7782/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8282162
  Stats: 308 lines in 15 files changed: 267 ins; 25 del; 16 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7782.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782

PR: https://git.openjdk.java.net/jdk/pull/7782