RFR: 8282162: [vector] Optimize vector negation API
Xiaohong Gong
xgong at openjdk.java.net
Tue Mar 15 01:18:46 UTC 2022
On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
> The current vector `"NEG"` is implemented with substraction a vector by zero in case the architecture does not support the negation instruction. And to fit the predicate feature for architectures that support it, the masked vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked "NEG" with substraction for architectures that support neither negation instruction nor predicate feature can also save several instructions than the current pattern.
>
> To optimize the VectorAPI negation, this patch moves the implementation from Java side to hotspot. The compiler will generate different nodes according to the architecture:
> - Generate the (predicated) negation node if architecture supports it, otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
> - Generate `"zero.sub(v, m)"` for masked operation if the architecture does not have predicate feature, otherwise generate the original pattern `"v.xor(-1, m).add(1, m)"`.
>
> So with this patch, the following transformations are applied:
>
> For non-masked negation with NEON:
>
> movi v16.4s, #0x0
> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s
>
> and with SVE:
>
> mov z16.s, #0
> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s
>
> For masked negation with NEON:
>
> movi v17.4s, #0x1
> mvn v19.16b, v18.16b
> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s
> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
> add v19.4s, v20.4s, v17.4s
> mov v18.16b, v16.16b
> bsl v18.16b, v19.16b, v20.16b
>
> and with SVE:
>
> mov z16.s, #-1
> mov z17.s, #1 ==> neg z16.s, p0/m, z16.s
> eor z18.s, p0/m, z18.s, z16.s
> add z18.s, p0/m, z18.s, z17.s
>
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 machines(note that the non-masked negation benchmarks do not have any improvement on X86 since no instructions are changed):
>
> NEON:
> Benchmark Gain
> Byte128Vector.NEG 1.029
> Byte128Vector.NEGMasked 1.757
> Short128Vector.NEG 1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked 1.513
> Long128Vector.NEG 1.003
> Long128Vector.NEGMasked 1.878
>
> SVE with 512-bits:
> Benchmark Gain
> ByteMaxVector.NEG 1.10
> ByteMaxVector.NEGMasked 1.165
> ShortMaxVector.NEG 1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked 1.239
> LongMaxVector.NEG 1.031
> LongMaxVector.NEGMasked 1.191
>
> X86 (non AVX-512):
> Benchmark Gain
> ByteMaxVector.NEGMasked 1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked 1.431
> LongMaxVector.NEGMasked 1.989
>
> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
Hi, could anyone please take a look at this PR? Thanks so much!
-------------
PR: https://git.openjdk.java.net/jdk/pull/7782
More information about the hotspot-compiler-dev
mailing list