RFR: 8282162: [vector] Optimize vector negation API
Xiaohong Gong
xgong at openjdk.java.net
Mon Mar 21 01:24:40 UTC 2022
On Sat, 19 Mar 2022 02:34:55 GMT, Jie Fu <jiefu at openjdk.org> wrote:
>> The current vector `"NEG"` is implemented with substraction a vector by zero in case the architecture does not support the negation instruction. And to fit the predicate feature for architectures that support it, the masked vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both can be optimized to a single negation instruction for ARM SVE.
>> And so does the non-masked "NEG" for NEON. Besides, implementing the masked "NEG" with substraction for architectures that support neither negation instruction nor predicate feature can also save several instructions than the current pattern.
>>
>> To optimize the VectorAPI negation, this patch moves the implementation from Java side to hotspot. The compiler will generate different nodes according to the architecture:
>> - Generate the (predicated) negation node if architecture supports it, otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>> - Generate `"zero.sub(v, m)"` for masked operation if the architecture does not have predicate feature, otherwise generate the original pattern `"v.xor(-1, m).add(1, m)"`.
>>
>> So with this patch, the following transformations are applied:
>>
>> For non-masked negation with NEON:
>>
>> movi v16.4s, #0x0
>> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s
>>
>> and with SVE:
>>
>> mov z16.s, #0
>> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s
>>
>> For masked negation with NEON:
>>
>> movi v17.4s, #0x1
>> mvn v19.16b, v18.16b
>> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s
>> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>> add v19.4s, v20.4s, v17.4s
>> mov v18.16b, v16.16b
>> bsl v18.16b, v19.16b, v20.16b
>>
>> and with SVE:
>>
>> mov z16.s, #-1
>> mov z17.s, #1 ==> neg z16.s, p0/m, z16.s
>> eor z18.s, p0/m, z18.s, z16.s
>> add z18.s, p0/m, z18.s, z17.s
>>
>> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 machines(note that the non-masked negation benchmarks do not have any improvement on X86 since no instructions are changed):
>>
>> NEON:
>> Benchmark Gain
>> Byte128Vector.NEG 1.029
>> Byte128Vector.NEGMasked 1.757
>> Short128Vector.NEG 1.041
>> Short128Vector.NEGMasked 1.659
>> Int128Vector.NEG 1.005
>> Int128Vector.NEGMasked 1.513
>> Long128Vector.NEG 1.003
>> Long128Vector.NEGMasked 1.878
>>
>> SVE with 512-bits:
>> Benchmark Gain
>> ByteMaxVector.NEG 1.10
>> ByteMaxVector.NEGMasked 1.165
>> ShortMaxVector.NEG 1.056
>> ShortMaxVector.NEGMasked 1.195
>> IntMaxVector.NEG 1.002
>> IntMaxVector.NEGMasked 1.239
>> LongMaxVector.NEG 1.031
>> LongMaxVector.NEGMasked 1.191
>>
>> X86 (non AVX-512):
>> Benchmark Gain
>> ByteMaxVector.NEGMasked 1.254
>> ShortMaxVector.NEGMasked 1.359
>> IntMaxVector.NEGMasked 1.431
>> LongMaxVector.NEGMasked 1.989
>>
>> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
>> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
>
> src/hotspot/share/opto/vectornode.cpp line 141:
>
>> 139: case T_BYTE:
>> 140: case T_SHORT:
>> 141: case T_INT: return Op_NegVI;
>
> Why not add `Op_NegVB` for `BYTE` and `Op_NegVS` for `SHORT`?
> Is there any performance drop for byte/short negation operation if both of them are handled as a NegVI vector?
The compiler can get the real type info from `Op_NegVI` that can also handle the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs which also need more match rules in the ad files.
> Is there any performance drop for byte/short negation operation if both of them are handled as a NegVI vector?
>From the benchmark results I showed in the commit message, I didn't see not any performance drop for byte/short. Thanks!
> src/hotspot/share/opto/vectornode.cpp line 1635:
>
>> 1633: }
>> 1634:
>> 1635: Node* NegVINode::Ideal(PhaseGVN* phase, bool can_reshape) {
>
> Much duplication in `NegVINode::Ideal` and `NegVLNode::Ideal`.
> Is it possible to refactor the implementation?
Yeah, maybe we need a superclass for `NegVINode` and `NegVLNode`. Thanks!
-------------
PR: https://git.openjdk.java.net/jdk/pull/7782
More information about the hotspot-compiler-dev
mailing list