RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v6]
Jatin Bhateja
jbhateja at openjdk.org
Fri Jul 25 07:28:42 UTC 2025
On Fri, 25 Jul 2025 07:24:28 GMT, erifan <duke at openjdk.org> wrote:
>> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is
>> relative smaller than that of `fromLong`. So this patch does the conversion for these cases.
>>
>> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported.
>>
>> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change.
>>
>> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64.
>>
>> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like
>>
>> VectorMaskToLong (VectorLongToMask x) => x
>>
>>
>> Hence, this patch also added the following optimizations:
>>
>> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0
>> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0
>>
>> VectorMaskCast (VectorMaskCast x) => x
>>
>> And we can see noticeable performance improvement with the above optimizations for floating-point types.
>>
>> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`:
>>
>> Benchmark Unit Before Error After Error Uplift
>> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71
>> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1
>> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99
>> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99
>> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3
>> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1
>> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1
>> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56
>>
>>
>> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
>>
>> Benchm...
>
> erifan has updated the pull request incrementally with one additional commit since the last revision:
>
> Add an assertion
Still looks good.
src/hotspot/share/opto/vectornode.cpp line 1989:
> 1987: if (in1->Opcode() == Op_VectorStoreMask) {
> 1988: in1 = in1->in(1);
> 1989: assert(!in1->bottom_type()->isa_vectmask(), "sanity");
Assertion should precede before any other statement in the block :-)
-------------
Marked as reviewed by jbhateja (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3054381237
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2230359167
More information about the hotspot-compiler-dev
mailing list