RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]
erifan
duke at openjdk.org
Thu Jul 17 09:09:14 UTC 2025
> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is
> relative smaller than that of `fromLong`. So this patch does the conversion for these cases.
>
> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported.
>
> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change.
>
> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64.
>
> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like
>
> VectorMaskToLong (VectorLongToMask x) => x
>
>
> Hence, this patch also added the following optimizations:
>
> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0
> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0
>
> VectorMaskCast (VectorMaskCast x) => x
>
> And we can see noticeable performance improvement with the above optimizations for floating-point types.
>
> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`:
>
> Benchmark Unit Before Error After Error Uplift
> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71
> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1
> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99
> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99
> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3
> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1
> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1
> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56
>
>
> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
>
> Benchmark Unit Before Error After Error Uplift
> microMaskFromLongToLong_Double...
erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
- Refactor the implementation
Do the convertion in C2's IGVN phase to cover more cases.
- Merge branch 'master' into JDK-8356760
- Simplify the test code
- Address some review comments
Add support for the following patterns:
toLong(maskAll(true)) => (-1ULL >> (64 -vlen))
toLong(maskAll(false)) => 0
And add more test cases.
- Merge branch 'master' into JDK-8356760
- 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases
If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would
set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent
to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is
relative smaller than that of `fromLong`. This patch does the conversion
for these cases if `l` is a compile time constant.
And this conversion also enables further optimizations that recognize
maskAll patterns, see [1].
Some JTReg test cases are added to ensure the optimization is effective.
I tried many different ways to write a JMH benchmark, but failed. Since
the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific
compile-time constant, the statement will be hoisted out of the loop.
If we don't use a loop, the hotspot will become other instructions, and
no obvious performance change was observed. However, combined with the
optimization of [1], we can observe a performance improvement of about
7% on both aarch64 and x64.
The patch was tested on both aarch64 and x64, all of tier1 tier2 and
tier3 tests passed.
[1] https://github.com/openjdk/jdk/pull/24674
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/25793/files
- new: https://git.openjdk.org/jdk/pull/25793/files/9f07d5c7..8ebe5e56
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=03
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=02-03
Stats: 21470 lines in 667 files changed: 10937 ins; 6238 del; 4295 mod
Patch: https://git.openjdk.org/jdk/pull/25793.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793
PR: https://git.openjdk.org/jdk/pull/25793
More information about the hotspot-compiler-dev
mailing list