RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]

Thu Jul 17 09:09:14 UTC 2025

> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is
> relative smaller than that of `fromLong`. So this patch does the conversion for these cases.
> 
> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported.
> 
> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change.
> 
> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64.
> 
> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like
> 
>   VectorMaskToLong (VectorLongToMask x) => x
> 
> 
> Hence, this patch also added the following optimizations:
> 
>   VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen)))    // x is -1 or 0
>   VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen)))  // x is -1 or 0
> 
>   VectorMaskCast (VectorMaskCast x) => x
> 
> And we can see noticeable performance improvement with the above optimizations for floating-point types.
> 
> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`:
> 
> Benchmark				Unit	Before		Error		After		Error		Uplift
> microMaskFromLongToLong_Double128	ops/s	1522384.986	1324881.46	2835774480	403575069.7	1862.71
> microMaskFromLongToLong_Double256	ops/s	4275.415598	28.560622	4285.587451	27.633101	1
> microMaskFromLongToLong_Double512	ops/s	3702.171936	9.528497	3692.747579	18.47744	0.99
> microMaskFromLongToLong_Double64	ops/s	4624.452243	37.388427	4616.320519	23.455954	0.99
> microMaskFromLongToLong_Float128	ops/s	1239661.887	1286803.852	2842927993	360468218.3	2293.3
> microMaskFromLongToLong_Float256	ops/s	3681.64954	15.153633	3685.411771	21.737124	1
> microMaskFromLongToLong_Float512	ops/s	3007.563025	10.189944	3022.002986	14.137287	1
> microMaskFromLongToLong_Float64		ops/s	1646664.258	1375451.279	2948453900	397472562.4	1790.56
> 
> 
> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
> 
> Benchmark				Unit	Before		Error		After		Error		Uplift
> microMaskFromLongToLong_Double...

erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:

 - Refactor the implementation

   Do the convertion in C2's IGVN phase to cover more cases.
 - Merge branch 'master' into JDK-8356760
 - Simplify the test code
 - Address some review comments

   Add support for the following patterns:
     toLong(maskAll(true))  => (-1ULL >> (64 -vlen))
     toLong(maskAll(false)) => 0

   And add more test cases.
 - Merge branch 'master' into JDK-8356760
 - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases

   If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would
   set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent
   to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is
   relative smaller than that of `fromLong`. This patch does the conversion
   for these cases if `l` is a compile time constant.

   And this conversion also enables further optimizations that recognize
   maskAll patterns, see [1].

   Some JTReg test cases are added to ensure the optimization is effective.

   I tried many different ways to write a JMH benchmark, but failed. Since
   the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific
   compile-time constant, the statement will be hoisted out of the loop.
   If we don't use a loop, the hotspot will become other instructions, and
   no obvious performance change was observed. However, combined with the
   optimization of [1], we can observe a performance improvement of about
   7% on both aarch64 and x64.

   The patch was tested on both aarch64 and x64, all of tier1 tier2 and
   tier3 tests passed.

   [1] https://github.com/openjdk/jdk/pull/24674

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/25793/files
  - new: https://git.openjdk.org/jdk/pull/25793/files/9f07d5c7..8ebe5e56

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=02-03

  Stats: 21470 lines in 667 files changed: 10937 ins; 6238 del; 4295 mod
  Patch: https://git.openjdk.org/jdk/pull/25793.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793

PR: https://git.openjdk.org/jdk/pull/25793