RFR: 8343689: AArch64: Optimize MulReduction implementation [v9]

Fri Aug 8 14:40:01 UTC 2025

> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
> 
> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
> 
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
> 
> Benchmarks results:
> 
> Neoverse-V1 (SVE 256-bit)
> 
>   Benchmark                 (size)   Mode   master         PR  Units
>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
> 
> 
> Fujitsu A64FX (SVE 512-bit):
> 
>   Benchmark                 (size)   Mode   master         PR  Units
>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms

Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:

 - Address review comments and simplify the implementation

   - remove the loops from gt128b methods making them 256b only
   - fixup: missed fnoregs in instruct reduce_mulL_256b
   - use an extra vtmp3 reg for the 256b integer method
   - remove a no longer needed change in reduce_mul_integral_le128b
   - cleanup: unify comments
 - Merge commit '8193856af8546332bfa180cb45154a4093b4fd2c'
 - remove the strictly-ordered FP implementation as unused
 - Compare VL against MaxVectorSize instead of FloatRegister::sve_vl_max
 - Use a dedicated ptrue predicate register

   This shifts MulReduction performance on Neoverse V1 a bit. Here Before
   if before this specific commit (ebad6dd37e332da44222c50cd17c69f3ff3f0635)
   and After is this commit.

   | Benchmark                | Before (ops/ms) | After (ops/ms) | Diff (%) |
   | ------------------------ | --------------- | -------------- | -------- |
   | ByteMaxVector.MULLanes   | 9883.151        | 9093.557       | -7.99%   |
   | DoubleMaxVector.MULLanes | 2712.674        | 2607.367       | -3.89%   |
   | FloatMaxVector.MULLanes  | 3388.811        | 3291.429       | -2.88%   |
   | IntMaxVector.MULLanes    | 4765.554        | 5031.741       | +5.58%   |
   | LongMaxVector.MULLanes   | 2685.228        | 2896.445       | +7.88%   |
   | ShortMaxVector.MULLanes  | 5128.185        | 5197.656       | +1.35%   |
 - cleanup: update a copyright notice

   Co-authored-by: Hao Sun <haosun at nvidia.com>
 - fixup: remove undefined insts from aarch64-asmtest.py
 - cleanup: address nits, rename several symbols
 - cleanup: remove unreferenced definitions
 - Address review comments.

   - fixup: disable FP mul reduction auto-vectorization for all targets
   - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
     reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
   - cleanup: replace a complex lambda in the above methods with a loop
   - cleanup: rename symbols to follow the existing naming convention
   - cleanup: add asserts to SVE only instructions
   - split mul FP reduction instructions into strictly-ordered (default)
     and explicitly non strictly-ordered
   - remove redundant conditions in TestVectorFPReduction.java

   Benchmarks results:

   Neoverse-V1 (SVE 256-bit)

   | Benchmark                 | Before   | After    | Units  | Diff  |
   |---------------------------|----------|----------|--------|-------|
   | ByteMaxVector.MULLanes    | 619.156  | 9884.578 | ops/ms | 1496% |
   | DoubleMaxVector.MULLanes  | 184.693  | 2712.051 | ops/ms | 1368% |
   | FloatMaxVector.MULLanes   | 277.818  | 3388.038 | ops/ms | 1119% |
   | IntMaxVector.MULLanes     | 371.225  | 4765.434 | ops/ms | 1183% |
   | LongMaxVector.MULLanes    | 205.149  | 2672.975 | ops/ms | 1203% |
   | ShortMaxVector.MULLanes   | 472.804  | 5122.917 | ops/ms |  984% |
 - ... and 5 more: https://git.openjdk.org/jdk/compare/8193856a...5b06b638

-------------

Changes: https://git.openjdk.org/jdk/pull/23181/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=08
  Stats: 383 lines in 9 files changed: 236 ins; 2 del; 145 mod
  Patch: https://git.openjdk.org/jdk/pull/23181.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181

PR: https://git.openjdk.org/jdk/pull/23181