RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Wed Feb 26 14:54:45 UTC 2025

> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
> 
> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
> 
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
> 
> Benchmarks results:
> 
> Neoverse-V1 (SVE 256-bit)
> 
>   Benchmark                 (size)   Mode   master         PR  Units
>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
> 
> 
> Fujitsu A64FX (SVE 512-bit):
> 
>   Benchmark                 (size)   Mode   master         PR  Units
>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms

Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:

 - fixup: don't modify the value in vsrc

   Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
   change, the result of recursive folding is held in vtmp1. To be able to
   pass this intermediate result to reduce_mul_integral_le128b(), we would
   have to use another temporary FloatRegister, as vtmp1 would essentially
   act as vsrc. It's possible to get around this however:
   reduce_mul_integral_le128b() is modified so it's possible to pass
   matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
   temporary register in rules that match to reduce_mul_integral_gt128b().
 - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23181/files
  - new: https://git.openjdk.org/jdk/pull/23181/files/c9dcc45f..3fc989bd

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=01-02

  Stats: 67 lines in 1 file changed: 35 ins; 17 del; 15 mod
  Patch: https://git.openjdk.org/jdk/pull/23181.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181

PR: https://git.openjdk.org/jdk/pull/23181