RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]

Wed Feb 5 11:20:59 UTC 2025

> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
> 
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old        New  Units
>   Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
>   Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
>   Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
>   Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
>   Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
>   Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms
> 
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old       New   Units
>   Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
>   Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
>   Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
>   Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
>   Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
>   Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms

Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:

  Use EXT instead of COMPACT to split a vector into two halves

  Benchmarks results:

  Neoverse-V1 (SVE 256-bit)

    Benchmark                 (size)   Mode   master         PR  Units
    ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
    ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
    IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
    LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
    FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
    DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms

  Fujitsu A64FX (SVE 512-bit)

    Benchmark                 (size)   Mode   master         PR  Units
    ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
    ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
    IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
    LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
    FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
    DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23181/files
  - new: https://git.openjdk.org/jdk/pull/23181/files/0a62dc33..c9dcc45f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=00-01

  Stats: 140 lines in 7 files changed: 10 ins; 6 del; 124 mod
  Patch: https://git.openjdk.org/jdk/pull/23181.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181

PR: https://git.openjdk.org/jdk/pull/23181