RFR: 8343689: AArch64: Optimize MulReduction implementation

Tue Feb 4 18:55:13 UTC 2025

On Fri, 17 Jan 2025 19:35:44 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
> 
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old        New  Units
>   Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
>   Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
>   Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
>   Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
>   Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
>   Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms
> 
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old       New   Units
>   Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
>   Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
>   Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
>   Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
>   Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
>   Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:

> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
> 2138: // instructions are used.
> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,

Drive-by question:
This is recursive folding: take halve the vector and add it that way.

What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1941733604