RFR: 8343689: AArch64: Optimize MulReduction implementation

Mon Jan 20 03:38:41 UTC 2025

On Fri, 17 Jan 2025 19:35:44 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
> 
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old        New  Units
>   Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
>   Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
>   Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
>   Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
>   Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
>   Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms
> 
> 
> Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:
> 
>   Benchmark                 (size)   Mode      Old       New   Units
>   Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
>   Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
>   Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
>   Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
>   Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
>   Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2095:

> 2093:     // matter: a contiguous set of elements is moved and its size is a multiple of D RegVariant.
> 2094:     sve_compact(vtmp1, D, vsrc, pgtmp1);
> 2095:     sve_mul(vsrc, elemType_to_regVariant(bt), pgtmp2, vtmp1);

Did you have tried with the SVE `EXT` instruction (https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/EXT--Extract-vector-from-pair-of-vectors-?lang=en), which I think could also help to shuffle the upper half elements to the lower half in a vector?  If it works, I think these five instructions can be optimized to three ones such as `ext, whilelo, mul`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1921747777