RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]

Mon Jun 30 13:25:10 UTC 2025

On Mon, 30 Jun 2025 13:22:34 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>> 
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>> 
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>> 
>> Benchmarks results:
>> 
>> Neoverse-V1 (SVE 256-bit)
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>> 
>> 
>> Fujitsu A64FX (SVE 512-bit):
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - cleanup: address nits, rename several symbols
>  - cleanup: remove unreferenced definitions
>  - Address review comments.
>    
>    - fixup: disable FP mul reduction auto-vectorization for all targets
>    - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
>      reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
>    - cleanup: replace a complex lambda in the above methods with a loop
>    - cleanup: rename symbols to follow the existing naming convention
>    - cleanup: add asserts to SVE only instructions
>    - split mul FP reduction instructions into strictly-ordered (default)
>      and explicitly non strictly-ordered
>    - remove redundant conditions in TestVectorFPReduction.java
>    
>    Benchmarks results:
>    
>    Neoverse-V1 (SVE 256-bit)
>    
>    | Benchmark                 | Before   | After    | Units  | Diff  |
>    |---------------------------|----------|----------|--------|-------|
>    | ByteMaxVector.MULLanes    | 619.156  | 9884.578 | ops/ms | 1496% |
>    | DoubleMaxVector.MULLanes  | 184.693  | 2712.051 | ops/ms | 1368% |
>    | FloatMaxVector.MULLanes   | 277.818  | 3388.038 | ops/ms | 1119% |
>    | IntMaxVector.MULLanes     | 371.225  | 4765.434 | ops/ms | 1183% |
>    | LongMaxVector.MULLanes    | 205.149  | 2672.975 | ops/ms | 1203% |
>    | ShortMaxVector.MULLanes   | 472.804  | 5122.917 | ops/ms |  984% |
>  - Merge branch 'master' into 8343689-rebase
>  - fixup: don't modify the value in vsrc
>    
>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>    change, the result of recursive folding is held in vtmp1. To be able to
>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>    have to use another temporary FloatRegister, as vtmp1 would essentially
>    act as vsrc. It's possible to get around this however:
>    reduce_mul_integral_le128b() is modified so it's possible to pass
>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>    temporary register in rules that match to reduce_mul_integral_gt128b().
>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>  - Use EXT instead of COMPACT to split a vector into two halves
>    
>    Benchmarks results:
>    
>    Neoverse-V1 (SVE 256-bit)
>    
>      Benchmark                 (size)   Mode   master         PR  Units
>      ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>      Short...

Thank you for a review! There are a couple more nits I've missed, I'll submit an update to resolve them shortly.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23181#pullrequestreview-2970941468