RFR: 8343689: AArch64: Optimize MulReduction implementation [v13]

Fri Oct 31 06:48:10 UTC 2025

On Tue, 28 Oct 2025 13:57:08 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>> 
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>> 
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>> 
>> Benchmarks results:
>> 
>> Neoverse-V1 (SVE 256-bit)
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>> 
>> 
>> Fujitsu A64FX (SVE 512-bit):
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 24 commits:
> 
>  - Merge commit 'c8679713402186b24608fa4c91397b6a4fd5ebf3' into 8343689
>    
>    Change-Id: Icfa70da585e034774e4ff0f60b8f0c9ce0598399
>  - cleanup: remove redundand local variables
>    
>    Change-Id: I6fb6a9a7a236537612caa5d53c5516ed2f260bad
>  - cleanup: remove a trivial switch-case statement
>    
>    Change-Id: Ib914ce02ae9d88057cb0b88d4880df6ca64f8184
>  - Assert the exact supported VL of 32B in SVE-specific methods
>    
>    Change-Id: I8768c653ff563cd8a7a75cd06a6523a9526d15ec
>  - cleanup: fix long line formatting
>    
>    Change-Id: I173e70a2fa9a45f56fe50d4a6b81699665e3433d
>  - fixup: remove VL asserts in match rules to fix failures on >= 512b SVE platforms
>    
>    Change-Id: I721f5a97076d645905ee1716f7d57ec8c90ef6e9
>  - Merge branch 'master' into 8343689
>    
>    Change-Id: Iebe758e4c7b3ab0de5f580199f8909e96b8c6274
>  - cleanup: start the SVE Integer Misc - Unpredicated section
>  - Merge branch 'master'
>  - Address review comments and simplify the implementation
>    
>    - remove the loops from gt128b methods making them 256b only
>    - fixup: missed fnoregs in instruct reduce_mulL_256b
>    - use an extra vtmp3 reg for the 256b integer method
>    - remove a no longer needed change in reduce_mul_integral_le128b
>    - cleanup: unify comments
>  - ... and 14 more: https://git.openjdk.org/jdk/compare/c8679713...e564d6c1

LGTM! Thanks for your work!

-------------

Marked as reviewed by xgong (Committer).

PR Review: https://git.openjdk.org/jdk/pull/23181#pullrequestreview-3402778189