RFR: 8343689: AArch64: Optimize MulReduction implementation [v11]

Tue Sep 9 06:53:31 UTC 2025

On Thu, 14 Aug 2025 14:01:13 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>> 
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>> 
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>> 
>> Benchmarks results:
>> 
>> Neoverse-V1 (SVE 256-bit)
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>> 
>> 
>> Fujitsu A64FX (SVE 512-bit):
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   cleanup: start the SVE Integer Misc - Unpredicated section

Do you intend to ignore ops with >32B vector size? May I ask the reason?

If so, maybe the title like `AArch64: Implement MulReduction for 256-bit SVE` is more accurate？

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 2199:

> 2197: 
> 2198: instruct reduce_non_strict_order_mulF_256b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp1, vReg tmp2) %{
> 2199:   predicate(Matcher::vector_length_in_bytes(n->in(2)) == 32 && !n->as_Reduction()->requires_strict_order());

Suggestion:

  predicate(Matcher::vector_length_in_bytes(n->in(2)) == 32 &&
            !n->as_Reduction()->requires_strict_order());

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2119:

> 2117:     assert(false, "unsupported");
> 2118:     ShouldNotReachHere();
> 2119:   }

Can we just add a type assertion at the start of the method and remove the switch-case?

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2165:

> 2163:                                                             FloatRegister vtmp1,
> 2164:                                                             FloatRegister vtmp2) {
> 2165:   assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead");

Is it better to assert `vector_length_in_bytes == 32`  or `vector_length_in_bytes == 2 * FloatRegister::neon_vl`?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23181#pullrequestreview-3199499604
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332130585
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332153670
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332197936