RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]

Mon Jun 30 13:25:11 UTC 2025

On Thu, 27 Feb 2025 03:49:41 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
>> 
>>> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
>>> 2138: // instructions are used.
>>> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
>> 
>> Drive-by question:
>> This is recursive folding: take halve the vector and add it that way.
>> 
>> What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.
>
> I have the same concern about the order issue with @eme64.
> Should we only enable this only for VectorAPI case, which doesn't require strict-order?

FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is `VectorReduction2.WithSuperword.intMulBig` (which is fine since it's an integer benchmark).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2174943784