RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Thu Feb 27 03:55:06 UTC 2025

On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - fixup: don't modify the value in vsrc
>>    
>>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>>    change, the result of recursive folding is held in vtmp1. To be able to
>>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>>    have to use another temporary FloatRegister, as vtmp1 would essentially
>>    act as vsrc. It's possible to get around this however:
>>    reduce_mul_integral_le128b() is modified so it's possible to pass
>>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>>    temporary register in rules that match to reduce_mul_integral_gt128b().
>>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
> 
>> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
>> 2138: // instructions are used.
>> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
> 
> Drive-by question:
> This is recursive folding: take halve the vector and add it that way.
> 
> What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.

I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972792220