RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Hao Sun
haosun at openjdk.org
Thu Feb 27 03:55:06 UTC 2025
On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>>
>> - fixup: don't modify the value in vsrc
>>
>> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>> change, the result of recursive folding is held in vtmp1. To be able to
>> pass this intermediate result to reduce_mul_integral_le128b(), we would
>> have to use another temporary FloatRegister, as vtmp1 would essentially
>> act as vsrc. It's possible to get around this however:
>> reduce_mul_integral_le128b() is modified so it's possible to pass
>> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>> temporary register in rules that match to reduce_mul_integral_gt128b().
>> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
>
>> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
>> 2138: // instructions are used.
>> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
>
> Drive-by question:
> This is recursive folding: take halve the vector and add it that way.
>
> What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.
I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972792220
More information about the hotspot-compiler-dev
mailing list