RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Xiaohong Gong
xgong at openjdk.org
Thu Jul 3 05:56:41 UTC 2025
On Thu, 3 Jul 2025 05:23:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
>
> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
>
> I don't have access to any SVE machines, so I cannot help you there, unfortunately.
>
>Is this helpful to you?
Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis.
BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us.
I tested the performance of `VectorReduction2` with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both `floatMulSimple` and `floatMulBig`. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of `floatMulBig` is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030931690
More information about the hotspot-compiler-dev
mailing list