RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Fri Jul 11 01:29:44 UTC 2025

On Thu, 3 Jul 2025 05:53:44 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.
>> 
>> However, I do plan to remove the auto-vectorization restrictions for simple reductions.
>> https://bugs.openjdk.org/browse/JDK-8307516
>> 
>> You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`.
>> https://bugs.openjdk.org/browse/JDK-8357530
>> I published benchmark results there:
>> https://github.com/openjdk/jdk/pull/25387
>> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
>> 
>> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
>> 
>> I don't have access to any SVE machines, so I cannot help you there, unfortunately.
>> 
>> Is this helpful to you?
>
>> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
>>
>> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
>>
>> I don't have access to any SVE machines, so I cannot help you there, unfortunately.
>>
>>Is this helpful to you?
> 
> Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis. 
> 
> BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us.
> 
> I tested the performance of `VectorReduction2` with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both `floatMulSimple` and `floatMulBig`.  But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of `floatMulBig` is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64.

> @XiaohongGong , @shqking , @eme64 ,
> 
> Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:
> 
> > To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used.
> 
> This PR does **not** target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions.
> 
> @shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, `Matcher::match_rule_supported_auto_vectorization()` returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the **same scalar mul reduction implementation** for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I’m mistaken.
> 
> This PR cannot leave `Matcher::match_rule_supported_auto_vectorization()` unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement.
> 
> @XiaohongGong ,
> 
> > But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm !
> 
> Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in `Matcher::match_rule_supported_auto_vectorization()` lifted (`After`). The linear strictly-ordered SVE implementation matched this way was later removed by [4593a5d](https://github.com/openjdk/jdk/commit/4593a5d717024df01769625993c2b769d8dde311).
> 
> ```
> | Benchmark                                      |   Before (ns/op) |   After (ns/op) | Diff (%)   |
> |:-----------------------------------------------|-----------------:|----------------:|:-----------|
> | VectorReduction.WithSuperword.mulRedD          |          401.679 |         401.704 | ~          |
> | VectorReduction2.WithSuperword.doubleMulBig    |         2365.554 |        7294.706 | +208.37%   |
> | VectorReduction2.WithSuperword.doubleMulSimple |         2321.154 |        2321.207 | ~          |
> | VectorReduction2.WithSuperword.floatMulBig     |         2356.006 |        2648.334 | +12.41%    |
> | VectorReduction2.WithSuperword.floatMulSimple  |         2321.018 |        2321.135 | ~          |
> ```
> 
> Given that:
> 
> * this PR focuses on VectorAPI and **not** on auto-vectorization,
> * and it does **not** introduce regressions in auto-vectorization performance,
> 
> I suggest:
> 
> * continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
> * moving forward with resolving the remaining VectorAPI issues and merging this PR.

I'm fine with removing the strict-ordered rules and disable these operations for SLP since it does not benefit performance. Thanks for your testing and updating!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3059883936