RFR: 8343689: AArch64: Optimize MulReduction implementation [v6]

Thu Jul 10 15:52:45 UTC 2025

On Thu, 3 Jul 2025 04:44:35 GMT, Hao Sun <haosun at openjdk.org> wrote:

> Background: case-1 was set off after @fg1417 's patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175). But case-2 was not touched. We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on **reduction**?
> 
> **Limitation** of SLP as mentioned in @fg1417 's patch
> 
> > Because superword doesn't vectorize reductions unconnected with other vector packs,
> 
> Performance data in this PR on case-2: From your provided [test data](https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067) on `Neoverse V2 (SVE 128-bit). Auto-vectorization section`, there is no obvious performance change on FP Mul Reduction benchmarks `(float|double)Mul(Big|Simple)`. As we checked the generated code of `floatMul(Big|Simple)` on Nvidia Grace machine(128b SVE2), we found that before this PR:
> 
> * `floatMulBig` is vectorized.
> * `floatMulSimple` is not vectorized because SLP determines that there is no profit.
> 
> Discussion: should we enable case-1 and case-2?
> 
> * if the SLP limitation on reductions is fixed?
> * If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.
> 
> It would be appreciated if @eme64 or @fg1417 could provide more inputs.
> 

@shqking Sorry for joining the discussion a bit late.

The patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175) was intended to fix a regression on `NEON` machine, while keeping the behaviour unchanged on `sve` machine — which may be a source of confusion now.

The reason I mentioned this SLP limitation in my previous patch was to clarify why the benchmark cases were written the way they were, and why I chose more complex cases instead of simpler reductions like `floatMulSimple`.
The rationale was that if a case like `floatMulBig` doesn’t show any performance gain, then a simpler case like `floatMulSimple` is even less likely to benefit. In general, more complex reduction cases are more likely to benefit from auto-vectorization.

@XiaohongGong thanks for testing on `128-bit sve` machine. Since the performance difference is not significant for both `floatMulSimple` and `floatMulBig` `with/without` auto-vectorization and there is a performance drop `with` auto-vectorization on `256-bit sve` machine reported by @mikabl-arm, it seems reasonable that it should also be disabled on SVE.

I'm looking forward to having a cost model in place, so we can safely remove these restrictions and enable SLP to handle these scenarios more flexibly.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3058018414