RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Fri Jul 11 02:07:41 UTC 2025

On Thu, 3 Jul 2025 05:23:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.
>> 
>> ### Neoverse N1 (NEON)
>> 
>> <details>
>> 
>> <summary>Auto-vectorization</summary>
>> 
>> | Benchmark                 | Before   | After    | Units | Diff |
>> |---------------------------|----------|----------|-------|------|
>> | mulRedD                   | 739.699  | 740.884  | ns/op |  ~   |
>> | byteAddBig                | 2670.248 | 2670.562 | ns/op |  ~   |
>> | byteAddSimple             | 1639.796 | 1639.940 | ns/op |  ~   |
>> | byteMulBig                | 2707.900 | 2708.063 | ns/op |  ~   |
>> | byteMulSimple             | 2452.939 | 2452.906 | ns/op |  ~   |
>> | charAddBig                | 2772.363 | 2772.269 | ns/op |  ~   |
>> | charAddSimple             | 1639.867 | 1639.751 | ns/op |  ~   |
>> | charMulBig                | 2796.533 | 2796.375 | ns/op |  ~   |
>> | charMulSimple             | 2453.034 | 2453.004 | ns/op |  ~   |
>> | doubleAddBig              | 2943.613 | 2936.897 | ns/op |  ~   |
>> | doubleAddSimple           | 1635.031 | 1634.797 | ns/op |  ~   |
>> | doubleMulBig              | 3001.937 | 3003.240 | ns/op |  ~   |
>> | doubleMulSimple           | 2448.154 | 2448.117 | ns/op |  ~   |
>> | floatAddBig               | 2963.086 | 2962.215 | ns/op |  ~   |
>> | floatAddSimple            | 1634.987 | 1634.798 | ns/op |  ~   |
>> | floatMulBig               | 3022.442 | 3021.356 | ns/op |  ~   |
>> | floatMulSimple            | 2447.976 | 2448.091 | ns/op |  ~   |
>> | intAddBig                 | 832.346  | 832.382  | ns/op |  ~   |
>> | intAddSimple              | 841.276  | 841.287  | ns/op |  ~   |
>> | intMulBig                 | 1245.155 | 1245.095 | ns/op |  ~   |
>> | intMulSimple              | 1638.762 | 1638.826 | ns/op |  ~   |
>> | longAddBig                | 4924.541 | 4924.328 | ns/op |  ~   |
>> | longAddSimple             | 841.623  | 841.625  | ns/op |  ~   |
>> | longMulBig                | 9848.954 | 9848.807 | ns/op |  ~   |
>> | longMulSimple             | 3427.169 | 3427.279 | ns/op |  ~   |
>> | shortAddBig               | 2670.027 | 2670.345 | ns/op |  ~   |
>> | shortAddSimple            | 1639.869 | 1639.876 | ns/op |  ~   |
>> | shortMulBig               | 2750.812 | 2750.562 | ns/op |  ~   |
>> | shortMulSimple            | 2453.030 | 2452.937 | ns/op |  ~   |
>> 
>>...
>
> @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.
> 
> However, I do plan to remove the auto-vectorization restrictions for simple reductions.
> https://bugs.openjdk.org/browse/JDK-8307516
> 
> You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`.
> https://bugs.openjdk.org/browse/JDK-8357530
> I published benchmark results there:
> https://github.com/openjdk/jdk/pull/25387
> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
> 
> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
> 
> I don't have access to any SVE machines, so I cannot help you there, unfortunately.
> 
> Is this helpful to you?

@eme64  Thanks for your input. It's very helpful to us.
@fg1417  Thanks for your clarification on `case-2` as I mentioned earlier.
@mikabl-arm Thanks for your providing the performance data on Neoverse-V1 machine.

> Given that:
> 
> * this PR focuses on VectorAPI and **not** on auto-vectorization,
> * and it does **not** introduce regressions in auto-vectorization performance,
> 
> I suggest:
> 
> * continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
> * moving forward with resolving the remaining VectorAPI issues and merging this PR.

I agree with your suggestion.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3059975539