RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Thu Jul 3 05:26:46 UTC 2025

On Mon, 30 Jun 2025 12:35:47 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - fixup: don't modify the value in vsrc
>>    
>>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>>    change, the result of recursive folding is held in vtmp1. To be able to
>>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>>    have to use another temporary FloatRegister, as vtmp1 would essentially
>>    act as vsrc. It's possible to get around this however:
>>    reduce_mul_integral_le128b() is modified so it's possible to pass
>>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>>    temporary register in rules that match to reduce_mul_integral_gt128b().
>>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>
> This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.
> 
> ### Neoverse N1 (NEON)
> 
> <details>
> 
> <summary>Auto-vectorization</summary>
> 
> | Benchmark                 | Before   | After    | Units | Diff |
> |---------------------------|----------|----------|-------|------|
> | mulRedD                   | 739.699  | 740.884  | ns/op |  ~   |
> | byteAddBig                | 2670.248 | 2670.562 | ns/op |  ~   |
> | byteAddSimple             | 1639.796 | 1639.940 | ns/op |  ~   |
> | byteMulBig                | 2707.900 | 2708.063 | ns/op |  ~   |
> | byteMulSimple             | 2452.939 | 2452.906 | ns/op |  ~   |
> | charAddBig                | 2772.363 | 2772.269 | ns/op |  ~   |
> | charAddSimple             | 1639.867 | 1639.751 | ns/op |  ~   |
> | charMulBig                | 2796.533 | 2796.375 | ns/op |  ~   |
> | charMulSimple             | 2453.034 | 2453.004 | ns/op |  ~   |
> | doubleAddBig              | 2943.613 | 2936.897 | ns/op |  ~   |
> | doubleAddSimple           | 1635.031 | 1634.797 | ns/op |  ~   |
> | doubleMulBig              | 3001.937 | 3003.240 | ns/op |  ~   |
> | doubleMulSimple           | 2448.154 | 2448.117 | ns/op |  ~   |
> | floatAddBig               | 2963.086 | 2962.215 | ns/op |  ~   |
> | floatAddSimple            | 1634.987 | 1634.798 | ns/op |  ~   |
> | floatMulBig               | 3022.442 | 3021.356 | ns/op |  ~   |
> | floatMulSimple            | 2447.976 | 2448.091 | ns/op |  ~   |
> | intAddBig                 | 832.346  | 832.382  | ns/op |  ~   |
> | intAddSimple              | 841.276  | 841.287  | ns/op |  ~   |
> | intMulBig                 | 1245.155 | 1245.095 | ns/op |  ~   |
> | intMulSimple              | 1638.762 | 1638.826 | ns/op |  ~   |
> | longAddBig                | 4924.541 | 4924.328 | ns/op |  ~   |
> | longAddSimple             | 841.623  | 841.625  | ns/op |  ~   |
> | longMulBig                | 9848.954 | 9848.807 | ns/op |  ~   |
> | longMulSimple             | 3427.169 | 3427.279 | ns/op |  ~   |
> | shortAddBig               | 2670.027 | 2670.345 | ns/op |  ~   |
> | shortAddSimple            | 1639.869 | 1639.876 | ns/op |  ~   |
> | shortMulBig               | 2750.812 | 2750.562 | ns/op |  ~   |
> | shortMulSimple            | 2453.030 | 2452.937 | ns/op |  ~   |
> 
> </details>
> 
> <details>
> 
> <summary>VectorAPI</summary>
> 
> | Benchmark                 ...

@mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.

However, I do plan to remove the auto-vectorization restrictions for simple reductions.
https://bugs.openjdk.org/browse/JDK-8307516

You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`.
https://bugs.openjdk.org/browse/JDK-8357530
I published benchmark results there:
https://github.com/openjdk/jdk/pull/25387
You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.

It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)

I don't have access to any SVE machines, so I cannot help you there, unfortunately.

Is this helpful to you?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030798159