RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Emanuel Peter
epeter at openjdk.org
Thu Jul 3 05:26:46 UTC 2025
On Mon, 30 Jun 2025 12:35:47 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>>
>> - fixup: don't modify the value in vsrc
>>
>> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>> change, the result of recursive folding is held in vtmp1. To be able to
>> pass this intermediate result to reduce_mul_integral_le128b(), we would
>> have to use another temporary FloatRegister, as vtmp1 would essentially
>> act as vsrc. It's possible to get around this however:
>> reduce_mul_integral_le128b() is modified so it's possible to pass
>> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>> temporary register in rules that match to reduce_mul_integral_gt128b().
>> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>
> This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.
>
> ### Neoverse N1 (NEON)
>
> <details>
>
> <summary>Auto-vectorization</summary>
>
> | Benchmark | Before | After | Units | Diff |
> |---------------------------|----------|----------|-------|------|
> | mulRedD | 739.699 | 740.884 | ns/op | ~ |
> | byteAddBig | 2670.248 | 2670.562 | ns/op | ~ |
> | byteAddSimple | 1639.796 | 1639.940 | ns/op | ~ |
> | byteMulBig | 2707.900 | 2708.063 | ns/op | ~ |
> | byteMulSimple | 2452.939 | 2452.906 | ns/op | ~ |
> | charAddBig | 2772.363 | 2772.269 | ns/op | ~ |
> | charAddSimple | 1639.867 | 1639.751 | ns/op | ~ |
> | charMulBig | 2796.533 | 2796.375 | ns/op | ~ |
> | charMulSimple | 2453.034 | 2453.004 | ns/op | ~ |
> | doubleAddBig | 2943.613 | 2936.897 | ns/op | ~ |
> | doubleAddSimple | 1635.031 | 1634.797 | ns/op | ~ |
> | doubleMulBig | 3001.937 | 3003.240 | ns/op | ~ |
> | doubleMulSimple | 2448.154 | 2448.117 | ns/op | ~ |
> | floatAddBig | 2963.086 | 2962.215 | ns/op | ~ |
> | floatAddSimple | 1634.987 | 1634.798 | ns/op | ~ |
> | floatMulBig | 3022.442 | 3021.356 | ns/op | ~ |
> | floatMulSimple | 2447.976 | 2448.091 | ns/op | ~ |
> | intAddBig | 832.346 | 832.382 | ns/op | ~ |
> | intAddSimple | 841.276 | 841.287 | ns/op | ~ |
> | intMulBig | 1245.155 | 1245.095 | ns/op | ~ |
> | intMulSimple | 1638.762 | 1638.826 | ns/op | ~ |
> | longAddBig | 4924.541 | 4924.328 | ns/op | ~ |
> | longAddSimple | 841.623 | 841.625 | ns/op | ~ |
> | longMulBig | 9848.954 | 9848.807 | ns/op | ~ |
> | longMulSimple | 3427.169 | 3427.279 | ns/op | ~ |
> | shortAddBig | 2670.027 | 2670.345 | ns/op | ~ |
> | shortAddSimple | 1639.869 | 1639.876 | ns/op | ~ |
> | shortMulBig | 2750.812 | 2750.562 | ns/op | ~ |
> | shortMulSimple | 2453.030 | 2452.937 | ns/op | ~ |
>
> </details>
>
> <details>
>
> <summary>VectorAPI</summary>
>
> | Benchmark ...
@mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.
However, I do plan to remove the auto-vectorization restrictions for simple reductions.
https://bugs.openjdk.org/browse/JDK-8307516
You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`.
https://bugs.openjdk.org/browse/JDK-8357530
I published benchmark results there:
https://github.com/openjdk/jdk/pull/25387
You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
I don't have access to any SVE machines, so I cannot help you there, unfortunately.
Is this helpful to you?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030798159
More information about the hotspot-compiler-dev
mailing list