RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Hao Sun
haosun at openjdk.org
Fri Jul 11 02:07:41 UTC 2025
On Thu, 3 Jul 2025 05:23:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.
>>
>> ### Neoverse N1 (NEON)
>>
>> <details>
>>
>> <summary>Auto-vectorization</summary>
>>
>> | Benchmark | Before | After | Units | Diff |
>> |---------------------------|----------|----------|-------|------|
>> | mulRedD | 739.699 | 740.884 | ns/op | ~ |
>> | byteAddBig | 2670.248 | 2670.562 | ns/op | ~ |
>> | byteAddSimple | 1639.796 | 1639.940 | ns/op | ~ |
>> | byteMulBig | 2707.900 | 2708.063 | ns/op | ~ |
>> | byteMulSimple | 2452.939 | 2452.906 | ns/op | ~ |
>> | charAddBig | 2772.363 | 2772.269 | ns/op | ~ |
>> | charAddSimple | 1639.867 | 1639.751 | ns/op | ~ |
>> | charMulBig | 2796.533 | 2796.375 | ns/op | ~ |
>> | charMulSimple | 2453.034 | 2453.004 | ns/op | ~ |
>> | doubleAddBig | 2943.613 | 2936.897 | ns/op | ~ |
>> | doubleAddSimple | 1635.031 | 1634.797 | ns/op | ~ |
>> | doubleMulBig | 3001.937 | 3003.240 | ns/op | ~ |
>> | doubleMulSimple | 2448.154 | 2448.117 | ns/op | ~ |
>> | floatAddBig | 2963.086 | 2962.215 | ns/op | ~ |
>> | floatAddSimple | 1634.987 | 1634.798 | ns/op | ~ |
>> | floatMulBig | 3022.442 | 3021.356 | ns/op | ~ |
>> | floatMulSimple | 2447.976 | 2448.091 | ns/op | ~ |
>> | intAddBig | 832.346 | 832.382 | ns/op | ~ |
>> | intAddSimple | 841.276 | 841.287 | ns/op | ~ |
>> | intMulBig | 1245.155 | 1245.095 | ns/op | ~ |
>> | intMulSimple | 1638.762 | 1638.826 | ns/op | ~ |
>> | longAddBig | 4924.541 | 4924.328 | ns/op | ~ |
>> | longAddSimple | 841.623 | 841.625 | ns/op | ~ |
>> | longMulBig | 9848.954 | 9848.807 | ns/op | ~ |
>> | longMulSimple | 3427.169 | 3427.279 | ns/op | ~ |
>> | shortAddBig | 2670.027 | 2670.345 | ns/op | ~ |
>> | shortAddSimple | 1639.869 | 1639.876 | ns/op | ~ |
>> | shortMulBig | 2750.812 | 2750.562 | ns/op | ~ |
>> | shortMulSimple | 2453.030 | 2452.937 | ns/op | ~ |
>>
>>...
>
> @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.
>
> However, I do plan to remove the auto-vectorization restrictions for simple reductions.
> https://bugs.openjdk.org/browse/JDK-8307516
>
> You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`.
> https://bugs.openjdk.org/browse/JDK-8357530
> I published benchmark results there:
> https://github.com/openjdk/jdk/pull/25387
> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.
>
> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;)
>
> I don't have access to any SVE machines, so I cannot help you there, unfortunately.
>
> Is this helpful to you?
@eme64 Thanks for your input. It's very helpful to us.
@fg1417 Thanks for your clarification on `case-2` as I mentioned earlier.
@mikabl-arm Thanks for your providing the performance data on Neoverse-V1 machine.
> Given that:
>
> * this PR focuses on VectorAPI and **not** on auto-vectorization,
> * and it does **not** introduce regressions in auto-vectorization performance,
>
> I suggest:
>
> * continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
> * moving forward with resolving the remaining VectorAPI issues and merging this PR.
I agree with your suggestion.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3059975539
More information about the hotspot-compiler-dev
mailing list