RFR: 8343689: AArch64: Optimize MulReduction implementation [v6]
Hao Sun
haosun at openjdk.org
Thu Jul 3 04:47:41 UTC 2025
On Wed, 2 Jul 2025 08:48:59 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>
> cleanup: update a copyright notice
>
> Co-authored-by: Hao Sun <haosun at nvidia.com>
Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on **FP Mul Reduction + auto-vectorization**.
Here lists the change before and after this PR in whether **FP Mul Reduction + auto-vectorization** is on or off.
| | Check | before | after|
| :-------- | :-------: | --------: | --------: |
| case-1 | UseSVE=0 | off | off |
| case-2 | UseSVE>0 and length_in_bytes=8or16 | on | off |
| case-3 | UseSVE>0 and length_in_bytes>16 | off | off |
## case-1 and case-2
Background: case-1 was set off after @fg1417 's patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175). But case-2 was not touched.
We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on **reduction**?
**Limitation** of SLP as mentioned in @fg1417 's patch
> Because superword doesn't vectorize reductions unconnected with other vector packs,
Performance data in this PR on case-2: From your provided [test data](https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067) on `Neoverse V2 (SVE 128-bit). Auto-vectorization section`, there is no obvious performance change on FP Mul Reduction benchmarks `(float|double)Mul(Big|Simple)`.
As we checked the generated code of `floatMul(Big|Simple)` on Nvidia Grace machine(128b SVE2), we found that before this PR:
- `floatMulBig` is vectorized.
- `floatMulSimple` is not vectorized because SLP determines that there is no profit.
Discussion: should we enable case-1 and case-2?
- if the SLP limitation on reductions is fixed?
- If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.
It would be appreciated if @eme64 or @fg1417 could provide more inputs.
## case-3
Status: this PR adds rules `reduce_mulF_gt128b` and `reduce_mulD_gt128b` but these two rules are **not** selected. See the [comment from Xiaohong](https://github.com/openjdk/jdk/pull/23181/files#r2176590314).
Our suggestion: we're not sure if it's profitable to enable case-3. Could you help do more test on `Neoverse V1 (SVE 256-bit)`? Note that local change should be made to enable case-3, e.g. removing [these lines](https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130-R136).
Expected result:
- If there is performance gain, we may consider enabling case-3 for auto-vectorization.
- If there is no performance gain, we suggest removing these two match rules because they are dead code.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030705608
More information about the hotspot-compiler-dev
mailing list