RFR: 8343689: AArch64: Optimize MulReduction implementation [v6]

Thu Jul 3 04:47:41 UTC 2025

On Wed, 2 Jul 2025 08:48:59 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>> 
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>> 
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>> 
>> Benchmarks results:
>> 
>> Neoverse-V1 (SVE 256-bit)
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>> 
>> 
>> Fujitsu A64FX (SVE 512-bit):
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   cleanup: update a copyright notice
>   
>   Co-authored-by: Hao Sun <haosun at nvidia.com>

Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on **FP Mul Reduction + auto-vectorization**.

Here lists the change before and after this PR in whether **FP Mul Reduction + auto-vectorization** is on or off.

|       | Check    | before | after|
| :-------- | :-------: | --------: | --------: |
| case-1     | UseSVE=0    | off    | off |
| case-2      | UseSVE>0 and length_in_bytes=8or16  | on       | off |
| case-3      | UseSVE>0 and length_in_bytes>16  | off       | off |

## case-1 and case-2

Background: case-1 was set off after @fg1417 's patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175). But case-2 was not touched.
We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on **reduction**? 

**Limitation** of SLP as mentioned in @fg1417 's patch 
> Because superword doesn't vectorize reductions unconnected with other vector packs, 

Performance data in this PR on case-2: From your provided [test data](https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067) on `Neoverse V2 (SVE 128-bit). Auto-vectorization section`, there is no obvious performance change on FP Mul Reduction benchmarks `(float|double)Mul(Big|Simple)`.
As we checked the generated code of `floatMul(Big|Simple)` on Nvidia Grace machine(128b SVE2), we found that before this PR:
- `floatMulBig` is vectorized.
- `floatMulSimple` is not vectorized because SLP determines that there is no profit.

Discussion: should we enable case-1 and case-2?
- if the SLP limitation on reductions is fixed?
- If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.

It would be appreciated if @eme64 or @fg1417 could provide more inputs.

## case-3

Status: this PR adds rules `reduce_mulF_gt128b` and `reduce_mulD_gt128b` but these two rules are **not** selected. See the [comment from Xiaohong](https://github.com/openjdk/jdk/pull/23181/files#r2176590314).

Our suggestion: we're not sure if it's profitable to enable case-3.  Could you help do more test on `Neoverse V1 (SVE 256-bit)`?  Note that local change should be made to enable case-3, e.g. removing [these lines](https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130-R136).

Expected result:
- If there is performance gain, we may consider enabling case-3 for auto-vectorization.
- If there is no performance gain, we suggest removing these two match rules because they are dead code.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030705608