RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]

Wed Feb 5 17:09:16 UTC 2025

On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Use EXT instead of COMPACT to split a vector into two halves
>>   
>>   Benchmarks results:
>>   
>>   Neoverse-V1 (SVE 256-bit)
>>   
>>     Benchmark                 (size)   Mode   master         PR  Units
>>     ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>     ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>     IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>     LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>     FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>     DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>>   
>>   Fujitsu A64FX (SVE 512-bit)
>>   
>>     Benchmark                 (size)   Mode   master         PR  Units
>>     ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>     ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>     IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>     LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>     FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>     DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
> 
>> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
>> 2138: // instructions are used.
>> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
> 
> Drive-by question:
> This is recursive folding: take halve the vector and add it that way.
> 
> What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.

We have strictly-ordered intrinsics for add reduction: https://github.com/openjdk/jdk/blob/19399d271ef00f925232fbbe9087b5772f2fca01/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2903

Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose `_requires_strict_order` on `MulReductionVFNode`, `MulReductionVDNode`. Although I suspect that we might have missed something as I see a speedup for `VectorReduction2.WithSuperword.doubleMulBig` / `floatMulBig` which I didn't expect to be the case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1943343335