RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]
Mikhail Ablakatov
mablakatov at openjdk.org
Wed Feb 5 17:09:16 UTC 2025
On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Use EXT instead of COMPACT to split a vector into two halves
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>> Fujitsu A64FX (SVE 512-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
>
>> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
>> 2138: // instructions are used.
>> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
>
> Drive-by question:
> This is recursive folding: take halve the vector and add it that way.
>
> What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.
We have strictly-ordered intrinsics for add reduction: https://github.com/openjdk/jdk/blob/19399d271ef00f925232fbbe9087b5772f2fca01/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2903
Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose `_requires_strict_order` on `MulReductionVFNode`, `MulReductionVDNode`. Although I suspect that we might have missed something as I see a speedup for `VectorReduction2.WithSuperword.doubleMulBig` / `floatMulBig` which I didn't expect to be the case.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1943343335
More information about the hotspot-compiler-dev
mailing list