RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]
Mikhail Ablakatov
mablakatov at openjdk.org
Mon Jun 30 13:25:10 UTC 2025
On Mon, 30 Jun 2025 13:22:34 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>
> - cleanup: address nits, rename several symbols
> - cleanup: remove unreferenced definitions
> - Address review comments.
>
> - fixup: disable FP mul reduction auto-vectorization for all targets
> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
> - cleanup: replace a complex lambda in the above methods with a loop
> - cleanup: rename symbols to follow the existing naming convention
> - cleanup: add asserts to SVE only instructions
> - split mul FP reduction instructions into strictly-ordered (default)
> and explicitly non strictly-ordered
> - remove redundant conditions in TestVectorFPReduction.java
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> | Benchmark | Before | After | Units | Diff |
> |---------------------------|----------|----------|--------|-------|
> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% |
> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% |
> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% |
> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% |
> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% |
> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
> - Merge branch 'master' into 8343689-rebase
> - fixup: don't modify the value in vsrc
>
> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
> change, the result of recursive folding is held in vtmp1. To be able to
> pass this intermediate result to reduce_mul_integral_le128b(), we would
> have to use another temporary FloatRegister, as vtmp1 would essentially
> act as vsrc. It's possible to get around this however:
> reduce_mul_integral_le128b() is modified so it's possible to pass
> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
> temporary register in rules that match to reduce_mul_integral_gt128b().
> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
> - Use EXT instead of COMPACT to split a vector into two halves
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> Short...
Thank you for a review! There are a couple more nits I've missed, I'll submit an update to resolve them shortly.
-------------
PR Review: https://git.openjdk.org/jdk/pull/23181#pullrequestreview-2970941468
More information about the hotspot-compiler-dev
mailing list