RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]
Mikhail Ablakatov
mablakatov at openjdk.org
Tue Jul 1 16:14:47 UTC 2025
On Tue, 1 Jul 2025 07:00:08 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>>
>> - cleanup: address nits, rename several symbols
>> - cleanup: remove unreferenced definitions
>> - Address review comments.
>>
>> - fixup: disable FP mul reduction auto-vectorization for all targets
>> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
>> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
>> - cleanup: replace a complex lambda in the above methods with a loop
>> - cleanup: rename symbols to follow the existing naming convention
>> - cleanup: add asserts to SVE only instructions
>> - split mul FP reduction instructions into strictly-ordered (default)
>> and explicitly non strictly-ordered
>> - remove redundant conditions in TestVectorFPReduction.java
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> | Benchmark | Before | After | Units | Diff |
>> |---------------------------|----------|----------|--------|-------|
>> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% |
>> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% |
>> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% |
>> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% |
>> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% |
>> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
>> - Merge branch 'master' into 8343689-rebase
>> - fixup: don't modify the value in vsrc
>>
>> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>> change, the result of recursive folding is held in vtmp1. To be able to
>> pass this intermediate result to reduce_mul_integral_le128b(), we would
>> have to use another temporary FloatRegister, as vtmp1 would essentially
>> act as vsrc. It's possible to get around this however:
>> reduce_mul_integral_le128b() is modified so it's possible to pass
>> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>> temporary register in rules that match to reduce_mul_integral_gt128b().
>> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>> - Use EXT instead of COMPACT to split a vector into two halves
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master ...
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536:
>
>> 3534:
>> 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{
>> 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order());
>
> Are there the cases that can match with this rule?
Well, we don't match it right now for auto-vectorization as it doesn't worth it performance-wise. This might change for future implementations of SVE(2). I'd still prefer to keep it so the set of instructions is complete.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178014966
More information about the hotspot-compiler-dev
mailing list