RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]
Hao Sun
haosun at openjdk.org
Tue Jul 1 02:54:47 UTC 2025
On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>
> - cleanup: address nits, rename several symbols
> - cleanup: remove unreferenced definitions
> - Address review comments.
>
> - fixup: disable FP mul reduction auto-vectorization for all targets
> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
> - cleanup: replace a complex lambda in the above methods with a loop
> - cleanup: rename symbols to follow the existing naming convention
> - cleanup: add asserts to SVE only instructions
> - split mul FP reduction instructions into strictly-ordered (default)
> and explicitly non strictly-ordered
> - remove redundant conditions in TestVectorFPReduction.java
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> | Benchmark | Before | After | Units | Diff |
> |---------------------------|----------|----------|--------|-------|
> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% |
> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% |
> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% |
> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% |
> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% |
> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
> - Merge branch 'master' into 8343689-rebase
> - fixup: don't modify the value in vsrc
>
> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
> change, the result of recursive folding is held in vtmp1. To be able to
> pass this intermediate result to reduce_mul_integral_le128b(), we would
> have to use another temporary FloatRegister, as vtmp1 would essentially
> act as vsrc. It's possible to get around this however:
> reduce_mul_integral_le128b() is modified so it's possible to pass
> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
> temporary register in rules that match to reduce_mul_integral_gt128b().
> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
> - Use EXT instead of COMPACT to split a vector into two halves
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> Short...
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729:
> 3727: #undef INSN
> 3728:
> 3729: // SVE aliases
In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64.
See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176310497
More information about the hotspot-compiler-dev
mailing list