RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]

Hao Sun haosun at openjdk.org
Tue Jul 1 02:54:47 UTC 2025


On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:

>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>> 
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>> 
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>> 
>> Benchmarks results:
>> 
>> Neoverse-V1 (SVE 256-bit)
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>> 
>> 
>> Fujitsu A64FX (SVE 512-bit):
>> 
>>   Benchmark                 (size)   Mode   master         PR  Units
>>   ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>   ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>   IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>   LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>   FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>   DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - cleanup: address nits, rename several symbols
>  - cleanup: remove unreferenced definitions
>  - Address review comments.
>    
>    - fixup: disable FP mul reduction auto-vectorization for all targets
>    - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
>      reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
>    - cleanup: replace a complex lambda in the above methods with a loop
>    - cleanup: rename symbols to follow the existing naming convention
>    - cleanup: add asserts to SVE only instructions
>    - split mul FP reduction instructions into strictly-ordered (default)
>      and explicitly non strictly-ordered
>    - remove redundant conditions in TestVectorFPReduction.java
>    
>    Benchmarks results:
>    
>    Neoverse-V1 (SVE 256-bit)
>    
>    | Benchmark                 | Before   | After    | Units  | Diff  |
>    |---------------------------|----------|----------|--------|-------|
>    | ByteMaxVector.MULLanes    | 619.156  | 9884.578 | ops/ms | 1496% |
>    | DoubleMaxVector.MULLanes  | 184.693  | 2712.051 | ops/ms | 1368% |
>    | FloatMaxVector.MULLanes   | 277.818  | 3388.038 | ops/ms | 1119% |
>    | IntMaxVector.MULLanes     | 371.225  | 4765.434 | ops/ms | 1183% |
>    | LongMaxVector.MULLanes    | 205.149  | 2672.975 | ops/ms | 1203% |
>    | ShortMaxVector.MULLanes   | 472.804  | 5122.917 | ops/ms |  984% |
>  - Merge branch 'master' into 8343689-rebase
>  - fixup: don't modify the value in vsrc
>    
>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>    change, the result of recursive folding is held in vtmp1. To be able to
>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>    have to use another temporary FloatRegister, as vtmp1 would essentially
>    act as vsrc. It's possible to get around this however:
>    reduce_mul_integral_le128b() is modified so it's possible to pass
>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>    temporary register in rules that match to reduce_mul_integral_gt128b().
>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>  - Use EXT instead of COMPACT to split a vector into two halves
>    
>    Benchmarks results:
>    
>    Neoverse-V1 (SVE 256-bit)
>    
>      Benchmark                 (size)   Mode   master         PR  Units
>      ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>      Short...

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729:

> 3727: #undef INSN
> 3728: 
> 3729: // SVE aliases

In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64.
See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176310497


More information about the hotspot-compiler-dev mailing list