RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]

Mikhail Ablakatov mablakatov at openjdk.org
Tue Jul 1 16:10:49 UTC 2025


On Tue, 1 Jul 2025 06:57:10 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>> 
>>  - cleanup: address nits, rename several symbols
>>  - cleanup: remove unreferenced definitions
>>  - Address review comments.
>>    
>>    - fixup: disable FP mul reduction auto-vectorization for all targets
>>    - fixup: add a tmp vReg to reduce_mul_integral_gt128b and
>>      reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
>>    - cleanup: replace a complex lambda in the above methods with a loop
>>    - cleanup: rename symbols to follow the existing naming convention
>>    - cleanup: add asserts to SVE only instructions
>>    - split mul FP reduction instructions into strictly-ordered (default)
>>      and explicitly non strictly-ordered
>>    - remove redundant conditions in TestVectorFPReduction.java
>>    
>>    Benchmarks results:
>>    
>>    Neoverse-V1 (SVE 256-bit)
>>    
>>    | Benchmark                 | Before   | After    | Units  | Diff  |
>>    |---------------------------|----------|----------|--------|-------|
>>    | ByteMaxVector.MULLanes    | 619.156  | 9884.578 | ops/ms | 1496% |
>>    | DoubleMaxVector.MULLanes  | 184.693  | 2712.051 | ops/ms | 1368% |
>>    | FloatMaxVector.MULLanes   | 277.818  | 3388.038 | ops/ms | 1119% |
>>    | IntMaxVector.MULLanes     | 371.225  | 4765.434 | ops/ms | 1183% |
>>    | LongMaxVector.MULLanes    | 205.149  | 2672.975 | ops/ms | 1203% |
>>    | ShortMaxVector.MULLanes   | 472.804  | 5122.917 | ops/ms |  984% |
>>  - Merge branch 'master' into 8343689-rebase
>>  - fixup: don't modify the value in vsrc
>>    
>>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>>    change, the result of recursive folding is held in vtmp1. To be able to
>>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>>    have to use another temporary FloatRegister, as vtmp1 would essentially
>>    act as vsrc. It's possible to get around this however:
>>    reduce_mul_integral_le128b() is modified so it's possible to pass
>>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>>    temporary register in rules that match to reduce_mul_integral_gt128b().
>>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>>  - Use EXT instead of COMPACT to split a vector into two halves
>>    
>>    Benchmarks results:
>>    
>>    Neoverse-V1 (SVE 256-bit)
>>    
>>      Benchmark                 (size)   Mode   master       ...
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097:
> 
>> 2095:   sve_movprfx(vtmp1, vsrc);                                // copy
>> 2096:   sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2);       // swap halves
>> 2097:   sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves
> 
>> sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc);
> 
> Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right?

Thanks! For some reason I thought that we don't have a dedicated predicate register for that.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2106:
> 
>> 2104:     sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vtmp2); // multiply halves
>> 2105:     vector_length_in_bytes = vector_length_in_bytes / 2;
>> 2106:     vector_length = vector_length / 2;
> 
> I guess you want to update the `pgtmp` with new `vector_length`? But seems the code is missing. Anyway, maybe the it's not necessary to generate a predicate as I commented above.

It isn't exactly necessary similarly to how we can always use `ptrue` here. But yeah, I'll just remove it following the suggestion above.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178009839
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178007165


More information about the hotspot-compiler-dev mailing list