RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]

Wed Feb 5 11:40:09 UTC 2025

On Mon, 20 Jan 2025 03:35:44 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Use EXT instead of COMPACT to split a vector into two halves
>>   
>>   Benchmarks results:
>>   
>>   Neoverse-V1 (SVE 256-bit)
>>   
>>     Benchmark                 (size)   Mode   master         PR  Units
>>     ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
>>     ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
>>     IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
>>     LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
>>     FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
>>     DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms
>>   
>>   Fujitsu A64FX (SVE 512-bit)
>>   
>>     Benchmark                 (size)   Mode   master         PR  Units
>>     ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
>>     ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
>>     IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
>>     LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
>>     FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
>>     DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2095:
> 
>> 2093:     // matter: a contiguous set of elements is moved and its size is a multiple of D RegVariant.
>> 2094:     sve_compact(vtmp1, D, vsrc, pgtmp1);
>> 2095:     sve_mul(vsrc, elemType_to_regVariant(bt), pgtmp2, vtmp1);
> 
> Did you have tried with the SVE `EXT` instruction (https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/EXT--Extract-vector-from-pair-of-vectors-?lang=en), which I think could also help to shuffle the upper half elements to the lower half in a vector?  If it works, I think these five instructions can be optimized to three ones such as `ext, whilelo, mul`.

Hi @XiaohongGong , thank you for a great suggestion! I've submitted https://github.com/openjdk/jdk/pull/23181/commits/c9dcc45f7f362f5af87f013715f0b55777472c78 to implement it. It gives up to ~30% performance improvement compared to the initially submitted implementation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1942711747