RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]
Andrew Haley
aph at openjdk.org
Wed Feb 5 17:59:21 UTC 2025
On Wed, 5 Feb 2025 11:20:59 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>
> Use EXT instead of COMPACT to split a vector into two halves
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>
> Fujitsu A64FX (SVE 512-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1915:
> 1913: %}
> 1914:
> 1915: instruct reduce_mulD(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{
Please consider that `reduce_mulF_gt128b` and `reduce_mulD_gt128b` might be similar enough that they should be combined in the same way as other patterns in this file.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1943420223
More information about the hotspot-compiler-dev
mailing list