RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Hao Sun
haosun at openjdk.org
Thu Feb 27 03:55:06 UTC 2025
On Wed, 26 Feb 2025 14:54:45 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>
> - fixup: don't modify the value in vsrc
>
> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
> change, the result of recursive folding is held in vtmp1. To be able to
> pass this intermediate result to reduce_mul_integral_le128b(), we would
> have to use another temporary FloatRegister, as vtmp1 would essentially
> act as vsrc. It's possible to get around this however:
> reduce_mul_integral_le128b() is modified so it's possible to pass
> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
> temporary register in rules that match to reduce_mul_integral_gt128b().
> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
src/hotspot/cpu/aarch64/aarch64_vector.ad line 3033:
> 3031: format %{ "reduce_mulI_gt128b $dst, $isrc, $vsrc\t# vector (> 128 bits). KILL $tmp1, $tmp2, $pgtmp" %}
> 3032: ins_encode %{
> 3033: BasicType bt = Matcher::vector_element_basic_type(this, $vsrc);
I suggest adding `assert(UseSVE > 0, "must be sve");` assertion here and for the other three `*_gt128b` rules.
src/hotspot/cpu/aarch64/aarch64_vector.ad line 3043:
> 3041: %}
> 3042:
> 3043: instruct reduce_mulL_le128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{
I suggest using `_128b` here since only `2L` is matched here.
Suggestion:
instruct reduce_mulL_128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{
src/hotspot/cpu/aarch64/aarch64_vector.ad line 3055:
> 3053: %}
> 3054:
> 3055: instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp1,
nit: only one tmp vReg is used here
Suggestion:
instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp,
src/hotspot/cpu/aarch64/aarch64_vector.ad line 3099:
> 3097: %}
> 3098:
> 3099: instruct reduce_mulD_le128b(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{
Similar to `long` type, I suggest using `_128b` as only `2D` is matched here.
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3:
> 1: /*
> 2: * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved.
> 3: * Copyright (c) 2014, 2025, Red Hat Inc. All rights reserved.
nit: I don't think the copyright year for Red Hat needs to be updated
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3677:
> 3675: INSN(sve_nots, sve_eors); // Bitwise invert predicate, setting the condition flags; an alias of sve_eors
> 3676: #undef INSN
> 3677:
These instructions are not used any more after the follow-up commit of using `EXT`. I suggest removing them.
Besides, could you also share the benchmark data after using `EXT`? I don't have >=256-bit SVE on hand and cannot test that. Thanks.
src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2002:
> 2000: assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16, "unsupported");
> 2001: assert_different_registers(vtmp1, vsrc);
> 2002: assert_different_registers(vtmp1, vtmp2);
nit: would be neat to use
Suggestion:
assert_different_registers(vsrc, vtmp1, vtmp2);
src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2087:
> 2085: assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead");
> 2086: assert(vector_length_in_bytes <= FloatRegister::sve_vl_max, "unsupported vector length");
> 2087: assert(is_power_of_2(vector_length_in_bytes), "unsupported vector length");
Better to compare with `MaxVectorSize`.
I suggest using `assert(length_in_bytes == MaxVectorSize, "invalid vector length");` and putting this assertion in `aarch64_vector.ad` file, i.e. inside the matching rule.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972763554
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972764334
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972765345
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972764690
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972765990
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972770302
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972770928
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972778855
More information about the hotspot-compiler-dev
mailing list