RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]

Mon Jun 30 13:25:11 UTC 2025

On Thu, 27 Feb 2025 03:17:59 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - fixup: don't modify the value in vsrc
>>    
>>    Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
>>    change, the result of recursive folding is held in vtmp1. To be able to
>>    pass this intermediate result to reduce_mul_integral_le128b(), we would
>>    have to use another temporary FloatRegister, as vtmp1 would essentially
>>    act as vsrc. It's possible to get around this however:
>>    reduce_mul_integral_le128b() is modified so it's possible to pass
>>    matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
>>    temporary register in rules that match to reduce_mul_integral_gt128b().
>>  - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3033:
> 
>> 3031:   format %{ "reduce_mulI_gt128b $dst, $isrc, $vsrc\t# vector (> 128 bits). KILL $tmp1, $tmp2, $pgtmp" %}
>> 3032:   ins_encode %{
>> 3033:     BasicType bt = Matcher::vector_element_basic_type(this, $vsrc);
> 
> I suggest adding `assert(UseSVE > 0, "must be sve");` assertion here and for the other three `*_gt128b` rules.

Done, thanks.

> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3043:
> 
>> 3041: %}
>> 3042: 
>> 3043: instruct reduce_mulL_le128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{
> 
> I suggest using `_128b` here since only `2L` is matched here.
> 
> Suggestion:
> 
> instruct reduce_mulL_128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{

Done, thank you!

> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3055:
> 
>> 3053: %}
>> 3054: 
>> 3055: instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp1,
> 
> nit: only one tmp vReg is used here
> 
> Suggestion:
> 
> instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp,

Done, thanks!

> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3099:
> 
>> 3097: %}
>> 3098: 
>> 3099: instruct reduce_mulD_le128b(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{
> 
> Similar to `long` type, I suggest using `_128b` as only `2D` is matched here.

Done!

> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3:
> 
>> 1: /*
>> 2:  * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved.
>> 3:  * Copyright (c) 2014, 2025, Red Hat Inc. All rights reserved.
> 
> nit: I don't think the copyright year for Red Hat needs to be updated

Done, thanks.

> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3677:
> 
>> 3675:   INSN(sve_nots, sve_eors); // Bitwise invert predicate, setting the condition flags; an alias of sve_eors
>> 3676: #undef INSN
>> 3677: 
> 
> These instructions are not used any more after the follow-up commit of using `EXT`. I suggest removing them.
> 
> Besides, could you also share the benchmark data after using `EXT`? I don't have >=256-bit SVE on hand and cannot test that. Thanks.

Done, thanks!

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2002:
> 
>> 2000:   assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16, "unsupported");
>> 2001:   assert_different_registers(vtmp1, vsrc);
>> 2002:   assert_different_registers(vtmp1, vtmp2);
> 
> nit: would be neat to use 
> Suggestion:
> 
>   assert_different_registers(vsrc, vtmp1, vtmp2);

`vsrc` and `vtmp2` are allowed to match.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2087:
> 
>> 2085:   assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead");
>> 2086:   assert(vector_length_in_bytes <= FloatRegister::sve_vl_max, "unsupported vector length");
>> 2087:   assert(is_power_of_2(vector_length_in_bytes), "unsupported vector length");
> 
> Better to compare with `MaxVectorSize`.
> 
> I suggest using `assert(length_in_bytes == MaxVectorSize, "invalid vector length");` and putting this assertion in `aarch64_vector.ad` file, i.e. inside the matching rule.

Why is it better that way? Currently the assertions check that we end up here if there computations that can be done only  using SVE (length > neon && length <= sve). What would happen if a user operates 256b VectorAPI vectors on a 512b SVE platform?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2174893039
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2175064198
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2175062794
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2175063706
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2174918739
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2174926322
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2154442438
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2174917565