RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2]

Thu Apr 21 10:51:28 UTC 2022

On Wed, 20 Apr 2022 09:43:29 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Eric Liu has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Generate SVE reduction for MIN/MAX/ADD as before
>>   
>>   Change-Id: Ibc6b9c1f46c42cd07f7bb73b81ed38829e9d0975
>
> src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2179:
> 
>> 2177: %}
>> 2178: 
>> 2179: 
> 
> This is all far too repetitive and (therefore) hard to maintain. Please use the macro processor in a sensible way.
> 
> Please isolate the common factors.
> `n->in(X)->bottom_type()->is_vect()->length_in_bytes()` should have a name, for example.

I have tried. That tricky thing is that I didn't find a sensible way to integrate them in a macro and balance the readability of m4, and the format of ad as well. One reason is they have different register usage, also accompanies with the different predicate. In the example below, if it's fine to waste one register for `reduce_mul_sve_4S`, thing would change more easier, that all the rules can merged together. But to pursue the better performance, at this moment I degrade the maintainability and write more repetitive code. 

instruct reduce_mul_sve_4S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp) %{
  predicate(UseSVE > 0 &&
            n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 8 &&
            n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT);
  match(Set dst (MulReductionVI isrc vsrc));
  ins_cost(8 * INSN_COST);
  effect(TEMP_DEF dst, TEMP vtmp);
  format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction4S (sve)" %}
  ins_encode %{
    __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg),
                                   as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 8,
                                   as_FloatRegister($vtmp$$reg), fnoreg);
  %}
  ins_pipe(pipe_slow);
%}

instruct reduce_mul_sve_8S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp1, vReg vtmp2) %{
  predicate(UseSVE > 0 &&
            n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 16 &&
            n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT);
  match(Set dst (MulReductionVI isrc vsrc));
  ins_cost(10 * INSN_COST);
  effect(TEMP_DEF dst, TEMP vtmp1, TEMP vtmp2);
  format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction8S (sve)" %}
  ins_encode %{
    __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg),
                                   as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 16,
                                   as_FloatRegister($vtmp1$$reg), as_FloatRegister($vtmp2$$reg));
  %}
  ins_pipe(pipe_slow);
%}

Indeed, we are looking for a better way to maintain the NEON and SVE rules. @nsjian is working on the detail work.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7999