RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2]

Thu Apr 21 14:11:27 UTC 2022

On Thu, 21 Apr 2022 10:47:36 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2179:
>> 
>>> 2177: %}
>>> 2178: 
>>> 2179: 
>> 
>> This is all far too repetitive and (therefore) hard to maintain. Please use the macro processor in a sensible way.
>> 
>> Please isolate the common factors.
>> `n->in(X)->bottom_type()->is_vect()->length_in_bytes()` should have a name, for example.
>
> I have tried. That tricky thing is that I didn't find a sensible way to integrate them in a macro and balance the readability of m4, and the format of ad as well. One reason is they have different register usage, also accompanies with the different predicate. In the example below, if it's fine to waste one register for `reduce_mul_sve_4S`, thing would change more easier, that all the rules can merged together. But to pursue the better performance, at this moment I degrade the maintainability and write more repetitive code. 
> 
> 
> instruct reduce_mul_sve_4S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp) %{
>   predicate(UseSVE > 0 &&
>             n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 8 &&
>             n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT);
>   match(Set dst (MulReductionVI isrc vsrc));
>   ins_cost(8 * INSN_COST);
>   effect(TEMP_DEF dst, TEMP vtmp);
>   format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction4S (sve)" %}
>   ins_encode %{
>     __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg),
>                                    as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 8,
>                                    as_FloatRegister($vtmp$$reg), fnoreg);
>   %}
>   ins_pipe(pipe_slow);
> %}
> 
> instruct reduce_mul_sve_8S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp1, vReg vtmp2) %{
>   predicate(UseSVE > 0 &&
>             n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 16 &&
>             n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT);
>   match(Set dst (MulReductionVI isrc vsrc));
>   ins_cost(10 * INSN_COST);
>   effect(TEMP_DEF dst, TEMP vtmp1, TEMP vtmp2);
>   format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction8S (sve)" %}
>   ins_encode %{
>     __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg),
>                                    as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 16,
>                                    as_FloatRegister($vtmp1$$reg), as_FloatRegister($vtmp2$$reg));
>   %}
>   ins_pipe(pipe_slow);
> %}
> 
> 
> Indeed, we are looking for a better way to maintain the NEON and SVE rules. @nsjian is working on the detail work.

OK. There are 8 slightly different versions of `reduce_X_sve_nT` here. I would have thought that an `ifelse` around `, vReg vtmp2` etc. would be exactly what you'd need, but I'm not going to try to rewrite your work.
I'm no great fan of m4, but I used it because we needed some way to write the boilerplate code such that it could be reviewed and extended; that's still true today. It's not perfect, but it's better than cut-and-paste programming.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7999