RFR: 8282541: AArch64: Auto-vectorize Math.round API [v2]

Fri Apr 15 03:29:40 UTC 2022

On Thu, 14 Apr 2022 17:41:13 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 374:
>> 
>>> 372: VECTOR_JAVA_FROUND(F, 4F,  I, T4S, 4,  INT, vReg)
>>> 373: VECTOR_JAVA_FROUND(D, 2D,  L, T2D, 2, LONG, vReg)
>>> 374: 
>> 
>> I don't know why do we need these rules. Should "UseSVE > 0" all go to the rules in sve ad file which call to vector_round_sve()?
>
>> I don't know why do we need these rules. Should "UseSVE > 0" all go to the rules in sve ad file which call to vector_round_sve()?
> 
> The freely-available Arm® Neoverse V1 Software Optimization Guide shows instructions such as ASIMD `FRINTA` having a throughput of 2 operations per clock, whereas it shows SVE `FRINTA` has a throughput of 1 operation per clock. This is true of most instructions used in `Math.round()`. I conclude that on V1, for short vectors, if we use ASIMD rather than equivalent SVE instructions, we should expect to virtually double throughput. For vectors wider than ASIMD supports, SVE should be a win.
> 
> At present, there is no reason not to use ASIMD for short vectors on all AArch64 processors. It won't significantly impair performance, and I can't think of any future circumstances in which it might.

OK, thanks! Looks reasonable to me. We are going to make all vecX/vecD regs to vReg, I think that should make SIMD code cleaner.

Currently all rules for vReg are in aarch64_sve.ad. And since the codegen is actually for SVE target, though generates ASIMD insns, perhaps move these rules to aarch64_sve.ad would be better? Also I think the 2F/4F rules could be merged into one, like:

instruct vroundvRegF(vReg dst, vReg src, vReg tmp1, vReg tmp2, vReg tmp3)
%{
  predicate(n->as_Vector()->length_in_bytes() <= 16);
  match(Set dst (RoundVF src));
  effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2, TEMP tmp3);
  format %{ "vround  $dst, $src\t# round vReg F to I vector" %}
  ins_encode %{
    uint size = Matcher::vector_length_in_bytes(this);
    __ vector_round_neon(as_FloatRegister($dst$$reg), as_FloatRegister($src$$reg),
                         as_FloatRegister($tmp1$$reg), as_FloatRegister($tmp2$$reg),
                         as_FloatRegister($tmp3$$reg), (size == 16) ? __ T4S : __ T2S);
  %}
  ins_pipe(pipe_slow);
%}

-------------

PR: https://git.openjdk.java.net/jdk/pull/8204