RFR: 8282541: AArch64: Auto-vectorize Math.round API [v2]

Andrew Haley aph at openjdk.java.net
Fri Apr 15 08:17:41 UTC 2022


On Fri, 15 Apr 2022 03:26:27 GMT, Ningsheng Jian <njian at openjdk.org> wrote:

>>> I don't know why do we need these rules. Should "UseSVE > 0" all go to the rules in sve ad file which call to vector_round_sve()?
>> 
>> The freely-available Arm® Neoverse V1 Software Optimization Guide shows instructions such as ASIMD `FRINTA` having a throughput of 2 operations per clock, whereas it shows SVE `FRINTA` has a throughput of 1 operation per clock. This is true of most instructions used in `Math.round()`. I conclude that on V1, for short vectors, if we use ASIMD rather than equivalent SVE instructions, we should expect to virtually double throughput. For vectors wider than ASIMD supports, SVE should be a win.
>> 
>> At present, there is no reason not to use ASIMD for short vectors on all AArch64 processors. It won't significantly impair performance, and I can't think of any future circumstances in which it might.
>
> OK, thanks! Looks reasonable to me. We are going to make all vecX/vecD regs to vReg, I think that should make SIMD code cleaner.
> 
> Currently all rules for vReg are in aarch64_sve.ad. And since the codegen is actually for SVE target, though generates ASIMD insns, perhaps move these rules to aarch64_sve.ad would be better? Also I think the 2F/4F rules could be merged into one, like:
> 
> 
> instruct vroundvRegF(vReg dst, vReg src, vReg tmp1, vReg tmp2, vReg tmp3)
> %{
>   predicate(n->as_Vector()->length_in_bytes() <= 16);
>   match(Set dst (RoundVF src));
>   effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2, TEMP tmp3);
>   format %{ "vround  $dst, $src\t# round vReg F to I vector" %}
>   ins_encode %{
>     uint size = Matcher::vector_length_in_bytes(this);
>     __ vector_round_neon(as_FloatRegister($dst$$reg), as_FloatRegister($src$$reg),
>                          as_FloatRegister($tmp1$$reg), as_FloatRegister($tmp2$$reg),
>                          as_FloatRegister($tmp3$$reg), (size == 16) ? __ T4S : __ T2S);
>   %}
>   ins_pipe(pipe_slow);
> %}

Seems reasonable. Maybe we could the logic down into MacroAssembler. That way there'd be one point at which the SVE/Neon devcision was made, in MacroAssembler. The disadvantage would be that Neon and SVE versions require different register clobbers, but that might not matter.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8204


More information about the hotspot-dev mailing list