RFR: 8279508: Auto-vectorize Math.round API [v15]

Mon Mar 21 17:59:44 UTC 2022

On Sun, 13 Mar 2022 04:27:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4178:
>> 
>>> 4176:   movl(scratch, 1056964608);
>>> 4177:   movq(xtmp1, scratch);
>>> 4178:   vbroadcastss(xtmp1, xtmp1, vec_enc);
>> 
>> You could put the constant in the constant table and use `vbroadcastss` here also.
>> 
>> Thank you very much.
>
> constant and register to register moves are never issued to execution ports,  rematerializing value rather than reading from memory will give better performance.

I have come across this a little bit. While `movl r, i` may not consume execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 3 retired and 2 executed uops. Furthermore, both `movq x, r` and `vbroadcastss x, x` can only run on port 5, limit the throughput of the operation. On the contrary, a `vbroadcastss x, m` only results in 1 retired and 1 executed uop, reducing pressure on the decoder and the backend. A `vbroadcastss x, m` can run on both port 2 and port 3, offering a much better throughput. Latency is not much of a concern in this circumstance since the operation does not have any input dependency.

> register to register moves are never issued to execution ports

I believe you misremembered this part, a register to register move is only elided when the registers are of the same kind, `vmovq x, r` would result in 1 uop being executed on port 5.

What do you think? Thank you very much.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094