RFR: 8279508: Auto-vectorize Math.round API [v15]

Mon Mar 21 18:28:34 UTC 2022

On Mon, 21 Mar 2022 17:56:22 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> constant and register to register moves are never issued to execution ports,  rematerializing value rather than reading from memory will give better performance.
>
> I have come across this a little bit. While `movl r, i` may not consume execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 3 retired and 2 executed uops. Furthermore, both `movq x, r` and `vbroadcastss x, x` can only run on port 5, limit the throughput of the operation. On the contrary, a `vbroadcastss x, m` only results in 1 retired and 1 executed uop, reducing pressure on the decoder and the backend. A `vbroadcastss x, m` can run on both port 2 and port 3, offering a much better throughput. Latency is not much of a concern in this circumstance since the operation does not have any input dependency.
> 
>> register to register moves are never issued to execution ports
> 
> I believe you misremembered this part, a register to register move is only elided when the registers are of the same kind, `vmovq x, r` would result in 1 uop being executed on port 5.
> 
> What do you think? Thank you very much.

A read from constant table will incur minimum of L1I access penalty to access code blob or at worst even more if data is not present in first level cache. Change was done for replace vpbroadcastd with vbroadcastss because of two reasons.
1) vbroadcastss works at AVX=1 level where as vpbroadcastd need AVX2 feature. 
2) We can avoid extra cycle penalty due to two domain switchovers (FP -> INT and then from INT-> FP).

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094