RFR: 8279508: Auto-vectorize Math.round API [v15]

Jatin Bhateja jbhateja at openjdk.java.net
Tue Mar 22 02:55:32 UTC 2022


On Tue, 22 Mar 2022 01:55:38 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> A read from constant table will incur minimum of L1I access penalty to access code blob or at worst even more if data is not present in first level cache. Change was done for replace vpbroadcastd with vbroadcastss because of two reasons.
>> 1) vbroadcastss works at AVX=1 level where as vpbroadcastd need AVX2 feature. 
>> 2) We can avoid extra cycle penalty due to two domain switchovers (FP -> INT and then from INT-> FP).
>
>> A read from constant table will incur minimum of L1I access penalty to access code blob or at worst even more if data is not present in first level cache
> 
> But your approach comes at a cost of frontend bandwidth and port contention, which imo are more important than latency in this case since a constant load does not prolong dependency chains. A load has very good throughput so it is often performant unless the load depends on its input (the memory location or the registers used for address calculation). Thanks

Thanks for going into details, multicycle memory load will also defer dispatch of dependent instructions to execution port, port congestion becomes bottleneck when multiple ready instructions cannot be issued due to lack of execution resource or throughput constraints imposed by instruction,  but a single cycle dependency chain may still win over  latency due to pending memory  operations.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094


More information about the core-libs-dev mailing list