RFR: 8283232: x86: Improve vector broadcast operations [v2]

Wed Mar 16 14:55:48 UTC 2022

On Wed, 16 Mar 2022 05:55:18 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> Hi,
>> 
>> This patch improves the generation of broadcasting a scalar in several ways:
>> 
>> - Avoid potential data bypass delay which can be observed on some platforms by using the correct type of instruction if it does not require extra instructions.
>> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
>> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
>> 
>> This patch also removes some redundant code paths and rename some incorrectly named instructions.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix crash in sse

Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel® 64 and IA-32 Architectures Optimization Reference Manual"

> When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow.

The manual mentions the guideline at section 3.5.2.2

![image](https://user-images.githubusercontent.com/49088128/158618209-c0674ba7-1c93-4014-a7e1-330f4e5846da.png)

Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7832