RFR: 8352585: Add special case handling for Float16.max/min x86 backend [v2]

Wed Mar 26 11:21:03 UTC 2025

On Tue, 25 Mar 2025 15:01:47 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7093:
>> 
>>> 7091: }
>>> 7092: 
>>> 7093: void C2_MacroAssembler::scalar_max_min_fp16(int opcode, XMMRegister dst, XMMRegister src1, XMMRegister src2,
>> 
>> Any reason we are not doing this on lines of scalar emit_fp_min_max? For most common cases emit_fp_min_max based sequence would have much better latency.
>
> emit_fp_min_max in x86_64.ad doesn't have any blend emulation.

Hi @sviswa7 ,
Instruction sequence similar to emit_fp_min_max for half floats prevent issuance of micro-ops from Decoded ICache, this makes its performance worse than the proposed sequence, it seems existence of several branches within 32 byte window is the problem. Section 3.4.2.5 "Optimization for Decoded ICache" has more details on this. The proposed sequence is also vector-friendly.  

![image](https://github.com/user-attachments/assets/0efcb12b-dcb4-4346-b3fa-9fefeb46636f)

[max_micro_sequences.txt](https://github.com/user-attachments/files/19465321/max_micro_sequences.txt)

Do you suggest going with the proposed performant sequence to fix this bug and addressing any shortcoming after more experimintation later?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24169#discussion_r2013923122