[lworld+fp16] RFR: 8338102: x86 backend support for newly added Float16 intrinsics. [v2]

Mon Aug 12 10:02:45 UTC 2024

On Mon, 12 Aug 2024 03:51:27 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> This patch enables newly added Float16 intrinsicfication support added by [JDK-8336406](https://bugs.openjdk.org/browse/JDK-8336406) for x86 targets supporting AVX512_FP16 feature.
>> 
>> Kindly review and approve.
>> 
>> Best Regards,
>> Jatin
>> 
>> Hi @Bhavana-Kilambi, 
>> On a second thought, do you see a possibility of sharing the IR nodes by appending secondary opcode to shared IR node in applicable scenarios, so we can have one IR for each class of operations (unary / binary / ternary).  It may need defining following new matcher routines and some more interfaces:- 
>> 
>>    match_rule_supported_shared(int primary_opcode, int secondary_opcode)
>>    match_rule_supported_vector_shared (int primary_opcode, int secondary_opcode, int vlen, BasicType bt)
>>    VectorNode::opcode(int popc, int sopc, BasicType bt)
>> 
>> BinaryOpNode (Dst, Src1, Src2, immI_Opcode);   
>> 
>> 
>> Secondary opcode being a immediate operand can be accessed by encoding routines. WDYT ?
>> 
>> Another possibility could be to encode both primary and secondary opcodes in existing opcode without disturbing the interfaces and add relevant helper routines to extract primary / secondary opcodes, I think opcodes are never -ve values, hence secondary opcode could be accommodated into higher order bits starting from (MSB-1).
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Optimizing IR checks

I am not sure about the idea of having unified nodes for multiple classes of operations (unary, binary etc). I am wondering whether it will simplify or complicate the implementation of these operations? I'll just break down my thoughts into the following major points - 
- The size of the resulting sea of nodes graph should not change. Instead of having AddHF/SubHF we will have a BinaryOpNode. Plus it needs to now store an extra field (secondary opcode) to differentiate between the various binary operations. Maybe the size of the JVM binary itself might reduce a bit but not sure about the c2 IR graph.
- How easy would it be to implement Ideal/Value optimizations for a BinaryOpNode (or any other unified node for that matter)? I don't think we can club optimizations for INT8 and FP8/FP16 as those optimizations would be vastly different between these types. Then should we have separate unified nodes for floating-point (FP8, FP16) and integer types (INT4, INT8)? In the current state, we are able to reuse FP32 optimizations very well for FP16 and where ever we do not want those optimizations to be applied, we just override it (in FP16 methods) and change accordingly.
- From your description above, it looks like we will still have separate Vector Nodes right? 
- Also, we might need some extra handling to decode the secondary opcode in the backend during matching.

I am not sure how much of benefit we might have with this approach. Also, do you know tentatively, by when is it planned to implement support for FP8/INT8 types in JDK? I think it will take quite some time for users to actually start using even FP16 and lot of future bugs/optimizations to be addressed for FP16.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1196#issuecomment-2283552499