[lworld+fp16] RFR: 8338102: x86 backend support for newly added Float16 intrinsics. [v2]

Mon Aug 12 10:42:46 UTC 2024

On Mon, 12 Aug 2024 09:58:50 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> I am not sure about the idea of having unified nodes for multiple classes of operations (unary, binary etc). I am wondering whether it will simplify or complicate the implementation of these operations? I'll just break down my thoughts into the following major points -
> 
> * The size of the resulting sea of nodes graph should not change. Instead of having AddHF/SubHF we will have a BinaryOpNode. Plus it needs to now store an extra field (secondary opcode) to differentiate between the various binary operations. Maybe the size of the JVM binary itself might reduce a bit but not sure about the c2 IR graph.

Yes, Sea Of Nodes is a compile time graph and is agnostic to common IR representation. 

Having a common IR translates into single instruction selection pattern,  as you know ADLC processes each pattern and generates code catering to different downstream passes.  By passing additional secondary opcode as an immI (constant) operand we can save lots of generate code.

> * How easy would it be to implement Ideal/Value optimizations for a BinaryOpNode (or any other unified node for that 

This is should be relatively straightforward since secondary opcode is a compile time constant, I guess your main concern looks like Value  / Identity routines which are inherited currently may not be usable, but one can always factor out the meaty parts of existing routines and then call it from new IR Value / Identity routines to avoid code-duplication..

> matter)? I don't think we can club optimizations for INT8 and FP8/FP16 as those optimizations would be vastly different between these types. Then should we have separate unified nodes for floating-point (FP8, FP16) and integer types (INT4, INT8)? 

What I meant was one Binary / Unary / Ternary node for each new reduced precision floating point type, as of now on x86 INT8 application is limited to dot product, its not a full blown ISA extension like FP16.

> In the current state, we are able to reuse FP32 optimizations very well for FP16 and where ever we do not want those optimizations to be applied, we just override it (in FP16 methods) and change accordingly.

Yes, that's our intention was while choosing this implementation strategy.  My intent here is to record our discussion on community mailing list so that we can accommodate suggestions from larger audience if any. 

> * From your description above, it looks like we will still have separate Vector Nodes right?

No, we can also have common IR for vectors.

> * Also, we might need some extra handling to decode the secondary opcode in the backend during matching.
> 

Matcher already handles constant integral operands, hence actual decision to emit specific instruction encoding based on secondary opcode can easily be deferred to encoding blocks without any hassle.

> I am not sure how much of benefit we might have with this approach. Also, do you know tentatively, by when is it planned to implement support for FP8/INT8 types in JDK? I think it will take quite some time for users to actually start using even FP16 and lot of future bugs/optimizations to be addressed for FP16.

FTR, here is link to our discussion on common vs separate 
IR https://github.com/jatin-bhateja/external_staging/blob/main/FP16Support/design_discussions.txt#L61

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1196#issuecomment-2283627677