[lworld+fp16] RFR: 8339473: Add support for FP16 isFinite, isInfinite and isNaN

Fri Sep 13 05:59:20 UTC 2024

On Wed, 11 Sep 2024 08:37:38 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds intrinsic support for FP16 isNaN, isFinite and isInfinite methods and also adds aarch64 backend for these intrinsics.
> 
> Tested all FP16 related tests successfully on aarch64.

> Hi @jatin-bhateja , I have implemented these intrinsics in floating point in this patch. I also have another version where I tried implementing isInfinite in integer arithmetic for aarch64. I have created a draft PR here - [e095b2d](https://github.com/openjdk/valhalla/commit/e095b2d47a4c28bece8fc0fcbc656da713feb132)
> 
> I avoided generating the ReinterpretS2HF nodes in the mid-end which will make the "src" to be in a GPR and made the implementation in integer arithmetic easier and resulted in significant gains which can be seen in the commit message of the draft PR.
> 
> I have also tried implementing similar integer logic on x86 (in the draft PR) and I could see ~2.6x better performance than the case without the intrinsic (I have however not tried to implement the vpclass instruction for FP16 and compare performance).
> 
> Keeping the RenterpretS2HF nodes would mean that we need to move the `src` from an FPR to a GPR in the backend, before executing the integer operations which proved to be costly and not very beneficial.
> 
> What do you think of not using ReinterpretS2HF node for these intrinsics and keeping the `src` in a GPR instead? I know that technically FP16 is floating point (even if storage is `short`) and we have always implemented the `src` in an FPR so as to generate FP instructions but I was wondering if this tweak is ok if we see good performance gains? Looking forward to hearing your feedback. Thanks!

Hi @Bhavana-Kilambi , Thanks for the patch, I will add the corresponding x86 backend implimentation.

On your prototype, c2 instruction selector is very flexible in folding bigger graph pallets,  we can skip over creating a specialized IR for graph fragment and directly create a matcher rule covering a bigger pattern and emit optimized instruction sequence for it. 

So what you have in this patch looks good to me.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1239#issuecomment-2348088194