[lworld+fp16] RFR: 8339473: Add support for FP16 isFinite, isInfinite and isNaN

Tue Sep 17 12:14:33 UTC 2024

On Fri, 13 Sep 2024 07:09:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi @jatin-bhateja , I have implemented these intrinsics in floating point in this patch. I also have another version where I tried implementing isInfinite in integer arithmetic for aarch64. I have created a draft PR here - https://github.com/openjdk/valhalla/pull/1242/commits/e095b2d47a4c28bece8fc0fcbc656da713feb132
>> 
>> I avoided generating the ReinterpretS2HF nodes in the mid-end which will make the "src" to be in a GPR and made the implementation in integer arithmetic easier and resulted in significant gains which can be seen in the commit message of the draft PR.
>> 
>> I have also tried implementing similar integer logic on x86 (in the draft PR) and I could see ~2.6x better performance than the case without the intrinsic (I have however not tried to implement the vpclass instruction for FP16 and compare performance).
>> 
>> Keeping the RenterpretS2HF nodes would mean that we need to move the `src` from an FPR to a GPR in the backend, before executing the integer operations which proved to be costly and not very beneficial.
>> 
>> What do you think of not using ReinterpretS2HF node for these intrinsics and keeping the `src` in a GPR instead? I know that technically FP16 is floating point (even if storage is `short`) and we have always implemented the `src` in an FPR so as to generate FP instructions but I was wondering if this tweak is ok if we see good performance gains? Looking forward to hearing your feedback. Thanks!
>
> Hi @Bhavana-Kilambi , 
> Did you trying modifying the Java implementation and measuring the performance with intrinsic solution ? 
> e.g.  isNaNTest :=  (Float16.float16ToRawShortValue() & 0x7e00) == 0x7e00 etc..

> Hi @jatin-bhateja , just tested the other two intrinsics as well. I do get huge performance gains (on aarch64) with the java versions. I used these -
> 
> ```
> @IntrinsicCandidate
>     public static boolean isNaN(Float16 f16) {
>         return ((float16ToRawShortBits(f16) & 0x7e00) == 0x7e00);
>     }
> ```
> 
> ```
> @IntrinsicCandidate
>     public static boolean isInfinite(Float16 f16) {
>         return ((float16ToRawShortBits(f16) ^ float16ToRawShortBits(POSITIVE_INFINITY)) & 0x7fff) == 0;
>     }
> ```
> 
> ```
> @IntrinsicCandidate
>     public static boolean isFinite(Float16 f16) {
>         return !(isInfinite(f16) || isNaN(f16));
>      }
> ```
> 
> I see these gains compared to the intrinsic version uploaded in the latest patch on this PR -
> 
> ```
> Benchmark               (SIZE)    Gain
> FP16Ops.isFiniteHF      65504     1.18409353876134
> FP16Ops.isInfiniteHF    65504     1.57611028955779
> FP16Ops.isNaNHF         65504     1.48276431718062
> ```
> 
> Do you think it's better to not have any intrinsics at all for these operations?

Thanks @Bhavana-Kilambi , yes it will be optimal, your java versions still carries @IntrinsicCandidate, hope we are not accedently running intrinsics, if that is not the case then making only a java side change should suffice as its also easily auto-vectorizatable

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1239#issuecomment-2355541564