[lworld+fp16] RFR: 8330021: AArch64: Add backend support for FP16 add operation [v3]

Thu Jun 20 13:13:30 UTC 2024

On Thu, 20 Jun 2024 10:42:57 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> This commit [1] adds initial support for FP16 operations and adds backend support for FP16 add operation for X86. This task adds backend support for scalar and vector FP16 add operation on aarch64.
>> 
>> [1] https://github.com/openjdk/valhalla/commit/f03fb4e4ee4d59ed692d0c26ddce260511f544e7#diff-a799ce8da7f3062bb3699beb65aae504840c649942032e325c2a50f88a2869ad
>
> Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
> 
>  - Remove couple of IR testcases + add --enable-preview flag to tests
>  - Merge branch 'lworld+fp16'
>  - Add a missing return statement in the ad file
>  - 8330021: AArch64: Add backend support for FP16 add operation
>    
>    This commit [1] adds initial support for FP16 operations and adds
>    backend support for FP16 add operation for X86. This task adds backend
>    support for scalar and vector FP16 add operation on aarch64.
>    
>    [1] https://github.com/openjdk/valhalla/commit/f03fb4e4ee4d59ed692d0c26ddce260511f544e7#diff-a799ce8da7f3062bb3699beb65aae504840c649942032e325c2a50f88a2869ad

Hello @jatin-bhateja , I noticed that the vector test - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorSum.java` fails on aarch64/x86 machines that support the Float16 feature. Have you noticed this on any of your `avx512_fp16` supporting machines too? I suspect this failure does not show up on github actions for this PR or others because they might be running the tests on non-fp16 supporting hardware.

I digged a bit deeper and found out that the Float16 `add` operation was not getting unrolled in the first place because of which it doesn't get vectorized eventually. I have an internal implementation for other binary FP16 operations (like subtract, multiply, divide, min and max) and none of these operations get vectorized either. After some debugging, I noticed that the inner loop body -> `_body` at this point of code execution - `https://github.com/openjdk/valhalla/blob/404a4fae6fc099cd93cc73d57e29b08b5418adb9/src/hotspot/share/opto/loopTransform.cpp#L1069` contains 188 nodes for Half float add compared to 29 nodes for Float32 add and 31 nodes for FP16 add before the merge on June 9th. I think due to so many nodes in the loop, the unrolling policy decides not to unroll it. 
I printed the extra nodes that the loop (for FP16 add operation) contains and I can see nodes like - `DecodeNKlass`, `DecodeN`, `LoadNKlass`, `LoadN`, `CmpP`, `StoreP`, `StoreCM`, `MergeMem`, `EncodeP`, `CastPP` and other pointer related IR nodes and maybe some nodes related to decoding compressed class pointers. Although none of these exist in the FP32 code or the FP16 code before the merge. Do you have any idea about this failure or the extra nodes in the loop?

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1096#issuecomment-2180646670