RFR: 8366444: Add support for add/mul reduction operations for Float16

Xiaohong Gong xgong at openjdk.org
Tue Oct 7 02:50:45 UTC 2025


On Thu, 2 Oct 2025 10:21:06 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1900:
>> 
>>> 1898:           fmulh(dst, dst, vtmp);
>>> 1899:           ins(vtmp, H, vsrc, 0, 7);
>>> 1900:           fmulh(dst, dst, vtmp);
>> 
>> Do you know why the performance is not improved significantly for multiply reduction? Seems instructions between different `ins` instructions will have a data-dependence, which is not expected? Could you please use other instructions instead or clear the register `vtmp` before `ins` and check the performance changes?
>> 
>> Note that a clear of `mov` such as `MOVI Vd.2D, #0` has zero cost from V2's guide.
>
> Are you referring to the N1 numbers? The add reduction operation has gains around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they look comparable (not considering the cases where we generate `fadda` instructions for add reduction).
> 
>> Seems instructions between different ins instructions will have a data-dependence, which is not expected
> 
> Why do you think it's not expected? We have the exact same sequence for Neon add reduction as well. There's back to back dependency there as well and yet it shows better performance. The N1 optimization guide shows 2 cyc latency for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?

I mean we do not expect there is data-dependence between two `ins` operations, but it has now. We do not recommend use the instructions that just write part of a register. This might involve un-expected dependence between. I suggest to use `ext` instead, and I can observe about 20% performance improvement compared with current version on V2. I did not check the correctness, but it looks right to me. Could you please help check on other machines? Thanks!

The change might look like:
Suggestion:

        fmulh(dst, fsrc, vsrc);
        ext(vtmp, T8B, vsrc, vsrc, 2);
        fmulh(dst, dst, vtmp);
        ext(vtmp, T8B, vsrc, vsrc, 4);
        fmulh(dst, dst, vtmp);
        ext(vtmp, T8B, vsrc, vsrc, 6);
        fmulh(dst, dst, vtmp);
        if (isQ) {
          ext(vtmp, T16B, vsrc, vsrc, 8);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 10);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 12);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 14);
          fmulh(dst, dst, vtmp);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2409190652


More information about the core-libs-dev mailing list