RFR: 8366444: Add support for add/mul reduction operations for Float16

Fri Dec 12 15:45:29 UTC 2025

On Tue, 7 Oct 2025 02:47:50 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Are you referring to the N1 numbers? The add reduction operation has gains around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they look comparable (not considering the cases where we generate `fadda` instructions for add reduction).
>> 
>>> Seems instructions between different ins instructions will have a data-dependence, which is not expected
>> 
>> Why do you think it's not expected? We have the exact same sequence for Neon add reduction as well. There's back to back dependency there as well and yet it shows better performance. The N1 optimization guide shows 2 cyc latency for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?
>
> I mean we do not expect there is data-dependence between two `ins` operations, but it has now. We do not recommend use the instructions that just write part of a register. This might involve un-expected dependence between. I suggest to use `ext` instead, and I can observe about 20% performance improvement compared with current version on V2. I did not check the correctness, but it looks right to me. Could you please help check on other machines? Thanks!
> 
> The change might look like:
> Suggestion:
> 
>         fmulh(dst, fsrc, vsrc);
>         ext(vtmp, T8B, vsrc, vsrc, 2);
>         fmulh(dst, dst, vtmp);
>         ext(vtmp, T8B, vsrc, vsrc, 4);
>         fmulh(dst, dst, vtmp);
>         ext(vtmp, T8B, vsrc, vsrc, 6);
>         fmulh(dst, dst, vtmp);
>         if (isQ) {
>           ext(vtmp, T16B, vsrc, vsrc, 8);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 10);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 12);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 14);
>           fmulh(dst, dst, vtmp);

Hi @XiaohongGong Thanks for this suggestion. I understand that `ins` has a read-modify-write dependency while `ext` does not have that as we are not reading the `vtmp` register in this case.

I made changes to both the add and mul reduction implementation and I could see some perf gains on V1 and V2 for mul reduction - 

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
<style>

</style>
</head>

<body link="#467886" vlink="#96607D">

Benchmark | vectorDim | 8B | 16B
-- | -- | -- | --
Float16OperationsBenchmark.ReductionAddFP16 | 256 | 1.0022509 | 0.99938584
Float16OperationsBenchmark.ReductionAddFP16 | 512 | 1.05157946 | 1.00262025
Float16OperationsBenchmark.ReductionAddFP16 | 1024 | 1.02392196 | 1.00187924
Float16OperationsBenchmark.ReductionAddFP16 | 2048 | 1.01219315 | 0.99964493
Float16OperationsBenchmark.ReductionMulFP16 | 256 | 0.99729809 | 1.19006546
Float16OperationsBenchmark.ReductionMulFP16 | 512 | 1.03897347 | 1.0689105
Float16OperationsBenchmark.ReductionMulFP16 | 1024 | 1.01822982 | 1.01509971
Float16OperationsBenchmark.ReductionMulFP16 | 2048 | 1.0086255 | 1.0032434

</body>

</html>

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2614674991