RFR: 8366444: Add support for add/mul reduction operations for Float16 [v3]

Fri Dec 19 08:46:07 UTC 2025

On Fri, 12 Dec 2025 15:42:24 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> I mean we do not expect there is data-dependence between two `ins` operations, but it has now. We do not recommend use the instructions that just write part of a register. This might involve un-expected dependence between. I suggest to use `ext` instead, and I can observe about 20% performance improvement compared with current version on V2. I did not check the correctness, but it looks right to me. Could you please help check on other machines? Thanks!
>> 
>> The change might look like:
>> Suggestion:
>> 
>>         fmulh(dst, fsrc, vsrc);
>>         ext(vtmp, T8B, vsrc, vsrc, 2);
>>         fmulh(dst, dst, vtmp);
>>         ext(vtmp, T8B, vsrc, vsrc, 4);
>>         fmulh(dst, dst, vtmp);
>>         ext(vtmp, T8B, vsrc, vsrc, 6);
>>         fmulh(dst, dst, vtmp);
>>         if (isQ) {
>>           ext(vtmp, T16B, vsrc, vsrc, 8);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 10);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 12);
>>           fmulh(dst, dst, vtmp);
>>           ext(vtmp, T16B, vsrc, vsrc, 14);
>>           fmulh(dst, dst, vtmp);
>
> Hi @XiaohongGong Thanks for this suggestion. I understand that `ins` has a read-modify-write dependency while `ext` does not have that as we are not reading the `vtmp` register in this case.
> 
> I made changes to both the add and mul reduction implementation and I could see some perf gains on Neoverse V1 and Neoverse V2 for mul reduction but none for Neoverse N1. The following is ratio between throughput with `ext` and throughput with `ins` (`>1` would mean `ext` is better) on Neoverse V2 - 
> 
> <html xmlns:v="urn:schemas-microsoft-com:vml"
> xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40">
> 
> <head>
> 
> <meta name=ProgId content=Excel.Sheet>
> <meta name=Generator content="Microsoft Excel 15">
> <link id=Main-File rel=Main-File
> href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
> <link rel=File-List
> href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
> <style>
> 
> </style>
> </head>
> 
> <body link="#467886" vlink="#96607D">
> 
> 
> Benchmark | vectorDim | 8B | 16B
> -- | -- | -- | --
> Float16OperationsBenchmark.ReductionAddFP16 | 256 | 1.0022509 | 0.99938584
> Float16OperationsBenchmark.ReductionAddFP16 | 512 | 1.05157946 | 1.00262025
> Float16OperationsBenchmark.ReductionAddFP16 | 1024 | 1.02392196 | 1.00187924
> Float16OperationsBenchmark.ReductionAddFP16 | 2048 | 1.01219315 | 0.99964493
> Float16OperationsBenchmark.ReductionMulFP16 | 256 | 0.99729809 | 1.19006546
> Float16OperationsBenchmark.ReductionMulFP16 | 512 | 1.03897347 | 1.0689105
> Float16OperationsBenchmark.ReductionMulFP16 | 1024 | 1.01822982 | 1.01509971
> Float16OperationsBenchmark.ReductionMulFP16 | 2048 | 1.0086255 | 1.0032434
> 
> 
> 
> </body>
> 
> </html>
> 
> The 20% gain you mentioned is reproducible but only for the smallest array size. The gains taper for larger array sizes (my wild guess is that for smaller array sizes the loop is lantency-bound so reducing the dependency due to the `ins` chains helps bring down the total latency but for larger array sizes the loop becomes more memory bound with more number of loads/stores and probably here removing the `ins` dependency chains doesn't help much?).
> 
> 
> Similar number for Neoverse V1 -
> 
> <html xmlns:v="urn:schemas-microsoft-com:vml"
> xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40">
> 
> <head>
> 
> <me...

Thanks for your testing!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2634220120