RFR: 8366444: Add support for add/mul reduction operations for Float16 [v5]

Thu Feb 5 11:50:08 UTC 2026

On Mon, 26 Jan 2026 16:51:56 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1940:
>> 
>>> 1938:       }
>>> 1939:   BLOCK_COMMENT("} neon_reduce_add_fp16");
>>> 1940: }
>> 
>> Given that the reduction order is sequential: why do you see any speedup in your benchmarks, comparing scalar to vector performance? How do you explain it? I'm just curious ;)
>
> Also: why not allow a vector with only 2 elements? Is there some restriction here?

Short answer: the speedup is not from parallelizing the reduction chain; it’s from lower overhead and vector loads. On NEON (Neoverse N1) the reduction was scalar fadd chain, but we now load FP16s in vectors, which cuts load count and loop overhead. That alone can improve throughput even when the dependence chain is unchanged.

Before (scalar loads + scalar mov + scalar fadd chain):

ldrsh w13, [x11, #16]
mov   v18.h[0], w13
ldrsh w13, [x11, #18]
mov   v19.h[0], w13
...
fadd  h16, h20, h18
fadd  h18, h21, h16
fadd  h16, h22, h18
fadd  h18, h23, h16
fadd  h16, h17, h18

After (vector loads + scalar ext + scalar fadd):

ldr q17, [x15, #16]
ldr q19, [x15, #32]
...
fadd h18, h16, h17
ext  v26.8b,  v17.8b,  v17.8b,  #2
fadd h18, h18, h26
ext  v26.8b,  v17.8b,  v17.8b,  #4
fadd h18, h18, h26
...

(Bhavana-Kilambi is on leave right now, so I’m taking this over. I will update the test shortly. However, I will be on leave as well starting next week for 2.5 weeks.)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2768689930