RFR: 8366444: Add support for add/mul reduction operations for Float16 [v5]
Yi Wu
duke at openjdk.org
Thu Feb 5 11:50:08 UTC 2026
On Mon, 26 Jan 2026 16:51:56 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1940:
>>
>>> 1938: }
>>> 1939: BLOCK_COMMENT("} neon_reduce_add_fp16");
>>> 1940: }
>>
>> Given that the reduction order is sequential: why do you see any speedup in your benchmarks, comparing scalar to vector performance? How do you explain it? I'm just curious ;)
>
> Also: why not allow a vector with only 2 elements? Is there some restriction here?
Short answer: the speedup is not from parallelizing the reduction chain; it’s from lower overhead and vector loads. On NEON (Neoverse N1) the reduction was scalar fadd chain, but we now load FP16s in vectors, which cuts load count and loop overhead. That alone can improve throughput even when the dependence chain is unchanged.
Before (scalar loads + scalar mov + scalar fadd chain):
ldrsh w13, [x11, #16]
mov v18.h[0], w13
ldrsh w13, [x11, #18]
mov v19.h[0], w13
...
fadd h16, h20, h18
fadd h18, h21, h16
fadd h16, h22, h18
fadd h18, h23, h16
fadd h16, h17, h18
After (vector loads + scalar ext + scalar fadd):
ldr q17, [x15, #16]
ldr q19, [x15, #32]
...
fadd h18, h16, h17
ext v26.8b, v17.8b, v17.8b, #2
fadd h18, h18, h26
ext v26.8b, v17.8b, v17.8b, #4
fadd h18, h18, h26
...
(Bhavana-Kilambi is on leave right now, so I’m taking this over. I will update the test shortly. However, I will be on leave as well starting next week for 2.5 weeks.)
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2768689930
More information about the core-libs-dev
mailing list