RFR: 8366444: Add support for add/mul reduction operations for Float16 [v5]

Mon Jan 26 16:54:45 UTC 2026

On Mon, 29 Dec 2025 17:39:42 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> This patch adds mid-end support for vectorized add/mul reduction operations for half floats. It also includes backend aarch64 support for these operations. Only vectorization support through autovectorization is added as VectorAPI currently does not support Float16 vector species.
>> 
>> Both add and mul reduction vectorized through autovectorization mandate the implementation to be strictly ordered. The following is how each of these reductions is implemented for different aarch64 targets -
>> 
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the scalar `fadd` instruction for both 8B and 16B vector lengths. This is because Neon does not provide a direct instruction for computing strictly ordered floating point add reduction.
>> 
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes add reduction for floating point in strict order.
>> 
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly ordered floating point multiply reduction. For vector lengths of 8B and 16B, a scalarized sequence of scalar `fmul` instructions is generated and multiply reduction for vector lengths > 16B is not supported.
>> 
>> Below is the performance of the two newly added microbenchmarks in `Float16OperationsBenchmark.java` tested on three different aarch64 machines and with varying `MaxVectorSize` -
>> 
>> Note: On all machines, the score (ops/ms) is compared with the master branch without this patch which generates a sequence of loads (`ldrsh`) to load the FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded value to the running sum/product. The ratios given below are the ratios between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the master branch.
>> 
>> **N1 (UseSVE = 0, max vector length = 16B):**
>> 
>> Benchmark         vectorDim  Mode   Cnt  8B     16B
>> ReductionAddFP16  256        thrpt  9    1.41   1.40
>> ReductionAddFP16  512        thrpt  9    1.41   1.41
>> ReductionAddFP16  1024       thrpt  9    1.43   1.40
>> ReductionAddFP16  2048       thrpt  9    1.43   1.40
>> ReductionMulFP16  256        thrpt  9    1.22   1.22
>> ReductionMulFP16  512        thrpt  9    1.21   1.23
>> ReductionMulFP16  1024       thrpt  9    1.21   1.22
>> ReductionMulFP16  2048       thrpt  9    1.20   1.22
>> 
>> 
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
> 
>  - Address review comments for the JTREG test and microbenchmark
>  - Merge branch 'master'
>  - Address review comments
>  - Fix build failures on Mac
>  - Address review comments
>  - Merge 'master'
>  - 8366444: Add support for add/mul reduction operations for Float16
>    
>    This patch adds mid-end support for vectorized add/mul reduction
>    operations for half floats. It also includes backend aarch64 support for
>    these operations. Only vectorization support through autovectorization
>    is added as VectorAPI currently does not support Float16 vector species.
>    
>    Both add and mul reduction vectorized through autovectorization mandate
>    the implementation to be strictly ordered. The following is how each of
>    these reductions is implemented for different aarch64 targets -
>    
>    For AddReduction :
>    On Neon only targets (UseSVE = 0): Generates scalarized additions
>    using the scalar "fadd" instruction for both 8B and 16B vector lengths.
>    This is because Neon does not provide a direct instruction for computing
>    strictly ordered floating point add reduction.
>    
>    On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
>    computes add reduction for floating point in strict order.
>    
>    For MulReduction :
>    Both Neon and SVE do not provide a direct instruction for computing
>    strictly ordered floating point multiply reduction. For vector lengths
>    of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
>    generated and multiply reduction for vector lengths > 16B is not
>    supported.
>    
>    Below is the performance of the two newly added microbenchmarks in
>    Float16OperationsBenchmark.java tested on three different aarch64
>    machines and with varying MaxVectorSize -
>    
>    Note: On all machines, the score (ops/ms) is compared with the master
>    branch without this patch which generates a sequence of loads ("ldrsh")
>    to load the FP16 value into an FPR and a scalar "fadd/fmul" to
>    add/multiply the loaded value to the running sum/product. The ratios
>    given below are the ratios between the throughput with this patch and
>    the throughput without this patch.
>    Ratio > 1 indicates the performance with this patch is better than the
>    master branch.
>    
>    N1 (UseSVE = 0...

I had another quick look. And I was wondering:
In my experience, float/double reductions that just add/mul up values (aka simple reductions) generally have no speedups when vectorized. The reason is that no matter if they are scalar or vector, the bottleneck is the latency along the reduction chain. So why do you measure speedups here for `Float16`? Do you have a good explanation? Because memory bandwidth should be even less the problem here, so the effect of latency along the chain has an even bigger weight. What do you think?

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1929:

> 1927:     ext(vtmp, T8B, vsrc, vsrc, 6);
> 1928:     faddh(dst, dst, vtmp);
> 1929:       if (isQ) {

I don't think the `if` should be indented here, right?

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1940:

> 1938:       }
> 1939:   BLOCK_COMMENT("} neon_reduce_add_fp16");
> 1940: }

Given that the reduction order is sequential: why do you see any speedup in your benchmarks, comparing scalar to vector performance? How do you explain it? I'm just curious ;)

-------------

PR Review: https://git.openjdk.org/jdk/pull/27526#pullrequestreview-3706944699
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2728374969
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2728381603