RFR: 8373344: Add support for min/max reduction operations for Float16 [v2]

Tue Jan 6 07:40:58 UTC 2026

On Mon, 5 Jan 2026 11:31:26 GMT, Yi Wu <duke at openjdk.org> wrote:

>> This patch adds mid-end support for vectorized min/max reduction operations for half floats. It also includes backend AArch64 support for these operations.
>> Both floating point min/max reductions don’t require strict order, because they are associative.
>> 
>> It will generate NEON fminv/fmaxv reduction instructions when max vector length is 8B or 16B. On SVE supporting machines with vector lengths > 16B, it will generate the SVE fminv/fmaxv instructions.
>> The patch also adds support for partial min/max reductions on SVE machines using fminv/fmaxv.
>> 
>> Ratio of throughput(ops/ms) > 1 indicates the performance with this patch is better than the mainline.
>> 
>> Neoverse N1 (UseSVE = 0, max vector length = 16B):
>> 
>> Benchmark         vectorDim  Mode   Cnt     8B    16B
>> ReductionMaxFP16   256       thrpt 9      3.69   6.44
>> ReductionMaxFP16   512       thrpt 9      3.71   7.62
>> ReductionMaxFP16   1024      thrpt 9      4.16   8.64
>> ReductionMaxFP16   2048      thrpt 9      4.44   9.12
>> ReductionMinFP16   256       thrpt 9      3.69   6.43
>> ReductionMinFP16   512       thrpt 9      3.70   7.62
>> ReductionMinFP16   1024      thrpt 9      4.16   8.64
>> ReductionMinFP16   2048      thrpt 9      4.44   9.10
>> 
>> 
>> Neoverse V1 (UseSVE = 1, max vector length = 32B):
>> 
>> Benchmark         vectorDim  Mode   Cnt     8B    16B    32B
>> ReductionMaxFP16   256       thrpt 9      3.96   8.62   8.02
>> ReductionMaxFP16   512       thrpt 9      3.54   9.25  11.71
>> ReductionMaxFP16   1024      thrpt 9      3.77   8.71  14.07
>> ReductionMaxFP16   2048      thrpt 9      3.88   8.44  14.69
>> ReductionMinFP16   256       thrpt 9      3.96   8.61   8.03
>> ReductionMinFP16   512       thrpt 9      3.54   9.28  11.69
>> ReductionMinFP16   1024      thrpt 9      3.76   8.70  14.12
>> ReductionMinFP16   2048      thrpt 9      3.87   8.45  14.70
>> 
>> 
>> Neoverse V2 (UseSVE = 2, max vector length = 16B):
>> 
>> Benchmark         vectorDim  Mode   Cnt     8B    16B
>> ReductionMaxFP16   256       thrpt 9      4.78  10.00
>> ReductionMaxFP16   512       thrpt 9      3.74  11.33
>> ReductionMaxFP16   1024      thrpt 9      3.86   9.59
>> ReductionMaxFP16   2048      thrpt 9      3.94   8.71
>> ReductionMinFP16   256       thrpt 9      4.78  10.00
>> ReductionMinFP16   512       thrpt 9      3.74  11.29
>> ReductionMinFP16   1024      thrpt 9      3.86   9.58
>> ReductionMinFP16   2048      thrpt 9      3.94   8.71
>> 
>> 
>> Testing:
>> hotspot_all, jdk (tier1-3) and langtools (tier1) all pass ...
>
> Yi Wu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Replace assert with verify
>  - Add IRNode constant and code refactor
>  - Merge remote-tracking branch 'origin/master' into yiwu-8373344
>  - 8373344: Add support for FP16 min/max reduction operations
>    
>    This patch adds mid-end support for vectorized min/max reduction
>    operations for half floats. It also includes backend AArch64 support
>    for these operations.
>    Both floating point min/max reductions don’t require strict order,
>    because they are associative.
>    
>    It will generate NEON fminv/fmaxv reduction instructions when
>    max vector length is 8B or 16B. On SVE supporting machines
>    with vector lengths > 16B, it will generate the SVE fminv/fmaxv
>    instructions.
>    The patch also adds support for partial min/max reductions on
>    SVE machines using fminv/fmaxv.
>    
>    Ratio of throughput(ops/ms) > 1 indicates the performance with
>    this patch is better than the mainline.
>    
>    Neoverse N1 (UseSVE = 0, max vector length = 16B):
>    Benchmark         vectorDim  Mode   Cnt     8B    16B
>    ReductionMaxFP16   256       thrpt 9      3.69   6.44
>    ReductionMaxFP16   512       thrpt 9      3.71   7.62
>    ReductionMaxFP16   1024      thrpt 9      4.16   8.64
>    ReductionMaxFP16   2048      thrpt 9      4.44   9.12
>    ReductionMinFP16   256       thrpt 9      3.69   6.43
>    ReductionMinFP16   512       thrpt 9      3.70   7.62
>    ReductionMinFP16   1024      thrpt 9      4.16   8.64
>    ReductionMinFP16   2048      thrpt 9      4.44   9.10
>    
>    Neoverse V1 (UseSVE = 1, max vector length = 32B):
>    Benchmark         vectorDim  Mode   Cnt     8B    16B    32B
>    ReductionMaxFP16   256       thrpt 9      3.96   8.62   8.02
>    ReductionMaxFP16   512       thrpt 9      3.54   9.25  11.71
>    ReductionMaxFP16   1024      thrpt 9      3.77   8.71  14.07
>    ReductionMaxFP16   2048      thrpt 9      3.88   8.44  14.69
>    ReductionMinFP16   256       thrpt 9      3.96   8.61   8.03
>    ReductionMinFP16   512       thrpt 9      3.54   9.28  11.69
>    ReductionMinFP16   1024      thrpt 9      3.76   8.70  14.12
>    ReductionMinFP16   2048      thrpt 9      3.87   8.45  14.70
>    
>    Neoverse V2 (UseSVE = 2, max vector length = 16B):
>    Benchmark         vectorDim  Mode   Cnt     8B    16B
>    ReductionMaxFP16   256       t...

src/hotspot/cpu/aarch64/aarch64_vector.ad line 381:

> 379:       case Op_XorReductionV:
> 380:       case Op_MinReductionVHF:
> 381:       case Op_MaxReductionVHF:

We can use the NEON instructions if the vector size <= 16B as well for partial cases. Did you test the performance with NEON instead of using predicated SVE instructions?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28828#discussion_r2663933727