RFR: 8373344: Add support for min/max reduction operations for Float16 [v2]

Wed Jan 7 17:38:11 UTC 2026

On Tue, 6 Jan 2026 07:28:59 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Yi Wu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>> 
>>  - Replace assert with verify
>>  - Add IRNode constant and code refactor
>>  - Merge remote-tracking branch 'origin/master' into yiwu-8373344
>>  - 8373344: Add support for FP16 min/max reduction operations
>>    
>>    This patch adds mid-end support for vectorized min/max reduction
>>    operations for half floats. It also includes backend AArch64 support
>>    for these operations.
>>    Both floating point min/max reductions don’t require strict order,
>>    because they are associative.
>>    
>>    It will generate NEON fminv/fmaxv reduction instructions when
>>    max vector length is 8B or 16B. On SVE supporting machines
>>    with vector lengths > 16B, it will generate the SVE fminv/fmaxv
>>    instructions.
>>    The patch also adds support for partial min/max reductions on
>>    SVE machines using fminv/fmaxv.
>>    
>>    Ratio of throughput(ops/ms) > 1 indicates the performance with
>>    this patch is better than the mainline.
>>    
>>    Neoverse N1 (UseSVE = 0, max vector length = 16B):
>>    Benchmark         vectorDim  Mode   Cnt     8B    16B
>>    ReductionMaxFP16   256       thrpt 9      3.69   6.44
>>    ReductionMaxFP16   512       thrpt 9      3.71   7.62
>>    ReductionMaxFP16   1024      thrpt 9      4.16   8.64
>>    ReductionMaxFP16   2048      thrpt 9      4.44   9.12
>>    ReductionMinFP16   256       thrpt 9      3.69   6.43
>>    ReductionMinFP16   512       thrpt 9      3.70   7.62
>>    ReductionMinFP16   1024      thrpt 9      4.16   8.64
>>    ReductionMinFP16   2048      thrpt 9      4.44   9.10
>>    
>>    Neoverse V1 (UseSVE = 1, max vector length = 32B):
>>    Benchmark         vectorDim  Mode   Cnt     8B    16B    32B
>>    ReductionMaxFP16   256       thrpt 9      3.96   8.62   8.02
>>    ReductionMaxFP16   512       thrpt 9      3.54   9.25  11.71
>>    ReductionMaxFP16   1024      thrpt 9      3.77   8.71  14.07
>>    ReductionMaxFP16   2048      thrpt 9      3.88   8.44  14.69
>>    ReductionMinFP16   256       thrpt 9      3.96   8.61   8.03
>>    ReductionMinFP16   512       thrpt 9      3.54   9.28  11.69
>>    ReductionMinFP16   1024      thrpt 9      3.76   8.70  14.12
>>    ReductionMinFP16   2048      thrpt 9      3.87   8.45  14.70
>>    
>>    Neoverse V2 (UseSVE = 2, max vector length = 16B)...
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 381:
> 
>> 379:       case Op_XorReductionV:
>> 380:       case Op_MinReductionVHF:
>> 381:       case Op_MaxReductionVHF:
> 
> We can use the NEON instructions if the vector size <= 16B as well for partial cases. Did you test the performance with NEON instead of using predicated SVE instructions?

You mean move it down, like `Op_AddReductionVI` and `Op_AddReductionVL` to use `return !VM_Version::use_neon_for_vector(length_in_bytes);`?
It doesn't to make much of a difference.

Neoverse V1 (UseSVE = 1, max vector length = 32B)
Benchmark           vectorDim  Mode   Cnt   8B(old) 8B(new) chg2/chg1   16B(old) 16B(new) chg2/chg1   32B(old) 32B(new) chg2/chg1
ReductionMaxFP16       256     thrpt    9     3.96     3.96     1.00        8.63     8.62     1.00        8.02     8.02     1.00
ReductionMaxFP16       512     thrpt    9     3.54     3.54     1.00        9.25     9.25     1.00       11.71    11.71     1.00
ReductionMaxFP16      1024     thrpt    9     3.77     3.77     1.00        8.70     8.71     1.00       14.12    14.07     1.00
ReductionMaxFP16      2048     thrpt    9     3.88     3.88     1.00        8.45     8.44     1.00       14.69    14.69     1.00
ReductionMinFP16       256     thrpt    9     3.96     3.96     1.00        8.62     8.61     1.00        8.02     8.03     1.00
ReductionMinFP16       512     thrpt    9     3.55     3.54     1.00        9.26     9.28     1.00       11.72    11.69     1.00
ReductionMinFP16      1024     thrpt    9     3.76     3.76     1.00        8.69     8.70     1.00       14.10    14.12     1.00
ReductionMinFP16      2048     thrpt    9     3.87     3.87     1.00        8.44     8.45     1.00       14.76    14.70     1.00

Neoverse V2 (UseSVE = 2, max vector length = 16B)
Benchmark           vectorDim  Mode   Cnt   8B(old) 8B(new) chg2/chg1   16B(old) 16B(new) chg2/chg1
ReductionMaxFP16       256     thrpt    9     4.77     4.78     1.00       10.00    10.00     1.00
ReductionMaxFP16       512     thrpt    9     3.75     3.74     1.00       11.32    11.33     1.00
ReductionMaxFP16      1024     thrpt    9     3.87     3.86     1.00        9.59     9.59     1.00
ReductionMaxFP16      2048     thrpt    9     3.94     3.94     1.00        8.72     8.71     1.00
ReductionMinFP16       256     thrpt    9     4.77     4.78     1.00        9.97    10.00     1.00
ReductionMinFP16       512     thrpt    9     3.77     3.74     0.99       11.35    11.29     0.99
ReductionMinFP16      1024     thrpt    9     3.86     3.86     1.00        9.56     9.58     1.00
ReductionMinFP16      2048     thrpt    9     3.94     3.94     1.00        8.71     8.71     1.00

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28828#discussion_r2669419647